added answers

2025-08-17 01:03:32 +00:00 · 2017-01-23 05:18:35 -05:00 · 2017-01-23 05:18:35 -05:00 · 65c3edbed6
commit 65c3edbed6
parent 460fa7c4a8
1 changed files with 87 additions and 0 deletions
--- a/hw1/answers
+++ b/hw1/answers
@ -0,0 +1,87 @@
+1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
+
+(a) The sample size n is extremely large, and the number of predic-
+tors p is small.
+    This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over.
+
+(b) The number of predictors p is extremely large, and the number
+of observations n is small.
+    We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors.
+
+(c) The relationship between the predictors and response is highly
+non-linear.
+    A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function.
+(d) The variance of the error terms is extremely
+high.
+    A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them.
+
+
+
+2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
+
+(a) We collect a set of data on the top 500 firms in the US. For each
+firm we record profit, number of employees, industry and the
+CEO salary. We are interested in understanding which factors
+affect CEO salary.
+    p = 4
+    n = 500
+    This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available.
+
+(b) We are considering launching a new product and wish to know
+whether it will be a success or a failure. We collect data on 20
+similar products that were previously launched. For each prod-
+uct we have recorded whether it was a success or failure, price
+charged for the product, marketing budget, competition price,
+and ten other variables.
+    p=14
+    n=20
+    Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success.
+
+(c) We are interesting in predicting the % change in the US dollar in
+relation to the weekly changes in the world stock markets. Hence
+we collect weekly data for all of 2012. For each week we record
+the % change in the dollar, the % change in the US market,
+the % change in the British market, and the % change in the
+German market.
+    n=52
+    p=4
+    A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies.
+
+
+
+4. You will now think of some real-life applications for statistical learning.
+
+(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your
+answer.
+
+(b) Describe three real-life applications in which regression might
+be useful. Describe the response, as well as the predictors. Is the
+goal of each application inference or prediction? Explain your
+answer.
+
+(c) Describe three real-life applications in which cluster analysis
+might be useful.
+    Star categories using spectral strengths
+
+
+
+
+9. This exercise involves the Auto data set studied in the lab. Make sure
+that the missing values have been removed from the data.
+(a) Which of the predictors are quantitative, and which are quali-
+tative?
+(b) What is the range of each quantitative predictor? You can an-
+swer this using the range() function.
+(c) What is the mean and standard deviation of each quantitative
+predictor?
+(d) Now remove the 10th through 85th observations. What is the
+range, mean, and standard deviation of each predictor in the
+subset of the data that remains?
+(e) Using the full data set, investigate the predictors graphically,
+using scatterplots or other tools of your choice. Create some plots
+highlighting the relationships among the predictors. Comment
+on your findings.
+(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis
+of the other variables. Do your plots suggest that any of the
+other variables might be useful in predicting mpg ? Justify your
+answer.