diff --git a/hw1/answers b/hw1/answers new file mode 100644 index 0000000..f95566c --- /dev/null +++ b/hw1/answers @@ -0,0 +1,87 @@ +1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer. + +(a) The sample size n is extremely large, and the number of predic- +tors p is small. + This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over. + +(b) The number of predictors p is extremely large, and the number +of observations n is small. + We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors. + +(c) The relationship between the predictors and response is highly +non-linear. + A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function. +(d) The variance of the error terms is extremely +high. + A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them. + + + +2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p. + +(a) We collect a set of data on the top 500 firms in the US. For each +firm we record profit, number of employees, industry and the +CEO salary. We are interested in understanding which factors +affect CEO salary. + p = 4 + n = 500 + This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available. + +(b) We are considering launching a new product and wish to know +whether it will be a success or a failure. We collect data on 20 +similar products that were previously launched. For each prod- +uct we have recorded whether it was a success or failure, price +charged for the product, marketing budget, competition price, +and ten other variables. + p=14 + n=20 + Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success. + +(c) We are interesting in predicting the % change in the US dollar in +relation to the weekly changes in the world stock markets. Hence +we collect weekly data for all of 2012. For each week we record +the % change in the dollar, the % change in the US market, +the % change in the British market, and the % change in the +German market. + n=52 + p=4 + A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies. + + + +4. You will now think of some real-life applications for statistical learning. + +(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your +answer. + +(b) Describe three real-life applications in which regression might +be useful. Describe the response, as well as the predictors. Is the +goal of each application inference or prediction? Explain your +answer. + +(c) Describe three real-life applications in which cluster analysis +might be useful. + Star categories using spectral strengths + + + + +9. This exercise involves the Auto data set studied in the lab. Make sure +that the missing values have been removed from the data. +(a) Which of the predictors are quantitative, and which are quali- +tative? +(b) What is the range of each quantitative predictor? You can an- +swer this using the range() function. +(c) What is the mean and standard deviation of each quantitative +predictor? +(d) Now remove the 10th through 85th observations. What is the +range, mean, and standard deviation of each predictor in the +subset of the data that remains? +(e) Using the full data set, investigate the predictors graphically, +using scatterplots or other tools of your choice. Create some plots +highlighting the relationships among the predictors. Comment +on your findings. +(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis +of the other variables. Do your plots suggest that any of the +other variables might be useful in predicting mpg ? Justify your +answer. \ No newline at end of file