mirror of
https://asciireactor.com/otho/cs-5821.git
synced 2024-11-25 03:35:05 +00:00
87 lines
5.0 KiB
Plaintext
87 lines
5.0 KiB
Plaintext
|
1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
|
||
|
|
||
|
(a) The sample size n is extremely large, and the number of predic-
|
||
|
tors p is small.
|
||
|
This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over.
|
||
|
|
||
|
(b) The number of predictors p is extremely large, and the number
|
||
|
of observations n is small.
|
||
|
We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors.
|
||
|
|
||
|
(c) The relationship between the predictors and response is highly
|
||
|
non-linear.
|
||
|
A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function.
|
||
|
(d) The variance of the error terms is extremely
|
||
|
high.
|
||
|
A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them.
|
||
|
|
||
|
|
||
|
|
||
|
2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
|
||
|
|
||
|
(a) We collect a set of data on the top 500 firms in the US. For each
|
||
|
firm we record profit, number of employees, industry and the
|
||
|
CEO salary. We are interested in understanding which factors
|
||
|
affect CEO salary.
|
||
|
p = 4
|
||
|
n = 500
|
||
|
This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available.
|
||
|
|
||
|
(b) We are considering launching a new product and wish to know
|
||
|
whether it will be a success or a failure. We collect data on 20
|
||
|
similar products that were previously launched. For each prod-
|
||
|
uct we have recorded whether it was a success or failure, price
|
||
|
charged for the product, marketing budget, competition price,
|
||
|
and ten other variables.
|
||
|
p=14
|
||
|
n=20
|
||
|
Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success.
|
||
|
|
||
|
(c) We are interesting in predicting the % change in the US dollar in
|
||
|
relation to the weekly changes in the world stock markets. Hence
|
||
|
we collect weekly data for all of 2012. For each week we record
|
||
|
the % change in the dollar, the % change in the US market,
|
||
|
the % change in the British market, and the % change in the
|
||
|
German market.
|
||
|
n=52
|
||
|
p=4
|
||
|
A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies.
|
||
|
|
||
|
|
||
|
|
||
|
4. You will now think of some real-life applications for statistical learning.
|
||
|
|
||
|
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your
|
||
|
answer.
|
||
|
|
||
|
(b) Describe three real-life applications in which regression might
|
||
|
be useful. Describe the response, as well as the predictors. Is the
|
||
|
goal of each application inference or prediction? Explain your
|
||
|
answer.
|
||
|
|
||
|
(c) Describe three real-life applications in which cluster analysis
|
||
|
might be useful.
|
||
|
Star categories using spectral strengths
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
9. This exercise involves the Auto data set studied in the lab. Make sure
|
||
|
that the missing values have been removed from the data.
|
||
|
(a) Which of the predictors are quantitative, and which are quali-
|
||
|
tative?
|
||
|
(b) What is the range of each quantitative predictor? You can an-
|
||
|
swer this using the range() function.
|
||
|
(c) What is the mean and standard deviation of each quantitative
|
||
|
predictor?
|
||
|
(d) Now remove the 10th through 85th observations. What is the
|
||
|
range, mean, and standard deviation of each predictor in the
|
||
|
subset of the data that remains?
|
||
|
(e) Using the full data set, investigate the predictors graphically,
|
||
|
using scatterplots or other tools of your choice. Create some plots
|
||
|
highlighting the relationships among the predictors. Comment
|
||
|
on your findings.
|
||
|
(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis
|
||
|
of the other variables. Do your plots suggest that any of the
|
||
|
other variables might be useful in predicting mpg ? Justify your
|
||
|
answer.
|