mirror of
https://asciireactor.com/otho/cs-5821.git
synced 2024-11-24 03:05:09 +00:00
added answers
This commit is contained in:
parent
460fa7c4a8
commit
65c3edbed6
87
hw1/answers
Normal file
87
hw1/answers
Normal file
@ -0,0 +1,87 @@
|
||||
1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
|
||||
|
||||
(a) The sample size n is extremely large, and the number of predic-
|
||||
tors p is small.
|
||||
This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over.
|
||||
|
||||
(b) The number of predictors p is extremely large, and the number
|
||||
of observations n is small.
|
||||
We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors.
|
||||
|
||||
(c) The relationship between the predictors and response is highly
|
||||
non-linear.
|
||||
A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function.
|
||||
(d) The variance of the error terms is extremely
|
||||
high.
|
||||
A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them.
|
||||
|
||||
|
||||
|
||||
2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
|
||||
|
||||
(a) We collect a set of data on the top 500 firms in the US. For each
|
||||
firm we record profit, number of employees, industry and the
|
||||
CEO salary. We are interested in understanding which factors
|
||||
affect CEO salary.
|
||||
p = 4
|
||||
n = 500
|
||||
This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available.
|
||||
|
||||
(b) We are considering launching a new product and wish to know
|
||||
whether it will be a success or a failure. We collect data on 20
|
||||
similar products that were previously launched. For each prod-
|
||||
uct we have recorded whether it was a success or failure, price
|
||||
charged for the product, marketing budget, competition price,
|
||||
and ten other variables.
|
||||
p=14
|
||||
n=20
|
||||
Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success.
|
||||
|
||||
(c) We are interesting in predicting the % change in the US dollar in
|
||||
relation to the weekly changes in the world stock markets. Hence
|
||||
we collect weekly data for all of 2012. For each week we record
|
||||
the % change in the dollar, the % change in the US market,
|
||||
the % change in the British market, and the % change in the
|
||||
German market.
|
||||
n=52
|
||||
p=4
|
||||
A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies.
|
||||
|
||||
|
||||
|
||||
4. You will now think of some real-life applications for statistical learning.
|
||||
|
||||
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your
|
||||
answer.
|
||||
|
||||
(b) Describe three real-life applications in which regression might
|
||||
be useful. Describe the response, as well as the predictors. Is the
|
||||
goal of each application inference or prediction? Explain your
|
||||
answer.
|
||||
|
||||
(c) Describe three real-life applications in which cluster analysis
|
||||
might be useful.
|
||||
Star categories using spectral strengths
|
||||
|
||||
|
||||
|
||||
|
||||
9. This exercise involves the Auto data set studied in the lab. Make sure
|
||||
that the missing values have been removed from the data.
|
||||
(a) Which of the predictors are quantitative, and which are quali-
|
||||
tative?
|
||||
(b) What is the range of each quantitative predictor? You can an-
|
||||
swer this using the range() function.
|
||||
(c) What is the mean and standard deviation of each quantitative
|
||||
predictor?
|
||||
(d) Now remove the 10th through 85th observations. What is the
|
||||
range, mean, and standard deviation of each predictor in the
|
||||
subset of the data that remains?
|
||||
(e) Using the full data set, investigate the predictors graphically,
|
||||
using scatterplots or other tools of your choice. Create some plots
|
||||
highlighting the relationships among the predictors. Comment
|
||||
on your findings.
|
||||
(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis
|
||||
of the other variables. Do your plots suggest that any of the
|
||||
other variables might be useful in predicting mpg ? Justify your
|
||||
answer.
|
Loading…
Reference in New Issue
Block a user