cs-5821/hw1/answers

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predic-
tors p is small.
    This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over.

(b) The number of predictors p is extremely large, and the number
of observations n is small.
    We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors.

(c) The relationship between the predictors and response is highly
non-linear.
    A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function.
(d) The variance of the error terms is extremely
high.
    A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them.


2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each
firm we record profit, number of employees, industry and the
CEO salary. We are interested in understanding which factors
affect CEO salary.
    p = 4
    n = 500
    This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available.

(b) We are considering launching a new product and wish to know
whether it will be a success or a failure. We collect data on 20
similar products that were previously launched. For each prod-
uct we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price,
and ten other variables.
    p=14
    n=20
    Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success.

(c) We are interesting in predicting the % change in the US dollar in
relation to the weekly changes in the world stock markets. Hence
we collect weekly data for all of 2012. For each week we record
the % change in the dollar, the % change in the US market,
the % change in the British market, and the % change in the
German market.
    n=52
    p=4
    A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies.


4. You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your
answer.

(b) Describe three real-life applications in which regression might
be useful. Describe the response, as well as the predictors. Is the
goal of each application inference or prediction? Explain your
answer.

(c) Describe three real-life applications in which cluster analysis
might be useful.
    Star categories using spectral strengths


9. This exercise involves the Auto data set studied in the lab. Make sure
that the missing values have been removed from the data.
(a) Which of the predictors are quantitative, and which are quali-
tative?
(b) What is the range of each quantitative predictor? You can an-
swer this using the range() function.
(c) What is the mean and standard deviation of each quantitative
predictor?
(d) Now remove the 10th through 85th observations. What is the
range, mean, and standard deviation of each predictor in the
subset of the data that remains?
(e) Using the full data set, investigate the predictors graphically,
using scatterplots or other tools of your choice. Create some plots
highlighting the relationships among the predictors. Comment
on your findings.
(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis
of the other variables. Do your plots suggest that any of the
other variables might be useful in predicting mpg ? Justify your
answer.
added answers 2017-01-23 10:18:35 +00:00			`1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.`

			`(a) The sample size n is extremely large, and the number of predic-`
			`tors p is small.`
			`This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over.`

			`(b) The number of predictors p is extremely large, and the number`
			`of observations n is small.`
			`We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors.`

			`(c) The relationship between the predictors and response is highly`
			`non-linear.`
			`A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function.`
			`(d) The variance of the error terms is extremely`
			`high.`
			`A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them.`



			`2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.`

			`(a) We collect a set of data on the top 500 firms in the US. For each`
			`firm we record profit, number of employees, industry and the`
			`CEO salary. We are interested in understanding which factors`
			`affect CEO salary.`
			`p = 4`
			`n = 500`
			`This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available.`

			`(b) We are considering launching a new product and wish to know`
			`whether it will be a success or a failure. We collect data on 20`
			`similar products that were previously launched. For each prod-`
			`uct we have recorded whether it was a success or failure, price`
			`charged for the product, marketing budget, competition price,`
			`and ten other variables.`
			`p=14`
			`n=20`
			`Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success.`

			`(c) We are interesting in predicting the % change in the US dollar in`
			`relation to the weekly changes in the world stock markets. Hence`
			`we collect weekly data for all of 2012. For each week we record`
			`the % change in the dollar, the % change in the US market,`
			`the % change in the British market, and the % change in the`
			`German market.`
			`n=52`
			`p=4`
			`A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies.`



			`4. You will now think of some real-life applications for statistical learning.`

			`(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your`
			`answer.`

			`(b) Describe three real-life applications in which regression might`
			`be useful. Describe the response, as well as the predictors. Is the`
			`goal of each application inference or prediction? Explain your`
			`answer.`

			`(c) Describe three real-life applications in which cluster analysis`
			`might be useful.`
			`Star categories using spectral strengths`




			`9. This exercise involves the Auto data set studied in the lab. Make sure`
			`that the missing values have been removed from the data.`
			`(a) Which of the predictors are quantitative, and which are quali-`
			`tative?`
			`(b) What is the range of each quantitative predictor? You can an-`
			`swer this using the range() function.`
			`(c) What is the mean and standard deviation of each quantitative`
			`predictor?`
			`(d) Now remove the 10th through 85th observations. What is the`
			`range, mean, and standard deviation of each predictor in the`
			`subset of the data that remains?`
			`(e) Using the full data set, investigate the predictors graphically,`
			`using scatterplots or other tools of your choice. Create some plots`
			`highlighting the relationships among the predictors. Comment`
			`on your findings.`
			`(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis`
			`of the other variables. Do your plots suggest that any of the`
			`other variables might be useful in predicting mpg ? Justify your`
			`answer.`