added answers for hw1

This commit is contained in:
caes 2017-01-24 22:48:35 -05:00
parent 6b7d6d3384
commit 13883d0b1c
2 changed files with 228 additions and 65 deletions

View File

@ -1,95 +1,258 @@
1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer. 1. For each of parts (a) through (d), indicate whether we would generally
expect the performance of a flexible statistical learning method to be
better or worse than an inflexible method. Justify your answer.
(a) The sample size n is extremely large, and the number of predic- (a) The sample size n is extremely large, and the number of predic- tors p
tors p is small. is small.
This seems to still depend on how the data are distributed, but generally, I would say a less flexible method will perform better here, given that we have a large number of observations to average over.
(b) The number of predictors p is extremely large, and the number This seems to still depend on how the data are distributed, but
of observations n is small. generally, I would say a less flexible method will perform better here,
We might want a more flexible method in this case, since the data are sparse and we want a model that responds smoothly to possible large changes along and across predictors. given that we have a large number of observations to average over.
(c) The relationship between the predictors and response is highly (b) The number of predictors p is extremely large, and the number of
non-linear. observations n is small.
A more-flexible will clearly be expected to have better performance here, as it will reflect the non-linear nature of the real function.
(d) The variance of the error terms is extremely We might want a more flexible method in this case, since the data are
high. sparse and we want a model that responds smoothly to possible large
A less-flexible function will likely respond better here, because the bias-variance trade-off is concerned with nuanced differences that are overwhelmed in a high-ε situation. The variance of f̂ and the bias of f̂ are insignificant compared to the variance of the error ε, so we don't gain predictability by attempting to reduce them. changes along and across predictors.
(c) The relationship between the predictors and response is highly non-
linear.
A more-flexible model will clearly be expected to have better
performance here, as it will reflect the non-linear nature of the real
function.
(d) The variance of the error terms is extremely high.
A less-flexible function will likely respond better here, because the
bias-variance trade-off is concerned with nuanced differences that are
overwhelmed in a high-ε situation. The variance of f̂ and the bias of
f̂ are insignificant compared to the variance of the error ε, so we
don't gain predictability by attempting to reduce them.
2. Explain whether each scenario is a classification or regression problem,
and indicate whether we are most interested in inference or prediction.
Finally, provide n and p.
2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p. (a) We collect a set of data on the top 500 firms in the US. For each firm
we record profit, number of employees, industry and the CEO salary. We are
interested in understanding which factors affect CEO salary.
(a) We collect a set of data on the top 500 firms in the US. For each
firm we record profit, number of employees, industry and the
CEO salary. We are interested in understanding which factors
affect CEO salary.
p = 4 p = 4
n = 500 n = 500
This is a regression problem, as we're predicting numerical values using numerical values. Prediction is interesting here, because we want to be able to predict CEO salary as a function of the predictors we find significant of the 4 available. This is a regression problem, as we're predicting
numerical values using numerical values. Prediction is interesting
here, because we want to be able to predict CEO salary as a function of
the predictors we find significant of the 4 available.
(b) We are considering launching a new product and wish to know whether it
will be a success or a failure. We collect data on 20 similar products that
were previously launched. For each prod- uct we have recorded whether it
was a success or failure, price charged for the product, marketing budget,
competition price, and ten other variables.
(b) We are considering launching a new product and wish to know
whether it will be a success or a failure. We collect data on 20
similar products that were previously launched. For each prod-
uct we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price,
and ten other variables.
p=14 p=14
n=20 n=20
Another prediction problem, because we're interested in a predicted outcome -- success or failure -- as a function of the various predictors. This could be considered semi-categorical, since at least one predictor has a classification nature, but I would say it is a classification problem because the goal is to predict a class: failure or success. Another prediction problem, because we're interested in a
predicted outcome -- success or failure -- as a function of the various
predictors. This could be considered semi-categorical, since at least
one predictor has a classification nature, but I would say it is a
classification problem because the goal is to predict a class: failure
or success.
(c) We are interesting in predicting the % change in the US dollar in (c) We are interesting in predicting the % change in the US dollar in
relation to the weekly changes in the world stock markets. Hence relation to the weekly changes in the world stock markets. Hence we collect
we collect weekly data for all of 2012. For each week we record weekly data for all of 2012. For each week we record the % change in the
the % change in the dollar, the % change in the US market, dollar, the % change in the US market, the % change in the British market,
the % change in the British market, and the % change in the and the % change in the German market.
German market.
n=52 n=52
p=4 p=4
A clear regression setting, but this is an inference problem, not a prediction problem. With inference, we have a starting place and attempt to predict the change in a variable as a function of other observed rates: in this case, we have a known US dollar price, and we want to predict how it will change given rate shifts in other markets, so inference clearly applies. A clear regression setting, but this is an inference problem,
not a prediction problem. With inference, we have a starting place and
attempt to predict the change in a variable as a function of other
observed rates: in this case, we have a known US dollar price, and we
want to predict how it will change given rate shifts in other markets,
so inference clearly applies.
4. You will now think of some real-life applications for statistical learning. 4. You will now think of some real-life applications for statistical
learning.
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your (a) Describe three real-life applications in which classification might be
answer. useful. Describe the response, as well as the predictors. Is the goal of
Image identification. The predictors could be things like "distribution of greyscale intensity", "distribution of colors", and any number of clever things I'm sure machine learning professionals have thought up. The response is the most probably classification. This is a prediction. each application inference or prediction? Explain your answer.
Galactic classification. Really this is very similar to general image identification, but we classify galaxies using very specific spectral bands for the predictors that involve light intensity, but then we also look at how strong particular spikes or dips in the spectrum are, so we might have predictors for "emission line strength" for several spectral features. The response is the most likely galactic classification. This is a prediction.
Speech recognition. The predictors would perhaps be the audio spectrum, with the response being the word the audio spectrum corresponds to. This would predict the most likely word for the audio received.
(b) Describe three real-life applications in which regression might Image identification. The predictors could be things like "distribution
be useful. Describe the response, as well as the predictors. Is the of greyscale intensity", "distribution of colors", and any number of
goal of each application inference or prediction? Explain your clever things I'm sure machine learning professionals have thought up.
answer. The response is the most probably classification. This is a prediction.
Marketing data is obvious. The predictor is perhaps how much was spent on a certain type of marketing, or a few types of marketing -- this is now sounding like the example from the book. The response is an amount sold for the same fiscal period. You could use inference or prediction here: inference to how how many addition sales you might add by spending marketing funds, or prediction by asking just "how many sales did we see when we spent X amount on marketing?"
I want to try to use this for my project: understanding the time delay, or reverberation, of a dynamic spectral feature compared against a similarly dynamic reference feature. 2 predictors, line-of-sight velocity and time delay, give a response of light intensity. Our task is to prediction the light intensity as a function of these predictors. This is actually a vanguard question in astrophysics, and I'll bet somebody is already trying to do this! Galactic classification. Really this is very similar to general image
Maybe something municipal. I could predict the taxable income of a city based on a number of predictors, like availability of mass transit or highways, demographics, resources, distance to neighbouring cities, and all kinds of things, then the response would continue to just be taxable income given all of these inputs. Perhaps it would be good to consider an inference questions here, for example: how would my city's taxable change if I increased the availability of public transit? identification, but we classify galaxies using very specific spectral
bands for the predictors that involve light intensity, but then we also
look at how strong particular spikes or dips in the spectrum are, so we
might have predictors for "emission line strength" for several spectral
features. The response is the most likely galactic classification. This
is a prediction.
Speech recognition. The predictors would perhaps be the audio spectrum,
with the response being the word the audio spectrum corresponds to.
This would predict the most likely word for the audio received.
(b) Describe three real-life applications in which regression might be
useful. Describe the response, as well as the predictors. Is the goal of
each application inference or prediction? Explain your answer.
Marketing data is obvious. The predictor is perhaps how much was spent
on a certain type of marketing, or a few types of marketing -- this is
now sounding like the example from the book. The response is an amount
sold for the same fiscal period. You could use inference or prediction
here: inference to how how many addition sales you might add by
spending marketing funds, or prediction by asking just "how many sales
did we see when we spent X amount on marketing?"
I want to try to use this for my project: understanding the time delay,
or reverberation, of a dynamic spectral feature compared against a
similarly dynamic reference feature. 2 predictors, line-of-sight
velocity and time delay, give a response of light intensity. Our task
is to predict the light intensity as a function of these predictors.
This is actually a vanguard question in astrophysics, and I'll bet
somebody is already trying to do this!
Maybe something municipal. I could predict the taxable income of a city
based on a number of predictors, like availability of mass transit or
highways, demographics, resources, distance to neighbouring cities, and
all kinds of things, then the response would continue to just be
taxable income given all of these inputs. Perhaps it would be good to
consider an inference questions here, for example: how would my city's
taxable change if I increased the availability of public transit?
(c) Describe three real-life applications in which cluster analysis (c) Describe three real-life applications in which cluster analysis might
might be useful. be useful.
Categorizing star type by spectral band strengths. Categorizing star type by spectral band strengths.
Plant and animal species identification. Plant and animal species identification.
Tracking objects in sensor data. Tracking objects in sensor data.
9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data. 9. This exercise involves the Auto data set studied in the lab. Make sure
(a) Which of the predictors are quantitative, and which are quali- that the missing values have been removed from the data.
tative?
(b) What is the range of each quantitative predictor? You can an- (a) Which of the predictors are quantitative, and which are quali- tative?
swer this using the range() function.
(c) What is the mean and standard deviation of each quantitative mpg, horsepower, weight, acceleration, and displacement are all clearly
predictor? quantitative.
(d) Now remove the 10th through 85th observations. What is the
range, mean, and standard deviation of each predictor in the cylinders I think is arguably qualitative because each number of
subset of the data that remains? cylinders defines a somewhat broad class of vehicles. For the years,
(e) Using the full data set, investigate the predictors graphically, the same argument might apply: each year is a class of vehicles. The
using scatterplots or other tools of your choice. Create some plots origin is clearly qualitative, and so is name.
highlighting the relationships among the predictors. Comment
on your findings. (b) What is the range of each quantitative predictor? You can an- swer this
(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis using the range() function.
of the other variables. Do your plots suggest that any of the
other variables might be useful in predicting mpg ? Justify your $mpg
answer. [1] 9.0 46.6
$cylinders
[1] 3 8
$displacement
[1] 68 455
$horsepower
[1] 46 230
$weight
[1] 1613 5140
$acceleration
[1] 8.0 24.8
$year
[1] 70 82
(c) What is the mean and standard deviation of each quantitative predictor?
$mpg
mu sigma
23.445918 7.805007
$cylinders
mu sigma
5.471939 1.705783
$displacement
mu sigma
194.412 104.644
$horsepower
mu sigma
104.46939 38.49116
$weight
mu sigma
2977.5842 849.4026
$acceleration
mu sigma
15.541327 2.758864
$year
mu sigma
75.979592 3.683737
(d) Now remove the 10th through 85th observations. What is the range, mean,
and standard deviation of each predictor in the subset of the data that
remains?
$mpg
mu sigma
24.404430 7.867283
$cylinders
mu sigma
5.373418 1.654179
$displacement
mu sigma
187.24051 99.67837
$horsepower
mu sigma
100.72152 35.70885
$weight
mu sigma
2935.9715 811.3002
$acceleration
mu sigma
15.726899 2.693721
$year
mu sigma
77.145570 3.106217
I've now changed my mind and say that both year and cylinders are
quantitative, since there is plenty of sense about talking about the
mean and std in those predictors for this set of data.
(e) Using the full data set, investigate the predictors graphically, using
scatterplots or other tools of your choice. Create some plots highlighting
the relationships among the predictors. Comment on your findings.
(f) Suppose that we wish to predict gas mileage ( mpg ) on the basis of the
other variables. Do your plots suggest that any of the other variables
might be useful in predicting mpg ? Justify your answer.

BIN
hw1/auto_pairs.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB