mirror of
https://asciireactor.com/otho/cs-5821.git
synced 2024-11-22 01:25:06 +00:00
finished h2 and added R log
This commit is contained in:
parent
3b3f51d41b
commit
8ffd07f1ef
BIN
hw2/.RData
Normal file
BIN
hw2/.RData
Normal file
Binary file not shown.
57
hw2/.Rhistory
Normal file
57
hw2/.Rhistory
Normal file
@ -0,0 +1,57 @@
|
|||||||
|
auto = read.table("auto.data",header=T,na.strings="?")
|
||||||
|
auto
|
||||||
|
attach(auto)
|
||||||
|
horsepower
|
||||||
|
plot(mpg ~ horsepower)
|
||||||
|
fit = lm(mpg ~ horsepower)
|
||||||
|
fit
|
||||||
|
summary(fit)
|
||||||
|
lines(fit)
|
||||||
|
abline(fit)
|
||||||
|
abline(fit,col="red")
|
||||||
|
plot(mpg ~ horsepower,pch="x")
|
||||||
|
abline(fit,col="red")
|
||||||
|
abline(fit,col="red",size="2")
|
||||||
|
abline(fit,col="red",lt="2")
|
||||||
|
help(abline)
|
||||||
|
abline(fit,col="red",lwd="2")
|
||||||
|
abline(fit,col="red",lwd="4")
|
||||||
|
summary(lm)
|
||||||
|
summary(fit)
|
||||||
|
help(predict)
|
||||||
|
predict(fit,horsepower=98)
|
||||||
|
predict(fit)
|
||||||
|
predict(fit,98)
|
||||||
|
predict(fit[98])
|
||||||
|
help(predict)
|
||||||
|
help(predict.lm)
|
||||||
|
predict(lm(mpg ~ 98)
|
||||||
|
predict(lm(mpg ~ horsepower)
|
||||||
|
predict(lm(mpg ~ 98))
|
||||||
|
pre
|
||||||
|
predict(fit,data.frame(p=98(
|
||||||
|
predict(fit,data.frame(p=98))
|
||||||
|
predict(fit,data.frame(p=c(98)))
|
||||||
|
predict(fit,data.frame(p=c(98)),interval="confidence")
|
||||||
|
predict(fit,data.frame(horesepower=c(98)),interval="confidence")
|
||||||
|
predict(fit,interval="confidence")
|
||||||
|
fit
|
||||||
|
predict(fit,data.frame(c(98))interval="confidence")
|
||||||
|
predict(fit,data.frame(p=c(98))interval="confidence")
|
||||||
|
predict(fit,data.frame(p=c(98)),interval="confidence")
|
||||||
|
predict(fit,data.frame(p=c(98)),interval="confidence")
|
||||||
|
names(fit)
|
||||||
|
coef(fit)
|
||||||
|
confint(fit)
|
||||||
|
predict(fit,data.frame(horsepower=c(1,98)),interval="confidence")
|
||||||
|
predict(fit,data.frame(horsepower=c(98)),interval="confidence")
|
||||||
|
predict(fit,data.frame(horsepower=98),interval="confidence")
|
||||||
|
predict(fit,data.frame(horsepower=98),interval="prediction")
|
||||||
|
dev.print(pdf,"mpg_horsepower_regression.pdf")
|
||||||
|
plot(fit)
|
||||||
|
par(mfrow=c(2,2))
|
||||||
|
plot(fit)
|
||||||
|
dev.print(pdf,"fit_quality.pdf")
|
||||||
|
save
|
||||||
|
save()
|
||||||
|
q()
|
119
hw2/answers
119
hw2/answers
@ -40,43 +40,70 @@
|
|||||||
dollars). Suppose we use least squares to fit the model, and get
|
dollars). Suppose we use least squares to fit the model, and get
|
||||||
β₀ = 50, β₁ = 20, β₂ = 0.07, β₃ = 35, β₄ = 0.01, β₅ = −10.
|
β₀ = 50, β₁ = 20, β₂ = 0.07, β₃ = 35, β₄ = 0.01, β₅ = −10.
|
||||||
|
|
||||||
(a) Which answer is correct, and why?
|
This is the model: ŷ = 50 + 20 X₁ + 0.07 X₂ + 35 X₃ + 0.01 X₄ +
|
||||||
i. For a fixed value of IQ and GPA, males earn more on
|
-10 X₅
|
||||||
average than females.
|
|
||||||
|
|
||||||
ii. For a fixed value of IQ and GPA, females earn more on
|
For fixed IQ and GPA, we can infer that the starting salary
|
||||||
average than males.
|
for a female sharing an IQ and GPA with her male counterpart
|
||||||
|
will make (35*1 - 10*(GPA*1)) more starting salary units
|
||||||
|
than her male counterpart. This means that at very low
|
||||||
|
GPAs(maybe this includes people who didn't attend school?),
|
||||||
|
males have a lower starting wage, and as GPA grows, males
|
||||||
|
make a larger starting salary from that point, overtaking
|
||||||
|
females at GPA=3.5. Therefore,
|
||||||
|
|
||||||
iii. For a fixed value of IQ and GPA, males earn more on
|
(a) Which answer is correct, and why? → iii. For a fixed value
|
||||||
average than females provided that the GPA is high enough.
|
of IQ and GPA, males earn more on average than females
|
||||||
|
provided that the GPA is high enough.
|
||||||
|
|
||||||
iv. For a fixed value of IQ and GPA, females earn more on
|
This one is correct.
|
||||||
average than males provided that the GPA is high enough.
|
|
||||||
|
|
||||||
(b) Predict the salary of a female with IQ of 110 and a GPA of
|
(b) Predict the salary of a female with IQ of 110 and a GPA of
|
||||||
4.0.
|
4.0.
|
||||||
|
|
||||||
|
ŷ = 50 + 20*4.0 + 0.07*110 + 35*1 + 0.01*(4.0*110) - 10*(4.0*1)
|
||||||
|
|
||||||
|
→ ŷ = 137.1 salary units
|
||||||
|
|
||||||
(c) True or false: Since the coefficient for the GPA/IQ
|
(c) True or false: Since the coefficient for the GPA/IQ
|
||||||
interaction term is very small, there is very little evidence of
|
interaction term is very small, there is very little evidence of
|
||||||
an interaction effect. Justify your answer.
|
an interaction effect. Justify your answer.
|
||||||
|
|
||||||
|
False. There is still a noticeable effect because the
|
||||||
|
coefficient for IQ's effect alone is only 7 times greater
|
||||||
|
than the coefficient of the interaction term. So, this term
|
||||||
|
holds significant weight compared to the overall
|
||||||
|
response of the model to IQ.
|
||||||
|
|
||||||
|
|
||||||
4. I collect a set of data (n = 100 observations) containing a
|
4. I collect a set of data (n = 100 observations) containing a
|
||||||
single predictor and a quantitative response. I then fit a linear
|
single predictor and a quantitative response. I then fit a linear
|
||||||
regression model to the data, as well as a separate cubic
|
regression model to the data, as well as a separate cubic
|
||||||
regression, i.e. Y = β₀ + β₁ X + β₂ X² + β₃ X³ + .
|
regression, i.e. Y = β₀ + β₁ X + β₂ X² + β₃ X³ + ε.
|
||||||
|
|
||||||
(a) Suppose that the true relationship between X and Y is
|
(a) Suppose that the true relationship between X and Y is
|
||||||
linear, i.e. Y = β₀ + β₁ X + . Consider the training residual
|
linear, i.e. Y = β₀ + β₁ X + ε. Consider the training residual
|
||||||
sum of squares (RSS) for the linear regression, and also the
|
sum of squares (RSS) for the linear regression, and also the
|
||||||
training RSS for the cubic regression. Would we expect one to be
|
training RSS for the cubic regression. Would we expect one to be
|
||||||
lower than the other, would we expect them to be the same, or is
|
lower than the other, would we expect them to be the same, or is
|
||||||
there not enough information to tell? Justify your answer.
|
there not enough information to tell? Justify your answer.
|
||||||
|
|
||||||
|
For the training data, the cubic regression might return a
|
||||||
|
better RSS than the linear regression, but this would only
|
||||||
|
be because the cubic is fitting points that are varied
|
||||||
|
according to the ε random error. It also may not, depending
|
||||||
|
on how that random error expressed itself in this case.
|
||||||
|
|
||||||
(b) Answer (a) using test rather than training RSS.
|
(b) Answer (a) using test rather than training RSS.
|
||||||
|
|
||||||
|
For the test error, the RSS will almost certainly be greater
|
||||||
|
for the cubic model than the linear model, because the
|
||||||
|
random error ε will likely express itself in a way that is
|
||||||
|
inconsistent with the noise that the cubic model adopted
|
||||||
|
during its training. The linear model will be more likely to
|
||||||
|
have a lower RSS the more test data is used against the
|
||||||
|
models.
|
||||||
|
|
||||||
(c) Suppose that the true relationship between X and Y is not
|
(c) Suppose that the true relationship between X and Y is not
|
||||||
linear, but we don’t know how far it is from linear. Consider
|
linear, but we don’t know how far it is from linear. Consider
|
||||||
the training RSS for the linear regression, and also the
|
the training RSS for the linear regression, and also the
|
||||||
@ -84,3 +111,71 @@
|
|||||||
lower than the other, would we expect them to be the same, or is
|
lower than the other, would we expect them to be the same, or is
|
||||||
there not enough information to tell? Justify your answer. (d)
|
there not enough information to tell? Justify your answer. (d)
|
||||||
Answer (c) using test rather than training RSS.
|
Answer (c) using test rather than training RSS.
|
||||||
|
|
||||||
|
The cubic model will pick up more information because of its
|
||||||
|
additional degrees of freedom. If the true relationship is
|
||||||
|
more complex than linear, then the cubic model will likely
|
||||||
|
have a lower RSS over the linear model. If the model is less
|
||||||
|
complex than linear (E.G. perhaps it is just a constant
|
||||||
|
scalar relationship) then the linear model will still be
|
||||||
|
more likely to have a smaller RSS, because the cubic will
|
||||||
|
again pick up information from the ε noise that is not
|
||||||
|
inherent in the real relationship.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
8. This question involves the use of simple linear regression on the Auto
|
||||||
|
data set.
|
||||||
|
|
||||||
|
(a) Use the lm() function to perform a simple linear regression
|
||||||
|
with mpg as the response and horsepower as the predictor. Use
|
||||||
|
the summary() function to print the results. Comment on the
|
||||||
|
output. For example:
|
||||||
|
|
||||||
|
There is definitely a correlation between horsepower and
|
||||||
|
mpg. The RSE is ~4.9, which is not insignificant and does
|
||||||
|
indicate that the response may not be truly linear, but it
|
||||||
|
is small enough relative to the mpg magnitude that it's
|
||||||
|
clear a relationship exists. The R² statistics corroborates
|
||||||
|
this by indicating (it has a small value at ~0.6) that a
|
||||||
|
large proportion of the mpg variability is explained by the
|
||||||
|
model. mpg has a negative correlation with horsepower,
|
||||||
|
indicated by the negative coefficient on the horsepower
|
||||||
|
factor.
|
||||||
|
|
||||||
|
|
||||||
|
For example, for a vehicle with 98 horsepower, one can
|
||||||
|
expect with 95% confidence that the mpg will be within 23.97
|
||||||
|
and 24.96, if the vehicles follow our model. However, after
|
||||||
|
incorporating the irreducible error, the prediction turns
|
||||||
|
out to be much less precise, with a 95% prediction interval
|
||||||
|
spanning 14.8 to 34.1. Some of this variability may also be
|
||||||
|
reduced by using a quadratic model, from visual inspection
|
||||||
|
of the plot.
|
||||||
|
|
||||||
|
(b) Plot the response and the predictor. Use the abline() function
|
||||||
|
to display the least squares regression line.
|
||||||
|
|
||||||
|
Attached.
|
||||||
|
|
||||||
|
(c) Use the plot() function to produce diagnostic plots of the least
|
||||||
|
squares regression fit. Comment on any problems you see with
|
||||||
|
the fit.
|
||||||
|
|
||||||
|
Attached. From these four plots it's clear there is a lot of
|
||||||
|
variability that remains unexplained by the linear model.
|
||||||
|
The standardized residuals plotted against the fitted values
|
||||||
|
shows clearly that the variability is strong, with values
|
||||||
|
consistenly lying outside 1 standardized residual unit, but
|
||||||
|
still within a tight range that doesn't extend past 3, which
|
||||||
|
is often considered an approximate threshold to indicate
|
||||||
|
values that aren't explained well by the model. There are
|
||||||
|
many points with high leverage, and these values have less
|
||||||
|
residual by default, of course, and in both of these graphs
|
||||||
|
we see a few points (323, 330) that are rearing their ugly
|
||||||
|
heads. These seems to be the bit of "uptick" toward the
|
||||||
|
higher end of the horsepower scale that would probably be
|
||||||
|
picked up by a quadratic fit.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user