diff --git a/hw2/.RData b/hw2/.RData new file mode 100644 index 0000000..b6aee81 Binary files /dev/null and b/hw2/.RData differ diff --git a/hw2/.Rhistory b/hw2/.Rhistory new file mode 100644 index 0000000..0d1d0d5 --- /dev/null +++ b/hw2/.Rhistory @@ -0,0 +1,57 @@ +auto = read.table("auto.data",header=T,na.strings="?") +auto +attach(auto) +horsepower +plot(mpg ~ horsepower) +fit = lm(mpg ~ horsepower) +fit +summary(fit) +lines(fit) +abline(fit) +abline(fit,col="red") +plot(mpg ~ horsepower,pch="x") +abline(fit,col="red") +abline(fit,col="red",size="2") +abline(fit,col="red",lt="2") +help(abline) +abline(fit,col="red",lwd="2") +abline(fit,col="red",lwd="4") +summary(lm) +summary(fit) +help(predict) +predict(fit,horsepower=98) +predict(fit) +predict(fit,98) +predict(fit[98]) +help(predict) +help(predict.lm) +predict(lm(mpg ~ 98) +predict(lm(mpg ~ horsepower) +predict(lm(mpg ~ 98)) +pre +predict(fit,data.frame(p=98( +predict(fit,data.frame(p=98)) +predict(fit,data.frame(p=c(98))) +predict(fit,data.frame(p=c(98)),interval="confidence") +predict(fit,data.frame(horesepower=c(98)),interval="confidence") +predict(fit,interval="confidence") +fit +predict(fit,data.frame(c(98))interval="confidence") +predict(fit,data.frame(p=c(98))interval="confidence") +predict(fit,data.frame(p=c(98)),interval="confidence") +predict(fit,data.frame(p=c(98)),interval="confidence") +names(fit) +coef(fit) +confint(fit) +predict(fit,data.frame(horsepower=c(1,98)),interval="confidence") +predict(fit,data.frame(horsepower=c(98)),interval="confidence") +predict(fit,data.frame(horsepower=98),interval="confidence") +predict(fit,data.frame(horsepower=98),interval="prediction") +dev.print(pdf,"mpg_horsepower_regression.pdf") +plot(fit) +par(mfrow=c(2,2)) +plot(fit) +dev.print(pdf,"fit_quality.pdf") +save +save() +q() diff --git a/hw2/answers b/hw2/answers index 0228780..affeb44 100644 --- a/hw2/answers +++ b/hw2/answers @@ -17,19 +17,19 @@ which of these predictors have a strong relationship with sales of this product. - TV marketing and radio marketing both have a strong relationship - to sales, according to their linear regression p-values, but - newspaper advertising does not appear to be effective, given - that the linear model does not account for much of the variation - in sales across that domain. We can conclude that cutting back - on newspaper advertising will likely have little effect on the - sales of the product, and that increasing TV and radio - advertising budgets likely will have an effect. Furthermore, we - can see that radio advertising spending has a stronger - relationship with sales, as the best-fit slope is significantly - more positive than the best fit for TV advertising spending, so - increasing the radio advertising budget will likely be more - effective. + TV marketing and radio marketing both have a strong relationship + to sales, according to their linear regression p-values, but + newspaper advertising does not appear to be effective, given + that the linear model does not account for much of the variation + in sales across that domain. We can conclude that cutting back + on newspaper advertising will likely have little effect on the + sales of the product, and that increasing TV and radio + advertising budgets likely will have an effect. Furthermore, we + can see that radio advertising spending has a stronger + relationship with sales, as the best-fit slope is significantly + more positive than the best fit for TV advertising spending, so + increasing the radio advertising budget will likely be more + effective. @@ -40,47 +40,142 @@ dollars). Suppose we use least squares to fit the model, and get β₀ = 50, β₁ = 20, β₂ = 0.07, β₃ = 35, β₄ = 0.01, β₅ = −10. - (a) Which answer is correct, and why? - i. For a fixed value of IQ and GPA, males earn more on - average than females. + This is the model: ŷ = 50 + 20 X₁ + 0.07 X₂ + 35 X₃ + 0.01 X₄ + + -10 X₅ - ii. For a fixed value of IQ and GPA, females earn more on - average than males. + For fixed IQ and GPA, we can infer that the starting salary + for a female sharing an IQ and GPA with her male counterpart + will make (35*1 - 10*(GPA*1)) more starting salary units + than her male counterpart. This means that at very low + GPAs(maybe this includes people who didn't attend school?), + males have a lower starting wage, and as GPA grows, males + make a larger starting salary from that point, overtaking + females at GPA=3.5. Therefore, - iii. For a fixed value of IQ and GPA, males earn more on - average than females provided that the GPA is high enough. + (a) Which answer is correct, and why? → iii. For a fixed value + of IQ and GPA, males earn more on average than females + provided that the GPA is high enough. + + This one is correct. - iv. For a fixed value of IQ and GPA, females earn more on - average than males provided that the GPA is high enough. + (b) Predict the salary of a female with IQ of 110 and a GPA of + 4.0. - (b) Predict the salary of a female with IQ of 110 and a GPA of - 4.0. + ŷ = 50 + 20*4.0 + 0.07*110 + 35*1 + 0.01*(4.0*110) - 10*(4.0*1) - (c) True or false: Since the coefficient for the GPA/IQ - interaction term is very small, there is very little evidence of - an interaction effect. Justify your answer. + → ŷ = 137.1 salary units + (c) True or false: Since the coefficient for the GPA/IQ + interaction term is very small, there is very little evidence of + an interaction effect. Justify your answer. + False. There is still a noticeable effect because the + coefficient for IQ's effect alone is only 7 times greater + than the coefficient of the interaction term. So, this term + holds significant weight compared to the overall + response of the model to IQ. 4. I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic - regression, i.e. Y = β₀ + β₁ X + β₂ X² + β₃ X³ + . + regression, i.e. Y = β₀ + β₁ X + β₂ X² + β₃ X³ + ε. (a) Suppose that the true relationship between X and Y is - linear, i.e. Y = β₀ + β₁ X + . Consider the training residual + linear, i.e. Y = β₀ + β₁ X + ε. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. + For the training data, the cubic regression might return a + better RSS than the linear regression, but this would only + be because the cubic is fitting points that are varied + according to the ε random error. It also may not, depending + on how that random error expressed itself in this case. + (b) Answer (a) using test rather than training RSS. + For the test error, the RSS will almost certainly be greater + for the cubic model than the linear model, because the + random error ε will likely express itself in a way that is + inconsistent with the noise that the cubic model adopted + during its training. The linear model will be more likely to + have a lower RSS the more test data is used against the + models. + (c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. (d) - Answer (c) using test rather than training RSS. \ No newline at end of file + Answer (c) using test rather than training RSS. + + The cubic model will pick up more information because of its + additional degrees of freedom. If the true relationship is + more complex than linear, then the cubic model will likely + have a lower RSS over the linear model. If the model is less + complex than linear (E.G. perhaps it is just a constant + scalar relationship) then the linear model will still be + more likely to have a smaller RSS, because the cubic will + again pick up information from the ε noise that is not + inherent in the real relationship. + + + + + +8. This question involves the use of simple linear regression on the Auto +data set. + + (a) Use the lm() function to perform a simple linear regression + with mpg as the response and horsepower as the predictor. Use + the summary() function to print the results. Comment on the + output. For example: + + There is definitely a correlation between horsepower and + mpg. The RSE is ~4.9, which is not insignificant and does + indicate that the response may not be truly linear, but it + is small enough relative to the mpg magnitude that it's + clear a relationship exists. The R² statistics corroborates + this by indicating (it has a small value at ~0.6) that a + large proportion of the mpg variability is explained by the + model. mpg has a negative correlation with horsepower, + indicated by the negative coefficient on the horsepower + factor. + + + For example, for a vehicle with 98 horsepower, one can + expect with 95% confidence that the mpg will be within 23.97 + and 24.96, if the vehicles follow our model. However, after + incorporating the irreducible error, the prediction turns + out to be much less precise, with a 95% prediction interval + spanning 14.8 to 34.1. Some of this variability may also be + reduced by using a quadratic model, from visual inspection + of the plot. + + (b) Plot the response and the predictor. Use the abline() function + to display the least squares regression line. + + Attached. + + (c) Use the plot() function to produce diagnostic plots of the least + squares regression fit. Comment on any problems you see with + the fit. + + Attached. From these four plots it's clear there is a lot of + variability that remains unexplained by the linear model. + The standardized residuals plotted against the fitted values + shows clearly that the variability is strong, with values + consistenly lying outside 1 standardized residual unit, but + still within a tight range that doesn't extend past 3, which + is often considered an approximate threshold to indicate + values that aren't explained well by the model. There are + many points with high leverage, and these values have less + residual by default, of course, and in both of these graphs + we see a few points (323, 330) that are rearing their ugly + heads. These seems to be the bit of "uptick" toward the + higher end of the horsepower scale that would probably be + picked up by a quadratic fit. +