diff --git a/hw3/answers b/hw3/answers index 666e2be..bea2e50 100644 --- a/hw3/answers +++ b/hw3/answers @@ -73,11 +73,13 @@ Part B: Choose one of Questions 10 or 11 (a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default? + .27 + (b) Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will de- fault? - + .19 11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data @@ -90,28 +92,122 @@ Part B: Choose one of Questions 10 or 11 data.frame() function to create a single data set containing both mpg01 and the other Auto variables. +> auto$mpg01 + [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 + [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + [75] 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 +[112] 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 +[149] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 +[186] 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 +[223] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 +[260] 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 +[297] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 +[334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 +[371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 + + (b) Explore the data graphically in order to investigate the associ- ation between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01 ? Scat- terplots and boxplots may be useful tools to answer this ques- tion. Describe your findings. + Horsepower clearly has the best relationship from the + scatter plots. All of the mpgs over the median are on one + side of the plot. Weight and acceleration are alright, but + there is significant overlap between middle values. + Displacement is on the cusp and the other variables don't + have a terribly useful relationship with this median. + (c) Split the data into a training set and a test set. + Seems like a 50/50 random sampling is appropriate enough. + + > training_indices = sample(nrow(auto),397/2) + > train_bools = rep(F,length(auto$mpg)) + > train_bools[training_indices]=T + > head(train_bools) + [1] FALSE TRUE FALSE FALSE TRUE FALSE + > length(train_bools) + [1] 397 + > train_data = auto[train_bools,] + > test_data = auto[!train_bools,] + + (d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? + > lda.fit + Call: + lda(mpg01 ~ horsepower + weight + acceleration + displacement, + data = train_data) + + Prior probabilities of groups: + 0 1 + 0.5431472 0.4568528 + + Group means: + horsepower weight acceleration displacement + 0 129.08411 3557.757 14.55981 269.729 + 1 79.64444 2345.233 16.39222 116.800 + + Coefficients of linear discriminants: + LD1 + horsepower 0.005678626 + weight -0.001137499 + acceleration -0.014950459 + displacement -0.007401647 + + + Error Rate against test data: + > mean(lda.pred$class!=test_data$mpg01,na.rm=T) + [1] 0.1179487 + + + (e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? + > qda.fit=qda(mpg01 ~ horsepower + weight + acceleration + displacement,data=train_data) + > qda.fit + Call: + qda(mpg01 ~ horsepower + weight + acceleration + displacement, + data = train_data) + + Prior probabilities of groups: + 0 1 + 0.5431472 0.4568528 + + Group means: + horsepower weight acceleration displacement + 0 129.08411 3557.757 14.55981 269.729 + 1 79.64444 2345.233 16.39222 116.800 + + Error Rate: + > mean(qda.pred$class!=test_data$mpg01,na.rm=T) + [1] 0.1025641 + + + (f) Perform logistic regression on the training data in order to pre- dict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? + > glm.fit=glm(mpg01 ~ horsepower + weight + acceleration + displacement,data=train_data,family=binomial) + > glm.probs=predict(glm.fit,test_data,type="response") + > glm.pred=rep(0,199) + > glm.pred[glm.probs>.5]=1 + > mean(glm.pred!=test_data$mpg01) + [1] 0.120603 + + (g) Perform KNN on the training data, with several values of K, in order to predict mpg01 . Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data - set? \ No newline at end of file + set? + + +