almost done with hw3

This commit is contained in:
caes 2017-02-09 03:16:50 -05:00
parent a817d948c9
commit 84a0b4cfc9

View File

@ -73,11 +73,13 @@ Part B: Choose one of Questions 10 or 11
(a) On average, what fraction of people with an odds of 0.37 of (a) On average, what fraction of people with an odds of 0.37 of
defaulting on their credit card payment will in fact default? defaulting on their credit card payment will in fact default?
.27
(b) Suppose that an individual has a 16 % chance of defaulting (b) Suppose that an individual has a 16 % chance of defaulting
on her credit card payment. What are the odds that she will de- on her credit card payment. What are the odds that she will de-
fault? fault?
.19
11. In this problem, you will develop a model to predict whether a 11. In this problem, you will develop a model to predict whether a
given car gets high or low gas mileage based on the Auto data given car gets high or low gas mileage based on the Auto data
@ -90,28 +92,122 @@ Part B: Choose one of Questions 10 or 11
data.frame() function to create a single data set containing data.frame() function to create a single data set containing
both mpg01 and the other Auto variables. both mpg01 and the other Auto variables.
> auto$mpg01
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0
[38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[75] 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
[112] 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
[149] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1
[186] 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
[223] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0
[260] 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
[297] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
[371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
(b) Explore the data graphically in order to investigate the (b) Explore the data graphically in order to investigate the
associ- ation between mpg01 and the other features. Which of the associ- ation between mpg01 and the other features. Which of the
other features seem most likely to be useful in predicting mpg01 other features seem most likely to be useful in predicting mpg01
? Scat- terplots and boxplots may be useful tools to answer this ? Scat- terplots and boxplots may be useful tools to answer this
ques- tion. Describe your findings. ques- tion. Describe your findings.
Horsepower clearly has the best relationship from the
scatter plots. All of the mpgs over the median are on one
side of the plot. Weight and acceleration are alright, but
there is significant overlap between middle values.
Displacement is on the cusp and the other variables don't
have a terribly useful relationship with this median.
(c) Split the data into a training set and a test set. (c) Split the data into a training set and a test set.
Seems like a 50/50 random sampling is appropriate enough.
> training_indices = sample(nrow(auto),397/2)
> train_bools = rep(F,length(auto$mpg))
> train_bools[training_indices]=T
> head(train_bools)
[1] FALSE TRUE FALSE FALSE TRUE FALSE
> length(train_bools)
[1] 397
> train_data = auto[train_bools,]
> test_data = auto[!train_bools,]
(d) Perform LDA on the training data in order to predict mpg01 (d) Perform LDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01 in using the variables that seemed most associated with mpg01 in
(b). What is the test error of the model obtained? (b). What is the test error of the model obtained?
> lda.fit
Call:
lda(mpg01 ~ horsepower + weight + acceleration + displacement,
data = train_data)
Prior probabilities of groups:
0 1
0.5431472 0.4568528
Group means:
horsepower weight acceleration displacement
0 129.08411 3557.757 14.55981 269.729
1 79.64444 2345.233 16.39222 116.800
Coefficients of linear discriminants:
LD1
horsepower 0.005678626
weight -0.001137499
acceleration -0.014950459
displacement -0.007401647
Error Rate against test data:
> mean(lda.pred$class!=test_data$mpg01,na.rm=T)
[1] 0.1179487
(e) Perform QDA on the training data in order to predict mpg01 (e) Perform QDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01 in using the variables that seemed most associated with mpg01 in
(b). What is the test error of the model obtained? (b). What is the test error of the model obtained?
> qda.fit=qda(mpg01 ~ horsepower + weight + acceleration + displacement,data=train_data)
> qda.fit
Call:
qda(mpg01 ~ horsepower + weight + acceleration + displacement,
data = train_data)
Prior probabilities of groups:
0 1
0.5431472 0.4568528
Group means:
horsepower weight acceleration displacement
0 129.08411 3557.757 14.55981 269.729
1 79.64444 2345.233 16.39222 116.800
Error Rate:
> mean(qda.pred$class!=test_data$mpg01,na.rm=T)
[1] 0.1025641
(f) Perform logistic regression on the training data in order to (f) Perform logistic regression on the training data in order to
pre- dict mpg01 using the variables that seemed most associated pre- dict mpg01 using the variables that seemed most associated
with mpg01 in (b). What is the test error of the model obtained? with mpg01 in (b). What is the test error of the model obtained?
> glm.fit=glm(mpg01 ~ horsepower + weight + acceleration + displacement,data=train_data,family=binomial)
> glm.probs=predict(glm.fit,test_data,type="response")
> glm.pred=rep(0,199)
> glm.pred[glm.probs>.5]=1
> mean(glm.pred!=test_data$mpg01)
[1] 0.120603
(g) Perform KNN on the training data, with several values of K, (g) Perform KNN on the training data, with several values of K,
in order to predict mpg01 . Use only the variables that seemed in order to predict mpg01 . Use only the variables that seemed
most associated with mpg01 in (b). What test errors do you most associated with mpg01 in (b). What test errors do you
obtain? Which value of K seems to perform the best on this data obtain? Which value of K seems to perform the best on this data
set? set?