almost done with hw3

2026-02-18 12:20:13 +00:00 · 2017-02-09 03:16:50 -05:00 · 2017-02-09 03:16:50 -05:00 · 84a0b4cfc9
commit 84a0b4cfc9
parent a817d948c9
1 changed files with 98 additions and 2 deletions
--- a/hw3/answers
+++ b/hw3/answers
@ -73,11 +73,13 @@ Part B: Choose one of Questions 10 or 11
    (a) On average, what fraction of people with an odds of 0.37 of
    defaulting on their credit card payment will in fact default?
        .27
    (b) Suppose that an individual has a 16 % chance of defaulting
    on her credit card payment. What are the odds that she will de-
    fault?
-
+        .19    
 11. In this problem, you will develop a model to predict whether a
    given car gets high or low gas mileage based on the Auto data
@ -90,28 +92,122 @@ Part B: Choose one of Questions 10 or 11
    data.frame() function to create a single data set containing
    both mpg01 and the other Auto variables.
 > auto$mpg01
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 [112] 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
 [149] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1
 [186] 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
 [223] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0
 [260] 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
 [297] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
 [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
    (b) Explore the data graphically in order to investigate the
    associ- ation between mpg01 and the other features. Which of the
    other features seem most likely to be useful in predicting mpg01
    ? Scat- terplots and boxplots may be useful tools to answer this
    ques- tion. Describe your findings.
        Horsepower clearly has the best relationship from the
        scatter plots. All of the mpgs over the median are on one
        side of the plot. Weight and acceleration are alright, but
        there is significant overlap between middle values.
        Displacement is on the cusp and the other variables don't
        have a terribly useful relationship with this median.
    (c) Split the data into a training set and a test set.
        Seems like a 50/50 random sampling is appropriate enough. 
        > training_indices = sample(nrow(auto),397/2)
        > train_bools = rep(F,length(auto$mpg))
        > train_bools[training_indices]=T
        > head(train_bools)
        [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
        > length(train_bools)
        [1] 397
        > train_data = auto[train_bools,]
        > test_data = auto[!train_bools,]
    (d) Perform LDA on the training data in order to predict mpg01
    using the variables that seemed most associated with mpg01 in
    (b). What is the test error of the model obtained?
        > lda.fit
        Call:
        lda(mpg01 ~ horsepower + weight + acceleration + displacement, 
            data = train_data)
        Prior probabilities of groups:
                0         1 
        0.5431472 0.4568528 
        Group means:
          horsepower   weight acceleration displacement
        0  129.08411 3557.757     14.55981      269.729
        1   79.64444 2345.233     16.39222      116.800
        Coefficients of linear discriminants:
                              LD1
        horsepower    0.005678626
        weight       -0.001137499
        acceleration -0.014950459
        displacement -0.007401647
        Error Rate against test data:
        > mean(lda.pred$class!=test_data$mpg01,na.rm=T)
        [1] 0.1179487
    (e) Perform QDA on the training data in order to predict mpg01
    using the variables that seemed most associated with mpg01 in
    (b). What is the test error of the model obtained?
        > qda.fit=qda(mpg01 ~ horsepower + weight + acceleration + displacement,data=train_data)
        > qda.fit
        Call:
        qda(mpg01 ~ horsepower + weight + acceleration + displacement, 
            data = train_data)
        Prior probabilities of groups:
                0         1 
        0.5431472 0.4568528 
        Group means:
          horsepower   weight acceleration displacement
        0  129.08411 3557.757     14.55981      269.729
        1   79.64444 2345.233     16.39222      116.800
        Error Rate:
        > mean(qda.pred$class!=test_data$mpg01,na.rm=T)
        [1] 0.1025641
    (f) Perform logistic regression on the training data in order to
    pre- dict mpg01 using the variables that seemed most associated
    with mpg01 in (b). What is the test error of the model obtained?
        > glm.fit=glm(mpg01 ~ horsepower + weight + acceleration + displacement,data=train_data,family=binomial)
        > glm.probs=predict(glm.fit,test_data,type="response")
        > glm.pred=rep(0,199)
        > glm.pred[glm.probs>.5]=1
        > mean(glm.pred!=test_data$mpg01)
        [1] 0.120603
    (g) Perform KNN on the training data, with several values of K,
    in order to predict mpg01 . Use only the variables that seemed
    most associated with mpg01 in (b). What test errors do you
    obtain? Which value of K seems to perform the best on this data
    set?