mirror of
https://asciireactor.com/otho/cs-5821.git
synced 2024-11-22 00:45:07 +00:00
275 lines
12 KiB
Plaintext
275 lines
12 KiB
Plaintext
Part B: Choose one of Questions 10 or 11
|
|
|
|
5. We now examine the differences between LDA and QDA.
|
|
|
|
(a) If the Bayes decision boundary is linear, do we expect LDA
|
|
or QDA to perform better on the training set? On the test set?
|
|
|
|
The QDA has more flexibility, so it will match the training
|
|
set more closely than the LDA. The LDA will perform better
|
|
on the test set because the real relationship is linear, so
|
|
the QDA would have additional bias.
|
|
|
|
|
|
(b) If the Bayes decision boundary is non-linear, do we expect
|
|
LDA or QDA to perform better on the training set? On the test
|
|
set?
|
|
|
|
QDA will still perform better on the training set, but now
|
|
should also perform better than LDA on the test set, since
|
|
QDA will account for the additional degree of freedom in the
|
|
real relationship.
|
|
|
|
|
|
(c) In general, as the sample size n increases, do we expect the
|
|
test prediction accuracy of QDA relative to LDA to improve,
|
|
decline, or be unchanged? Why?
|
|
|
|
Definitely increase. The LDA has an advantage when the
|
|
training set is small because it is less sensitive to the
|
|
fluctuations of those few data. As the size of the training
|
|
set grows, the QDA is able to optimize its coefficients
|
|
well, and assuming the real relationship is not linear, the
|
|
QDA should eventually out-perform the LDA.
|
|
|
|
|
|
(d) True or False: Even if the Bayes decision boundary for a
|
|
given problem is linear, we will probably achieve a superior
|
|
test error rate using QDA rather than LDA because QDA is
|
|
flexible enough to model a linear decision boundary. Justify
|
|
your answer.
|
|
|
|
False. The QDA will likely be biased because it will fit to
|
|
training data that don't completely represent the
|
|
relationship that will be observed in test data.
|
|
|
|
|
|
8. Suppose that we take a data set, divide it into equally-sized
|
|
training and test sets, and then try out two different
|
|
classification procedures. First we use logistic regression and
|
|
get an error rate of 20 % on the training data and 30 % on the
|
|
test data. Next we use 1-nearest neighbors (i.e. K = 1) and get
|
|
an average error rate (averaged over both test and training data
|
|
sets) of 18 %. Based on these results, which method should we
|
|
prefer to use for classification of new observations? Why?
|
|
|
|
Definitely 1-nearest neighbor. The logistic regression
|
|
performed more poorly on the training data, to which it has
|
|
been optimized as much as possible, and yet the nearest
|
|
neighbor model performs better over the entire dataset.
|
|
Considering the logistic regression performed even worse on
|
|
the test data, the average error rate of the logistic
|
|
regression over the training and test data is 25%. This
|
|
suggests that the relationship may not even be linear, and
|
|
the nearest-neighbor is a very solid method for modeling
|
|
non-linear classifications, so if the real relationship is
|
|
not linear, it easily explains why the nearest-neighbor
|
|
method would do so much better. Everything here seems to
|
|
point at using the nearest-neighbor.
|
|
|
|
|
|
9. This problem has to do with odds.
|
|
|
|
(a) On average, what fraction of people with an odds of 0.37 of
|
|
defaulting on their credit card payment will in fact default?
|
|
|
|
.27
|
|
|
|
(b) Suppose that an individual has a 16 % chance of defaulting
|
|
on her credit card payment. What are the odds that she will de-
|
|
fault?
|
|
|
|
.19
|
|
|
|
11. In this problem, you will develop a model to predict whether a
|
|
given car gets high or low gas mileage based on the Auto data
|
|
set.
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(a) Create a binary variable, mpg01 , that contains a 1 if mpg
|
|
contains a value above its median, and a 0 if mpg contains a
|
|
value below its median. You can compute the median using the
|
|
median() function. Note you may find it helpful to use the
|
|
data.frame() function to create a single data set containing
|
|
both mpg01 and the other Auto variables.
|
|
|
|
> auto$mpg01=rep(0,397)
|
|
> auto$mpg01[auto$mpg>median(auto$mpg)]=1
|
|
|
|
> auto$mpg01
|
|
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0
|
|
[38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
|
|
[75] 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
|
|
[112] 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
|
|
[149] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1
|
|
[186] 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
|
|
[223] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0
|
|
[260] 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
|
|
[297] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
|
|
[334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
|
|
[371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
|
|
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(b) Explore the data graphically in order to investigate the
|
|
associ- ation between mpg01 and the other features. Which of the
|
|
other features seem most likely to be useful in predicting mpg01
|
|
? Scat- terplots and boxplots may be useful tools to answer this
|
|
ques- tion. Describe your findings.
|
|
|
|
Horsepower clearly has the best relationship from the
|
|
scatter plots. All of the mpgs over the median are on one
|
|
side of the plot. Weight and acceleration are alright, but
|
|
there is significant overlap between middle values.
|
|
Displacement is on the cusp and the other variables don't
|
|
have a terribly useful relationship with this median.
|
|
|
|
The boxplots indicate that acceleration really isn't a great
|
|
predictor of mpg01, but displacement is. It also confirms
|
|
horsepower and weight as good predictors, and cylinders also
|
|
seems to be very strong, even though I didn't take that from
|
|
the scatter plots.
|
|
|
|
I will use mpg01 ~ horsepower + weight + cylinders + displacement
|
|
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(c) Split the data into a training set and a test set.
|
|
|
|
Seems like a 50/50 random sampling is appropriate enough.
|
|
|
|
> training_indices = sample(nrow(auto),397/2)
|
|
> train_bools = rep(F,length(auto$mpg))
|
|
> train_bools[training_indices]=T
|
|
> head(train_bools)
|
|
[1] TRUE TRUE TRUE FALSE TRUE FALSE
|
|
> length(train_bools)
|
|
[1] 397
|
|
> train_data = auto[train_bools,]
|
|
> test_data = auto[!train_bools,]
|
|
|
|
Actually, I changed this now, because a solution I found
|
|
online suggested a different test split and I was having
|
|
trouble with the KNN model, so I followed their style. I used:
|
|
|
|
> train <- (auto$year %% 2 == 0)
|
|
|
|
and then the rest the same
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(d) Perform LDA on the training data in order to predict mpg01
|
|
using the variables that seemed most associated with mpg01 in
|
|
(b). What is the test error of the model obtained?
|
|
|
|
> lda.fit
|
|
Call:
|
|
lda(mpg01 ~ horsepower + weight + cylinders + displacement, data = train_data)
|
|
|
|
Prior probabilities of groups:
|
|
0 1
|
|
0.4666667 0.5333333
|
|
|
|
Group means:
|
|
horsepower weight cylinders displacement
|
|
0 131.96939 3579.827 6.755102 268.4082
|
|
1 77.96429 2313.598 4.071429 111.7188
|
|
|
|
Coefficients of linear discriminants:
|
|
LD1
|
|
horsepower 0.0060634365
|
|
weight -0.0011442212
|
|
cylinders -0.6390942259
|
|
displacement 0.0004517291
|
|
|
|
|
|
|
|
***Test Data Error Rate:
|
|
> mean(lda.pred$class!=test_data$mpg01,na.rm=T)
|
|
[1] 0.1428571
|
|
|
|
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(e) Perform QDA on the training data in order to predict mpg01
|
|
using the variables that seemed most associated with mpg01 in
|
|
(b). What is the test error of the model obtained?
|
|
|
|
> qda.fit
|
|
Call:
|
|
lda(mpg01 ~ horsepower + weight + cylinders + displacement, data = train_data)
|
|
|
|
Prior probabilities of groups:
|
|
0 1
|
|
0.4666667 0.5333333
|
|
|
|
Group means:
|
|
horsepower weight cylinders displacement
|
|
0 131.96939 3579.827 6.755102 268.4082
|
|
1 77.96429 2313.598 4.071429 111.7188
|
|
|
|
Coefficients of linear discriminants:
|
|
LD1
|
|
horsepower 0.0060634365
|
|
weight -0.0011442212
|
|
cylinders -0.6390942259
|
|
displacement 0.0004517291
|
|
|
|
***Test Data Error Rate:
|
|
> mean(qda.pred$class!=test_data$mpg01,na.rm=T)
|
|
[1] 0.1428571
|
|
|
|
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(f) Perform logistic regression on the training data in order to
|
|
pre- dict mpg01 using the variables that seemed most associated
|
|
with mpg01 in (b). What is the test error of the model obtained?
|
|
|
|
> glm.fit=glm(mpg01 ~ horsepower + weight + cylinders + displacement,data=train_data,family=binomial)
|
|
> glm.probs=predict(glm.fit,test_data,type="response")
|
|
> glm.pred=rep(0,199)
|
|
> glm.pred[glm.probs>.5]=1
|
|
|
|
***Test Data Error Rate:
|
|
> mean(glm.pred!=test_data$mpg01)
|
|
[1] 0.1407035
|
|
|
|
|
|
──────────────────────────────────────────────────────────────────────────
|
|
(g) Perform KNN on the training data, with several values of K,
|
|
in order to predict mpg01 . Use only the variables that seemed
|
|
most associated with mpg01 in (b). What test errors do you
|
|
obtain? Which value of K seems to perform the best on this data
|
|
set?
|
|
|
|
The knn method can't handle the NA values, so
|
|
|
|
> set.seed(1)
|
|
> auto <- na.omit(auto)
|
|
> train_bools <- (auto$year %% 2 == 0)
|
|
> train_data = auto[train_bools,]
|
|
> test_data = auto[!train_bools,]
|
|
|
|
> train.X = cbind(auto$horsepower,auto$displacement,auto$weight,auto$acceleration)[train_bools,]
|
|
> test.X = cbind(auto$horsepower,auto$displacement,auto$weight,auto$acceleration)[!train_bools,]
|
|
> train.mpg01 = auto$mpg01[train_bools]
|
|
|
|
***Test Data Error Rates:
|
|
k = 1
|
|
> mean(knn.pred != test_data$mpg01)
|
|
[1] 0.1483516
|
|
k = 2
|
|
> mean(knn.pred != test_data$mpg01)
|
|
[1] 0.1593407
|
|
k = 3
|
|
> mean(knn.pred != test_data$mpg01)
|
|
[1] 0.1648352
|
|
k = 4
|
|
> mean(knn.pred != test_data$mpg0)
|
|
[1] 0.1813187
|
|
|
|
k = 1 looks like the best, since the error rate increases with k.
|
|
|
|
|
|
|