mirror of
				https://asciireactor.com/otho/cs-5821.git
				synced 2025-10-31 17:38:04 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			276 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			276 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Part B: Choose one of Questions 10 or 11
 | |
| 
 | |
| 5. We now examine the differences between LDA and QDA.
 | |
| 
 | |
|     (a) If the Bayes decision boundary is linear, do we expect LDA
 | |
|     or QDA to perform better on the training set? On the test set?
 | |
| 
 | |
|         The QDA has more flexibility, so it will match the training
 | |
|         set more closely than the LDA. The LDA will perform better
 | |
|         on the test set because the real relationship is linear, so
 | |
|         the QDA would have additional bias.
 | |
| 
 | |
| 
 | |
|     (b) If the Bayes decision boundary is non-linear, do we expect
 | |
|     LDA or QDA to perform better on the training set? On the test
 | |
|     set?
 | |
| 
 | |
|         QDA will still perform better on the training set, but now
 | |
|         should also perform better than LDA on the test set, since
 | |
|         QDA will account for the additional degree of freedom in the
 | |
|         real relationship.
 | |
| 
 | |
| 
 | |
|     (c) In general, as the sample size n increases, do we expect the
 | |
|     test prediction accuracy of QDA relative to LDA to improve,
 | |
|     decline, or be unchanged? Why?
 | |
| 
 | |
|         Definitely increase. The LDA has an advantage when the
 | |
|         training set is small because it is less sensitive to the
 | |
|         fluctuations of those few data. As the size of the training
 | |
|         set grows, the QDA is able to optimize its coefficients
 | |
|         well, and assuming the real relationship is not linear, the
 | |
|         QDA should eventually out-perform the LDA.
 | |
| 
 | |
| 
 | |
|     (d) True or False: Even if the Bayes decision boundary for a
 | |
|     given problem is linear, we will probably achieve a superior
 | |
|     test error rate using QDA rather than LDA because QDA is
 | |
|     flexible enough to model a linear decision boundary. Justify
 | |
|     your answer.
 | |
| 
 | |
|         False. The QDA will likely be biased because it will fit to
 | |
|         training data that don't completely represent the
 | |
|         relationship that will be observed in test data.
 | |
| 
 | |
| 
 | |
| 8. Suppose that we take a data set, divide it into equally-sized
 | |
|    training and test sets, and then try out two different
 | |
|    classification procedures. First we use logistic regression and
 | |
|    get an error rate of 20 % on the training data and 30 % on the
 | |
|    test data. Next we use 1-nearest neighbors (i.e. K = 1) and get
 | |
|    an average error rate (averaged over both test and training data
 | |
|    sets) of 18 %. Based on these results, which method should we
 | |
|    prefer to use for classification of new observations? Why?
 | |
| 
 | |
|         Definitely 1-nearest neighbor. The logistic regression
 | |
|         performed more poorly on the training data, to which it has
 | |
|         been optimized as much as possible, and yet the nearest
 | |
|         neighbor model performs better over the entire dataset.
 | |
|         Considering the logistic regression performed even worse on
 | |
|         the test data, the average error rate of the logistic
 | |
|         regression over the training and test data is 25%. This
 | |
|         suggests that the relationship may not even be linear, and
 | |
|         the nearest-neighbor is a very solid method for modeling
 | |
|         non-linear classifications, so if the real relationship is
 | |
|         not linear, it easily explains why the nearest-neighbor
 | |
|         method would do so much better. Everything here seems to
 | |
|         point at using the nearest-neighbor.
 | |
| 
 | |
| 
 | |
| 9. This problem has to do with odds.
 | |
| 
 | |
|     (a) On average, what fraction of people with an odds of 0.37 of
 | |
|     defaulting on their credit card payment will in fact default?
 | |
| 
 | |
|         .27
 | |
| 
 | |
|     (b) Suppose that an individual has a 16 % chance of defaulting
 | |
|     on her credit card payment. What are the odds that she will de-
 | |
|     fault?
 | |
| 
 | |
|         .19    
 | |
| 
 | |
| 11. In this problem, you will develop a model to predict whether a
 | |
|     given car gets high or low gas mileage based on the Auto data
 | |
|     set.
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (a) Create a binary variable, mpg01 , that contains a 1 if mpg
 | |
|     contains a value above its median, and a 0 if mpg contains a
 | |
|     value below its median. You can compute the median using the
 | |
|     median() function. Note you may find it helpful to use the
 | |
|     data.frame() function to create a single data set containing
 | |
|     both mpg01 and the other Auto variables.
 | |
| 
 | |
|         > auto$mpg01=rep(0,397)
 | |
|         > auto$mpg01[auto$mpg>median(auto$mpg)]=1
 | |
| 
 | |
| > auto$mpg01
 | |
|   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0
 | |
|  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 | |
|  [75] 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 | |
| [112] 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
 | |
| [149] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1
 | |
| [186] 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
 | |
| [223] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0
 | |
| [260] 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
 | |
| [297] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 | |
| [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1
 | |
| [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
 | |
| 
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (b) Explore the data graphically in order to investigate the
 | |
|     associ- ation between mpg01 and the other features. Which of the
 | |
|     other features seem most likely to be useful in predicting mpg01
 | |
|     ? Scat- terplots and boxplots may be useful tools to answer this
 | |
|     ques- tion. Describe your findings.
 | |
| 
 | |
|         Horsepower clearly has the best relationship from the
 | |
|         scatter plots. All of the mpgs over the median are on one
 | |
|         side of the plot. Weight and acceleration are alright, but
 | |
|         there is significant overlap between middle values.
 | |
|         Displacement is on the cusp and the other variables don't
 | |
|         have a terribly useful relationship with this median.
 | |
| 
 | |
|         The boxplots indicate that acceleration really isn't a great
 | |
|         predictor of mpg01, but displacement is. It also confirms
 | |
|         horsepower and weight as good predictors, and cylinders also
 | |
|         seems to be very strong, even though I didn't take that from
 | |
|         the scatter plots.
 | |
| 
 | |
|         I will use mpg01 ~ horsepower + weight + cylinders + displacement
 | |
| 
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (c) Split the data into a training set and a test set.
 | |
| 
 | |
|         Seems like a 50/50 random sampling is appropriate enough. 
 | |
| 
 | |
|         > training_indices = sample(nrow(auto),397/2)
 | |
|         > train_bools = rep(F,length(auto$mpg))
 | |
|         > train_bools[training_indices]=T
 | |
|         > head(train_bools)
 | |
|         [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
 | |
|         > length(train_bools)
 | |
|         [1] 397
 | |
|         > train_data = auto[train_bools,]
 | |
|         > test_data = auto[!train_bools,]
 | |
|                  
 | |
|         Actually, I changed this now, because a solution I found
 | |
|         online suggested a different test split and I was having
 | |
|         trouble with the KNN model, so I followed their style. I used:
 | |
| 
 | |
|         > train <- (auto$year %% 2 == 0)
 | |
| 
 | |
|         and then the rest the same
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (d) Perform LDA on the training data in order to predict mpg01
 | |
|     using the variables that seemed most associated with mpg01 in
 | |
|     (b). What is the test error of the model obtained?
 | |
| 
 | |
|         > lda.fit
 | |
|         Call:
 | |
|         lda(mpg01 ~ horsepower + weight + cylinders + displacement, data = train_data)
 | |
|         
 | |
|         Prior probabilities of groups:
 | |
|                 0         1 
 | |
|         0.4666667 0.5333333 
 | |
|         
 | |
|         Group means:
 | |
|           horsepower   weight cylinders displacement
 | |
|         0  131.96939 3579.827  6.755102     268.4082
 | |
|         1   77.96429 2313.598  4.071429     111.7188
 | |
|         
 | |
|         Coefficients of linear discriminants:
 | |
|                                LD1
 | |
|         horsepower    0.0060634365
 | |
|         weight       -0.0011442212
 | |
|         cylinders    -0.6390942259
 | |
|         displacement  0.0004517291
 | |
| 
 | |
| 
 | |
| 
 | |
|      ***Test Data Error Rate:
 | |
|         > mean(lda.pred$class!=test_data$mpg01,na.rm=T)
 | |
|         [1] 0.1428571
 | |
| 
 | |
| 
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (e) Perform QDA on the training data in order to predict mpg01
 | |
|     using the variables that seemed most associated with mpg01 in
 | |
|     (b). What is the test error of the model obtained?
 | |
| 
 | |
|         > qda.fit
 | |
|         Call:
 | |
|         qda(mpg01 ~ horsepower + weight + cylinders + displacement, data = train_data)
 | |
|         
 | |
|         Prior probabilities of groups:
 | |
|                 0         1 
 | |
|         0.4666667 0.5333333 
 | |
|         
 | |
|         Group means:
 | |
|           horsepower   weight cylinders displacement
 | |
|         0  131.96939 3579.827  6.755102     268.4082
 | |
|         1   77.96429 2313.598  4.071429     111.7188
 | |
| 
 | |
|         > qda.pred=predict(qda.fit,test_data,na.rm=T)
 | |
| 
 | |
|     ***Test Data Error Rate:
 | |
|         > mean(qda.pred$class!=test_data$mpg01,na.rm=T)
 | |
|         [1] 0.1483516
 | |
| 
 | |
| 
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (f) Perform logistic regression on the training data in order to
 | |
|     pre- dict mpg01 using the variables that seemed most associated
 | |
|     with mpg01 in (b). What is the test error of the model obtained?
 | |
| 
 | |
|         > glm.fit=glm(mpg01 ~ horsepower + weight + cylinders + displacement,data=train_data,family=binomial)
 | |
|         > glm.probs=predict(glm.fit,test_data,type="response")
 | |
|         > glm.pred=rep(0,nrow(test_data))
 | |
|         > glm.pred[glm.probs>.5]=1
 | |
| 
 | |
|      ***Test Data Error Rate:
 | |
|         > mean(glm.pred!=test_data$mpg01)
 | |
|         [1] 0.1373626
 | |
| 
 | |
| 
 | |
| ──────────────────────────────────────────────────────────────────────────
 | |
|     (g) Perform KNN on the training data, with several values of K,
 | |
|     in order to predict mpg01 . Use only the variables that seemed
 | |
|     most associated with mpg01 in (b). What test errors do you
 | |
|     obtain? Which value of K seems to perform the best on this data
 | |
|     set?
 | |
| 
 | |
|        The knn method can't handle the NA values, so
 | |
| 
 | |
|         > set.seed(1)
 | |
|         > auto <- na.omit(auto)
 | |
|         > train_bools <- (auto$year %% 2 == 0)
 | |
|         > train_data = auto[train_bools,]
 | |
|         > test_data = auto[!train_bools,]
 | |
|         
 | |
|         > train.X = cbind(auto$horsepower,auto$displacement,auto$weight,auto$cylinders)[train_bools,]
 | |
|         > test.X = cbind(auto$horsepower,auto$displacement,auto$weight,auto$cylinders)[!train_bools,]
 | |
|         > train.mpg01 = auto$mpg01[train_bools]
 | |
| 
 | |
| 
 | |
|      ***Test Data Error Rates:
 | |
|      k = 1
 | |
|         > knn.pred = knn(train.X,test.X,train.mpg01,k=1)
 | |
|         > mean(knn.pred != test_data$mpg01)
 | |
|         [1] 0.1483516
 | |
|      k = 2
 | |
|         > knn.pred = knn(train.X,test.X,train.mpg01,k=2)
 | |
|         > mean(knn.pred != test_data$mpg01)
 | |
|         [1] 0.1593407
 | |
|      k = 3
 | |
|         > knn.pred = knn(train.X,test.X,train.mpg01,k=3)
 | |
|         > mean(knn.pred != test_data$mpg01)
 | |
|         [1] 0.1648352
 | |
|      k = 4
 | |
|         > knn.pred = knn(train.X,test.X,train.mpg01,k=4)
 | |
|         > mean(knn.pred != test_data$mpg0)
 | |
|         [1] 0.1923077
 | |
| 
 | |
| 
 | |
|         k = 1 looks like the best, since the error rate increases with k.
 | |
| 
 | |
|  
 | |
| 
 |