mirror of
				https://asciireactor.com/otho/cs-5821.git
				synced 2025-10-31 17:38:04 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			117 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			117 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Part B: Choose one of Questions 10 or 11
 | |
| 
 | |
| 5. We now examine the differences between LDA and QDA.
 | |
| 
 | |
|     (a) If the Bayes decision boundary is linear, do we expect LDA
 | |
|     or QDA to perform better on the training set? On the test set?
 | |
| 
 | |
|         The QDA has more flexibility, so it will match the training
 | |
|         set more closely than the LDA. The LDA will perform better
 | |
|         on the test set because the real relationship is linear, so
 | |
|         the QDA would have additional bias.
 | |
| 
 | |
| 
 | |
|     (b) If the Bayes decision boundary is non-linear, do we expect
 | |
|     LDA or QDA to perform better on the training set? On the test
 | |
|     set?
 | |
| 
 | |
|         QDA will still perform better on the training set, but now
 | |
|         should also perform better than LDA on the test set, since
 | |
|         QDA will account for the additional degree of freedom in the
 | |
|         real relationship.
 | |
| 
 | |
| 
 | |
|     (c) In general, as the sample size n increases, do we expect the
 | |
|     test prediction accuracy of QDA relative to LDA to improve,
 | |
|     decline, or be unchanged? Why?
 | |
| 
 | |
|         Definitely increase. The LDA has an advantage when the
 | |
|         training set is small because it is less sensitive to the
 | |
|         fluctuations of those few data. As the size of the training
 | |
|         set grows, the QDA is able to optimize its coefficients
 | |
|         well, and assuming the real relationship is not linear, the
 | |
|         QDA should eventually out-perform the LDA.
 | |
| 
 | |
| 
 | |
|     (d) True or False: Even if the Bayes decision boundary for a
 | |
|     given problem is linear, we will probably achieve a superior
 | |
|     test error rate using QDA rather than LDA because QDA is
 | |
|     flexible enough to model a linear decision boundary. Justify
 | |
|     your answer.
 | |
| 
 | |
|         False. The QDA will likely be biased because it will fit to
 | |
|         training data that don't completely represent the
 | |
|         relationship that will be observed in test data.
 | |
| 
 | |
| 
 | |
| 8. Suppose that we take a data set, divide it into equally-sized
 | |
|    training and test sets, and then try out two different
 | |
|    classification procedures. First we use logistic regression and
 | |
|    get an error rate of 20 % on the training data and 30 % on the
 | |
|    test data. Next we use 1-nearest neighbors (i.e. K = 1) and get
 | |
|    an average error rate (averaged over both test and training data
 | |
|    sets) of 18 %. Based on these results, which method should we
 | |
|    prefer to use for classification of new observations? Why?
 | |
| 
 | |
|         Definitely 1-nearest neighbor. The logistic regression
 | |
|         performed more poorly on the training data, to which it has
 | |
|         been optimized as much as possible, and yet the nearest
 | |
|         neighbor model performs better over the entire dataset.
 | |
|         Considering the logistic regression performed even worse on
 | |
|         the test data, the average error rate of the logistic
 | |
|         regression over the training and test data is 25%. This
 | |
|         suggests that the relationship may not even be linear, and
 | |
|         the nearest-neighbor is a very solid method for modeling
 | |
|         non-linear classifications, so if the real relationship is
 | |
|         not linear, it easily explains why the nearest-neighbor
 | |
|         method would do so much better. Everything here seems to
 | |
|         point at using the nearest-neighbor.
 | |
| 
 | |
| 
 | |
| 9. This problem has to do with odds.
 | |
| 
 | |
|     (a) On average, what fraction of people with an odds of 0.37 of
 | |
|     defaulting on their credit card payment will in fact default?
 | |
| 
 | |
|     (b) Suppose that an individual has a 16 % chance of defaulting
 | |
|     on her credit card payment. What are the odds that she will de-
 | |
|     fault?
 | |
| 
 | |
| 
 | |
| 
 | |
| 11. In this problem, you will develop a model to predict whether a
 | |
|     given car gets high or low gas mileage based on the Auto data
 | |
|     set.
 | |
| 
 | |
|     (a) Create a binary variable, mpg01 , that contains a 1 if mpg
 | |
|     contains a value above its median, and a 0 if mpg contains a
 | |
|     value below its median. You can compute the median using the
 | |
|     median() function. Note you may find it helpful to use the
 | |
|     data.frame() function to create a single data set containing
 | |
|     both mpg01 and the other Auto variables.
 | |
| 
 | |
|     (b) Explore the data graphically in order to investigate the
 | |
|     associ- ation between mpg01 and the other features. Which of the
 | |
|     other features seem most likely to be useful in predicting mpg01
 | |
|     ? Scat- terplots and boxplots may be useful tools to answer this
 | |
|     ques- tion. Describe your findings.
 | |
| 
 | |
|     (c) Split the data into a training set and a test set.
 | |
| 
 | |
|     (d) Perform LDA on the training data in order to predict mpg01
 | |
|     using the variables that seemed most associated with mpg01 in
 | |
|     (b). What is the test error of the model obtained?
 | |
| 
 | |
|     (e) Perform QDA on the training data in order to predict mpg01
 | |
|     using the variables that seemed most associated with mpg01 in
 | |
|     (b). What is the test error of the model obtained?
 | |
| 
 | |
|     (f) Perform logistic regression on the training data in order to
 | |
|     pre- dict mpg01 using the variables that seemed most associated
 | |
|     with mpg01 in (b). What is the test error of the model obtained?
 | |
| 
 | |
|     (g) Perform KNN on the training data, with several values of K,
 | |
|     in order to predict mpg01 . Use only the variables that seemed
 | |
|     most associated with mpg01 in (b). What test errors do you
 | |
|     obtain? Which value of K seems to perform the best on this data
 | |
|     set? |