Part B: Choose one of Questions 10 or 11 5. We now examine the differences between LDA and QDA. (a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set? The QDA has more flexibility, so it will match the training set more closely than the LDA. The LDA will perform better on the test set because the real relationship is linear, so the QDA would have additional bias. (b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set? QDA will still perform better on the training set, but now should also perform better than LDA on the test set, since QDA will account for the additional degree of freedom in the real relationship. (c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why? Definitely increase. The LDA has an advantage when the training set is small because it is less sensitive to the fluctuations of those few data. As the size of the training set grows, the QDA is able to optimize its coefficients well, and assuming the real relationship is not linear, the QDA should eventually out-perform the LDA. (d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer. False. The QDA will likely be biased because it will fit to training data that don't completely represent the relationship that will be observed in test data. 8. Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20 % on the training data and 30 % on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18 %. Based on these results, which method should we prefer to use for classification of new observations? Why? Definitely 1-nearest neighbor. The logistic regression performed more poorly on the training data, to which it has been optimized as much as possible, and yet the nearest neighbor model performs better over the entire dataset. Considering the logistic regression performed even worse on the test data, the average error rate of the logistic regression over the training and test data is 25%. This suggests that the relationship may not even be linear, and the nearest-neighbor is a very solid method for modeling non-linear classifications, so if the real relationship is not linear, it easily explains why the nearest-neighbor method would do so much better. Everything here seems to point at using the nearest-neighbor. 9. This problem has to do with odds. (a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default? (b) Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will de- fault? 11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set. (a) Create a binary variable, mpg01 , that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables. (b) Explore the data graphically in order to investigate the associ- ation between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01 ? Scat- terplots and boxplots may be useful tools to answer this ques- tion. Describe your findings. (c) Split the data into a training set and a test set. (d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? (e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? (f) Perform logistic regression on the training data in order to pre- dict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? (g) Perform KNN on the training data, with several values of K, in order to predict mpg01 . Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?