From a817d948c98b44bcc12d11d000cd9d7507bf9066 Mon Sep 17 00:00:00 2001 From: caes Date: Wed, 8 Feb 2017 02:20:48 -0500 Subject: [PATCH] added hw3 answers --- ISLR Sixth Printing.pdf | Bin hw1/.RData | Bin hw1/.Rhistory | 0 hw1/Auto.data | 0 hw1/answers | 0 hw1/assigned | 0 hw1/auto_pairs.png | Bin hw2/.RData | Bin hw2/.Rhistory | 0 hw2/answers | 0 hw3/answers | 117 ++++++++++++++++++++++++++++++++++++++++ lab2/.RData | Bin lab2/.Rhistory | 0 lab2/Figure.pdf | Bin lab2/lab.r | 0 lab2/program | 0 lab2/text | 0 notes | 0 project/ideas | 0 usingR.pdf | Bin 20 files changed, 117 insertions(+) mode change 100644 => 100755 ISLR Sixth Printing.pdf mode change 100644 => 100755 hw1/.RData mode change 100644 => 100755 hw1/.Rhistory mode change 100644 => 100755 hw1/Auto.data mode change 100644 => 100755 hw1/answers mode change 100644 => 100755 hw1/assigned mode change 100644 => 100755 hw1/auto_pairs.png mode change 100644 => 100755 hw2/.RData mode change 100644 => 100755 hw2/.Rhistory mode change 100644 => 100755 hw2/answers create mode 100644 hw3/answers mode change 100644 => 100755 lab2/.RData mode change 100644 => 100755 lab2/.Rhistory mode change 100644 => 100755 lab2/Figure.pdf mode change 100644 => 100755 lab2/lab.r mode change 100644 => 100755 lab2/program mode change 100644 => 100755 lab2/text mode change 100644 => 100755 notes mode change 100644 => 100755 project/ideas mode change 100644 => 100755 usingR.pdf diff --git a/ISLR Sixth Printing.pdf b/ISLR Sixth Printing.pdf old mode 100644 new mode 100755 diff --git a/hw1/.RData b/hw1/.RData old mode 100644 new mode 100755 diff --git a/hw1/.Rhistory b/hw1/.Rhistory old mode 100644 new mode 100755 diff --git a/hw1/Auto.data b/hw1/Auto.data old mode 100644 new mode 100755 diff --git a/hw1/answers b/hw1/answers old mode 100644 new mode 100755 diff --git a/hw1/assigned b/hw1/assigned old mode 100644 new mode 100755 diff --git a/hw1/auto_pairs.png b/hw1/auto_pairs.png old mode 100644 new mode 100755 diff --git a/hw2/.RData b/hw2/.RData old mode 100644 new mode 100755 diff --git a/hw2/.Rhistory b/hw2/.Rhistory old mode 100644 new mode 100755 diff --git a/hw2/answers b/hw2/answers old mode 100644 new mode 100755 diff --git a/hw3/answers b/hw3/answers new file mode 100644 index 0000000..666e2be --- /dev/null +++ b/hw3/answers @@ -0,0 +1,117 @@ +Part B: Choose one of Questions 10 or 11 + +5. We now examine the differences between LDA and QDA. + + (a) If the Bayes decision boundary is linear, do we expect LDA + or QDA to perform better on the training set? On the test set? + + The QDA has more flexibility, so it will match the training + set more closely than the LDA. The LDA will perform better + on the test set because the real relationship is linear, so + the QDA would have additional bias. + + + (b) If the Bayes decision boundary is non-linear, do we expect + LDA or QDA to perform better on the training set? On the test + set? + + QDA will still perform better on the training set, but now + should also perform better than LDA on the test set, since + QDA will account for the additional degree of freedom in the + real relationship. + + + (c) In general, as the sample size n increases, do we expect the + test prediction accuracy of QDA relative to LDA to improve, + decline, or be unchanged? Why? + + Definitely increase. The LDA has an advantage when the + training set is small because it is less sensitive to the + fluctuations of those few data. As the size of the training + set grows, the QDA is able to optimize its coefficients + well, and assuming the real relationship is not linear, the + QDA should eventually out-perform the LDA. + + + (d) True or False: Even if the Bayes decision boundary for a + given problem is linear, we will probably achieve a superior + test error rate using QDA rather than LDA because QDA is + flexible enough to model a linear decision boundary. Justify + your answer. + + False. The QDA will likely be biased because it will fit to + training data that don't completely represent the + relationship that will be observed in test data. + + +8. Suppose that we take a data set, divide it into equally-sized + training and test sets, and then try out two different + classification procedures. First we use logistic regression and + get an error rate of 20 % on the training data and 30 % on the + test data. Next we use 1-nearest neighbors (i.e. K = 1) and get + an average error rate (averaged over both test and training data + sets) of 18 %. Based on these results, which method should we + prefer to use for classification of new observations? Why? + + Definitely 1-nearest neighbor. The logistic regression + performed more poorly on the training data, to which it has + been optimized as much as possible, and yet the nearest + neighbor model performs better over the entire dataset. + Considering the logistic regression performed even worse on + the test data, the average error rate of the logistic + regression over the training and test data is 25%. This + suggests that the relationship may not even be linear, and + the nearest-neighbor is a very solid method for modeling + non-linear classifications, so if the real relationship is + not linear, it easily explains why the nearest-neighbor + method would do so much better. Everything here seems to + point at using the nearest-neighbor. + + +9. This problem has to do with odds. + + (a) On average, what fraction of people with an odds of 0.37 of + defaulting on their credit card payment will in fact default? + + (b) Suppose that an individual has a 16 % chance of defaulting + on her credit card payment. What are the odds that she will de- + fault? + + + +11. In this problem, you will develop a model to predict whether a + given car gets high or low gas mileage based on the Auto data + set. + + (a) Create a binary variable, mpg01 , that contains a 1 if mpg + contains a value above its median, and a 0 if mpg contains a + value below its median. You can compute the median using the + median() function. Note you may find it helpful to use the + data.frame() function to create a single data set containing + both mpg01 and the other Auto variables. + + (b) Explore the data graphically in order to investigate the + associ- ation between mpg01 and the other features. Which of the + other features seem most likely to be useful in predicting mpg01 + ? Scat- terplots and boxplots may be useful tools to answer this + ques- tion. Describe your findings. + + (c) Split the data into a training set and a test set. + + (d) Perform LDA on the training data in order to predict mpg01 + using the variables that seemed most associated with mpg01 in + (b). What is the test error of the model obtained? + + (e) Perform QDA on the training data in order to predict mpg01 + using the variables that seemed most associated with mpg01 in + (b). What is the test error of the model obtained? + + (f) Perform logistic regression on the training data in order to + pre- dict mpg01 using the variables that seemed most associated + with mpg01 in (b). What is the test error of the model obtained? + + (g) Perform KNN on the training data, with several values of K, + in order to predict mpg01 . Use only the variables that seemed + most associated with mpg01 in (b). What test errors do you + obtain? Which value of K seems to perform the best on this data + set? \ No newline at end of file diff --git a/lab2/.RData b/lab2/.RData old mode 100644 new mode 100755 diff --git a/lab2/.Rhistory b/lab2/.Rhistory old mode 100644 new mode 100755 diff --git a/lab2/Figure.pdf b/lab2/Figure.pdf old mode 100644 new mode 100755 diff --git a/lab2/lab.r b/lab2/lab.r old mode 100644 new mode 100755 diff --git a/lab2/program b/lab2/program old mode 100644 new mode 100755 diff --git a/lab2/text b/lab2/text old mode 100644 new mode 100755 diff --git a/notes b/notes old mode 100644 new mode 100755 diff --git a/project/ideas b/project/ideas old mode 100644 new mode 100755 diff --git a/usingR.pdf b/usingR.pdf old mode 100644 new mode 100755