Exercise 7 - Resampling Methods and Model Selection

You will need to load the core library for the course textbook. We will also be using various random number generation functions in today’s seminar, so it would also be worthwhile setting the seed as shown in the code block below.

library(ISLR)
set.seed(12345)

Exercise 7.1 - Leave One Out Cross Validation

In this exercise, we will be using the Weekly dataset, which can be loaded with the following code:

data("Weekly")

This data includes weekly percentage returns for the S&P 500 stock index between 1990 and 2010. To find out more about the variables included in this data, you can look at the help file, which can be accessed using ?Weekly. We will be interested in predicting the Direction variable, which indicates whether the S&P 500 went up or down in a given week.

Start by using the summary() command to explore the data. In how many weeks did the market move up?

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

Fit a logistic regression model that predicts Direction using Lag1 and Lag2. Recall that to estimate a logistic regression, you will need to use the glm() function, with family = binomial as an argument to that function. Summarise the results. Are either of the two lag variables significant?

glm.fit <-  glm(Direction ~ Lag1 + Lag2, 
                data = Weekly, 
                family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.623  -1.261   1.001   1.083   1.506  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4

Use the regression that you just estimated to calculate the predicted probability of the outcome for every observation in the data. To do so, you can simply use the predict() function on the estimated model object, with type = "response". This will produce a vector of probabilities which will be larger when an observation is more likely to have moved up rather than down, given its covariate values. For how many observations is the probability of the S&P going up greater than .5?

pred_probs <- predict(glm.fit, type = "response")

sum(pred_probs > .5)

## [1] 1013

table(pred_probs > .5)

## 
## FALSE  TRUE 
##    76  1013

The model predicts upward market movement in 1013 out of 1089 observations.

Create a new vector which is equal to "Up" if the model predicts upward movement in the stock market and "Down" if the model predicts downward movement in the stock market (i.e. use the vector of probabilities you created in part (c) and code which observations are greater than .5). Use this new variable to construct a confusion matrix with the original outcome variable, Direction. How many false positives are there? How many false negatives are there? What is the training set error rate?

N.b. Remember that, in the context of classification problems, such as the one considered here, the error rate is \(\frac{1}{n}\sum_{i=1}^N Err_i\). That is, for each observation, we predict the outcome and compare it to the observed outcome for that observation. If the predicted outcome is different from the actual outcome, then \(Err_i = 1\), otherwise it is equal to 0. The error-rate is then just the average of those errors across all observations.

pred_outcome <- ifelse(pred_probs > .5,"Up", "Down")

table(pred_outcome, Weekly$Direction)

##             
## pred_outcome Down  Up
##         Down   38  38
##         Up    446 567

mean(pred_outcome != Weekly$Direction)

## [1] 0.4444444

There are 446 false positives and 38 false negatives. The training set error rate is 44%

As we discussed in the lecture, the training set error rate may underestimate the error rate for the test set. We will now use leave-one-out cross validation to estimate the test-set error rate. In the textbook sections for this topic (particularly sections 5.3.2 and 5.3.3 in James et al.), you will see that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one can compute those quantities using just the glm() and predict() functions, and a for loop. Although it can be more laborious to do this using a loop, it also helps to clarify what is going on in this process!

Start by re-fitting the logistic regression model that you fit in answer to part (b) (predicting Direction using Lag1 and Lag2), but here use all observations except the first observation. Remember, you can subset a data.frame by using the square brackets. So, here, you will want something like my_data[-1,]. How do the coefficients you estimate in this model differ from those in the model in part (b)?

glm.fit <-  glm(Direction ~ Lag1 + Lag2, 
                data = Weekly[-1,], 
                family = binomial)

summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-1, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6258  -1.2617   0.9999   1.0819   1.5071  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22324    0.06150   3.630 0.000283 ***
## Lag1        -0.03843    0.02622  -1.466 0.142683    
## Lag2         0.06085    0.02656   2.291 0.021971 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1494.6  on 1087  degrees of freedom
## Residual deviance: 1486.5  on 1085  degrees of freedom
## AIC: 1492.5
## 
## Number of Fisher Scoring iterations: 4

They are almost identical, because you are using exactly the same model and the same data except for the omission of the first observation!

Use the model from (e) to predict the direction of only the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5. Remember that to calculate predicted probabilities (rather than predicted log-odds) you will need to add type = "response" as an argument to the predict() function. Compare your prediction to the true value of Direction for the first observation. Was this observation correctly classified?

predict(glm.fit, Weekly[1,], type="response") > 0.5

##    1 
## TRUE

Weekly[1,]$Direction

## [1] Down
## Levels: Down Up

Prediction was UP, true Direction was DOWN.

To estimate the LOOCV error, we need to now replicate the process you followed above for every observation in the dataset. Fortunately, rather than writing hundreds of lines of code, we can accomplish this using a for loop. Write a for loop that loops over from i=1 to i=n, where n is the number of observations in the data set (i.e. you could use nrow(Weekly)). At each iteration, your loop should performs each of the following steps:
1. Fit a logistic regression model using all but the i-th observation to predict Direction using Lag1 and Lag2 (i.e. as in your answer to (e) above).
2. Calculate the probability of the market moving up for the i-th observation (this requires using the same code as you used for your answer to part (f) above.).
3. Use the probability for the i-th observation in order to predict whether or not the market moves up (again, as you did in part (f)).
4. Determine whether or not an error was made in predicting the direction for the i-th observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0. Make sure that you store these results!

The code provided below can be used to help with this question:

# Set up a vector in which you can store the errors for each observation
errors <- rep(0, nrow(Weekly))

# Loop over each observation in the data
for (i in 1:nrow(Weekly)) {
  
  # Fit the logistic regression model
   ## YOUR CODE GOES HERE
   
   # Calculate the predicted probability, and check whether this is greater than .5
     ## YOUR CODE GOES HERE
   
   # Calculate the true outcome for the i-th observation
    ## YOUR CODE GOES HERE
   
   # Assign the errors vector to be equal to 1 if a prediction error was made (the vector is all 0s otherwise)
    ## YOUR CODE GOES HERE
}

# Count up the number of errors made

# Set up a vector in which you can store the errors for each observation
errors <- rep(0, nrow(Weekly))

# Loop over each observation in the data
for (i in 1:nrow(Weekly)) {
  
  # Fit the logistic regression model
   glm.fit <-  glm(Direction ~ Lag1 + Lag2, 
                   data = Weekly[-i,], 
                   family = binomial)
   
   # Calculate the predicted probability, and check whether this is greater than .5
   is_up <-  predict(glm.fit, Weekly[i,], type="response") > 0.5
   
   # Calculate the true outcome for the i-th observation
   is_true_up <-  Weekly[i,]$Direction == "Up"
   
   # Assign the errors vector to be equal to 1 if a prediction error was made (the vector is all 0s otherwise)
   if (is_up != is_true_up) errors[i] <-  1
}

# Count up the number of errors made
sum(errors)

## [1] 490

There are 490 errors.

Take the average of the n numbers obtained from the loop above in order to calculate the LOOCV estimate for the test error. Comment on the results. How does this compare to the estimate of the training set error calculated in part (d)?

mean(errors)

## [1] 0.4499541

The LOOCV estimates a test error rate of 45%. This is very similar to the training set error rate we calculated in part (d), though that will not always be the case.

Exercise 7.2 - K-Fold Cross Validation

Running the leave-one-out procedure here does not take too long because the Weekly dataset has only nrow(Weekly) observations, and this is a simple logistic regression model. However, often LOOCV is too computationally burdensome to be useful. In those cases, K-fold cross-validation is normally used instead. Though it is possible to code cross-validation manually, using only the functions we used above, in this case it will suffice to use some functions designed for the purpose.

First, load the boot library (you may need to install.packages("boot") first):

library(boot)

Now fit the same logistic regression model as in part (b) in the question above - i.e. on the whole data

glm_fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)

Now use the cv.glm() function from the boot package to estimate the cross-validation error. Look at the help file for this function and you will see that you need to provide 4 arguments to make the function work. i) the data for estimating the model; ii) the estimated model object from part (b) above; iii) a “cost” function, which we have copied for you below (this just tells the function that you want to do cross-validation for a logit model); and finally, iv) a value for K, the number of folds to use. Use this function to implement 10-fold cross validation and report the cross-validation error (this information is stored in the first element of delta component of the estimated model object). How does this estimate compare to the LOOCV reported in the question above?

cost <- function(r, pi) mean(abs(r-pi)> 0.5)

cv_10_fold_logistic <-  cv.glm(Weekly, glm_fit, K= 10, cost = cost)
cv_error_10_fold_logistic <- cv_10_fold_logistic$delta[1]
cv_error_10_fold_logistic

## [1] 0.4536272

The answer is almost exactly the same as the LOOCV, but the function takes less time to complete than the for loop did!

Experiment with some different values for K in the cv.glm function. How does this affect the cross-validation error?

cv_5_fold_logistic <-  cv.glm(Weekly, glm_fit, K= 5, cost = cost)
cv_error_5_fold_logistic <- cv_5_fold_logistic$delta[1]
cv_error_5_fold_logistic

## [1] 0.4481175

cv_20_fold_logistic <-  cv.glm(Weekly, glm_fit, K= 20, cost = cost)
cv_20_fold_logistic <- cv_20_fold_logistic$delta[1]
cv_20_fold_logistic

## [1] 0.446281

The results are essentially entirely insensitive to these choices. Usually the choice of K is determined by how long the model takes to estimate - if you have a very long-running model, then a lower value for K is normally practical.

Exercise 7.2 - Lasso and ridge regression

In this exercise, we will predict the number of applications (Apps) received by US colleges using the other variables in the College data set. Again, if you would like to read more about the variables in the data, just type ?College.

Fit a linear model with Apps as the outcome, and all other variables in the data as predictors. Report the estimated coefficients from the model.

lm.fit <-  lm(Apps ~ . , data = College)
coef(lm.fit)

##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -445.08412616 -494.14896913    1.58580720   -0.88069035   49.92627615 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -14.23447911    0.05739022    0.04444654   -0.08587004    0.15102708 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.02089712    0.03110491   -8.67849712   -3.33065614   15.38960682 
##   perc.alumni        Expend     Grad.Rate 
##    0.17866507    0.07789731    8.66762545

Using the same outcome and the same predictors, fit a ridge regression model on the College data. To get you started, the code below creates a matrix of predictors using the model.matrix() function, which makes working with glmnet() a little more straightforward.

library(glmnet) # Load the relevent package for estimating ridge and lasso regressions

## Loading required package: Matrix

## Loaded glmnet 4.0-2

train.mat <-  model.matrix(Apps ~ . , data = College)[,-1]

To estimate the ridge regression, you will need to use the glmnet function from the library(glmnet) package. This function takes four main arguments:

x - a matrix of predictor variables
y - a vector of outcomes
alpha - a parameter determining whether you want to run a ridge regression (alpha = 0) or a lasso regression (alpha = 1)
lambda - the penalty parameter applied to the ridge regression

For the purposes of this exercise, use train.mat for x, and the outcome from the College data for y. Set alpha = 1 and lambda = 1000 (a very large penalty parameter). Report the coefficients (coef()) estimated by this model. How do they compare to the coefficients that you estimated using OLS in part (b) above?

ridge_mod <- glmnet(train.mat, College$Apps, alpha = 0, lambda = 1000)
coef(ridge_mod)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -1.983738e+03
## PrivateYes  -5.498690e+02
## Accept       7.233613e-01
## Enroll       7.172059e-01
## Top10perc    1.825737e+01
## Top25perc    5.311974e+00
## F.Undergrad  1.142266e-01
## P.Undergrad  4.288358e-02
## Outstate     8.359463e-04
## Room.Board   1.994826e-01
## Books        2.177900e-01
## Personal    -2.429920e-03
## PhD         -5.943560e-01
## Terminal    -2.261802e+00
## S.F.Ratio    1.126691e+01
## perc.alumni -1.052872e+01
## Expend       6.608624e-02
## Grad.Rate    1.188523e+01

## Presenting these nicely next to the OLS estimates
round(data.frame(ols = coef(lm.fit), ridge = as.numeric(coef(ridge_mod))),2)

##                 ols    ridge
## (Intercept) -445.08 -1983.74
## PrivateYes  -494.15  -549.87
## Accept         1.59     0.72
## Enroll        -0.88     0.72
## Top10perc     49.93    18.26
## Top25perc    -14.23     5.31
## F.Undergrad    0.06     0.11
## P.Undergrad    0.04     0.04
## Outstate      -0.09     0.00
## Room.Board     0.15     0.20
## Books          0.02     0.22
## Personal       0.03     0.00
## PhD           -8.68    -0.59
## Terminal      -3.33    -2.26
## S.F.Ratio     15.39    11.27
## perc.alumni    0.18   -10.53
## Expend         0.08     0.07
## Grad.Rate      8.67    11.89

As expected, the ridge regression shrinks the size of almost all the coefficients relative to OLS.

Experiment by lowering the value of the lambda parameter. How does this affect the estimated coefficients? What happens when lambda = 0?

ridge_mod <- glmnet(train.mat, College$Apps, alpha = 0, lambda = 100)
round(data.frame(ols = coef(lm.fit), ridge = as.numeric(coef(ridge_mod))),2)

##                 ols   ridge
## (Intercept) -445.08 -914.82
## PrivateYes  -494.15 -514.20
## Accept         1.59    1.32
## Enroll        -0.88   -0.12
## Top10perc     49.93   37.83
## Top25perc    -14.23   -6.65
## F.Undergrad    0.06    0.04
## P.Undergrad    0.04    0.03
## Outstate      -0.09   -0.06
## Room.Board     0.15    0.18
## Books          0.02    0.06
## Personal       0.03    0.01
## PhD           -8.68   -6.76
## Terminal      -3.33   -4.55
## S.F.Ratio     15.39   14.93
## perc.alumni    0.18   -4.19
## Expend         0.08    0.08
## Grad.Rate      8.67    9.79

ridge_mod <- glmnet(train.mat, College$Apps, alpha = 0, lambda = 0)
round(data.frame(ols = coef(lm.fit), ridge = as.numeric(coef(ridge_mod))),2)

##                 ols   ridge
## (Intercept) -445.08 -450.43
## PrivateYes  -494.15 -494.51
## Accept         1.59    1.58
## Enroll        -0.88   -0.86
## Top10perc     49.93   49.81
## Top25perc    -14.23  -14.15
## F.Undergrad    0.06    0.06
## P.Undergrad    0.04    0.04
## Outstate      -0.09   -0.09
## Room.Board     0.15    0.15
## Books          0.02    0.02
## Personal       0.03    0.03
## PhD           -8.68   -8.68
## Terminal      -3.33   -3.33
## S.F.Ratio     15.39   15.41
## perc.alumni    0.18    0.12
## Expend         0.08    0.08
## Grad.Rate      8.67    8.67

As we reduce the value of lambda, the ridge coefficients get closer to the OLS coefficients. At lambda = 0, they are essentially the same (the small differences that are still present are due to small implementation differences between lm() and glmnet()).

To estimate a lasso regression, you can use exactly the same code as for the ridge regression, only changing the alpha parameter to be equal to 1 rather than 0. Try this now, and compare the OLS coefficients to the lasso coefficients when lambda = 10.

lasso_mod <- glmnet(train.mat, College$Apps, alpha = 1, lambda = 20)
coef(lasso_mod)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -585.76484088
## PrivateYes  -430.96055785
## Accept         1.46383398
## Enroll        -0.22858773
## Top10perc     35.00907616
## Top25perc     -3.36767932
## F.Undergrad    .         
## P.Undergrad    0.02373731
## Outstate      -0.06033567
## Room.Board     0.12680896
## Books          .         
## Personal       .         
## PhD           -5.84465314
## Terminal      -3.23247018
## S.F.Ratio      5.27541830
## perc.alumni   -0.94416051
## Expend         0.07026571
## Grad.Rate      5.41871976

## Presenting these nicely next to the OLS estimates
round(data.frame(ols = coef(lm.fit), lasso = as.numeric(coef(lasso_mod))),2)

##                 ols   lasso
## (Intercept) -445.08 -585.76
## PrivateYes  -494.15 -430.96
## Accept         1.59    1.46
## Enroll        -0.88   -0.23
## Top10perc     49.93   35.01
## Top25perc    -14.23   -3.37
## F.Undergrad    0.06    0.00
## P.Undergrad    0.04    0.02
## Outstate      -0.09   -0.06
## Room.Board     0.15    0.13
## Books          0.02    0.00
## Personal       0.03    0.00
## PhD           -8.68   -5.84
## Terminal      -3.33   -3.23
## S.F.Ratio     15.39    5.28
## perc.alumni    0.18   -0.94
## Expend         0.08    0.07
## Grad.Rate      8.67    5.42

The lasso shrinks the coefficients in a similar way to the ridge regression, but here many of the coefficients are shrunk all the way to zero. This illustrates a key difference between the lasso and ridge approaches.

Exercise 7.3 - Cross-validation for lasso and ridge regression (hard question)

In this question we will use the validation set approach in order to estimate the test error rate using lasso and ridge regression. First, split the data set into a training set and a validation set. You may find the sample() function helpful for selecting observations at random to include in the two sets.

library(ISLR)

set.seed(11)
train.size <-  nrow(College) / 2
train <-  sample(1:nrow(College), train.size)
College.train <-  College[train, ]
College.test <-  College[-train, ]

Use least squares to estimate the model on the training set, and then report the test error obtained for the validation set. To calculate the test error, you will need to use the formula \(\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i})^2\), where \(y_i\) is the observed value of the outcome (Apps) for observation \(i\) and \(\hat{y_i}\) is the fitted value for observation \(i\) from the regression model.

lm.fit <-  lm(Apps ~ . , data = College.train)
lm.pred <-  predict(lm.fit, College.test)
mean((College.test$Apps - lm.pred)^2)

## [1] 1026096

The test-set MSE for the linear model is 1026096

Using the same outcome and the same predictors, fit a ridge regression model on the training data that you just created. Remember, you will have to convert the training data predictors using the model.matrix() function, as in part (c) of the previous question, in order for the glmnet() function to work. Use the predict() function to calculate fitted values for the test set by setting the newx argument to be equal to your test data. Report the test set MSE from this regression. How does it compare to the test set error you calculated in part (b)?

(For the purposes of this question, set lambda = 0.1. Or, if you are feeling adventurous, follow the instructions on pages 253-254 of the James et. al. textbook to see how to use cross-validation for the ridge regression!)

train.mat <-  model.matrix(Apps ~ . , data = College.train)[,-1]
test.mat <-  model.matrix(Apps ~ . , data = College.test)[,-1]

ridge_mod <- glmnet(train.mat, College.train$Apps, alpha = 0, lambda = 0.1)
ridge_pred <- predict(ridge_mod,  newx = test.mat)
mean((College.test$Apps - ridge_pred)^2)

## [1] 1024134

The test-set MSE is marginally lower than OLS at 1024134

Repeat the process you followed in part (c), but this time using a lasso regression (remember, a lasso is estimated by setting alpha = 1 in glmnet()). Again, set lambda = 0.1. Report the test error obtained. How does this compare to the test set errors for the OLS and ridge models? (Again, further details for implementing the lasso model can be found on page 255 of the textbook.)

Pick \(\lambda\) using College.train and report error on College.test.

lasso_mod <- glmnet(train.mat, College.train$Apps, alpha = 1, lambda = 0.1)
lasso_pred <- predict(lasso_mod,  newx = test.mat)
mean((College.test$Apps - lasso_pred)^2)

## [1] 1023823

Again, the test-set MSE is very marginally lower than OLS at 1023823

Exercise 7 - Resampling Methods and Model Selection - Solutions

Ken Benoit and Jack Blumenau

Exercise 7.1 - Leave One Out Cross Validation

Exercise 7.2 - K-Fold Cross Validation

Exercise 7.2 - Lasso and ridge regression

Exercise 7.3 - Cross-validation for lasso and ridge regression (hard question)