You will need to load the core library for the course textbook. We will also be using various random number generation functions in today’s seminar, so it would also be worthwhile setting the seed as shown in the code block below.
library(ISLR)
set.seed(12345)In this exercise, we will be using the Weekly dataset,
which can be loaded with the following code:
data("Weekly")This data includes weekly percentage returns for the S&P 500
stock index between 1990 and 2010. To find out more about the variables
included in this data, you can look at the help file, which can be
accessed using ?Weekly. We will be interested in predicting
the Direction variable, which indicates whether the S&P
500 went up or down in a given week.
summary() command to explore the
data. In how many weeks did the market move up?summary(Weekly$Direction)## Down   Up 
##  484  605The market moved up in 605 of 1089 weeks.
Direction
using Lag1 and Lag2. Recall that to estimate a
logistic regression, you will need to use the glm()
function, with family = binomial as an argument to that
function. Summarise the results. Are either of the two lag variables
significant?glm.fit <-  glm(Direction ~ Lag1 + Lag2, 
                data = Weekly, 
                family = binomial)
summary(glm.fit)## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.623  -1.261   1.001   1.083   1.506  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4There is some evidence that the return from 2 weeks previous (
Lag2) positively predicts positive returns in the current week, conditional on controlling for returns from the previous week.
predict() function on the
estimated model object, with type = "response". This will
produce a vector of probabilities which will be larger when an
observation is more likely to have moved up rather than down, given its
covariate values. For how many observations is the probability of the
S&P going up greater than .5?pred_probs <- predict(glm.fit, type = "response")
sum(pred_probs > .5)## [1] 1013table(pred_probs > .5)## 
## FALSE  TRUE 
##    76  1013The model predicts upward market movement in 1013 out of 1089 observations.
"Up" if the model
predicts upward movement in the stock market and "Down" if
the model predicts downward movement in the stock market (i.e. use the
vector of probabilities you created in part (c) and code which
observations are greater than .5). Use this new variable to construct a
confusion matrix with the original outcome variable,
Direction. How many false positives are there? How many
false negatives are there? What is the training set error rate?N.b. Remember that, in the context of classification problems, such as the one considered here, the error rate is \(\frac{1}{n}\sum_{i=1}^N Err_i\). That is, for each observation, we predict the outcome and compare it to the observed outcome for that observation. If the predicted outcome is different from the actual outcome, then \(Err_i = 1\), otherwise it is equal to 0. The error-rate is then just the average of those errors across all observations.
pred_outcome <- ifelse(pred_probs > .5,"Up", "Down")
table(pred_outcome, Weekly$Direction)##             
## pred_outcome Down  Up
##         Down   38  38
##         Up    446 567mean(pred_outcome != Weekly$Direction)## [1] 0.4444444There are 446 false positives and 38 false negatives. The training set error rate is 44%.
cv.glm()
function can be used in order to compute the LOOCV test error estimate.
Alternatively, one can compute those quantities using just the
glm() and predict() functions, and a
for loop. Although it can be more laborious to do this
using a loop, it also helps to clarify what is going on in this
process!Start by re-fitting the logistic regression model that you fit in
answer to part (b) (predicting Direction using
Lag1 and Lag2), but here use all
observations except the first observation. Remember, you can subset
a data.frame by using the square brackets. So, here, you
will want something like my_data[-1,]. How do the
coefficients you estimate in this model differ from those in the model
in part (b)?
glm.fit <-  glm(Direction ~ Lag1 + Lag2, 
                data = Weekly[-1,], 
                family = binomial)
summary(glm.fit)## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-1, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6258  -1.2617   0.9999   1.0819   1.5071  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22324    0.06150   3.630 0.000283 ***
## Lag1        -0.03843    0.02622  -1.466 0.142683    
## Lag2         0.06085    0.02656   2.291 0.021971 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1494.6  on 1087  degrees of freedom
## Residual deviance: 1486.5  on 1085  degrees of freedom
## AIC: 1492.5
## 
## Number of Fisher Scoring iterations: 4They are almost identical, because you are using exactly the same model and the same data except for the omission of the first observation!
P(Direction="Up"|Lag1, Lag2) > 0.5.
Remember that to calculate predicted probabilities (rather than
predicted log-odds) you will need to add type = "response"
as an argument to the predict() function. Compare your
prediction to the true value of Direction for the first
observation. Was this observation correctly classified?predict(glm.fit, Weekly[1,], type="response") > 0.5##    1 
## TRUEWeekly[1,]$Direction## [1] Down
## Levels: Down UpPrediction was UP, true Direction was DOWN.
To estimate the LOOCV error, we need to now replicate the process
you followed above for every observation in the dataset.
Fortunately, rather than writing hundreds of lines of code, we can
accomplish this using a for loop. Write a for
loop that loops over from i=1 to i=n, where n
is the number of observations in the data set (i.e. you could use
nrow(Weekly)). At each iteration, your loop should performs
each of the following steps:
Direction using Lag1 and
Lag2 (i.e. as in your answer to (e) above).The code provided below can be used to help with this question:
# Set up a vector in which you can store the errors for each observation
errors <- rep(0, nrow(Weekly))
# Loop over each observation in the data
for (i in 1:nrow(Weekly)) {
  
  # Fit the logistic regression model
   ## YOUR CODE GOES HERE
   
   # Calculate the predicted probability, and check whether this is greater than .5
     ## YOUR CODE GOES HERE
   
   # Calculate the true outcome for the i-th observation
    ## YOUR CODE GOES HERE
   
   # Assign the errors vector to be equal to 1 if a prediction error was made (the vector is all 0s otherwise)
    ## YOUR CODE GOES HERE
}
# Count up the number of errors made# Set up a vector in which you can store the errors for each observation
errors <- rep(0, nrow(Weekly))
# Loop over each observation in the data
for (i in 1:nrow(Weekly)) {
  
  # Fit the logistic regression model
   glm.fit <-  glm(Direction ~ Lag1 + Lag2, 
                   data = Weekly[-i,], 
                   family = binomial)
   
   # Calculate the predicted probability, and check whether this is greater than .5
   is_up <-  predict(glm.fit, Weekly[i,], type="response") > 0.5
   
   # Calculate the true outcome for the i-th observation
   is_true_up <-  Weekly[i,]$Direction == "Up"
   
   # Assign the errors vector to be equal to 1 if a prediction error was made (the vector is all 0s otherwise)
   if (is_up != is_true_up) errors[i] <-  1
}
# Count up the number of errors made
sum(errors)## [1] 490There are 490 errors.
n numbers obtained from the
loop above in order to calculate the LOOCV estimate for the test error.
Comment on the results. How does this compare to the estimate of the
training set error calculated in part (d)?mean(errors)## [1] 0.4499541The LOOCV estimates a test error rate of 45%. This is very similar to the training set error rate we calculated in part (d), though that will not always be the case. In particular, here we are using a very simple model with only two (linear) predictors. As the models we fit become more complicated and more flexible, we increase the chances that we will overfit the training data. When this is the case, the training error rate will decrease but the LOOCV (or any other out-of-sample error rate) might begin to increase.
Running the leave-one-out procedure here does not take too long
because the Weekly dataset has only 1089 observations, and
this is a simple logistic regression model. However, often LOOCV is too
computationally burdensome to be useful. In those cases, K-fold
cross-validation is normally used instead. Though it is possible to code
cross-validation manually, using only the functions we used above, in
this case it will suffice to use some functions designed for the
purpose.
install.packages("boot") first):library(boot)glm_fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)cv.glm() function from the boot package to
estimate the cross-validation error. Look at the help file for this
function and you will see that you need to provide 4 arguments to make
the function work. i) the data for estimating the model; ii) the
estimated model object from part (b) above; iii) a “cost” function,
which we have copied for you below (this just tells the function that
you want to do cross-validation for a logit model); and finally, iv) a
value for K, the number of folds to use. Use this function
to implement 10-fold cross validation and report the cross-validation
error (this information is stored in the first element of
delta component of the estimated model object). How does
this estimate compare to the LOOCV reported in the question above?cost <- function(r, pi) mean(abs(r-pi)> 0.5)cv_10_fold_logistic <-  cv.glm(Weekly, glm_fit, K= 10, cost = cost)
cv_error_10_fold_logistic <- cv_10_fold_logistic$delta[1]
cv_error_10_fold_logistic## [1] 0.4536272The answer is almost exactly the same as the LOOCV, but the function takes less time to complete than the for loop did.
K in the
cv.glm function. How does this affect the cross-validation
error?cv_5_fold_logistic <-  cv.glm(Weekly, glm_fit, K= 5, cost = cost)
cv_error_5_fold_logistic <- cv_5_fold_logistic$delta[1]
cv_error_5_fold_logistic## [1] 0.4481175cv_20_fold_logistic <-  cv.glm(Weekly, glm_fit, K= 20, cost = cost)
cv_20_fold_logistic <- cv_20_fold_logistic$delta[1]
cv_20_fold_logistic## [1] 0.446281The results are essentially entirely insensitive to these choices. Usually the choice of K is determined by how long the model takes to estimate - if you have a very long-running model, then a lower value for K is normally practical.
In this exercise, we will predict the number of applications
(Apps) received by US colleges using the other variables in
the College data set. Again, if you would like to read more
about the variables in the data, just type ?College.
Apps as the outcome, and all
other variables in the data as predictors. Report the estimated
coefficients from the model.lm.fit <-  lm(Apps ~ . , data = College)
coef(lm.fit)##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -445.08412616 -494.14896913    1.58580720   -0.88069035   49.92627615 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -14.23447911    0.05739022    0.04444654   -0.08587004    0.15102708 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.02089712    0.03110491   -8.67849712   -3.33065614   15.38960682 
##   perc.alumni        Expend     Grad.Rate 
##    0.17866507    0.07789731    8.66762545College data. To get you started,
the code below creates a matrix of predictors using the
model.matrix() function, which makes working with
glmnet() a little more straightforward.library(glmnet) # Load the relevent package for estimating ridge and lasso regressions## Loading required package: Matrix## Loaded glmnet 4.1-6train.mat <-  model.matrix(Apps ~ . , data = College)[,-1]To estimate the ridge regression, you will need to use the
glmnet function from the library(glmnet)
package. This function takes four main arguments:
x - a matrix of predictor variablesy - a vector of outcomesalpha - a parameter determining whether you want to run
a ridge regression (alpha = 0) or a lasso regression
(alpha = 1)lambda - the penalty parameter applied to the ridge
regressionFor the purposes of this exercise, use train.mat for
x, and the outcome from the College data for
y. Set alpha = 1 and
lambda = 1000 (a very large penalty parameter). Report the
coefficients (coef()) estimated by this model. How do they
compare to the coefficients that you estimated using OLS in part (b)
above?
ridge_mod <- glmnet(train.mat, College$Apps, alpha = 0, lambda = 1000)
coef(ridge_mod)## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -1.983738e+03
## PrivateYes  -5.498690e+02
## Accept       7.233613e-01
## Enroll       7.172059e-01
## Top10perc    1.825737e+01
## Top25perc    5.311974e+00
## F.Undergrad  1.142266e-01
## P.Undergrad  4.288358e-02
## Outstate     8.359463e-04
## Room.Board   1.994826e-01
## Books        2.177900e-01
## Personal    -2.429920e-03
## PhD         -5.943560e-01
## Terminal    -2.261802e+00
## S.F.Ratio    1.126691e+01
## perc.alumni -1.052872e+01
## Expend       6.608624e-02
## Grad.Rate    1.188523e+01## Presenting these nicely next to the OLS estimates
round(data.frame(ols = coef(lm.fit), ridge = as.numeric(coef(ridge_mod))),2)##                 ols    ridge
## (Intercept) -445.08 -1983.74
## PrivateYes  -494.15  -549.87
## Accept         1.59     0.72
## Enroll        -0.88     0.72
## Top10perc     49.93    18.26
## Top25perc    -14.23     5.31
## F.Undergrad    0.06     0.11
## P.Undergrad    0.04     0.04
## Outstate      -0.09     0.00
## Room.Board     0.15     0.20
## Books          0.02     0.22
## Personal       0.03     0.00
## PhD           -8.68    -0.59
## Terminal      -3.33    -2.26
## S.F.Ratio     15.39    11.27
## perc.alumni    0.18   -10.53
## Expend         0.08     0.07
## Grad.Rate      8.67    11.89As expected, the ridge regression shrinks the size of many (though not all) of the coefficients relative to OLS.
lambda
parameter. How does this affect the estimated coefficients? What happens
when lambda = 0?ridge_mod <- glmnet(train.mat, College$Apps, alpha = 0, lambda = 100)
round(data.frame(ols = coef(lm.fit), ridge = as.numeric(coef(ridge_mod))),2)##                 ols   ridge
## (Intercept) -445.08 -914.82
## PrivateYes  -494.15 -514.20
## Accept         1.59    1.32
## Enroll        -0.88   -0.12
## Top10perc     49.93   37.83
## Top25perc    -14.23   -6.65
## F.Undergrad    0.06    0.04
## P.Undergrad    0.04    0.03
## Outstate      -0.09   -0.06
## Room.Board     0.15    0.18
## Books          0.02    0.06
## Personal       0.03    0.01
## PhD           -8.68   -6.76
## Terminal      -3.33   -4.55
## S.F.Ratio     15.39   14.93
## perc.alumni    0.18   -4.19
## Expend         0.08    0.08
## Grad.Rate      8.67    9.79ridge_mod <- glmnet(train.mat, College$Apps, alpha = 0, lambda = 0)
round(data.frame(ols = coef(lm.fit), ridge = as.numeric(coef(ridge_mod))),2)##                 ols   ridge
## (Intercept) -445.08 -450.43
## PrivateYes  -494.15 -494.51
## Accept         1.59    1.58
## Enroll        -0.88   -0.86
## Top10perc     49.93   49.81
## Top25perc    -14.23  -14.15
## F.Undergrad    0.06    0.06
## P.Undergrad    0.04    0.04
## Outstate      -0.09   -0.09
## Room.Board     0.15    0.15
## Books          0.02    0.02
## Personal       0.03    0.03
## PhD           -8.68   -8.68
## Terminal      -3.33   -3.33
## S.F.Ratio     15.39   15.41
## perc.alumni    0.18    0.12
## Expend         0.08    0.08
## Grad.Rate      8.67    8.67As we reduce the value of lambda, the ridge coefficients get closer to the OLS coefficients. At lambda = 0, they are essentially the same (the small differences that are still present are due to small implementation differences between
lm()andglmnet()).
alpha parameter
to be equal to 1 rather than 0. Try this now, and compare the OLS
coefficients to the lasso coefficients when
lambda = 10.lasso_mod <- glmnet(train.mat, College$Apps, alpha = 1, lambda = 20)
coef(lasso_mod)## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -585.76484088
## PrivateYes  -430.96055785
## Accept         1.46383398
## Enroll        -0.22858773
## Top10perc     35.00907616
## Top25perc     -3.36767932
## F.Undergrad    .         
## P.Undergrad    0.02373731
## Outstate      -0.06033567
## Room.Board     0.12680896
## Books          .         
## Personal       .         
## PhD           -5.84465314
## Terminal      -3.23247018
## S.F.Ratio      5.27541830
## perc.alumni   -0.94416051
## Expend         0.07026571
## Grad.Rate      5.41871976## Presenting these nicely next to the OLS estimates
round(data.frame(ols = coef(lm.fit), lasso = as.numeric(coef(lasso_mod))),2)##                 ols   lasso
## (Intercept) -445.08 -585.76
## PrivateYes  -494.15 -430.96
## Accept         1.59    1.46
## Enroll        -0.88   -0.23
## Top10perc     49.93   35.01
## Top25perc    -14.23   -3.37
## F.Undergrad    0.06    0.00
## P.Undergrad    0.04    0.02
## Outstate      -0.09   -0.06
## Room.Board     0.15    0.13
## Books          0.02    0.00
## Personal       0.03    0.00
## PhD           -8.68   -5.84
## Terminal      -3.33   -3.23
## S.F.Ratio     15.39    5.28
## perc.alumni    0.18   -0.94
## Expend         0.08    0.07
## Grad.Rate      8.67    5.42The lasso shrinks the coefficients in a similar way to the ridge regression, but here many of the coefficients are shrunk all the way to zero. This illustrates a key difference between the lasso and ridge approaches.
For this question, you might find it helpful to refer to pages 274-275 of the James et. al. textbook.
sample() function helpful for selecting
observations at random to include in the two sets.library(ISLR)
set.seed(11)
train.size <-  nrow(College) / 2
train <-  sample(1:nrow(College), train.size)
College.train <-  College[train, ]
College.test <-  College[-train, ]Apps) for observation
\(i\) and \(\hat{y_i}\) is the fitted value for
observation \(i\) from the regression
model.lm.fit <-  lm(Apps ~ . , data = College.train)
lm.pred <-  predict(lm.fit, College.test)
mean((College.test$Apps - lm.pred)^2)## [1] 1026096The test-set MSE for the linear model is 1026096
lambda = 0.1. Remember, you
will have to convert the training data predictors using the
model.matrix() function, as in part (c) of the previous
question, in order for the glmnet() function to work. Use
the predict() function to calculate fitted values for the
test set by setting the newx argument to be equal to your
test data. Report the test set MSE from this regression. How does it
compare to the test set error you calculated in part (b)?train.mat <-  model.matrix(Apps ~ . , data = College.train)[,-1]
test.mat <-  model.matrix(Apps ~ . , data = College.test)[,-1]ridge_mod <- glmnet(train.mat, College.train$Apps, alpha = 0)
ridge_pred <- predict(ridge_mod,  newx = test.mat, s = 0.1)
mean((College.test$Apps - ridge_pred)^2)## [1] 969107.4The test-set MSE of the ridge model is marginally lower than OLS.
alpha = 1 in glmnet()). Again, set
lambda = 0.1. Report the test error obtained. How does this
compare to the test set errors for the OLS and ridge models?lasso_mod <- glmnet(train.mat, College.train$Apps, alpha = 1)
lasso_pred <- predict(lasso_mod,  newx = test.mat, lambda = 0.1)
mean((College.test$Apps - lasso_pred)^2)## [1] 1689676Here, the test-set MSE of the lasso model is a bit higher than OLS.
cv.glmnet() function, which takes the following main
arguments:| Argument | Value | 
|---|---|
| x | A matrix of predictor variables | 
| y | A response variable | 
| family | A character string representing the family of the
outcome variable. Use "gaussian"here for a numeric
outcome. | 
| nfolds | Number of folds to use in cross validation | 
| alpha | Parameter expressing the type of penalised model you
would like to fit. alpha = 1is the lasso penalty, andalpha = 0the ridge penalty. | 
Implement the cross-validation using this function.
lasso_cv <- cv.glmnet(x = train.mat, 
                      y = College.train$Apps, 
                      family = "gaussian",
                      alpha = 1,
                      nfolds = 10)
ridge_cv <- cv.glmnet(x = train.mat, 
                      y = College.train$Apps, 
                      family = "gaussian",
                      alpha = 0,
                      nfolds = 10)cv.glmnet() stores the
optimal choice for \(\lambda\) in the
output of the fitted object. Extract it using
estimated_cv_object$lambda.min and use as input to the
s argument of the predict() function. (You can
also see pages 276-277 of the James et. al. book if you are struggling
here.) Report the test error obtained for both models.lasso_bestlam <- lasso_cv$lambda.min
ridge_bestlam <- ridge_cv$lambda.min
lasso_pred_cv <- predict(lasso_mod, s = lasso_bestlam,  newx = test.mat)
ridge_pred_cv <- predict(ridge_mod, s = ridge_bestlam,  newx = test.mat)
mean((College.test$Apps - lasso_pred_cv)^2)## [1] 1013316mean((College.test$Apps - ridge_pred_cv)^2)## [1] 969107.4In general, not very much. The MSE for the lasso with the optimal lambda is now somewhat smaller than the OLS model, but the ridge model MSE is unchanged. In this case, there are not really enough variables for us to be over-fitting the data and so the regularization approaches do not confer much of a performance advantage. Again, this will not always be the case and so it is helpful to have these tools at hand for when we need them!