This question relates to the College
dataset from the
ISLR
package. Start by loading that package, as well as the
gam
package which will allow us to estimate generalised
additive models, the splines
package (which,
unsurprisingly, let’s us estimate splines), and the
randomForest
package (guess what that is for?).
library(ISLR)
library(gam)
library(splines)
library(randomForest)
The College
data contains several variables for 777 US
Colleges in the 1990s. Look at the help file for this data set
(?College
) for a description of the variables that are
included.
In this seminar, we will be experimenting with different approaches
to estimating non-linear relationships between the Expend
variable – which measures the expenditure per student of each college
(in dollars) – and several other variables in the data.
Expend
variable and the following four predictors:
Outstate
, PhD
, Room.Board
, and
S.F.Ratio
. For which of these variables do you think there
is evidence of a non-linear relationship?par(mfrow = c(2,2))
plot(College$Outstate, College$Expend)
plot(College$PhD, College$Expend)
plot(College$Room.Board, College$Expend)
plot(College$S.F.Ratio, College$Expend)
It is a little hard to tell, but there is some evidence here that
both PhD
and S.F.Ratio
are non-linearly
related to the Expend
variable. For the PhD variable, there
is essentially no relationship between the percentage of faculty with a
PhD and the expenditure variable for most of the range of X, but a
positive relationship between the variables for the 80-100% range. The
opposite is true for the student-faculty ratio variable, which is
strongly negatively associated with the outcome for lower levels of X
but mostly flat for higher levels of X.
Expend
as
the outcome variable, and each including one of the predictors you
plotted above. Include a second-degree polynomial transformation of X in
each of the models (you can do this by using
poly(x_variable,2)
in the lm()
model formula).
Interpret the significance of the squared term in each of your models.
Can you reject the null hypothesis of linearity?summary(lm(Expend ~ poly(Outstate, 2), data = College))
##
## Call:
## lm(formula = Expend ~ poly(Outstate, 2), data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9328 -1636 -497 665 36680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9660.2 130.5 74.04 <2e-16 ***
## poly(Outstate, 2)1 97863.5 3636.7 26.91 <2e-16 ***
## poly(Outstate, 2)2 36675.2 3636.7 10.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3637 on 774 degrees of freedom
## Multiple R-squared: 0.5162, Adjusted R-squared: 0.515
## F-statistic: 412.9 on 2 and 774 DF, p-value: < 2.2e-16
summary(lm(Expend ~ poly(PhD, 2), data = College))
##
## Call:
## lm(formula = Expend ~ poly(PhD, 2), data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -2263 -357 1309 40415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9660.2 154.4 62.55 <2e-16 ***
## poly(PhD, 2)1 62950.2 4304.9 14.62 <2e-16 ***
## poly(PhD, 2)2 53405.8 4304.9 12.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4305 on 774 degrees of freedom
## Multiple R-squared: 0.3221, Adjusted R-squared: 0.3203
## F-statistic: 183.9 on 2 and 774 DF, p-value: < 2.2e-16
summary(lm(Expend ~ poly(Room.Board, 2), data = College))
##
## Call:
## lm(formula = Expend ~ poly(Room.Board, 2), data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9804 -2240 -628 1293 40005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9660.2 161.6 59.784 <2e-16 ***
## poly(Room.Board, 2)1 72983.8 4504.1 16.204 <2e-16 ***
## poly(Room.Board, 2)2 11416.1 4504.1 2.535 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4504 on 774 degrees of freedom
## Multiple R-squared: 0.2579, Adjusted R-squared: 0.256
## F-statistic: 134.5 on 2 and 774 DF, p-value: < 2.2e-16
summary(lm(Expend ~ poly(S.F.Ratio, 2), data = College))
##
## Call:
## lm(formula = Expend ~ poly(S.F.Ratio, 2), data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19209.3 -1896.6 -415.4 1556.7 31145.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9660.2 138.5 69.76 <2e-16 ***
## poly(S.F.Ratio, 2)1 -84925.2 3860.2 -22.00 <2e-16 ***
## poly(S.F.Ratio, 2)2 49124.6 3860.2 12.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3860 on 774 degrees of freedom
## Multiple R-squared: 0.4549, Adjusted R-squared: 0.4535
## F-statistic: 323 on 2 and 774 DF, p-value: < 2.2e-16
In fact, we can reject the null hypothesis of linearity for all of these variables!
Expend
and PhD
, calculate
fitted values across the range of the PhD
variable.
Recreate the plot between these variable that you constructed in part a,
and add a line representing these fitted values to the plot (using the
lines()
function) to illustrate the estimated relationship.
Interpret the graph. (You will need to use the predict()
function with the newdata
argument in order to complete
this question. You can also find the range of the PhD
variable using the range()
function.) I have given you some
starter code below.fitted_vals_quadratic <- predict(phd_mod_quadratic, newdata = data.frame(PhD = SOME_VALUES_GO_HERE))
plot(SOMETHING_GOES_HERE, SOMETHING_ELSE_GOES_HERE,
xlab = "Percentage of faculty with a PhD",
ylab = "Predicted Expenditure per Student")
lines(SOME_VALUES_GO_HERE, SOME_OTHER_VALUES_GO_HERE, col = "red")
phd_mod_quadratic <- lm(Expend ~ poly(PhD, 2), data = College)
range(College$PhD) # Find the range of the PhD variable
## [1] 8 103
fitted_vals_quadratic <- predict(phd_mod_quadratic, newdata = data.frame(PhD = 8:103))
plot(College$PhD, College$Expend,
xlab = "Percentage of faculty with a PhD",
ylab = "Predicted Expenditure per Student")
lines(8:103, fitted_vals_quadratic, col = "red")
The graph suggests that predicted expenditure per student decreases when moving from low to moderate percentages of faculty with PhDs, and then increases when moving from moderate to high percentages of faculty with PhDs.
Expend ~ PhD
model, this time
including a cubic polynomial (i.e. of degree 3). Can you reject the null
hypothesis for the cubic term? Add another line (in a different colour)
with the fitted values from this model to the plot you created in part
c.phd_mod_cubic <- lm(Expend ~ poly(PhD, 3), data = College)
summary(phd_mod_cubic) # Yes, we can reject the null
##
## Call:
## lm(formula = Expend ~ poly(PhD, 3), data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15884 -2266 -373 1330 39272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9660 152 63.544 < 2e-16 ***
## poly(PhD, 3)1 62950 4238 14.855 < 2e-16 ***
## poly(PhD, 3)2 53406 4238 12.603 < 2e-16 ***
## poly(PhD, 3)3 21518 4238 5.078 4.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4238 on 773 degrees of freedom
## Multiple R-squared: 0.344, Adjusted R-squared: 0.3414
## F-statistic: 135.1 on 3 and 773 DF, p-value: < 2.2e-16
fitted_vals_cubic <- predict(phd_mod_cubic, newdata = data.frame(PhD = 8:103))
plot(College$PhD, College$Expend,
xlab = "Percentage of faculty with a PhD",
ylab = "Predicted Expenditure per Student")
lines(8:103, fitted_vals_quadratic, col = "red")
lines(8:103, fitted_vals_cubic, col = "green")
Expend
and PhD
, this time using a cubic spline
instead of a polynomial. You can implement the cubic spline by using the
bs()
function, which is specified thus:
lm(outcome ~ bs(x_variable, df = ?, degree = 3))
. Select a
value for the df
argument that you think is reasonable.
Estimate the model and then, again, plot the fitted values across the
range of the PhD
variable.phd_mod_spline <- lm(Expend ~ bs(PhD, df = 6, degree = 3), data = College)
fitted_vals_spline <- predict(phd_mod_spline, newdata = data.frame(PhD = 8:103))
plot(College$PhD, College$Expend,
xlab = "Percentage of faculty with a PhD",
ylab = "Predicted Expenditure per Student")
lines(8:103, fitted_vals_quadratic, col = "red")
lines(8:103, fitted_vals_cubic, col = "green")
lines(8:103, fitted_vals_spline, col = "blue")
I have set df = 6
which is likely too flexible for this
data. Nevertheless, the general pattern is consistent with both the
cubic and quadratic models.
loess()
model. The key parameter here is the
span
. High values for span
will result in a
less flexible model, and low values for span
will result in
a more flexible model. Pick a value that you feel is appropriately
wiggly. Again, add it to your (now very colourful) plot.phd_mod_loess <- loess(Expend ~ PhD, data = College, span = .1)
fitted_vals_loess <- predict(phd_mod_loess, newdata = data.frame(PhD = 8:103))
plot(College$PhD, College$Expend,
xlab = "Percentage of faculty with a PhD",
ylab = "Predicted Expenditure per Student")
lines(8:103, fitted_vals_quadratic, col = "red")
lines(8:103, fitted_vals_cubic, col = "green")
lines(8:103, fitted_vals_spline, col = "blue")
lines(8:103, fitted_vals_loess, col = "purple")
I have definitely over-wiggled here!
PhD
and Expend
in the data? Can you tell?No, you cannot tell. It is very hard to work out which relationship is best fitting in this example just by looking at plots! Fortunately, tomorrow we will discuss a set of tool that allows us to work out which relationship is most appropriate by seeing which is best at predicting data out of sample.
College
data using the gam()
function. This
model is just like the lm()
function, but it allows you to
include flexible transformations of the covariates in the model. In this
example, estimate a GAM with Expend
as the outcome, and use
all four of predictors that you plotted in part a as well as the
Private
variable. For each of the continuous predictors,
use a cubic spline (bs()
) with 4 degrees of freedom. Once
you have estimated your model, plot the results by passing the estimated
model object to the plot()
function. Interpret the
results.library(gam)
gam_mod <- gam(Expend ~ Private + bs(PhD, df=4) +
bs(Outstate, df=4) + bs(Room.Board, df=4) +
bs(S.F.Ratio, df=4) ,
data=College)
par(mfrow=c(2, 3))
plot(gam_mod, se=TRUE, col="blue")
Apply bagging and random forests to the Weekly
data set.
This dataset includes Weekly percentage returns for the S&P 500
stock index between 1990 and 2010. Your goal is to predict the
Direction
variable, which has two levels (Down and Up). For
this task, you should fit the models on a randomly selected subset of
your data – the training set – and evaluate their performance on the
rest of the data – the test set. I have given you some code below to
help you construct these datasets. How accurate are the results compared
to simpler methods like logistic regression? Which of these approaches
yields the best performance?
set.seed(3) # Set a value for the random number generator to make the results comparable across runs
train <- sample(nrow(Weekly), 2/3 * nrow(Weekly)) # Randomly select two-thirds of the data
Weekly_train <- Weekly[train,] # Subset to the training observations
Weekly_test <- Weekly[-train,] # Subset to the test observations
Logistic regression
glm.fit <- glm(Direction ~ . -Year-Today,
data=Weekly_train,
family="binomial")
glm.probs <- predict(glm.fit, newdata=Weekly_test,
type = "response")
glm.pred <- rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] <- "Up"
table(glm.pred, Weekly_test$Direction)
##
## glm.pred Down Up
## Down 14 18
## Up 151 180
mean(glm.pred != Weekly_test$Direction)
## [1] 0.4655647
Bagging
library(randomForest)
bag.weekly <- randomForest(Direction~.-Year-Today,
data=Weekly_train,
mtry=6)
yhat.bag <- predict(bag.weekly, newdata=Weekly_test)
table(yhat.bag, Weekly_test$Direction)
##
## yhat.bag Down Up
## Down 69 64
## Up 96 134
mean(yhat.bag != Weekly_test$Direction)
## [1] 0.4407713
Random forests
rf.weekly <- randomForest(Direction ~ . -Year-Today,
data=Weekly_train,
mtry=2)
yhat.bag <- predict(rf.weekly, newdata=Weekly_test)
table(yhat.bag, Weekly_test$Direction)
##
## yhat.bag Down Up
## Down 65 59
## Up 100 139
mean(yhat.bag != Weekly_test$Direction)
## [1] 0.4380165
Best performance summary: Bagging resulted in the lowest validation set test error rate in this example, but this needn’t always be the case.