Classification coding demo
Michael Jacobs 2023-07-15
Introduction
This coding demo is based on chapter 4.7 of James et al. ‘An
Introduction to Statistical Learning’. It works with Smarket
- a
dataset of stock market data from the ISLR2
package. You may need to
first install this package before loading the data.
This data set consists of percentage returns for the S&P 500 stock index
over 1, 250 days, from the beginning of 2001 until the end of 2005. For
each date, percentage returns for each of the five previous trading days
is recorded, (Lag1
through Lag5
). Also recorded is Volume
(the
number of shares traded on the previous day, in billions), Today
(the
percentage return on the date in question) and Direction
(whether the
market was Up or Down on this date).
For this coding demo, the goal is to predict Direction
(a
qualitative response) using the other features.
Logistic regression
First, let’s try to predict Direction
using a logistic regression,
with all lags and Volume
as predictors. We do this using the glm()
function. Specifying family = binomial
tells R to fit a logistic
regression rather than another type of generalised linear model.
# Fit a logistic regression
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
data = Smarket , family = binomial
)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Careful how you interpret the output here.
-
As in linear regression model outputs, the p-values tell us whether the coefficient is statistically significant (i.e. can we reject the null hypothesis that this coefficient is 0?)
-
Unlike in a linear regression model, the coefficients themselves (in the
Estimate
column) do not tell us the marginal effect of the variable on the outcome. Instead, it tells us the increase in log odds of the outcome as the predictor increases by one unit. This is difficult to interpret meaningfully! -
We can convert the coefficients into ‘odds ratios’ (rather than change in ‘log odds’) by exponentiating them. i.e. $exp(\beta_1)$ is the ratio of the odds of the binary outcome occurring, given a 1-unit increase in variable 1, over the odds of the binary outcome occurring, given no change in variable 1.
# Exponentiate the coefficients to turn them into odds
exp(coef(glm.fits))
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.8816146 0.9295323 0.9585809 1.0111468 1.0094029 1.0103664
## Volume
## 1.1450412
This is still quite difficult to interpret (and is a disadvantage of logistic regression relative to linear regression). But note that interpreting coefficients is only really important if we are interested in inference, not prediction. Let’s move on to thinking about prediction.
We can use the predict()
function to predict the outcome of units
given we have (a) a pre-trained model (in this case glm.fits
); and (b)
a dataset with values for all the predictors in that model.
# First, let's apply the predict() function to the training data itself
# Set type = "response" to get predicted probabilities out
pred_probs <- predict(glm.fits, type = "response")
pred_probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
## 9 10
## 0.5176135 0.4888378
Because we did not supply the predict()
function with a newdata
argument, it used the training data of glm.fits
itself in order to
make it’s predictions. In other words, the 10 values in the output above
represent the predict probability of the stock market moving upward for
the first 10 observations in the Smarket dataset.
How well does our model do at predicting the outcomes of the data it was trained on? Let’s evaluate this using a confusion matrix
# First, lets create a predicted outcome variable
# If pred_prob > 0.5, assign value "Up", else assign value "Down"
pred_outcomes <- ifelse(pred_probs > 0.5, "Up", "Down")
# Now let's compare this to the true value of `Direction` (which is known in the training data)
true_outcomes <- Smarket$Direction
table(pred_outcomes, true_outcomes)
## true_outcomes
## pred_outcomes Down Up
## Down 145 141
## Up 457 507
Using this confusion matrix, we can see how many observations were correctly and incorrectly classified in each class. From this, we can calculate the proportion correctly (and incorrectly) classified.
# Proportion correctly classified
(145 + 507) / (145 + 507 + 141 + 457)
## [1] 0.5216
mean(pred_outcomes == true_outcomes)
## [1] 0.5216
So our model can correctly predict the direction of the movement in the stock market in our training set in 52.16% of cases.
But we should never use our training set to evaluate model performance, because there is a risk that we overfit our model to our training data, in which case, the model would be better at predicting outcomes on the training data than new data.
Solution: let’s split our data in a training and test set. We will use the training set to train the model, and the test set to evaluate the model.
# Split the data into a training set (data from before 2005) and a test set (data from after 2005)
training_data <- Smarket[Smarket$Year < 2005, ]
test_data <- Smarket[Smarket$Year >= 2005, ]
# Now lets fit the model, this time just using the training set
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = training_data , family = binomial
)
# And then predict the outcome for the test set
# To do this, we will use the newdata argument this time
pred_probs_test <- predict(glm.fits, newdata = test_data,
type = "response")
# And finally, let's evaluate the performance of the model on the test set
pred_outcomes_test <- ifelse(pred_probs_test > 0.5, "Up", "Down")
true_outcomes_test <- test_data$Direction
table(pred_outcomes_test, true_outcomes_test)
## true_outcomes_test
## pred_outcomes_test Down Up
## Down 77 97
## Up 34 44
# Calculate proportion of test set correctly classified
mean(pred_outcomes_test == true_outcomes_test)
## [1] 0.4801587
Proportion correctly classified in the test set is 48% - which is worse than guessing :(
Naive Bayes
Now let’s try using the Naive Bayes classifier. For this, we will need
the e1071
library - again, you may need to install this using
install.packages()
.
# install.packages("e1071")
library(e1071)
As with the logistic regression model, let’s train the model using data
from before 2005, and test it using data from after 2005. We do this
using the naiveBayes()
function.
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket)
When can then use the predict()
function in exactly the same way as
for the logistic regression model (but this time, to get probabilities,
we specify type = "raw"
).
# ?predict.naiveBayes
pred_probs_test_nb <- predict(nb.fit, newdata = test_data, type = "raw")
# And finally, let's evaluate the performance of the model on the test set
pred_outcomes_test_nb <- predict(nb.fit, newdata = test_data, type = "class")
true_outcomes_test <- test_data$Direction
table(pred_outcomes_test_nb, true_outcomes_test)
## true_outcomes_test
## pred_outcomes_test_nb Down Up
## Down 2 0
## Up 109 141
# Calculate proportion of test set correctly classified
mean(pred_outcomes_test_nb == true_outcomes_test)
## [1] 0.5674603
The proportion correctly classified is now 56.7% - this looks like an improvement! But careful - the model has classified all but 2 observations as “Up”. Of the 111 test set observations with a true value of “Down” only 2 were correctly classified! We therefore need to consider the performance of the model in each class, as well as the average performance across classes.