Skip to the content.

Midterm Review

Sarah Jewett

library(stargazer)
library(dplyr)
library(ggplot2)
data("Auto", package = "ISLR")
Auto$name <- as.factor(Auto$name)

Regression

Predicted Probabilities

Say we want to see the predicted probabilities for the first few rows of the Auto data based on a linear regression model with only two independent variables:

lm.1 <- lm(mpg ~ cylinders + weight, Auto)
summary(lm.1)
## 
## Call:
## lm(formula = mpg ~ cylinders + weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6469  -2.8282  -0.2905   2.1606  16.5856 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.2923105  0.7939685  58.305   <2e-16 ***
## cylinders   -0.7213779  0.2893780  -2.493   0.0131 *  
## weight      -0.0063471  0.0005811 -10.922   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.304 on 389 degrees of freedom
## Multiple R-squared:  0.6975, Adjusted R-squared:  0.6959 
## F-statistic: 448.4 on 2 and 389 DF,  p-value: < 2.2e-16

We can use predict() to see what the car’s predicted mpg is based on the above coefficients

predict(lm.1, newdata = Auto[1:6,], type = "response")
##        1        2        3        4        5        6 
## 18.28101 17.08141 18.71261 18.73166 18.63010 12.96848
# OR
lm.1$fitted.values[1:6] 
##        1        2        3        4        5        6 
## 18.28101 17.08141 18.71261 18.73166 18.63010 12.96848

Our predicted MPG for the first row is 18.28101

Let’s see how we get this predicted MPG:

First we pull out the relevant coefficients from lm.1:

intercept      <- 46.2923105
beta_cylinders <- -0.7213779
beta_weight    <- -0.0063471

# OR

intercept      <- lm.1$coefficients[1]
beta_cylinders <- lm.1$coefficients[2]
beta_weight    <- lm.1$coefficients[3]

This is: MPG = 46.29 + -0.72*cylinders + –0.006*weight

Let’s take a look at the first row and see what the values are in the data:

Auto[1,]
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504           12   70      1
##                        name
## 1 chevrolet chevelle malibu

MPG = 46.29 + -0.7213779*8 + -0.0063471*3504

Our predicted MPG was: 18.28101 Does this match?

predicted_mpg <- intercept + beta_cylinders*Auto$cylinders[1] + beta_weight*Auto$weight[1]

unname(predicted_mpg)
## [1] 18.28101
# unname() takes away the name from 'Named number' values, which this is. 

Now, we can have fun and easily plot our original mpg values with the fitted values. We will also add the regression line like in Assignment 3 to see how the predicted values plot compared to when we do lm() within a plot with our data.

lm.3 <- lm(mpg ~ weight, Auto)
summary(lm.3)
## 
## Call:
## lm(formula = mpg ~ weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9736  -2.7556  -0.3358   2.1379  16.5194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.216524   0.798673   57.87   <2e-16 ***
## weight      -0.007647   0.000258  -29.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6918 
## F-statistic: 878.8 on 1 and 390 DF,  p-value: < 2.2e-16
Auto$pred <- unname(lm.3$fitted.values)

ggplot(Auto, aes(y = mpg, x = weight)) + 
  geom_point(aes(color = "Observed")) +
  geom_point(aes(y = pred, color = "Predicted")) +
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

Now if we want to start looking at how adding variables changes from the original regression line we can add points from our earlier regression on mpg with weight and cylinders

Auto$pred2 <- unname(lm.1$fitted.values)

ggplot(Auto, aes(y = mpg, x = weight)) + 
  geom_point(aes(color = "Observed")) +
  geom_point(aes(y = pred, color = "Predicted")) +
  geom_point(aes(y=pred2, color = "MultiVarReg")) +
    geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

Interactions!

When don’t use interactions, we are assuming that the independent variables are, well, independent to each other. An interaction is looking at, say, how x1’s impact on y depends on x1’s relationship with x2.

Imagine we are looking at whether people like peanut butter or jelly. If we only look at the two spreads independently, we might miss that many people actual PREFER to eat peanut butter and jelly together and that their level of enjoyment increases when they eat them together.

If we have a sample from a survey of people across a few days days at a hotel asking if they enjoyed their breakfast or not, we can see how this PB&J situation might look in the data. Imagine we also asked what they ate for breakfast, with a bunch of foods including peanut butter and jelly. We also know that on one day that they ran out of peanut butter at some point and on another day, they ran out of jelly.

Enjoyment is measured from 1-10, with 6 being the threshold for being able to say they enjoyed the meal. (This is just to avoid a binary dependent variable, even though we are going to more or less consider it as such).

So say we have: 12 people who ate PB & Jelly (all but 1 enjoyed their breakfast) 5 people who ate PB (1 enjoyed their breakfast) 5 people who ate jelly (4 enjoyed their breakfast) 4 people who ate neither (and all enjoyed their breakfast)

We create a data frame:

food <- data.frame(PB =    c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1, 1, 1, 1, 1,  0, 0, 0, 0, 0,  0, 0, 0, 0),
                   Jelly = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  0, 0, 0, 0, 0,  1, 1, 1, 1, 1,  0, 0, 0, 0),
                   Enjoy = c(8, 7, 9, 3, 8, 9, 9, 6, 8, 6, 7, 8,  1, 5, 6, 2, 3,  7, 7, 3, 9, 8,  7, 8, 6, 8))

And now we can run three different linear models, two of which include interactions.

So if we decide to look at the impact of whether a person had peanut butter and jelly on their enjoyment rating…

First a basic regression

lm.no_inter <- lm(Enjoy ~ Jelly + PB, food)

This keeps the independent variables and the interactions (identical ways of doing this but with * and :)

lm.all_inter <-   lm(Enjoy ~ Jelly * PB, food)
lm.all_inter2 <-  lm(Enjoy ~ Jelly + PB + Jelly:PB, food)

but not the same as:

lm.inter_only <-  lm(Enjoy ~ Jelly:PB, food)

Let’s look at them side by side (using stargazer package, shows in knit HTML only):

stargazer(lm.no_inter, lm.all_inter, lm.inter_only, type = "html")
Dependent variable:
Enjoy
\(1\) \(2\) \(3\)
Jelly 2.240\*\* -0.450
(0.868) (1.225)
PB -1.160 -3.850\*\*\*
(0.868) (1.225)
Jelly:PB 4.383\*\* 1.619\*
(1.564) (0.863)
Constant 5.756\*\*\* 7.250\*\*\* 5.714\*\*\*
(0.845) (0.913) (0.586)
Observations 26 26 26
R2 0.248 0.446 0.128
Adjusted R2 0.183 0.370 0.092
Residual Std. Error 2.081 (df = 23) 1.827 (df = 22) 2.194 (df = 24)
F Statistic 3.792\*\* (df = 2; 23) 5.898\*\*\* (df = 3; 22) 3.519\* (df = 1; 24)
Note: *p\<0.1; **p\<0.05; ***p\<0.01

So the regression output, without the interaction, shows a negative relationship with PB and enjoyment. However, we know in the observed data, that people who had peanut butter AND jelly, enjoyed their breakfast except for one person. This can lead us to believe that having PB on it’s own will lead to lower ratings of breakfast enjoyment, which isn’t untrue if you look only at the people who ate just peanut butter. In short, this isn’t capturing the relationship between PB & Jelly as a combination in enjoyment.

When we look at the regression with the interaction, we can see that suddenly the relationship with enjoyment and PB becomes worse, BUT the interaction rightfully captures that many people who had PB & J enjoyed their meal more. So in this sense, it is illustrating that when people have PB on its own, they they tend to enjoy their meal less, but when they have PB & J together, they are more likely to enjoy their meal.

You can look at jelly the same way – without the interaction, having jelly has a positive relationship with enjoyment. But with the interaction, it actually gets worse, even though for the most part, people enjoyed their meal when only having jelly.

We can further demystify this by looking at the predicted probability of a specific case, in this instance, the first row observation in the data we created which is someone who had peanut butter and jelly.

predicted.ind <- lm.no_inter$coefficients[1] + # this is the constant/intercept
                 lm.no_inter$coefficients[2]*food$Jelly[1] + # this is beta of Jelly * Jelly (observed)
                 lm.no_inter$coefficients[3]*food$PB[1] # this is beta of PB * PB (observed)

predicted.ind_int <- lm.all_inter$coefficients[1] + # this is the constant/intercept
                     lm.all_inter$coefficients[2]*food$Jelly[1] + #  # this is beta of Jelly * Jelly (observed)
                     lm.all_inter$coefficients[3]*food$PB[1] + # this is beta of PB * PB (observed)
                     lm.all_inter$coefficients[4]*food$Jelly[1]*food$PB[1] # this is beta of interaction * Jelly (observed) AND * PB (observed)

sprintf("The non-interaction model shows a predicted rating of %g for PB&J and the model with the interaction shows a predicted rating of %g for PB&J\n", 
        unname(predicted.ind), unname(predicted.ind_int)) %>% 
    cat()
## The non-interaction model shows a predicted rating of 6.83523 for PB&J and the model with the interaction shows a predicted rating of 7.33333 for PB&J

So what about if they only had jelly, what is the predicted enjoyment rating?

Easier to just create a new df with the 4 possibilities and see the predictions using pred()

new_food <- data.frame(PB    = c(1, 1, 0, 0),
                       Jelly = c(1, 0, 1, 0))

predict(lm.no_inter, new_food, type = "response")
##        1        2        3        4 
## 6.835227 4.595455 7.995455 5.755682
# Highest predicted rating goes to Jelly only here
predict(lm.all_inter, new_food, type = "response")
##        1        2        3        4 
## 7.333333 3.400000 6.800000 7.250000
# Highest predicted rating goes to PB & J here

This is showing that without the interaction, people having only jelly are predicted to have a higher enjoyment rating on average than those who had both peanut butter and jelly.

The interaction model, however, adjusts for this relationship between the two. So the now negative coefficient for jelly actually brings the predicted rating down to 6.8, which is lower than the interaction model for PB & J (7.33)

This is a saturated model so you may have noticed that the predicted ratings for the interaction model match the mean for each of the four possible options. We can see what the observed means for the four groupings like so:

enjoy_mean <- food %>% 
    group_by(across(c(PB, Jelly))) %>% 
    summarize(Mean = mean(Enjoy))
## `summarise()` has grouped output by 'PB'. You can override using the `.groups`
## argument.
enjoy_mean
## # A tibble: 4 × 3
## # Groups:   PB [2]
##      PB Jelly  Mean
##   <dbl> <dbl> <dbl>
## 1     0     0  7.25
## 2     0     1  6.8 
## 3     1     0  3.4 
## 4     1     1  7.33

Manipulating your observations before regressing:

Say we wanted to take all of the numeric values and sqrt them before running the regression. There are a number of reasons why you might want to do this, which you can read about here.

A long way:

sqrt_Auto <- Auto %>%
  mutate(sqrt_mpg = sqrt(mpg)) %>%
  mutate(sqrt_cylinders = sqrt(cylinders)) %>%
  mutate(sqrt_displacement = sqrt(displacement)) %>%
  mutate(sqrt_horsepower = sqrt(horsepower)) %>%
  mutate(sqrt_weight = sqrt(weight)) %>%
  mutate(sqrt_acceleration = sqrt(acceleration))

head(sqrt_Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name     pred    pred2 sqrt_mpg sqrt_cylinders
## 1 chevrolet chevelle malibu 19.42024 18.28101 4.242641       2.828427
## 2         buick skylark 320 17.97489 17.08141 3.872983       2.828427
## 3        plymouth satellite 19.94026 18.71261 4.242641       2.828427
## 4             amc rebel sst 19.96320 18.73166 4.000000       2.828427
## 5               ford torino 19.84084 18.63010 4.123106       2.828427
## 6          ford galaxie 500 13.01941 12.96848 3.872983       2.828427
##   sqrt_displacement sqrt_horsepower sqrt_weight sqrt_acceleration
## 1          17.52142        11.40175    59.19459          3.464102
## 2          18.70829        12.84523    60.77006          3.391165
## 3          17.83255        12.24745    58.61740          3.316625
## 4          17.43560        12.24745    58.59181          3.464102
## 5          17.37815        11.83216    58.72819          3.240370
## 6          20.71232        14.07125    65.88627          3.162278

Or we can do it with dplyr’s across() function:

s_Auto <- Auto %>% 
  mutate(across(c(mpg, cylinders, displacement, horsepower, weight, acceleration), sqrt, .names = "sqrt_{.col}"))

head(s_Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name     pred    pred2 sqrt_mpg sqrt_cylinders
## 1 chevrolet chevelle malibu 19.42024 18.28101 4.242641       2.828427
## 2         buick skylark 320 17.97489 17.08141 3.872983       2.828427
## 3        plymouth satellite 19.94026 18.71261 4.242641       2.828427
## 4             amc rebel sst 19.96320 18.73166 4.000000       2.828427
## 5               ford torino 19.84084 18.63010 4.123106       2.828427
## 6          ford galaxie 500 13.01941 12.96848 3.872983       2.828427
##   sqrt_displacement sqrt_horsepower sqrt_weight sqrt_acceleration
## 1          17.52142        11.40175    59.19459          3.464102
## 2          18.70829        12.84523    60.77006          3.391165
## 3          17.83255        12.24745    58.61740          3.316625
## 4          17.43560        12.24745    58.59181          3.464102
## 5          17.37815        11.83216    58.72819          3.240370
## 6          20.71232        14.07125    65.88627          3.162278

Note: Remove the ‘.names’ argument if you want to overwrite the columns