Summer School 2019 midsession examination

ME314 Introduction to Data Science and Machine Learning

Suitable for all candidates

Instructions to candidates

  • Complete the assignment by adding your answers directly to the RMarkdown document, knitting the document, and submitting the HTML file to Moodle.
  • Time allowed: due 19:00 on Wednesday, 7th August 2019.
  • Submit the assignment via Moodle.

haven’tt

Question 1 (40 points)

This question should be answered using the Carseats data set, which is part of the ISLR package. This data contains simulated data set containing sales of child car seats at 400 different stores.

data("Carseats", package = "ISLR")
  1. Fit a regression model predicting Sales using Advertising and Price as predictors. Interpret the coefficients, the \(R^2\), and the Residual standard error from the regression (by explaining each in a few statements). (15 points)

COEFFICIENTS: Advertising
* Holding all other covariates constant, when advertising increases one unit, unit of sales increases 0.123. In terms of the units of the variables, this means for each $1,000 spent on advertising, sales increase on average by about 1,230 units. The coefficient is signficant at less than 0.001.

Price * Holding all other covariates constant when advertising increases one unit, sales decrease 0.0546 units. In terms of the units of the variables, this means for every dollar increase in price, sales decrease on average by about 54.6 units. The coefficient is signficant at less than 0.001.

RESIDUAL STANDARD ERROR: Average amount the response will deviate from the true regression line. Is considered a lack of fit of the model to the data, and is preferable to be as small as possible. *In this case the RSE is 2.399, so we can determine that given that sales is in 1000 units, actual sales in each market deviate from the true regression line by approximately an average of 2,399 units.

R2 Proportion of variance explained by the model. In this case, it is not particularly high and therefore the model leaves a substantial amount of the variation unexplained. **

summary(lmod1 <- lm(Sales ~ Advertising + Price, data = Carseats))
## 
## Call:
## lm(formula = Sales ~ Advertising + Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9011 -1.5470 -0.0223  1.5361  6.3748 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.003427   0.606850  21.428  < 2e-16 ***
## Advertising  0.123107   0.018079   6.809 3.64e-11 ***
## Price       -0.054613   0.005078 -10.755  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.399 on 397 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2782 
## F-statistic: 77.91 on 2 and 397 DF,  p-value: < 2.2e-16
  1. Fit a second model by adding Urban as an interactive variable with Advertising. Interpret the two new coefficients produced by adding this interaction to the Advertising variable that was already present from the first question, in a few statements. (15 points) COEFFICIENTS:
  • You can interpret Advertising and Price the same way as in 1.1; UrbanYes is binary, and IF it were statistically signficant (which it is not), you might say that holding all other covariates constant, a store being in an Urban location means an average increase of 4.6 units in sales. *Again, if the interaction was statistically signficant (which it is not), you might say that Urban’s effect on Advertising’s effect on sales is such that being in an urban location reduces advertising effectiveness on sales by 6.6 sales.
summary(lmod2 <- lm(Sales ~ Advertising*Urban + Price, data = Carseats))
## 
## Call:
## lm(formula = Sales ~ Advertising * Urban + Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8696 -1.5640 -0.0284  1.5323  6.3818 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          12.988703   0.666534  19.487  < 2e-16 ***
## Advertising           0.128015   0.034404   3.721 0.000227 ***
## UrbanYes              0.004602   0.369040   0.012 0.990057    
## Price                -0.054519   0.005110 -10.670  < 2e-16 ***
## Advertising:UrbanYes -0.006666   0.040564  -0.164 0.869559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.405 on 395 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2747 
## F-statistic: 38.77 on 4 and 395 DF,  p-value: < 2.2e-16
  1. Which of these two models is preferable, and why? (10 points)

The basic answer is that both are more or less the same, because adding Urban to the mix did not improve prediction and did not affect the estimated coefficients on Advertising or Price. Omitting this variable did not change anything. Based on the principle of parsimony, the first and simpler model is preferable. The R2 is identical and the adjusted R2 is slightly better in model one. The F-statistic is also higher in model 1. Some of you also used AIC to indicate Model 1 being the better fit.

Question 2 (60 points)

You will need to load the core library for the course textbook and any other libraries you find suitable to answer the question:

data("Weekly", package = "ISLR")
suppressPackageStartupMessages(library("MASS"))
library("class")

This question should be answered using the Weekly data set, which is part of the ISLR package. This data contains 1,089 weekly stock returns for 21 years, from the beginning of 1990 to the end of 2010.

  1. Perform exploratory data analysis of the Weekly data (produce some numerical and graphical summaries). Discuss any patterns that emerge. (20 points)

As long as there’s an effort to do some descriptive stats, effort to visualise the data, and engagement with the results trying to get some insights (like identifying patterns) that should be given credit. It’s a simple question. Can be a simple summary stats with base R, simple pairwise scatterplot with base R. That’s a good answer.

pairs(Weekly)

That’s an example from userwritten packages we covered in the lecture. Bottom line, pretty much any effort to get close to the data (EDA) works here as an answer.

#Using summarytools package as discussed in the lecture
library(summarytools)
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
freq(Weekly)
## Variable(s) ignored: Lag1, Lag2, Lag3, Lag4, Lag5, Volume, Today
## Frequencies  
## Weekly$Year  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##        1990     47      4.32           4.32      4.32           4.32
##        1991     52      4.78           9.09      4.78           9.09
##        1992     52      4.78          13.87      4.78          13.87
##        1993     52      4.78          18.64      4.78          18.64
##        1994     52      4.78          23.42      4.78          23.42
##        1995     52      4.78          28.19      4.78          28.19
##        1996     53      4.87          33.06      4.87          33.06
##        1997     52      4.78          37.83      4.78          37.83
##        1998     52      4.78          42.61      4.78          42.61
##        1999     52      4.78          47.38      4.78          47.38
##        2000     52      4.78          52.16      4.78          52.16
##        2001     52      4.78          56.93      4.78          56.93
##        2002     52      4.78          61.71      4.78          61.71
##        2003     52      4.78          66.48      4.78          66.48
##        2004     52      4.78          71.26      4.78          71.26
##        2005     52      4.78          76.03      4.78          76.03
##        2006     52      4.78          80.81      4.78          80.81
##        2007     53      4.87          85.67      4.87          85.67
##        2008     52      4.78          90.45      4.78          90.45
##        2009     52      4.78          95.22      4.78          95.22
##        2010     52      4.78         100.00      4.78         100.00
##        <NA>      0                               0.00         100.00
##       Total   1089    100.00         100.00    100.00         100.00
## 
## Weekly$Direction  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##        Down    484     44.44          44.44     44.44          44.44
##          Up    605     55.56         100.00     55.56         100.00
##        <NA>      0                               0.00         100.00
##       Total   1089    100.00         100.00    100.00         100.00
descr(Weekly)
## Non-numerical variable(s) ignored: Direction
## Descriptive Statistics  
## Weekly  
## N: 1089  
## 
##                        Lag1      Lag2      Lag3      Lag4      Lag5     Today    Volume      Year
## ----------------- --------- --------- --------- --------- --------- --------- --------- ---------
##              Mean      0.15      0.15      0.15      0.15      0.14      0.15      1.57   2000.05
##           Std.Dev      2.36      2.36      2.36      2.36      2.36      2.36      1.69      6.03
##               Min    -18.20    -18.20    -18.20    -18.20    -18.20    -18.20      0.09   1990.00
##                Q1     -1.15     -1.15     -1.16     -1.16     -1.17     -1.15      0.33   1995.00
##            Median      0.24      0.24      0.24      0.24      0.23      0.24      1.00   2000.00
##                Q3      1.40      1.41      1.41      1.41      1.40      1.40      2.05   2005.00
##               Max     12.03     12.03     12.03     12.03     12.03     12.03      9.33   2010.00
##               MAD      1.87      1.87      1.87      1.87      1.88      1.87      1.04      7.41
##               IQR      2.56      2.56      2.57      2.57      2.57      2.56      1.72     10.00
##                CV     15.65     15.60     16.04     16.19     16.88     15.72      1.07      0.00
##          Skewness     -0.48     -0.48     -0.48     -0.48     -0.47     -0.48      1.62      0.00
##       SE.Skewness      0.07      0.07      0.07      0.07      0.07      0.07      0.07      0.07
##          Kurtosis      5.67      5.67      5.62      5.63      5.61      5.67      2.06     -1.21
##           N.Valid   1089.00   1089.00   1089.00   1089.00   1089.00   1089.00   1089.00   1089.00
##         Pct.Valid    100.00    100.00    100.00    100.00    100.00    100.00    100.00    100.00
dfSummary(Weekly, plain.ascii = TRUE, style = "grid", 
          graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")
## Data Frame Summary  
## Weekly  
## Dimensions: 1089 x 9  
## Duplicates: 0  
## 
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | No | Variable   | Stats / Values         | Freqs (% of Valid)   | Graph                      | Missing |
## +====+============+========================+======================+============================+=========+
## | 1  | Year       | Mean (sd) : 2000 (6)   | 21 distinct values   | :                          | 0       |
## |    | [numeric]  | min < med < max:       |                      | : . . . . . . . . .        | (0%)    |
## |    |            | 1990 < 2000 < 2010     |                      | : : : : : : : : : :        |         |
## |    |            | IQR (CV) : 10 (0)      |                      | : : : : : : : : : :        |         |
## |    |            |                        |                      | : : : : : : : : : :        |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 2  | Lag1       | Mean (sd) : 0.2 (2.4)  | 1004 distinct values |             :              | 0       |
## |    | [numeric]  | min < med < max:       |                      |           : :              | (0%)    |
## |    |            | -18.2 < 0.2 < 12       |                      |           : :              |         |
## |    |            | IQR (CV) : 2.6 (15.7)  |                      |           : :              |         |
## |    |            |                        |                      |         . : : :            |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 3  | Lag2       | Mean (sd) : 0.2 (2.4)  | 1005 distinct values |             :              | 0       |
## |    | [numeric]  | min < med < max:       |                      |           : :              | (0%)    |
## |    |            | -18.2 < 0.2 < 12       |                      |           : :              |         |
## |    |            | IQR (CV) : 2.6 (15.6)  |                      |           : :              |         |
## |    |            |                        |                      |         . : : :            |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 4  | Lag3       | Mean (sd) : 0.1 (2.4)  | 1005 distinct values |             :              | 0       |
## |    | [numeric]  | min < med < max:       |                      |           : :              | (0%)    |
## |    |            | -18.2 < 0.2 < 12       |                      |           : :              |         |
## |    |            | IQR (CV) : 2.6 (16)    |                      |           : :              |         |
## |    |            |                        |                      |         . : : :            |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 5  | Lag4       | Mean (sd) : 0.1 (2.4)  | 1005 distinct values |             :              | 0       |
## |    | [numeric]  | min < med < max:       |                      |           : :              | (0%)    |
## |    |            | -18.2 < 0.2 < 12       |                      |           : :              |         |
## |    |            | IQR (CV) : 2.6 (16.2)  |                      |           : :              |         |
## |    |            |                        |                      |         . : : :            |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 6  | Lag5       | Mean (sd) : 0.1 (2.4)  | 1005 distinct values |             :              | 0       |
## |    | [numeric]  | min < med < max:       |                      |           : :              | (0%)    |
## |    |            | -18.2 < 0.2 < 12       |                      |           : :              |         |
## |    |            | IQR (CV) : 2.6 (16.9)  |                      |           : :              |         |
## |    |            |                        |                      |         . : : :            |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 7  | Volume     | Mean (sd) : 1.6 (1.7)  | 1089 distinct values | :                          | 0       |
## |    | [numeric]  | min < med < max:       |                      | :                          | (0%)    |
## |    |            | 0.1 < 1 < 9.3          |                      | : .                        |         |
## |    |            | IQR (CV) : 1.7 (1.1)   |                      | : :                        |         |
## |    |            |                        |                      | : : : . . . .              |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 8  | Today      | Mean (sd) : 0.1 (2.4)  | 1003 distinct values |             :              | 0       |
## |    | [numeric]  | min < med < max:       |                      |           : :              | (0%)    |
## |    |            | -18.2 < 0.2 < 12       |                      |           : :              |         |
## |    |            | IQR (CV) : 2.6 (15.7)  |                      |           : :              |         |
## |    |            |                        |                      |         . : : :            |         |
## +----+------------+------------------------+----------------------+----------------------------+---------+
## | 9  | Direction  | 1. Down                | 484 (44.4%)          | IIIIIIII                   | 0       |
## |    | [factor]   | 2. Up                  | 605 (55.6%)          | IIIIIIIIIII                | (0%)    |
## +----+------------+------------------------+----------------------+----------------------------+---------+
  1. Fit a logistic regression with Direction as the response and different combinations of lag variables plus Volume as predictors. Use the period from 1990 to 2008 as your training set and 2009-2010 as your test set. Produce a summary of results. (20 points)

    Do any of the predictors appear to be statistically significant in your training set? If so, which ones?

train <- (Weekly$Year < 2009)
test <- Weekly[!train, ]
glm_train <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                 data = Weekly, family = binomial, subset = train)
summary(glm_train)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7186  -1.2498   0.9823   1.0841   1.4911  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.33258    0.09421   3.530 0.000415 ***
## Lag1        -0.06231    0.02935  -2.123 0.033762 *  
## Lag2         0.04468    0.02982   1.499 0.134002    
## Lag3        -0.01546    0.02948  -0.524 0.599933    
## Lag4        -0.03111    0.02924  -1.064 0.287241    
## Lag5        -0.03775    0.02924  -1.291 0.196774    
## Volume      -0.08972    0.05410  -1.658 0.097240 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1342.3  on 978  degrees of freedom
## AIC: 1356.3
## 
## Number of Fisher Scoring iterations: 4

The approach in ISLR (and our labs) in splitting training and test can be followed. Could also do something with rsample package. As long as it’s consistent with the time sequence split of training/test (1990-2008 vs 2009-2010), it’s fine. For the model itself Lag variables (whatever combinations) and Volume should be included. The results should be discussed from the model fit on the training data. Discuss any covariates are statistically significant and name them (so here it’s only Lag1). So the problems can be if the answer covers the model that was fit on either the dataset or on test set (rather than training).

  1. From your test set, compute the confusion matrix, and calculate accuracy, precision, recall and F1. (20)

    Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression, and what can you learn from additional measures of fit like accuracy, precision, recall, and F1.

glm_probs <- predict(glm_train, test, type = "response")
glm_pred <-  rep("Down", length(glm_probs))
glm_pred[glm_probs > .5] <- "Up"
Direction_test <- Weekly$Direction[!train]
table(glm_pred, Direction_test)
##         Direction_test
## glm_pred Down Up
##     Down   31 44
##     Up     12 17

The confusion matrix above can be used to calculate all the required results by hand. We explicitly require reporting of accuracy, precision, recall, and F1. We also require it to be reported from the fit on the test set. caret package can be used to get the results. We covered it in class. caret can also be used for estimation or just to report the required stats. Either way is fine. Given the amount of time we spent on caret and confusionMatrix in the lecture, a good student would also discuss the No Information Rate and associate p-value, kappa, maybe even go deeper and explain why we should use prec and recall and F1 and how they are complementary to accuracy and class imbalance issue (but that’s probably only if they paid attention in the lecture, which is unlikely)

xtab <- table(glm_pred, Direction_test)
caret::confusionMatrix(xtab, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##         Direction_test
## glm_pred Down Up
##     Down   31 44
##     Up     12 17
##                                          
##                Accuracy : 0.4615         
##                  95% CI : (0.3633, 0.562)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.9962         
##                                          
##                   Kappa : -3e-04         
##                                          
##  Mcnemar's Test P-Value : 3.435e-05      
##                                          
##               Precision : 0.4133         
##                  Recall : 0.7209         
##                      F1 : 0.5254         
##              Prevalence : 0.4135         
##          Detection Rate : 0.2981         
##    Detection Prevalence : 0.7212         
##       Balanced Accuracy : 0.4998         
##                                          
##        'Positive' Class : Down           
## 
  1. (Extra credit) Experiment with alternative classification methods. (additional 10 points max)

    Present the results of your experiments reporting method, associated confusion matrix, and measures of fit on the test set like accuracy, precision, recall, and F1.

Given the topics we covered in the lectures I suspsect they’ll use a random forest model here or some tree-based model. Some may decide to use LDA/QDA/KNN or something else that’s covered in ISLR. Doesn’t really matter. The point is for them to explore a classification method different from the logistic.

set.seed(123)

rf_weekly <- 
    randomForest::randomForest(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                               data=Weekly, 
                               subset=train, 
                               mtry=2)

yhat_bag <-  predict(rf_weekly, newdata = test)

caret::confusionMatrix(table(yhat_bag, Direction_test), mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##         Direction_test
## yhat_bag Down Up
##     Down   26 35
##     Up     17 26
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9700          
##                                           
##                   Kappa : 0.0291          
##                                           
##  Mcnemar's Test P-Value : 0.0184          
##                                           
##               Precision : 0.4262          
##                  Recall : 0.6047          
##                      F1 : 0.5000          
##              Prevalence : 0.4135          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy : 0.5154          
##                                           
##        'Positive' Class : Down            
##