#### Summer School 2019 midsession examination
# ME314 Introduction to Data Science and Machine Learning
## Suitable for all candidates
### Instructions to candidates
* Complete the assignment by adding your answers directly to the RMarkdown document, knitting the document, and submitting the HTML file to Moodle.
* Time allowed: due 19:00 on Wednesday, 7th August 2019.
* Submit the assignment via [Moodle](https://shortcourses.lse.ac.uk/course/view.php?id=158).
haven'tt
## Question 1 (40 points)
This question should be answered using the `Carseats` data set, which is part of the **ISLR** package. This data contains simulated data set containing sales of child car seats at 400 different stores.
```{r}
data("Carseats", package = "ISLR")
```
1. Fit a regression model predicting Sales using Advertising and Price as predictors. Interpret the coefficients, the $R^2$, and the Residual standard error from the regression (by explaining each in a few statements). (15 points)
**COEFFICIENTS:**
*Advertising*
* *Holding all other covariates constant*, when advertising increases one unit, unit of sales increases 0.123. In terms of the units of the variables, this means for each $1,000 spent on advertising, sales increase on average by about 1,230 units. The coefficient is signficant at less than 0.001.
*Price*
* *Holding all other covariates constant* when advertising increases one unit, sales decrease 0.0546 units. In terms of the units of the variables, this means for every dollar increase in price, sales decrease on average by about 54.6 units. The coefficient is signficant at less than 0.001.
**RESIDUAL STANDARD ERROR:**
*Average amount the response will deviate from the true regression line.
* Is considered a lack of fit of the model to the data, and is preferable to be as small as possible.
*In this case the RSE is 2.399, so we can determine that given that sales is in 1000 units, actual sales in each market deviate from the true regression line by approximately an average of 2,399 units.
**R^2^**
*Proportion of variance explained by the model. In this case, it is not particularly high and therefore the model leaves a substantial amount of the variation unexplained.
***
```{r}
summary(lmod1 <- lm(Sales ~ Advertising + Price, data = Carseats))
```
2. Fit a second model by adding Urban as an interactive variable with Advertising. Interpret the two new coefficients produced by adding this interaction to the Advertising variable that was already present from the first question, in a few statements. (15 points)
**COEFFICIENTS:**
* You can interpret Advertising and Price the same way as in 1.1; UrbanYes is binary, and IF it were statistically signficant (which it is not), you might say that *holding all other covariates constant*, a store being in an Urban location means an average increase of 4.6 units in sales.
*Again, if the interaction was statistically signficant (which it is not), you might say that Urban's effect on Advertising's effect on sales is such that being in an urban location reduces advertising effectiveness on sales by 6.6 sales.
```{r}
summary(lmod2 <- lm(Sales ~ Advertising*Urban + Price, data = Carseats))
```
3. Which of these two models is preferable, and why? (10 points)
**The basic answer is that both are more or less the same, because adding `Urban` to the mix did not improve prediction and did not affect the estimated coefficients on Advertising or Price. Omitting this variable did not change anything. Based on the principle of parsimony, the first and simpler model is preferable. The R^2^ is identical and the adjusted R^2^ is slightly better in model one. The F-statistic is also higher in model 1. Some of you also used AIC to indicate Model 1 being the better fit. **
## Question 2 (60 points)
You will need to load the core library for the course textbook and any other libraries you find suitable to answer the question:
```{r}
data("Weekly", package = "ISLR")
suppressPackageStartupMessages(library("MASS"))
library("class")
```
This question should be answered using the `Weekly` data set, which is part of the **ISLR** package. This data contains 1,089 weekly stock returns for 21 years, from the beginning of 1990 to the end of 2010.
1. Perform exploratory data analysis of the `Weekly` data (produce some numerical and graphical summaries). Discuss any patterns that emerge. (20 points)
**As long as there's an effort to do some descriptive stats, effort to visualise the data, and engagement with the results trying to get some insights (like identifying patterns) that should be given credit. It's a simple question. Can be a simple summary stats with base R, simple pairwise scatterplot with base R. That's a good answer.**
```{r}
pairs(Weekly)
```
**That's an example from userwritten packages we covered in the lecture. Bottom line, pretty much any effort to get close to the data (EDA) works here as an answer.**
```{r}
#Using summarytools package as discussed in the lecture
library(summarytools)
```
```{r}
freq(Weekly)
descr(Weekly)
```
```{r}
dfSummary(Weekly, plain.ascii = TRUE, style = "grid",
graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")
```
2. Fit a logistic regression with `Direction` as the response and different combinations of lag variables plus `Volume` as predictors. Use the period from 1990 to 2008 as your training set and 2009-2010 as your test set. Produce a summary of results. (20 points)
Do any of the predictors appear to be statistically significant in your training set? If so, which ones?
```{r}
train <- (Weekly$Year < 2009)
test <- Weekly[!train, ]
glm_train <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial, subset = train)
summary(glm_train)
```
**The approach in ISLR (and our labs) in splitting training and test can be followed. Could also do something with rsample package. As long as it's consistent with the time sequence split of training/test (1990-2008 vs 2009-2010), it's fine. For the model itself Lag variables (whatever combinations) and Volume should be included. The results should be discussed from the model fit on the training data. Discuss any covariates are statistically significant and name them (so here it's only Lag1). So the problems can be if the answer covers the model that was fit on either the dataset or on test set (rather than training).**
3. From your test set, compute the confusion matrix, and calculate accuracy, precision, recall and F1. (20)
Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression, and what can you learn from additional measures of fit like accuracy, precision, recall, and F1.
```{r}
glm_probs <- predict(glm_train, test, type = "response")
glm_pred <- rep("Down", length(glm_probs))
glm_pred[glm_probs > .5] <- "Up"
Direction_test <- Weekly$Direction[!train]
table(glm_pred, Direction_test)
```
**The confusion matrix above can be used to calculate all the required results by hand. We explicitly require reporting of accuracy, precision, recall, and F1. We also require it to be reported from the fit on the test set. caret package can be used to get the results. We covered it in class. caret can also be used for estimation or just to report the required stats. Either way is fine. Given the amount of time we spent on caret and confusionMatrix in the lecture, a good student would also discuss the No Information Rate and associate p-value, kappa, maybe even go deeper and explain why we should use prec and recall and F1 and how they are complementary to accuracy and class imbalance issue (but that's probably only if they paid attention in the lecture, which is unlikely)**
```{r}
xtab <- table(glm_pred, Direction_test)
caret::confusionMatrix(xtab, mode = "prec_recall")
```
4. (Extra credit) Experiment with alternative classification methods. (additional 10 points max)
Present the results of your experiments reporting method, associated confusion matrix, and measures of fit on the test set like accuracy, precision, recall, and F1.
**Given the topics we covered in the lectures I suspsect they'll use a random forest model here or some tree-based model. Some may decide to use LDA/QDA/KNN or something else that's covered in ISLR. Doesn't really matter. The point is for them to explore a classification method different from the logistic.**
```{r}
set.seed(123)
rf_weekly <-
randomForest::randomForest(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data=Weekly,
subset=train,
mtry=2)
yhat_bag <- predict(rf_weekly, newdata = test)
caret::confusionMatrix(table(yhat_bag, Direction_test), mode = "prec_recall")
```