### Non-linear models This question relates to the `College` dataset from the `ISLR` package. Start by loading that package, as well as the `gam` package which will allow us to estimate generalised additive models, the `splines` package (which, unsurprisingly, let's us estimate splines), and the `randomForest` package (guess what that is for?). ```{r, warning=FALSE, message = FALSE} library(ISLR) library(gam) library(splines) library(randomForest) ``` The `College` data contains several variables for 777 US Colleges in the 1990s. Look at the help file for this data set (`?College`) for a description of the variables that are included. In this seminar, we will be experimenting with different approaches to estimating non-linear relationships between the `Expend` variable -- which measures the expenditure per student of each college (in dollars) -- and several other variables in the data. #### a. Create a series of scatter plots which show the association between the `Expend` variable and the following four predictors: `Outstate`, ` PhD`, `Room.Board`, and `S.F.Ratio`. For which of these variables do you think there is evidence of a non-linear relationship? #### b. Estimate four regression models, all with `Expend` as the outcome variable, and each including one of the predictors you plotted above. Include a second-degree polynomial transformation of X in each of the models (you can do this by using `poly(x_variable,2)` in the `lm()` model formula). Interpret the significance of the squared term in each of your models. Can you reject the null hypothesis of linearity? #### c. Using the regression you estimated in part b to describe the association between `Expend` and `PhD`, calculate fitted values across the range of the `PhD` variable. Recreate the plot between these variables that you constructed in part a, and add a line representing these fitted values to the plot (using the `lines()` function) to illustrate the estimated relationship. Interpret the graph. (You will need to use the `predict()` function with the `newdata` argument in order to complete this question. You can also find the range of the `PhD` variable using the `range()` function.) I have given you some starter code below. ```{r, echo = TRUE, eval = FALSE} fitted_vals_quadratic <- predict(phd_mod_quadratic, newdata = data.frame(PhD = SOME_VALUES_GO_HERE)) plot(SOMETHING_GOES_HERE, SOMETHING_ELSE_GOES_HERE, xlab = "Percentage of faculty with a PhD", ylab = "Predicted Expenditure per Student") lines(SOME_VALUES_GO_HERE, SOME_OTHER_VALUES_GO_HERE, col = "red") ``` #### d. Re-estimate the `Expend ~ PhD` model, this time including a cubic polynomial (i.e. of degree 3). Can you reject the null hypothesis for the cubic term? Add another line (in a different colour) with the fitted values from this model to the plot you created in part c. #### e. Estimate a new model for the relationship between `Expend` and `PhD`, this time using a cubic spline instead of a polynomial. You can implement the cubic spline by using the `bs()` function, which is specified thus: `lm(outcome ~ bs(x_variable, df = ?, degree = 3))`. Select a value for the `df` argument that you think is reasonable. Estimate the model and then, again, plot the fitted values across the range of the `PhD` variable. #### f. Guess what? Now it's time to do the same thing again, but this time using a `loess()` model. The key parameter here is the `span`. High values for `span` will result in a less flexible model, and low values for `span` will result in a more flexible model. Pick a value that you feel is appropriately wiggly. Again, add it to your (now very colourful) plot. #### g. Examine the nice plot you have constructed. Which of the lines of fitted values best characterise the relationship between `PhD` and `Expend` in the data? Can you tell? #### h. Fit a generalised additive model (GAM) to the `College` data using the `gam()` function. This model is just like the `lm()` function, but it allows you to include flexible transformations of the covariates in the model. In this example, estimate a GAM with `Expend` as the outcome, and use all four of predictors that you plotted in part a as well as the `Private` variable. For each of the continuous predictors, use a cubic spline (`bs()`) with 4 degrees of freedom. Once you have estimated your model, plot the results by passing the estimated model object to the `plot()` function. Interpret the results. ### Tree-based methods Apply a) bagging and b) Random Forests to the `Weekly` data set. This dataset includes Weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Your goal is to predict the `Direction` variable, which has two levels (Down and Up). For this task, you should fit the models on a randomly selected subset of your data -- the training set -- and evaluate their performance on the rest of the data -- the test set. I have given you some code below to help you construct these datasets. How accurate are the results compared to simpler methods like logistic regression? Which of these approaches yields the best performance? ```{r, echo = TRUE, eval = TRUE} set.seed(3) # Set a value for the random number generator to make the results comparable across runs train <- sample(nrow(Weekly), 2/3 * nrow(Weekly)) # Randomly select two-thirds of the data Weekly_train <- Weekly[train,] # Subset to the training observations Weekly_test <- Weekly[-train,] # Subset to the test observations ```