### Exercise 8.1 -- Principal Component Analysis


In today's assignment, we are going to look at a set of variables describing the economic characteristics of English parliamentary constituencies around 2017-2019 (the dates of the source data vary a bit in terms of year).  

You can directly load the data file into R from the course website with the following command:

```{r,echo=TRUE,eval=TRUE}
econ_vars <- read.csv(url("https://raw.githubusercontent.com/lse-me314/lse-me314.github.io/master/data/const-econ-vars.csv"))

```

This data file has 8 variables

- `ONSConstID` - Office for National Statistics Parliamentary Constituency ID
- `ConstituencyName` - Constituency Name
- `HouseWageRatio` - Ratio of House Prices to Wages
- `UnempConstRate` - Unemployment Rate
- `UnempConstRateChange` - Unemployment Rate Change since 2010
- `WageMedianConst` - Median Wage
- `social_mobility_score` - [Social Mobility Index](https://researchbriefings.parliament.uk/ResearchBriefing/Summary/CBP-8400)
- `deprivation_index_score` - [Social Deprivation Index](https://researchbriefings.parliament.uk/ResearchBriefing/Summary/CBP-7327)

Q1. Use the `cor()` and `pairs()` functions to assess the correlations between the six economic variables in the data set.  Which two economic variables are most highly correlated with one another at the constituency level?  Which variable is least correlated with the others at the constituency level?

```{r,echo=TRUE,fig.width=12,fig.height=12}
pairs(econ_vars[,3:8])
```

```{r,echo=TRUE}
round(
  cor(econ_vars[,3:8], use = "pairwise.complete.obs")
  ,2)
```

> The two most highly correlated variables are the unemployment rate and the deprivation index.

> The change in the unemployment rate since 2010 is barely correlated with any of the other variables.  This is very likely to be because the places that are poor or affluent tend to stay that way over time.  As a result, there is no reason to expect that the *change* in unemployment will be associated with whether a place is currently relatively poor/affluent. 


Q2. Use the command `pcafit <- prcomp(econ_vars[,4:9],scale.=TRUE)` to calculate the principal components of these six economic variables.  Then examine the object `pcafit` directly and also through `summary(pcafit)`.

Which variable is has the smallest (magnitude) "loading" on the first principal component?  How does this relate to your answer in Q1?

```{r,echo=TRUE}
pcafit <- prcomp(econ_vars[,3:8],scale.=TRUE)
pcafit
summary(pcafit)
```

> The variable with the smallest magnitude loading on PC1 is the change in the unemployment rate.  This is what we would expect given that we previously found it had very low correlations with all of the other variables.  A principal component that predicts as much variation in the other variables as possible cannot possibly predict much variation in a variable that is poorly correlated with all other variables.

Q3. Extract the standard deviations of the principal components using the command `pcafit$sdev`. Use these standard deviations to calculate the proportion of variance explained by each principal component. (Recall that the variance is just the standard deviation squared!) How much of the variation in the underlying data is explained by the first principal component?

```{r}
pca_variances <- pcafit$sdev^2
pca_variances/sum(pca_variances)
```

> The first principal component explains approximately 52% of the variance in the underlying data.

Q4. Construct screeplots using either the `type="barplot"` or the `type="lines"` options of the  `screeplot()` command.  Given this and the output of `summary(pcafit)` above, is it clear how many dimensions are needed to describe these data well?

```{r}
screeplot(pcafit,type="barplot")
screeplot(pcafit,type="lines")
```

> The first principal component "explains" about half the variance (0.52) and the second principal component explains about a quarter (0.23).  This is a somewhat ambiguous case, each component explains about half as much variation as the previous one.  You could certainly make a credible argument for either one or two principal components providing a good summary of the data.  Depending on the application of interest, you might think that a single dimension explaining half the variation was a pretty good summary or that two dimensions explaining three quarters of the variation was preferable.

Q5. Check that the signs of the loadings for PC1 for each variable in the model.  For each variable, write sentences of the form "[The ratio of home prices to wages] are [positive/negatively] correlated with the first principal component".  Do these all make sense collectively?  You could also try writing a sentence of the form: "Places that are high on the first principal component are [high/low] in the house to wage ratio, [high/low] in unemployment,..."  What does this tell us about what the first principal component is measuring?

```{r}

# Print the loadings associated with the first principal component
pcafit$rotation[,1]

```

> - The ratio of home prices to wages is negatively correlated with the first principal component
> - The unemployment rate is positively correlated with the first principal component
> - Recent changes in the unemployment rate are negative correlated with the first principal component
> - Median wage is negatively correlated with the first principal component
> - Social mobility is negatively correlated with the first principal component
> - Deprivation is positively correlated with the first principal component

> If you put these together, the general tendency is for attributes of less economically successful places to be positively correlated with the first principal component: lower house prices, higher unemployment, lower wages, lower social mobility and higher deprivation.  You could argue about whether this is capturing a concept that should be understood as "economic success" or "affluence" or something else similar.  The one slight mismatch is the unemployment change variable, which is weakly negatively associated with the first principal component.  As we noted before, this variable perhaps does not belong in this analysis because it describes changes over time rather than levels.

Q5. Are you able to identify what PC2 is capturing?

```{r}

# Print the loadings associated with the second principal component
pcafit$rotation[,2]

```

> PC2 is positively associated with unemployment rate changes and negatively associated with everything else.  It would appear this is heavily a factor for capturing unemployment rate changes, which we have already seen are mostly uncorrelated with the over variables.  It does capture a bit of additional variation from the other variables, but not in a way that I was able to make much sense of!

Q6.  Re-do the principal components analysis without the variable that has the smallest magnitude loading on the first principal component.  Extract the first principal component from the original analysis with all six variables (using `pcafit$x[,1]`) and also from this new analysis with five variables.  Plot them against one another and check their correlation.  Explain why you find what you find.

```{r,echo=TRUE}
pcafit2 <- prcomp(econ_vars[,c(3,4,6,7,8)],scale.=TRUE)
pcafit2
summary(pcafit2)
```

```{r}
plot(pcafit$x[,1],pcafit2$x[,1])
cor(pcafit$x[,1],pcafit2$x[,1])
```

> The relative loadings on the five remaining variables are nearly identical to their previous values.  The first principal component values for the constituencies are very nearly identical as well, correlated at over 0.99.  Very little has changed because the variable we omitted was very weakly correlated with the others.  Put differently, changes in unemployment are not really closely related to affluence or economic success of constituencies, so they are a poor indicator of that more general concept.  The other five indicators tend to go together much more strongly.


Q7. Re-do the PCA from question 2 again, but this time set `scale. = FALSE`. Compare the loadings on the first principal component and comment on any differences you see. Why do these differences occur?

```{r,echo=TRUE}
pcafit3 <- prcomp(econ_vars[,c(3,4,5,6,7,8)],scale.=FALSE)

round(pcafit$rotation,2)
round(pcafit3$rotation,2)

```

> When we set `scale.=FALSE` the variables enter the PCA without first being normalised to have mean zero and variance one. As a result, the first principal component in this analysis is dominated by the `WageMedianConst` variable, as this has the largest unnormalised sample variance in our data. In fact, in this analysis `WageMedianConst` and `social_mobility_score` both dominate the first two principal components, but this is entirely driven by the fact that those two variables have large sample variance *in the units in which they are measured*. Of course, we do not want the units of measurement to determine the PCA output, and so this question is just here to serve as a reminder that we should always set `scale.=TRUE` when running PCA.

### Exercise 8.2 -- K-means Clustering

Consider the `USArrests` data from the "datasets" package, which you can load with the following command: 

```{r}
data("USArrests", package = "datasets")  # "package" argument optional here
```

The data includes statistics on the number of arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It also gives the percentage of the population living in urban areas. Explore the data using the `summary()` and `pairs()` commands to make sure you are familiar with it. You will also be using some functions that make use of random number generation in this question, so let's set the seed again:

```{r}
set.seed(2)
```

(a) Use the `kmeans()` function to perform k-means clustering on the arrests data. This function takes three main arguments:

1. `x` = the data on which we would like to perform k-means clustering
2. `centers` = a user specified parameter for the number of clusters (i.e. the value of `K`)
3. `nstart` = a parameter that determines the number of random starts to use. For now, just set this to 20. (This will fit the algorithm 20 times from different start values, and select the best fitting one)

    Use the arguments above to perform k-means when `K = 3`. Which states are assigned to which clusters? Can you say anything substantive about the clusters given the information in the output?

```{r}

km.out <-  kmeans(USArrests, 3, nstart = 20)

```

> The states assigned to each cluster are given in the output. The table of cluster means suggests that the second cluster captures the least crime-affected states, and the third cluster captures the most crime-affected states, while the first cluster captures states at intermediate levels.

(b) Use the code below to create a map of the clusters of US states. Note that you need to assign your cluster assignments to `cluster_assignments` with something like `cluster_assignments <- your_kmeans_object$cluster` in order for this code to work.

```{r}

library(ggplot2)
library(usmap)

cluster_assignments <- km.out$cluster

classification_df <- data.frame(state=tolower(names(cluster_assignments)),
                                cluster=as.factor(as.numeric(cluster_assignments)))

state_plot <- plot_usmap(regions="states", data=classification_df, values="cluster") + 
  labs(title="US States",subtitle="Clusters by Arrest Rates.") + 
  scale_colour_hue() + 
  theme( panel.background = element_rect(color = "black", fill = "white"), legend.position="right") + 
  guides(fill=guide_legend(title=""))

state_plot

```


(c) Use the function `hclust()` to perform hierarchical clustering on this data. The `hclust()` function requires the following arguments:

1. `d` = this arguments requires an object created by the `dist()` function applied to the data used for clustering. `dist()` takes a numeric matrix or data.frame as an argument (here you should use the `USArrests` object), and also allows you to select a distance metric. For now, just use `method = "euclidean"`.
2. `method` = this argument specifies which agglomeration method should be used for clustering. Select `method = "complete"` here.

    Once you have estimated the clusters, use the `plot()` function on the resulting object to plot the dendrogram.

```{r}

hc.complete <-  hclust(dist(USArrests, method = "euclidean"), method = "complete")

plot(hc.complete)

```

(d) Cut the dendrogram at a height that results in three distinct clusters using the function `cutree()`. Look at the help file to work out how to use this function. Which states belong to which clusters? How many states are in each cluster?

```{r}
cutree(hc.complete, 3)
table(cutree(hc.complete, 3))
```

(e) How do the clusters from (d) compare to the clusters from (a)? Are the same countries clustered together using the two approaches? 

Note: You may find it easier to answer this question if you assign the cluster idenitifiers from each of these approaches to the `USArrests` data first.

```{r}

USArrests$clusters_k <- km.out$cluster
USArrests$clusters_h <- cutree(hc.complete, 3)

table(USArrests$clusters_k, USArrests$clusters_h)

```

> The clusters contain the same states regardless of the approach used. Note, however, that the *labels* for the clusters are different for the two approaches!

(f) Hierarchically cluster the states again using complete linkage and Euclidean distance. However, before you run the `hclust()` function, first scale each of the variables to have standard deviation one. You can acheive this by applying the `scale()` function to `USArrests`. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

```{r}
dsc <-  scale(USArrests)
hc.s.complete <-  hclust(dist(dsc), method = "complete")
plot(hc.s.complete)
USArrests$clusters_h_s <- cutree(hc.s.complete, 3)
table(USArrests$clusters_h, USArrests$clusters_h_s)
table(USArrests$clusters_h)
```

> Scaling the variables affects the clusters obtained from cutting the dendogram into 3 clusters. In particular, some of the states in cluster 3 from the non-scaled analysis are put in a different cluster in the scaled analysis.  For this data set, it seems reasonable to standardise the variables because one of them is measured in units that are different from the others ($UrbanPop$ compared to other three columns).