Exercise 8 - Unsupervised Learning (Solution)

Exercise 8.1 – Principal Component Analysis

In today’s assignment, we are going to look at a set of variables describing the economic characteristics of English parliamentary constituencies around 2017-2019 (the dates of the source data vary a bit in terms of year).

You can directly load the data file into R from the course website with the following command:

econ_vars <- read.csv(url("https://raw.githubusercontent.com/lse-me314/lse-me314.github.io/master/data/const-econ-vars.csv"))

This data file has 8 variables

ONSConstID - Office for National Statistics Parliamentary Constituency ID
ConstituencyName - Constituency Name
HouseWageRatio - Ratio of House Prices to Wages
UnempConstRate - Unemployment Rate
UnempConstRateChange - Unemployment Rate Change since 2010
WageMedianConst - Median Wage
social_mobility_score - Social Mobility Index
deprivation_index_score - Social Deprivation Index

Q1. Use the cor() and pairs() functions to assess the correlations between the six economic variables in the data set. Which two economic variables are most highly correlated with one another at the constituency level? Which variable is least correlated with the others at the constituency level?

pairs(econ_vars[,3:8])

round(
  cor(econ_vars[,3:8], use = "pairwise.complete.obs")
  ,2)

##                         HouseWageRatio UnempConstRate UnempConstRateChange
## HouseWageRatio                    1.00          -0.40                -0.04
## UnempConstRate                   -0.40           1.00                -0.17
## UnempConstRateChange             -0.04          -0.17                 1.00
## WageMedianConst                   0.68          -0.52                 0.04
## social_mobility_score             0.56          -0.26                 0.03
## deprivation_index_score          -0.33           0.90                -0.39
##                         WageMedianConst social_mobility_score
## HouseWageRatio                     0.68                  0.56
## UnempConstRate                    -0.52                 -0.26
## UnempConstRateChange               0.04                  0.03
## WageMedianConst                    1.00                  0.66
## social_mobility_score              0.66                  1.00
## deprivation_index_score           -0.55                 -0.30
##                         deprivation_index_score
## HouseWageRatio                            -0.33
## UnempConstRate                             0.90
## UnempConstRateChange                      -0.39
## WageMedianConst                           -0.55
## social_mobility_score                     -0.30
## deprivation_index_score                    1.00

The two most highly correlated variables are the unemployment rate and the deprivation index.

The change in the unemployment rate since 2010 is barely correlated with any of the other variables. This is very likely to be because the places that are poor or affluent tend to stay that way over time. As a result, there is no reason to expect that the change in unemployment will be associated with whether a place is currently relatively poor/affluent.

Q2. Use the command pcafit <- prcomp(econ_vars[,4:9],scale.=TRUE) to calculate the principal components of these six economic variables. Then examine the object pcafit directly and also through summary(pcafit).

Which variable is has the smallest (magnitude) “loading” on the first principal component? How does this relate to your answer in Q1?

pcafit <- prcomp(econ_vars[,3:8],scale.=TRUE)
pcafit

## Standard deviations (1, .., p=6):
## [1] 1.7666732 1.1652675 0.8889118 0.6460604 0.5037648 0.2442953
## 
## Rotation (n x k) = (6 x 6):
##                                PC1        PC2        PC3         PC4
## HouseWageRatio          -0.4159756 -0.3771834 0.05947919 -0.75825898
## UnempConstRate           0.4548338 -0.3205344 0.47247017 -0.03825962
## UnempConstRateChange    -0.1302040  0.6051598 0.73740722 -0.20929981
## WageMedianConst         -0.4881757 -0.2380995 0.06771818  0.04389976
## social_mobility_score   -0.3867713 -0.3868244 0.42061614  0.59345975
## deprivation_index_score  0.4639288 -0.4271388 0.21900014 -0.16016068
##                                 PC5         PC6
## HouseWageRatio           0.30144653  0.12382456
## UnempConstRate          -0.22721115  0.64348050
## UnempConstRateChange    -0.02493771 -0.16920649
## WageMedianConst         -0.82807063 -0.11304521
## social_mobility_score    0.41158724 -0.04752232
## deprivation_index_score -0.04216830 -0.72590075

summary(pcafit)

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6
## Standard deviation     1.7667 1.1653 0.8889 0.64606 0.5038 0.24430
## Proportion of Variance 0.5202 0.2263 0.1317 0.06957 0.0423 0.00995
## Cumulative Proportion  0.5202 0.7465 0.8782 0.94776 0.9900 1.00000

The variable with the smallest magnitude loading on PC1 is the change in the unemployment rate. This is what we would expect given that we previously found it had very low correlations with all of the other variables. A principal component that predicts as much variation in the other variables as possible cannot possibly predict much variation in a variable that is poorly correlated with all other variables.

Q3. Extract the standard deviations of the principal components using the command pcafit$sdev. Use these standard deviations to calculate the proportion of variance explained by each principal component. (Recall that the variance is just the standard deviation squared!) How much of the variation in the underlying data is explained by the first principal component?

pca_variances <- pcafit$sdev^2
pca_variances/sum(pca_variances)

## [1] 0.520189034 0.226308070 0.131694023 0.069565679 0.042296497 0.009946697

The first principal component explains approximately 52% of the variance in the underlying data.

Q4. Construct screeplots using either the type="barplot" or the type="lines" options of the screeplot() command. Given this and the output of summary(pcafit) above, is it clear how many dimensions are needed to describe these data well?

screeplot(pcafit,type="barplot")

screeplot(pcafit,type="lines")

The first principal component “explains” about half the variance (0.52) and the second principal component explains about a quarter (0.23). This is a somewhat ambiguous case, each component explains about half as much variation as the previous one. You could certainly make a credible argument for either one or two principal components providing a good summary of the data. Depending on the application of interest, you might think that a single dimension explaining half the variation was a pretty good summary or that two dimensions explaining three quarters of the variation was preferable.

Q5. Check that the signs of the loadings for PC1 for each variable in the model. For each variable, write sentences of the form “[The ratio of home prices to wages] are [positive/negatively] correlated with the first principal component”. Do these all make sense collectively? You could also try writing a sentence of the form: “Places that are high on the first principal component are [high/low] in the house to wage ratio, [high/low] in unemployment,…” What does this tell us about what the first principal component is measuring?

# Print the loadings associated with the first principal component
pcafit$rotation[,1]

##          HouseWageRatio          UnempConstRate    UnempConstRateChange 
##              -0.4159756               0.4548338              -0.1302040 
##         WageMedianConst   social_mobility_score deprivation_index_score 
##              -0.4881757              -0.3867713               0.4639288

The ratio of home prices to wages is negatively correlated with the first principal component

The unemployment rate is positively correlated with the first principal component

Recent changes in the unemployment rate are negative correlated with the first principal component

Median wage is negatively correlated with the first principal component

Social mobility is negatively correlated with the first principal component

Deprivation is positively correlated with the first principal component

If you put these together, the general tendency is for attributes of less economically successful places to be positively correlated with the first principal component: lower house prices, higher unemployment, lower wages, lower social mobility and higher deprivation. You could argue about whether this is capturing a concept that should be understood as “economic success” or “affluence” or something else similar. The one slight mismatch is the unemployment change variable, which is weakly negatively associated with the first principal component. As we noted before, this variable perhaps does not belong in this analysis because it describes changes over time rather than levels.

Q5. Are you able to identify what PC2 is capturing?

# Print the loadings associated with the second principal component
pcafit$rotation[,2]

##          HouseWageRatio          UnempConstRate    UnempConstRateChange 
##              -0.3771834              -0.3205344               0.6051598 
##         WageMedianConst   social_mobility_score deprivation_index_score 
##              -0.2380995              -0.3868244              -0.4271388

PC2 is positively associated with unemployment rate changes and negatively associated with everything else. It would appear this is heavily a factor for capturing unemployment rate changes, which we have already seen are mostly uncorrelated with the over variables. It does capture a bit of additional variation from the other variables, but not in a way that I was able to make much sense of!

Q6. Re-do the principal components analysis without the variable that has the smallest magnitude loading on the first principal component. Extract the first principal component from the original analysis with all six variables (using pcafit$x[,1]) and also from this new analysis with five variables. Plot them against one another and check their correlation. Explain why you find what you find.

pcafit2 <- prcomp(econ_vars[,c(3,4,6,7,8)],scale.=TRUE)
pcafit2

## Standard deviations (1, .., p=5):
## [1] 1.7565429 1.0658834 0.6638214 0.5041835 0.2891193
## 
## Rotation (n x k) = (5 x 5):
##                                PC1        PC2         PC3         PC4
## HouseWageRatio          -0.4294957 -0.3798653 -0.73628018 -0.31772277
## UnempConstRate           0.4531833 -0.5254131  0.07272015  0.20909319
## WageMedianConst         -0.4979047 -0.2271984  0.02627548  0.83039862
## social_mobility_score   -0.3964105 -0.5029322  0.64941606 -0.40424205
## deprivation_index_score  0.4528885 -0.5244840 -0.17369989  0.04856716
##                                 PC5
## HouseWageRatio          -0.16786763
## UnempConstRate          -0.68524285
## WageMedianConst          0.10109171
## social_mobility_score    0.06902949
## deprivation_index_score  0.69805307

summary(pcafit2)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5
## Standard deviation     1.7565 1.0659 0.66382 0.50418 0.28912
## Proportion of Variance 0.6171 0.2272 0.08813 0.05084 0.01672
## Cumulative Proportion  0.6171 0.8443 0.93244 0.98328 1.00000

plot(pcafit$x[,1],pcafit2$x[,1])

cor(pcafit$x[,1],pcafit2$x[,1])

## [1] 0.9969929

The relative loadings on the five remaining variables are nearly identical to their previous values. The first principal component values for the constituencies are very nearly identical as well, correlated at over 0.99. Very little has changed because the variable we omitted was very weakly correlated with the others. Put differently, changes in unemployment are not really closely related to affluence or economic success of constituencies, so they are a poor indicator of that more general concept. The other five indicators tend to go together much more strongly.

Q7. Re-do the PCA from question 2 again, but this time set scale. = FALSE. Compare the loadings on the first principal component and comment on any differences you see. Why do these differences occur?

pcafit3 <- prcomp(econ_vars[,c(3,4,5,6,7,8)],scale.=FALSE)

round(pcafit$rotation,2)

##                           PC1   PC2  PC3   PC4   PC5   PC6
## HouseWageRatio          -0.42 -0.38 0.06 -0.76  0.30  0.12
## UnempConstRate           0.45 -0.32 0.47 -0.04 -0.23  0.64
## UnempConstRateChange    -0.13  0.61 0.74 -0.21 -0.02 -0.17
## WageMedianConst         -0.49 -0.24 0.07  0.04 -0.83 -0.11
## social_mobility_score   -0.39 -0.39 0.42  0.59  0.41 -0.05
## deprivation_index_score  0.46 -0.43 0.22 -0.16 -0.04 -0.73

round(pcafit3$rotation,2)

##                           PC1   PC2   PC3   PC4  PC5   PC6
## HouseWageRatio          -0.03 -0.01 -0.02  1.00 0.00  0.00
## UnempConstRate           0.00  0.00  0.00  0.00 0.70  0.72
## UnempConstRateChange     0.00  0.00  0.00  0.00 0.72 -0.70
## WageMedianConst         -0.94  0.32 -0.07 -0.02 0.00  0.00
## social_mobility_score   -0.33 -0.94  0.03 -0.02 0.00  0.00
## deprivation_index_score  0.06 -0.05 -1.00 -0.02 0.00  0.00

When we set scale.=FALSE the variables enter the PCA without first being normalised to have mean zero and variance one. As a result, the first principal component in this analysis is dominated by the WageMedianConst variable, as this has the largest unnormalised sample variance in our data. In fact, in this analysis WageMedianConst and social_mobility_score both dominate the first two principal components, but this is entirely driven by the fact that those two variables have large sample variance in the units in which they are measured. Of course, we do not want the units of measurement to determine the PCA output, and so this question is just here to serve as a reminder that we should always set scale.=TRUE when running PCA.

Exercise 8.2 – K-means Clustering

Consider the USArrests data from the “datasets” package, which you can load with the following command:

data("USArrests", package = "datasets")  # "package" argument optional here

The data includes statistics on the number of arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It also gives the percentage of the population living in urban areas. Explore the data using the summary() and pairs() commands to make sure you are familiar with it. You will also be using some functions that make use of random number generation in this question, so let’s set the seed again:

set.seed(2)

Use the kmeans() function to perform k-means clustering on the arrests data. This function takes three main arguments:

x = the data on which we would like to perform k-means clustering
centers = a user specified parameter for the number of clusters (i.e. the value of K)
nstart = a parameter that determines the number of random starts to use. For now, just set this to 20. (This will fit the algorithm 20 times from different start values, and select the best fitting one)

Use the arguments above to perform k-means when K = 3. Which states are assigned to which clusters? Can you say anything substantive about the clusters given the information in the output?

km.out <-  kmeans(USArrests, 3, nstart = 20)

The states assigned to each cluster are given in the output. The table of cluster means suggests that the second cluster captures the least crime-affected states, and the third cluster captures the most crime-affected states, while the first cluster captures states at intermediate levels.

Use the code below to create a map of the clusters of US states. Note that you need to assign your cluster assignments to cluster_assignments with something like cluster_assignments <- your_kmeans_object$cluster in order for this code to work.

library(ggplot2)
library(usmap)

cluster_assignments <- km.out$cluster

classification_df <- data.frame(state=tolower(names(cluster_assignments)),
                                cluster=as.factor(as.numeric(cluster_assignments)))

state_plot <- plot_usmap(regions="states", data=classification_df, values="cluster") + 
  labs(title="US States",subtitle="Clusters by Arrest Rates.") + 
  scale_colour_hue() + 
  theme( panel.background = element_rect(color = "black", fill = "white"), legend.position="right") + 
  guides(fill=guide_legend(title=""))

state_plot

Use the function hclust() to perform hierarchical clustering on this data. The hclust() function requires the following arguments:

d = this arguments requires an object created by the dist() function applied to the data used for clustering. dist() takes a numeric matrix or data.frame as an argument (here you should use the USArrests object), and also allows you to select a distance metric. For now, just use method = "euclidean".
method = this argument specifies which agglomeration method should be used for clustering. Select method = "complete" here.

Once you have estimated the clusters, use the plot() function on the resulting object to plot the dendrogram.

hc.complete <-  hclust(dist(USArrests, method = "euclidean"), method = "complete")

plot(hc.complete)

Cut the dendrogram at a height that results in three distinct clusters using the function cutree(). Look at the help file to work out how to use this function. Which states belong to which clusters? How many states are in each cluster?

cutree(hc.complete, 3)

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

table(cutree(hc.complete, 3))

## 
##  1  2  3 
## 16 14 20

How do the clusters from (d) compare to the clusters from (a)? Are the same countries clustered together using the two approaches?

Note: You may find it easier to answer this question if you assign the cluster idenitifiers from each of these approaches to the USArrests data first.

USArrests$clusters_k <- km.out$cluster
USArrests$clusters_h <- cutree(hc.complete, 3)

table(USArrests$clusters_k, USArrests$clusters_h)

##    
##      1  2  3
##   1  0 14  0
##   2  0  0 20
##   3 16  0  0

The clusters contain the same states regardless of the approach used. Note, however, that the labels for the clusters are different for the two approaches!

Hierarchically cluster the states again using complete linkage and Euclidean distance. However, before you run the hclust() function, first scale each of the variables to have standard deviation one. You can acheive this by applying the scale() function to USArrests. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

dsc <-  scale(USArrests)
hc.s.complete <-  hclust(dist(dsc), method = "complete")
plot(hc.s.complete)

USArrests$clusters_h_s <- cutree(hc.s.complete, 3)
table(USArrests$clusters_h, USArrests$clusters_h_s)

##    
##      1  2  3
##   1 16  0  0
##   2  0 14  0
##   3  0  9 11

table(USArrests$clusters_h)

## 
##  1  2  3 
## 16 14 20

Scaling the variables affects the clusters obtained from cutting the dendogram into 3 clusters. In particular, some of the states in cluster 3 from the non-scaled analysis are put in a different cluster in the scaled analysis. For this data set, it seems reasonable to standardise the variables because one of them is measured in units that are different from the others ($UrbanPop$ compared to other three columns).

Exercise 8 - Unsupervised Learning (Solution)

Ken Benoit and Jack Blumenau

Exercise 8.1 – Principal Component Analysis

Exercise 8.2 – K-means Clustering