In today’s assignment, we are going to look at a set of variables describing the economic characteristics of English parliamentary constituencies around 2017-2019 (the dates of the source data vary a bit in terms of year).
You can directly load the data file into R from the course website with the following command:
econ_vars <- read.csv(url("https://raw.githubusercontent.com/lse-me314/lse-me314.github.io/master/data/const-econ-vars.csv"))
This data file has 8 variables
ONSConstID
- Office for National Statistics
Parliamentary Constituency IDConstituencyName
- Constituency NameHouseWageRatio
- Ratio of House Prices to WagesUnempConstRate
- Unemployment RateUnempConstRateChange
- Unemployment Rate Change since
2010WageMedianConst
- Median Wagesocial_mobility_score
- Social
Mobility Indexdeprivation_index_score
- Social
Deprivation IndexQ1. Use the cor()
and pairs()
functions to
assess the correlations between the six economic variables in the data
set. Which two economic variables are most highly correlated with one
another at the constituency level? Which variable is least correlated
with the others at the constituency level?
pairs(econ_vars[,3:8])
round(
cor(econ_vars[,3:8], use = "pairwise.complete.obs")
,2)
## HouseWageRatio UnempConstRate UnempConstRateChange
## HouseWageRatio 1.00 -0.40 -0.04
## UnempConstRate -0.40 1.00 -0.17
## UnempConstRateChange -0.04 -0.17 1.00
## WageMedianConst 0.68 -0.52 0.04
## social_mobility_score 0.56 -0.26 0.03
## deprivation_index_score -0.33 0.90 -0.39
## WageMedianConst social_mobility_score
## HouseWageRatio 0.68 0.56
## UnempConstRate -0.52 -0.26
## UnempConstRateChange 0.04 0.03
## WageMedianConst 1.00 0.66
## social_mobility_score 0.66 1.00
## deprivation_index_score -0.55 -0.30
## deprivation_index_score
## HouseWageRatio -0.33
## UnempConstRate 0.90
## UnempConstRateChange -0.39
## WageMedianConst -0.55
## social_mobility_score -0.30
## deprivation_index_score 1.00
The two most highly correlated variables are the unemployment rate and the deprivation index.
The change in the unemployment rate since 2010 is barely correlated with any of the other variables. This is very likely to be because the places that are poor or affluent tend to stay that way over time. As a result, there is no reason to expect that the change in unemployment will be associated with whether a place is currently relatively poor/affluent.
Q2. Use the command
pcafit <- prcomp(econ_vars[,4:9],scale.=TRUE)
to
calculate the principal components of these six economic variables. Then
examine the object pcafit
directly and also through
summary(pcafit)
.
Which variable is has the smallest (magnitude) “loading” on the first principal component? How does this relate to your answer in Q1?
pcafit <- prcomp(econ_vars[,3:8],scale.=TRUE)
pcafit
## Standard deviations (1, .., p=6):
## [1] 1.7666732 1.1652675 0.8889118 0.6460604 0.5037648 0.2442953
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4
## HouseWageRatio -0.4159756 -0.3771834 0.05947919 -0.75825898
## UnempConstRate 0.4548338 -0.3205344 0.47247017 -0.03825962
## UnempConstRateChange -0.1302040 0.6051598 0.73740722 -0.20929981
## WageMedianConst -0.4881757 -0.2380995 0.06771818 0.04389976
## social_mobility_score -0.3867713 -0.3868244 0.42061614 0.59345975
## deprivation_index_score 0.4639288 -0.4271388 0.21900014 -0.16016068
## PC5 PC6
## HouseWageRatio 0.30144653 0.12382456
## UnempConstRate -0.22721115 0.64348050
## UnempConstRateChange -0.02493771 -0.16920649
## WageMedianConst -0.82807063 -0.11304521
## social_mobility_score 0.41158724 -0.04752232
## deprivation_index_score -0.04216830 -0.72590075
summary(pcafit)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.7667 1.1653 0.8889 0.64606 0.5038 0.24430
## Proportion of Variance 0.5202 0.2263 0.1317 0.06957 0.0423 0.00995
## Cumulative Proportion 0.5202 0.7465 0.8782 0.94776 0.9900 1.00000
The variable with the smallest magnitude loading on PC1 is the change in the unemployment rate. This is what we would expect given that we previously found it had very low correlations with all of the other variables. A principal component that predicts as much variation in the other variables as possible cannot possibly predict much variation in a variable that is poorly correlated with all other variables.
Q3. Extract the standard deviations of the principal components using
the command pcafit$sdev
. Use these standard deviations to
calculate the proportion of variance explained by each principal
component. (Recall that the variance is just the standard deviation
squared!) How much of the variation in the underlying data is explained
by the first principal component?
pca_variances <- pcafit$sdev^2
pca_variances/sum(pca_variances)
## [1] 0.520189034 0.226308070 0.131694023 0.069565679 0.042296497 0.009946697
The first principal component explains approximately 52% of the variance in the underlying data.
Q4. Construct screeplots using either the type="barplot"
or the type="lines"
options of the screeplot()
command. Given this and the output of summary(pcafit)
above, is it clear how many dimensions are needed to describe these data
well?
screeplot(pcafit,type="barplot")
screeplot(pcafit,type="lines")
The first principal component “explains” about half the variance (0.52) and the second principal component explains about a quarter (0.23). This is a somewhat ambiguous case, each component explains about half as much variation as the previous one. You could certainly make a credible argument for either one or two principal components providing a good summary of the data. Depending on the application of interest, you might think that a single dimension explaining half the variation was a pretty good summary or that two dimensions explaining three quarters of the variation was preferable.
Q5. Check that the signs of the loadings for PC1 for each variable in the model. For each variable, write sentences of the form “[The ratio of home prices to wages] are [positive/negatively] correlated with the first principal component”. Do these all make sense collectively? You could also try writing a sentence of the form: “Places that are high on the first principal component are [high/low] in the house to wage ratio, [high/low] in unemployment,…” What does this tell us about what the first principal component is measuring?
# Print the loadings associated with the first principal component
pcafit$rotation[,1]
## HouseWageRatio UnempConstRate UnempConstRateChange
## -0.4159756 0.4548338 -0.1302040
## WageMedianConst social_mobility_score deprivation_index_score
## -0.4881757 -0.3867713 0.4639288
- The ratio of home prices to wages is negatively correlated with the first principal component
- The unemployment rate is positively correlated with the first principal component
- Recent changes in the unemployment rate are negative correlated with the first principal component
- Median wage is negatively correlated with the first principal component
- Social mobility is negatively correlated with the first principal component
- Deprivation is positively correlated with the first principal component
If you put these together, the general tendency is for attributes of less economically successful places to be positively correlated with the first principal component: lower house prices, higher unemployment, lower wages, lower social mobility and higher deprivation. You could argue about whether this is capturing a concept that should be understood as “economic success” or “affluence” or something else similar. The one slight mismatch is the unemployment change variable, which is weakly negatively associated with the first principal component. As we noted before, this variable perhaps does not belong in this analysis because it describes changes over time rather than levels.
Q5. Are you able to identify what PC2 is capturing?
# Print the loadings associated with the second principal component
pcafit$rotation[,2]
## HouseWageRatio UnempConstRate UnempConstRateChange
## -0.3771834 -0.3205344 0.6051598
## WageMedianConst social_mobility_score deprivation_index_score
## -0.2380995 -0.3868244 -0.4271388
PC2 is positively associated with unemployment rate changes and negatively associated with everything else. It would appear this is heavily a factor for capturing unemployment rate changes, which we have already seen are mostly uncorrelated with the over variables. It does capture a bit of additional variation from the other variables, but not in a way that I was able to make much sense of!
Q6. Re-do the principal components analysis without the variable that
has the smallest magnitude loading on the first principal component.
Extract the first principal component from the original analysis with
all six variables (using pcafit$x[,1]
) and also from this
new analysis with five variables. Plot them against one another and
check their correlation. Explain why you find what you find.
pcafit2 <- prcomp(econ_vars[,c(3,4,6,7,8)],scale.=TRUE)
pcafit2
## Standard deviations (1, .., p=5):
## [1] 1.7565429 1.0658834 0.6638214 0.5041835 0.2891193
##
## Rotation (n x k) = (5 x 5):
## PC1 PC2 PC3 PC4
## HouseWageRatio -0.4294957 -0.3798653 -0.73628018 -0.31772277
## UnempConstRate 0.4531833 -0.5254131 0.07272015 0.20909319
## WageMedianConst -0.4979047 -0.2271984 0.02627548 0.83039862
## social_mobility_score -0.3964105 -0.5029322 0.64941606 -0.40424205
## deprivation_index_score 0.4528885 -0.5244840 -0.17369989 0.04856716
## PC5
## HouseWageRatio -0.16786763
## UnempConstRate -0.68524285
## WageMedianConst 0.10109171
## social_mobility_score 0.06902949
## deprivation_index_score 0.69805307
summary(pcafit2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.7565 1.0659 0.66382 0.50418 0.28912
## Proportion of Variance 0.6171 0.2272 0.08813 0.05084 0.01672
## Cumulative Proportion 0.6171 0.8443 0.93244 0.98328 1.00000
plot(pcafit$x[,1],pcafit2$x[,1])
cor(pcafit$x[,1],pcafit2$x[,1])
## [1] 0.9969929
The relative loadings on the five remaining variables are nearly identical to their previous values. The first principal component values for the constituencies are very nearly identical as well, correlated at over 0.99. Very little has changed because the variable we omitted was very weakly correlated with the others. Put differently, changes in unemployment are not really closely related to affluence or economic success of constituencies, so they are a poor indicator of that more general concept. The other five indicators tend to go together much more strongly.
Q7. Re-do the PCA from question 2 again, but this time set
scale. = FALSE
. Compare the loadings on the first principal
component and comment on any differences you see. Why do these
differences occur?
pcafit3 <- prcomp(econ_vars[,c(3,4,5,6,7,8)],scale.=FALSE)
round(pcafit$rotation,2)
## PC1 PC2 PC3 PC4 PC5 PC6
## HouseWageRatio -0.42 -0.38 0.06 -0.76 0.30 0.12
## UnempConstRate 0.45 -0.32 0.47 -0.04 -0.23 0.64
## UnempConstRateChange -0.13 0.61 0.74 -0.21 -0.02 -0.17
## WageMedianConst -0.49 -0.24 0.07 0.04 -0.83 -0.11
## social_mobility_score -0.39 -0.39 0.42 0.59 0.41 -0.05
## deprivation_index_score 0.46 -0.43 0.22 -0.16 -0.04 -0.73
round(pcafit3$rotation,2)
## PC1 PC2 PC3 PC4 PC5 PC6
## HouseWageRatio -0.03 -0.01 -0.02 1.00 0.00 0.00
## UnempConstRate 0.00 0.00 0.00 0.00 0.70 0.72
## UnempConstRateChange 0.00 0.00 0.00 0.00 0.72 -0.70
## WageMedianConst -0.94 0.32 -0.07 -0.02 0.00 0.00
## social_mobility_score -0.33 -0.94 0.03 -0.02 0.00 0.00
## deprivation_index_score 0.06 -0.05 -1.00 -0.02 0.00 0.00
When we set
scale.=FALSE
the variables enter the PCA without first being normalised to have mean zero and variance one. As a result, the first principal component in this analysis is dominated by theWageMedianConst
variable, as this has the largest unnormalised sample variance in our data. In fact, in this analysisWageMedianConst
andsocial_mobility_score
both dominate the first two principal components, but this is entirely driven by the fact that those two variables have large sample variance in the units in which they are measured. Of course, we do not want the units of measurement to determine the PCA output, and so this question is just here to serve as a reminder that we should always setscale.=TRUE
when running PCA.
Consider the USArrests
data from the “datasets” package,
which you can load with the following command:
data("USArrests", package = "datasets") # "package" argument optional here
The data includes statistics on the number of arrests per 100,000
residents for assault, murder, and rape in each of the 50 US states in
1973. It also gives the percentage of the population living in urban
areas. Explore the data using the summary()
and
pairs()
commands to make sure you are familiar with it. You
will also be using some functions that make use of random number
generation in this question, so let’s set the seed again:
set.seed(2)
kmeans()
function to perform k-means clustering
on the arrests data. This function takes three main arguments:x
= the data on which we would like to perform
k-means clustering
centers
= a user specified parameter for the number
of clusters (i.e. the value of K
)
nstart
= a parameter that determines the number of
random starts to use. For now, just set this to 20. (This will fit the
algorithm 20 times from different start values, and select the best
fitting one)
Use the arguments above to perform k-means when K = 3
.
Which states are assigned to which clusters? Can you say anything
substantive about the clusters given the information in the
output?
km.out <- kmeans(USArrests, 3, nstart = 20)
The states assigned to each cluster are given in the output. The table of cluster means suggests that the second cluster captures the least crime-affected states, and the third cluster captures the most crime-affected states, while the first cluster captures states at intermediate levels.
cluster_assignments
with something like
cluster_assignments <- your_kmeans_object$cluster
in
order for this code to work.library(ggplot2)
library(usmap)
cluster_assignments <- km.out$cluster
classification_df <- data.frame(state=tolower(names(cluster_assignments)),
cluster=as.factor(as.numeric(cluster_assignments)))
state_plot <- plot_usmap(regions="states", data=classification_df, values="cluster") +
labs(title="US States",subtitle="Clusters by Arrest Rates.") +
scale_colour_hue() +
theme( panel.background = element_rect(color = "black", fill = "white"), legend.position="right") +
guides(fill=guide_legend(title=""))
state_plot
hclust()
to perform hierarchical
clustering on this data. The hclust()
function requires the
following arguments:d
= this arguments requires an object created by the
dist()
function applied to the data used for clustering.
dist()
takes a numeric matrix or data.frame as an argument
(here you should use the USArrests
object), and also allows
you to select a distance metric. For now, just use
method = "euclidean"
.
method
= this argument specifies which agglomeration
method should be used for clustering. Select
method = "complete"
here.
Once you have estimated the clusters, use the plot()
function on the resulting object to plot the dendrogram.
hc.complete <- hclust(dist(USArrests, method = "euclidean"), method = "complete")
plot(hc.complete)
cutree()
. Look at the help file
to work out how to use this function. Which states belong to which
clusters? How many states are in each cluster?cutree(hc.complete, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
table(cutree(hc.complete, 3))
##
## 1 2 3
## 16 14 20
Note: You may find it easier to answer this question if you assign
the cluster idenitifiers from each of these approaches to the
USArrests
data first.
USArrests$clusters_k <- km.out$cluster
USArrests$clusters_h <- cutree(hc.complete, 3)
table(USArrests$clusters_k, USArrests$clusters_h)
##
## 1 2 3
## 1 0 14 0
## 2 0 0 20
## 3 16 0 0
The clusters contain the same states regardless of the approach used. Note, however, that the labels for the clusters are different for the two approaches!
hclust()
function, first scale each of the variables to have standard deviation
one. You can acheive this by applying the scale()
function
to USArrests
. What effect does scaling the variables have
on the hierarchical clustering obtained? In your opinion, should the
variables be scaled before the inter-observation dissimilarities are
computed? Provide a justification for your answer.dsc <- scale(USArrests)
hc.s.complete <- hclust(dist(dsc), method = "complete")
plot(hc.s.complete)
USArrests$clusters_h_s <- cutree(hc.s.complete, 3)
table(USArrests$clusters_h, USArrests$clusters_h_s)
##
## 1 2 3
## 1 16 0 0
## 2 0 14 0
## 3 0 9 11
table(USArrests$clusters_h)
##
## 1 2 3
## 16 14 20
Scaling the variables affects the clusters obtained from cutting the dendogram into 3 clusters. In particular, some of the states in cluster 3 from the non-scaled analysis are put in a different cluster in the scaled analysis. For this data set, it seems reasonable to standardise the variables because one of them is measured in units that are different from the others (\(UrbanPop\) compared to other three columns).