In this assignment, you will use R to calculate similarity metrics for text data, and apply document classification methods using quanteda.
You will need the following packages:
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)
library(tidyverse)
library(caret)
Did the United States constitution influence the constitutions of other countries? There is a growing scholarly train of thought that suggests the influence of the US Constitution has decreased over time, as it is increasingly divergent from an increasing global consensus of the importance of human rights to constitutional settlements. However, there is a lack of empirical and systematic knowledge about the extent to which the U.S. Constitution impacts the revision and adoption of formal constitutions across the world.1
David S. Law and Mila Versteeg (2012) investigate the influence of the US constitution empirically and show that other countries have, in recent decades, become increasingly unlikely to model the rights-related provisions of their own constitutions upon those found in the US Constitution. In this problem set, we will use some of the methods that we covered this week to replicate some parts of their analysis.
We will use the constitutions.csv
file for this
question.
Once you have downloaded this files and stored it somewhere sensible, you can load it into R using the following command:
constitutions <- read_csv("constitutions.csv")
This file contains the preambles of 155 (English-translated) constitutions. The data contains the following variables:
Variable | Description |
---|---|
country |
Name of the country |
continent |
Continent of the country |
year |
Year in which the constitution was written |
preamble |
Text of the preamble of the constitution |
You can take a quick look at the variables in the data by using the
glimpse()
function from the tidyverse
package:
glimpse(constitutions)
## Rows: 155
## Columns: 4
## $ country <chr> "afghanistan", "albania", "algeria", "andorra", "angola", "a…
## $ continent <chr> "Asia", "Europe", "Africa", "Europe", "Africa", "Americas", …
## $ year <dbl> 2004, 1998, 1989, 1993, 2010, 1981, 1853, 1995, 1995, 1973, …
## $ preamble <chr> "In the name of Allah, the Most Beneficent, the Most Mercifu…
constitutions
object to get a sense of the
data that we are working with. What is the average length of the texts
stored in the preambles
variable?2 Which country has the
longest preamble text?3 Which has the shortest?4 Has the average length
of these preambles changed over time?5Convert the constitutions
data.frame into a
corpus()
object and then into a dfm()
object
(remember that you will need to use the tokens()
) function
as well. Make some sensible feature selection decisions.
Use the topfeatures()
function to find the most
prevalent 10 features in the US constitution. Compare these features to
the top features for three other countries of your choice. What do you
notice?
Apply tf-idf weights to your dfm using the
dfm_tfidf()
function. Repeat the exercise above using the
new matrix. What do you notice?
Make two word clouds for two for the USA and one other country
using the textplot_wordcloud()
function. Marvel at how ugly
these are.6
The cosine similarity (\(cos(\theta)\)) between two vectors \(\textbf{a}\) and \(\textbf{b}\) is defined as:
\[cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\left|\left| \mathbf{a} \right|\right| \left|\left| \mathbf{b} \right|\right|}\]
where \(\theta\) is the angle between the two vectors and \(\left| \mathbf{a} \right|\) and \(\left| \mathbf{b} \right|\) are the magnitudes of the vectors \(\mathbf{a}\) and \(\mathbf{b}\), respectively. In slightly more laborious, but possibly easier to understand, notation:
\[cos(\theta) = \frac{a_1b_1 + a_2b_2 + ... + a_Jb_J}{\sqrt{a_1^2 + a_2^2 + ... + a_J^2} \times \sqrt{b_1^2 + b_2^2 + ... + b_J^2}}\]
textstat_simil()
function to calculate the
cosine similarity between the preamble for the US constitution and
all other preambles in the data.7 Assign the output of
this function to the original constitutions
data.frame
using the as.numeric()
function. Which 3 constitutions are
most similar to the US? Which are the 3 least similar?8There are a couple of coding nuances that you will need to tackle to complete this question.
First, you will need to convert the year
variable to
a decade
variable. You can do this by using the
%%
“modulo” operator, which calculates the remainder after
the division of two numeric variables. For instance,
1986 %% 10
will return a value of 6
. If you
subtract that from the original year, you will be left with the correct
decade (i.e. 1986 - 6 = 1980
).
Second, you will need to calculate the decade-level averages of
the cosine similarity variable that you created in answer to the
question above. To do so, you should use the group_by()
and
summarise()
functions.group_by()
allows you to
specify the variable by which the summarisation should be applied, and
the summarise()
function allows you to specify which type
of summary you wish to use (i.e. here you should be using the
mean()
function).
geom_line()
in ggplot) with the
averages that you calculated above on the y-axis and with the decades on
the x-axis. Have constitution preambles become less similar to the
preamble of the US constituion over recent history?In this question, we will use Naive Bayes models to predict whether
movies are positively or negatively reviewed. We will use a classic
computer science dataset of movie reviews, (Pang and
Lee 2004). The movies corpus has an attribute sentiment
that labels each text as either pos
or neg
according to the original imdb.com archived newspaper review star
rating.
You can extract the relevant corpus object using the following line of code:
moviereviews <- quanteda.textmodels::data_corpus_moviereviews
Start by looking at the metadata included with this corpus using the
docvars()
function:
head(docvars(moviereviews))
## sentiment id1 id2
## 1 neg cv000 29416
## 2 neg cv001 19502
## 3 neg cv002 17424
## 4 neg cv003 12683
## 5 neg cv004 12641
## 6 neg cv005 29357
We will be using the sentiment
variable, which includes
information from a human-labelling of movie reviews as either positive
(pos
) or negative (neg
).
table()
function to work out how many positive
and how many negative movie reviews there are in the corpus.dfm()
), and make some
reasonable feature selection decisions to reduce the number of features
in the dfm. You will need to first convert the moviereviews
corpus into a tokens object, using tokens()
.?sample
to make sure
you understand what each part of the code is doing. As we are using
randomness to generate this vector, don’t forget to first set your seed
so that the results are fully replicable!
set.seed(1234)
train <- sample(c(TRUE, FALSE), 2000, replace = TRUE, prob = c(.75, .25))
Subset the dfm into a training set and a test set using the
vector you just created. Use the dfm_subset()
function to
acheive this.
Use the textmodel_nb()
function to train the Naive
Bayes classifier on the training dfm. You should use the dfm you created
for the training corpus as the x
argument to this function,
and the outcome (i.e. training_dfm$sentiment
) as the
y
argument.
Examine the param
element of the fitted model. Which
words have the highest probability under the pos
class?
Which words have the highest probability under the neg
class? You might find the sort()
function helpful
here.
Use the predict()
function to predict the sentiment
of movies in the test set dfm. The predict function takes two arguments
in this instance: 1) the estimated Naive Bayes model from part (e), and
2) the test-set dfm. Create a confusion matrix of the predicted classes
and the actual classes in the test data. What is the accuracy of your
model?
Use the confusionMatrix()
function to calculate
other statistics relevant to the predictive performance of your model.
The first argument to the confusionMatrix()
function should
be the confusion matrix that you created in answer to question (g). You
should also set the positive
argument equal to
"pos"
to tell R the level of the outcome that corresponds
to a “positive” result. Report the the accuracy, sensitivity and
specificity of your predictions, giving a brief interpretation of
each.
This problem set draws from material in Quantitative Social Science: An Introduction by Kosuke Imai.↩︎
The ntokens()
function will be helpful
here.↩︎
The which.max()
function will be helpful
here.↩︎
The which.min()
function will be helpful
here.↩︎
You will need to compare the length of the preambles
variable to the year
variable in some way (a good-looking
plot would be nice!)↩︎
You may need to set the min_count
argument
to be a lower value than the default of 3 for the US constitution, as
that text is very short.↩︎
You can also provide this function with an
x
matrix and a y
vector. This will
enable you to calculate the similarity between all rows in
x
and the vector used for y
.↩︎
Use the order()
function to acheive this.
Look back at seminar 2 if you have forgotten how to use this function.↩︎