Assignment 10 - Similarity Metrics and Supervised Learning for Text

In this assignment, you will use R to calculate similarity metrics for text data, and apply document classification methods using quanteda.

You will need the following packages:

library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)
library(tidyverse)
library(caret)

Exercise 10.1 - Similarity Metrics for Analyzing the Preambles of Constitutions

Did the United States constitution influence the constitutions of other countries? There is a growing scholarly train of thought that suggests the influence of the US Constitution has decreased over time, as it is increasingly divergent from an increasing global consensus of the importance of human rights to constitutional settlements. However, there is a lack of empirical and systematic knowledge about the extent to which the U.S. Constitution impacts the revision and adoption of formal constitutions across the world.¹

David S. Law and Mila Versteeg (2012) investigate the influence of the US constitution empirically and show that other countries have, in recent decades, become increasingly unlikely to model the rights-related provisions of their own constitutions upon those found in the US Constitution. In this problem set, we will use some of the methods that we covered this week to replicate some parts of their analysis.

We will use the constitutions.csv file for this question.

Once you have downloaded this files and stored it somewhere sensible, you can load it into R using the following command:


constitutions <- read_csv("constitutions.csv")

This file contains the preambles of 155 (English-translated) constitutions. The data contains the following variables:

Variables in the `constituiton` data.
Variable	Description
`country`	Name of the country
`continent`	Continent of the country
`year`	Year in which the constitution was written
`preamble`	Text of the preamble of the constitution

You can take a quick look at the variables in the data by using the glimpse() function from the tidyverse package:


glimpse(constitutions)
## Rows: 155
## Columns: 4
## $ country   <chr> "afghanistan", "albania", "algeria", "andorra", "angola", "a…
## $ continent <chr> "Asia", "Europe", "Africa", "Europe", "Africa", "Americas", …
## $ year      <dbl> 2004, 1998, 1989, 1993, 2010, 1981, 1853, 1995, 1995, 1973, …
## $ preamble  <chr> "In the name of Allah, the Most Beneficent, the Most Mercifu…

Tf-idf

Explore the constitutions object to get a sense of the data that we are working with. What is the average length of the texts stored in the preambles variable?² Which country has the longest preamble text?³ Which has the shortest?⁴ Has the average length of these preambles changed over time?⁵

Convert the constitutions data.frame into a corpus() object and then into a dfm() object (remember that you will need to use the tokens()) function as well. Make some sensible feature selection decisions.
Use the topfeatures() function to find the most prevalent 10 features in the US constitution. Compare these features to the top features for three other countries of your choice. What do you notice?
Apply tf-idf weights to your dfm using the dfm_tfidf() function. Repeat the exercise above using the new matrix. What do you notice?
Make two word clouds for two for the USA and one other country using the textplot_wordcloud() function. Marvel at how ugly these are.⁶

Cosine Similarity

The cosine similarity ($cos(\theta)$) between two vectors $\textbf{a}$ and $\textbf{b}$ is defined as:

\[cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\left|\left| \mathbf{a} \right|\right| \left|\left| \mathbf{b} \right|\right|}\]

where $\theta$ is the angle between the two vectors and $\left| \mathbf{a} \right|$ and $\left| \mathbf{b} \right|$ are the magnitudes of the vectors $\mathbf{a}$ and $\mathbf{b}$, respectively. In slightly more laborious, but possibly easier to understand, notation:

\[cos(\theta) = \frac{a_1b_1 + a_2b_2 + ... + a_Jb_J}{\sqrt{a_1^2 + a_2^2 + ... + a_J^2} \times \sqrt{b_1^2 + b_2^2 + ... + b_J^2}}\]

Use the textstat_simil() function to calculate the cosine similarity between the preamble for the US constitution and all other preambles in the data.⁷ Assign the output of this function to the original constitutions data.frame using the as.numeric() function. Which 3 constitutions are most similar to the US? Which are the 3 least similar?⁸

Calculate the average cosine similarity between the constitution of the US and the constitutions of other countries for each decade in the data for all constitutions written from the 1950s onwards.

There are a couple of coding nuances that you will need to tackle to complete this question.

First, you will need to convert the year variable to a decade variable. You can do this by using the %% “modulo” operator, which calculates the remainder after the division of two numeric variables. For instance, 1986 %% 10 will return a value of 6. If you subtract that from the original year, you will be left with the correct decade (i.e. 1986 - 6 = 1980).
Second, you will need to calculate the decade-level averages of the cosine similarity variable that you created in answer to the question above. To do so, you should use the group_by() and summarise() functions.group_by() allows you to specify the variable by which the summarisation should be applied, and the summarise() function allows you to specify which type of summary you wish to use (i.e. here you should be using the mean() function).

Create a line graph (geom_line() in ggplot) with the averages that you calculated above on the y-axis and with the decades on the x-axis. Have constitution preambles become less similar to the preamble of the US constituion over recent history?

Exercise 10.2 - Naive Bayes Classification of Movie Reviews

In this question, we will use Naive Bayes models to predict whether movies are positively or negatively reviewed. We will use a classic computer science dataset of movie reviews, (Pang and Lee 2004). The movies corpus has an attribute sentiment that labels each text as either pos or neg according to the original imdb.com archived newspaper review star rating.

You can extract the relevant corpus object using the following line of code:


moviereviews <- quanteda.textmodels::data_corpus_moviereviews

Start by looking at the metadata included with this corpus using the docvars() function:


head(docvars(moviereviews))
##   sentiment   id1   id2
## 1       neg cv000 29416
## 2       neg cv001 19502
## 3       neg cv002 17424
## 4       neg cv003 12683
## 5       neg cv004 12641
## 6       neg cv005 29357

We will be using the sentiment variable, which includes information from a human-labelling of movie reviews as either positive (pos) or negative (neg).

Use the table() function to work out how many positive and how many negative movie reviews there are in the corpus.

Make a dfm for this corpus (i.e. dfm()), and make some reasonable feature selection decisions to reduce the number of features in the dfm. You will need to first convert the moviereviews corpus into a tokens object, using tokens().

Use the code below to create a logical vector of the same length as the number of documents in the corpus. We will use this vector to define our training and test sets. Look at ?sample to make sure you understand what each part of the code is doing. As we are using randomness to generate this vector, don’t forget to first set your seed so that the results are fully replicable!


set.seed(1234)

train <- sample(c(TRUE, FALSE), 2000, replace = TRUE, prob = c(.75, .25))

Subset the dfm into a training set and a test set using the vector you just created. Use the dfm_subset() function to acheive this.
Use the textmodel_nb() function to train the Naive Bayes classifier on the training dfm. You should use the dfm you created for the training corpus as the x argument to this function, and the outcome (i.e. training_dfm$sentiment) as the y argument.
Examine the param element of the fitted model. Which words have the highest probability under the pos class? Which words have the highest probability under the neg class? You might find the sort() function helpful here.
Use the predict() function to predict the sentiment of movies in the test set dfm. The predict function takes two arguments in this instance: 1) the estimated Naive Bayes model from part (e), and 2) the test-set dfm. Create a confusion matrix of the predicted classes and the actual classes in the test data. What is the accuracy of your model?
Use the confusionMatrix() function to calculate other statistics relevant to the predictive performance of your model. The first argument to the confusionMatrix() function should be the confusion matrix that you created in answer to question (g). You should also set the positive argument equal to "pos" to tell R the level of the outcome that corresponds to a “positive” result. Report the the accuracy, sensitivity and specificity of your predictions, giving a brief interpretation of each.

This problem set draws from material in Quantitative Social Science: An Introduction by Kosuke Imai.↩︎
The ntokens() function will be helpful here.↩︎
The which.max() function will be helpful here.↩︎
The which.min() function will be helpful here.↩︎
You will need to compare the length of the preambles variable to the year variable in some way (a good-looking plot would be nice!)↩︎
You may need to set the min_count argument to be a lower value than the default of 3 for the US constitution, as that text is very short.↩︎
You can also provide this function with an x matrix and a y vector. This will enable you to calculate the similarity between all rows in x and the vector used for y.↩︎
Use the order() function to acheive this. Look back at seminar 2 if you have forgotten how to use this function.↩︎

Assignment 10 - Similarity Metrics and Supervised Learning for Text

Jack Blumenau

Exercise 10.1 - Similarity Metrics for Analyzing the Preambles of Constitutions

Tf-idf

Cosine Similarity

Exercise 10.2 - Naive Bayes Classification of Movie Reviews