In this assignment, you will use R to understand and apply document classification and supervised scaling using R and quanteda.

Exercise 10.1 - Naive Bayes Classification of Movie Revies

We will start with a classic computer science dataset of movie reviews, (Pang and Lee 2004). The movies corpus has an attribute sentiment that labels each text as either pos or neg according to the original imdb.com archived newspaper review star rating.

To use this dataset, you will need to install the quanteda.textmodels package:

install.packages("quanteda.textmodels")
library(quanteda.textmodels)

You can extract the relevant corpus object using the following line of code:


moviereviews <- quanteda.textmodels::data_corpus_moviereviews

You should also load the quanteda package:

library(quanteda)

Start by looking at the metadata included with this corpus using the docvars() function:


head(docvars(moviereviews))
##   sentiment   id1   id2
## 1       neg cv000 29416
## 2       neg cv001 19502
## 3       neg cv002 17424
## 4       neg cv003 12683
## 5       neg cv004 12641
## 6       neg cv005 29357

We will be using the sentiment variable, which includes information from a human-labelling of movie reviews as either positive (pos) or negative (neg).

  1. Use the table() function to work out how many positive and how many negative movie reviews there are in the corpus.

  2. Use the code below to create a logical vector of the same length as the number of documents in the corpus. We will use this vector to define our training and test sets. Look at ?sample to make sure you understand what each part of the code is doing. As we are using randomness to generate this vector, don’t forget to first set your seed so that the results are fully replicable!


set.seed(1234)

train <- sample(c(TRUE, FALSE), 2000, replace = TRUE, prob = c(.75, .25))
  1. Subset the corpus into a training set and a test set using the vector you just created. Use the square brackets to subset (i.e. my_corpus[vector,]) to do this. (Remember, if we use an exclamation point ! before a logical vector, it will reverse the TRUE and FALSE values.)
  1. Make a dfm for the training corpus (i.e. dfm()), and make some reasonable feature selection decisions to reduce the number of features in the dfm. Then make a dfm for the test corpus, and use the dfm_match() function to make sure that it contains the same set of features as the training dfm. See the example in the lecture if you are struggling, or consult the relevant help files.
  1. Use the textmodel_nb() function to train the Naive Bayes classifier on the training dfm. You should use the dfm you created for the training corpus as the x argument to this function, and the outcome (i.e. training_dfm$sentiment) as the y argument.

  2. Examine the param element of the fitted model. Which words have the highest probability under the pos class? Which words have the highest probability under the neg class? You might find the sort() function helpful here.

  3. Use the predict() function to predict the sentiment of movies in the test set dfm. The predict function takes two arguments in this instance: 1) the estimated Naive Bayes model from part (e), and 2) the test-set dfm. Create a confusion matrix of the predicted classes and the actual classes in the test data. What is the accuracy of your model?

  4. Load the caret package (install it first using install.packages() if you need to), and then use the confusionMatrix() function to calculate other statistics relevant to the predictive performance of your model. The first argument to the confusionMatrix() function should be the confusion matrix that you created in answer to question (g). You should also set the positive argument equal to "pos" to tell R the level of the outcome that corresponds to a “positive” result. Report the the accuracy, sensitivity and specificity of your predictions, giving a brief interpretation of each.

Exercise 10.2 - Wordscores for Movie Reviews

  1. We will now use the same training and test set to estimate wordscores for the movie reviews. First, create a new variable (named refscore) for the training set dfm which is equal to 1 for positive movie reviews, and -1 for negative movie reviews. These are the reference scores that we will use for training the model.

  2. Use the textmodel_wordscores() function to estimate wordscores on the training dfm. This function requires two arguments: 1) x, for the dfm you are using to estimate the model, and 2) y for the vector of reference scores associated with each training document (i.e. the variable you created in the answer above).

  3. Predict the wordscores for the test set using the predict function. Again, for predict() to work, you need to pass it the trained wordscores model, and the test set dfm. Save your predictions as a new metadata variable in your training data dfm.

  4. Use the docvars() function on your test set dfm to check that you have correctly assigned the predictions as meta data (hint: if you have done (c) correctly, then when you run str(docvars(my_test_set_dfm)) you should see a column containing the estimated wordscores).

  5. Use the boxplot() function to compare the distribution of wordscores against the “true” sentiment of the reviews given by human annotators (look at ?boxplot to see how to create the plot). Describe the resulting pattern.

  6. Look for examples of texts with positive wordscores (for instance, any text with a wordscore greater than 0.075) that are nonetheless categorised as “negative” by human readers. Look for examples of texts with negative wordscores (for instance, any text with a wordscore smaller than -0.03) that are nonetheless categorised as “positive” by human readers. Why do you think the model gave the wrong predictions in those cases?

    Hint: you may want to use logical relations to find the texts you are looking for. For instance, using my_vetor > 0.05 will return a logical vector which is equal to TRUE when my_vector is greater than 0.05 and FALSE otherwise. Similarly, my_vector == "a string I want" will return a logical vector which is equal to TRUE when my_vector is equal to “a string I want” and FALSE otherwise.

    Hint 2: You can extract the full texts from your corpus object by using as.character(my_corpus) with the appropriate subsetting operator (i.e. [,]).

Exercise 10.3 - Wordfish for Irish Parliamentary Debates (Hard question)

In this part of the assignment, you will use R to understand and apply unsupervised document scaling. Use the data_corpus_irishbudget2010 in quanteda.textmodels for this. You will also need to load (and possible install) the quanteda.textplots package first.

  1. Fit a wordfish model of all the documents in this corpus. Apply any required preprocessing steps first. Use the textplot_scale1d function to visualize the result. (You may want to use the advanced options of this function to get a better plot than just the default one.)

    What do you learn about what the dimension is capturing? You can use wikipedia to learn about the Irish parties involved in this debate to help you answer this question.

  2. Plot the wordfish “Eiffel Tower” plot (as in Figure 2 of Slapin and Proksch 2008), from the wordfish object. You can do this using the textplot_scale1d function. What is your interpretation of these results?

  3. Plot the log of the length in tokens of each text against the \(\hat{\alpha}\) from your estimated wordfish model. What does the relationship indicate? (Hint: you can use the ntoken() function on your dfm to extract the number of words in each text.)

  4. Plot the log of the frequency of the top most frequent 1000 words against the same psi-hat values from your estimated wordfish model, and describe the relationship. The topfeatures() function might be helpful here.