In this assignment, you will use R to understand and apply document classification and supervised scaling using R and quanteda.
We will start with a classic computer science dataset of movie
reviews, (Pang and
Lee 2004). The movies corpus has an attribute sentiment
that labels each text as either pos
or neg
according to the original imdb.com archived newspaper review star
rating.
To use this dataset, you will need to install the
quanteda.textmodels
package:
install.packages("quanteda.textmodels")
library(quanteda.textmodels)
You can extract the relevant corpus object using the following line of code:
moviereviews <- quanteda.textmodels::data_corpus_moviereviews
You should also load the quanteda
package:
library(quanteda)
Start by looking at the metadata included with this corpus using the
docvars()
function:
head(docvars(moviereviews))
## sentiment id1 id2
## 1 neg cv000 29416
## 2 neg cv001 19502
## 3 neg cv002 17424
## 4 neg cv003 12683
## 5 neg cv004 12641
## 6 neg cv005 29357
We will be using the sentiment
variable, which includes
information from a human-labelling of movie reviews as either positive
(pos
) or negative (neg
).
Use the table()
function to work out how many
positive and how many negative movie reviews there are in the
corpus.
Use the code below to create a logical vector of the same length
as the number of documents in the corpus. We will use this vector to
define our training and test sets. Look at ?sample
to make
sure you understand what each part of the code is doing. As we are using
randomness to generate this vector, don’t forget to first set your seed
so that the results are fully replicable!
set.seed(1234)
train <- sample(c(TRUE, FALSE), 2000, replace = TRUE, prob = c(.75, .25))
my_corpus[vector,]
) to do this. (Remember, if we use
an exclamation point !
before a logical vector, it will
reverse the TRUE
and FALSE
values.)dfm()
), and
make some reasonable feature selection decisions to reduce the number of
features in the dfm. Then make a dfm for the test corpus, and use the
dfm_match()
function to make sure that it contains the same
set of features as the training dfm. See the example in the lecture if
you are struggling, or consult the relevant help files.Use the textmodel_nb()
function to train the Naive
Bayes classifier on the training dfm. You should use the dfm you created
for the training corpus as the x
argument to this function,
and the outcome (i.e. training_dfm$sentiment
) as the
y
argument.
Examine the param
element of the fitted model. Which
words have the highest probability under the pos
class?
Which words have the highest probability under the neg
class? You might find the sort()
function helpful
here.
Use the predict()
function to predict the sentiment
of movies in the test set dfm. The predict function takes two arguments
in this instance: 1) the estimated Naive Bayes model from part (e), and
2) the test-set dfm. Create a confusion matrix of the predicted classes
and the actual classes in the test data. What is the accuracy of your
model?
Load the caret
package (install it first using
install.packages()
if you need to), and then use the
confusionMatrix()
function to calculate other statistics
relevant to the predictive performance of your model. The first argument
to the confusionMatrix()
function should be the confusion
matrix that you created in answer to question (g). You should also set
the positive
argument equal to "pos"
to tell R
the level of the outcome that corresponds to a “positive” result. Report
the the accuracy, sensitivity and specificity of your predictions,
giving a brief interpretation of each.
We will now use the same training and test set to estimate
wordscores for the movie reviews. First, create a new variable (named
refscore
) for the training set dfm which is equal to 1 for
positive movie reviews, and -1 for negative movie reviews. These are the
reference scores that we will use for training the model.
Use the textmodel_wordscores()
function to estimate
wordscores on the training dfm. This function requires two arguments: 1)
x
, for the dfm you are using to estimate the model, and 2)
y
for the vector of reference scores associated with each
training document (i.e. the variable you created in the answer
above).
Predict the wordscores for the test set using the predict
function. Again, for predict()
to work, you need to pass it
the trained wordscores model, and the test set dfm. Save your
predictions as a new metadata variable in your training data
dfm.
Use the docvars()
function on your test set dfm to
check that you have correctly assigned the predictions as meta data
(hint: if you have done (c) correctly, then when you run
str(docvars(my_test_set_dfm))
you should see a column
containing the estimated wordscores).
Use the boxplot()
function to compare the
distribution of wordscores against the “true” sentiment of the reviews
given by human annotators (look at ?boxplot
to see how to
create the plot). Describe the resulting pattern.
Look for examples of texts with positive wordscores (for instance, any text with a wordscore greater than 0.075) that are nonetheless categorised as “negative” by human readers. Look for examples of texts with negative wordscores (for instance, any text with a wordscore smaller than -0.03) that are nonetheless categorised as “positive” by human readers. Why do you think the model gave the wrong predictions in those cases?
Hint: you may want to use logical relations to find the texts you are
looking for. For instance, using my_vetor > 0.05
will
return a logical vector which is equal to TRUE
when
my_vector
is greater than 0.05 and FALSE
otherwise. Similarly, my_vector == "a string I want"
will
return a logical vector which is equal to TRUE
when
my_vector
is equal to “a string I want” and
FALSE
otherwise.
Hint 2: You can extract the full texts from your corpus object by
using as.character(my_corpus)
with the appropriate
subsetting operator (i.e. [,]
).
In this part of the assignment, you will use R to understand and
apply unsupervised document scaling. Use the
data_corpus_irishbudget2010
in
quanteda.textmodels for this. You will also need to
load (and possible install) the quanteda.textplots
package
first.
Fit a wordfish model of all the documents in this corpus. Apply
any required preprocessing steps first. Use the
textplot_scale1d
function to visualize the result. (You may
want to use the advanced options of this function to get a better plot
than just the default one.)
What do you learn about what the dimension is capturing? You can use wikipedia to learn about the Irish parties involved in this debate to help you answer this question.
Plot the wordfish “Eiffel Tower” plot (as in Figure 2 of Slapin
and Proksch 2008), from the wordfish object. You can do this using the
textplot_scale1d
function. What is your interpretation of
these results?
Plot the log of the length in tokens of each text against the
\(\hat{\alpha}\) from your estimated
wordfish model. What does the relationship indicate? (Hint: you can use
the ntoken()
function on your dfm to extract the number of
words in each text.)
Plot the log of the frequency of the top most frequent 1000 words
against the same psi-hat values from your estimated wordfish model, and
describe the relationship. The topfeatures()
function might
be helpful here.