This exercise is designed to get you working with the quanteda package and some other associated packages. The focus will be on exploring the package, getting some texts into the corpus object format, learning how to convert texts into document-feature matrices, and performing descriptive analyses on this data.

Data

Presidential Inaugural Corpusinaugural.csv

This data includes the texts of 59 US presidential inaugural address texts from 1789 to present. It also includes the following variables

Variable Description
Year Year of inaugural address
President President’s last name
FirstName President’s first name (and possibly middle initial)
Party Name of the President’s political party
text Text of the inaugural address

You can load this file into R using the following command:

inaugural <- read.csv("inaugural.csv")

1. Getting Started.

  1. You will first need to install and load the following packages:
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("quanteda.textstats")
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
  1. You will also need to install the package quanteda.corpora from github using the install_github function from the devtools package:
devtools::install_github("quanteda/quanteda.corpora")
library(quanteda.corpora)
  1. Exploring quanteda functions. Look at the Quick Start vignette, and browse the manual for quanteda. You can use example() function for any function in the package, to run the examples and see how the function works. Of course you should also browse the documentation, especially ?corpus to see the structure and operations of how to construct a corpus. The website http://quanteda.io has extensive documentation.
?corpus
example(dfm)
example(corpus)

2. Making a corpus and corpus structure

A corpus object is the foundation for all the analysis we will be doing in quanteda. The first thing to do when you load some text data into R is to convert it using the corpus() function.

  1. The simplest way to create a corpus is to use a set of texts already present in R’s global environment. In our case, we previously loaded the inaugural.csv file and stored it as the inaugural object. Let’s have a look at this object to see what it contains. Use the head() function applied to the inaugural object and report the output. Which variable includes the texts of the inaugural addresses?

  2. Use the corpus() function on this set of texts to create a new corpus. The first argument to corpus() should be the inaugural object. You will also need to set the text_field to be equal to "text" so that quanteda knows that the text we are interested in is saved in that variable.

inaugural_corpus <- corpus(inaugural, text_field = "text")
  1. Once you have constructed this corpus, use the summary() method to see a brief description of the corpus. Which inaugural address was the longest in terms of the number of sentences?

  2. Note that although we specified text_field = "text" when constructing the corpus, we have not removed the metadata associated with the texts. To access the other variables, we can use the docvars() function applied to the corpus object that we created above. Try this now.

3. Tokenizing texts

In order to count word frequencies, we first need to split the text into words (or longer phrases) through a process known as tokenization. Look at the documentation for quanteda’s tokens() function.

  1. Use the tokens command on inaugural_corpus object, and examine the results.

  2. Experiment with some of the arguments of the tokens() function, such as remove_punct and remove_numbers.

  3. Try tokenizing the sentences from inaugural_corpus into sentences, using tokens(x, what = "sentence").

4. Explore some phrases in the text.

  1. quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Use the kwic() function (for “keywords-in-context”) to explore how a specific word or phrase is used in this corpus (use the word-based tokenization that you implemented above). You can look at the help file (?kwic) to see the arguments that the function takes.
kwic(inaugural_tokens, "terror", 3)

Try substituting your own search terms.

  1. By default, kwic gives exact matches for a given pattern. What if we wanted to see words like “terrorism” and “terrorist” rather than exactly “terror”? We can use the wildcard character * to exand our search by appending it to the end of the pattern we are using to search. For example, we could use "terror*". Try this now in the kwic() function.

5. Creating a dfm()

Document-feature matrices are the standard way of representing text as quantitative data. Fortunately, it is very simple to convert the tokens objects in quanteda into dfms.

  1. Create a document-feature matrix, using dfm applied to the immig_tokens object you created above. First, read the documentation using ?dfm to see the available options. Once you have created the dfm, use the topfeatures() function to inspect the top 20 most frequently occuring features in the dfm. What kinds of words do you see?

  2. Experiment with different dfm_* functions, such as dfm_wordstem(), dfm_remove() and dfm_trim(). These functions allow you to reduce the size of the dfm following its construction. How does the number of features in your dfm change as you apply these functions to the dfm object you created in the question above?

  3. Use the dfm_remove() function to remove English-language stopwords from this data. You can get a list of English stopwords by using stopwords("english").

  4. You can easily use quanteda to subset a corpus. There is a corpus_subset() method defined for a corpus, which works just like R’s normal subset() command. For instance if you want a wordcloud of just Obama’s two inaugural addresses, you would need to subset the corpus first:

obama_corpus <- corpus_subset(inaugural_corpus, President == "Obama")
obama_tokens <- tokens(obama_corpus)
obama_dfm <- dfm(obama_tokens)
textplot_wordcloud(obama_dfm)

Try producing that plot without the stopwords and without punctuation. To remove stopwords, use dfm_remove(). To remove punctuation, pass remove_punct = TRUE to the tokens() function.

6. Descriptive statistics

  1. We can plot the type-token ratio of the inaugural speeches over time. To do this, begin by summarising the speeches by each president by applying the summary() function to the inaugural_corpus object and examining the results.

  2. Get the type-token ratio for each text, and plot the resulting vector of TTRs as a function of the Year Hint: See ?textstat_lexdiv.

  3. Use the corpus_subset() function to select the speeches given by presidents between 1900 and 1950. Then, using this subset, measure the term similarities (textstat_simil) for the following words: economy, health, women. Which other terms are most associated with each of these three terms?

7. Working with dictionaries

For this question, you will need to create a dfm of the inaugural corpus that you constructed above. If you have not yet done so, create this now using:

inaugural_tokens <- tokens(inaugural_corpus)
inaugural_dfm <- dfm(inaugural_tokens)
  1. Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance we could define a dictionary consisting of articles and conjunctions, using:
pos_dict <- dictionary(list(articles = c("the", "a", "and"),
                           conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))

To let this define a set of features, we can use this dictionary on the dfm object we created above. To do so, apply the dfm_lookup() function to the relevant dfm object, with the dictionary argument equal to the pos_dict created above:

pos_dfm <- dfm_lookup(inaugural_dfm, dictionary = pos_dict)
pos_dfm[1:10,]
## Document-feature matrix of: 10 documents, 2 features (0.00% sparse) and 4 docvars.
##        features
## docs    articles conjunctions
##   text1      178           73
##   text2       15            4
##   text3      344          192
##   text4      232          109
##   text5      256          126
##   text6      166           63
## [ reached max_ndoc ... 4 more documents ]
  1. Plot the counts of articles and conjunctions (actually, here just the coordinating conjunctions) across the speeches. (Hint: you can use docvars(inaugural_corpus, "Year")) for the x-axis.) Is the distribution of normalized articles and conjunctions relatively constant across years, as you would expect?

  2. The previous analysis uses the count of articles and conjunctions, which depends on the length of the speech as longer speeches will, on average, use more articles and conjunctions. To remove this dependency, we can weight the document-feature matrix by document length and re-compute. For this, we first have to compute the full dfm (using dfm()), then weight it by document frequency (using dfm_weight() with the scheme argument equal to "prop"), and finally apply the dictionary (using dfm_lookup()). Apply these steps and then create a plot showing the weighted counts of articles and conjunctions over time.

  3. Create a new dictionary capturing a concept of your own choosing (perhaps something like “democracy” or “optimism”). Apply this dictionary to the inaugural speeches data and plot the prevalence of that concept in speeches made by US Presidents over time.