This exercise is designed to get you working with the quanteda package and some other
associated packages. The focus will be on exploring the package, getting
some texts into the corpus
object format, learning how to
convert texts into document-feature matrices, and performing descriptive
analyses on this data.
Presidential Inaugural Corpus –
inaugural.csv
This data includes the texts of 59 US presidential inaugural address texts from 1789 to present. It also includes the following variables
Variable | Description |
---|---|
Year |
Year of inaugural address |
President |
President’s last name |
FirstName |
President’s first name (and possibly middle initial) |
Party |
Name of the President’s political party |
text |
Text of the inaugural address |
You can load this file into R using the following command:
inaugural <- read.csv("inaugural.csv")
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("quanteda.textstats")
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
quanteda.corpora
from github using the
install_github
function from the devtools
package:devtools::install_github("quanteda/quanteda.corpora")
library(quanteda.corpora)
example()
function for any function in the package, to run
the examples and see how the function works. Of course you should also
browse the documentation, especially ?corpus
to see the
structure and operations of how to construct a corpus. The website http://quanteda.io has
extensive documentation.?corpus
example(dfm)
example(corpus)
A corpus object is the foundation for all the analysis we will be
doing in quanteda
. The first thing to do when you load some
text data into R is to convert it using the corpus()
function.
The simplest way to create a corpus is to use a set of texts
already present in R’s global environment. In our case, we previously
loaded the inaugural.csv
file and stored it as the
inaugural
object. Let’s have a look at this object to see
what it contains. Use the head()
function applied to the
inaugural
object and report the output. Which variable
includes the texts of the inaugural addresses?
Use the corpus()
function on this set of texts to
create a new corpus. The first argument to corpus()
should
be the inaugural
object. You will also need to set the
text_field
to be equal to "text"
so that
quanteda knows that the text we are interested in is saved in that
variable.
inaugural_corpus <- corpus(inaugural, text_field = "text")
Once you have constructed this corpus, use the
summary()
method to see a brief description of the corpus.
Which inaugural address was the longest in terms of the number of
sentences?
Note that although we specified text_field = "text"
when constructing the corpus, we have not removed the metadata
associated with the texts. To access the other variables, we can use the
docvars()
function applied to the corpus object that we
created above. Try this now.
In order to count word frequencies, we first need to split the text
into words (or longer phrases) through a process known as
tokenization. Look at the documentation for
quanteda’s tokens()
function.
Use the tokens
command on
inaugural_corpus
object, and examine the results.
Experiment with some of the arguments of the
tokens()
function, such as remove_punct
and
remove_numbers
.
Try tokenizing the sentences from
inaugural_corpus
into sentences, using
tokens(x, what = "sentence")
.
kwic()
function (for “keywords-in-context”) to
explore how a specific word or phrase is used in this corpus (use the
word-based tokenization that you implemented above). You can look at the
help file (?kwic
) to see the arguments that the function
takes.kwic(inaugural_tokens, "terror", 3)
Try substituting your own search terms.
*
to exand our
search by appending it to the end of the pattern we are using to search.
For example, we could use "terror*"
. Try this now in the
kwic()
function.dfm()
Document-feature matrices are the standard way of representing text as quantitative data. Fortunately, it is very simple to convert the tokens objects in quanteda into dfms.
Create a document-feature matrix, using dfm
applied
to the immig_tokens
object you created above. First, read
the documentation using ?dfm
to see the available options.
Once you have created the dfm, use the topfeatures()
function to inspect the top 20 most frequently occuring features in the
dfm. What kinds of words do you see?
Experiment with different dfm_*
functions, such as
dfm_wordstem()
, dfm_remove()
and
dfm_trim()
. These functions allow you to reduce the size of
the dfm following its construction. How does the number of features in
your dfm change as you apply these functions to the dfm object you
created in the question above?
Use the dfm_remove()
function to remove
English-language stopwords from this data. You can get a list of English
stopwords by using stopwords("english")
.
You can easily use quanteda to subset a corpus. There is a
corpus_subset()
method defined for a corpus, which works
just like R’s normal subset()
command. For instance if you
want a wordcloud of just Obama’s two inaugural addresses, you would need
to subset the corpus first:
obama_corpus <- corpus_subset(inaugural_corpus, President == "Obama")
obama_tokens <- tokens(obama_corpus)
obama_dfm <- dfm(obama_tokens)
textplot_wordcloud(obama_dfm)
Try producing that plot without the stopwords and without
punctuation. To remove stopwords, use dfm_remove()
. To
remove punctuation, pass remove_punct = TRUE
to the
tokens()
function.
We can plot the type-token ratio of the inaugural speeches over
time. To do this, begin by summarising the speeches by each president by
applying the summary()
function to the
inaugural_corpus
object and examining the results.
Get the type-token ratio for each text, and plot the resulting
vector of TTRs as a function of the Year
Hint: See
?textstat_lexdiv
.
Use the corpus_subset()
function to select the
speeches given by presidents between 1900 and 1950. Then, using this
subset, measure the term similarities (textstat_simil
) for
the following words: economy, health, women.
Which other terms are most associated with each of these three
terms?
For this question, you will need to create a dfm of the inaugural corpus that you constructed above. If you have not yet done so, create this now using:
inaugural_tokens <- tokens(inaugural_corpus)
inaugural_dfm <- dfm(inaugural_tokens)
pos_dict <- dictionary(list(articles = c("the", "a", "and"),
conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))
To let this define a set of features, we can use this dictionary on
the dfm object we created above. To do so, apply the
dfm_lookup()
function to the relevant dfm object, with the
dictionary
argument equal to the pos_dict
created above:
pos_dfm <- dfm_lookup(inaugural_dfm, dictionary = pos_dict)
pos_dfm[1:10,]
## Document-feature matrix of: 10 documents, 2 features (0.00% sparse) and 4 docvars.
## features
## docs articles conjunctions
## text1 178 73
## text2 15 4
## text3 344 192
## text4 232 109
## text5 256 126
## text6 166 63
## [ reached max_ndoc ... 4 more documents ]
Plot the counts of articles and conjunctions (actually, here just
the coordinating conjunctions) across the speeches.
(Hint: you can use
docvars(inaugural_corpus, "Year"))
for the
x-axis.) Is the distribution of normalized articles and
conjunctions relatively constant across years, as you would
expect?
The previous analysis uses the count of articles and
conjunctions, which depends on the length of the speech as longer
speeches will, on average, use more articles and conjunctions. To remove
this dependency, we can weight the document-feature matrix by document
length and re-compute. For this, we first have to compute the full dfm
(using dfm()
), then weight it by document frequency (using
dfm_weight()
with the scheme
argument equal to
"prop"
), and finally apply the dictionary (using
dfm_lookup()
). Apply these steps and then create a plot
showing the weighted counts of articles and conjunctions over
time.
Create a new dictionary capturing a concept of your own choosing (perhaps something like “democracy” or “optimism”). Apply this dictionary to the inaugural speeches data and plot the prevalence of that concept in speeches made by US Presidents over time.