This exercise is designed to get you working with the [quanteda](https://quanteda.io) package and some other associated packages. The focus will be on exploring the package, getting some texts into the `corpus` object format, learning how to convert texts into document-feature matrices, and performing descriptive analyses on this data. ### Data **Presidential Inaugural Corpus** -- `inaugural.csv` This data includes the texts of 59 US presidential inaugural address texts from 1789 to present. It also includes the following variables | Variable | Description | |:------------|:-----------------------------------------------------| | `Year` | Year of inaugural address | | `President` | President's last name | | `FirstName` | President's first name (and possibly middle initial) | | `Party` | Name of the President's political party | | `text` | Text of the inaugural address | Once you have downloaded this files and stored them somewhere sensible, you can load them into R using the following commands: ```{r, echo = TRUE, eval = FALSE} inaugural <- read.csv("inaugural.csv") ``` ```{r, echo = FALSE, warning=FALSE, message=FALSE, eval = TRUE} inaugural <- read.csv("data/inaugural.csv") ``` #### 1. Getting Started. 1. You will first need to install and load the following packages: ```{r, eval=FALSE} install.packages("quanteda") install.packages("readtext") install.packages("quanteda.textplots") install.packages("quanteda.textstats") ``` ```{r, echo = T, message=FALSE} library(quanteda) library(quanteda.textplots) library(quanteda.textstats) library(readtext) ``` 2. You will also need to install the package `quanteda.corpora` from github using the `install_github` function from the `devtools` package: ```{r, eval=FALSE} devtools::install_github("quanteda/quanteda.corpora") library(quanteda.corpora) ``` 3. Exploring **quanteda** functions. Look at the Quick Start vignette, and browse the manual for quanteda. You can use `example()` function for any function in the package, to run the examples and see how the function works. Of course you should also browse the documentation, especially `?corpus` to see the structure and operations of how to construct a corpus. The website http://quanteda.io has extensive documentation. ```{r, eval = F} ?corpus example(dfm) example(corpus) ``` #### 2. Making a corpus and corpus structure A corpus object is the foundation for all the analysis we will be doing in `quanteda`. The first thing to do when you load some text data into R is to convert it using the `corpus()` function. 1. The simplest way to create a corpus is to use a set of texts already present in R's global environment. In our case, we previously loaded the `inaugural.csv` file and stored it as the `inaugural` object. Let's have a look at this object to see what it contains. Use the `head()` function applied to the `inaugural` object and report the output. Which variable includes the texts of the inaugural addresses? ```{r, eval = T, echo=TRUE} head(inaugural) ``` > The output tells us that this is a `data.frame` and we can see the first six lines of the data. The column labelled `text` contains the texts of the inaugural addresses. 2. Use the `corpus()` function on this set of texts to create a new corpus. The first argument to `corpus()` should be the `inaugural` object. You will also need to set the `text_field` to be equal to `"text"` so that quanteda knows that the text we are interested in is saved in that variable. ```{r, eval = T, echo=TRUE} inaugural_corpus <- corpus(inaugural, text_field = "text") ``` 3. Once you have constructed this corpus, use the `summary()` method to see a brief description of the corpus. Which inaugural address was the longest in terms of the number of sentences? ```{r} summary(inaugural_corpus) ``` > Joe Biden's had the largest number of sentences. 4. Note that although we specified `text_field = "text"` when constructing the corpus, we have not removed the metadata associated with the texts. To access the other variables, we can use the `docvars()` function applied to the corpus object that we created above. Try this now. ```{r} head(docvars(inaugural_corpus)) ``` #### 3. Tokenizing texts In order to count word frequencies, we first need to split the text into words (or longer phrases) through a process known as *tokenization*. Look at the documentation for **quanteda**'s `tokens()` function. 1. Use the `tokens` command on `inaugural_corpus` object, and examine the results. ```{r} inaugural_tokens <- tokens(inaugural_corpus) ``` 2. Experiment with some of the arguments of the `tokens()` function, such as `remove_punct` and `remove_numbers`. ```{r} inaugural_tokens <- tokens(inaugural_corpus, remove_punct = TRUE, remove_numbers = TRUE) ``` 3. Try tokenizing the *sentences* from `inaugural_corpus` into sentences, using `tokens(x, what = "sentence")`. ```{r, eval = F} inaugural_sentences <- tokens(inaugural_corpus, what = "sentence") inaugural_sentences[1:2] ``` #### 4. Explore some phrases in the text. 1. **quanteda** provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Use the `kwic()` function (for "keywords-in-context") to explore how a specific word or phrase is used in this corpus (use the word-based tokenization that you implemented above). You can look at the help file (`?kwic`) to see the arguments that the function takes. ```{r} kwic(inaugural_tokens, "terror", 3) ``` Try substituting your own search terms, or working with your own corpus. ```{r, eval = T} head(kwic(inaugural_tokens, "america", 3)) head(kwic(inaugural_tokens, "democracy", 3)) ``` 2. By default, kwic gives exact matches for a given pattern. What if we wanted to see words like "terrorism" and "terrorist" rather than exactly "terror"? We can use the wildcard character `*` to exand our search by appending it to the end of the pattern we are using to search. For example, we could use `"terror*"`. Try this now in the `kwic()` function. ```{r} kwic(inaugural_tokens, "terror*", 3) ``` #### 5. Creating a `dfm()` Document-feature matrices are the standard way of representing text as quantitative data. Fortunately, it is very simple to convert the tokens objects in quanteda into dfms. 1. Create a document-feature matrix, using `dfm` applied to the `immig_tokens` object you created above. First, read the documentation using `?dfm` to see the available options. Once you have created the dfm, use the `topfeatures()` function to inspect the top 20 most frequently occuring features in the dfm. What kinds of words do you see? ```{r} mydfm <- dfm(inaugural_tokens) mydfm topfeatures(mydfm, 20) ``` **Mostly stop words!** 2. Experiment with different `dfm_*` functions, such as `dfm_wordstem()`, `dfm_remove()` and `dfm_trim()`. These functions allow you to reduce the size of the dfm following its construction. How does the number of features in your dfm change as you apply these functions to the dfm object you created in the question above? ```{r} dim(mydfm) dim(dfm_wordstem(mydfm)) dim(dfm_remove(mydfm, pattern = stopwords("english"))) dim(dfm_trim(mydfm, min_termfreq = 5, min_docfreq = 0.01, termfreq_type = "count", docfreq_type = "prop")) ``` 3. Use the `dfm_remove()` function to remove English-language stopwords from this data. You can get a list of English stopwords by using `stopwords("english")`. ```{r} mydfm_nostops <- dfm_remove(mydfm, pattern = stopwords("en")) mydfm_nostops ``` 4. You can easily use quanteda to subset a corpus. There is a `corpus_subset()` method defined for a corpus, which works just like R's normal `subset()` command. For instance if you want a wordcloud of just Obama's two inaugural addresses, you would need to subset the corpus first: ```{r} obama_corpus <- corpus_subset(inaugural_corpus, President == "Obama") obama_tokens <- tokens(obama_corpus) obama_dfm <- dfm(obama_tokens) textplot_wordcloud(obama_dfm) ``` Try producing that plot without the stopwords and without punctuation. To remove stopwords, use `dfm_remove()`. To remove punctuation, pass `remove_punct = TRUE` to the `tokens()` function. ```{r} obama_tokens <- tokens(obama_corpus, remove_punct = TRUE) obama_dfm <- dfm(obama_tokens) obama_dfm <- dfm_remove(obama_dfm, pattern = stopwords("en")) textplot_wordcloud(obama_dfm) ``` #### 6. Descriptive statistics 1. We can plot the type-token ratio of the inaugural speeches over time. To do this, begin by summarising the speeches by each president by applying the `summary()` function to the `inaugural_corpus` object and examining the results. ```{r} token_info <- summary(inaugural_corpus) ``` 2. Get the type-token ratio for each text, and plot the resulting vector of TTRs as a function of the `Year` Hint: See `?textstat_lexdiv`. ```{r} inaugural_dfm <- dfm(inaugural_tokens, remove_punct = TRUE) ttr_by_speech <- textstat_lexdiv(inaugural_dfm, "TTR") plot(inaugural_dfm$Year, ttr_by_speech$TTR, main = "TTR by year", xlab = "Year", ylab = "TTR", pch = 19, bty = "n", type = "b") ``` 3. Use the `corpus_subset()` function to select the speeches given by presidents between 1900 and 1950. Then, using this subset, measure the term similarities (`textstat_simil`) for the following words: *economy*, *health*, *women*. Which other terms are most associated with each of these three terms? ```{r} inaugural_corpus_subset <- corpus_subset(inaugural_corpus, Year %in% c(1900:1950)) inaugural_tokens_subset <- tokens(inaugural_corpus_subset) inaugural_dfm_subset <- dfm(inaugural_tokens_subset) word_similarities <- textstat_simil(inaugural_dfm, inaugural_dfm[,c("economy", "health", "women")], margin = "features") head(word_similarities[order(word_similarities[, 1], decreasing = TRUE), ]) head(word_similarities[order(word_similarities[, 2], decreasing = TRUE), ]) head(word_similarities[order(word_similarities[, 3], decreasing = TRUE), ]) ``` #### 7. Working with dictionaries 1. Dictionaries are named lists, consisting of a "key" and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance we could define a dictionary consisting of articles and conjunctions, using: ```{r} pos_dict <- dictionary(list(articles = c("the", "a", "and"), conjunctions = c("and", "but", "or", "nor", "for", "yet", "so"))) ``` To let this define a set of features, we can use this dictionary on the dfm object we created above. To do so, apply the `dfm_lookup()` function to the relevant dfm object, with the `dictionary` argument equal to the `pos_dict` created above: ```{r} pos_dfm <- dfm_lookup(inaugural_dfm, dictionary = pos_dict) pos_dfm[1:10,] ``` 2. Plot the counts of articles and conjunctions (actually, here just the coordinating conjunctions) across the speeches. (**Hint:** you can use `docvars(inaugural_corpus, "Year"))` for the *x*-axis.) Is the distribution of normalized articles and conjunctions relatively constant across years, as you would expect? ```{r} par(mfrow = c(1,2)) plot(inaugural_corpus$Year, as.numeric(pos_dfm[, 1])) plot(inaugural_corpus$Year, as.numeric(pos_dfm[, 2])) ``` 3. The previous analysis uses the count of articles and conjunctions, which depends on the length of the speech as longer speeches will, on average, use more articles and conjunctions. To remove this dependency, we can weight the document-feature matrix by document length and re-compute. For this, we first have to compute the full dfm (using `dfm()`), then weight it by document frequency (using `dfm_weight()` with the `scheme` argument equal to `"prop"`), and finally apply the dictionary (using `dfm_lookup()`). Apply these steps and then create a plot showing the weighted counts of articles and conjunctions over time. ```{r} inaugural_dfm <- dfm(inaugural_tokens) inaugural_wgt <- dfm_weight(inaugural_dfm, scheme = "prop") pos_wgt <- dfm_lookup(inaugural_wgt, dictionary = pos_dict) pos_wgt[1:10, ] # For easier processing, you can turn it into a data.frame and add the Year pos_wgt_df <- convert(pos_wgt, to = "data.frame") pos_wgt_df <- cbind(pos_wgt_df, year = inaugural_corpus$Year) par(mfrow = c(1,1)) plot(pos_wgt_df$year, pos_wgt_df$articles, type = "l", col = "orange", ylim = range(pos_wgt), xlab = "Year", ylab = "Weighted count") lines(pos_wgt_df$year, pos_wgt_df$conjunctions, type = "l", col = "blue") legend("topright", legend = c("articles", "conjunctions"), col = c("orange","blue"), lty = 1) ``` 4. Create a new dictionary capturing a concept of your own choosing (perhaps something like "democracy" or "optimism"). Apply this dictionary to the inaugural speeches data and plot the prevalence of that concept in speeches made by US Presidents over time. ```{r} optimist_dict <- dictionary(list(optimist = c("best", "better", "great", "awesome", "good", "fantastic", "terrific", "tremendous", "amazing", "superior", "astounding", "astonishing", "extraordinary", "incredible", "magnificent"))) inaugural_dfm <- dfm(inaugural_tokens) inaugural_wgt <- dfm_weight(inaugural_dfm, scheme = "prop") optimist_wgt <- dfm_lookup(inaugural_wgt, dictionary = optimist_dict) # For easier processing, you can turn it into a data.frame and add the Year optimist_wgt_df <- convert(optimist_wgt, to = "data.frame") optimist_wgt_df <- cbind(optimist_wgt_df, year = inaugural_corpus$Year) par(mfrow = c(1,1)) plot(optimist_wgt_df$year, optimist_wgt_df$optimist, type = "l", col = "orange", ylim = range(optimist_wgt), xlab = "Year", ylab = "Weighted count") ```