Exercise summary

This exercise is designed to get you working with quanteda. The focus will be on exploring the package and getting some texts into the corpus object format. The quanteda package has several functions for creating a corpus of texts which we will use in this exercise.

  1. Getting Started.

    You will first need to install the packages:

    install.packages("quanteda")
    install.packages("readtext")

    You will also need to install the package quanteda.corpora from github using the install_github function from the devtools package:

    devtools::install_github('kbenoit/quanteda.corpora')
  2. Exploring quanteda functions.

    library(tm)
    library(quanteda)
    library(quanteda.corpora)
    library(readtext)

    Look at the Quick Start vignette, and browse the manual for quanteda. You can use example() function for any function in the package, to run the examples and see how the function works. Of course you should also browse the documentation, especially ?corpus to see the structure and operations of how to construct a corpus. The website http://quanteda.io has extensive documentation.

    ?corpus
    example(dfm)
    example(corpus)
  3. Making a corpus and corpus structure

    1. From a vector of texts already in memory.

      The simplest way to create a corpus is to use a vector of texts already present in R’s global environment. Some text and corpus objects are built into the package, for example data_char_ukimmig2010 is the UTF-8 encoded set of 9 UK party manifesto sections from 2010, that deal with immigration policy. addresses. Try using corpus() on this set of texts to create a corpus.

      Once you have constructed this corpus, use the summary() method to see a brief description of the corpus. The names of the corpus data_char_ukimmig2010 should have become the document names.

      immig_corpus <- corpus(data_char_ukimmig2010)
      summary(immig_corpus)
      ## Corpus consisting of 9 documents:
      ## 
      ##          Text Types Tokens Sentences
      ##           BNP  1125   3280        88
      ##     Coalition   142    260         4
      ##  Conservative   251    499        15
      ##        Greens   322    679        21
      ##        Labour   298    683        29
      ##        LibDem   251    483        14
      ##            PC    77    114         5
      ##           SNP    88    134         4
      ##          UKIP   346    723        27
      ## 
      ## Source: /home/cmueller/academia/ay2016-17/teaching/ME314/classes/assignment09/* on x86_64 by cmueller
      ## Created: Thu Aug 16 23:40:12 2018
      ## Notes:
    2. From a directory of text files.

      The readtext() function from the readtext package can read (almost) any set of files into an object that you can then call the corpus() function on, to create a corpus. (See ?readtext for an example.)

      Here you are encouraged to select any directory of plain text files of your own.
      How did it work? Try using docvars() to assign a set of document-level variables. If you do not have a set of text files to work with, then you can use the UK 2010 manifesto texts on immigration, in the Day 8 folder, like this:

      require(quanteda)
      manfiles <- readtext("https://github.com/kbenoit/ME114/raw/master/day8/UKimmigTexts.zip")
      mycorpus <- corpus(manfiles)
    3. From .csv or .json files — see the documentation for the package readtext (help(package = "readtext")).

      Here you can try one of your own examples, or just file this in your mental catalogue for future reference.

  4. Explore some phrases in the text.

    You can do this using the kwic (for “key-words-in-context”) to explore a specific word or phrase.

    kwic(data_corpus_inaugural, "terror", 3)
    ##                                                           
    ##     [1797-Adams, 1325]             violence, by | terror |
    ##  [1933-Roosevelt, 112] unreasoning, unjustified | terror |
    ##  [1941-Roosevelt, 287]          by a fatalistic | terror |
    ##    [1961-Kennedy, 866]     uncertain balance of | terror |
    ##     [1981-Reagan, 813]       Americans from the | terror |
    ##   [1997-Clinton, 1055]        the fanaticism of | terror |
    ##   [1997-Clinton, 1655]   strong defense against | terror |
    ##     [2009-Obama, 1632]         aims by inducing | terror |
    ##                            
    ##  , intrigue,               
    ##  which paralyzes needed    
    ##  , we proved               
    ##  that stays the            
    ##  of runaway living         
    ##  . And they                
    ##  and destruction.          
    ##  and slaughtering innocents

    Try substituting your own search terms, or working with your own corpus.

    head(kwic(data_corpus_inaugural, "america", 3))
    ##                                                         
    ##  [1793-Washington, 63]      people of united | America |
    ##       [1797-Adams, 16]     middle course for | America |
    ##      [1797-Adams, 427]         the people of | America |
    ##     [1797-Adams, 1419]         the people of | America |
    ##     [1797-Adams, 2004] aboriginal nations of | America |
    ##     [1797-Adams, 2152]         the people of | America |
    ##                            
    ##  . Previous to             
    ##  remained between unlimited
    ##  were not abandoned        
    ##  have exhibited to         
    ##  , and a                   
    ##  and the internal
    head(kwic(data_corpus_inaugural, "democracy", 3))
    ##                                                                   
    ##     [1825-Adams, 1546] a confederated representative | democracy |
    ##   [1841-Harrison, 525]                    to that of | democracy |
    ##  [1841-Harrison, 1585]       a simple representative | democracy |
    ##  [1841-Harrison, 7463]                   the name of | democracy |
    ##  [1841-Harrison, 7894]                of devotion to | democracy |
    ##   [1921-Harding, 1087]      temple of representative | democracy |
    ##                   
    ##  were a government
    ##  . If such        
    ##  or republic,     
    ##  they speak,      
    ##  . The foregoing  
    ##  , to be
  5. Create a document-feature matrix, using dfm. First, read the documentation using ?dfm to see the available options.

    mydfm <- dfm(data_corpus_inaugural, remove = stopwords("english"))
    mydfm
    ## Document-feature matrix of: 58 documents, 9,221 features (92.6% sparse).
    topfeatures(mydfm, 20)
    ##          ,          .          -     people          ; government 
    ##       7026       4945        762        575        565        564 
    ##         us        can       upon       must      great        may 
    ##        478        471        371        366        340        338 
    ##     states      shall      world    country      every     nation 
    ##        333        314        311        304        298        293 
    ##      peace        one 
    ##        254        252

    Experiment with different dfm options, such as stem = TRUE. The function dfm_trim() allows you to reduce the size of the dfm following its construction.

    dim(dfm(data_corpus_inaugural, stem = T))
    ## [1]   58 5541
    dim(dfm_trim(mydfm, min_termfreq = 5, min_docfreq = 0.01, termfreq_type = "count", docfreq_type = "prop"))
    ## [1]   58 2596

    Grouping on a variable is an excellent feature of dfm(), in fact one of my favorites.
    For instance, if you want to aggregate all speeches by presidential name, you can execute

    mydfm <- dfm(data_corpus_inaugural, groups = "President")
    mydfm
    ## Document-feature matrix of: 35 documents, 9,357 features (88.3% sparse).
    docnames(mydfm)
    ##  [1] "Adams"      "Buchanan"   "Bush"       "Carter"     "Cleveland" 
    ##  [6] "Clinton"    "Coolidge"   "Eisenhower" "Garfield"   "Grant"     
    ## [11] "Harding"    "Harrison"   "Hayes"      "Hoover"     "Jackson"   
    ## [16] "Jefferson"  "Johnson"    "Kennedy"    "Lincoln"    "Madison"   
    ## [21] "McKinley"   "Monroe"     "Nixon"      "Obama"      "Pierce"    
    ## [26] "Polk"       "Reagan"     "Roosevelt"  "Taft"       "Taylor"    
    ## [31] "Truman"     "Trump"      "Van Buren"  "Washington" "Wilson"

    Note that this groups Theodore and Franklin D. Roosevelt together – to separate them we would have needed to add a firstname variable using docvars() and grouped on that as well.

    Do this to aggregate the Irish budget corpus (data_corpus_irishbudget2010) by political party, when creating a dfm.

    mydfm <- dfm(data_corpus_inaugural, remove = stopwords("english"), remove_punct = T, stem = T)
    topfeatures(mydfm, 20)
    ##    nation    govern     peopl        us       can     state     great 
    ##       675       657       623       478       471       450       373 
    ##      upon     power      must   countri     world       may     shall 
    ##       371       370       366       355       339       338       314 
    ##     everi constitut      peac     right       law      time 
    ##       298       286       283       276       271       267
    irish_dfm <- dfm(data_corpus_irishbudget2010, groups = "party")
  6. Explore the ability to subset a corpus.

    There is a corpus_subset() method defined for a corpus, which works just like R’s normal subset() command. For instance if you want a wordcloud of just Obama’s two inagural addresses, you would need to subset the corpus first:

    obamadfm <- dfm(corpus_subset(data_corpus_inaugural, President=="Obama"))
    textplot_wordcloud(obamadfm)

    Try producing that plot without the stopwords. See dfm_remove() to remove stopwords from the dfm object directly, or supply the remove argument to dfm().

    obamadfm <- dfm(corpus_subset(data_corpus_inaugural, President=="Obama"), remove = stopwords("english"), remove_punct = T)
    textplot_wordcloud(obamadfm)