You will need to load the following libraries (you may also want to set the random number seed to make everything replicable):


Topic modelling of parliamentary speeches

In this question we are going to use topic modelling to understand how parliamentary speech varies by the gender of the MP. We will be working with a corpus of speeches made by legislators in the UK House of Commons in the 2014 calandar year.

You will need to make sure that the file hoc_speeches.Rdata is in your current working directory, and then use the following command to read this data into R.

  1. Inspect the data.frame object speeches and produce some summary statistics.
prop.table(table(speeches$party, speeches$gender),1)
##                        female       male
##   Conservative     0.14849624 0.85150376
##   Labour           0.37480740 0.62519260
##   Liberal Democrat 0.08807588 0.91192412
speeches$ntoken <- ntoken(speeches$speech)
hist(speeches$ntoken, main = "Distribution of speech length", breaks = 100)

  1. Use the functions in the quanteda package to turn this data into a corpus object. Attach the relevant metadata as docvars.
speechCorpus <- corpus(speeches$speech, docvars = speeches)
  1. Turn this corpus into a document-feature matrix. At a minimum, you should remove punctuation and numbers from the texts when constructing the dfm (remove_punct =T & remove_numbers = T) but you may also want to do some additional pre-processing if you don’t want to wait days for your topic model to coverge. Think about some of the following:

    1. Unigrams?
    2. Stopwords?
    3. Stemming?
    4. Very infrequent words?
speechDFM <- dfm(speechCorpus, remove = stopwords("en"), remove_punct =T, remove_numbers = T, stem = T)

speechDFM <- dfm_trim(speechDFM, min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
  1. Run a structural topic model for this corpus, using the gender variable in the topic prevalence argument. Use the stm function to do this. Set the seed argument to stm to be equal to 123. Be aware, this takes about 15 minutes to run on Jack’s laptop – for testing purposes you might want to set the maximum iterations for the stm to be some low number (max.em.its = 10 for instance).

Now specify and estimate the stm model:

K <- 20
stmOut <- stm(documents = speechDFM, 
              data = docvars(speechDFM),
              prevalence = ~gender,
              K = K, seed = 123, verbose = FALSE, max.em.its = 500)

Plot the estimated topic model: