The US State Department has produced regular reports on human rights practices across the world for many years. These monitoring reports play an important role both in the international human rights regime and in the production of human rights data. In a paper published in 2018, Benjamin Baozzi and Daniel Berliner analyse these reports in order to identify a set of topics and describe how these vary over time and space.
In today’s seminar, we will analyse the US State Department’s annual Country Reports on Human Rights Practices (1977–2012), by applying structural topic models (STMs) to identify the underlying topics of attention and scrutiny across the entire corpus and in each individual report. We will also assess the extent to which the prevalence of different topics in the corpus is related to covariates pertaining to each countries’ relationship with the US.
You will need to load the following packages before beginning the assignment
library(stm)
library(tidyverse)
library(quanteda)
# If you cannot load these libraries, try installing them first. E.g.:
# install.packages("stm")
Today we will use data on 4067 Human Rights Reports from the US State Department. The table below describes some of the variables included in the data:
Variable | Description |
---|---|
cname |
The name of the country which is the subject of the report |
year |
The year of the report |
report |
The text of the report (note that these texts have already been stemmed and stop words have been removed) |
alliance |
Whether the country has a formal military alliance with the United States (1) or not (0). |
p_polity2 |
The polity score for the country |
logus_aid_econ |
The (log) level of foreign aid provided to the country by the US. |
oecd |
OECD membership dummy |
civil_war |
Civil war dummy |
This data is not stored on GitHub because the file is to large. Instead, you will need to download it from this Dropbox link.
Once you have downloaded the file and stored it somewhere sensible, you can load it into R:
human_rights <- read_csv("human_rights_reports.csv")
You can take a quick look at the variables in the data by using the
glimpse()
function from the tidyverse
package:
glimpse(human_rights)
## Rows: 4,067
## Columns: 16
## $ cname <chr> "Albania", "Albania", "Albania", "Albania", "Albania…
## $ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988…
## $ cowcode <dbl> 339, 339, 339, 339, 339, 339, 339, 339, 339, 339, 33…
## $ logwdi_gdpc <dbl> 7.524573, 7.560410, 7.568337, 7.558117, 7.524482, 7.…
## $ p_polity2 <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, 1, 1, 5, 5, …
## $ alliance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ logus_aid_econ <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000…
## $ civilwar <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ oecd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ logtrade_with_US <dbl> 3.010621, 2.502255, 3.131137, 2.263844, 2.627563, 2.…
## $ latentmean_Fariss <dbl> -0.915279270, -1.060029900, -1.053791400, -1.0242505…
## $ gd_ptsa <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 4…
## $ years_to_election <dbl> 0, 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0, 3…
## $ rep_pres <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0…
## $ pres_chambers <dbl> 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0…
## $ report <chr> "albania isol balkan nation peopl govern communist r…
We will begin by implementing the null model of the Structural Topic Model. This model is equivalent to the Correlated Topic Model – a close cousin of the LDA model that we covered in the lecture, though one in which the topics in the corpus are allowed to be correlated with each other (LDA assumes that topics are uncorrelated).
The stm()
function from the stm
package can
be used to fit the model. There are a few different arguments that you
will need to specify for this function:
Argument | Description |
---|---|
documents |
The DFM on which you intend to fit the stm model. |
K |
The number of topics you wish to estimate. |
prevalence |
A formula (with no response variable) specifying the covariates you wish to use to model the topic prevalences across documents. |
content |
A formula (with no response variable) specifying the covariate you wish to use to model the content of each topic across documents. |
seed |
A seed number to make the results replicable. |
human_rights
data. Then create
a dfm, making some feature selection decisions.Note: Topic models can take a long time to estimate so I would advise that you trim the DFM to keep it reasonably small for now.
human_rights_corpus <- human_rights %>%
corpus(text_field = "report")
human_rights_dfm <- human_rights_corpus %>%
tokens() %>%
dfm()
human_rights_dfm <- human_rights_dfm %>%
dfm_trim(min_docfreq = .1,
max_docfreq = .9,
docfreq_type = "prop")
stm()
function from the stm
package to fit a topic model. Choose an appropriate number of topics.
You should not use any covariates in answer to this question. As the STM
model will take a while to run (probably a minute or two), you should
make sure you save the output of the model so that you don’t need to run
this code repeatedly.stm_out <- stm(documents = human_rights_dfm,
K = 15,
seed = 12345,
verbose = FALSE)
save(stm_out, file = "stm_out.Rdata")
plot()
function to assess how common each topic
is in this corpus. What is the most common topic? What is the least
common?plot(stm_out)
labelTopics()
function to extract the most
distinctive words for each topic. Do some interpretation of these topic
“labels”.[^seminar7-2] Is there a sexual violence topic? Is there a
topic about electoral manipulation? Create two word clouds illustrating
two of the most interesting topics using the cloud()
function.Note: The stm
package provides various
different metrics for weighting words in estimated topic models. The
most relevant two for our purposes are Highest Prob
and
FREX
. Highest Prob
simply reports the words
that have the highest probability within each topic (i.e. inferred
directly from the \(\beta\)
parameters). FREX
is a weighting that takes into account
both frequency and exclusivity (words are upweighted when they are
common in one topic but uncommon in other topics).
labelTopics(stm_out)
## Topic 1 Top Words:
## Highest Prob: israel, west, bank, arab, territori, militari, occupi
## FREX: israel, west, bank, territori, arab, occupi, east
## Lift: israel, west, strip, arab, occupi, bank, territori
## Score: israel, arab, west, territori, jewish, bank, east
## Topic 2 Top Words:
## Highest Prob: presid, code, ministri, minimum, enforc, minist, legisl
## FREX: code, franc, french, minimum, interior, radio, extrajudici
## Lift: franc, gendarmeri, french, leagu, apprenticeship, le, slaveri
## Score: franc, gendarmeri, code, presid, french, ministri, disabl
## Topic 3 Top Words:
## Highest Prob: civilian, war, militari, regim, attack, special, execut
## FREX: war, regim, iraq, southern, revolutionari, insurg, north
## Lift: iraq, revolutionari, war, regim, casualti, summarili, government-control
## Score: iraq, insurg, war, regim, civilian, revolutionari, militia
## Topic 4 Top Words:
## Highest Prob: traffick, victim, child, sexual, violenc, ministri, ngos
## FREX: roma, sexual, traffick, societ, corrupt, exploit, victim
## Lift: roma, bisexu, transgend, chat, lesbian, reproduct, gay
## Score: roma, traffick, ngos, internet, sexual, child, ombudsman
## Topic 5 Top Words:
## Highest Prob: militari, indigen, judg, ministri, end, crime, presid
## FREX: indigen, guerrilla, de, kidnap, paramilitari, congress, prosecutor
## Lift: guerrilla, jose, carlo, inter-american, san, homicid, el
## Score: guerrilla, indigen, jose, inter-american, carlo, ombudsman, el
## Topic 6 Top Words:
## Highest Prob: guarante, militari, -, amnesti, recent, rate, will
## FREX: guarante, tion, ment, communist, now, growth, current
## Lift: vital, ment, tion, non-government, inter, guarante, invas
## Score: vital, tion, ment, communist, guarante, indian, now
## Topic 7 Top Words:
## Highest Prob: provinc, sentenc, chines, detain, activist, china, provinci
## FREX: provinc, chines, china, provinci, dissid, activist, enterpris
## Lift: china, chines, provinc, dissid, anniversari, provinci, crackdown
## Score: china, chines, provinc, dissid, provinci, internet, communist
## Topic 8 Top Words:
## Highest Prob: see, end, soldier, child, journalist, militari, presid
## FREX: soldier, rebel, idp, fgm, girl, arm, unlik
## Lift: rebel, fgm, idp, loot, soldier, unlik, rob
## Score: rebel, idp, fgm, soldier, see, ngos, ethnic
## Topic 9 Top Words:
## Highest Prob: south, african, black, end, parliament, africa, white
## FREX: black, african, south, white, africa, farm, magistr
## Lift: white, black, africa, african, color, south, farm
## Score: white, african, south, africa, black, parliament, magistr
## Topic 10 Top Words:
## Highest Prob: islam, ministri, see, muslim, sentenc, council, sharia
## FREX: islam, sharia, non-muslim, king, muslim, bahai, christian
## Lift: bahai, non-muslim, sunni, sharia, moham, islam, ali
## Score: bahai, islam, sharia, sunni, non-muslim, king, arab
## Topic 11 Top Words:
## Highest Prob: opposit, presid, militari, detain, minist, leader, newspap
## FREX: opposit, decre, coup, martial, ralli, ban, emerg
## Lift: martial, coup, opposit, campus, sedit, decre, lift
## Score: martial, opposit, coup, presid, decre, militari, presidenti
## Topic 12 Top Words:
## Highest Prob: refuge, ethnic, tradit, presid, power, peopl, can
## FREX: king, loan, role, tradit, agricultur, known, exil
## Lift: loan, -parti, consensus, nonpolit, king, expatri, monarchi
## Score: loan, king, ethnic, royal, tradit, dissid, refuge
## Topic 13 Top Words:
## Highest Prob: district, violenc, child, see, death, end, muslim
## FREX: district, milit, tribal, custodi, bond, injur, villag
## Lift: milit, cast, tribal, ordin, epz, bond, tribe
## Score: milit, ngos, tribal, insurg, traffick, muslim, child
## Topic 14 Top Words:
## Highest Prob: feder, asylum, legisl, immigr, parliament, equal, minor
## FREX: immigr, asylum, feder, applic, equal, racial, european
## Lift: kingdom, racist, racism, alien, german, treati, immigr
## Score: kingdom, parliament, feder, immigr, disabl, asylum, seeker
## Topic 15 Top Words:
## Highest Prob: prosecutor, ethnic, region, presid, media, ministri, parliament
## FREX: prosecutor, russian, orthodox, registr, regist, soviet, region
## Lift: russian, russia, orthodox, soviet, jehovah, procur, psychiatr
## Score: russian, orthodox, soviet, russia, parliament, ethnic, prosecutor
cloud(stm_out, 4)
cloud(stm_out, 11)
stm_out$theta
). How many rows does this matrix
have? How many columns? What do the rows and columns represent?dim(stm_out$theta)
## [1] 4067 15
This matrix has 4067 rows and 15 columns. The rows here are the documents and the columns represent topics. The value for each cell of this matrix is the proportion of document \(d\) allocated to topic \(k\).
For example, let’s look at the first row of this matrix:
stm_out$theta[1,]
## [1] 0.0040269548 0.0042179549 0.1747053118 0.0002940531 0.0046224977
## [6] 0.6573289918 0.0789538623 0.0003247346 0.0012839030 0.0050881147
## [11] 0.0111577621 0.0320168947 0.0016730949 0.0150901088 0.0092157609
We can see that the first document in our collection is mostly about topic 6, because 66% of the document is allocated to that topic.
year
variable from the human_rights
data. What does this plot
suggest?# Assign the topic of interest to the data
# I have chosen topic 4, you might have selected something else.
human_rights$sexual_violence_topic <- stm_out$theta[,4]
human_rights %>%
ggplot(aes(x = year, y = sexual_violence_topic)) +
geom_point(alpha = .2) +
theme_bw()
There is evidence that this topic has become much more prominent in the country reports over time.
prevalence
argument. You can pick any covariate that you
think is likely to show interesting relationships with the estimated
topics. Again, remember to save your model output so that you don’t need
to estimate the model more than once.stm_out_prevalence <- stm(documents = human_rights_dfm,
prevalence = ~alliance,
K = 15,
seed = 12345,
verbose = FALSE)
save(stm_out_prevalence, file = "stm_out_prevalence.Rdata")
"frex"
scores
for each topic.# Extract the matrix of words with highest frex scores
topic_labels_matrix <- labelTopics(stm_out_prevalence, n = 7)$frex
# Collapse the words for each topic into a single label
topic_labels <- apply(topic_labels_matrix, 1, paste0, collapse = "_")
topic_labels
## [1] "israel_west_bank_territori_arab_occupi_east"
## [2] "code_franc_french_minimum_interior_extrajudici_radio"
## [3] "war_regim_iraq_southern_revolutionari_insurg_north"
## [4] "roma_sexual_traffick_societ_corrupt_exploit_victim"
## [5] "indigen_guerrilla_de_kidnap_paramilitari_congress_prosecutor"
## [6] "guarante_tion_ment_communist_now_growth_current"
## [7] "provinc_chines_china_provinci_dissid_activist_enterpris"
## [8] "soldier_rebel_idp_fgm_girl_arm_unlik"
## [9] "black_african_south_white_africa_farm_magistr"
## [10] "islam_sharia_non-muslim_king_muslim_bahai_christian"
## [11] "opposit_decre_coup_martial_ralli_ban_emerg"
## [12] "king_loan_role_tradit_agricultur_citizenship_known"
## [13] "district_milit_tribal_bond_custodi_injur_villag"
## [14] "immigr_asylum_feder_applic_equal_racial_european"
## [15] "prosecutor_russian_orthodox_registr_regist_soviet_region"
Note that the topics here differ somewhat from the topics we recovered using the stm without covariates. This is because here we have estimated a slightly different model, resulting in a slightly different distribution over words. This is one of the core weaknesses of topic models as the results are at least somewhat sensitive to model specification.
estimateEffect()
function to estimate
differences in topic usage by one of the covariates in the
human_rights
data. This function takes three main
arguments:Argument | Description |
---|---|
formula |
A formula for the regression. Should be of the form
c(1,2,3) ~ covariate_name , where the numbers on the
left-hand side indicate the topics for which you would like to estimate
effects. |
stmobj |
The model output from the stm()
function. |
metadata |
A data.frame where the covariates are to
be found. You can use docvars(my_dfm) for the
dfm you used to estimate the original model. |
# Estimating the effects of having an alliance with the US for *all* topics
prevalence_effects <- estimateEffect(formula = c(1:15) ~ alliance,
stmobj = stm_out_prevalence,
metadata = docvars(human_rights_dfm))
summary()
function to extract the estimated
regression coefficients. For which topics do you find evidence of a
significant relationship with the covariate you selected?summary(prevalence_effects)
##
## Call:
## estimateEffect(formula = c(1:15) ~ alliance, stmobj = stm_out_prevalence,
## metadata = docvars(human_rights_dfm))
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.018023 0.001535 11.745 < 2e-16 ***
## alliance -0.010857 0.002555 -4.249 2.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 2:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.088745 0.003034 29.255 <2e-16 ***
## alliance -0.012048 0.005536 -2.176 0.0296 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 3:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.062235 0.002562 24.295 < 2e-16 ***
## alliance -0.029948 0.004456 -6.721 2.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 4:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.113096 0.004459 25.362 < 2e-16 ***
## alliance 0.034394 0.007990 4.305 1.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 5:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.012426 0.002689 4.621 3.94e-06 ***
## alliance 0.185503 0.005939 31.237 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 6:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.081556 0.003672 22.208 < 2e-16 ***
## alliance 0.044196 0.006697 6.599 4.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 7:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.041075 0.002386 17.22 <2e-16 ***
## alliance -0.008468 0.003994 -2.12 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 8:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.078346 0.003185 24.60 <2e-16 ***
## alliance -0.059257 0.005228 -11.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 9:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.045257 0.002322 19.493 < 2e-16 ***
## alliance -0.023581 0.003730 -6.322 2.86e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 10:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.076252 0.003205 23.79 <2e-16 ***
## alliance -0.056578 0.005138 -11.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 11:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.077174 0.002699 28.589 <2e-16 ***
## alliance -0.007400 0.004608 -1.606 0.108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 12:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.103781 0.002957 35.09 <2e-16 ***
## alliance -0.073682 0.004845 -15.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 13:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.049369 0.002581 19.13 < 2e-16 ***
## alliance -0.016447 0.004569 -3.60 0.000322 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 14:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.071095 0.003515 20.23 <2e-16 ***
## alliance 0.076990 0.006414 12.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 15:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.081501 0.003387 24.062 < 2e-16 ***
## alliance -0.042913 0.006307 -6.804 1.17e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Most of them!
plot.estimateEffect()
function. There
are various different arguments that you can provide to this function.
See the help file for assistance here
(?plot.estimateEffect
).plot.estimateEffect(prevalence_effects,
topics = 4,
covariate = "alliance",
method = "pointestimate",
main = topic_labels[4])
plot.estimateEffect(prevalence_effects,
topics = 14,
covariate = "alliance",
method = "pointestimate",
main = topic_labels[14])
content
argument to the stm()
function
(see the lecture slides for an example). Once you have estimated the
model, inspect the output and create at least one plot which
demonstrates how word use for a given topic differs for the covariate
you included in the model. (Note: The use of the content()
argument can cause the model to take a long time to converge so you will
need to be patient!)stm_out_content <- stm(documents = human_rights_dfm,
content = ~alliance,
K = 15,
seed = 12345,
verbose = FALSE)
plot(stm_out_content,
topics = c(3),
type = "perspectives")
plot(stm_out_content,
topics = c(1),
type = "perspectives")