In this assignment, you will use R to understand and apply document classification and supervised scaling using R and quanteda.
We will start with a classic computer science dataset of movie
reviews, (Pang and
Lee 2004). The movies corpus has an attribute sentiment
that labels each text as either pos
or neg
according to the original imdb.com archived newspaper review star
rating.
To use this dataset, you will need to install the
quanteda.textmodels
package:
install.packages("quanteda.textmodels")
library(quanteda.textmodels)
You can extract the relevant corpus object using the following line of code:
moviereviews <- quanteda.textmodels::data_corpus_moviereviews
You should also load the quanteda
package:
library(quanteda)
Start by looking at the metadata included with this corpus using the
docvars()
function:
head(docvars(moviereviews))
## sentiment id1 id2
## 1 neg cv000 29416
## 2 neg cv001 19502
## 3 neg cv002 17424
## 4 neg cv003 12683
## 5 neg cv004 12641
## 6 neg cv005 29357
We will be using the sentiment
variable, which includes
information from a human-labelling of movie reviews as either positive
(pos
) or negative (neg
).
table()
function to work out how many positive
and how many negative movie reviews there are in the corpus.
table(docvars(moviereviews)$sentiment)
##
## neg pos
## 1000 1000
?sample
to make sure
you understand what each part of the code is doing. As we are using
randomness to generate this vector, don’t forget to first set your seed
so that the results are fully replicable!
set.seed(1234)
train <- sample(c(TRUE, FALSE), 2000, replace = TRUE, prob = c(.75, .25))
my_corpus[vector,]
) to do this. (Remember, if we use
an exclamation point !
before a logical vector, it will
reverse the TRUE
and FALSE
values.)
movies_train_corpus <- moviereviews[train]
movies_test_corpus <- moviereviews[!train]
dfm()
), and
make some reasonable feature selection decisions to reduce the number of
features in the dfm. Then make a dfm for the test corpus, and use the
dfm_match()
function to make sure that it contains the same
set of features as the training dfm. See the example in the lecture if
you are struggling, or consult the relevant help files.
movies_train_tokens <- tokens(movies_train_corpus,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE)
movies_test_tokens <- tokens(movies_test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE)
movies_train_dfm <- dfm(movies_train_tokens) %>%
dfm_remove(pattern = stopwords("en")) %>%
dfm_trim(min_termfreq = 10)
movies_test_dfm <- dfm(movies_test_tokens)
movies_test_dfm <- dfm_match(movies_test_dfm, features = featnames(movies_train_dfm))
textmodel_nb()
function to train the Naive
Bayes classifier on the training dfm. You should use the dfm you created
for the training corpus as the x
argument to this function,
and the outcome (i.e. training_dfm$sentiment
) as the
y
argument.
movie_nb <- textmodel_nb(movies_train_dfm, movies_train_dfm$sentiment)
param
element of the fitted model. Which
words have the highest probability under the pos
class?
Which words have the highest probability under the neg
class? You might find the sort()
function helpful
here.
head(sort(movie_nb$param[2,], decreasing = TRUE), 40)
## film one movie like just also
## 0.015424588 0.009025496 0.007464938 0.005313137 0.004178917 0.003861175
## story good time even can character
## 0.003752579 0.003688226 0.003668116 0.003583653 0.003523322 0.003205579
## much first characters life see well
## 0.003185469 0.003076874 0.003064807 0.003056763 0.003020565 0.003000454
## two way get films best really
## 0.002867727 0.002835550 0.002658580 0.002598249 0.002574116 0.002521830
## make little new many people great
## 0.002401168 0.002393124 0.002340837 0.002316705 0.002292572 0.002220175
## scene never man love movies scenes
## 0.002212131 0.002208109 0.002192021 0.002002984 0.001970808 0.001950698
## world plot still back
## 0.001942654 0.001821992 0.001813948 0.001805904
head(sort(movie_nb$param[1,], decreasing = TRUE), 40)
## film movie one like just even
## 0.013998056 0.010223724 0.009091424 0.006430519 0.005383142 0.004680172
## good time get can much bad
## 0.003944178 0.003892281 0.003661103 0.003661103 0.003529001 0.003448796
## plot character story two characters really
## 0.003146850 0.003132696 0.003061927 0.003033620 0.003000594 0.002816596
## make also first way see little
## 0.002811878 0.002675058 0.002675058 0.002609007 0.002599572 0.002495777
## well scene action films scenes know
## 0.002415573 0.002278753 0.002217420 0.002203267 0.002198549 0.002193831
## people director never new big another
## 0.002170241 0.002165523 0.002160805 0.002005114 0.001948499 0.001939063
## movies man something made
## 0.001924910 0.001887166 0.001877730 0.001858859
predict()
function to predict the sentiment of
movies in the test set dfm. The predict function takes two arguments in
this instance: 1) the estimated Naive Bayes model from part (e), and 2)
the test-set dfm. Create a confusion matrix of the predicted classes and
the actual classes in the test data. What is the accuracy of your
model?
movie_test_predicted_class <- predict(movie_nb, newdata = movies_test_dfm)
movie_confusion <- table(movie_test_predicted_class, movies_test_dfm$sentiment)
movie_confusion
##
## movie_test_predicted_class neg pos
## neg 220 42
## pos 38 182
## Accuracy
mean(movie_test_predicted_class == movies_test_dfm$sentiment)
## [1] 0.8340249
caret
package (install it first using
install.packages()
if you need to), and then use the
confusionMatrix()
function to calculate other statistics
relevant to the predictive performance of your model. The first argument
to the confusionMatrix()
function should be the confusion
matrix that you created in answer to question (g). You should also set
the positive
argument equal to "pos"
to tell R
the level of the outcome that corresponds to a “positive” result. Report
the the accuracy, sensitivity and specificity of your predictions,
giving a brief interpretation of each.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
movie_confusion_statistics <- confusionMatrix(movie_confusion, positive = "pos")
movie_confusion_statistics
## Confusion Matrix and Statistics
##
##
## movie_test_predicted_class neg pos
## neg 220 42
## pos 38 182
##
## Accuracy : 0.834
## 95% CI : (0.7977, 0.8661)
## No Information Rate : 0.5353
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.666
##
## Mcnemar's Test P-Value : 0.7373
##
## Sensitivity : 0.8125
## Specificity : 0.8527
## Pos Pred Value : 0.8273
## Neg Pred Value : 0.8397
## Prevalence : 0.4647
## Detection Rate : 0.3776
## Detection Prevalence : 0.4564
## Balanced Accuracy : 0.8326
##
## 'Positive' Class : pos
##
Accuracy: The proportion of observations correctly classified is 0.8340249.
Sensitivity: The proportion of “positive” movie reviews correctly classified is 0.8125.
Specificity: The proportion of “negative” movie reviews correctly classified is 0.8527132.
refscore
) for the training set dfm which is equal to 1 for
positive movie reviews, and -1 for negative movie reviews. These are the
reference scores that we will use for training the model.
movies_train_dfm$refscore <- ifelse(movies_train_dfm$sentiment == "pos", 1, -1)
textmodel_wordscores()
function to estimate
wordscores on the training dfm. This function requires two arguments: 1)
x
, for the dfm you are using to estimate the model, and 2)
y
for the vector of reference scores associated with each
training document (i.e. the variable you created in the answer
above).
wordscore_model <- textmodel_wordscores(movies_train_dfm, movies_train_dfm$refscore)
predict()
to work, you need to pass it the
trained wordscores model, and the test set dfm. Save your predictions as
a new metadata variable in your training data dfm.
movies_test_dfm$wordscores <- predict(wordscore_model, movies_test_dfm)
docvars()
function on your test set dfm to
check that you have correctly assigned the predictions as meta data
(hint: if you have done (c) correctly, then when you run
str(docvars(my_test_set_dfm))
you should see a column
containing the estimated wordscores).
str(docvars(movies_test_dfm))
## 'data.frame': 482 obs. of 4 variables:
## $ sentiment : Factor w/ 2 levels "neg","pos": 1 1 1 1 1 1 1 1 1 1 ...
## $ id1 : chr "cv004" "cv013" "cv015" "cv025" ...
## $ id2 : chr "12641" "10494" "29356" "29825" ...
## $ wordscores: 'predict.textmodel_wordscores' num 0.0179 -0.0481 -0.1056 0.0511 -0.0241 ...
boxplot()
function to compare the distribution
of wordscores against the “true” sentiment of the reviews given by human
annotators (look at ?boxplot
to see how to create the
plot). Describe the resulting pattern.
boxplot(movies_test_dfm$wordscores ~ movies_test_dfm$sentiment, ylab = "Raw wordscore")
Our model appears to do a pretty good job of assigning more positive scores to “positive” movie reviews.
Hint: you may want to use logical relations to find the texts you are
looking for. For instance, using my_vetor > 0.05
will
return a logical vector which is equal to TRUE
when
my_vector
is greater than 0.05 and FALSE
otherwise. Similarly, my_vector == "a string I want"
will
return a logical vector which is equal to TRUE
when
my_vector
is equal to “a string I want” and
FALSE
otherwise.
Hint 2: You can extract the full texts from your corpus object by
using as.character(my_corpus)
with the appropriate
subsetting operator (i.e. [,]
).
as.character(movies_test_corpus)[movies_test_dfm$wordscores > .075 & movies_test_dfm$sentiment == "neg"]
## cv926_18471.txt
## "star wars : ? episode i - the phantom menace ( 1999 ) \ndirector : george lucas cast : liam neeson , ewan mcgregor , natalie portman , jake lloyd , ian mcdiarmid , samuel l . jackson , oliver ford davies , terence stamp , pernilla august , frank oz , ahmed best , kenny baker , anthony daniels screenplay : george lucas producers : rick mccallum runtime : 131 min . \nus distribution : 20th century fox rated pg : mild violence , thematic elements \ncopyright 1999 nathaniel r . atcheson \na fellow critic once stated his belief that a reviewer should not speak of himself in his own review . \ni've attempted to obey this rule in recent months , but to do so would be impossible in this case . \nthe fact is , nearly every person who goes to see the phantom menace brings baggage in with them . \nthe original star wars trilogy means so much to so many people . \nfor me , they calibrated my creativity as a child ; they are masterful , original works of art that mix moving stories with what were astonishing special effects at the time ( and they still hold up pretty darn well ) . \ni am too young to have seen star wars in the theater during its original release , but that doesn't make me any less dedicated to it . \non the contrary , the star wars trilogy - and the empire strikes back in particular - are three items on a very short list of why i love movies . \nwhen i heard that george lucas would be making the first trilogy in the nine-film series , i got exited . \nwhen i first saw screenshots from the film , well over a year ago , i embarked on a year-long drool of anticipation . \nand when the first previews were released last thanksgiving , i was ready to see the film . \nbut then there was the hype , the insane marketing campaign , and lucasfilm's secretive snobbery over the picture . \nin the last weeks before the picture opened , while multitudes of fans waited outside of theaters and stood in the boiling sun days in advance just to be the first ones in the theater , i was tired of hearing about it . \ni was tired of seeing cardboard cut-outs of the characters whenever i went to kfc or taco bell . \ni just wanted to see the movie . \nreader , do not misunderstand . \ni did not have an anti-hype reaction . \nthe hype was unavoidable . \ni understand and accept the hype - it's just what happens when the prequel to the most widely beloved films of all time get released . \nfive minutes into the phantom menace , i knew there was a problem . \n \" who are these jedi knights ? \" \ni asked . \n \" why are they churning out stale dialogue with machine-gun rapidity ? \" \n \" why aren't these characters being developed before their adventures ? \" \n \" why is there a special effects shot in nearly every frame of the entire film ? \" \nthese were just some of my questions early on . \nlater , i asked , \" where's the magic of the first three films ? \" \nand \" why am i looking at my watch every fifteen minutes ? ' \nby the end of the film , i was tired , maddened , and depressed . \ngeorge lucas has funneled his own wonderful movies into a pointless , mindless , summer blockbuster . \nthe phantom menace is no star wars film . \ntake away the title and the jedi talk and the force , and you're left with what is easily one of the most vacuous special effects movies of all time . \nit's an embarrassment . \ni looked desperately for a scene in which a character is explored , or a new theme is examined , or a special effects shot isn't used . \nthere are a few of each , but they're all token attempts . \nthe fact is , george lucas has created what is simultaneously an abysmally bad excuse for a movie and a pretty good showcase for digital effects . \nthis is not what i wanted to see . \ni didn't want to leave the phantom menace with a headache and a bitter taste in my mouth , but i did . \nthe story centers mostly around qui-gon jinn ( liam neeson , looking lost and confused ) and his apprentice , obi-wan kenobi ( ewan mcgregor , who scarcely has a line in the film ) and their attempts to liberate the people of the planet naboo . \nnaboo is the victim of a bureaucratic war with the trade federation ; their contact on naboo is queen amidala ( natalie portman ) , the teenage ruler who truly cares for her people . \nafter picking up jar jar binks ( a completely cgi character , voiced by ahmed best ) , they head to tatooine , where they meet young anakin skywalker ( jake lloyd ) and his mother ( pernilla august ) . \nqui-gon knows that the force is strong with young anakin , and so the jedi knights take the boy with them on their journeys . \nthe bad guys are darth maul and darth sidious , neither of whom have enough lines to register as characters . \nthere isn't anything particularly wrong with this story when looking at it in synopsis form . \nthe way lucas has handled it , however , it unsatisfactory . \nfirst of all , we don't learn one single thing about qui-gon jinn . \nnot one thing . \nwhat was his life like before this film ? \nwell , i imagine he didn't have one . \nthat's why he feels like a plot device . \nthis probably explains why neeson looks so hopeless in the role , and why he's recently retired from film ( i don't blame him , honestly ) . \nobi-wan , a character i was really looking forward to learning more about , is even less interesting . \nmcgregor has just a few lines , so anyone hoping to see the engaging young actor in a great performance is urged to look elsewhere . \nsince these two men are the focus of the phantom menace , lucas has served us a big emotional void as the centerpiece of his movie . \nthings start to pick up when our characters reach tatooine ; young anakin is perhaps the only truly fleshed-out character in the film , and lloyd does a thoughtful job with the role . \ni was also hugely impressed with the sand speeder scene ; rarely is an action sequence so fast and so exciting . \nand when anakin says goodbye to his mother , i found it moving . \nalso fairly good is portman , and she manages to give a little depth to a character where no depth has been written . \njar jar binks is one of the most annoying characters i've ever had to endure , but he's more interesting than most of the humans . \nas soon as the relatively-brief segment on tatooine is over , it's back to the mind-numbing special effects and depthless action scenes . \ni've seen many movies that qualify as \" special effects extravaganzas , \" but the phantom menace is the first one i've seen that had me sick of the special effects fifteen minutes into the movie . \nthe reason is obvious : george lucas has no restraint . \ni can't say that i didn't find the effects original , because i did - the final battle between darth maul , obi-wan , and qui-gon is visually exceptional , as is most of the film . \nbut i also found the effects deadening and tiresome . \nmy breaking point was near the end of the picture , as anakin is getting questioned by yoda and the other jedi masters ; in the background , we see hundreds of digital spaceships flying around through a digital sky , and i wanted that to go away . \ncan't we have one stinking scene that isn't bursting at the seems with a special effects shot ? \ni got so sick of looking at the cgi characters and spaceships and planets and backgrounds that i really just wanted to go outside and look at a physical landscape for a few hours . \nand then there's the question of magic . \nwhat was lost in the sixteen years between the phantom menace and return of the jedi ? \ni have a feeling that lucas was so focused on how his movie looked that he forgot entirely the way it should feel . \njohn williams' familiar score is no help , nor is lucas' direction . \ni think it comes right down to characters : there are none here . \ni longed for the magnetic presence of han , luke , and leia , but i got no such thing . \nand what about the ridiculous expectations ? \nmine weren't that high ; i simply wanted a film that showed me the roots of the films that i grew up loving , a story that had a few characters and a few great special effects . \ninstead , i got two hours and fifteen minutes of a lifeless and imaginative computer graphics show . \ni don't hate the phantom menace as much as i resent it : i'd like to forget that it exists , and yet i can't . \nit's here to stay . \ni can only hope that episodes ii and iii have something of substance in them , because if they don't , then lucas will have pulled off the impossible task of destroying his own indestructible series . "
as.character(movies_test_corpus)[movies_test_dfm$wordscores < -.03 & movies_test_dfm$sentiment == "pos"]
## cv019_14482.txt
## "there's something about ben stiller that makes him a popular choice among casting directors these days . \nstiller currently has three projects in circulation , and what other actor can lay claim to that ? \nhe's in \" there's something about mary , \" which i * still * haven't seen . \nand he's in the acerbic \" your friends & neighbors , \" playing a talkative , sexually-frustrated drama coach called jerri . \nnow there's \" permanent midnight , \" in which stiller plays another jerry , this one a heroin-addicted television writer , last name stahl . \nthere's also something about this industry that pushes bankable stars like stiller into doing drug-addiction pictures the minute they've proved themselves commercially . \newan mcgregor springs to mind who , after successful turns in \" emma \" and \" brassed off , \" received greater respect and admiration for his mind-blowing realization as renton in danny boyle's transatlantic junk-fest , \" trainspotting . \" \nthe philosophy appears to be a simple one : if you want 'em to be taken seriously , make 'em do drugs . \n \" permanent midnight \" is based on the true life experiences of jerry stahl , a successful hollywood writer who , in the mid-eighties , had a $5 , 000-a-week job churning out plotlines for disposable tv sitcoms and a $6 , 000-a-week heroin habit . \na habit , in stahl's own words , \" the size of utah . \" \nas stahl , stiller contributes a commanding performance . \nunlike \" trainspotting , \" which was successful in having it both ways by chronicling both the highs and the lows of heroin abuse , \" permanent midnight \" instead focuses on the concept of drug addiction as maintenance . \none of the earliest observations in the film is a casual reference to \" naked lunch \" author william s . burroughs who , when asked why he shoots up first thing in the morning responds , \" so i can shave . \" \nstahl rarely appears to be puncturing veins for the thrill of it all in \" permanent midnight \" ; it's so he can talk to his mother on the phone , show up for work on time , even pay his bills . \nwhile the film itself occasionally wobbles around along with stahl , the writing ( adapted from stahl's autobiography by director david veloz ) is controlled and pointed . \n \" permanent midnight \" shows how stahl moved from new york to l . a . to - again in the author's words - \" escape the drug scene \" ( yeah , right ) ; why he entered into a convenient marriage with a british tv exec ( elizabeth hurley , so impossibly polite you'd swear her single profanity was dubbed ) ; and that he conceived a child in between his random hirings and firings . \nstahl narrates all this in a motel bedroom to a sympathetic lover called kitty ( norristown's own maria bello ) with whom he spent some rehab time . \njaneane garofalo is wasted - and miscast - as a heavily-bespectacled hollywood talent agent who fails to get her hooks into the doped-up wordsmith , and that's stahl himself playing a jaded clinic counselor . \nstiller , unshaven ( burroughs take note ) and with lots of mascara around the eyes , has stahl stumble through the film looking like a train wreck but , to his credit , never once pushes his pill-popping , needle-jabbing performance over the top . \nthe ubiquitous stiller is the reason to see \" permanent midnight \" ; a dark , comic , and strangely absorbing study of assisted living . "
## cv460_10842.txt
## "deep rising is one of \" those \" movies . \nthe kind of movie which serves no purpose except to entertain us . \nit does not ask us to think about important questions like life on other planets or the possibility that there is no god . . . screw that , it says boldly , let's see some computer generated monsters rip into , decapitate and generally cause irreparable booboos to a bunch of little known actors . \nheh ! \nthem wacky monsters , gotta love 'em . \nof course , since we can rent about a thousand b movies with the same kind of story , hollywood must give that little extra \" oumph \" to get people in theaters . \nthat is where deep rising fails , which is a good thing . \nconfused ? \nlet me explain : \ndespite all them flashy effects and big explosions , deep rising is still , at heart , a good 'ol b movie . \nluckily , it's a very good b movie . \nthe worst cliches in movie history are a b movie's bread and butter . \ntherefore , things that would destroy a serious movie actually help us have a good time while watching a movie of lower calibre . \nof course we know there's a big slimy creature behind that door , that one person will wander off to be picked off by said monster and we always know which persons or person will make it out alive . \nwe just don't know when or how horrible it will be . \ni went to see deep rising with my expections low and my tolerance for bad dialogue high . \nimagine my surprise when i discover that deep rising is actually , well , pretty darn funny at times . \na funny b movie ? \nwell , that's new . \nthese flicks are not supposed to make us laugh . \n ( except for a few unintended laughs once a while . ) \nand before you know it , treat williams , wes studi and famke jansen appear on the big screen . \nhey ! i know them guys ( and gal ) from a couple of other movies . \ncool . \nfamiliar faces . \nso far so good . \nour man treat is the hero , he'll live . \nwes is a staple of b movies , he is the token victim . \nwe know he'll buy the farm but he will take a few creeps with him on the way out . \nfamke is the babe , 'nuff said . \nthere is also a guy with glasses ( the guy with glasses always dies ) a black person ( b movie buffs know that the black guy always dies , never fails ) and a very funny , nerdy guy . \n ( ah ! \ncomic relief . \nhow can we possibly explain having to kill him . . . let \nhim live . ) \nafter the first fifteen minutes i felt right at home . \ni know who to root for and who i need to boo too and a gum to chew . \n ( please kill me . ) \nsuffice it to say that for the next hour and a half i jumped out of my seat a few times , went \" ewwww \" about a dozen times and nearly had an orgasm over all the explosions and firepower our heroes were packing . \ni'm a man , we nottice these things . \nall in all , i'd recommend deep rising if you are looking for a good time and care to leave your brain at the door . . . but \nbring your sense of humor and excitement in with you . \nthe acting is decent , the effects top rate . \nhow to best describe it ? \nput together the jet ski scene from hard rain , the bug attacks from starship troopers , a couple of james bond like stunts and all those scenes from friday the thirteenth and freddy where you keep screaming \" don't go there , he's behind you \" and you end up with deep rising . \nfor creepy crawly goodness , tight t-shirts , major firepower and the need to go to the bathroom every fifteen minutes from seing all that water . "
## cv876_9390.txt
## "usually when one is debating who the modern queen of the romantic comedy is they will bring up names like julia roberts or sandra bullock . \nothers will mention meg ryan . \nbut for me , it's not even close . \njaneane garofalo is not only the queen of the romantic comedy , she is the best comic actress in hollywood right now . \nand it's a good thing she's starring in the matchmaker , because without her presence the movie would be bland , unfunny , and dull . \ngarofalo stars as marcy tizard , a top aide to boston senator john mcglory , who is suffering in the polls . \nin an attempt to capture the irish vote , he sends marcy on a mission to a small irish town called ballinagra in search of other mcglory's that never moved to america . \nunfortunately for marcy , her visit coincides with the town's annual matchmaking festival . \nthings get off to a rocky start for marcy though . \nshe has no hotel reservations ( for no rational reason ) and the tiny confined room ( tired old cliche' ) she has to stay in has a visitor in her bathtub . \nhis name is sean , and marcy finds him repugnant at first , so you can obviously tell where this is headed . \nthe movie runs into a few roadblocks . \nfor instance , the story is very thin . \nnone of the characters ( except the old local matchmaker ) are nearly as interesting as garofalo . \nsome of the characters , like the political aide played by denis leary , have wandered in from a completely different movie . \ni think the director realized this and decided to throw in numerous shots of the beautiful irish scenery , and several close-ups of garofalo's winning smile . \nthe strange thing is that it works . \ngarofalo's charm and the irish scenery could carry the thinnest of stories , and it carries this one . "
The model fails in these cases because the reviews contain a lot of words of the opposite class which generally explain the subject matter of the movie, rather than expressing sentiment about the movie.
In this part of the assignment, you will use R to understand and
apply unsupervised document scaling. Use the
data_corpus_irishbudget2010
in
quanteda.textmodels for this. You will also need to
load (and possible install) the quanteda.textplots
package
first.
textplot_scale1d
function to visualize the result. (You may
want to use the advanced options of this function to get a better plot
than just the default one.)What do you learn about what the dimension is capturing? You can use wikipedia to learn about the Irish parties involved in this debate to help you answer this question.
library(quanteda.textplots)
irish_tokens <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE) %>%
tokens_wordstem()
irish_dfm <- dfm(irish_tokens) %>%
dfm_remove(pattern = stopwords("en"))
wordfish_model <- textmodel_wordfish(irish_dfm)
textplot_scale1d(wordfish_model, groups = data_corpus_irishbudget2010$party)
The model is capturing a government vs opposition dimension rather than a left-right dimension. \(\theta\) is opposition score, so Labour is more often in opposition in 2010
textplot_scale1d
function. What is your interpretation of
these results?
textplot_scale1d(wordfish_model, margin = "features")
In this case, the plot is very hard to interpret! There is some evidence that “citizenship” is an especially discriminating word on the estimated dimension, but the other words are hard to make sense of (because they are difficult to see). It is somewhat easier to just extract the most discriminating words at each end of the dimension, as follows:
head(wordfish_model$features[order(wordfish_model$beta, decreasing = T)])
## [1] "citizenship" "screw" "phrase" "precis" "internat"
## [6] "passport"
head(wordfish_model$features[order(wordfish_model$beta, decreasing = F)])
## [1] "innov" "summari" "boost" "day-to-day" "particip"
## [6] "enhanc"
Even now, however, it is hard to know what this dimension “means” in any real sense. Perhaps it would be easier if you had a PhD in Irish Politics (which I do not). Let this serve as a cautionary tale about the difficulties of unsupervised learning for text!
ntoken()
function on your dfm to extract the number of
words in each text.)plot(x = log(ntoken(irish_dfm)),
y = wordfish_model$alpha, pch = 19,
xlab="log token count for each document",
ylab="estimated alpha")
It shows that the alpha parameter is measuring how much each politician speaks.
topfeatures()
function might
be helpful here.# finding top 1,000 words
top1000 <- topfeatures(irish_dfm, n=1000)
top1000 <- data.frame(word = names(top1000),
freq = as.numeric(top1000),
stringsAsFactors = FALSE)
# extracting the estimated psi parameters
df <- data.frame(
word = wordfish_model$features,
psi_hat = wordfish_model$psi,
stringsAsFactors=FALSE
)
# Merge the word counts with the estimated word-level coefficients
df <- merge(df, top1000)
# Plot the result
plot(
x = log(df$freq),
y = df$psi_hat,
pch = 19, col = "gray",
xlab = "log(word frequency)",
ylab = "estimated psi"
)
Psi captures the log frequency with which each word appears in the corpus.