Assignment 10 - Text Classification and Scaling (Solutions)

In this assignment, you will use R to understand and apply document classification and supervised scaling using R and quanteda.

Exercise 10.1 - Naive Bayes Classification of Movie Revies

We will start with a classic computer science dataset of movie reviews, (Pang and Lee 2004). The movies corpus has an attribute sentiment that labels each text as either pos or neg according to the original imdb.com archived newspaper review star rating.

To use this dataset, you will need to install the quanteda.textmodels package:

install.packages("quanteda.textmodels")
library(quanteda.textmodels)

You can extract the relevant corpus object using the following line of code:


moviereviews <- quanteda.textmodels::data_corpus_moviereviews

You should also load the quanteda package:

library(quanteda)

Start by looking at the metadata included with this corpus using the docvars() function:


head(docvars(moviereviews))
##   sentiment   id1   id2
## 1       neg cv000 29416
## 2       neg cv001 19502
## 3       neg cv002 17424
## 4       neg cv003 12683
## 5       neg cv004 12641
## 6       neg cv005 29357

We will be using the sentiment variable, which includes information from a human-labelling of movie reviews as either positive (pos) or negative (neg).

Use the table() function to work out how many positive and how many negative movie reviews there are in the corpus.


table(docvars(moviereviews)$sentiment)
## 
##  neg  pos 
## 1000 1000

Use the code below to create a logical vector of the same length as the number of documents in the corpus. We will use this vector to define our training and test sets. Look at ?sample to make sure you understand what each part of the code is doing. As we are using randomness to generate this vector, don’t forget to first set your seed so that the results are fully replicable!


set.seed(1234)

train <- sample(c(TRUE, FALSE), 2000, replace = TRUE, prob = c(.75, .25))

Subset the corpus into a training set and a test set using the vector you just created. Use the square brackets to subset (i.e. my_corpus[vector,]) to do this. (Remember, if we use an exclamation point ! before a logical vector, it will reverse the TRUE and FALSE values.)


movies_train_corpus <- moviereviews[train]
movies_test_corpus <- moviereviews[!train]

Make a dfm for the training corpus (i.e. dfm()), and make some reasonable feature selection decisions to reduce the number of features in the dfm. Then make a dfm for the test corpus, and use the dfm_match() function to make sure that it contains the same set of features as the training dfm. See the example in the lecture if you are struggling, or consult the relevant help files.


movies_train_tokens <- tokens(movies_train_corpus, 
                              remove_punct = TRUE, 
                              remove_numbers = TRUE, 
                              remove_symbols = TRUE)

movies_test_tokens <- tokens(movies_test_corpus, 
                             remove_punct = TRUE, 
                             remove_numbers = TRUE, 
                             remove_symbols = TRUE)

movies_train_dfm <- dfm(movies_train_tokens) %>%
  dfm_remove(pattern = stopwords("en")) %>%
  dfm_trim(min_termfreq = 10)

movies_test_dfm <- dfm(movies_test_tokens)

movies_test_dfm <- dfm_match(movies_test_dfm, features = featnames(movies_train_dfm))

Use the textmodel_nb() function to train the Naive Bayes classifier on the training dfm. You should use the dfm you created for the training corpus as the x argument to this function, and the outcome (i.e. training_dfm$sentiment) as the y argument.


movie_nb <- textmodel_nb(movies_train_dfm, movies_train_dfm$sentiment)

Examine the param element of the fitted model. Which words have the highest probability under the pos class? Which words have the highest probability under the neg class? You might find the sort() function helpful here.


head(sort(movie_nb$param[2,], decreasing = TRUE), 40)
##        film         one       movie        like        just        also 
## 0.015424588 0.009025496 0.007464938 0.005313137 0.004178917 0.003861175 
##       story        good        time        even         can   character 
## 0.003752579 0.003688226 0.003668116 0.003583653 0.003523322 0.003205579 
##        much       first  characters        life         see        well 
## 0.003185469 0.003076874 0.003064807 0.003056763 0.003020565 0.003000454 
##         two         way         get       films        best      really 
## 0.002867727 0.002835550 0.002658580 0.002598249 0.002574116 0.002521830 
##        make      little         new        many      people       great 
## 0.002401168 0.002393124 0.002340837 0.002316705 0.002292572 0.002220175 
##       scene       never         man        love      movies      scenes 
## 0.002212131 0.002208109 0.002192021 0.002002984 0.001970808 0.001950698 
##       world        plot       still        back 
## 0.001942654 0.001821992 0.001813948 0.001805904

head(sort(movie_nb$param[1,], decreasing = TRUE), 40)
##        film       movie         one        like        just        even 
## 0.013998056 0.010223724 0.009091424 0.006430519 0.005383142 0.004680172 
##        good        time         get         can        much         bad 
## 0.003944178 0.003892281 0.003661103 0.003661103 0.003529001 0.003448796 
##        plot   character       story         two  characters      really 
## 0.003146850 0.003132696 0.003061927 0.003033620 0.003000594 0.002816596 
##        make        also       first         way         see      little 
## 0.002811878 0.002675058 0.002675058 0.002609007 0.002599572 0.002495777 
##        well       scene      action       films      scenes        know 
## 0.002415573 0.002278753 0.002217420 0.002203267 0.002198549 0.002193831 
##      people    director       never         new         big     another 
## 0.002170241 0.002165523 0.002160805 0.002005114 0.001948499 0.001939063 
##      movies         man   something        made 
## 0.001924910 0.001887166 0.001877730 0.001858859

Use the predict() function to predict the sentiment of movies in the test set dfm. The predict function takes two arguments in this instance: 1) the estimated Naive Bayes model from part (e), and 2) the test-set dfm. Create a confusion matrix of the predicted classes and the actual classes in the test data. What is the accuracy of your model?


movie_test_predicted_class <- predict(movie_nb, newdata = movies_test_dfm)

movie_confusion <- table(movie_test_predicted_class, movies_test_dfm$sentiment)

movie_confusion
##                           
## movie_test_predicted_class neg pos
##                        neg 220  42
##                        pos  38 182

## Accuracy
mean(movie_test_predicted_class == movies_test_dfm$sentiment)
## [1] 0.8340249

Load the caret package (install it first using install.packages() if you need to), and then use the confusionMatrix() function to calculate other statistics relevant to the predictive performance of your model. The first argument to the confusionMatrix() function should be the confusion matrix that you created in answer to question (g). You should also set the positive argument equal to "pos" to tell R the level of the outcome that corresponds to a “positive” result. Report the the accuracy, sensitivity and specificity of your predictions, giving a brief interpretation of each.


library(caret)
## Loading required package: ggplot2
## Loading required package: lattice

movie_confusion_statistics <- confusionMatrix(movie_confusion, positive = "pos")

movie_confusion_statistics
## Confusion Matrix and Statistics
## 
##                           
## movie_test_predicted_class neg pos
##                        neg 220  42
##                        pos  38 182
##                                           
##                Accuracy : 0.834           
##                  95% CI : (0.7977, 0.8661)
##     No Information Rate : 0.5353          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.666           
##                                           
##  Mcnemar's Test P-Value : 0.7373          
##                                           
##             Sensitivity : 0.8125          
##             Specificity : 0.8527          
##          Pos Pred Value : 0.8273          
##          Neg Pred Value : 0.8397          
##              Prevalence : 0.4647          
##          Detection Rate : 0.3776          
##    Detection Prevalence : 0.4564          
##       Balanced Accuracy : 0.8326          
##                                           
##        'Positive' Class : pos             
##

Accuracy: The proportion of observations correctly classified is 0.8340249.

Sensitivity: The proportion of “positive” movie reviews correctly classified is 0.8125.

Specificity: The proportion of “negative” movie reviews correctly classified is 0.8527132.

Exercise 10.2 - Wordscores for Movie Reviews

We will now use the same training and test set to estimate wordscores for the movie reviews. First, create a new variable (named refscore) for the training set dfm which is equal to 1 for positive movie reviews, and -1 for negative movie reviews. These are the reference scores that we will use for training the model.


movies_train_dfm$refscore <- ifelse(movies_train_dfm$sentiment == "pos", 1, -1)

Use the textmodel_wordscores() function to estimate wordscores on the training dfm. This function requires two arguments: 1) x, for the dfm you are using to estimate the model, and 2) y for the vector of reference scores associated with each training document (i.e. the variable you created in the answer above).


wordscore_model <- textmodel_wordscores(movies_train_dfm, movies_train_dfm$refscore)

Predict the wordscores for the test set using the predict function. Again, for predict() to work, you need to pass it the trained wordscores model, and the test set dfm. Save your predictions as a new metadata variable in your training data dfm.


movies_test_dfm$wordscores <- predict(wordscore_model, movies_test_dfm)

Use the docvars() function on your test set dfm to check that you have correctly assigned the predictions as meta data (hint: if you have done (c) correctly, then when you run str(docvars(my_test_set_dfm)) you should see a column containing the estimated wordscores).


str(docvars(movies_test_dfm))
## 'data.frame':    482 obs. of  4 variables:
##  $ sentiment : Factor w/ 2 levels "neg","pos": 1 1 1 1 1 1 1 1 1 1 ...
##  $ id1       : chr  "cv004" "cv013" "cv015" "cv025" ...
##  $ id2       : chr  "12641" "10494" "29356" "29825" ...
##  $ wordscores: 'predict.textmodel_wordscores' num  0.0179 -0.0481 -0.1056 0.0511 -0.0241 ...

Use the boxplot() function to compare the distribution of wordscores against the “true” sentiment of the reviews given by human annotators (look at ?boxplot to see how to create the plot). Describe the resulting pattern.


boxplot(movies_test_dfm$wordscores ~ movies_test_dfm$sentiment, ylab = "Raw wordscore")

Our model appears to do a pretty good job of assigning more positive scores to “positive” movie reviews.

Look for examples of texts with positive wordscores (for instance, any text with a wordscore greater than 0.075) that are nonetheless categorised as “negative” by human readers. Look for examples of texts with negative wordscores (for instance, any text with a wordscore smaller than -0.03) that are nonetheless categorised as “positive” by human readers. Why do you think the model gave the wrong predictions in those cases?

Hint: you may want to use logical relations to find the texts you are looking for. For instance, using my_vetor > 0.05 will return a logical vector which is equal to TRUE when my_vector is greater than 0.05 and FALSE otherwise. Similarly, my_vector == "a string I want" will return a logical vector which is equal to TRUE when my_vector is equal to “a string I want” and FALSE otherwise.

Hint 2: You can extract the full texts from your corpus object by using as.character(my_corpus) with the appropriate subsetting operator (i.e. [,]).


as.character(movies_test_corpus)[movies_test_dfm$wordscores > .075 & movies_test_dfm$sentiment == "neg"]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 cv926_18471.txt 
## "star wars : ? episode i  -  the phantom menace ( 1999 ) \ndirector : george lucas cast : liam neeson , ewan mcgregor , natalie portman , jake lloyd , ian mcdiarmid , samuel l . jackson , oliver ford davies , terence stamp , pernilla august , frank oz , ahmed best , kenny baker , anthony daniels screenplay : george lucas producers : rick mccallum runtime : 131 min . \nus distribution : 20th century fox rated pg : mild violence , thematic elements \ncopyright 1999 nathaniel r . atcheson \na fellow critic once stated his belief that a reviewer should not speak of himself in his own review . \ni've attempted to obey this rule in recent months , but to do so would be impossible in this case . \nthe fact is , nearly every person who goes to see the phantom menace brings baggage in with them . \nthe original star wars trilogy means so much to so many people . \nfor me , they calibrated my creativity as a child ; they are masterful , original works of art that mix moving stories with what were astonishing special effects at the time ( and they still hold up pretty darn well ) . \ni am too young to have seen star wars in the theater during its original release , but that doesn't make me any less dedicated to it . \non the contrary , the star wars trilogy  -  and the empire strikes back in particular  -  are three items on a very short list of why i love movies . \nwhen i heard that george lucas would be making the first trilogy in the nine-film series , i got exited . \nwhen i first saw screenshots from the film , well over a year ago , i embarked on a year-long drool of anticipation . \nand when the first previews were released last thanksgiving , i was ready to see the film . \nbut then there was the hype , the insane marketing campaign , and lucasfilm's secretive snobbery over the picture . \nin the last weeks before the picture opened , while multitudes of fans waited outside of theaters and stood in the boiling sun days in advance just to be the first ones in the theater , i was tired of hearing about it . \ni was tired of seeing cardboard cut-outs of the characters whenever i went to kfc or taco bell . \ni just wanted to see the movie . \nreader , do not misunderstand . \ni did not have an anti-hype reaction . \nthe hype was unavoidable . \ni understand and accept the hype  -  it's just what happens when the prequel to the most widely beloved films of all time get released . \nfive minutes into the phantom menace , i knew there was a problem . \n \" who are these jedi knights ? \" \ni asked . \n \" why are they churning out stale dialogue with machine-gun rapidity ? \" \n \" why aren't these characters being developed before their adventures ? \" \n \" why is there a special effects shot in nearly every frame of the entire film ? \" \nthese were just some of my questions early on . \nlater , i asked , \" where's the magic of the first three films ? \" \nand \" why am i looking at my watch every fifteen minutes ? ' \nby the end of the film , i was tired , maddened , and depressed . \ngeorge lucas has funneled his own wonderful movies into a pointless , mindless , summer blockbuster . \nthe phantom menace is no star wars film . \ntake away the title and the jedi talk and the force , and you're left with what is easily one of the most vacuous special effects movies of all time . \nit's an embarrassment . \ni looked desperately for a scene in which a character is explored , or a new theme is examined , or a special effects shot isn't used . \nthere are a few of each , but they're all token attempts . \nthe fact is , george lucas has created what is simultaneously an abysmally bad excuse for a movie and a pretty good showcase for digital effects . \nthis is not what i wanted to see . \ni didn't want to leave the phantom menace with a headache and a bitter taste in my mouth , but i did . \nthe story centers mostly around qui-gon jinn ( liam neeson , looking lost and confused ) and his apprentice , obi-wan kenobi ( ewan mcgregor , who scarcely has a line in the film ) and their attempts to liberate the people of the planet naboo . \nnaboo is the victim of a bureaucratic war with the trade federation ; their contact on naboo is queen amidala ( natalie portman ) , the teenage ruler who truly cares for her people . \nafter picking up jar jar binks ( a completely cgi character , voiced by ahmed best ) , they head to tatooine , where they meet young anakin skywalker ( jake lloyd ) and his mother ( pernilla august ) . \nqui-gon knows that the force is strong with young anakin , and so the jedi knights take the boy with them on their journeys . \nthe bad guys are darth maul and darth sidious , neither of whom have enough lines to register as characters . \nthere isn't anything particularly wrong with this story when looking at it in synopsis form . \nthe way lucas has handled it , however , it unsatisfactory . \nfirst of all , we don't learn one single thing about qui-gon jinn . \nnot one thing . \nwhat was his life like before this film ? \nwell , i imagine he didn't have one . \nthat's why he feels like a plot device . \nthis probably explains why neeson looks so hopeless in the role , and why he's recently retired from film ( i don't blame him , honestly ) . \nobi-wan , a character i was really looking forward to learning more about , is even less interesting . \nmcgregor has just a few lines , so anyone hoping to see the engaging young actor in a great performance is urged to look elsewhere . \nsince these two men are the focus of the phantom menace , lucas has served us a big emotional void as the centerpiece of his movie . \nthings start to pick up when our characters reach tatooine ; young anakin is perhaps the only truly fleshed-out character in the film , and lloyd does a thoughtful job with the role . \ni was also hugely impressed with the sand speeder scene ; rarely is an action sequence so fast and so exciting . \nand when anakin says goodbye to his mother , i found it moving . \nalso fairly good is portman , and she manages to give a little depth to a character where no depth has been written . \njar jar binks is one of the most annoying characters i've ever had to endure , but he's more interesting than most of the humans . \nas soon as the relatively-brief segment on tatooine is over , it's back to the mind-numbing special effects and depthless action scenes . \ni've seen many movies that qualify as \" special effects extravaganzas , \" but the phantom menace is the first one i've seen that had me sick of the special effects fifteen minutes into the movie . \nthe reason is obvious : george lucas has no restraint . \ni can't say that i didn't find the effects original , because i did  -  the final battle between darth maul , obi-wan , and qui-gon is visually exceptional , as is most of the film . \nbut i also found the effects deadening and tiresome . \nmy breaking point was near the end of the picture , as anakin is getting questioned by yoda and the other jedi masters ; in the background , we see hundreds of digital spaceships flying around through a digital sky , and i wanted that to go away . \ncan't we have one stinking scene that isn't bursting at the seems with a special effects shot ? \ni got so sick of looking at the cgi characters and spaceships and planets and backgrounds that i really just wanted to go outside and look at a physical landscape for a few hours . \nand then there's the question of magic . \nwhat was lost in the sixteen years between the phantom menace and return of the jedi ? \ni have a feeling that lucas was so focused on how his movie looked that he forgot entirely the way it should feel . \njohn williams' familiar score is no help , nor is lucas' direction . \ni think it comes right down to characters : there are none here . \ni longed for the magnetic presence of han , luke , and leia , but i got no such thing . \nand what about the ridiculous expectations ? \nmine weren't that high ; i simply wanted a film that showed me the roots of the films that i grew up loving , a story that had a few characters and a few great special effects . \ninstead , i got two hours and fifteen minutes of a lifeless and imaginative computer graphics show . \ni don't hate the phantom menace as much as i resent it : i'd like to forget that it exists , and yet i can't . \nit's here to stay . \ni can only hope that episodes ii and iii have something of substance in them , because if they don't , then lucas will have pulled off the impossible task of destroying his own indestructible series . "


as.character(movies_test_corpus)[movies_test_dfm$wordscores < -.03 & movies_test_dfm$sentiment == "pos"]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              cv019_14482.txt 
##                                                                                        "there's something about ben stiller that makes him a popular choice among casting directors these days . \nstiller currently has three projects in circulation , and what other actor can lay claim to that ? \nhe's in \" there's something about mary , \" which i * still * haven't seen . \nand he's in the acerbic \" your friends & neighbors , \" playing a talkative , sexually-frustrated drama coach called jerri . \nnow there's \" permanent midnight , \" in which stiller plays another jerry , this one a heroin-addicted television writer , last name stahl . \nthere's also something about this industry that pushes bankable stars like stiller into doing drug-addiction pictures the minute they've proved themselves commercially . \newan mcgregor springs to mind who , after successful turns in \" emma \" and \" brassed off , \" received greater respect and admiration for his mind-blowing realization as renton in danny boyle's transatlantic junk-fest , \" trainspotting . \" \nthe philosophy appears to be a simple one : if you want 'em to be taken seriously , make 'em do drugs . \n \" permanent midnight \" is based on the true life experiences of jerry stahl , a successful hollywood writer who , in the mid-eighties , had a $5 , 000-a-week job churning out plotlines for disposable tv sitcoms and a $6 , 000-a-week heroin habit . \na habit , in stahl's own words , \" the size of utah . \" \nas stahl , stiller contributes a commanding performance . \nunlike \" trainspotting , \" which was successful in having it both ways by chronicling both the highs and the lows of heroin abuse , \" permanent midnight \" instead focuses on the concept of drug addiction as maintenance . \none of the earliest observations in the film is a casual reference to \" naked lunch \" author william s . burroughs who , when asked why he shoots up first thing in the morning responds , \" so i can shave . \" \nstahl rarely appears to be puncturing veins for the thrill of it all in \" permanent midnight \" ; it's so he can talk to his mother on the phone , show up for work on time , even pay his bills . \nwhile the film itself occasionally wobbles around along with stahl , the writing ( adapted from stahl's autobiography by director david veloz ) is controlled and pointed . \n \" permanent midnight \" shows how stahl moved from new york to l . a . to - again in the author's words -  \" escape the drug scene \" ( yeah , right ) ; why he entered into a convenient marriage with a british tv exec ( elizabeth hurley , so impossibly polite you'd swear her single profanity was dubbed ) ; and that he conceived a child in between his random hirings and firings . \nstahl narrates all this in a motel bedroom to a sympathetic lover called kitty ( norristown's own maria bello ) with whom he spent some rehab time . \njaneane garofalo is wasted - and miscast - as a heavily-bespectacled hollywood talent agent who fails to get her hooks into the doped-up wordsmith , and that's stahl himself playing a jaded clinic counselor . \nstiller , unshaven ( burroughs take note ) and with lots of mascara around the eyes , has stahl stumble through the film looking like a train wreck but , to his credit , never once pushes his pill-popping , needle-jabbing performance over the top . \nthe ubiquitous stiller is the reason to see \" permanent midnight \" ; a dark , comic , and strangely absorbing study of assisted living . " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              cv460_10842.txt 
## "deep rising is one of \" those \" movies . \nthe kind of movie which serves no purpose except to entertain us . \nit does not ask us to think about important questions like life on other planets or the possibility that there is no god . . . screw that , it says boldly , let's see some computer generated monsters rip into , decapitate and generally cause irreparable booboos to a bunch of little known actors . \nheh ! \nthem wacky monsters , gotta love 'em . \nof course , since we can rent about a thousand b movies with the same kind of story , hollywood must give that little extra \" oumph \" to get people in theaters . \nthat is where deep rising fails , which is a good thing . \nconfused ? \nlet me explain : \ndespite all them flashy effects and big explosions , deep rising is still , at heart , a good 'ol b movie . \nluckily , it's a very good b movie . \nthe worst cliches in movie history are a b movie's bread and butter . \ntherefore , things that would destroy a serious movie actually help us have a good time while watching a movie of lower calibre . \nof course we know there's a big slimy creature behind that door , that one person will wander off to be picked off by said monster and we always know which persons or person will make it out alive . \nwe just don't know when or how horrible it will be . \ni went to see deep rising with my expections low and my tolerance for bad dialogue high . \nimagine my surprise when i discover that deep rising is actually , well , pretty darn funny at times . \na funny b movie ? \nwell , that's new . \nthese flicks are not supposed to make us laugh . \n ( except for a few unintended laughs once a while . ) \nand before you know it , treat williams , wes studi and famke jansen appear on the big screen . \nhey ! i know them guys ( and gal ) from a couple of other movies . \ncool . \nfamiliar faces . \nso far so good . \nour man treat is the hero , he'll live . \nwes is a staple of b movies , he is the token victim . \nwe know he'll buy the farm but he will take a few creeps with him on the way out . \nfamke is the babe , 'nuff said . \nthere is also a guy with glasses ( the guy with glasses always dies ) a black person ( b movie buffs know that the black guy always dies , never fails ) and a very funny , nerdy guy . \n ( ah ! \ncomic relief . \nhow can we possibly explain having to kill him . . . let \nhim live . ) \nafter the first fifteen minutes i felt right at home . \ni know who to root for and who i need to boo too and a gum to chew . \n ( please kill me . ) \nsuffice it to say that for the next hour and a half i jumped out of my seat a few times , went \" ewwww \" about a dozen times and nearly had an orgasm over all the explosions and firepower our heroes were packing . \ni'm a man , we nottice these things . \nall in all , i'd recommend deep rising if you are looking for a good time and care to leave your brain at the door . . . but \nbring your sense of humor and excitement in with you . \nthe acting is decent , the effects top rate . \nhow to best describe it ? \nput together the jet ski scene from hard rain , the bug attacks from starship troopers , a couple of james bond like stunts and all those scenes from friday the thirteenth and freddy where you keep screaming \" don't go there , he's behind you \" and you end up with deep rising . \nfor creepy crawly goodness , tight t-shirts , major firepower and the need to go to the bathroom every fifteen minutes from seing all that water . " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               cv876_9390.txt 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     "usually when one is debating who the modern queen of the romantic comedy is they will bring up names like julia roberts or sandra bullock . \nothers will mention meg ryan . \nbut for me , it's not even close . \njaneane garofalo is not only the queen of the romantic comedy , she is the best comic actress in hollywood right now . \nand it's a good thing she's starring in the matchmaker , because without her presence the movie would be bland , unfunny , and dull . \ngarofalo stars as marcy tizard , a top aide to boston senator john mcglory , who is suffering in the polls . \nin an attempt to capture the irish vote , he sends marcy on a mission to a small irish town called ballinagra in search of other mcglory's that never moved to america . \nunfortunately for marcy , her visit coincides with the town's annual matchmaking festival . \nthings get off to a rocky start for marcy though . \nshe has no hotel reservations ( for no rational reason ) and the tiny confined room ( tired old cliche' ) she has to stay in has a visitor in her bathtub . \nhis name is sean , and marcy finds him repugnant at first , so you can obviously tell where this is headed . \nthe movie runs into a few roadblocks . \nfor instance , the story is very thin . \nnone of the characters ( except the old local matchmaker ) are nearly as interesting as garofalo . \nsome of the characters , like the political aide played by denis leary , have wandered in from a completely different movie . \ni think the director realized this and decided to throw in numerous shots of the beautiful irish scenery , and several close-ups of garofalo's winning smile . \nthe strange thing is that it works . \ngarofalo's charm and the irish scenery could carry the thinnest of stories , and it carries this one . "

The model fails in these cases because the reviews contain a lot of words of the opposite class which generally explain the subject matter of the movie, rather than expressing sentiment about the movie.

Exercise 10.3 - Wordfish for Irish Parliamentary Debates (Hard question)

In this part of the assignment, you will use R to understand and apply unsupervised document scaling. Use the data_corpus_irishbudget2010 in quanteda.textmodels for this. You will also need to load (and possible install) the quanteda.textplots package first.

Fit a wordfish model of all the documents in this corpus. Apply any required preprocessing steps first. Use the textplot_scale1d function to visualize the result. (You may want to use the advanced options of this function to get a better plot than just the default one.)

What do you learn about what the dimension is capturing? You can use wikipedia to learn about the Irish parties involved in this debate to help you answer this question.


library(quanteda.textplots)

irish_tokens <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE) %>%
  tokens_wordstem()

irish_dfm <- dfm(irish_tokens) %>%
  dfm_remove(pattern = stopwords("en"))

wordfish_model <- textmodel_wordfish(irish_dfm)

textplot_scale1d(wordfish_model, groups = data_corpus_irishbudget2010$party)

The model is capturing a government vs opposition dimension rather than a left-right dimension. $\theta$ is opposition score, so Labour is more often in opposition in 2010

Plot the wordfish “Eiffel Tower” plot (as in Figure 2 of Slapin and Proksch 2008), from the wordfish object. You can do this using the textplot_scale1d function. What is your interpretation of these results?


textplot_scale1d(wordfish_model, margin = "features")

In this case, the plot is very hard to interpret! There is some evidence that “citizenship” is an especially discriminating word on the estimated dimension, but the other words are hard to make sense of (because they are difficult to see). It is somewhat easier to just extract the most discriminating words at each end of the dimension, as follows:


head(wordfish_model$features[order(wordfish_model$beta, decreasing = T)])
## [1] "citizenship" "screw"       "phrase"      "precis"      "internat"   
## [6] "passport"

head(wordfish_model$features[order(wordfish_model$beta, decreasing = F)])
## [1] "innov"      "summari"    "boost"      "day-to-day" "particip"  
## [6] "enhanc"

Even now, however, it is hard to know what this dimension “means” in any real sense. Perhaps it would be easier if you had a PhD in Irish Politics (which I do not). Let this serve as a cautionary tale about the difficulties of unsupervised learning for text!

Plot the log of the length in tokens of each text against the $\hat{\alpha}$ from your estimated wordfish model. What does the relationship indicate? (Hint: you can use the ntoken() function on your dfm to extract the number of words in each text.)

plot(x = log(ntoken(irish_dfm)), 
     y = wordfish_model$alpha, pch = 19,
     xlab="log token count for each document",
     ylab="estimated alpha")

It shows that the alpha parameter is measuring how much each politician speaks.

Plot the log of the frequency of the top most frequent 1000 words against the same psi-hat values from your estimated wordfish model, and describe the relationship. The topfeatures() function might be helpful here.

# finding top 1,000 words
top1000 <- topfeatures(irish_dfm, n=1000)
top1000 <- data.frame(word = names(top1000), 
                      freq = as.numeric(top1000),
                    stringsAsFactors = FALSE)

# extracting the estimated psi parameters
df <- data.frame(
  word = wordfish_model$features,
  psi_hat = wordfish_model$psi,
  stringsAsFactors=FALSE
)

# Merge the word counts with the estimated word-level coefficients

df <- merge(df, top1000)

# Plot the result

plot(
  x = log(df$freq),
  y = df$psi_hat,
  pch = 19, col = "gray",
  xlab = "log(word frequency)",
  ylab = "estimated psi"
)

Psi captures the log frequency with which each word appears in the corpus.

Assignment 10 - Text Classification and Scaling (Solutions)

Jack Blumenau

Exercise 10.1 - Naive Bayes Classification of Movie Revies

Exercise 10.2 - Wordscores for Movie Reviews

Exercise 10.3 - Wordfish for Irish Parliamentary Debates (Hard question)