Word embeddings are all the rage.1 Whenever we have represented words as data in previous seminars, we have simply counted how often they occur across and within documents. We have largely viewed words as individual units: strings of text that uniquely identify a given meaning, for which we have had no natural notion of similarity for grouping similar words together.
By contrast, word-embedding approaches represent each unique word in a corpus as dense, real-valued vector. As we discussed in the lecture, these vectors turn out to encode a lot of information about the ways in which words are used, and this information can put to good use across a wide range of questions.
In the seminar today, we will familiarise ourselves with some of the pre-trained word embeddings from the GloVe project. We will use these vectors to discover similarities between words, to compute analogy-based tasks, and to supplement the dictionary-based approaches to measurement that we covered earlier in the course.
You will need to load the following packages before beginning the assignment
library(tidyverse)
library(quanteda)
library(text2vec)
# If you cannot load these libraries, try installing them first. E.g.:
# install.packages("text2vec")
Download link for Glove embeddings
Today we will be using the pre-trained GloVe embeddings, which can be downloaded from the link above. Note that the file which contains the word embeddings is very large! It may therefore take a minute or two to download, depending on your internet conection. Once you have downloaded the file, store it in the folder you have created for this assignment.
Despite the large size of the file, we are actually using one of the smaller versions of the GloVe embeddings, which were trained on a combination of Wikipedia and news data. The embeddings are of dimension 300 and cover some 400,000 words. Note that you could replicate any of the assignment here with larger versions of the GloVe embeddings by downloading them from the GloVe project website, but any differences for the applications here are likely to be small.
Load the glove embeddings into R using the load()
function.
Look at the dimensions of the glove
embeddings
object. How many rows does this object have? How many columns? What do
these represent?
Write a function to calculate the cosine similarity between a selected word and every other word in the glove embeddings object. I have provided some starter code below. You should be able to work out what goes in each part of this function by looking at the examples in the lecture slides.
You will need to use the sim2()
function from the
text2vec
package here. Note that this function requires two
main arguments: 1) x
– a matrix of embeddings. 2)
y
– a second matrix of embeddings for which you would like
to compute similarities. It is important to note that both of these
inputs must be in matrix form. If you extract a vector from the glove
object for a selected word, you have to transform it to a matrix for use
with this function. To do so, use the matrix
function,
setting the nrow
argument equal to 1.
The function you create should take two inputs: 1)
target_word
– the word for which you would like to
calculate similarities; 2) n
– the number of nearest
neighbouring words returned
similarities <- function(target_word, n){
# Extract embedding of target word by subsetting to the relevant row of the glove object
# Calculate cosine similarity between target word and other words using the sim2 function
# Report nearest neighbours of target word (i.e. those with the largest cosine similarity scores)
}
Write a function that computes analogies of the form “a is to b
as c is to ___”. For instance, if b is”king”, a is “man”, and c is
“woman”, then the missing word should be “queen”.
Your function will need to take four arguments. Three arguments should correspond to the words included in the analogy. The fourth should be an argument specifying the number of nearest neighbouring words returned. Again, I have provided some starter code below, which you should be able to complete by consulting the lecture slides.
analogies <- function(a, b, c, n){
# Extract vectors for each of the three words in analogy task by subsetting the glove matrix
# Generate analogy vector: vector(c) - vector(a) + vector(b)
# Calculate cosine similarity between anaology vector and all other vectors using the sim2 function
# Report nearest neighbours of analogy vector
}
Use the function you created above to find the word-embedding answers to the following analogy completion tasks.
Come up with some of your own analogies and try them here.
In this exercise, we will used the Moral Foundations Dictionary to score some Reddit posts in terms of their moral content.
We will use two sources of data for this part of the assignment:
mft_dictionary_words <- read_csv("mft_dictionary.csv")
mft_texts <- read_csv("mft_texts.csv")
Moral Foundations Dictionary –
mft_dictionary.csv
Moral Foundations Reddit Corpus –
mft_texts.csv
Create a vector of the MFT “Care” words.
Extract the embeddings from the glove
object
relating to the care words.
Calculate the mean embedding vector of the care words. To do
this, use the colMeans
function, which will calculate the
mean of each column of the matrix.
Calculate the similarity between the mean care vector and every
other word in the glove
embedding object. To do so, use the
sim2()
function again.
What are the 500 words that have highest cosine similarity with the mean care vector? How many of these words are in the original dictionary?
Examine the words that are in the top 500 words that you calculated above but which are not in the original care dictionary. Do these represent the concept of care?
What does your answer to the previous question suggest about this dictionary expansion approach?
The mft_texts
object includes a series of variables that
record the human annotations of which category each text falls into. In
this part of the assignment, you will use dictionary-based methods to
score the texts, and compare the dictionary scores to those human
codings. If you have forgotten how to apply dictionaries, go back and
look at the material from day 9.
Create a new dictionary which includes two categories. The first
should be a care_original_words
category, which contains
only the words from the original care dictionary. The second should be a
care_embedding_words
which contains both the original care
words and the top 500 words that you extracted in the last
section.
Use the dictionary you just constructed to score the scores in
the mft_texts
object. Create variables in that object that
indicate whether a given dictionary classifies each text as a care text
or not (i.e. classify a text as a care text if it contains any words
from the relevant dictionary).
Create a confusion matrix which compares the human annotations to the scores generated by the dictionary analysis. Which performs best, the original dictionary or the word-embedding approach?
A good qualitative indicator of the success of an innovation in quantitative methods is when they are discussed in some detail in the London Review of Books↩︎