You will need to load the following packages before beginning the assignment
library(tidyverse)
library(quanteda)
library(guardianapi)
# Run the following code if you cannot load the guardianapi package:
# devtools::install_github("evanodell/guardianapi")
# Or, alternatively:
# install.packages("guardianapi")
An Application Programming Interface (API) is often the most
convenient way to collect well-structured data from the web. We will
cover an example here of getting data using the Guardian Newspaper API
using the guardianapi
package and analysing
it using quanteda
.
In theory,1 the Guardian API only requires very minimal effort to access. In particular, to access the API you need to register for a developer API key, which you can do by visiting this webpage. You will need to select the “Register developer key” option on that page:
Once you have clicked that button, you will be taken to a page where you can add details of your project.
Fill in the details on that form as follows:
You should receive an email asking you to confirm your LSE email
address, and then a second email which contains your API key. It will be
a long string like "g257t1jk-df09-4a0c-8ae5-101010d94e428"
.
Make sure you save this key somewhere!
You can then authenticate with the API by using the
gu_api_key()
function:
gu_api_key()
When you run that function, you will see the following message appear in your R console.
## Please enter your API key and press enter:
Paste the API key that you received via email into the console and you should see the following message:
## Updating gu.API.key session variable...
You should now be able to use the API functions that are available in
the guardianapi
package! We will cover some of these
functions below.
We will start by using the gu_content()
function to
retrieve some data from the API. This function takes a number of
arguments, some of the more important ones are listed in the table
below:
Argument | Description |
---|---|
query |
A string containing the search query. Today, you can choose a simple query which will retrieve any newspaper article published in the Guardian that contains that term. |
from_date |
The start date that we would like to constrain our
search. This argument should be a character of the form
"YYYY-MM-DD" . We will use "2021-01-01" today
so that we will gather articles published on the 1st January 2021 or
later. |
to_date |
The end date of our search. We will use
"2021-12-31" , so as to collect articles up to 31st December
2021. |
production_office |
The Guardian operates in several countries and this
argument allows us to specify which version of the Guardian we would
like to collect data from. We will set this to "UK" so that
we collect news stories published in the UK. |
Execute the gu_content()
function using the
arguments as specified in the table above. There are two very important
things to remember about this step:
# You should only run this function once so as to not repeatedly make calls to the API
gu_out <- gu_content(query = "YOUR_SEARCH_TERM_GOES_HERE",
from_date = "2021-01-01",
to_date = "2021-12-31",
production_office = "UK")
I used the term
"china"
for thequery
argument, but you can select whatever search term you like.
save(gu_out, file = "gu_out.Rdata")
You can then load the data file (if you need to) using the
load()
function as usual:
load(file = "gu_out.Rdata")
glimpse(gu_out)
## Rows: 5,009
## Columns: 46
## $ id <chr> "media/2021/dec/21/china-deletes-soci…
## $ type <chr> "article", "article", "article", "art…
## $ section_id <chr> "media", "books", "world", "world", "…
## $ section_name <chr> "Media", "Books", "World news", "Worl…
## $ web_publication_date <dttm> 2021-12-21, 2021-11-28, 2021-11-23, …
## $ web_title <chr> "China deletes social media accounts …
## $ web_url <chr> "https://www.theguardian.com/media/20…
## $ api_url <chr> "https://content.guardianapis.com/med…
## $ tags <list> [<data.frame[7 x 12]>], [<data.frame…
## $ is_hosted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ pillar_id <chr> "pillar/news", "pillar/arts", "pillar…
## $ pillar_name <chr> "News", "Arts", "News", "News", "News…
## $ headline <chr> "China deletes social media accounts …
## $ standfirst <chr> "Huang Wei, known by username Viya, w…
## $ trail_text <chr> "Huang Wei, known by username Viya, w…
## $ byline <chr> "Vincent Ni", "Isabel Hilton", "Vince…
## $ main <chr> "<figure class=\"element element-imag…
## $ body <chr> "<p>China has deleted social media ac…
## $ wordcount <chr> "442", "1376", "455", "300", "995", "…
## $ first_publication_date <dttm> 2021-12-21, 2021-11-28, 2021-11-23, …
## $ is_inappropriate_for_sponsorship <chr> "false", "false", "false", "false", "…
## $ is_premoderated <chr> "false", "false", "false", "false", "…
## $ last_modified <chr> "2021-12-21T13:44:51Z", "2021-11-28T0…
## $ production_office <chr> "UK", "UK", "UK", "UK", "AUS", "AUS",…
## $ publication <chr> "theguardian.com", "The Observer", "T…
## $ short_url <chr> "https://www.theguardian.com/p/k3tpd"…
## $ should_hide_adverts <chr> "false", "false", "false", "false", "…
## $ show_in_related_content <chr> "true", "true", "true", "true", "true…
## $ thumbnail <chr> "https://media.guim.co.uk/8f16b195e33…
## $ legally_sensitive <chr> "false", "false", "false", "false", "…
## $ lang <chr> "en", "en", "en", "en", "en", "en", "…
## $ is_live <chr> "true", "true", "true", "true", "true…
## $ body_text <chr> "China has deleted social media accou…
## $ char_count <chr> "2790", "8338", "2779", "1877", "6403…
## $ should_hide_reader_revenue <chr> "false", "false", "false", "false", "…
## $ show_affiliate_links <chr> "false", "false", "false", "false", "…
## $ byline_html <chr> "<a href=\"profile/vincent-ni\">Vince…
## $ newspaper_page_number <chr> NA, "43", "37", "30", "29", NA, NA, N…
## $ newspaper_edition_date <date> NA, 2021-11-28, 2021-11-24, 2021-12-…
## $ sensitive <chr> NA, "true", NA, NA, NA, NA, NA, NA, N…
## $ comment_close_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ commentable <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ display_hint <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ live_blogging_now <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ star_rating <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ contributor_bio <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
We have successfully retrieved 5009 articles that contain the term
"china"
in 2021.
We get a lot of information about these articles! In addition to the text of the article (
body_text
), we also get the title of each article (web_title
), the date of publication (first_publication_date
), the section of the newspaper in which the article appeared (section_name
), the author of the article (byline
), as well as many other pieces of potentially useful metadata.
body_text
variable) into a quanteda dfm
object. How many features are
in your dfm
?# Convert to corpus
gu_corpus <- corpus(gu_out, text_field = "body_text")
# Tokenize
gu_tokens <- gu_corpus %>%
tokens(remove_punct = T,
remove_symbols = T,
remove_url = T)
# Convert to DFM
gu_dfm <- gu_tokens %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
dfm_trim(max_docfreq = .9,
docfreq_type = "prop",
min_termfreq = 10)
gu_dfm
## Document-feature matrix of: 5,009 documents, 30,108 features (97.82% sparse) and 45 docvars.
## features
## docs deleted social media accounts influencer known country's livestreaming
## text1 1 3 4 2 1 2 1 3
## text2 0 0 0 1 0 0 0 0
## text3 0 1 2 0 0 0 1 0
## text4 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 1 1 0
## text6 0 0 1 0 0 0 2 0
## features
## docs queen stripped
## text1 2 1
## text2 0 0
## text3 0 0
## text4 0 0
## text5 0 0
## text6 0 0
## [ reached max_ndoc ... 5,003 more documents, reached max_nfeat ... 30,098 more features ]
topfeatures(gu_dfm)
## said people new us government also health
## 77384 42047 37263 30672 27951 26699 25921
## vaccine one covid-19
## 25530 24446 23529
library(quanteda.dictionaries)
covid_dictionary <- c("covid*", "covid19", "covid-19", "coronavirus*", "lockdown*", "vaccination*")
covid_dictionary <- dictionary(list(covid = covid_dictionary))
gu_covid_dfm <- dfm_lookup(gu_dfm, covid_dictionary)
gu_covid_dfm_proportions <- gu_covid_dfm/ntoken(gu_dfm)
gu_out$covid <- as.numeric(gu_covid_dfm_proportions[,1])
gu_out %>%
ggplot(aes(x = web_publication_date,
y = covid)) +
geom_point() +
theme_bw() +
ylab("Proportion of COVID-19 words") +
xlab("Publication Date")
Warning: Collecting data from the web (“web scraping”) is usually really annoying. There is no single function that will give you the information you need from a webpage. Instead, you must carefully and painfully write code that will give you what you want. If that sounds OK, then continue on with this problem set. If it doesn’t, stop here, and do something else.
You will need to load the following libraries to complete this part
of the assignment (you may need to use install.packages()
first):
library(rvest)
library(xml2)
rvest
is a nice package which helps you to scrape
information from web pages.
xml2
is a package which includes functions that can
make it (somewhat) easier to navigate through html data that is loaded
into R.
Throughout this course, the modal structure of a problem set has been
that we give you a nice, clean, rectangular data.frame
or
tibble
, which you use for the application of some fancy
method. Here, we are going to walk through an example of getting the
horrible, messy, and oddly-shaped data from a webpage, and turning it
into a data.frame
or tibble
that is
usable.
Since no two websites are the same, web scraping requires you to identify the relevant parts of the html code that lies behind websites. The goal here is to parse the HTML into usable data. Generally speaking, there are three main steps for webscraping:
We are going to set ourselves a typical data science-type task in
which we are going to scrape some data about politicians from their wiki
pages. In particular, our task is to establish which universities were
most popular amongst the crop of UK MPs who served in the House of
Commons between 2017 and 2019. It is often useful to define in advance
what the exact goal of the data collection task is. For us, we would
like to finish with a data.frame
or tibble
that consists of one observation for each MP, and two variables: the
MP’s name, and where they went to university.
First, we need to know which MPs were in parliament in this period. A bit of googling shows that this wiki page gives us what we need. Scroll down a little, and you will see that there is a table where each row is an MP. It looks like this:
The nice thing about this is that an html table like this should be reasonably easy to work with. We will need to be able to work with the underlying html code of the wiki page in what follows, so you will need to be able to see the source code of the website. If you don’t know how to look at the source code, follow the relevant instructions on this page for the browser that you are using.
When you have figured that out, you should be able to see something that looks a bit like this:
As you can see, html is horrible to look at. In R, we can read in the
html code by using the read_html
function from the
rvest
package:
# Read in the raw html code of the wiki page
mps_list_page <- read_html("https://en.wikipedia.org/wiki/List_of_United_Kingdom_MPs_by_seniority_(2017–2019)")
read_html
returns an XML document (to check, try running
class(mps_list_page)
), which makes navigating the different
parts of the website (somewhat) easier.
Now that we have the html code in R, we need to find the parts of the webpage that contain the table. Scroll through the source code that you should have open in your browser to see if you can find the parts of the code that contain the table we are interested in.
On line 1154, you should see something like
<table class="wikitable collapsible sortable" style="text-align: center; font-size: 85%; line-height: 14px;">
.
This marks the beginning of the table that we are interested in, and we
can ask rvest
to extract that table from our
mps_list_page
object by using the
html_elements
function.
# Extract table of MPs
mp_table <- html_elements(mps_list_page,
css = "table[class='wikitable collapsible sortable']")
Here, the string we pass to the css
argument tells
rvest
that we would like to grab the table
from the object mps_list_page
that has the class
wikitable collapsible sortable
. The object we have created
(mp_table
) is itself an XML object, which is good, because
we will need to navigate through that table to get the information we
need.
Now, within that table, we would like to extract two pieces of
information for each MP: their name, and the link to their own
individual wikipedia page. Looking back at the html source code, you
should be able to see that each MP’s entry in the table is contained
within its own separate <span>
tag, and the
information we are after is further nested within a
<a>
tag. For example, line 1250 includes the
following:
Yes, Bottomley is a funny name.
We would like to extract all of these entries from the table, and we
can do so by again using html_elements
and using the
appropriate css expression, which here is "span a"
, because
the information we want is included in the a
tag which
itself is nested within the span
tag.
# Extract MP names and urls
mp_table_entries <- html_elements(mp_table, "span a")
mp_table_entries
## {xml_nodeset (655)}
## [1] <a href="/wiki/Kenneth_Clarke" title="Kenneth Clarke">Kenneth Clarke</a>
## [2] <a href="/wiki/Dennis_Skinner" title="Dennis Skinner">Dennis Skinner</a>
## [3] <a href="/wiki/Peter_Bottomley" title="Peter Bottomley">Sir Peter Bottom ...
## [4] <a href="/wiki/Geoffrey_Robinson_(politician)" title="Geoffrey Robinson ...
## [5] <a href="/wiki/Barry_Sheerman" title="Barry Sheerman">Barry Sheerman</a>
## [6] <a href="/wiki/Frank_Field_(British_politician)" class="mw-redirect" tit ...
## [7] <a href="/wiki/Harriet_Harman" title="Harriet Harman">Harriet Harman</a>
## [8] <a href="/wiki/Kevin_Barron" title="Kevin Barron">Sir Kevin Barron</a>
## [9] <a href="/wiki/Edward_Leigh" title="Edward Leigh">Sir Edward Leigh</a>
## [10] <a href="/wiki/Nick_Brown" title="Nick Brown">Nick Brown</a>
## [11] <a href="/wiki/Jeremy_Corbyn" title="Jeremy Corbyn">Jeremy Corbyn</a>
## [12] <a href="/wiki/David_Amess" title="David Amess">Sir David Amess</a>
## [13] <a href="/wiki/Roger_Gale" title="Roger Gale">Sir Roger Gale</a>
## [14] <a href="/wiki/Nicholas_Soames" title="Nicholas Soames">Sir Nicholas Soa ...
## [15] <a href="/wiki/Margaret_Beckett" title="Margaret Beckett">Dame Margaret ...
## [16] <a href="/wiki/Bill_Cash" title="Bill Cash">Sir Bill Cash</a>
## [17] <a href="/wiki/Ann_Clwyd" title="Ann Clwyd">Ann Clwyd</a>
## [18] <a href="/wiki/Patrick_McLoughlin" title="Patrick McLoughlin">Sir Patric ...
## [19] <a href="/wiki/George_Howarth" title="George Howarth">Sir George Howarth ...
## [20] <a href="/wiki/John_Redwood" title="John Redwood">Sir John Redwood</a>
## ...
Finally, now that we have the entry for each MP, it is very simple to extract the name of the MP and the URL to their wikipedia page:
# html_text returns the text between the tags (here, the MPs' names)
mp_names <- html_text(mp_table_entries)
# html_attr returns the attrubutes of the tags that you have named. Here we have asked for the "href" which will give us the link to each MP's own wiki page
mp_hrefs <- html_attr(mp_table_entries,
name = "href")
# Combine into a tibble
mps <- tibble(name = mp_names, url = mp_hrefs, university = NA, stringsAsFactors = FALSE)
head(mps)
## # A tibble: 6 × 4
## name url university stringsAsFactors
## <chr> <chr> <lgl> <lgl>
## 1 Kenneth Clarke /wiki/Kenneth_Clarke NA FALSE
## 2 Dennis Skinner /wiki/Dennis_Skinner NA FALSE
## 3 Sir Peter Bottomley /wiki/Peter_Bottomley NA FALSE
## 4 Geoffrey Robinson /wiki/Geoffrey_Robinson_(poli… NA FALSE
## 5 Barry Sheerman /wiki/Barry_Sheerman NA FALSE
## 6 Frank Field /wiki/Frank_Field_(British_po… NA FALSE
OK, OK, so those urls are not quite complete. We need to fix
“https://en.wikipedia.org” to the front of them first. We
can do that using the paste0()
function:
mps$url <- paste0("https://en.wikipedia.org", mps$url)
head(mps)
## # A tibble: 6 × 4
## name url university stringsAsFactors
## <chr> <chr> <lgl> <lgl>
## 1 Kenneth Clarke https://en.wikipedia.org/wiki… NA FALSE
## 2 Dennis Skinner https://en.wikipedia.org/wiki… NA FALSE
## 3 Sir Peter Bottomley https://en.wikipedia.org/wiki… NA FALSE
## 4 Geoffrey Robinson https://en.wikipedia.org/wiki… NA FALSE
## 5 Barry Sheerman https://en.wikipedia.org/wiki… NA FALSE
## 6 Frank Field https://en.wikipedia.org/wiki… NA FALSE
That’s better. Though, wait, how many observations are there in our
data.frame
?
dim(mps)
## [1] 655 4
655? But there are only 650 MPs in the House of Commons! Oh, I know why, it’s because some MPs will have left/died/been caught in a scandal and therefore have been replaced…
Are you still here? Well done! We have something! We have…a list of MPs’ names! But we don’t have anything else. In particular, we still do not know where these people went to university. To find that, we have to move on to step 2.
Let’s look at the page for the first MP in our list: https://en.wikipedia.org/wiki/Kenneth_Clarke. Scroll down the page, looking at the panel on the right-hand side. At the bottom of the panel, you will see this:
The bottom line gives Clarke’s alma mater, which in this case is one
of the Cambridge colleges. That is the information we are after. If we
look at the html source code for this page, we can see that the alma
mater line of the panel is enclosed in another <a>
tag:
Now that we know this, we can call in the html using
read_html
again:
mp_text <- read_html(mps$url[1])
And then we can use html_elements
and
html_text
to extract the name of the university. Here we
use a somewhat more complicated argument to find the information we are
looking for. The xpath
argument tells rvest
to
look for the tag a
with a title of
"Alma mater"
, and then asking rvest
to look
for the next a
tag that comes after the alma mater
tag. This is because the name of the university is actually stored in
the subsequent a
tag.
mp_university <- html_elements(mp_text,
xpath = "//a[@title='Alma mater']/following::a[1]") %>%
html_text()
print(mp_university)
## [1] "Gonville and Caius College, Cambridge"
Regardless of whether you followed that last bit: it works! We now
know where Kenneth Clarke went to university. Finally, we can assign the
university that he went to to the mps
tibble
that we created earlier:
mps$university[1] <- mp_university
head(mps)
## # A tibble: 6 × 4
## name url university stringsAsFactors
## <chr> <chr> <chr> <lgl>
## 1 Kenneth Clarke https://en.wikipedia.org/wiki… Gonville … FALSE
## 2 Dennis Skinner https://en.wikipedia.org/wiki… <NA> FALSE
## 3 Sir Peter Bottomley https://en.wikipedia.org/wiki… <NA> FALSE
## 4 Geoffrey Robinson https://en.wikipedia.org/wiki… <NA> FALSE
## 5 Barry Sheerman https://en.wikipedia.org/wiki… <NA> FALSE
## 6 Frank Field https://en.wikipedia.org/wiki… <NA> FALSE
data.frame
we just constructed and
pulls out the relevant information from each MP’s wiki page. You will
find very quickly that web-scraping is a messy business, and your loop
will probably fail. You might want to use the stop
,
next
, try
and if
functions to
help avoid problems.A for-loop is pretty easy to set up given the code provided above. We
just need to loop over each row of the mps
object, read in
the html, find the university, and assign it to the relevant cell in the
data.frame. E.g.
for(i in 1:nrow(mps)){
cat('.')
mp_text <- read_html(mps$url[i])
mp_university <- html_elements(mp_text,
xpath = "//a[@title='Alma mater']/following::a[1]") %>%
html_text()
mps$university[i] <- mp_university
}
Here, cat('.')
is just a piece of convenience code that
will print out a dot to the console on every iteration of the loop. This
just helps us to know that R hasn’t crashed or that nothing is
happening. It’s also quite satisfying to know that every time a dot
appears, that means that you have collected some new data.
However, if you try running that code, you’ll see that it will cut out after a short while with an error.
The main difficulty with this exercise is that there are essentially an infinite number of ways in which data scraping can go wrong. Here, the main problems is that some of the MPs do not actually have any information recorded in their wiki profiles about the university that they attended. Look at the page for Ronnie Campbell for example. Never went to university, but certainly looks like a happy chap.
Because of that, we need to build in some code into the loop that
says ‘OK, if you can’t find any information about this MP’s university,
just code it as NA
.’ I’ve added a line that does this to
the loop.
for(i in 1:nrow(mps)){
cat(".")
mp_text <- read_html(mps$url[i])
mp_university <- xml_text(xml_find_all(mp_text, xpath = "//a[@title='Alma mater']/following::a[1]"))
if(length(mp_university)==0) mp_university <- NA
mps$university[i] <- mp_university
}
Now the loop runs without breaking! Hooray!
(It is worth noting that this is a very simple example. In the typical web-scraping exercise, you should expect considerably more frustration than you have encountered here. :) Enjoy!)
sort(table(mps$university), decreasing = T)[1]
## London School of Economics
## 17
So, LSE is the most popular university for MPs? That seems…unlikely… And indeed it is. Remember the Kenneth Clarke example: wiki lists the college he attended in Cambridge, not just the university. Maybe lots of MPs went to Cambridge, but they all just went to different colleges? Let’s check:
unique(mps$university[grep("Cambridge",mps$university)])
## [1] "Gonville and Caius College, Cambridge"
## [2] "Trinity College, Cambridge"
## [3] "Clare College, Cambridge"
## [4] "Newnham College, Cambridge"
## [5] "Sidney Sussex College, Cambridge"
## [6] "Pembroke College, Cambridge"
## [7] "Fitzwilliam College, Cambridge"
## [8] "Corpus Christi College, Cambridge"
## [9] "Emmanuel College, Cambridge"
## [10] "Christ's College, Cambridge"
## [11] "St John's College, Cambridge"
## [12] "Jesus College, Cambridge"
## [13] "Magdalene College, Cambridge"
## [14] "Downing College, Cambridge"
## [15] "Robinson College, Cambridge"
## [16] "St Catharine's College,Cambridge"
## [17] "King's College, Cambridge"
## [18] "Girton College, Cambridge"
## [19] "Queens' College, Cambridge"
## [20] "Peterhouse, Cambridge"
## [21] "University of Cambridge"
## [22] "Corpus Christi College,Cambridge"
## [23] "Pembroke College,Cambridge"
## [24] "Trinity Hall, Cambridge"
## [25] "Selwyn College, Cambridge"
Oh dear. Maybe it is the same for Oxford?
unique(mps$university[grep("Oxford",mps$university)])
## [1] "Lincoln College, Oxford"
## [2] "Magdalen College, Oxford"
## [3] "St John's College, Oxford"
## [4] "St Edmund Hall, Oxford"
## [5] "Balliol College, Oxford"
## [6] "University College, Oxford"
## [7] "St Hugh's College, Oxford"
## [8] "Pembroke College, Oxford"
## [9] "New College, Oxford"
## [10] "Oxford Polytechnic"
## [11] "Exeter College, Oxford"
## [12] "Lady Margaret Hall, Oxford"
## [13] "Brasenose College, Oxford"
## [14] "Merton College, Oxford"
## [15] "Somerville College, Oxford"
## [16] "St Hilda's College, Oxford"
## [17] "Corpus Christi College, Oxford"
## [18] "Keble College, Oxford"
## [19] "Jesus College, Oxford"
## [20] "Trinity College, Oxford"
## [21] "Mansfield College, Oxford"
## [22] "St Benet's Hall, Oxford"
## [23] "Christ Church, Oxford"
## [24] "Hertford College, Oxford"
## [25] "Oxford Brookes"
## [26] "University College, University of Oxford"
## [27] "Wadham College, Oxford"
## [28] "St Anne's College, Oxford"
## [29] "Greyfriars, Oxford"
## [30] "Oriel College, Oxford"
## [31] "University of Oxford"
## [32] "St. Hilda's College, Oxford"
## [33] "St Catherine's College, Oxford"
Yup.
Right, so we need to do some recoding. Let’s create a new variable that we can use to simplify the universities coding:
mps$university_new <- mps$university
mps$university_new[grep("Cambridge",mps$university)] <- "Cambridge"
mps$university_new[grep("Oxford",mps$university)] <- "Oxford"
mps$university_new[grep("London School of Economics",mps$university)] <- "LSE"
head(sort(table(mps$university_new), decreasing = T))
##
## Oxford Cambridge LSE
## 85 46 17
## University of Edinburgh University of Hull University of Glasgow
## 12 12 11
Looks like the Oxbridge connection is still pretty strong!
tibble
. Can you scrape the MPs’ party
affiliations? Can you scrape their date of birth? Doing so will require
you to look carefully at the html source code, and work out the
appropriate xpath expression to use. For guidance on xpath, see here.mps$university <- NA
mps$party <- NA
mps$birthday <- NA
for(i in 1:nrow(mps)){
cat(".")
mp_text <- read_html(mps$url[i])
mp_university <- html_elements(mp_text, xpath = "//a[@title='Alma mater']/following::a[1]") %>%
html_text()
mp_party <- html_elements(mp_text, xpath = "////tr/th[text()='Political party']/following::a[1]") %>%
html_text()
mp_birthday <- html_elements(mp_text, xpath = "//span[@class='bday']") %>%
html_text()
if(length(mp_university)==0) mp_university <- NA
if(length(mp_party)==0) mp_party <- NA
if(length(mp_birthday)==0) mp_birthday <- NA
mps$university[i] <- mp_university
mps$party[i] <- mp_party
mps$birthday[i] <- mp_birthday
}
save(mps, file = "mps_alma_mater.Rdata")
head(mps)
## # A tibble: 6 × 7
## name url university stringsAsFactors university_new party birthday
## <chr> <chr> <chr> <lgl> <chr> <chr> <chr>
## 1 Kenneth Clarke http… Gonville … FALSE Cambridge Cons… 1940-07…
## 2 Dennis Skinner http… Ruskin Co… FALSE Ruskin College Labo… 1932-02…
## 3 Sir Peter Bot… http… Trinity C… FALSE Cambridge Cons… 1944-07…
## 4 Geoffrey Robi… http… Clare Col… FALSE Cambridge Labo… 1938-05…
## 5 Barry Sheerman http… London Sc… FALSE LSE Labo… 1940-08…
## 6 Frank Field http… Universit… FALSE University of… cros… 1942-07…
In my experience, the interfaces for both the APIs themselves, and the R packages desgined to access those APIs, are not terribly stable. This is an indirect way of saying: if this code doesn’t work for you, and you cannot access the API in class today, don’t blame me! Instead, think of it as a lesson of all the potential dangers you might face by using tools like this in your research process. Kevin Munger has a nice cautionary tale about this here.↩︎