12: APIs and Web-scraping

Jack Blumenau

Final assessment

  • Released 6pm today

  • Due 8pm on Monday

  • 2 questions, worth equal weight. You must answer both questions.

  • 750 words per question, not including code or tables. Does include everything else.

  • Questions are open-ended: you may use any model(s) or approach(es) that we have discussed on the course.

Ethics and Big Data

A Data Science Pipeline

  1. Data Collection: The process begins with collecting raw data, which can involve scraping the web, querying databases, using APIs, or even manually entering data.

  2. Data Cleaning: This often unsung hero of the process involves preprocessing data to handle missing values, outliers, and incorrect entries. It also includes data transformation and normalization.

  3. Data Analysis: While crucial, this is only a small part of the overall pipeline. Here, statistical techniques, machine learning algorithms, and data visualization methods are employed to generate insights.

  4. Modeling and Validation: Creating predictive or descriptive models based on the analyses, and validating them using techniques such as cross-validation.

  5. Communication and Deployment: Finally, results need to be communicated effectively, often through visualizations or reports, and models or data products need to be deployed for end-users.

In this course we have focused mainly on 3 and 4.

Today we will also speak about 1 (and a bit of 2).

Ethical Concerns and the Use of Big Data

The tools we cover today provide methods for dramatically expanding the amount of information we can analyse. However, the collection of large and diverse data from online comes with some significant ethical challenges.

  • Privacy: The collection and use of big data can pose obvious threats to individual privacy.

  • Informed consent: Individuals may not be aware of how their data is being collected and used, or they may not have given consent for their data to be used in certain ways.

  • Bias: The application of machine learning algorithms to large datasets produced in the course of human interactions can encode biases of those interactions, which can lead models to to discriminate against certain groups of people.

  • Ownership: Often the data we wish to collect is prioriatory even if it is available, which can lead to conflict over how it is used.

  • Transparency: How we communicate the use of data in machine learning models is complicated by the fact that even researchers often have a weak understanding about why their models predict certain outcomes. This makes it very difficult for people to understand how their data is being used.

We will think about two of these challenges – Bias and Informed Consent – in a little more detail.

Big data and bias

“There is nothing about doing data analysis that is neutral. What and how data is collected, how the data is cleaned and stored, what models are constructed, and what questions are asked – all of this is political.” Danah Boyd, NYU

  1. Computers can learn to acquire existing human baises

    • Particularly problematic for decision-making (hiring, policing, etc)
  1. Even in very large datasets, there is always proportionally less data available about minorities

    • If the training data reflect existing social biases against a minority, the algorithm is likely to incorporate these biases
    • Statistical patterns that apply to the majority might be invalid within a minority group

Implication: We need to find ways to measure and correct for such biases in our data

Big data and bias: Example

Word-embeddings are an unsupervised learning method for discovering the “meaning” of words inductively from a corpus of texts.

The distributional hypothesis: the meaning of a word can be derived from the distribution of contexts in which it appears.

  • We can learn about the meaning of a word by investigating the distribution of words that show up around the word

    • “You shall know a word by the company it keeps!” J.R. Firth (1957)
    • “The meaning of words lies in their use” Ludwig Wittteinstein (1953)
  • The hypothesis implies that words that appear in similar “contexts” will share similar meanings

  • Word embedding approaches represent the distributional “meaning” of a word as a vector in multidimensional space

  • The basic idea behind word-embedding models is to use the co-occurance of terms within a corpus to create vectors that encode the meaning of each term.

  • One way of understanding the resulting embeddings is to see which words are “close” to one another in the embedding space.

Word Embedding Overview

  1. The meaning of each word is based on the distribution of terms with which it co-occurs

  2. We represent this meaning using a vector for each word

  3. Vectors are constructed such that similar words are close to each other in “semantic” space

  4. We build this space automatically by seeing which words are close to one another in texts

Big data and bias: Example

Let’s use a matrix of word-embeddings that I trained on the corpus of parliamentary speeches we have been using:

word_vectors[1:5,1:10]
               [,1]        [,2]        [,3]        [,4]       [,5]        [,6]
house    0.09834925  0.34462858  0.43410388 -0.01537683  0.3328848 -0.49788390
proceeds 0.10935879 -0.69976782 -0.11314722  0.40691536 -0.6123208 -0.04475971
choice   0.21215889  0.54387728 -0.51125106 -0.56793830  0.8246459 -0.27973160
speaker  0.15791494 -0.05892315  0.21089931  0.05878700  0.3328526 -0.16063796
may      0.13679385  0.59354320  0.08695598  0.07544566  0.3619411 -0.33497751
               [,7]       [,8]       [,9]        [,10]
house    -0.1114182 -0.1368300 -0.1828581 -0.375206858
proceeds -0.2696800 -0.0568067  0.3693623 -0.436618352
choice    0.1071844 -0.1436283  0.2803353 -0.004834098
speaker  -0.8303939  0.1638817 -1.0612585  0.300182838
may       0.4564891  0.2179769 -0.6830856  0.382264276

This shows us the first 10 embedding-dimensions (150 total) of the first 5 words in our corpus.

Similarity

  • A key advantage of word embeddings: we can compute the similarity between words (or collections of words)

  • The similarity between two words can be calculated as the cosine of the angle between the embedding vectors:

\[cos(\theta) = \frac{\mathbf{w}_i \cdot \mathbf{w}_j}{\left|\left| \mathbf{w}_i \right|\right| \left|\left| \mathbf{w}_j \right|\right|}\]

  • We can then sort the words in order of their similarity with the target word and report the “nearest neighbours”

Similarity Demonstration

library(text2vec)

# Extract target embedding
target <- word_vectors[which(rownames(word_vectors) %in% c("excellent", "european", "health")),]

# Calculate cosine similarity
target_sim <- sim2(word_vectors,
                   target)

# Report nearest neighbours
sort(target_sim[,1], decreasing = T)[1:10]
 excellent  fantastic     superb  brilliant       good  wonderful     speech 
 1.0000000  0.7721880  0.7670712  0.7284035  0.6995969  0.6712192  0.6173562 
marvellous    commend impressive 
 0.6027609  0.5988839  0.5959094 
sort(target_sim[,2], decreasing = T)[1:10]
 european        eu     union    europe countries        ec    states      nato 
1.0000000 0.8081603 0.8077907 0.7269197 0.6310429 0.6277406 0.6229872 0.6119536 
       uk    united 
0.6034233 0.6007785 
sort(target_sim[,3], decreasing = T)[1:10]
    health     mental       care   services        nhs    service     social 
 1.0000000  0.7962167  0.7247389  0.7238975  0.6909844  0.6811504  0.6663994 
 education  wellbeing healthcare 
 0.6655127  0.6202950  0.6118153 

Big data and fairness: Example

We can use these similarity measures to test whether embeddings trained on this corpus embed gender bias.

  1. List a set of job titles
career_words <- c("policeman", "cleaner", "surgeon", "politician", 
                  "author", "librarian", "cashier", "waiter", 
                  "waitress", "banker", "doctor", "academic","nurse")
  1. Calculate the similarity between each of these jobs and the words “man” and “woman”
female_sim <- sim2(word_vectors["woman",], word_vectors[career_words,])
male_sim <- sim2(word_vectors["man",], word_vectors[career_words,])
  1. See whether each careers is considered more male or female according to the embeddings
more_female_careers <- career_words[female_sim > male_sim]
more_male_careers <- career_words[female_sim < male_sim]

Big data and fairness: Example

print(more_male_careers)
[1] "policeman"  "surgeon"    "politician" "waiter"     "banker"    
[6] "doctor"     "academic"  
print(more_female_careers)
[1] "cleaner"   "author"    "librarian" "cashier"   "waitress"  "nurse"    

Big data and fairness: Example

This is a very general problem! Bolukbasi et. al. demonstrate the same phenomenon in word-embeddings trained on news stories:

Similar phenomena have been found to apply to race/ethnicity and social classes.

APIs

APIs

  • API: Application Programming Interface — a way for two pieces of software to talk to each other

  • Your software can receive (and also send) data automatically through these services

  • Data is sent by — the same way your browser does it

  • Most services have helping code (known as a wrapper) to construct http requests

  • Both the wrapper and the service itself are called APIs

  • http service also sometimes known as REST (REpresentational State Transfer)

API registration and authentication

  • APIs typically require you to register for an API key to allow access

    • Many are not free, at least for large-scale use
  • Before you commit to using a given API, check what the rate limits are on its use

    • Limits on total number of requests for a given user
    • Limits on the total number of requests in a given day/minute/hour etc
  • Make sure you register with the service in plenty of time to actually get the data!

  • Once registered, you will have access to some kind of key that will allow you to access the API

http requests

It is helpful to start paying attention to the structure of basic http requests.

For instance, let’s say we want to get some data from the TheyWorkForYou api.

A test request:

https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX

  • Parameters to the API are encoded in the URL

    • output = Which format do you want returned?
    • search = Return speeches with which words?
    • num = number requested
    • key = access key

API Output

  • The output of an API will typically not be in csv or Rdata format

  • Often, though not always, it will be in either JSON and XML

    • XML: eXtensible Markup Language

    • JSON : JavaScript Object Notation

  • If you have a choice, you probably want JSON

  • Both types of file are easily read into R

  • json_lite and xml2 are the relevant packages

API packages

  • It’s not usually necessary to construct these kind of requests yourself

  • R, Python, and other programming languages have libraries to make it easier – but you have to find them!

  • I have provided a sample of APIs that have associated R packages on the next slide

  • The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.

Sample of APIs

There are many existing R packages that make it straightforward to retreive data from an API:

API R package Description
Twitter install.packages("rtweet") Twitter, small-scale use
Guardian Newspaper install.packages("guardianapi") Full Guardian archive, 1999-present
Wikipedia install.packages("WikipediR") Wikipedia data and knowledge graph
TheyWorkForYou install.packages("twfy") Speeches from the UK House of Commons and Lords
ProPublica Congress API install.packages("ProPublicaR") Data from the US Congress

Warning: I have not tested all of these!

API demonstration

Last Year’s Demonstration

library(academictwitteR)

my_api_key <- "YOUR_API_KEY_GOES_HERE"

mp_tweets <- get_all_tweets(
                          # Twitter usernames
                          user = mps$username,
                          # Start date of collected tweets
                          start_tweets = "2022-01-01T00:00:00Z",
                          # End date of collected tweets
                          end_tweets = "2023-02-02T00:00:00Z",
                          # Name of file to save all the tweets in
                          file = "mp_tweets",
                          # Name of folder to save all the json files in
                          data_path = "data/",
                          # Your API key
                          bearer_token = my_api_key,
                          # Maximum number of tweets to be fetched
                          n = 1000000
                          )

Last Year’s Demonstration

Why not use the twitter API this year?

Guardian API

Instead, we will use the Guardian newspaper API to search for articles about cricket and, specifically, the Ashes.

Tammy Beaumont

Ben Stokes

Guardian API – Registration

Guardian API – Authentication

library(guardianapi)

gu_api_key()
Please enter your API key and press enter: <my_key>
Updating gu.API.key session variable...

Guardian API

Guardian API Application

cricket <- gu_content("ashes", 
                     from_date = "2023-07-01", 
                     to_date = "2023-07-27",
                     production_office = "UK")

save(cricket, file = "cricket.Rdata")

Guardian API Application

glimpse(cricket)
Rows: 233
Columns: 44
$ id                               <chr> "sport/2023/jul/26/why-ashes-the-burn…
$ type                             <chr> "article", "article", "article", "art…
$ section_id                       <chr> "sport", "sport", "sport", "sport", "…
$ section_name                     <chr> "Sport", "Sport", "Sport", "Sport", "…
$ web_publication_date             <dttm> 2023-07-26 11:28:50, 2023-07-16 13:4…
$ web_title                        <chr> "The Spin | Why Ashes? The burning is…
$ web_url                          <chr> "https://www.theguardian.com/sport/20…
$ api_url                          <chr> "https://content.guardianapis.com/spo…
$ tags                             <list> [<data.frame[16 x 15]>], [<data.fram…
$ is_hosted                        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ pillar_id                        <chr> "pillar/sport", "pillar/sport", "pill…
$ pillar_name                      <chr> "Sport", "Sport", "Sport", "Sport", "…
$ headline                         <chr> "Why Ashes? The burning issue of ‘obi…
$ standfirst                       <chr> "Reginald Shirley Brooks’s wry dig at…
$ trail_text                       <chr> "Reginald Shirley Brooks’ wry dig at …
$ byline                           <chr> "James Wallace", "Ali Martin", "Geoff…
$ main                             <chr> "<figure class=\"element element-imag…
$ body                             <chr> "<p>A few weeks ago I was walking dow…
$ newspaper_page_number            <chr> "31", "39", "34", "35", "43", "36", N…
$ wordcount                        <chr> "1161", "413", "760", "855", "744", "…
$ comment_close_date               <dttm> 2023-07-29 11:28:50, NA, 2023-07-15 …
$ commentable                      <chr> "true", NA, "true", NA, NA, NA, NA, N…
$ first_publication_date           <dttm> 2023-07-26 11:28:50, 2023-07-16 13:4…
$ is_inappropriate_for_sponsorship <chr> "false", "false", "false", "false", "…
$ is_premoderated                  <chr> "false", "false", "true", "false", "f…
$ last_modified                    <chr> "2023-07-26T13:35:07Z", "2023-07-16T2…
$ newspaper_edition_date           <date> 2023-07-27, 2023-07-17, 2023-07-13, …
$ production_office                <chr> "UK", "UK", "UK", "UK", "UK", "UK", "…
$ publication                      <chr> "The Guardian", "The Guardian", "The …
$ short_url                        <chr> "https://www.theguardian.com/p/zgbt6"…
$ should_hide_adverts              <chr> "false", "false", "false", "false", "…
$ show_in_related_content          <chr> "true", "true", "true", "true", "true…
$ thumbnail                        <chr> "https://media.guim.co.uk/d61056494eb…
$ legally_sensitive                <chr> "false", "false", "false", "false", "…
$ lang                             <chr> "en", "en", "en", "en", "en", "en", "…
$ is_live                          <chr> "true", "true", "true", "true", "true…
$ body_text                        <chr> "A few weeks ago I was walking down t…
$ char_count                       <chr> "6735", "2369", "4419", "4783", "4009…
$ should_hide_reader_revenue       <chr> "false", "false", "false", "false", "…
$ show_affiliate_links             <chr> "false", "false", "false", "false", "…
$ byline_html                      <chr> "<a href=\"profile/james-wallace\">Ja…
$ show_table_of_contents           <chr> "false", "false", "false", "false", "…
$ live_blogging_now                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sensitive                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Guardian API Application

table(cricket$section_id)

australia-news       business  commentisfree        culture    environment 
            29              5              4              1              2 
      football   lifeandstyle          media          music           news 
             3              4              1              3              2 
      politics          sport     technology    theobserver         travel 
             4            153              1              1              1 
  tv-and-radio        us-news          world 
             7              1             11 

Guardian API Application

cricket2 <- gu_content('"ashes" AND "cricket"', 
                     from_date = "2023-07-01", 
                     to_date = "2023-07-27",
                     production_office = "UK")

save(cricket2, file = "cricket2.Rdata")

Guardian API Application

table(cricket2$section_id)

sport 
  153 

England Women

england_women <- 
  c("Heather Knight",
    "Tammy Beaumont",
    "Maia Bouchier",
    "Katherine Brunt",
    "Kate Cross",
    "Freya Davies",
    "Charlie Dean",
    "Sophia Dunkley",
    "Sophie Ecclestone",
    "Tash Farrant",
    "Sarah Glenn",
    "Amy Jones",
    "Nat Sciver",
    "Anya Shrubsole",
    "Mady Villiers",
    "Lauren Winfield-Hill",
    "Danni Wyatt")

England Men

england_men <- 
    c("Ben Stokes", 
      "Moeen Ali", 
      "James Anderson", 
      "Jonny Bairstow", 
      "Stuart Broad", 
      "Harry Brook", 
      "Zak Crawley", 
      "Ben Duckett", 
      "Dan Lawrence", 
      "Ollie Robinson", 
      "Joe Root", 
      "Josh Tongue", 
      "Chris Woakes", 
      "Mark Wood")

Guardian API Application

cricket_tokens <- cricket2 %>%
                corpus(text_field = "body_text") %>%
                tokens(remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_url = TRUE) 

Guardian API Application

kwic(cricket_tokens, phrase("Tammy Beaumont"))
                                      pre        keyword
1                harder and easier to hit Tammy Beaumont
2            a five-day Test during which Tammy Beaumont
3                 Wyatt slid to her right Tammy Beaumont
4          off has been effective England Tammy Beaumont
5             a flurry of boundaries from Tammy Beaumont
6             had lost Sophia Dunkley and Tammy Beaumont
7                 in and turned away from Tammy Beaumont
8                the field and out stride Tammy Beaumont
9             taking the pace off England Tammy Beaumont
10                 her third ball Just as Tammy Beaumont
11 squad for England The double-centurion Tammy Beaumont
12                  is that King snare of Tammy Beaumont
13                    it on a good length Tammy Beaumont
14                13 Beaumont 50 FIFTY to Tammy Beaumont
15                 four of the innings to Tammy Beaumont
16   Georgia Wareham Megan Schutt England Tammy Beaumont
17                 such as Ben Stokes and Tammy Beaumont
18                over the next half hour Tammy Beaumont
19                 of the match Lose like Tammy Beaumont
                                  post
1              and Alice Capsey did so
2      became the first English player
3               slid to her left Danni
4  Sophia Dunkley Heather Knight Alice
5            and Capsey reaching 84 in
6            in the opening four overs
7                    to hit the top of
8          and Sophia Dunkley to start
9    Sophia Dunkley Heather Knight Nat
10               was saying on TV that
11             isn’t part of the group
12    You’re welcome 25th over England
13            was stripped of a couple
14                    A gem of a knock
15         she saunters down and lofts
16 Sophia Dunkley Alice Capsey Heather
17           The finishes in the first
18         looks gorgeous in a vibrant
19         unable to wipe the chuckles

Guardian API Application

kwic(cricket_tokens, phrase("Ben Stokes"))
                                              pre    keyword
1                      concerns are not shared by Ben Stokes
2                     delivered 75 crucial runs 8 Ben Stokes
3                   Ashes series Mark Wood sought Ben Stokes
4               game was reopened England captain Ben Stokes
5               hypocrisy Stuart Broad who joined Ben Stokes
6                             him in the second 6 Ben Stokes
7               rain not intervened in Manchester Ben Stokes
8                       the second Test at Lord’s Ben Stokes
9                       Moeen Ali Harry Brook and Ben Stokes
10                          in quite the same way Ben Stokes
11                    going to make some memories Ben Stokes
12                           You do not mess with Ben Stokes
13                Stokes upholds coin toss status Ben Stokes
14                         as sheer as El Capitan Ben Stokes
15             tempered by some watchfulness from Ben Stokes
16                                                Ben Stokes
17                      days Most recently it was Ben Stokes
18                       like a yeti McCullum and Ben Stokes
19                this Ashes series remains alive Ben Stokes
20                      regain the Ashes was said Ben Stokes
21                         with two Tests to play Ben Stokes
22                      for the fourth Ashes Test Ben Stokes
23                  it’s a tactical decision from Ben Stokes
24                     a nice attacking move from Ben Stokes
25         particularly threatening I don’t think Ben Stokes
26                     series the next highest is Ben Stokes
27                     2016 Here come the players Ben Stokes
28                off in triumph nonetheless with Ben Stokes
29                 of that wicket were delightful Ben Stokes
30                glove Joel Wilson disagrees and Ben Stokes
31                       Ali Joe Root Harry Brook Ben Stokes
32                       Simon Burnton who was at Ben Stokes
33                      of winning back the Ashes Ben Stokes
34                      one was coming And here’s Ben Stokes
35                    Tests as captain of England Ben Stokes
36             here fabulously though they batted Ben Stokes
37                        chop for the Oval where Ben Stokes
38                         it shows just how much Ben Stokes
39                         357-4 Brook 6 Stokes 6 Ben Stokes
40                         a clue anymore I think Ben Stokes
41               who wouldn’t enjoy working under Ben Stokes
42                         to buy one wicket when Ben Stokes
43                     Travis Head who copied the Ben Stokes
44                         out and no review from Ben Stokes
45                 57-1 Khawaja 25 Labuschagne 30 Ben Stokes
46                   make things happen more than Ben Stokes
47                         2-0 Khawaja 2 Warner 0 Ben Stokes
48                      definitely give it to him Ben Stokes
49              Stokes 74 Robinson 5 Tenth-wicket Ben Stokes
50                     a Roberto Carlos free kick Ben Stokes
51                        that if you had offered Ben Stokes
52                       not to denigrate him Had Ben Stokes
53                        hang about to chat with Ben Stokes
54                  Brook Joe Root Jonny Bairstow Ben Stokes
55                         a good footing Here is Ben Stokes
56                   was batting with his captain Ben Stokes
57                             as well he said of Ben Stokes
58                        in the last series with Ben Stokes
59                       33 making up the quartet Ben Stokes
60             Room at Lords Thistlewaite Tweeted Ben Stokes
61                        said The PM agrees with Ben Stokes
62                   efforts from figures such as Ben Stokes
63  Bairstow’s controversial dismissal and during Ben Stokes
64           of superheroes absolutely gutted for Ben Stokes
65                    soon The debate will rumble Ben Stokes
66                    Feels about eight years ago Ben Stokes
67                   the off-side Lord’s rises to Ben Stokes
68                      carnage from the blade of Ben Stokes
69                        scenes in the Long Room Ben Stokes
70                            10 Just a single to Ben Stokes
71                 swivel-pull off his hip brings Ben Stokes
72                   a short pitched barrage from Ben Stokes
73                        is coming off the field Ben Stokes
74                      Sri Lanka game with gusto Ben Stokes
75                 have another day between Tests Ben Stokes
76                 will learn from their mistakes Ben Stokes
77                            a man on the ground Ben Stokes
78                 crescendo again On the balcony Ben Stokes
79                    Australia off with the ball Ben Stokes
80                        135-4 Brook 28 Stokes 4 Ben Stokes
81                              a win will do for Ben Stokes
82                   fast bowling Ben Duckett and Ben Stokes
83                        223-5 Green 15 Carey 10 Ben Stokes
84                 those triumphs of the unlikely Ben Stokes
85                 doing an amazing impression of Ben Stokes
86                    admire the England team and Ben Stokes
87                        home crowd back to life Ben Stokes
88                   Stuart Broad plus the mighty Ben Stokes
89                         the side can’t rely on Ben Stokes
90                           and not just rely on Ben Stokes
91                 Australia’s coach I think when Ben Stokes
92                        still on a rolling boil Ben Stokes
93                         cast back to 2019 when Ben Stokes
94                         came out In the middle Ben Stokes
95                    target of 251 being reached Ben Stokes
96                      Ian Botham one fewer than Ben Stokes
97                    the past year playing under Ben Stokes
98             of their aggressive approach under Ben Stokes
99                       escape of which to speak Ben Stokes
100                       of the third Ashes Test Ben Stokes
101                retained the fierce loyalty of Ben Stokes
102                         a massive game for us Ben Stokes
103                     is full of admiration how Ben Stokes
104                         Ben Duckett on 50 and Ben Stokes
105                        hit a high of offering Ben Stokes
106                       the Oval next week then Ben Stokes
107                       as such Harry Brook and Ben Stokes
108                      chapter had a burst from Ben Stokes
109                          the shock of the old Ben Stokes
110           call it leadership cooperation When Ben Stokes
111                     innings for the ages from Ben Stokes
112                    and preparing to bowl when Ben Stokes
113            and their guests England’s batters Ben Stokes
114           the dismissals Bairstow chopping on Ben Stokes
115                    have had some dreams after Ben Stokes
116             mud dredging and trudging towards Ben Stokes
117               who suffered from depression in Ben Stokes
118           individual thinkers give some magic Ben Stokes
119                    and cheered and sang while Ben Stokes
120                     England were 193 for five Ben Stokes
121                 likely to continue Either way Ben Stokes
122                     final day and the captain Ben Stokes
123                         looks as nailed on as Ben Stokes
                                        post
1          and Brendon McCullum however with
2                      Only 13 in the second
3                      to ask if the England
4          was perhaps understandably in the
5                at the crease once Bairstow
6   His second‑innings 155 was extraordinary
7                  says his primary focus is
8                  was quite bullish he said
9        all produced mature innings England
10                  had spent the day moving
11          wrote before the start Hopefully
12                           for he is a man
13             won another toss remaining on
14        hero of Headingley starts unbeaten
15                   and Harry Brook late on
16            has promised that England will
17                     It is two years since
18             could be forgiven for wishing
19                   and his players now sit
20                 the perfect place for his
21           and Brendon McCullum have opted
22                c Moeen Ali Jimmy Anderson
23             Anderson starts with a maiden
24                   Moeen starts with a few
25                 will wait too long before
26                 on 543 2nd over Australia
27                      has been doing a bit
28                   who takes more pride in
29       charged towards Bairstow and seemed
30               always so considered in his
31          Jonny Bairstow Chris Woakes Mark
32     press conference yesterday Full steam
33                        said it had been a
34       speaking from underneath his bucket
35             finally has to concede defeat
36            went from making a declaration
37           and Brendon McCullum could well
38          side have changed the parameters
39                    gets off the mark with
40                at Headingly in 2019 might
41            and Brendon McCullum 31st over
42                   was trying to hit every
43              template of batting with the
44                 It looked close though it
45                    goes back to Mark Wood
46               muses Matt Dony His ability
47                     is on the field Ollie
48         the greatest English cricketer of
49            partnership at Headingley on a
50                   to the middle once more
51                an Australian score of 263
52               or Keith Miller played that
53                      at all That’s a huge
54              Moeen Ali Chris Woakes Ollie
55          being typically upbeat about the
56                      up the other end and
57                He didn’t contrast this to
58                  and this one was another
59                        32 may yet bowl as
60            the England captain said after
61                He said he simply wouldn’t
62           and Tammy Beaumont The finishes
63            incredible innings What a game
64                        I defy you to find
65    joins Mike Atherton Having experienced
66                     has his say Here come
67               He looks truly gutted right
68            Two ridiculously big hits sail
69                 is cheered to the rafters
70                 a nudge into the leg-side
71                  his 29th Test Fifty Warm
72              England side that would have
73                is wandering out and Tanya
74             is Arjuna Ranatunga in spirit
75              amongst others looked on his
76                    is showing them how to
77                looks exhausted But he has
78                   head down in bucket hat
79                      is on strike This is
80                   is roared to the crease
81     side They’ve spurned opportunities at
82                      took the game into a
83                    is having a bowl after
84                    is a couple of stories
85               here at Headingley He clubs
86                    is an utter legend But
87    then played another jaw-dropping knock
88             in the all-rounder stakes has
89                    all the time after the
90               was dealing with a strained
91                  is there you’re never in
92              provided a template by using
93          carved out the second Headingley
94             playing cricket from the gods
95                   said it was specific to
96                   Marsh is playing in his
97          and Brendon McCullum as probably
98              Stokes is under no illusions
99             hero of Headingley warrior of
100      endured physical pain and crippling
101        and Brendon McCullum despite some
102            insisted But no longer really
103     and Brendon McCullum have cultivated
104                    on 29 Of course we’re
105                  the threat he so craves
106            and his Bazballers need mercy
107  registered half-centuries but found the
108         and Harry Brook before Australia
109               talks a lot about feelings
110                finally makes an error in
111                 who scored nine sixes on
112                 was ordered to stand him
113          and Stuart Broad were applauded
114                gloving down the leg side
115          pulled off his last-day miracle
116                   in the middle He burns
117         and how previous England regimes
118                has been worth the bother
119          whistled sixes into their midst
120                     was at one end Jonny
121             and his England players left
122      batting with Jonny Bairstow Cameron
123              leading England to glory at

Guardian API Application

How many times does each player feature in the Guardian news corpus we just collected?

# Combine lists of players
england_players <- c(england_women, england_men)

# Code gender of players
genders <- c(rep("Women", length(england_women)),
             rep("Men", length(england_men)))

# Set-up data.frame for storage
out <- data.frame(player = england_players, 
                  gender = genders, 
                  n_mentions = NA)

# Loop over players and count number of mentions
for(i in 1:nrow(out)){
    
  out$n_mentions[i] <- nrow(kwic(cricket_tokens, phrase(england_players[i])))
    
}

Guardian API Application

Break


If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com

Web-scraping

Web scraping overview

Key steps in any web-scraping project:

  1. Work out how the website is structured

  2. Work out how links connect different pages

  3. Isolate the information you care about on each page

  4. Write a loop which connects 3 to 2, and saves the information you want from each page

  5. Put it all into a nice and tidy data.frame

  6. Feel like a superhero

(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)

Web-scraping Demonstration

Web-scraping Demonstration

  • We will scrape the research interests of members of faculty in the Department of Political Science at UCL

  • The departmental website has a list of faculty members

  • Each member of the department has a unique page

  • The research interests of the faculty member are stored on their unique page

  • Let’s look at an example…

Source code

  • To collect the information we want, we need to see how it is stored within the html code that underpins the website

  • Webpages include much more than what is immediately visible to visitors

  • Crucially, they include code which provides structure, style and functionality (which your browser interprets)

    • HTML provides strucutre
    • css provides style
    • JavaScript provides functionality
  • To implement a web-scraper, we have to work directly with the source code

    • Identifying the information on each page that we want to extract
    • Identifying links between pages that help us to navigate the page programmatically

To see the source code, use Ctrl + U or right click and select View/Show Page Source

Load initial page

We can read the source code of any website into R using the readLines() function.

library(tidyverse)

spp_home <- "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff"

spp_html <- readLines(spp_home)
spp_html[1:20]
 [1] "<!DOCTYPE html>"                                                                                                                                      
 [2] "<!--[if IE 7]>"                                                                                                                                       
 [3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"                                                                                        
 [4] "<!--[if IE 8]>"                                                                                                                                       
 [5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"                                                                                               
 [6] "<!--[if gt IE 8]><!-->"                                                                                                                               
 [7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"                                                                                                  
 [8] "<head>"                                                                                                                                               
 [9] "  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"                                                                        
[10] "  <meta name=\"author\" content=\"UCL\"/>"                                                                                                            
[11] "  <meta property=\"og:profile_id\" content=\"uclofficial\"/>"                                                                                         
[12] "  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"                                                                          
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                              
[15] "<meta name=\"ucl:faculty\" content=\"Social &amp; Historical Sciences\" />"                                                                           
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"                                                                       
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"                                                                 
[18] "<meta property=\"og:type\" content=\"website\" />"                                                                                                    
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"                                                                    
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                         
spp_html[grep("Professor Ben", spp_html)[1]]
[1] "<li><a href=\"/political-science/people/academic-teaching-and-research-staff/professor-benjamin-lauderdale\" class=\"nav-item\">Professor Benjamin Lauderdale</a></li>"

This is helpful, but it is awkward to navigate the source code directly.

Parse HTML

The read_html function in the rvest package allows us to read the HTML in a more structured format:

library(rvest)

spp <- read_html(spp_home)

spp
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve names for each member of faculty

We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):

spp_faculty_elements <- spp %>% html_elements("a[class='nav-item']") 

head(spp_faculty_elements)
{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...

The names of each faculty member are stored in the text associated with these elements:

[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-helen-brown-coverdale\" class=\"nav-item\">Dr Helen Brown Coverdale</a>"

We can extract these names using the html_text() function:

spp_faculty_names <- spp_faculty_elements %>% html_text()
head(spp_faculty_names)
[1] "Andrew Scott"         "Bugra Susler"         "Dr Adam Harris"      
[4] "Dr Alexandra Hartman" "Dr Amanda Hall"       "Dr Aparna Ravi"      

Retrieve URL for each member of faculty

The URL for each faculty member is stored in the href attribute of the elements:

spp_faculty_elements[22]
{xml_nodeset (1)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href") 

head(spp_urls)
[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"      
# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk/", spp_urls)

head(spp_urls)
[1] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"      

Storage

spp <- data.frame(name = spp_faculty_names, 
                  url = spp_urls, 
                  text = NA)

head(spp)
                  name
1         Andrew Scott
2         Bugra Susler
3       Dr Adam Harris
4 Dr Alexandra Hartman
5       Dr Amanda Hall
6       Dr Aparna Ravi
                                                                                                        url
1      https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2      https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3       https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5       https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6       https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
  text
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA

Retrieve unique page for each faculty member

jack_page <- read_html(spp$url[22]) 
jack_page
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve unique page for each faculty member

jack_text <- jack_page %>%
    html_elements(xpath = '//p[preceding::dl[@class="accordion"]]') %>%
    html_text() %>%
    paste0(collapse = " ")

jack_text
[1] "I am an Associate Professor of Political Science and Quantitative Research Methods at University College London and I am the programme director for the MSc Data Science and Public Policy. I received my PhD from Government Department of the London School of Economics in 2016. I am currently a member of the UK Cabinet Office’s Trial Advice Panel and was previously a Data Science Advisor to YouGov. My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion.  At UCL, I have taught a series of quantitative methods modules to our (excellent) undergraduate and postgraduate students. These include PUBL0055 – Introduction to Quantitative Methods; PUBL050 – Advanced Quantitative Methods; POLS0012 – Causal Analysis. This year I will also be teaching PUBL0099 – Quantitative Text Analysis for Social Science.I am also programme director for the new MSc degree in Data Science and Public Policy which is a joint degree programme delivered between UCL Departments of Political Science and Economics.I supervise PhD students working in the areas of political behaviour and quantitative methods.  Tweets by uclspp"

We have the text for one person! How do we get this for all faculty members?

for loops

  • A for loop is a control structure in programming that allows repeating a set of operations multiple times.
  • It works by iterating over a sequence of elements (such as a vector or a list) and executing a block of code for each element in the sequence.
  • In R, the syntax for a for loop is as follows:
for (variable in sequence) {
  # code to be executed for each element in the sequence
}
  • Example:
for (i in 1:10) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

for loops

We can use a for loop to loop over the elements of our url variable

for(i in 1:nrow(spp)){

  # Load page for faculty member i
  faculty_member_page <- read_html(spp$url[i]) 
  
  # Extract text from that page
  faculty_member_text <- faculty_member_page %>%
                            html_elements(xpath = '//p[preceding::dl[@class="accordion"]]') %>%
                            html_text() %>%
                            paste0(collapse = " ")
  
  # Save text for faculty member i
  spp$text[i] <- faculty_member_text
  
}

Output

spp[22,]
               name
27 Dr Jack Blumenau
                                                                                                     url
27 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-jack-blumenau
text
27 I am an Associate Professor of Political Science and Quantitative Research Methods at University College London and I am the programme director for the MSc Data Science and Public Policy. I received my PhD from Government Department of the London School of Economics in 2016. I am currently a member of the UK Cabinet Office’s Trial Advice Panel and was previously a Data Science Advisor to YouGov. My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion.  At UCL, I have taught a series of quantitative methods modules to our (excellent) undergraduate and postgraduate students. These include PUBL0055 – Introduction to Quantitative Methods; PUBL050 – Advanced Quantitative Methods; POLS0012 – Causal Analysis. This year I will also be teaching PUBL0099 – Quantitative Text Analysis for Social Science.I am also programme director for the new MSc degree in Data Science and Public Policy which is a joint degree programme delivered between UCL Departments of Political Science and Economics.I supervise PhD students working in the areas of political behaviour and quantitative methods.  Tweets by uclspp
spp[39,]
             name
39 Dr Lucy Barnes
                                                                                                   url
39 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-lucy-barnes
text
39 Dr Barnes' research focuses on the politics of economic policymaking in rich, western democracies (including the UK). She is currently working on the UKRI-funded Mental Models in Political Economy project, which seeks to understand how various types of people understand the economy. In other projects she is interested in examining the interface between political science and political philosophy, the politics of inequality and redistribution, of fiscal policy and government budgets, and the politics of taxation. Dr Barnes' research focuses on the politics of economic policymaking in rich, western democracies (including the UK). She is currently working on the UKRI-funded Mental Models and Political Economy project, which seeks to understand how various types of people understand the economy. In other projects she is interested in examining the interface between political science and political philosophy, the politics of inequality and redistribution, of fiscal policy and government budgets, and the politics of taxation.    Tweets by uclspp

Topic Model for Departmental Research Interests

  • What should we do with this data?

  • We can use it to estimate a topic model!

  • Two questions in this application:

    • What are the topics that feature in the staff research profiles?
    • Which staff members are most highly associated with each topic?s

Topic Model for Departmental Research Interests

Topic Model for Departmental Research Interests

What next?

Update your CV

You could all now legitimately add something like this to your CV:

Training in data science and machine learning, including experience with: data manipulation and visualisation; supervised and unsupervised learning methods; linear and logistic regression; classification methods; non-linear methods (local regression, splines, GAMs); tree-based methods (bagging, Random-Forests); unsupervised learning methods (k-means; principal components analysis); quantitative text analysis (dictionaries, supervised learning for text, topic models); web-scraping.

Further study

  1. Machine Learning and Causality

    • How can we use these tools to (help) make causal statements?
  2. Machine Learning Theory

    • What is really going on here?
  3. Measurement

    • What does it mean to have a good measure of \(X\) or \(Y\)? Which tools are available to us for measuring our concepts of interest?
  4. Advanced Text Analysis

    • Beyond topic models!

Further study

There are some very good MSc courses near here that offer a lot of this material!

Please feel free to email me if you are interested in the UCL course.

Thank you!