Released 6pm today
Due 8pm on Monday
2 questions, worth equal weight. You must answer both questions.
750 words per question, not including code or tables. Does include everything else.
Questions are open-ended: you may use any model(s) or approach(es) that we have discussed on the course.
Data Collection: The process begins with collecting raw data, which can involve scraping the web, querying databases, using APIs, or even manually entering data.
Data Cleaning: This often unsung hero of the process involves preprocessing data to handle missing values, outliers, and incorrect entries. It also includes data transformation and normalization.
Data Analysis: While crucial, this is only a small part of the overall pipeline. Here, statistical techniques, machine learning algorithms, and data visualization methods are employed to generate insights.
Modeling and Validation: Creating predictive or descriptive models based on the analyses, and validating them using techniques such as cross-validation.
Communication and Deployment: Finally, results need to be communicated effectively, often through visualizations or reports, and models or data products need to be deployed for end-users.
In this course we have focused mainly on 3 and 4.
Today we will also speak about 1 (and a bit of 2).
The tools we cover today provide methods for dramatically expanding the amount of information we can analyse. However, the collection of large and diverse data from online comes with some significant ethical challenges.
Privacy: The collection and use of big data can pose obvious threats to individual privacy.
Informed consent: Individuals may not be aware of how their data is being collected and used, or they may not have given consent for their data to be used in certain ways.
Bias: The application of machine learning algorithms to large datasets produced in the course of human interactions can encode biases of those interactions, which can lead models to to discriminate against certain groups of people.
Ownership: Often the data we wish to collect is prioriatory even if it is available, which can lead to conflict over how it is used.
Transparency: How we communicate the use of data in machine learning models is complicated by the fact that even researchers often have a weak understanding about why their models predict certain outcomes. This makes it very difficult for people to understand how their data is being used.
We will think about two of these challenges – Bias and Informed Consent – in a little more detail.
“There is nothing about doing data analysis that is neutral. What and how data is collected, how the data is cleaned and stored, what models are constructed, and what questions are asked – all of this is political.” Danah Boyd, NYU
Computers can learn to acquire existing human baises
Even in very large datasets, there is always proportionally less data available about minorities
Implication: We need to find ways to measure and correct for such biases in our data
Word-embeddings are an unsupervised learning method for discovering the “meaning” of words inductively from a corpus of texts.
The distributional hypothesis: the meaning of a word can be derived from the distribution of contexts in which it appears.
We can learn about the meaning of a word by investigating the distribution of words that show up around the word
The hypothesis implies that words that appear in similar “contexts” will share similar meanings
Word embedding approaches represent the distributional “meaning” of a word as a vector in multidimensional space
The basic idea behind word-embedding models is to use the co-occurance of terms within a corpus to create vectors that encode the meaning of each term.
One way of understanding the resulting embeddings is to see which words are “close” to one another in the embedding space.
The meaning of each word is based on the distribution of terms with which it co-occurs
We represent this meaning using a vector for each word
Vectors are constructed such that similar words are close to each other in “semantic” space
We build this space automatically by seeing which words are close to one another in texts
Let’s use a matrix of word-embeddings that I trained on the corpus of parliamentary speeches we have been using:
[,1] [,2] [,3] [,4] [,5] [,6]
house 0.09834925 0.34462858 0.43410388 -0.01537683 0.3328848 -0.49788390
proceeds 0.10935879 -0.69976782 -0.11314722 0.40691536 -0.6123208 -0.04475971
choice 0.21215889 0.54387728 -0.51125106 -0.56793830 0.8246459 -0.27973160
speaker 0.15791494 -0.05892315 0.21089931 0.05878700 0.3328526 -0.16063796
may 0.13679385 0.59354320 0.08695598 0.07544566 0.3619411 -0.33497751
[,7] [,8] [,9] [,10]
house -0.1114182 -0.1368300 -0.1828581 -0.375206858
proceeds -0.2696800 -0.0568067 0.3693623 -0.436618352
choice 0.1071844 -0.1436283 0.2803353 -0.004834098
speaker -0.8303939 0.1638817 -1.0612585 0.300182838
may 0.4564891 0.2179769 -0.6830856 0.382264276
This shows us the first 10 embedding-dimensions (150 total) of the first 5 words in our corpus.
A key advantage of word embeddings: we can compute the similarity between words (or collections of words)
The similarity between two words can be calculated as the cosine of the angle between the embedding vectors:
\[cos(\theta) = \frac{\mathbf{w}_i \cdot \mathbf{w}_j}{\left|\left| \mathbf{w}_i \right|\right| \left|\left| \mathbf{w}_j \right|\right|}\]
excellent fantastic superb brilliant good wonderful speech
1.0000000 0.7721880 0.7670712 0.7284035 0.6995969 0.6712192 0.6173562
marvellous commend impressive
0.6027609 0.5988839 0.5959094
european eu union europe countries ec states nato
1.0000000 0.8081603 0.8077907 0.7269197 0.6310429 0.6277406 0.6229872 0.6119536
uk united
0.6034233 0.6007785
health mental care services nhs service social
1.0000000 0.7962167 0.7247389 0.7238975 0.6909844 0.6811504 0.6663994
education wellbeing healthcare
0.6655127 0.6202950 0.6118153
We can use these similarity measures to test whether embeddings trained on this corpus embed gender bias.
[1] "policeman" "surgeon" "politician" "waiter" "banker"
[6] "doctor" "academic"
[1] "cleaner" "author" "librarian" "cashier" "waitress" "nurse"
This is a very general problem! Bolukbasi et. al. demonstrate the same phenomenon in word-embeddings trained on news stories:
Similar phenomena have been found to apply to race/ethnicity and social classes.
One of the cornerstones of conducting ethically sound social science research involves the informed consent of participants, obtained through advising them about the study in which they are invited to partake, its possible risks, but also benefits, and the study’s projected outcomes. The use of informed consent is important because it allows participants to make a choice and signals their willing participation.
Questions about whether people are fully informed about the use of their data is particularly relevant to the use of social media data.
Research design
Measure the “emotional state” of Facebook newsfeed posts
Randomly assign Facebook users to three conditions
Measure the “emotional state” of those users’ subsequent newsfeed posts
If those in the treatment conditions have different emotional states than those in the control condition \(\rightarrow\) evidence of emotional “contagion”
Measurement: How did the researchers measure positive and negative emotional states?
“The study was consistent with Facebook’s Data Use Policy, to which all users agree prior to creating an account on Facebook, constituting informed consent for this research.” Kramer et al, PNAS, 2014
The collection of the data by Facebook may have involved practices that were not fully consistent with the principles of obtaining informed consent and allowing participants to opt out.” Editor-in-Chief, PNAS, 2015
Are EULAs (End-User License Agreement) too complex to allow “informed consent”?
No, they used a poorly-designed text measure to detect tiny differences in word use.
But the ethical point stands!
API: Application Programming Interface — a way for two pieces of software to talk to each other
Your software can receive (and also send) data automatically through these services
Data is sent by — the same way your browser does it
Most services have helping code (known as a wrapper) to construct http requests
Both the wrapper and the service itself are called APIs
http service also sometimes known as REST (REpresentational State Transfer)
APIs typically require you to register for an API key to allow access
Before you commit to using a given API, check what the rate limits are on its use
Make sure you register with the service in plenty of time to actually get the data!
Once registered, you will have access to some kind of key that will allow you to access the API
http
requestsIt is helpful to start paying attention to the structure of basic http requests.
For instance, let’s say we want to get some data from the TheyWorkForYou api.
A test request:
https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX
Parameters to the API are encoded in the URL
output
= Which format do you want returned?search
= Return speeches with which words?num
= number requestedkey
= access keyThe output of an API will typically not be in csv
or Rdata
format
Often, though not always, it will be in either JSON and XML
XML: eXtensible Markup Language
JSON : JavaScript Object Notation
If you have a choice, you probably want JSON
Both types of file are easily read into R
json_lite
and xml2
are the relevant packages
It’s not usually necessary to construct these kind of requests yourself
R, Python, and other programming languages have libraries to make it easier – but you have to find them!
I have provided a sample of APIs that have associated R packages on the next slide
The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.
There are many existing R packages that make it straightforward to retreive data from an API:
API | R package | Description |
---|---|---|
install.packages("rtweet") |
Twitter, small-scale use | |
Guardian Newspaper | install.packages("guardianapi") |
Full Guardian archive, 1999-present |
Wikipedia | install.packages("WikipediR") |
Wikipedia data and knowledge graph |
TheyWorkForYou | install.packages("twfy") |
Speeches from the UK House of Commons and Lords |
ProPublica Congress API | install.packages("ProPublicaR") |
Data from the US Congress |
Warning: I have not tested all of these!
library(academictwitteR)
my_api_key <- "YOUR_API_KEY_GOES_HERE"
mp_tweets <- get_all_tweets(
# Twitter usernames
user = mps$username,
# Start date of collected tweets
start_tweets = "2022-01-01T00:00:00Z",
# End date of collected tweets
end_tweets = "2023-02-02T00:00:00Z",
# Name of file to save all the tweets in
file = "mp_tweets",
# Name of folder to save all the json files in
data_path = "data/",
# Your API key
bearer_token = my_api_key,
# Maximum number of tweets to be fetched
n = 1000000
)
Why not use the twitter API this year?
Instead, we will use the Guardian newspaper API to search for articles about cricket and, specifically, the Ashes.
Please enter your API key and press enter: <my_key>
Updating gu.API.key session variable...
Rows: 233
Columns: 44
$ id <chr> "sport/2023/jul/26/why-ashes-the-burn…
$ type <chr> "article", "article", "article", "art…
$ section_id <chr> "sport", "sport", "sport", "sport", "…
$ section_name <chr> "Sport", "Sport", "Sport", "Sport", "…
$ web_publication_date <dttm> 2023-07-26 11:28:50, 2023-07-16 13:4…
$ web_title <chr> "The Spin | Why Ashes? The burning is…
$ web_url <chr> "https://www.theguardian.com/sport/20…
$ api_url <chr> "https://content.guardianapis.com/spo…
$ tags <list> [<data.frame[16 x 15]>], [<data.fram…
$ is_hosted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ pillar_id <chr> "pillar/sport", "pillar/sport", "pill…
$ pillar_name <chr> "Sport", "Sport", "Sport", "Sport", "…
$ headline <chr> "Why Ashes? The burning issue of ‘obi…
$ standfirst <chr> "Reginald Shirley Brooks’s wry dig at…
$ trail_text <chr> "Reginald Shirley Brooks’ wry dig at …
$ byline <chr> "James Wallace", "Ali Martin", "Geoff…
$ main <chr> "<figure class=\"element element-imag…
$ body <chr> "<p>A few weeks ago I was walking dow…
$ newspaper_page_number <chr> "31", "39", "34", "35", "43", "36", N…
$ wordcount <chr> "1161", "413", "760", "855", "744", "…
$ comment_close_date <dttm> 2023-07-29 11:28:50, NA, 2023-07-15 …
$ commentable <chr> "true", NA, "true", NA, NA, NA, NA, N…
$ first_publication_date <dttm> 2023-07-26 11:28:50, 2023-07-16 13:4…
$ is_inappropriate_for_sponsorship <chr> "false", "false", "false", "false", "…
$ is_premoderated <chr> "false", "false", "true", "false", "f…
$ last_modified <chr> "2023-07-26T13:35:07Z", "2023-07-16T2…
$ newspaper_edition_date <date> 2023-07-27, 2023-07-17, 2023-07-13, …
$ production_office <chr> "UK", "UK", "UK", "UK", "UK", "UK", "…
$ publication <chr> "The Guardian", "The Guardian", "The …
$ short_url <chr> "https://www.theguardian.com/p/zgbt6"…
$ should_hide_adverts <chr> "false", "false", "false", "false", "…
$ show_in_related_content <chr> "true", "true", "true", "true", "true…
$ thumbnail <chr> "https://media.guim.co.uk/d61056494eb…
$ legally_sensitive <chr> "false", "false", "false", "false", "…
$ lang <chr> "en", "en", "en", "en", "en", "en", "…
$ is_live <chr> "true", "true", "true", "true", "true…
$ body_text <chr> "A few weeks ago I was walking down t…
$ char_count <chr> "6735", "2369", "4419", "4783", "4009…
$ should_hide_reader_revenue <chr> "false", "false", "false", "false", "…
$ show_affiliate_links <chr> "false", "false", "false", "false", "…
$ byline_html <chr> "<a href=\"profile/james-wallace\">Ja…
$ show_table_of_contents <chr> "false", "false", "false", "false", "…
$ live_blogging_now <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sensitive <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
england_women <-
c("Heather Knight",
"Tammy Beaumont",
"Maia Bouchier",
"Katherine Brunt",
"Kate Cross",
"Freya Davies",
"Charlie Dean",
"Sophia Dunkley",
"Sophie Ecclestone",
"Tash Farrant",
"Sarah Glenn",
"Amy Jones",
"Nat Sciver",
"Anya Shrubsole",
"Mady Villiers",
"Lauren Winfield-Hill",
"Danni Wyatt")
pre keyword
1 harder and easier to hit Tammy Beaumont
2 a five-day Test during which Tammy Beaumont
3 Wyatt slid to her right Tammy Beaumont
4 off has been effective England Tammy Beaumont
5 a flurry of boundaries from Tammy Beaumont
6 had lost Sophia Dunkley and Tammy Beaumont
7 in and turned away from Tammy Beaumont
8 the field and out stride Tammy Beaumont
9 taking the pace off England Tammy Beaumont
10 her third ball Just as Tammy Beaumont
11 squad for England The double-centurion Tammy Beaumont
12 is that King snare of Tammy Beaumont
13 it on a good length Tammy Beaumont
14 13 Beaumont 50 FIFTY to Tammy Beaumont
15 four of the innings to Tammy Beaumont
16 Georgia Wareham Megan Schutt England Tammy Beaumont
17 such as Ben Stokes and Tammy Beaumont
18 over the next half hour Tammy Beaumont
19 of the match Lose like Tammy Beaumont
post
1 and Alice Capsey did so
2 became the first English player
3 slid to her left Danni
4 Sophia Dunkley Heather Knight Alice
5 and Capsey reaching 84 in
6 in the opening four overs
7 to hit the top of
8 and Sophia Dunkley to start
9 Sophia Dunkley Heather Knight Nat
10 was saying on TV that
11 isn’t part of the group
12 You’re welcome 25th over England
13 was stripped of a couple
14 A gem of a knock
15 she saunters down and lofts
16 Sophia Dunkley Alice Capsey Heather
17 The finishes in the first
18 looks gorgeous in a vibrant
19 unable to wipe the chuckles
pre keyword
1 concerns are not shared by Ben Stokes
2 delivered 75 crucial runs 8 Ben Stokes
3 Ashes series Mark Wood sought Ben Stokes
4 game was reopened England captain Ben Stokes
5 hypocrisy Stuart Broad who joined Ben Stokes
6 him in the second 6 Ben Stokes
7 rain not intervened in Manchester Ben Stokes
8 the second Test at Lord’s Ben Stokes
9 Moeen Ali Harry Brook and Ben Stokes
10 in quite the same way Ben Stokes
11 going to make some memories Ben Stokes
12 You do not mess with Ben Stokes
13 Stokes upholds coin toss status Ben Stokes
14 as sheer as El Capitan Ben Stokes
15 tempered by some watchfulness from Ben Stokes
16 Ben Stokes
17 days Most recently it was Ben Stokes
18 like a yeti McCullum and Ben Stokes
19 this Ashes series remains alive Ben Stokes
20 regain the Ashes was said Ben Stokes
21 with two Tests to play Ben Stokes
22 for the fourth Ashes Test Ben Stokes
23 it’s a tactical decision from Ben Stokes
24 a nice attacking move from Ben Stokes
25 particularly threatening I don’t think Ben Stokes
26 series the next highest is Ben Stokes
27 2016 Here come the players Ben Stokes
28 off in triumph nonetheless with Ben Stokes
29 of that wicket were delightful Ben Stokes
30 glove Joel Wilson disagrees and Ben Stokes
31 Ali Joe Root Harry Brook Ben Stokes
32 Simon Burnton who was at Ben Stokes
33 of winning back the Ashes Ben Stokes
34 one was coming And here’s Ben Stokes
35 Tests as captain of England Ben Stokes
36 here fabulously though they batted Ben Stokes
37 chop for the Oval where Ben Stokes
38 it shows just how much Ben Stokes
39 357-4 Brook 6 Stokes 6 Ben Stokes
40 a clue anymore I think Ben Stokes
41 who wouldn’t enjoy working under Ben Stokes
42 to buy one wicket when Ben Stokes
43 Travis Head who copied the Ben Stokes
44 out and no review from Ben Stokes
45 57-1 Khawaja 25 Labuschagne 30 Ben Stokes
46 make things happen more than Ben Stokes
47 2-0 Khawaja 2 Warner 0 Ben Stokes
48 definitely give it to him Ben Stokes
49 Stokes 74 Robinson 5 Tenth-wicket Ben Stokes
50 a Roberto Carlos free kick Ben Stokes
51 that if you had offered Ben Stokes
52 not to denigrate him Had Ben Stokes
53 hang about to chat with Ben Stokes
54 Brook Joe Root Jonny Bairstow Ben Stokes
55 a good footing Here is Ben Stokes
56 was batting with his captain Ben Stokes
57 as well he said of Ben Stokes
58 in the last series with Ben Stokes
59 33 making up the quartet Ben Stokes
60 Room at Lords Thistlewaite Tweeted Ben Stokes
61 said The PM agrees with Ben Stokes
62 efforts from figures such as Ben Stokes
63 Bairstow’s controversial dismissal and during Ben Stokes
64 of superheroes absolutely gutted for Ben Stokes
65 soon The debate will rumble Ben Stokes
66 Feels about eight years ago Ben Stokes
67 the off-side Lord’s rises to Ben Stokes
68 carnage from the blade of Ben Stokes
69 scenes in the Long Room Ben Stokes
70 10 Just a single to Ben Stokes
71 swivel-pull off his hip brings Ben Stokes
72 a short pitched barrage from Ben Stokes
73 is coming off the field Ben Stokes
74 Sri Lanka game with gusto Ben Stokes
75 have another day between Tests Ben Stokes
76 will learn from their mistakes Ben Stokes
77 a man on the ground Ben Stokes
78 crescendo again On the balcony Ben Stokes
79 Australia off with the ball Ben Stokes
80 135-4 Brook 28 Stokes 4 Ben Stokes
81 a win will do for Ben Stokes
82 fast bowling Ben Duckett and Ben Stokes
83 223-5 Green 15 Carey 10 Ben Stokes
84 those triumphs of the unlikely Ben Stokes
85 doing an amazing impression of Ben Stokes
86 admire the England team and Ben Stokes
87 home crowd back to life Ben Stokes
88 Stuart Broad plus the mighty Ben Stokes
89 the side can’t rely on Ben Stokes
90 and not just rely on Ben Stokes
91 Australia’s coach I think when Ben Stokes
92 still on a rolling boil Ben Stokes
93 cast back to 2019 when Ben Stokes
94 came out In the middle Ben Stokes
95 target of 251 being reached Ben Stokes
96 Ian Botham one fewer than Ben Stokes
97 the past year playing under Ben Stokes
98 of their aggressive approach under Ben Stokes
99 escape of which to speak Ben Stokes
100 of the third Ashes Test Ben Stokes
101 retained the fierce loyalty of Ben Stokes
102 a massive game for us Ben Stokes
103 is full of admiration how Ben Stokes
104 Ben Duckett on 50 and Ben Stokes
105 hit a high of offering Ben Stokes
106 the Oval next week then Ben Stokes
107 as such Harry Brook and Ben Stokes
108 chapter had a burst from Ben Stokes
109 the shock of the old Ben Stokes
110 call it leadership cooperation When Ben Stokes
111 innings for the ages from Ben Stokes
112 and preparing to bowl when Ben Stokes
113 and their guests England’s batters Ben Stokes
114 the dismissals Bairstow chopping on Ben Stokes
115 have had some dreams after Ben Stokes
116 mud dredging and trudging towards Ben Stokes
117 who suffered from depression in Ben Stokes
118 individual thinkers give some magic Ben Stokes
119 and cheered and sang while Ben Stokes
120 England were 193 for five Ben Stokes
121 likely to continue Either way Ben Stokes
122 final day and the captain Ben Stokes
123 looks as nailed on as Ben Stokes
post
1 and Brendon McCullum however with
2 Only 13 in the second
3 to ask if the England
4 was perhaps understandably in the
5 at the crease once Bairstow
6 His second‑innings 155 was extraordinary
7 says his primary focus is
8 was quite bullish he said
9 all produced mature innings England
10 had spent the day moving
11 wrote before the start Hopefully
12 for he is a man
13 won another toss remaining on
14 hero of Headingley starts unbeaten
15 and Harry Brook late on
16 has promised that England will
17 It is two years since
18 could be forgiven for wishing
19 and his players now sit
20 the perfect place for his
21 and Brendon McCullum have opted
22 c Moeen Ali Jimmy Anderson
23 Anderson starts with a maiden
24 Moeen starts with a few
25 will wait too long before
26 on 543 2nd over Australia
27 has been doing a bit
28 who takes more pride in
29 charged towards Bairstow and seemed
30 always so considered in his
31 Jonny Bairstow Chris Woakes Mark
32 press conference yesterday Full steam
33 said it had been a
34 speaking from underneath his bucket
35 finally has to concede defeat
36 went from making a declaration
37 and Brendon McCullum could well
38 side have changed the parameters
39 gets off the mark with
40 at Headingly in 2019 might
41 and Brendon McCullum 31st over
42 was trying to hit every
43 template of batting with the
44 It looked close though it
45 goes back to Mark Wood
46 muses Matt Dony His ability
47 is on the field Ollie
48 the greatest English cricketer of
49 partnership at Headingley on a
50 to the middle once more
51 an Australian score of 263
52 or Keith Miller played that
53 at all That’s a huge
54 Moeen Ali Chris Woakes Ollie
55 being typically upbeat about the
56 up the other end and
57 He didn’t contrast this to
58 and this one was another
59 32 may yet bowl as
60 the England captain said after
61 He said he simply wouldn’t
62 and Tammy Beaumont The finishes
63 incredible innings What a game
64 I defy you to find
65 joins Mike Atherton Having experienced
66 has his say Here come
67 He looks truly gutted right
68 Two ridiculously big hits sail
69 is cheered to the rafters
70 a nudge into the leg-side
71 his 29th Test Fifty Warm
72 England side that would have
73 is wandering out and Tanya
74 is Arjuna Ranatunga in spirit
75 amongst others looked on his
76 is showing them how to
77 looks exhausted But he has
78 head down in bucket hat
79 is on strike This is
80 is roared to the crease
81 side They’ve spurned opportunities at
82 took the game into a
83 is having a bowl after
84 is a couple of stories
85 here at Headingley He clubs
86 is an utter legend But
87 then played another jaw-dropping knock
88 in the all-rounder stakes has
89 all the time after the
90 was dealing with a strained
91 is there you’re never in
92 provided a template by using
93 carved out the second Headingley
94 playing cricket from the gods
95 said it was specific to
96 Marsh is playing in his
97 and Brendon McCullum as probably
98 Stokes is under no illusions
99 hero of Headingley warrior of
100 endured physical pain and crippling
101 and Brendon McCullum despite some
102 insisted But no longer really
103 and Brendon McCullum have cultivated
104 on 29 Of course we’re
105 the threat he so craves
106 and his Bazballers need mercy
107 registered half-centuries but found the
108 and Harry Brook before Australia
109 talks a lot about feelings
110 finally makes an error in
111 who scored nine sixes on
112 was ordered to stand him
113 and Stuart Broad were applauded
114 gloving down the leg side
115 pulled off his last-day miracle
116 in the middle He burns
117 and how previous England regimes
118 has been worth the bother
119 whistled sixes into their midst
120 was at one end Jonny
121 and his England players left
122 batting with Jonny Bairstow Cameron
123 leading England to glory at
How many times does each player feature in the Guardian news corpus we just collected?
# Combine lists of players
england_players <- c(england_women, england_men)
# Code gender of players
genders <- c(rep("Women", length(england_women)),
rep("Men", length(england_men)))
# Set-up data.frame for storage
out <- data.frame(player = england_players,
gender = genders,
n_mentions = NA)
# Loop over players and count number of mentions
for(i in 1:nrow(out)){
out$n_mentions[i] <- nrow(kwic(cricket_tokens, phrase(england_players[i])))
}
If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com
Key steps in any web-scraping project:
Work out how the website is structured
Work out how links connect different pages
Isolate the information you care about on each page
Write a loop which connects 3 to 2, and saves the information you want from each page
Put it all into a nice and tidy data.frame
Feel like a superhero
(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)
Web-scraping can be illegal in some circumstances
Web-scraping is more likely to be illegal when…
It is harmful to the source, e.g.,
It gathers data that is under copywrite/has privacy restrictions/used for financial gain
Even if not illegal, web-scraping can be ethically dubious. Especially when…
it is edging towards being illegal
the data is otherwise available via an API
it does not respect restrictions specified by the host website (often specified in a robots.txt
file)
We will scrape the research interests of members of faculty in the Department of Political Science at UCL
The departmental website has a list of faculty members
Each member of the department has a unique page
The research interests of the faculty member are stored on their unique page
Let’s look at an example…
To collect the information we want, we need to see how it is stored within the html code that underpins the website
Webpages include much more than what is immediately visible to visitors
Crucially, they include code which provides structure, style and functionality (which your browser interprets)
HTML
provides strucutrecss
provides styleJavaScript
provides functionalityTo implement a web-scraper, we have to work directly with the source code
To see the source code, use Ctrl + U
or right click and select View/Show Page Source
We can read the source code of any website into R using the readLines()
function.
[1] "<!DOCTYPE html>"
[2] "<!--[if IE 7]>"
[3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"
[4] "<!--[if IE 8]>"
[5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"
[6] "<!--[if gt IE 8]><!-->"
[7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"
[8] "<head>"
[9] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"
[10] " <meta name=\"author\" content=\"UCL\"/>"
[11] " <meta property=\"og:profile_id\" content=\"uclofficial\"/>"
[12] " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"
[15] "<meta name=\"ucl:faculty\" content=\"Social & Historical Sciences\" />"
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"
[18] "<meta property=\"og:type\" content=\"website\" />"
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"
This is helpful, but it is awkward to navigate the source code directly.
The read_html
function in the rvest
package allows us to read the HTML in a more structured format:
We can then navigate through the HTML by searching for elements that have common elements (using html_elements()
):
{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
The names of each faculty member are stored in the text associated with these elements:
[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-helen-brown-coverdale\" class=\"nav-item\">Dr Helen Brown Coverdale</a>"
The URL for each faculty member is stored in the href
attribute of the elements:
{xml_nodeset (1)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href")
head(spp_urls)
[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"
# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk/", spp_urls)
head(spp_urls)
[1] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"
[2] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"
[3] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-adam-harris"
[4] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"
[6] "https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"
name
1 Andrew Scott
2 Bugra Susler
3 Dr Adam Harris
4 Dr Alexandra Hartman
5 Dr Amanda Hall
6 Dr Aparna Ravi
url
1 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
text
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...
[1] "I am an Associate Professor of Political Science and Quantitative Research Methods at University College London and I am the programme director for the MSc Data Science and Public Policy. I received my PhD from Government Department of the London School of Economics in 2016. I am currently a member of the UK Cabinet Office’s Trial Advice Panel and was previously a Data Science Advisor to YouGov. My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion. At UCL, I have taught a series of quantitative methods modules to our (excellent) undergraduate and postgraduate students. These include PUBL0055 – Introduction to Quantitative Methods; PUBL050 – Advanced Quantitative Methods; POLS0012 – Causal Analysis. This year I will also be teaching PUBL0099 – Quantitative Text Analysis for Social Science.I am also programme director for the new MSc degree in Data Science and Public Policy which is a joint degree programme delivered between UCL Departments of Political Science and Economics.I supervise PhD students working in the areas of political behaviour and quantitative methods. Tweets by uclspp"
We have the text for one person! How do we get this for all faculty members?
for
loopsfor
loopsWe can use a for
loop to loop over the elements of our url
variable
for(i in 1:nrow(spp)){
# Load page for faculty member i
faculty_member_page <- read_html(spp$url[i])
# Extract text from that page
faculty_member_text <- faculty_member_page %>%
html_elements(xpath = '//p[preceding::dl[@class="accordion"]]') %>%
html_text() %>%
paste0(collapse = " ")
# Save text for faculty member i
spp$text[i] <- faculty_member_text
}
name
27 Dr Jack Blumenau
url
27 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-jack-blumenau
text
27 I am an Associate Professor of Political Science and Quantitative Research Methods at University College London and I am the programme director for the MSc Data Science and Public Policy. I received my PhD from Government Department of the London School of Economics in 2016. I am currently a member of the UK Cabinet Office’s Trial Advice Panel and was previously a Data Science Advisor to YouGov. My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion. At UCL, I have taught a series of quantitative methods modules to our (excellent) undergraduate and postgraduate students. These include PUBL0055 – Introduction to Quantitative Methods; PUBL050 – Advanced Quantitative Methods; POLS0012 – Causal Analysis. This year I will also be teaching PUBL0099 – Quantitative Text Analysis for Social Science.I am also programme director for the new MSc degree in Data Science and Public Policy which is a joint degree programme delivered between UCL Departments of Political Science and Economics.I supervise PhD students working in the areas of political behaviour and quantitative methods. Tweets by uclspp
name
39 Dr Lucy Barnes
url
39 https://www.ucl.ac.uk//political-science/people/academic-teaching-and-research-staff/dr-lucy-barnes
text
39 Dr Barnes' research focuses on the politics of economic policymaking in rich, western democracies (including the UK). She is currently working on the UKRI-funded Mental Models in Political Economy project, which seeks to understand how various types of people understand the economy. In other projects she is interested in examining the interface between political science and political philosophy, the politics of inequality and redistribution, of fiscal policy and government budgets, and the politics of taxation. Dr Barnes' research focuses on the politics of economic policymaking in rich, western democracies (including the UK). She is currently working on the UKRI-funded Mental Models and Political Economy project, which seeks to understand how various types of people understand the economy. In other projects she is interested in examining the interface between political science and political philosophy, the politics of inequality and redistribution, of fiscal policy and government budgets, and the politics of taxation. Tweets by uclspp
What should we do with this data?
We can use it to estimate a topic model!
Two questions in this application:
You could all now legitimately add something like this to your CV:
Training in data science and machine learning, including experience with: data manipulation and visualisation; supervised and unsupervised learning methods; linear and logistic regression; classification methods; non-linear methods (local regression, splines, GAMs); tree-based methods (bagging, Random-Forests); unsupervised learning methods (k-means; principal components analysis); quantitative text analysis (dictionaries, supervised learning for text, topic models); web-scraping.
Machine Learning and Causality
Machine Learning Theory
Measurement
Advanced Text Analysis
There are some very good MSc courses near here that offer a lot of this material!
LSE option: MSc Applied Social Data Science
UCL option: MSc Data Science and Public Policy
Please feel free to email me if you are interested in the UCL course.
ME314: Introduction to Data Science and Machine Learning