LSE Methods Summer Programme 2022
London School of Economics and Political Science
Instructors
 Kenneth Benoit (K.R.Benoit@lse.ac.uk), Department of Methodology, LSE
 Jack Blumenau (j.blumenau@ucl.ac.uk), Department of Political Science, UCL
TAs
 Sarah Jewett (S.Jewett1@lse.ac.uk), LSE
 Yuanmo He (y.he54@lse.ac.uk), LSE
Moodle
 Moodle page here
 Moodle enrollment key: ME31422
This repository contains the class materials for the Research Methods, Data Science, and Mathematics course ME314 Introduction to Data Science and Machine Learning taught in July 2022 by Kenneth Benoit and Jack Blumenau.
Quick links to topics
Day  Date  Instructor  Topic 

1  Mo 11 Jul  KB  Overview and introduction to data science 
2  Tu 12 Jul  KB  The Shape of Data 
3  We 13 Jul  KB  Working with Data (continued) 
4  Th 14 Jul  KB  Linear Regression 
5  Mo 18 Jul  JB  Classification 
6  Tu 19 Jul  JB  Nonlinear models and treebased methods 
7  We 20 Jul  JB  Resampling methods, model selection and regularization 
8  Th 21 Jul  JB  Unsupervised learning and dimensional reduction 
9  Mo 25 Jul  JB  Text analysis 
10  Tu 26 Jul  JB  Text classification and scaling 
11  We 27 Jul  JB  Topic modelling 
12  Th 28 Jul  JB  Data from the Web 
13  Fr 29 Jul  Final Exam 
Overview
Data science and machine learning are exciting new areas that combine scientific inquiry, statistical knowledge, substantive expertise, and computer programming. One of the main challenges for businesses and policy makers when using big data is to find people with the appropriate skills. Good data science requires experts that combine substantive knowledge with data analytical skills, which makes it a prime area for social scientists with an interest in quantitative methods.
This course integrates prior training in quantitative methods (statistics) and coding with substantive expertise and introduces the fundamental concepts and techniques of data science and machine learning.
Typical students will be advanced undergraduate and postgraduate students from any field requiring the fundamentals of data science or working with typically large datasets and databases. Practitioners from industry, government, or research organisations with some basic training in quantitative analysis or computer programming are also welcome. Because this course surveys diverse techniques and methods, it makes an ideal foundation for more advanced or more specific training. Our applications are drawn from social, political, economic, legal, and business and marketing fields.
Objectives
This course aims to provide an introduction to the quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. We will cover the main analytical methods from this field with handson applications using example datasets, so that students gain experience with and confidence in using the methods we cover. We also cover data preparation and processing, including working with structured databases, keyvalue formatted data (JSON), and unstructured textual data. At the end of this course students will have a sound understanding of the field of data science, the ability to analyse data using some of its main methods, and a solid foundation for more advanced or more specialised study.
The course will be delivered as a series of morning lectures (held from 10am to 1pm, with an extended break in the middle), followed by lab sessions in the afternoon where students will apply the lessons in a series of instructorguided exercises using data provided as part of the exercises. The course will cover the following topics:
 an overview of data science and the challenge of working with big data using statistical methods
 how to integrate the insights from data analytics into knowledge generation and decisionmaking
 how to acquire data, both structured and unstructured, and to process it, store it, and convert it into a format suitable for analysis
 the basics of statistical inference including probability and probability distributions, modelling, experimental design
 an overview of classification methods and related methods for assessing model fit and crossvalidating predictive models
 supervised learning approaches, including linear and logistic regression, decision trees, and naïve Bayes
 unsupervised learning approaches, including clustering, association rules, and principal components analysis
 quantitative methods of text analysis, including mining social media and other online resources
 data visualisation through a variety of graphs.
Lectures and classes

Lectures: Lectures will be held between 10am and 1pm each day.

Classes: Students will be assigned to four classes, which will be held between 2pm3.30pm and 3.30pm5pm each day
See the Moodle site for ME314 for class lists, Zoom links, and announcements.
Prerequisites
Students should already be familiar with quantitative methods at an introductory level, up to linear regression analysis. Familiarity with computer programming or database structures is a benefit, but not formally required.
Preparing for the course
You will need R and RStudio for this course. You will need to download and install R and RStudio on your computer.
Detailed instructions can also be found here for installing the tools you need and working with the lab materials.
If you are not already familiar with R, we strongly encourage you to attempt to become familiar before the start of the course. That way, you will spend much less time become familiar with the tools, and be able to focus more on the methods. The following links provide basic introductions to R, which you can study at your own pace before the course begins.
 An Introduction to R.
 Data Camp R tutorials.
 Data Camp R Markdown tutorials, first chapter.
 R Codeschool.
We also strongly recommend you spend some time before the course working through the following materials:
 Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O’Reilly Media. Note: Online version is available from the authors’ page here.
 James et al. (2021) An Introduction to Statistical Learning: With applications in R, Springer. Particularly chapters 1 and 2. Note: The book is available free online here.
Important Specifics
Computer Software
Computerbased exercises will feature prominently in the course, especially in the lab sessions. The use of all software tools will be explained in the sessions, including how to download and install them. All of the class work will be done using R, using publicly available packages.
Main Texts
The primary texts are:
 James et al. (2021) An Introduction to Statistical Learning: With applications in R, Springer. Note: The book is available free online here.
 Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O’Reilly Media. Note: Online version is available from the authors’ page here.
 Zumel, N. and Mount, J. (2014). Practical Data Science with R. Manning Publications.
The following are supplemental texts which you may also find useful:
 Lantz, B. (2013). Machine Learning with R. Packt Publishing.
 Lesmeister, C. (2015). Mastering Machine Learning with R. Packt Publishing.
 Conway, D. and White, J. (2012) Machine Learning for Hackers. O’Reilly Media.
 Leskovec, J., Rajaraman, A. and Ullman, J. (2011). Mining of Massive Datasets. Cambridge University Press.
 Zafarani, R., Abbasi, M. A. and Liu, H. (2014) Social Media Mining: An introduction. Cambridge University Press.
 Hastie et al. (2009) The Elements of Statistical Learning: Data mining, inference, and prediction. Springer. Note: The book is available from the authors’ page here.
Instructors
Kenneth Benoit is Director of the Data Science Institute and Professor of Computational Social Science at the Department of Methodology, LSE. With a background in political science, his substantive work focuses on political party competition, political measurement issues, and electoral systems. His research and teaching is primarily in the field of social science statistical applications. His recent work concerns the quantitative analysis of text as data, for which he has developed the package(s) quanteda for the R statistical software.
Jack Blumenau is an Associate Professor in Quantitative Methods at the UCL Department of Political Science and is the Programme Director for the MSc in Data Science and Public Policy at UCL. He is also a member of the UK Cabinet Office’s “What Works” Trial Advice Panel, in which he provides data science expertise to government and was previously a Data Science Advisor to YouGov. His research and teaching are primary in the fields of quantitative methods, public opinion, legislative politics, and electoral politics.
Assessment
Daily lab exercises
These are not assessed, but will form the practical materials for each day’s labs. See these instructions for how to access and work with each day’s exercise.
See https://lseme314.github.io/instructions for detailed instructions on obtaining and working with each day’s lab materials.
Midterm
The class assignment for Day 5 will count as the midterm assignment, which will count for 25% of the grade. The midterm will be released after the lecture on Day 5 (Monday 18th July) and will be due at 7pm on Day 7 (Wednesday 20th July).
Exam
The final exam will be set on Friday 29th July.
Slack
We have a Slack workspace for the course which you should use to communicate both with us as instructors, and with your fellow students. You can sign up via this link: https://tinyurl.com/2022ME314
Detailed Course Schedule
1. Overview and introduction to data science
We will use this session to get to know the range of interests and experience students bring to the class, as well as to survey the machine learning approaches to be covered. We will also discuss and demonstrate the R software.
Resources
Required reading
 James et al (2021), Chapters 1–2. Note: The book is available here.
 Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O’Reilly Media, Chapters 13. Note: Online version is available from the authors’ page here.
 An Introduction to R.
 Downloading and installing RStudio and R on your computer.
 Data Camp R tutorials.
 Data Camp R Markdown tutorials, first chapter.
 R Codeschool.
Recommended Reading
 Patrick Burns, 2011. The R Inferno. Available here.
 Lantz, Ch. 2.
2. The shape of data
This week introduces the concept of data “beyond the spreadsheet”, the rectangular format most common in statistical datasets. It covers relational structures and the concept of database normalization. We will also cover ways to restructure data from “wide” to “long” format, within strictly rectangular data structures. Additional topics concerning text encoding, date formats, and sparse matrix formats are also covered.
Resources
Required reading
 Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 79 of the print edition; Ch. 912 online).
If you use Python, these references may help
 Reshaping data in Python: “Reshaping and Pivot Tables”.
 Robin Linderborg, “Reshaping Data in Python”, 20 Jan 2017.
3. Working with data (continued)
This day will continue with data manipulation and reshaping. We will cover alternative data formats including JSON, and how to make use of it. We will introduce the concept of databases and SQL, although we will not cover this using SQL directly. Depending on time, we will move on to Day 4 material in preparation for next week and the midterm released on Day 5.
Resources
Required reading
Continue with the Day 2 reading from Wickhama and Grolemund (2017).
Recommended Reading
 Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013. Chapters 45, Relational Databases and NoSQL databases.
 Nield, Thomas. Getting Started with SQL: A handson approach for beginners. O’Reilly, 2016. Entire text.
 SQLite documentation.
 Bassett, L. 2015. Introduction to JavaScript Object Notation: A tothepoint Guide to JSON. O’Reilly Media, Inc.
4. Linear regression
Linear regression model and supervised learning.
Resources
Required Reading
 James et al., Chapter 3.
Recommended Reading
 Zumel and Mount, Chapter 7.1.
 Lantz, Chapter 6
5. Classification
Logistic regression, Naive Bayes, evaluating model performance.
Resources
The midterm exam will be posted on Moodle.
Required Reading
 James et al., Chapter 4.
Recommended Reading
 Lesmeister, Chapter 3.
 Zumel and Mount, Chapters 5, 6, 7.2.
 Lantz, Chapters 34, 10.
6. Nonlinear models and treebased methods
GAMs, local regression, decision trees, random forest, bagging.
Resources
Required Reading
 James et al., Chapter 78.
Recommended Reading
 Lesmeister, Chapter 6.
 Zumel and Mount, Chapter 9.19.3.
 Muchlinksi, D., Siroky, D., Jingrui, H., Kocher, M., (2016) “Comparing Random Forest with Logistic Regression for Predicting ClassImbalanced Civil War Onset Data.” Political Analysis, 24(1): 87103.
7. Resampling methods, model selection and regularization
Crossvalidation, bootstrap, ridge and lasso.
Resources
Required Reading
 James et al., Chapter 56.
Recommended Reading
 Lesmeister, Chapter 4.
8. Unsupervised learning and dimensional reduction
Cluster analysis, PCA
Resources
Required reading
 James et al., Chapter 12.
Recommended Reading
 Lesmeister, Chapter 5, 89.
 Zumel and Mount, Chapter 8.
 Lantz, Chapters 89
 Leskovec et al., Chapter 11.
9. Text analysis
Working with text in R, sentiment analysis, dictionary methods.
Resources
Required reading
 Grimmer, J, and B M Stewart (2013), ``Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.’’ Political Analysis.
 Benoit, Kenneth and Alexander Herzog. In press. ``Text Analysis: Estimating Policy Preferences From Written and Spoken Words.’’.’’ In Analytics, Policy and Governance, eds. Jennifer Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.
Recommended Reading
 Denny, M.J. and Spirling, A. (2018),``Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It’’ Political Analysis
 Spirling, A. (2012), ``Democratization and Linguistic Complexity: The Effect of Franchise Extension on Parliamentary Discourse, 1832–1915.’’ Journal of Politics
 Herzog, A. and K. Benoit (2015), ``The most unkindest cuts: Speaker selection and expressed government dissent during economic crisis.’’ Journal of Politics, 77(4):1157–1175.
 Benoit, K., Munger, K., and Spirling, A. ``Measuring and Explaining Political Sophistication Through Textual Complexity.’’, American Journal of Political Science 63(2, April): 491–508. 10.1111/ajps.12423.
10. Text classification and scaling
Naive Bayes classifier, Wordscores, and Wordfish.
Resources
Required reading
Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. American Political Science Review, 97(2), 311331. doi:10.1017/S0003055403000698
Slapin, J. B. and Proksch, S. (2008), A Scaling Model for Estimating Time‐Series Party Positions from Texts. American Journal of Political Science, 52: 705722. doi:10.1111/j.15405907.2008.00338.x
Recommended Reading
 Statsoft, “Naive Bayes Classifier Introductory Overview.”
 An online article by Paul Graham on classifying spam email.
 Bionicspirit.com, 9 Feb 2012, “How to Build a Naive Bayes Classifier.”
 Lowe, W. (2008). Understanding wordscores. Political Analysis, 16(4), 356371.
 Benoit, Kenneth and Paul Nulty. 2013. “Classification Methods for Scaling Latent Political Traits.” Presented at the Annual Meeting of the Midwest Political Science Association, April 11–14, Chicago.
11. Topic modelling
Latent Dirichlet Allocation, Correlated Topic Model, Structural Topic Model.
Resources
Required reading
 David Blei (2012). “Probabilistic topic models.”” Communications of the ACM, 55(4): 7784.
 Blei, David, Andrew Y. Ng, and Michael I. Jordan (2003). “Latent dirichlet allocation.” Journal of Machine Learning Research 3: 9931022.
 Blei, David (2014) “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.” Annual Review of Statistics and Its Application, 1: 203232.
Recommended Reading
 Blei, D. and J. Lafferty “Topic Models.” In Text Mining: Classification, clustering, and applications, A. Srivastava and M. Sahami (eds.), pp 7194, 2009. Chapter available here.
 Blei, David M., and John D. Lafferty. “Dynamic topic models.” In Proceedings of the 23rd international conference on machine learning, pp. 113120. ACM, 2006.
 Mimno, D. (April 2012). “Computational Historiography: Data Mining in a Century of Classics Journals.” Journal on Computing and Cultural Heritage, 5 (1).
 Lesmeister Chapter 12.
12. Data from the web
The promises and pitfalls of social media data. The Twitter API. The Facebook API. Web scraping. Ethics.
Resources
Recommended Reading:
 Broniatowski, David A, Michael J Paul, and Mark Dredze. 2013. “National and Local Influenza Surveillance Through Twitter: an Analysis of the 20122013 Influenza Epidemic” PLoS ONE 8(12): 83672–78. PDF here
 Barbera, Pablo., 2017. ``Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data.’’ Working Paper
 Munger, Kevin., 2017. ``Tweetment Effects on the Tweeted: Experimentally Reducing Racist Harassment’’ Political Behaviour 39(3): 629649
 Ginsberg et al., 2008. ``Detecting influenza epidemics using search engine query data’’ Nature 457: 1012–1014.
 Lazer et al., 2014. ``The Parable of Google Flu: Traps in Big Data Analysis’’ Science 343: 12031205
 Earthquake shakes Twitter users: realtime event detection by social sensors
 http://rcrastinate.blogspot.co.uk/2015/02/mappingworldwithtweetsincludinggif.html
 https://github.com/pablobarbera/streamR
 Matthew Russell (2013). Mining the Social Web. O’Reilly Media. 2nd edition.