Skip to the content.

LSE Methods Summer Programme 2024

London School of Economics and Political Science

Instructors

TAs

Moodle

This repository contains the class materials for the Research Methods, Data Science, and Mathematics course ME314 Introduction to Data Science and Machine Learning taught in July 2024 by Kenneth Benoit and Jack Blumenau.

Day Date Instructor Topic
1 Mo 08 Jul KB Overview and introduction to data science
2 Tu 09 Jul KB The Shape of Data
3 Th 11 Jul KB Working with Data
4 Fr 12 Jul KB Linear Regression
5 Mo 15 Jul JB Classification
6 Tu 16 Jul JB Non-linear models and tree-based methods
7 We 17 Jul KB Resampling methods, model selection and regularization
8 Th 18 Jul KB Unsupervised learning and dimensional reduction
9 Mo 22 Jul JB Text analysis
10 Tu 23 Jul JB Similarity Metrics and Supervised Learning for Text
11 We 24 Jul JB Topic modelling
12 Th 25 Jul JB Word-embeddings and Large Language Models
13 Fr 28 Jul   Final Exam

Overview

Data science and machine learning are exciting new areas that combine scientific inquiry, statistical knowledge, substantive expertise, and computer programming. One of the main challenges for businesses and policy makers when using big data is to find people with the appropriate skills. Good data science requires experts that combine substantive knowledge with data analytical skills, which makes it a prime area for social scientists with an interest in quantitative methods.

This course integrates prior training in quantitative methods (statistics) and coding with substantive expertise and introduces the fundamental concepts and techniques of data science and machine learning.

Typical students will be advanced undergraduate and postgraduate students from any field requiring the fundamentals of data science or working with typically large datasets and databases. Practitioners from industry, government, or research organisations with some basic training in quantitative analysis or computer programming are also welcome. Because this course surveys diverse techniques and methods, it makes an ideal foundation for more advanced or more specific training. Our applications are drawn from social, political, economic, legal, and business and marketing fields.

Objectives

This course aims to provide an introduction to the quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. We will cover the main analytical methods from this field with hands-on applications using example datasets, so that students gain experience with and confidence in using the methods we cover. We also cover data preparation and processing, including working with structured databases, key-value formatted data (JSON), and unstructured textual data. At the end of this course students will have a sound understanding of the field of data science, the ability to analyse data using some of its main methods, and a solid foundation for more advanced or more specialised study.

The course will be delivered as a series of morning lectures (held from 10am to 1pm, with an extended break in the middle), followed by lab sessions in the afternoon where students will apply the lessons in a series of instructor-guided exercises using data provided as part of the exercises. The course will cover the following topics:

Lectures and classes

See the Moodle site for ME314 for class lists, Zoom links, and announcements.

Prerequisites

Students should already be familiar with quantitative methods at an introductory level, up to linear regression analysis. Familiarity with computer programming or database structures is a benefit, but not formally required.

Preparing for the course

You will need R and RStudio for this course. You will need to download and install R and RStudio on your computer.

Detailed instructions can also be found here for installing the tools you need and working with the lab materials.

If you are not already familiar with R, we strongly encourage you to attempt to become familiar before the start of the course. That way, you will spend much less time become familiar with the tools, and be able to focus more on the methods. The following links provide basic introductions to R, which you can study at your own pace before the course begins.

We also strongly recommend you spend some time before the course working through the following materials:

Important Specifics

Computer Software

Computer-based exercises will feature prominently in the course, especially in the lab sessions. The use of all software tools will be explained in the sessions, including how to download and install them. All of the class work will be done using R, using publicly available packages.

Main Texts

The primary texts are:

The following are supplemental texts which you may also find useful:

Instructors

Kenneth Benoit is Director of the Data Science Institute and Professor of Computational Social Science at the Department of Methodology, LSE. With a background in political science, his substantive work focuses on political party competition, political measurement issues, and electoral systems. His research and teaching is primarily in the field of social science statistical applications. His recent work concerns the quantitative analysis of text as data, for which he has developed the package(s) quanteda for the R statistical software.

Jack Blumenau is an Associate Professor in Quantitative Methods at the UCL Department of Political Science and is the Programme Director for the MSc in Data Science and Public Policy at UCL. He is also a member of the UK Cabinet Office’s “What Works” Trial Advice Panel, in which he provides data science expertise to government and was previously a Data Science Advisor to YouGov. His research and teaching are primary in the fields of quantitative methods, public opinion, legislative politics, and electoral politics.

Assessment

Daily lab exercises

These are not assessed, but will form the practical materials for each day’s labs. See these instructions for how to access and work with each day’s exercise.

See https://lse-me314.github.io/instructions for detailed instructions on obtaining and working with each day’s lab materials.

Mid-term

The class assignment for Day 5 will count as the mid-term assignment, which will count for 25% of the grade. The midterm will be released after the lecture on Day 5 (Monday 15th July) and will be due at 7pm on Day 7 (Wednesday 17th July).

Exam

The final exam will be set on Friday 26th July.

Slack

We have a Slack workspace for the course which you should use to communicate both with us as instructors, and with your fellow students. You can sign up via this link: TBC.

Detailed Course Schedule


1. Overview and introduction to data science

We will use this session to get to know the range of interests and experience students bring to the class, as well as to survey the machine learning approaches to be covered. We will also discuss and demonstrate the R software.

Resources
Required reading

2. The shape of data

This week introduces the concept of data “beyond the spreadsheet”, the rectangular format most common in statistical datasets. It covers relational structures and the concept of database normalization. We will also cover ways to restructure data from “wide” to “long” format, within strictly rectangular data structures. Additional topics concerning text encoding, date formats, and sparse matrix formats are also covered.

Resources
Required reading
If you use Python, these references may help

3. Working with data (continued)

This day will continue with data manipulation and reshaping. We will cover alternative data formats including JSON, and how to make use of it. We will introduce the concept of databases and SQL, although we will not cover this using SQL directly. Depending on time, we will move on to Day 4 material in preparation for next week and the mid-term released on Day 5.

Resources
Required reading

Continue with the Day 2 reading from Wickhama and Grolemund (2017).


4. Linear regression

Linear regression model and supervised learning.

Resources
Required Reading

5. Classification

Logistic regression, Naive Bayes, evaluating model performance.

Resources

The mid-term exam will be submitted on Moodle.

Required Reading

6. Non-linear models and tree-based methods

GAMs, local regression, decision trees, random forest, bagging.

Resources
Required Reading

7. Resampling methods, model selection and regularization

Cross-validation, bootstrap, ridge and lasso.

Resources
Required Reading

8. Unsupervised learning and dimensional reduction

Cluster analysis, PCA

Resources
Required reading

9. Text analysis

Working with text in R, dictionary methods.

Resources
Required reading

10. Similarity Metrics and Supervised Learning for Text

Vector space model, tf-idf, cosine similarity, Naive Bayes classification.

Resources
Required reading

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 11


11. Topic modelling

Latent Dirichlet Allocation, Structural Topic Model.

Resources
Required reading

12. Word-Embeddings and Large Language Models

Word-embeddings, n-gram models, transformers.

Resources