Language is central to almost all social interaction
Laws are written ✒️
Political events are discussed 📢
History is recorded 📚
People communicate ✉️
But these interactions have not been amenable to quantitative analysis until recently.
Two major changes that contributed to the growth of QTA:
Enormous increase in availability of digitized texts
Development of powerful and easily applicable methods
Consequence: we have the ability now to interrogate central questions in social science using data that was never available in the past.
We will be thinking about different methods of doing one core thing:
Assigning numbers to words and documents in order to measure latent concepts in text.
Although the methods we use to generate these numbers differ, the common goal will be to assign numbers that enable us measure latent concepts from large corpora of text.
Dictionaries
Supervised learning
Topic models
Word-embedding models
Many of these approaches share a set of common assumptions:
Texts represent observable manifestations of underlying characteristics of interest (usually attribute of authors)
Texts can be represented through extracting their features (for now, words)
Analysis of those features can produce meaningful estimates of the underlying characteristic of interest
For any given application, these assumptions may or may not be met.
Statistical models attempt to describe the ways in which data is generated
The data-generation process for language is extraordinarily complex
All the methods we cover on this course make simplifying assumptions which means they fail to provide an accurate account of the data-generation process
We trade-off complexity for tractability
Many of the methods we study are easy to apply fast and at scale
When applied in any given domain they may lead to misleading or wrong inferences
It is therefore essential to validate the approaches we use in the particular setting we are studying
QTA models should not be evaluated for their realism, but for their usefulness in specific tasks
Validation can take many forms
Text data is inherently multidimensional
A key goal of text analysis is to distill this complexity into a lower-dimensional representation that preserves important aspects of meaning
Visualising the outputs of text models is crucial to conveying the meaning embodied in the texts
Quantitative approaches differ from qualitative approaches
Large-scale analysis of many texts, rather than close readings of few texts
Interpretation of quantitative summaries of text, rather than direct interpretation of texts
But all text analysis involves qualitative judgement…
…in the construction of the feature-document matrix
… in the interpretation of the output of statistical models
Each quantitative text analysis follows a similar workflow:
Conversion of textual features into a quantitative matrix
A quantitative or statistical procedure to extract information from the quantitative matrix
Summary and interpretation of the quantitative results
In reality, there are additional steps:
Select Documents
Digitize documents
Represent as quantitative data
Analyse data
Validate analysis
Interpret analysis
Motivating Example
The UN Sustainable Development Goals are a set of 17 connected global goals which represent “a shared blueprint for peace and prosperity” for people across the world. Each goal is associated with a series of specific targets and indicators.
Question: (How) can we characterise the UN Sustainable Development Goals as numeric data?
[1] "End poverty in all its forms everywhere"
[2] "End hunger, achieve food security and improved nutrition and promote sustainable agriculture"
[3] "Ensure healthy lives and promote well-being for all at all ages"
[4] "Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all"
[5] "Achieve gender equality and empower all women and girls"
[6] "Ensure availability and sustainable management of water and sanitation for all"
[7] "Ensure access to affordable, reliable, sustainable and modern energy for all"
[8] "Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all"
[9] "Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation"
[10] "Reduce inequality within and among countries"
[11] "Make cities and human settlements inclusive, safe, resilient and sustainable"
[12] "Ensure sustainable consumption and production patterns"
[13] "Take urgent action to combat climate change and its impacts"
[14] "Conserve and sustainably use the oceans, seas and marine resources for sustainable development"
[15] "Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss"
[16] "Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels"
[17] "Strengthen the means of implementation and revitalize the global partnership for sustainable development"
[1] "End poverty in all its forms everywhere By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day By 2030, reduce at least by half the proportion of men, women and children of all ages living in poverty in all its dimensions according to national definitions Implement nationally appropriate social protection systems and measures for all, including floors, and by 2030 achieve substantial coverage of the poor and the vulnerable By 2030, ensure that all men and women, in particular the poor and the vulnerable, have equal rights to economic resources, as well as access to basic services, ownership and control over land and other forms of property, inheritance, natural resources, appropriate new technology and financial services, including microfinance By 2030, build the resilience of the poor and those in vulnerable situations and reduce their exposure and vulnerability to climate-related extreme events and other economic, social and environmental shocks and disasters Ensure significant mobilization of resources from a variety of sources, including through enhanced development cooperation, in order to provide adequate and predictable means for developing countries, in particular least developed countries, to implement programmes and policies to end poverty in all its dimensions Create sound policy frameworks at the national, regional and international levels, based on pro-poor and gender-sensitive development strategies, to support accelerated investment in poverty eradication actions "
[2] "End hunger, achieve food security and improved nutrition and promote sustainable agriculture By 2030, end hunger and ensure access by all people, in particular the poor and people in vulnerable situations, including infants, to safe, nutritious and sufficient food all year round By 2030, end all forms of malnutrition, including achieving, by 2025, the internationally agreed targets on stunting and wasting in children under 5 years of age, and address the nutritional needs of adolescent girls, pregnant and lactating women and older persons By 2030, double the agricultural productivity and incomes of small-scale food producers, in particular women, indigenous peoples, family farmers, pastoralists and fishers, including through secure and equal access to land, other productive resources and inputs, knowledge, financial services, markets and opportunities for value addition and non-farm employment By 2030, ensure sustainable food production systems and implement resilient agricultural practices that increase productivity and production, that help maintain ecosystems, that strengthen capacity for adaptation to climate change, extreme weather, drought, flooding and other disasters and that progressively improve land and soil quality By 2020, maintain the genetic diversity of seeds, cultivated plants and farmed and domesticated animals and their related wild species, including through soundly managed and diversified seed and plant banks at the national, regional and international levels, and promote access to and fair and equitable sharing of benefits arising from the utilization of genetic resources and associated traditional knowledge, as internationally agreed Increase investment, including through enhanced international cooperation, in rural infrastructure, agricultural research and extension services, technology development and plant and livestock gene banks in order to enhance agricultural productive capacity in developing countries, in particular least developed countries Correct and prevent trade restrictions and distortions in world agricultural markets, including through the parallel elimination of all forms of agricultural export subsidies and all export measures with equivalent effect, in accordance with the mandate of the Doha Development Round Adopt measures to ensure the proper functioning of food commodity markets and their derivatives and facilitate timely access to market information, including on food reserves, in order to help limit extreme food price volatility "
[3] "Ensure healthy lives and promote well-being for all at all ages By 2030, reduce the global maternal mortality ratio to less than 70 per 100,000 live births By 2030, end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce neonatal mortality to at least as low as 12 per 1,000 live births and under-5 mortality to at least as low as 25 per 1,000 live births By 2030, end the epidemics of AIDS, tuberculosis, malaria and neglected tropical diseases and combat hepatitis, water-borne diseases and other communicable diseases By 2030, reduce by one third premature mortality from non-communicable diseases through prevention and treatment and promote mental health and well-being Strengthen the prevention and treatment of substance abuse, including narcotic drug abuse and harmful use of alcohol By 2020, halve the number of global deaths and injuries from road traffic accidents By 2030, ensure universal access to sexual and reproductive health-care services, including for family planning, information and education, and the integration of reproductive health into national strategies and programmes Achieve universal health coverage, including financial risk protection, access to quality essential health-care services and access to safe, effective, quality and affordable essential medicines and vaccines for all By 2030, substantially reduce the number of deaths and illnesses from hazardous chemicals and air, water and soil pollution and contamination Strengthen the implementation of the World Health Organization Framework Convention on Tobacco Control in all countries, as appropriate Support the research and development of vaccines and medicines for the communicable and non-communicable diseases that primarily affect developing countries, provide access to affordable essential medicines and vaccines, in accordance with the Doha Declaration on the TRIPS Agreement and Public Health, which affirms the right of developing countries to use to the full the provisions in the Agreement on Trade-Related Aspects of Intellectual Property Rights regarding flexibilities to protect public health, and, in particular, provide access to medicines for all Substantially increase health financing and the recruitment, development, training and retention of the health workforce in developing countries, especially in least developed countries and small island developing States Strengthen the capacity of all countries, in particular developing countries, for early warning, risk reduction and management of national and global health risks "
[4] "Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all By 2030, ensure that all girls and boys complete free, equitable and quality primary and secondary education leading to relevant and effective learning outcomes By 2030, ensure that all girls and boys have access to quality early childhood development, care and pre-primary education so that they are ready for primary education By 2030, ensure equal access for all women and men to affordable and quality technical, vocational and tertiary education, including university By 2030, substantially increase the number of youth and adults who have relevant skills, including technical and vocational skills, for employment, decent jobs and entrepreneurship By 2030, eliminate gender disparities in education and ensure equal access to all levels of education and vocational training for the vulnerable, including persons with disabilities, indigenous peoples and children in vulnerable situations By 2030, ensure that all youth and a substantial proportion of adults, both men and women, achieve literacy and numeracy By 2030, ensure that all learners acquire the knowledge and skills needed to promote sustainable development, including, among others, through education for sustainable development and sustainable lifestyles, human rights, gender equality, promotion of a culture of peace and non-violence, global citizenship and appreciation of cultural diversity and of culture’s contribution to sustainable development Build and upgrade education facilities that are child, disability and gender sensitive and provide safe, non-violent, inclusive and effective learning environments for all By 2020, substantially expand globally the number of scholarships available to developing countries, in particular least developed countries, small island developing States and African countries, for enrolment in higher education, including vocational training and information and communications technology, technical, engineering and scientific programmes, in developed countries and other developing countries By 2030, substantially increase the supply of qualified teachers, including through international cooperation for teacher training in developing countries, especially least developed countries and small island developing States "
[5] "Achieve gender equality and empower all women and girls End all forms of discrimination against all women and girls everywhere Eliminate all forms of violence against all women and girls in the public and private spheres, including trafficking and sexual and other types of exploitation Eliminate all harmful practices, such as child, early and forced marriage and female genital mutilation Recognize and value unpaid care and domestic work through the provision of public services, infrastructure and social protection policies and the promotion of shared responsibility within the household and the family as nationally appropriate Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life Ensure universal access to sexual and reproductive health and reproductive rights as agreed in accordance with the Programme of Action of the International Conference on Population and Development and the Beijing Platform for Action and the outcome documents of their review conferences Undertake reforms to give women equal rights to economic resources, as well as access to ownership and control over land and other forms of property, financial services, inheritance and natural resources, in accordance with national laws Enhance the use of enabling technology, in particular information and communications technology, to promote the empowerment of women Adopt and strengthen sound policies and enforceable legislation for the promotion of gender equality and the empowerment of all women and girls at all levels "
[6] "Ensure availability and sustainable management of water and sanitation for all By 2030, achieve universal and equitable access to safe and affordable drinking water for all By 2030, achieve access to adequate and equitable sanitation and hygiene for all and end open defecation, paying special attention to the needs of women and girls and those in vulnerable situations By 2030, improve water quality by reducing pollution, eliminating dumping and minimizing release of hazardous chemicals and materials, halving the proportion of untreated wastewater and substantially increasing recycling and safe reuse globally By 2030, substantially increase water-use efficiency across all sectors and ensure sustainable withdrawals and supply of freshwater to address water scarcity and substantially reduce the number of people suffering from water scarcity By 2030, implement integrated water resources management at all levels, including through transboundary cooperation as appropriate By 2020, protect and restore water-related ecosystems, including mountains, forests, wetlands, rivers, aquifers and lakes By 2030, expand international cooperation and capacity-building support to developing countries in water- and sanitation-related activities and programmes, including water harvesting, desalination, water efficiency, wastewater treatment, recycling and reuse technologies Support and strengthen the participation of local communities in improving water and sanitation management "
[7] "Ensure access to affordable, reliable, sustainable and modern energy for all By 2030, ensure universal access to affordable, reliable and modern energy services By 2030, increase substantially the share of renewable energy in the global energy mix By 2030, double the global rate of improvement in energy efficiency By 2030, enhance international cooperation to facilitate access to clean energy research and technology, including renewable energy, energy efficiency and advanced and cleaner fossil-fuel technology, and promote investment in energy infrastructure and clean energy technology By 2030, expand infrastructure and upgrade technology for supplying modern and sustainable energy services for all in developing countries, in particular least developed countries, small island developing States, and land-locked developing countries, in accordance with their respective programmes of support "
[8] "Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all Sustain per capita economic growth in accordance with national circumstances and, in particular, at least 7 per cent gross domestic product growth per annum in the least developed countries Achieve higher levels of economic productivity through diversification, technological upgrading and innovation, including through a focus on high-value added and labour-intensive sectors Promote development-oriented policies that support productive activities, decent job creation, entrepreneurship, creativity and innovation, and encourage the formalization and growth of micro-, small- and medium-sized enterprises, including through access to financial services Improve progressively, through 2030, global resource efficiency in consumption and production and endeavour to decouple economic growth from environmental degradation, in accordance with the 10-year framework of programmes on sustainable consumption and production, with developed countries taking the lead By 2030, achieve full and productive employment and decent work for all women and men, including for young people and persons with disabilities, and equal pay for work of equal value By 2020, substantially reduce the proportion of youth not in employment, education or training Take immediate and effective measures to eradicate forced labour, end modern slavery and human trafficking and secure the prohibition and elimination of the worst forms of child labour, including recruitment and use of child soldiers, and by 2025 end child labour in all its forms Protect labour rights and promote safe and secure working environments for all workers, including migrant workers, in particular women migrants, and those in precarious employment By 2030, devise and implement policies to promote sustainable tourism that creates jobs and promotes local culture and products Strengthen the capacity of domestic financial institutions to encourage and expand access to banking, insurance and financial services for all Increase Aid for Trade support for developing countries, in particular least developed countries, including through the Enhanced Integrated Framework for Trade-Related Technical Assistance to Least Developed Countries By 2020, develop and operationalize a global strategy for youth employment and implement the Global Jobs Pact of the International Labour Organization "
[9] "Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation Develop quality, reliable, sustainable and resilient infrastructure, including regional and transborder infrastructure, to support economic development and human well-being, with a focus on affordable and equitable access for all Promote inclusive and sustainable industrialization and, by 2030, significantly raise industry’s share of employment and gross domestic product, in line with national circumstances, and double its share in least developed countries Increase the access of small-scale industrial and other enterprises, in particular in developing countries, to financial services, including affordable credit, and their integration into value chains and markets By 2030, upgrade infrastructure and retrofit industries to make them sustainable, with increased resource-use efficiency and greater adoption of clean and environmentally sound technologies and industrial processes, with all countries taking action in accordance with their respective capabilities Enhance scientific research, upgrade the technological capabilities of industrial sectors in all countries, in particular developing countries, including, by 2030, encouraging innovation and substantially increasing the number of research and development workers per 1 million people and public and private research and development spending Facilitate sustainable and resilient infrastructure development in developing countries through enhanced financial, technological and technical support to African countries, least developed countries, landlocked developing countries and small island developing States Support domestic technology development, research and innovation in developing countries, including by ensuring a conducive policy environment for, inter alia, industrial diversification and value addition to commodities Significantly increase access to information and communications technology and strive to provide universal and affordable access to the Internet in least developed countries by 2020"
[10] "Reduce inequality within and among countries By 2030, progressively achieve and sustain income growth of the bottom 40 per cent of the population at a rate higher than the national average By 2030, empower and promote the social, economic and political inclusion of all, irrespective of age, sex, disability, race, ethnicity, origin, religion or economic or other status Ensure equal opportunity and reduce inequalities of outcome, including by eliminating discriminatory laws, policies and practices and promoting appropriate legislation, policies and action in this regard Adopt policies, especially fiscal, wage and social protection policies, and progressively achieve greater equality Improve the regulation and monitoring of global financial markets and institutions and strengthen the implementation of such regulations Ensure enhanced representation and voice for developing countries in decision-making in global international economic and financial institutions in order to deliver more effective, credible, accountable and legitimate institutions Facilitate orderly, safe, regular and responsible migration and mobility of people, including through the implementation of planned and well-managed migration policies Implement the principle of special and differential treatment for developing countries, in particular least developed countries, in accordance with World Trade Organization agreements Encourage official development assistance and financial flows, including foreign direct investment, to States where the need is greatest, in particular least developed countries, African countries, small island developing States and landlocked developing countries, in accordance with their national plans and programmes By 2030, reduce to less than 3 per cent the transaction costs of migrant remittances and eliminate remittance corridors with costs higher than 5 per cent "
[11] "Make cities and human settlements inclusive, safe, resilient and sustainable By 2030, ensure access for all to adequate, safe and affordable housing and basic services and upgrade slums By 2030, provide access to safe, affordable, accessible and sustainable transport systems for all, improving road safety, notably by expanding public transport, with special attention to the needs of those in vulnerable situations, women, children, persons with disabilities and older persons By 2030, enhance inclusive and sustainable urbanization and capacity for participatory, integrated and sustainable human settlement planning and management in all countries Strengthen efforts to protect and safeguard the world’s cultural and natural heritage By 2030, significantly reduce the number of deaths and the number of people affected and substantially decrease the direct economic losses relative to global gross domestic product caused by disasters, including water-related disasters, with a focus on protecting the poor and people in vulnerable situations By 2030, reduce the adverse per capita environmental impact of cities, including by paying special attention to air quality and municipal and other waste management By 2030, provide universal access to safe, inclusive and accessible, green and public spaces, in particular for women and children, older persons and persons with disabilities Support positive economic, social and environmental links between urban, per-urban and rural areas by strengthening national and regional development planning By 2020, substantially increase the number of cities and human settlements adopting and implementing integrated policies and plans towards inclusion, resource efficiency, mitigation and adaptation to climate change, resilience to disasters, and develop and implement, in line with the Sendai Framework for Disaster Risk Reduction 2015-2030, holistic disaster risk management at all levels Support least developed countries, including through financial and technical assistance, in building sustainable and resilient buildings utilizing local materials "
[12] "Ensure sustainable consumption and production patterns Implement the 10-year framework of programmes on sustainable consumption and production, all countries taking action, with developed countries taking the lead, taking into account the development and capabilities of developing countries By 2030, achieve the sustainable management and efficient use of natural resources By 2030, halve per capita global food waste at the retail and consumer levels and reduce food losses along production and supply chains, including post-harvest losses By 2020, achieve the environmentally sound management of chemicals and all wastes throughout their life cycle, in accordance with agreed international frameworks, and significantly reduce their release to air, water and soil in order to minimize their adverse impacts on human health and the environment By 2030, substantially reduce waste generation through prevention, reduction, recycling and reuse Encourage companies, especially large and transnational companies, to adopt sustainable practices and to integrate sustainability information into their reporting cycle Promote public procurement practices that are sustainable, in accordance with national policies and priorities By 2030, ensure that people everywhere have the relevant information and awareness for sustainable development and lifestyles in harmony with nature Support developing countries to strengthen their scientific and technological capacity to move towards more sustainable patterns of consumption and production Develop and implement tools to monitor sustainable development impacts for sustainable tourism that creates jobs and promotes local culture and products Rationalize inefficient fossil-fuel subsidies that encourage wasteful consumption by removing market distortions, in accordance with national circumstances, including by restructuring taxation and phasing out those harmful subsidies, where they exist, to reflect their environmental impacts, taking fully into account the specific needs and conditions of developing countries and minimizing the possible adverse impacts on their development in a manner that protects the poor and the affected communities"
[13] "Take urgent action to combat climate change and its impacts Strengthen resilience and adaptive capacity to climate-related hazards and natural disasters in all countries Integrate climate change measures into national policies, strategies and planning Improve education, awareness-raising and human and institutional capacity on climate change mitigation, adaptation, impact reduction and early warning Implement the commitment undertaken by developed-country parties to the United Nations Framework Convention on Climate Change to a goal of mobilizing jointly $100 billion annually by 2020 from all sources to address the needs of developing countries in the context of meaningful mitigation actions and transparency on implementation and fully operationalize the Green Climate Fund through its capitalization as soon as possible Promote mechanisms for raising capacity for effective climate change-related planning and management in least developed countries and small island developing States, including focusing on women, youth and local and marginalized communities"
[14] "Conserve and sustainably use the oceans, seas and marine resources for sustainable development By 2025, prevent and significantly reduce marine pollution of all kinds, in particular from land-based activities, including marine debris and nutrient pollution By 2020, sustainably manage and protect marine and coastal ecosystems to avoid significant adverse impacts, including by strengthening their resilience, and take action for their restoration in order to achieve healthy and productive oceans Minimize and address the impacts of ocean acidification, including through enhanced scientific cooperation at all levels By 2020, effectively regulate harvesting and end overfishing, illegal, unreported and unregulated fishing and destructive fishing practices and implement science-based management plans, in order to restore fish stocks in the shortest time feasible, at least to levels that can produce maximum sustainable yield as determined by their biological characteristics By 2020, conserve at least 10 per cent of coastal and marine areas, consistent with national and international law and based on the best available scientific information By 2020, prohibit certain forms of fisheries subsidies which contribute to overcapacity and overfishing, eliminate subsidies that contribute to illegal, unreported and unregulated fishing and refrain from introducing new such subsidies, recognizing that appropriate and effective special and differential treatment for developing and least developed countries should be an integral part of the World Trade Organization fisheries subsidies negotiation By 2030, increase the economic benefits to Small Island developing States and least developed countries from the sustainable use of marine resources, including through sustainable management of fisheries, aquaculture and tourism Increase scientific knowledge, develop research capacity and transfer marine technology, taking into account the Intergovernmental Oceanographic Commission Criteria and Guidelines on the Transfer of Marine Technology, in order to improve ocean health and to enhance the contribution of marine biodiversity to the development of developing countries, in particular small island developing States and least developed countries Provide access for small-scale artisanal fishers to marine resources and markets Enhance the conservation and sustainable use of oceans and their resources by implementing international law as reflected in UNCLOS, which provides the legal framework for the conservation and sustainable use of oceans and their resources, as recalled in paragraph 158 of The Future We Want "
[15] "Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss By 2020, ensure the conservation, restoration and sustainable use of terrestrial and inland freshwater ecosystems and their services, in particular forests, wetlands, mountains and drylands, in line with obligations under international agreements By 2020, promote the implementation of sustainable management of all types of forests, halt deforestation, restore degraded forests and substantially increase afforestation and reforestation globally By 2030, combat desertification, restore degraded land and soil, including land affected by desertification, drought and floods, and strive to achieve a land degradation-neutral world By 2030, ensure the conservation of mountain ecosystems, including their biodiversity, in order to enhance their capacity to provide benefits that are essential for sustainable development Take urgent and significant action to reduce the degradation of natural habitats, halt the loss of biodiversity and, by 2020, protect and prevent the extinction of threatened species Promote fair and equitable sharing of the benefits arising from the utilization of genetic resources and promote appropriate access to such resources, as internationally agreed Take urgent action to end poaching and trafficking of protected species of flora and fauna and address both demand and supply of illegal wildlife products By 2020, introduce measures to prevent the introduction and significantly reduce the impact of invasive alien species on land and water ecosystems and control or eradicate the priority species By 2020, integrate ecosystem and biodiversity values into national and local planning, development processes, poverty reduction strategies and accounts Mobilize and significantly increase financial resources from all sources to conserve and sustainably use biodiversity and ecosystems Mobilize significant resources from all sources and at all levels to finance sustainable forest management and provide adequate incentives to developing countries to advance such management, including for conservation and reforestation Enhance global support for efforts to combat poaching and trafficking of protected species, including by increasing the capacity of local communities to pursue sustainable livelihood opportunities "
[16] "Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels Significantly reduce all forms of violence and related death rates everywhere End abuse, exploitation, trafficking and all forms of violence against and torture of children Promote the rule of law at the national and international levels and ensure equal access to justice for all By 2030, significantly reduce illicit financial and arms flows, strengthen the recovery and return of stolen assets and combat all forms of organized crime Substantially reduce corruption and bribery in all their forms Develop effective, accountable and transparent institutions at all levels Ensure responsive, inclusive, participatory and representative decision-making at all levels Broaden and strengthen the participation of developing countries in the institutions of global governance By 2030, provide legal identity for all, including birth registration Ensure public access to information and protect fundamental freedoms, in accordance with national legislation and international agreements Strengthen relevant national institutions, including through international cooperation, for building capacity at all levels, in particular in developing countries, to prevent violence and combat terrorism and crime Promote and enforce non-discriminatory laws and policies for sustainable development"
[17] "Strengthen the means of implementation and revitalize the global partnership for sustainable development Strengthen domestic resource mobilization, including through international support to developing countries, to improve domestic capacity for tax and other revenue collection Developed countries to implement fully their official development assistance commitments, including the commitment by many developed countries to achieve the target of 0.7 per cent of ODA/GNI to developing countries and 0.15 to 0.20 per cent of ODA/GNI to least developed countries; ODA providers are encouraged to consider setting a target to provide at least 0.20 per cent of ODA/GNI to least developed countries Mobilize additional financial resources for developing countries from multiple sources Assist developing countries in attaining long-term debt sustainability through coordinated policies aimed at fostering debt financing, debt relief and debt restructuring, as appropriate, and address the external debt of highly indebted poor countries to reduce debt distress Adopt and implement investment promotion regimes for least developed countries Enhance North-South, South-South and triangular regional and international cooperation on and access to science, technology and innovation and enhance knowledge sharing on mutually agreed terms, including through improved coordination among existing mechanisms, in particular at the United Nations level, and through a global technology facilitation mechanism Promote the development, transfer, dissemination and diffusion of environmentally sound technologies to developing countries on favourable terms, including on concessional and preferential terms, as mutually agreed Fully operationalize the technology bank and science, technology and innovation capacity-building mechanism for least developed countries by 2017 and enhance the use of enabling technology, in particular information and communications technology Enhance international support for implementing effective and targeted capacity-building in developing countries to support national plans to implement all the sustainable development goals, including through North-South, South-South and triangular cooperation Promote a universal, rules-based, open, non-discriminatory and equitable multilateral trading system under the World Trade Organization, including through the conclusion of negotiations under its Doha Development Agenda Significantly increase the exports of developing countries, in particular with a view to doubling the least developed countries’ share of global exports by 2020 Realize timely implementation of duty-free and quota-free market access on a lasting basis for all least developed countries, consistent with World Trade Organization decisions, including by ensuring that preferential rules of origin applicable to imports from least developed countries are transparent and simple, and contribute to facilitating market access Enhance global macroeconomic stability, including through policy coordination and policy coherence Enhance policy coherence for sustainable development Respect each country’s policy space and leadership to establish and implement policies for poverty eradication and sustainable development Enhance the global partnership for sustainable development, complemented by multi-stakeholder partnerships that mobilize and share knowledge, expertise, technology and financial resources, to support the achievement of the sustainable development goals in all countries, in particular developing countries Encourage and promote effective public, public-private and civil society partnerships, building on the experience and resourcing strategies of partnerships By 2020, enhance capacity-building support to developing countries, including for least developed countries and small island developing States, to increase significantly the availability of high-quality, timely and reliable data disaggregated by income, gender, age, race, ethnicity, migratory status, disability, geographic location and other characteristics relevant in national contexts By 2030, build on existing initiatives to develop measurements of progress on sustainable development that complement gross domestic product, and support statistical capacity-building in developing countries "
Which features of text would be most helpful for the following research questions?
Predicting whether the author of a text message was young or old
Measuring the financial content of news coverage
Assessing the complexity of a piece of writing
Which features of text would be most helpful for the following research questions?
Predicting whether the author of a text message was young or old
Measuring the financial content of news coverage
Assessing the complexity of a piece of writing
Implication: feature selection will depend on your research question.
Document-Feature Matrix (DFM)
A document-feature matrix is a common way of representing text data in quantitative form.
The rows of the matrix indicate the documents.
The columns of the matrix indicate the features (words, etc).
DFM’s are parsimonious representations which discard information. But they are helpful!
In order to construct a dfm, we need to made decisions about both documents and features.
Document
Basic unit (text) of analysis
Corpus
A structured set of documents for analysis
Type
A unique feature in the corpus e.g. a word (“flies”), a punctuation mark, a part-of-speech
Token
An instance of a type in a document e.g. the occurrence of the word in a given document
Selecting documents is an important, and often ignored, step in any QTA analysis.
Key questions:
Is it possible/feasible to collect a set of documents?
Is the corpus representative of the population of interest?
Is it ethical to examine documents of this sort at scale?
Implication: The selection of texts is consequential to the conclusions we can draw.
A “document” is the typical unit of analysis in QTA. But what is a document?
Key: Depends on the research question.
Words
N-grams
Language sequences
Word segments, especially for languages using compound words, e.g.
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
Words
N-grams
Language sequences
Word segments, especially for languages using compound words, e.g.
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
The simplest possible way of characterising a corpus is by counting words
For each text, we record how many times each unique word appears
We ignore everything else.
The words in a document convey meaning
Word order does not matter
Word combinations do not matter (i.e. negation)
Grammar does not matter
Words are the only relevant features (not punctuation, not syllables, etc)
The importance of these assumptions depends on the application.
time | flies | fruit | like | an | a | banana | arrow | |
---|---|---|---|---|---|---|---|---|
Sentence 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
Sentence 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |
# Load the quanteda library
library(quanteda)
# Convert the sdg data.frame into a corpus
sdg_corpus <- corpus(sdg, text_field = "long_description")
# Take the corpus
sdg_dfm <- sdg_corpus %>%
# Tokenize (split) the corpus into individual words
tokens() %>%
# Construct a document-feature matrix
dfm()
# Print the dfm
sdg_dfm
Document-feature matrix of: 17 documents, 1,085 features (86.41% sparse) and 2 docvars.
features
docs end poverty in all its forms everywhere by 2030 ,
text1 2 5 9 7 3 2 2 6 5 24
text2 3 0 11 5 0 2 0 7 4 43
text3 2 0 7 7 0 0 0 8 6 33
text4 0 0 6 8 0 0 0 9 8 39
text5 1 0 5 9 0 3 1 0 0 11
text6 1 0 3 5 0 0 0 8 6 21
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 1,075 more features ]
Wait, what is this %>%
thing?
How many features are there in this dfm?
[1] 1085
N-grams
Contiguous sequence of words from document (1-gram, unigram; 2-gram, bigram)
Document-feature matrix of: 17 documents, 4,337 features (90.46% sparse) and 2 docvars.
features
docs end poverty in all its forms everywhere by 2030 ,
text1 2 5 9 7 3 2 2 6 5 24
text2 3 0 11 5 0 2 0 7 4 43
text3 2 0 7 7 0 0 0 8 6 33
text4 0 0 6 8 0 0 0 9 8 39
text5 1 0 5 9 0 3 1 0 0 11
text6 1 0 3 5 0 0 0 8 6 21
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 4,327 more features ]
Document-feature matrix of: 17 documents, 8,685 features (91.86% sparse) and 2 docvars.
features
docs end poverty in all its forms everywhere by 2030 ,
text1 2 5 9 7 3 2 2 6 5 24
text2 3 0 11 5 0 2 0 7 4 43
text3 2 0 7 7 0 0 0 8 6 33
text4 0 0 6 8 0 0 0 9 8 39
text5 1 0 5 9 0 3 1 0 0 11
text6 1 0 3 5 0 0 0 8 6 21
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 8,675 more features ]
How many features are there in these dfms?
This can lead to a lot of features!
For this example (very small) corpus:
17 documents
1085 unique words
4337 unique 1-gram and 2-gram sequences
8685 unique 1-gram, 2-gram and 3-gram sequences
The resulting dfms are also very sparse – they contain a high fraction of zeros because most n-grams do not appear in most documents
Reduce complexity
quanteda
), remove punctuation (not automatic in quanteda
)Deliberate disregard
Word stemming/lematization
Filter by frequency
Purposive selection
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very" "will"
But no list should be considered universal…
[1] "a" "a's" "able" "about"
[5] "above" "according" "accordingly" "across"
[9] "actually" "after" "afterwards" "again"
[13] "against" "ain't" "all" "allow"
[17] "allows" "almost" "alone" "along"
[21] "already" "also" "although" "always"
[25] "am" "among" "amongst" "an"
[29] "and" "another" "any" "anybody"
[33] "anyhow" "anyone" "anything" "anyway"
[37] "anyways" "anywhere" "apart" "appear"
[41] "appreciate" "appropriate" "are" "aren't"
[45] "around" "as" "aside" "ask"
[49] "asking" "associated" "at" "available"
[53] "away" "awfully" "b" "be"
[57] "became" "because" "become" "becomes"
[61] "becoming" "been" "before" "beforehand"
[65] "behind" "being" "believe" "below"
[69] "beside" "besides" "best" "better"
[73] "between" "beyond" "both" "brief"
[77] "but" "by" "c" "c'mon"
[81] "c's" "came" "can" "can't"
[85] "cannot" "cant" "cause" "causes"
[89] "certain" "certainly" "changes" "clearly"
[93] "co" "com" "come" "comes"
[97] "concerning" "consequently" "consider" "considering"
[101] "contain" "containing" "contains" "corresponding"
[105] "could" "couldn't" "course" "currently"
[109] "d" "definitely" "described" "despite"
[113] "did" "didn't" "different" "do"
[117] "does" "doesn't" "doing" "don't"
[121] "done" "down" "downwards" "during"
[125] "e" "each" "edu" "eg"
[129] "eight" "either" "else" "elsewhere"
[133] "enough" "entirely" "especially" "et"
[137] "etc" "even" "ever" "every"
[141] "everybody" "everyone" "everything" "everywhere"
[145] "ex" "exactly" "example" "except"
[149] "f" "far" "few" "fifth"
[153] "first" "five" "followed" "following"
[157] "follows" "for" "former" "formerly"
[161] "forth" "four" "from" "further"
[165] "furthermore" "g" "get" "gets"
[169] "getting" "given" "gives" "go"
[173] "goes" "going" "gone" "got"
[177] "gotten" "greetings" "h" "had"
[181] "hadn't" "happens" "hardly" "has"
[185] "hasn't" "have" "haven't" "having"
[189] "he" "he's" "hello" "help"
[193] "hence" "her" "here" "here's"
[197] "hereafter" "hereby" "herein" "hereupon"
[201] "hers" "herself" "hi" "him"
[205] "himself" "his" "hither" "hopefully"
[209] "how" "howbeit" "however" "i"
[213] "i'd" "i'll" "i'm" "i've"
[217] "ie" "if" "ignored" "immediate"
[221] "in" "inasmuch" "inc" "indeed"
[225] "indicate" "indicated" "indicates" "inner"
[229] "insofar" "instead" "into" "inward"
[233] "is" "isn't" "it" "it'd"
[237] "it'll" "it's" "its" "itself"
[241] "j" "just" "k" "keep"
[245] "keeps" "kept" "know" "knows"
[249] "known" "l" "last" "lately"
[253] "later" "latter" "latterly" "least"
[257] "less" "lest" "let" "let's"
[261] "like" "liked" "likely" "little"
[265] "look" "looking" "looks" "ltd"
[269] "m" "mainly" "many" "may"
[273] "maybe" "me" "mean" "meanwhile"
[277] "merely" "might" "more" "moreover"
[281] "most" "mostly" "much" "must"
[285] "my" "myself" "n" "name"
[289] "namely" "nd" "near" "nearly"
[293] "necessary" "need" "needs" "neither"
[297] "never" "nevertheless" "new" "next"
[301] "nine" "no" "nobody" "non"
[305] "none" "noone" "nor" "normally"
[309] "not" "nothing" "novel" "now"
[313] "nowhere" "o" "obviously" "of"
[317] "off" "often" "oh" "ok"
[321] "okay" "old" "on" "once"
[325] "one" "ones" "only" "onto"
[329] "or" "other" "others" "otherwise"
[333] "ought" "our" "ours" "ourselves"
[337] "out" "outside" "over" "overall"
[341] "own" "p" "particular" "particularly"
[345] "per" "perhaps" "placed" "please"
[349] "plus" "possible" "presumably" "probably"
[353] "provides" "q" "que" "quite"
[357] "qv" "r" "rather" "rd"
[361] "re" "really" "reasonably" "regarding"
[365] "regardless" "regards" "relatively" "respectively"
[369] "right" "s" "said" "same"
[373] "saw" "say" "saying" "says"
[377] "second" "secondly" "see" "seeing"
[381] "seem" "seemed" "seeming" "seems"
[385] "seen" "self" "selves" "sensible"
[389] "sent" "serious" "seriously" "seven"
[393] "several" "shall" "she" "should"
[397] "shouldn't" "since" "six" "so"
[401] "some" "somebody" "somehow" "someone"
[405] "something" "sometime" "sometimes" "somewhat"
[409] "somewhere" "soon" "sorry" "specified"
[413] "specify" "specifying" "still" "sub"
[417] "such" "sup" "sure" "t"
[421] "t's" "take" "taken" "tell"
[425] "tends" "th" "than" "thank"
[429] "thanks" "thanx" "that" "that's"
[433] "thats" "the" "their" "theirs"
[437] "them" "themselves" "then" "thence"
[441] "there" "there's" "thereafter" "thereby"
[445] "therefore" "therein" "theres" "thereupon"
[449] "these" "they" "they'd" "they'll"
[453] "they're" "they've" "think" "third"
[457] "this" "thorough" "thoroughly" "those"
[461] "though" "three" "through" "throughout"
[465] "thru" "thus" "to" "together"
[469] "too" "took" "toward" "towards"
[473] "tried" "tries" "truly" "try"
[477] "trying" "twice" "two" "u"
[481] "un" "under" "unfortunately" "unless"
[485] "unlikely" "until" "unto" "up"
[489] "upon" "us" "use" "used"
[493] "useful" "uses" "using" "usually"
[497] "uucp" "v" "value" "various"
[501] "very" "via" "viz" "vs"
[505] "w" "want" "wants" "was"
[509] "wasn't" "way" "we" "we'd"
[513] "we'll" "we're" "we've" "welcome"
[517] "well" "went" "were" "weren't"
[521] "what" "what's" "whatever" "when"
[525] "whence" "whenever" "where" "where's"
[529] "whereafter" "whereas" "whereby" "wherein"
[533] "whereupon" "wherever" "whether" "which"
[537] "while" "whither" "who" "who's"
[541] "whoever" "whole" "whom" "whose"
[545] "why" "will" "willing" "wish"
[549] "with" "within" "without" "won't"
[553] "wonder" "would" "would" "wouldn't"
[557] "x" "y" "yes" "yet"
[561] "you" "you'd" "you'll" "you're"
[565] "you've" "your" "yours" "yourself"
[569] "yourselves" "z" "zero"
End poverty in all its forms everywhere
End hunger, achieve food security and improved nutrition and promote sustainable agriculture
Ensure healthy lives and promote well-being for all at all ages
End poverty in all its forms everywhere
End hunger, achieve food security and improved nutrition and promote sustainable agriculture
Ensure healthy lives and promote well-being for all at all ages
Compare…
It was a nice party, Pablo had brought his ukulele.
To…
It was a nice party, but Pablo had brought his ukulele.
nice | party | Pablo | brought | ukulele | |
---|---|---|---|---|---|
Sentence 1 | 1 | 1 | 1 | 1 | 1 |
Sentence 2 | 1 | 1 | 1 | 1 | 1 |
Document-feature matrix of: 17 documents, 1,034 features (87.62% sparse) and 2 docvars.
features
docs end poverty forms everywhere 2030 eradicate extreme people currently
text1 2 5 2 2 5 1 2 2 1
text2 3 0 2 0 4 0 2 2 0
text3 2 0 0 0 6 0 0 0 0
text4 0 0 0 0 8 0 0 0 0
text5 1 0 3 1 0 0 0 0 0
text6 1 0 0 0 6 0 0 1 0
features
docs measured
text1 1
text2 0
text3 0
text4 0
text5 0
text6 0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 1,024 more features ]
Stemming
Process for reducing inflected (or sometimes derived) words to their stem, base or root form. Stemmers operate on single words without knowledge of the context.
Example:
Production, producer, produce, produces, produced → produc
Lemmatization
Algorithmic process of converting words to their lemma forms.
Example:
am, are, is → be
Stemming is a crude heuristic process that chops off the ends of words. Lemmatization is smarter, but slower.
End poverty in all its forms everywhere
End hunger, achieve food security and improved nutrition and promote sustainable agriculture
Ensure healthy lives and promote well-being for all at all ages
End poverti in all it form everywher
End hunger , achiev food secur and improv nutrit and promot sustain agricultur
Ensure healthi live and promot well-b for all at all age
Document-feature matrix of: 17 documents, 872 features (84.37% sparse) and 2 docvars.
features
docs end poverti in all it form everywher by 2030 erad
text1 2 5 9 7 3 2 2 6 5 2
text2 3 0 11 5 0 2 0 7 4 0
text3 2 0 7 7 0 0 0 8 6 0
text4 0 0 6 8 0 0 0 9 8 0
text5 1 0 5 9 0 3 1 0 0 0
text6 1 0 3 5 0 0 0 8 6 0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 862 more features ]
Very rare words and very frequent words are unlikely to be helpful in discriminating between documents.
Document-feature matrix of: 17 documents, 323 features (70.86% sparse) and 2 docvars.
features
docs end poverty in all its forms everywhere by 2030 eradicate
text1 2 5 9 7 3 2 2 6 5 1
text2 3 0 11 5 0 2 0 7 4 0
text3 2 0 7 7 0 0 0 8 6 0
text4 0 0 6 8 0 0 0 9 8 0
text5 1 0 5 9 0 3 1 0 0 0
text6 1 0 3 5 0 0 0 8 6 0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 313 more features ]
Feature selection matters! See Denny and Spirling, 2017
Just seven (binary) preprocessing decisions leads to a total of \(2^7 = 128\) possible feature matrices
These selection decisions can have substantive implications for the inferences we draw from QTA
How should we select between these representations?
There is no single “best” dfm
The optimal representation of a corpus will depend on the particular research task
We need to design ways of validating the representations we construct
Are female politicians less aggressive than male politicians? (Hargrave and Blumenau, 2022)
A repeated claim in the qualitative literature on gender and politics is that male and female politicians have distinct styles. Many political observers argue that women are less aggressive in political debate than their male colleagues. Most of the evidence for these claims is taken from small-N classical content analysis studies. We will review this question by applying an existing sentiment dictionary to a large-N corpus of parliamentary texts.
Are female politicians less aggressive than male politicians? (Hargrave and Blumenau, 2022)
A repeated claim in the qualitative literature on gender and politics is that male and female politicians have distinct styles. Many political observers argue that women are less aggressive in political debate than their male colleagues. Most of the evidence for these claims is taken from small-N classical content analysis studies. We will review this question by applying an existing sentiment dictionary to a large-N corpus of parliamentary texts.
How might we conceptualize “aggression” in the context of parliamentary debate?
Use of aggressive or combative language, which might include criticisms or insults; language that suggests forceful action; or declamatory or adversarial language.
Theoretical conceptualization
Empirical exploration/discovery
Key feature: use of “human” coders to implement a pre-defined coding scheme, by reading and coding texts
Human decision-making is the central feature of coding decisions, not a computer or other mechanized tool
Validity is usually the objective, rather than reliability
Example: hand-coding sentences into pre-defined categories
Dictionaries represent a hybrid procedure that bridges qualitative approaches and fully-automated text-as-data approaches
“Qualitative” since it involves identification of the concepts and associated keys/categories, and the textual features associated with each key/category
“Quantitative” because it involves applying an algorithm to large corpora and presenting statistical summaries of results
Rather than count all words that occur, pre-define words as associated with specific meanings
Two components:
A better metaphor is really a thesaurus: a canoncial term or concept (the key) associated with equivalent synonyms (the values)
Key | Values |
---|---|
Dog | Dalmation, Labrador, Poodle, Pug |
Computation | Data, Number, Computer, Simulation |
Genetics | Gene, DNA, Inherit |
A dictionary is just a list of words (\(m=1,...,M\)) that is related to a common concept.
Aggression |
---|
stupid |
dishonest |
lier |
idiot |
ignorant |
hate |
fight |
battle |
Applying a dictionary to a corpus of texts (\(i = 1,...,N\)) simply requires counting the number of times each word occurs in each text and summing them.
If \(W_{im}\) is the number of times word \(m\) appears in text \(i\) and 0 otherwise, then the dictionary score for document \(i\) is:
\[ t_i = \frac{\sum_{m=1}^M W_{im}}{N_i} \]
Or, the proportion of words in document \(i\) that appear in the dictionary.
“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”
\[ t_i = \frac{\sum_{m=1}^M W_{im}}{N_i} = \frac{1+1}{24} = 0.083 \]
A slight development on this would be to assign each word in the dictionary a weight which reflects something about the importance of the word to the concept
Aggression | Weight |
---|---|
stupid | .6 |
dishonest | .2 |
lie | .5 |
idiot | .7 |
ignorant | .3 |
brutal | .4 |
violence | .5 |
We can adjust the previous formula to incorporate the weights (\(s_m\)):
\[ t_i = \frac{\sum_{m=1}^M s_mW_{im}}{N_i} \]
Why normalise by \(N_i\)?
Some texts will be longer than others and we do not want these texts to mechanically be assigned higher scores.
“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”
\[ t_i = \frac{\sum_{m=1}^M s_mW_{im}}{N_i} = \frac{(1\cdot0.6)+(1\cdot0.3)}{24} = 0.0375 \]
Most applications of dictionary methods in social science applications use unweighted dictionaries.
Why learn this then?
Linquistic Inquiry and Word Count
Lexicoder Sentiment Dictionary
Loughran-McDonald Sentiment Dictionary
Martindale’s Regressive Imagery Dictionary
Many of these are directly available in quanteda
. Some are available only for purchase.
Applying off-the-shelf dictionaries to new contexts can be problematic:
Problem 1: polysemes – words that have multiple meanings
Problem 2: Dictionaries often lack important words in a given context
Problem 3: Some dictionaries might do more to pick up the topic of a document than the tone of a document
Applying dictionaries outside the domain for which they were developed can lead to serious errors (Grimmer and Stewart, 2013, 268)
“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”
“Terrible acts of brutality and violence have been carried out against the Rohingya people.”
Dictionaries may miss words that are important to the concept
Dictionaries do not typically capture modifiers
Dictionaries often fail to capture all synonyms
Dictionaries may not capture the relevant concept
aggression_texts
is a data.frame which includes 10937 sentences from parliamentary speechesaggression_words
is a vector of 222 words from the an existing “Aggression” dictionaryOur goal is to use aggression_words
to score the texts in aggression_texts
.
[1] "abhor*" "abus*" "abusiv*" "accus*"
[5] "afflict*" "aggress*" "aggressiv*" "ambush*"
[9] "anger*" "angri*" "angrier*" "angry*"
[13] "annihilat*" "annoy*" "annoyanc*" "antagoniz*"
[17] "argu*" "argument*" "army*" "arrow*"
[21] "assault*" "attack*" "aveng*" "ax"
[25] "axe" "axes" "battl*" "beak*"
[29] "beat*" "beaten*" "betray*" "blade*"
[33] "blam*" "bloody*" "bother*" "brawl*"
[37] "break*" "brok*" "broken*" "brutal*"
[41] "cannon*" "chid*" "combat*" "complain*"
[45] "conflict*" "condemn*" "controversy*" "critic*"
[49] "cruel*" "crush*" "cut" "cuts"
[53] "cutt*" "damag*" "decei*" "defeat*"
[57] "degrad*" "demolish*" "depriv*" "derid*"
[61] "despis*" "destroy*" "destruct*" "destructiv*"
[65] "detest*" "disagre*" "disagreement*" "disapprov*"
[69] "discontent*" "dislik*" "disput*" "disturb*"
[73] "doubt*" "enemi*" "enemy*" "enrag*"
[77] "exasperat*" "controversial*" "critique" "disparag*"
[81] "irritable" "exploit*" "exterminat*" "feud*"
[85] "fierc*" "fight*" "fought*" "furiou*"
[89] "fury*" "gash*" "grappl*" "growl*"
[93] "grudg*" "gun" "gunn*" "guns"
[97] "harm*" "harsh*" "hate*" "hatr*"
[101] "hit" "hits" "hitt*" "homicid*"
[105] "hostil*" "hurt*" "ingrat*" "injur*"
[109] "injury*" "insult*" "invad*" "invas*"
[113] "irat*" "irk*" "irritat*" "jealou*"
[117] "jealousy*" "jeer*" "kick*" "kil*"
[121] "kill*" "knif*" "kniv*" "loath*"
[125] "maim*" "mistreat*" "mock*" "murder*"
[129] "obliterat*" "offend*" "oppos*" "predatory*"
[133] "protest*" "quarrel*" "rage" "rages"
[137] "raging" "rapin*" "rebel*" "rebell*"
[141] "rebuk*" "relentles*" "reproach*" "resent*"
[145] "resentment*" "retribut*" "reveng*" "revolt*"
[149] "ridicul*" "rip" "ripp*" "rips"
[153] "rob" "robb*" "robs" "sarcasm*"
[157] "sarcastic*" "scalp*" "scof*" "scoff*"
[161] "scourg*" "seiz*" "sever*" "severity*"
[165] "shatter*" "shoot*" "shot*" "shov*"
[169] "slain*" "slander*" "slap*" "slaughter*"
[173] "slay*" "slew*" "smash*" "snarl*"
[177] "sneer*" "spear*" "spiteful*" "spurn*"
[181] "stab*" "steal*" "stol*" "stolen*"
[185] "strangl*" "strif*" "strik*" "struck*"
[189] "struggl*" "stubborn*" "sword*" "taunt*"
[193] "temper*" "threat*" "threaten*" "tore"
[197] "torment*" "torn*" "tortur*" "traitor*"
[201] "trampl*" "treacherou*" "treachery*" "tyrant*"
[205] "unkind*" "vengeanc*" "vengeful*" "vex"
[209] "vexing" "violat*" "violenc*" "violent*"
[213] "war" "warring" "warrior*" "wars"
[217] "weapon*" "whip*" "wound*" "wrath*"
[221] "football*" "wreck*"
The *
character will pick up any token which begins with the relevant string.
I.e. accus*
➡️ accuse
, accuses
, accused
, etc.
# First we convert the texts to a corpus object:
aggression_corpus <- corpus(aggression_texts, text_field = "texts")
# Then we tokenize the texts and create a dfm:
aggression_tokens <- tokens(aggression_corpus)
aggression_dfm <- dfm(aggression_tokens)
# We use the aggression words to create a dictionary object:
aggression_dictionary <- dictionary(list(aggression = aggression_words))
# Finally, we apply the dictionary to the dfm using the dfm_lookup function:
aggression_dfm_dictionary <- dfm_lookup(aggression_dfm,
dictionary = aggression_dictionary)
Document-feature matrix of: 10,937 documents, 1 feature (79.05% sparse) and 1 docvar.
features
docs aggression
text1 0
text2 0
text3 0
text4 0
text5 1
text6 1
[ reached max_ndoc ... 10,931 more documents ]
aggression_dfm
is a document-feature matrix, where the only “feature” is the dictionary counts
Finally, we can calculate the score by dividing the dictionary counts by the number of words in each text:
Applying dictionaries outside the domain for which they were developed can lead to errors.
One way of assessing the seriousness of these errors is to conduct validation tests
There are many forms of these tests!
All share a core idea: are the texts that are flagged by the dictionary more representative of the relevant concept than other texts?
There are many approaches to assessing validity of a measure, \(m_1\), for a target concept, \(\mu_1\):
Face validity
Concurrent validity
Convergent validity
Discriminant validity
Predictive validity
Comparison to human judgements of a target concept, \(\mu\), are often thought to be the “gold standard” of validation
This is based on the (often implicit) assumption that real people can accurately identify and label examples of a given concept (“you know it when you see it”)
This assumption may not be met due to…
The caratage of the gold standard will therefore vary across applications
Intuition: Does our measure of aggression vary in sensible ways?
In this case, one obvious test is whether MPs speeches are more aggressive during Prime Minister’s Questions (PMQs).
'data.frame': 10937 obs. of 4 variables:
$ texts : chr "Is it not more important to work hard to open up trade between eastern and western Europe than to allow the Eur"| __truncated__ "Also, the Bill will consider aspects of the procedures applying to boards of inquiry." "On that measure, NHS provision per head of population in Cornwall is about half the national average." "Making it a criminal offence would help to make it clear that forced marriage is completely and utterly unacceptable." ...
$ human : logi NA NA NA NA NA NA ...
$ debate_type: chr "prime_ministers_questions" "legislation" "opposition_day" "question_time" ...
$ proportions: num 0 0 0 0 0.0233 ...
Call:
lm(formula = proportions ~ debate_type, data = aggression_texts)
Residuals:
Min 1Q Median 3Q Max
-0.015708 -0.007154 -0.006837 -0.006837 0.174768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0068367 0.0002633 25.966 <2e-16 ***
debate_typeopposition_day 0.0005019 0.0006371 0.788 0.431
debate_typeprime_ministers_questions 0.0088718 0.0005542 16.009 <2e-16 ***
debate_typequestion_time 0.0003170 0.0003923 0.808 0.419
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01809 on 10933 degrees of freedom
Multiple R-squared: 0.02486, Adjusted R-squared: 0.0246
F-statistic: 92.93 on 3 and 10933 DF, p-value: < 2.2e-16
There is clear evidence that PMQ debates tend to have higher levels of aggressive language than other debates.
How does this approach perform? Let’s look at the top-scoring sentences:
score | text | |
---|---|---|
text3998 | 0.19 | I fully appreciate that it is the Opposition's job to oppose, but there are times when opposition is destructive. |
text7416 | 0.18 | We unequivocally condemn Hamas's dreadful and murderous rocket attacks and defend Israel's right to defend itself. |
text2941 | 0.14 | They were asking ridiculous prices, because they had the sole remedy for a complaint, so could exploit that situation. |
text106 | 0.13 | Terrible acts of brutality and violence have been carried out against the Rohingya people. |
text144 | 0.13 | The motion condemns the early release scheme for those who have assaulted police officers. |
While some seem reasonable, others indicate that we are picking up topic rather than tone.
aggression_texts
data.frame includes a variable, human
, which includes the results of a validation exercise.'data.frame': 10937 obs. of 4 variables:
$ texts : chr "Is it not more important to work hard to open up trade between eastern and western Europe than to allow the Eur"| __truncated__ "Also, the Bill will consider aspects of the procedures applying to boards of inquiry." "On that measure, NHS provision per head of population in Cornwall is about half the national average." "Making it a criminal offence would help to make it clear that forced marriage is completely and utterly unacceptable." ...
$ human : logi NA NA NA NA NA NA ...
$ debate_type: chr "prime_ministers_questions" "legislation" "opposition_day" "question_time" ...
$ proportions: num 0 0 0 0 0.0233 ...
Implication: Our aggression dictionary rarely detects non-aggressive texts as aggressive but frequently fails to detect aggressive texts.
Which of these questions is easier?
Is this sentence aggressive?
Which of these sentences is more aggressive?
“I regard it as unbelievable that the minister has said that, when it is clearly wrong.”
“I also welcome the fact that the Bill will encourage more young people to take advantage of the programme.”
Paired comparisons tend to give more useful and reliable information than single ratings.
Apply 7 basic QTA measures (including 6 dictionaries) to 8 million sentences
Score each sentence using uniform word weights
Present pairs of sentences to human coders and ask them to select which sentence is most representative of a certain concept
Does the difference in sentence-level dictionary scores predict human judgements?
Sample pairs of sentences from the corpus
Randomly present to human coders, code (\(Y_i\)) whether:
Calculate the relationship between human coding and dictionaries by:
Repeat for each dictionary
Aggression tends to manifest very differently in parliamentary speech than in other contexts!
In the paper, H&B develop a more sophisticated approach to measuring aggression (and other styles):
Take an off-the-shelf dictionary of aggressive words
Use word-embeddings to…
Score speeches according to these modified word lists
More on this approach on Thursday.
Let’s believe for a second that the validation strategy worked.
Full paper here.
Quantitative Text Analysis allows us to address a wide variety of important research questions
There is no one right way to represent text for all research questions.
The representation we choose can be consequential for the results we present
Dictionaries are fast, easy-to-apply, methods with many pre-existing implementations
Validation is critical to any quantitative text application
The validity of a dictionary will be sensitive to the contexts in which it is developed and applied
ME314