Lecture 9: Text As Data and Dictionaries

Jack Blumenau

Introduction to Quantitative Text Analysis

Much of the Social World is Textual

Language is central to almost all social interaction

  1. Laws are written ✒️

  2. Political events are discussed 📢

  3. History is recorded 📚

  4. People communicate ✉️

But these interactions have not been amenable to quantitative analysis until recently.

The Growth of Quantitative Text Analysis

Two major changes that contributed to the growth of QTA:

  1. Enormous increase in availability of digitized texts

  2. Development of powerful and easily applicable methods

Consequence: we have the ability now to interrogate central questions in social science using data that was never available in the past.

Quantitative Text Analysis

We will be thinking about different methods of doing one core thing:


Assigning numbers to words and documents in order to measure latent concepts in text.


Although the methods we use to generate these numbers differ, the common goal will be to assign numbers that enable us measure latent concepts from large corpora of text.

Assigning Numbers to Words/Documents

  1. Dictionaries

    • Some words are assigned a 1, some a zero
    • Documents are characterised by the degree to which they include words in the dictionary
  2. Supervised learning

    • Words are assigned a weight depending on their relative use across groups
    • Documents are characterised by the degree to which they include words associated with different groups
  1. Topic models

    • Words are assigned a vector of numbers, representing their relevance to a set of topics
    • Documents are assigned a vector of numbers, representing their relevance to a set of topics
  2. Word-embedding models

    • Words are assigned a vector of numbers, representing the context in which they are used
    • Documents are characterised by some average of the vectors of the words they contain

Applications in QTA

  • How does the media cover the economy?
  • When did Western political culture diverge from the rest of the world?
  • How do central bankers make decisions on economic policy?
  • How has the cultural meaning of words changed over time?
  • How can we detect online hate speech?
  • Which interest groups have policy influence?

Assumptions

Many of these approaches share a set of common assumptions:

  1. Texts represent observable manifestations of underlying characteristics of interest (usually attribute of authors)

  2. Texts can be represented through extracting their features (for now, words)

  3. Analysis of those features can produce meaningful estimates of the underlying characteristic of interest

For any given application, these assumptions may or may not be met.

Principles of Quantitative Text Analysis

  1. All Language Models Are Wrong But Some Are Useful
  1. Domain-Specific Validation is Essential
  1. Visualization is Central to Understanding High Dimensional Data
  1. Quantitative Text Analysis Requires Qualitative Interpretation
  • Statistical models attempt to describe the ways in which data is generated

  • The data-generation process for language is extraordinarily complex

  • All the methods we cover on this course make simplifying assumptions which means they fail to provide an accurate account of the data-generation process

  • We trade-off complexity for tractability

  • Many of the methods we study are easy to apply fast and at scale

  • When applied in any given domain they may lead to misleading or wrong inferences

  • It is therefore essential to validate the approaches we use in the particular setting we are studying

  • QTA models should not be evaluated for their realism, but for their usefulness in specific tasks

  • Validation can take many forms

  • Text data is inherently multidimensional

  • A key goal of text analysis is to distill this complexity into a lower-dimensional representation that preserves important aspects of meaning

  • Visualising the outputs of text models is crucial to conveying the meaning embodied in the texts

  • Quantitative approaches differ from qualitative approaches

    • Large-scale analysis of many texts, rather than close readings of few texts

    • Interpretation of quantitative summaries of text, rather than direct interpretation of texts

  • But all text analysis involves qualitative judgement…

    • …in the construction of the feature-document matrix

    • … in the interpretation of the output of statistical models

Workflow

Each quantitative text analysis follows a similar workflow:

  1. Conversion of textual features into a quantitative matrix

  2. A quantitative or statistical procedure to extract information from the quantitative matrix

  3. Summary and interpretation of the quantitative results

Workflow

Workflow

In reality, there are additional steps:

  1. Select Documents

  2. Digitize documents

  3. Represent as quantitative data

  4. Analyse data

  5. Validate analysis

  6. Interpret analysis

Representing Text as Data

Motivating Example

Motivating Example

The UN Sustainable Development Goals are a set of 17 connected global goals which represent “a shared blueprint for peace and prosperity” for people across the world. Each goal is associated with a series of specific targets and indicators.

Question: (How) can we characterise the UN Sustainable Development Goals as numeric data?

sdg <- read.csv("data/SDG-goals.csv")
sdg$description
 [1] "End poverty in all its forms everywhere"                                                                                                                                                     
 [2] "End hunger, achieve food security and improved nutrition and promote sustainable agriculture"                                                                                                
 [3] "Ensure healthy lives and promote well-being for all at all ages"                                                                                                                             
 [4] "Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all"                                                                                        
 [5] "Achieve gender equality and empower all women and girls"                                                                                                                                     
 [6] "Ensure availability and sustainable management of water and sanitation for all"                                                                                                              
 [7] "Ensure access to affordable, reliable, sustainable and modern energy for all"                                                                                                                
 [8] "Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all"                                                                        
 [9] "Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation"                                                                                   
[10] "Reduce inequality within and among countries"                                                                                                                                                
[11] "Make cities and human settlements inclusive, safe, resilient and sustainable"                                                                                                                
[12] "Ensure sustainable consumption and production patterns"                                                                                                                                      
[13] "Take urgent action to combat climate change and its impacts"                                                                                                                                 
[14] "Conserve and sustainably use the oceans, seas and marine resources for sustainable development"                                                                                              
[15] "Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss"
[16] "Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels"           
[17] "Strengthen the means of implementation and revitalize the global partnership for sustainable development"                                                                                    

Motivating Example

sdg$long_description
 [1] "End poverty in all its forms everywhere By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day By 2030, reduce at least by half the proportion of men, women and children of all ages living in poverty in all its dimensions according to national definitions  Implement nationally appropriate social protection systems and measures for all, including floors, and by 2030 achieve substantial coverage of the poor and the vulnerable  By 2030, ensure that all men and women, in particular the poor and the vulnerable, have equal rights to economic resources, as well as access to basic services, ownership and control over land and other forms of property, inheritance, natural resources, appropriate new technology and financial services, including microfinance By 2030, build the resilience of the poor and those in vulnerable situations and reduce their exposure and vulnerability to climate-related extreme events and other economic, social and environmental shocks and disasters  Ensure significant mobilization of resources from a variety of sources, including through enhanced development cooperation, in order to provide adequate and predictable means for developing countries, in particular least developed countries, to implement programmes and policies to end poverty in all its dimensions  Create sound policy frameworks at the national, regional and international levels, based on pro-poor and gender-sensitive development strategies, to support accelerated investment in poverty eradication actions "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [2] "End hunger, achieve food security and improved nutrition and promote sustainable agriculture By 2030, end hunger and ensure access by all people, in particular the poor and people in vulnerable situations, including infants, to safe, nutritious and sufficient food all year round By 2030, end all forms of malnutrition, including achieving, by 2025, the internationally agreed targets on stunting and wasting in children under 5 years of age, and address the nutritional needs of adolescent girls, pregnant and lactating women and older persons By 2030, double the agricultural productivity and incomes of small-scale food producers, in particular women, indigenous peoples, family farmers, pastoralists and fishers, including through secure and equal access to land, other productive resources and inputs, knowledge, financial services, markets and opportunities for value addition and non-farm employment By 2030, ensure sustainable food production systems and implement resilient agricultural practices that increase productivity and production, that help maintain ecosystems, that strengthen capacity for adaptation to climate change, extreme weather, drought, flooding and other disasters and that progressively improve land and soil quality  By 2020, maintain the genetic diversity of seeds, cultivated plants and farmed and domesticated animals and their related wild species, including through soundly managed and diversified seed and plant banks at the national, regional and international levels, and promote access to and fair and equitable sharing of benefits arising from the utilization of genetic resources and associated traditional knowledge, as internationally agreed Increase investment, including through enhanced international cooperation, in rural infrastructure, agricultural research and extension services, technology development and plant and livestock gene banks in order to enhance agricultural productive capacity in developing countries, in particular least developed countries Correct and prevent trade restrictions and distortions in world agricultural markets, including through the parallel elimination of all forms of agricultural export subsidies and all export measures with equivalent effect, in accordance with the mandate of the Doha Development Round Adopt measures to ensure the proper functioning of food commodity markets and their derivatives and facilitate timely access to market information, including on food reserves, in order to help limit extreme food price volatility "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
 [3] "Ensure healthy lives and promote well-being for all at all ages By 2030, reduce the global maternal mortality ratio to less than 70 per 100,000 live births By 2030, end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce neonatal mortality to at least as low as 12 per 1,000 live births and under-5 mortality to at least as low as 25 per 1,000 live births By 2030, end the epidemics of AIDS, tuberculosis, malaria and neglected tropical diseases and combat hepatitis, water-borne diseases and other communicable diseases  By 2030, reduce by one third premature mortality from non-communicable diseases through prevention and treatment and promote mental health and well-being Strengthen the prevention and treatment of substance abuse, including narcotic drug abuse and harmful use of alcohol  By 2020, halve the number of global deaths and injuries from road traffic accidents By 2030, ensure universal access to sexual and reproductive health-care services, including for family planning, information and education, and the integration of reproductive health into national strategies and programmes  Achieve universal health coverage, including financial risk protection, access to quality essential health-care services and access to safe, effective, quality and affordable essential medicines and vaccines for all By 2030, substantially reduce the number of deaths and illnesses from hazardous chemicals and air, water and soil pollution and contamination Strengthen the implementation of the World Health Organization Framework Convention on Tobacco Control in all countries, as appropriate Support the research and development of vaccines and medicines for the communicable and non-communicable diseases that primarily affect developing countries, provide access to affordable essential medicines and vaccines, in accordance with the Doha Declaration on the TRIPS Agreement and Public Health, which affirms the right of developing countries to use to the full the provisions in the Agreement on Trade-Related Aspects of Intellectual Property Rights regarding flexibilities to protect public health, and, in particular, provide access to medicines for all Substantially increase health financing and the recruitment, development, training and retention of the health workforce in developing countries, especially in least developed countries and small island developing States Strengthen the capacity of all countries, in particular developing countries, for early warning, risk reduction and management of national and global health risks "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [4] "Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all By 2030, ensure that all girls and boys complete free, equitable and quality primary and secondary education leading to relevant and effective learning outcomes By 2030, ensure that all girls and boys have access to quality early childhood development, care and pre-primary education so that they are ready for primary education  By 2030, ensure equal access for all women and men to affordable and quality technical, vocational and tertiary education, including university  By 2030, substantially increase the number of youth and adults who have relevant skills, including technical and vocational skills, for employment, decent jobs and entrepreneurship  By 2030, eliminate gender disparities in education and ensure equal access to all levels of education and vocational training for the vulnerable, including persons with disabilities, indigenous peoples and children in vulnerable situations By 2030, ensure that all youth and a substantial proportion of adults, both men and women, achieve literacy and numeracy  By 2030, ensure that all learners acquire the knowledge and skills needed to promote sustainable development, including, among others, through education for sustainable development and sustainable lifestyles, human rights, gender equality, promotion of a culture of peace and non-violence, global citizenship and appreciation of cultural diversity and of culture’s contribution to sustainable development Build and upgrade education facilities that are child, disability and gender sensitive and provide safe, non-violent, inclusive and effective learning environments for all By 2020, substantially expand globally the number of scholarships available to developing countries, in particular least developed countries, small island developing States and African countries, for enrolment in higher education, including vocational training and information and communications technology, technical, engineering and scientific programmes, in developed countries and other developing countries By 2030, substantially increase the supply of qualified teachers, including through international cooperation for teacher training in developing countries, especially least developed countries and small island developing States "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
 [5] "Achieve gender equality and empower all women and girls End all forms of discrimination against all women and girls everywhere  Eliminate all forms of violence against all women and girls in the public and private spheres, including trafficking and sexual and other types of exploitation Eliminate all harmful practices, such as child, early and forced marriage and female genital mutilation Recognize and value unpaid care and domestic work through the provision of public services, infrastructure and social protection policies and the promotion of shared responsibility within the household and the family as nationally appropriate  Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life Ensure universal access to sexual and reproductive health and reproductive rights as agreed in accordance with the Programme of Action of the International Conference on Population and Development and the Beijing Platform for Action and the outcome documents of their review conferences Undertake reforms to give women equal rights to economic resources, as well as access to ownership and control over land and other forms of property, financial services, inheritance and natural resources, in accordance with national laws Enhance the use of enabling technology, in particular information and communications technology, to promote the empowerment of women Adopt and strengthen sound policies and enforceable legislation for the promotion of gender equality and the empowerment of all women and girls at all levels "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 [6] "Ensure availability and sustainable management of water and sanitation for all By 2030, achieve universal and equitable access to safe and affordable drinking water for all By 2030, achieve access to adequate and equitable sanitation and hygiene for all and end open defecation, paying special attention to the needs of women and girls and those in vulnerable situations By 2030, improve water quality by reducing pollution, eliminating dumping and minimizing release of hazardous chemicals and materials, halving the proportion of untreated wastewater and substantially increasing recycling and safe reuse globally By 2030, substantially increase water-use efficiency across all sectors and ensure sustainable withdrawals and supply of freshwater to address water scarcity and substantially reduce the number of people suffering from water scarcity By 2030, implement integrated water resources management at all levels, including through transboundary cooperation as appropriate  By 2020, protect and restore water-related ecosystems, including mountains, forests, wetlands, rivers, aquifers and lakes By 2030, expand international cooperation and capacity-building support to developing countries in water- and sanitation-related activities and programmes, including water harvesting, desalination, water efficiency, wastewater treatment, recycling and reuse technologies Support and strengthen the participation of local communities in improving water and sanitation management "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 [7] "Ensure access to affordable, reliable, sustainable and modern energy for all By 2030, ensure universal access to affordable, reliable and modern energy services  By 2030, increase substantially the share of renewable energy in the global energy mix By 2030, double the global rate of improvement in energy efficiency By 2030, enhance international cooperation to facilitate access to clean energy research and technology, including renewable energy, energy efficiency and advanced and cleaner fossil-fuel technology, and promote investment in energy infrastructure and clean energy technology By 2030, expand infrastructure and upgrade technology for supplying modern and sustainable energy services for all in developing countries, in particular least developed countries, small island developing States, and land-locked developing countries, in accordance with their respective programmes of support "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [8] "Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all Sustain per capita economic growth in accordance with national circumstances and, in particular, at least 7 per cent gross domestic product growth per annum in the least developed countries Achieve higher levels of economic productivity through diversification, technological upgrading and innovation, including through a focus on high-value added and labour-intensive sectors  Promote development-oriented policies that support productive activities, decent job creation, entrepreneurship, creativity and innovation, and encourage the formalization and growth of micro-, small- and medium-sized enterprises, including through access to financial services  Improve progressively, through 2030, global resource efficiency in consumption and production and endeavour to decouple economic growth from environmental degradation, in accordance with the 10-year framework of programmes on sustainable consumption and production, with developed countries taking the lead By 2030, achieve full and productive employment and decent work for all women and men, including for young people and persons with disabilities, and equal pay for work of equal value By 2020, substantially reduce the proportion of youth not in employment, education or training Take immediate and effective measures to eradicate forced labour, end modern slavery and human trafficking and secure the prohibition and elimination of the worst forms of child labour, including recruitment and use of child soldiers, and by 2025 end child labour in all its forms  Protect labour rights and promote safe and secure working environments for all workers, including migrant workers, in particular women migrants, and those in precarious employment By 2030, devise and implement policies to promote sustainable tourism that creates jobs and promotes local culture and products Strengthen the capacity of domestic financial institutions to encourage and expand access to banking, insurance and financial services for all Increase Aid for Trade support for developing countries, in particular least developed countries, including through the Enhanced Integrated Framework for Trade-Related Technical Assistance to Least Developed Countries By 2020, develop and operationalize a global strategy for youth employment and implement the Global Jobs Pact of the International Labour Organization "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [9] "Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation Develop quality, reliable, sustainable and resilient infrastructure, including regional and transborder infrastructure, to support economic development and human well-being, with a focus on affordable and equitable access for all Promote inclusive and sustainable industrialization and, by 2030, significantly raise industry’s share of employment and gross domestic product, in line with national circumstances, and double its share in least developed countries  Increase the access of small-scale industrial and other enterprises, in particular in developing countries, to financial services, including affordable credit, and their integration into value chains and markets By 2030, upgrade infrastructure and retrofit industries to make them sustainable, with increased resource-use efficiency and greater adoption of clean and environmentally sound technologies and industrial processes, with all countries taking action in accordance with their respective capabilities Enhance scientific research, upgrade the technological capabilities of industrial sectors in all countries, in particular developing countries, including, by 2030, encouraging innovation and substantially increasing the number of research and development workers per 1 million people and public and private research and development spending Facilitate sustainable and resilient infrastructure development in developing countries through enhanced financial, technological and technical support to African countries, least developed countries, landlocked developing countries and small island developing States Support domestic technology development, research and innovation in developing countries, including by ensuring a conducive policy environment for, inter alia, industrial diversification and value addition to commodities Significantly increase access to information and communications technology and strive to provide universal and affordable access to the Internet in least developed countries by 2020"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[10] "Reduce inequality within and among countries By 2030, progressively achieve and sustain income growth of the bottom 40 per cent of the population at a rate higher than the national average  By 2030, empower and promote the social, economic and political inclusion of all, irrespective of age, sex, disability, race, ethnicity, origin, religion or economic or other status  Ensure equal opportunity and reduce inequalities of outcome, including by eliminating discriminatory laws, policies and practices and promoting appropriate legislation, policies and action in this regard Adopt policies, especially fiscal, wage and social protection policies, and progressively achieve greater equality Improve the regulation and monitoring of global financial markets and institutions and strengthen the implementation of such regulations  Ensure enhanced representation and voice for developing countries in decision-making in global international economic and financial institutions in order to deliver more effective, credible, accountable and legitimate institutions Facilitate orderly, safe, regular and responsible migration and mobility of people, including through the implementation of planned and well-managed migration policies Implement the principle of special and differential treatment for developing countries, in particular least developed countries, in accordance with World Trade Organization agreements Encourage official development assistance and financial flows, including foreign direct investment, to States where the need is greatest, in particular least developed countries, African countries, small island developing States and landlocked developing countries, in accordance with their national plans and programmes By 2030, reduce to less than 3 per cent the transaction costs of migrant remittances and eliminate remittance corridors with costs higher than 5 per cent "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[11] "Make cities and human settlements inclusive, safe, resilient and sustainable By 2030, ensure access for all to adequate, safe and affordable housing and basic services and upgrade slums  By 2030, provide access to safe, affordable, accessible and sustainable transport systems for all, improving road safety, notably by expanding public transport, with special attention to the needs of those in vulnerable situations, women, children, persons with disabilities and older persons By 2030, enhance inclusive and sustainable urbanization and capacity for participatory, integrated and sustainable human settlement planning and management in all countries Strengthen efforts to protect and safeguard the world’s cultural and natural heritage  By 2030, significantly reduce the number of deaths and the number of people affected and substantially decrease the direct economic losses relative to global gross domestic product caused by disasters, including water-related disasters, with a focus on protecting the poor and people in vulnerable situations  By 2030, reduce the adverse per capita environmental impact of cities, including by paying special attention to air quality and municipal and other waste management  By 2030, provide universal access to safe, inclusive and accessible, green and public spaces, in particular for women and children, older persons and persons with disabilities Support positive economic, social and environmental links between urban, per-urban and rural areas by strengthening national and regional development planning By 2020, substantially increase the number of cities and human settlements adopting and implementing integrated policies and plans towards inclusion, resource efficiency, mitigation and adaptation to climate change, resilience to disasters, and develop and implement, in line with the Sendai Framework for Disaster Risk Reduction 2015-2030, holistic disaster risk management at all levels Support least developed countries, including through financial and technical assistance, in building sustainable and resilient buildings utilizing local materials "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[12] "Ensure sustainable consumption and production patterns Implement the 10-year framework of programmes on sustainable consumption and production, all countries taking action, with developed countries taking the lead, taking into account the development and capabilities of developing countries  By 2030, achieve the sustainable management and efficient use of natural resources By 2030, halve per capita global food waste at the retail and consumer levels and reduce food losses along production and supply chains, including post-harvest losses  By 2020, achieve the environmentally sound management of chemicals and all wastes throughout their life cycle, in accordance with agreed international frameworks, and significantly reduce their release to air, water and soil in order to minimize their adverse impacts on human health and the environment  By 2030, substantially reduce waste generation through prevention, reduction, recycling and reuse  Encourage companies, especially large and transnational companies, to adopt sustainable practices and to integrate sustainability information into their reporting cycle Promote public procurement practices that are sustainable, in accordance with national policies and priorities  By 2030, ensure that people everywhere have the relevant information and awareness for sustainable development and lifestyles in harmony with nature Support developing countries to strengthen their scientific and technological capacity to move towards more sustainable patterns of consumption and production Develop and implement tools to monitor sustainable development impacts for sustainable tourism that creates jobs and promotes local culture and products Rationalize inefficient fossil-fuel subsidies that encourage wasteful consumption by removing market distortions, in accordance with national circumstances, including by restructuring taxation and phasing out those harmful subsidies, where they exist, to reflect their environmental impacts, taking fully into account the specific needs and conditions of developing countries and minimizing the possible adverse impacts on their development in a manner that protects the poor and the affected communities"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[13] "Take urgent action to combat climate change and its impacts Strengthen resilience and adaptive capacity to climate-related hazards and natural disasters in all countries  Integrate climate change measures into national policies, strategies and planning Improve education, awareness-raising and human and institutional capacity on climate change mitigation, adaptation, impact reduction and early warning Implement the commitment undertaken by developed-country parties to the United Nations Framework Convention on Climate Change to a goal of mobilizing jointly $100 billion annually by 2020 from all sources to address the needs of developing countries in the context of meaningful mitigation actions and transparency on implementation and fully operationalize the Green Climate Fund through its capitalization as soon as possible Promote mechanisms for raising capacity for effective climate change-related planning and management in least developed countries and small island developing States, including focusing on women, youth and local and marginalized communities"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[14] "Conserve and sustainably use the oceans, seas and marine resources for sustainable development By 2025, prevent and significantly reduce marine pollution of all kinds, in particular from land-based activities, including marine debris and nutrient pollution By 2020, sustainably manage and protect marine and coastal ecosystems to avoid significant adverse impacts, including by strengthening their resilience, and take action for their restoration in order to achieve healthy and productive oceans Minimize and address the impacts of ocean acidification, including through enhanced scientific cooperation at all levels By 2020, effectively regulate harvesting and end overfishing, illegal, unreported and unregulated fishing and destructive fishing practices and implement science-based management plans, in order to restore fish stocks in the shortest time feasible, at least to levels that can produce maximum sustainable yield as determined by their biological characteristics By 2020, conserve at least 10 per cent of coastal and marine areas, consistent with national and international law and based on the best available scientific information By 2020, prohibit certain forms of fisheries subsidies which contribute to overcapacity and overfishing, eliminate subsidies that contribute to illegal, unreported and unregulated fishing and refrain from introducing new such subsidies, recognizing that appropriate and effective special and differential treatment for developing and least developed countries should be an integral part of the World Trade Organization fisheries subsidies negotiation By 2030, increase the economic benefits to Small Island developing States and least developed countries from the sustainable use of marine resources, including through sustainable management of fisheries, aquaculture and tourism Increase scientific knowledge, develop research capacity and transfer marine technology, taking into account the Intergovernmental Oceanographic Commission Criteria and Guidelines on the Transfer of Marine Technology, in order to improve ocean health and to enhance the contribution of marine biodiversity to the development of developing countries, in particular small island developing States and least developed countries Provide access for small-scale artisanal fishers to marine resources and markets Enhance the conservation and sustainable use of oceans and their resources by implementing international law as reflected in UNCLOS, which provides the legal framework for the conservation and sustainable use of oceans and their resources, as recalled in paragraph 158 of The Future We Want "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[15] "Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss By 2020, ensure the conservation, restoration and sustainable use of terrestrial and inland freshwater ecosystems and their services, in particular forests, wetlands, mountains and drylands, in line with obligations under international agreements By 2020, promote the implementation of sustainable management of all types of forests, halt deforestation, restore degraded forests and substantially increase afforestation and reforestation globally By 2030, combat desertification, restore degraded land and soil, including land affected by desertification, drought and floods, and strive to achieve a land degradation-neutral world By 2030, ensure the conservation of mountain ecosystems, including their biodiversity, in order to enhance their capacity to provide benefits that are essential for sustainable development Take urgent and significant action to reduce the degradation of natural habitats, halt the loss of biodiversity and, by 2020, protect and prevent the extinction of threatened species Promote fair and equitable sharing of the benefits arising from the utilization of genetic resources and promote appropriate access to such resources, as internationally agreed Take urgent action to end poaching and trafficking of protected species of flora and fauna and address both demand and supply of illegal wildlife products By 2020, introduce measures to prevent the introduction and significantly reduce the impact of invasive alien species on land and water ecosystems and control or eradicate the priority species By 2020, integrate ecosystem and biodiversity values into national and local planning, development processes, poverty reduction strategies and accounts Mobilize and significantly increase financial resources from all sources to conserve and sustainably use biodiversity and ecosystems Mobilize significant resources from all sources and at all levels to finance sustainable forest management and provide adequate incentives to developing countries to advance such management, including for conservation and reforestation Enhance global support for efforts to combat poaching and trafficking of protected species, including by increasing the capacity of local communities to pursue sustainable livelihood opportunities "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[16] "Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels Significantly reduce all forms of violence and related death rates everywhere End abuse, exploitation, trafficking and all forms of violence against and torture of children Promote the rule of law at the national and international levels and ensure equal access to justice for all By 2030, significantly reduce illicit financial and arms flows, strengthen the recovery and return of stolen assets and combat all forms of organized crime Substantially reduce corruption and bribery in all their forms Develop effective, accountable and transparent institutions at all levels Ensure responsive, inclusive, participatory and representative decision-making at all levels Broaden and strengthen the participation of developing countries in the institutions of global governance By 2030, provide legal identity for all, including birth registration Ensure public access to information and protect fundamental freedoms, in accordance with national legislation and international agreements Strengthen relevant national institutions, including through international cooperation, for building capacity at all levels, in particular in developing countries, to prevent violence and combat terrorism and crime Promote and enforce non-discriminatory laws and policies for sustainable development"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[17] "Strengthen the means of implementation and revitalize the global partnership for sustainable development Strengthen domestic resource mobilization, including through international support to developing countries, to improve domestic capacity for tax and other revenue collection Developed countries to implement fully their official development assistance commitments, including the commitment by many developed countries to achieve the target of 0.7 per cent of ODA/GNI to developing countries and 0.15 to 0.20 per cent of ODA/GNI to least developed countries; ODA providers are encouraged to consider setting a target to provide at least 0.20 per cent of ODA/GNI to least developed countries Mobilize additional financial resources for developing countries from multiple sources Assist developing countries in attaining long-term debt sustainability through coordinated policies aimed at fostering debt financing, debt relief and debt restructuring, as appropriate, and address the external debt of highly indebted poor countries to reduce debt distress Adopt and implement investment promotion regimes for least developed countries  Enhance North-South, South-South and triangular regional and international cooperation on and access to science, technology and innovation and enhance knowledge sharing on mutually agreed terms, including through improved coordination among existing mechanisms, in particular at the United Nations level, and through a global technology facilitation mechanism Promote the development, transfer, dissemination and diffusion of environmentally sound technologies to developing countries on favourable terms, including on concessional and preferential terms, as mutually agreed Fully operationalize the technology bank and science, technology and innovation capacity-building mechanism for least developed countries by 2017 and enhance the use of enabling technology, in particular information and communications technology  Enhance international support for implementing effective and targeted capacity-building in developing countries to support national plans to implement all the sustainable development goals, including through North-South, South-South and triangular cooperation  Promote a universal, rules-based, open, non-discriminatory and equitable multilateral trading system under the World Trade Organization, including through the conclusion of negotiations under its Doha Development Agenda Significantly increase the exports of developing countries, in particular with a view to doubling the least developed countries’ share of global exports by 2020 Realize timely implementation of duty-free and quota-free market access on a lasting basis for all least developed countries, consistent with World Trade Organization decisions, including by ensuring that preferential rules of origin applicable to imports from least developed countries are transparent and simple, and contribute to facilitating market access  Enhance global macroeconomic stability, including through policy coordination and policy coherence Enhance policy coherence for sustainable development Respect each country’s policy space and leadership to establish and implement policies for poverty eradication and sustainable development Enhance the global partnership for sustainable development, complemented by multi-stakeholder partnerships that mobilize and share knowledge, expertise, technology and financial resources, to support the achievement of the sustainable development goals in all countries, in particular developing countries Encourage and promote effective public, public-private and civil society partnerships, building on the experience and resourcing strategies of partnerships By 2020, enhance capacity-building support to developing countries, including for least developed countries and small island developing States, to increase significantly the availability of high-quality, timely and reliable data disaggregated by income, gender, age, race, ethnicity, migratory status, disability, geographic location and other characteristics relevant in national contexts By 2030, build on existing initiatives to develop measurements of progress on sustainable development that complement gross domestic product, and support statistical capacity-building in developing countries "

There Is No Single Right Way To Represent Text

Which features of text would be most helpful for the following research questions?

  1. Predicting whether the author of a text message was young or old

  2. Measuring the financial content of news coverage

  3. Assessing the complexity of a piece of writing

There Is No Single Right Way To Represent Text

Which features of text would be most helpful for the following research questions?

  1. Predicting whether the author of a text message was young or old

    • Emojis; informal language; length
  2. Measuring the financial content of news coverage

    • Words relating to finance
  3. Assessing the complexity of a piece of writing

    • Number of sylables; relative number of adjectives, nouns, verbs, etc

Implication: feature selection will depend on your research question.

Document-feature matrix

Document-Feature Matrix (DFM)

A document-feature matrix is a common way of representing text data in quantitative form.

  • The rows of the matrix indicate the documents.

  • The columns of the matrix indicate the features (words, etc).

DFM’s are parsimonious representations which discard information. But they are helpful!

In order to construct a dfm, we need to made decisions about both documents and features.

Terminology

Document

Basic unit (text) of analysis

Corpus

A structured set of documents for analysis

Type

A unique feature in the corpus e.g. a word (“flies”), a punctuation mark, a part-of-speech

Token

An instance of a type in a document e.g. the occurrence of the word in a given document

Selecting documents

Selecting documents is an important, and often ignored, step in any QTA analysis.

Key questions:

  1. Is it possible/feasible to collect a set of documents?

  2. Is the corpus representative of the population of interest?

  3. Is it ethical to examine documents of this sort at scale?

Implication: The selection of texts is consequential to the conclusions we can draw.

Strategies for defining “documents”

A “document” is the typical unit of analysis in QTA. But what is a document?

  • Entire document
  • Pages
  • Paragraphs
  • Tweets

Key: Depends on the research question.

Strategies for defining “features”

Strategies for defining “features”

Bags of words

  1. The simplest possible way of characterising a corpus is by counting words

  2. For each text, we record how many times each unique word appears

  3. We ignore everything else.

Bags of words assumptions

  1. The words in a document convey meaning

  2. Word order does not matter

  3. Word combinations do not matter (i.e. negation)

  4. Grammar does not matter

  5. Words are the only relevant features (not punctuation, not syllables, etc)

The importance of these assumptions depends on the application.

Bag of words assumption

  1. Time flies like an arrow.
  2. Fruit flies like a banana.
time flies fruit like an a banana arrow
Sentence 1 1 1 0 1 1 0 0 1
Sentence 2 0 1 1 1 0 1 1 0
  • The dependency structure between words in each sentence is lost
  • The word “flies” has two different meanings (metaphorical versus literal)
  • The word “like” has two different meanings (preposition versus verb)
  • The “joke” is no longer funny

Bags of words

# Load the quanteda library
library(quanteda)

# Convert the sdg data.frame into a corpus
sdg_corpus <- corpus(sdg, text_field = "long_description")

# Take the corpus 
sdg_dfm <- sdg_corpus %>% 
           # Tokenize (split) the corpus into individual words
           tokens() %>% 
           # Construct a document-feature matrix
           dfm()

# Print the dfm
sdg_dfm
Document-feature matrix of: 17 documents, 1,085 features (86.41% sparse) and 2 docvars.
       features
docs    end poverty in all its forms everywhere by 2030  ,
  text1   2       5  9   7   3     2          2  6    5 24
  text2   3       0 11   5   0     2          0  7    4 43
  text3   2       0  7   7   0     0          0  8    6 33
  text4   0       0  6   8   0     0          0  9    8 39
  text5   1       0  5   9   0     3          1  0    0 11
  text6   1       0  3   5   0     0          0  8    6 21
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 1,075 more features ]

Coding Interlude

Wait, what is this %>% thing?

  • This is called a “pipe”
  • It takes the output of one function and passes it to another function

E.g.

my_vector <- c(1,2,3)
mean(my_vector) %>% sqrt()
[1] 1.414214

Bags of words

How many features are there in this dfm?

ncol(sdg_dfm)
[1] 1085

And how many documents?

nrow(sdg_dfm)
[1] 17

And what are the most common features in this dfm?

topfeatures(sdg_dfm, 10)
      and         ,       the        of        to        in        by countries 
      476       464       165       161       140       115       105        85 
      all       for 
       80        77 

Top features

Word sequences/N-grams

N-grams

Contiguous sequence of words from document (1-gram, unigram; 2-gram, bigram)

Word sequences/N-grams

sdg_dfm <- sdg_corpus %>% 
                    # Split the corpus into individual words
                    tokens() %>%
                    # Construct a document-feature matrix
                    dfm()

sdg_dfm

Word sequences/N-grams

sdg_dfm_bigram <- sdg_corpus %>% 
                    # Split the corpus into individual words
                    tokens() %>%
                    # Construct uni-grams and bi-grams
                    tokens_ngrams(1:2) %>%
                    # Construct a document-feature matrix
                    dfm()

sdg_dfm_bigram
Document-feature matrix of: 17 documents, 4,337 features (90.46% sparse) and 2 docvars.
       features
docs    end poverty in all its forms everywhere by 2030  ,
  text1   2       5  9   7   3     2          2  6    5 24
  text2   3       0 11   5   0     2          0  7    4 43
  text3   2       0  7   7   0     0          0  8    6 33
  text4   0       0  6   8   0     0          0  9    8 39
  text5   1       0  5   9   0     3          1  0    0 11
  text6   1       0  3   5   0     0          0  8    6 21
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 4,327 more features ]

Word sequences/N-grams

sdg_dfm_trigram <- sdg_corpus %>% 
                    # Split the corpus into individual words
                    tokens() %>%
                    # Construct uni-grams, bi-grams and tri-grams
                    tokens_ngrams(1:3) %>%
                    # Construct a document-feature matrix
                    dfm()

sdg_dfm_trigram
Document-feature matrix of: 17 documents, 8,685 features (91.86% sparse) and 2 docvars.
       features
docs    end poverty in all its forms everywhere by 2030  ,
  text1   2       5  9   7   3     2          2  6    5 24
  text2   3       0 11   5   0     2          0  7    4 43
  text3   2       0  7   7   0     0          0  8    6 33
  text4   0       0  6   8   0     0          0  9    8 39
  text5   1       0  5   9   0     3          1  0    0 11
  text6   1       0  3   5   0     0          0  8    6 21
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 8,675 more features ]

Word sequences/N-grams

How many features are there in these dfms?

ncol(sdg_dfm)
[1] 1085
ncol(sdg_dfm_bigram)
[1] 4337
ncol(sdg_dfm_trigram)
[1] 8685

Strategies for feature selection

  • This can lead to a lot of features!

  • For this example (very small) corpus:

    • 17 documents

    • 1085 unique words

    • 4337 unique 1-gram and 2-gram sequences

    • 8685 unique 1-gram, 2-gram and 3-gram sequences

  • The resulting dfms are also very sparse – they contain a high fraction of zeros because most n-grams do not appear in most documents

sparsity(sdg_dfm)
[1] 0.8640824
sparsity(sdg_dfm_bigram)
[1] 0.9046237
sparsity(sdg_dfm_trigram)
[1] 0.9186359

Strategies for feature selection

  1. Reduce complexity

    • Convert to lowercase (automatic in quanteda), remove punctuation (not automatic in quanteda)
  2. Deliberate disregard

    • Ignore words that have no substantive content (“stop” words)
  3. Word stemming/lematization

    • Define some words as equivalent to each other (school, schools, schooling, etc)
  4. Filter by frequency

    • Document frequency: Ignore words that occur rarely across documents
    • Term frequency: Ignore words that occur rarely overall
  5. Purposive selection

    • Select only certain words to analyse

Common stop words

stopwords("en")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"       "will"      

But no list should be considered universal…

Other common stop words

stopwords("smart")
  [1] "a"             "a's"           "able"          "about"        
  [5] "above"         "according"     "accordingly"   "across"       
  [9] "actually"      "after"         "afterwards"    "again"        
 [13] "against"       "ain't"         "all"           "allow"        
 [17] "allows"        "almost"        "alone"         "along"        
 [21] "already"       "also"          "although"      "always"       
 [25] "am"            "among"         "amongst"       "an"           
 [29] "and"           "another"       "any"           "anybody"      
 [33] "anyhow"        "anyone"        "anything"      "anyway"       
 [37] "anyways"       "anywhere"      "apart"         "appear"       
 [41] "appreciate"    "appropriate"   "are"           "aren't"       
 [45] "around"        "as"            "aside"         "ask"          
 [49] "asking"        "associated"    "at"            "available"    
 [53] "away"          "awfully"       "b"             "be"           
 [57] "became"        "because"       "become"        "becomes"      
 [61] "becoming"      "been"          "before"        "beforehand"   
 [65] "behind"        "being"         "believe"       "below"        
 [69] "beside"        "besides"       "best"          "better"       
 [73] "between"       "beyond"        "both"          "brief"        
 [77] "but"           "by"            "c"             "c'mon"        
 [81] "c's"           "came"          "can"           "can't"        
 [85] "cannot"        "cant"          "cause"         "causes"       
 [89] "certain"       "certainly"     "changes"       "clearly"      
 [93] "co"            "com"           "come"          "comes"        
 [97] "concerning"    "consequently"  "consider"      "considering"  
[101] "contain"       "containing"    "contains"      "corresponding"
[105] "could"         "couldn't"      "course"        "currently"    
[109] "d"             "definitely"    "described"     "despite"      
[113] "did"           "didn't"        "different"     "do"           
[117] "does"          "doesn't"       "doing"         "don't"        
[121] "done"          "down"          "downwards"     "during"       
[125] "e"             "each"          "edu"           "eg"           
[129] "eight"         "either"        "else"          "elsewhere"    
[133] "enough"        "entirely"      "especially"    "et"           
[137] "etc"           "even"          "ever"          "every"        
[141] "everybody"     "everyone"      "everything"    "everywhere"   
[145] "ex"            "exactly"       "example"       "except"       
[149] "f"             "far"           "few"           "fifth"        
[153] "first"         "five"          "followed"      "following"    
[157] "follows"       "for"           "former"        "formerly"     
[161] "forth"         "four"          "from"          "further"      
[165] "furthermore"   "g"             "get"           "gets"         
[169] "getting"       "given"         "gives"         "go"           
[173] "goes"          "going"         "gone"          "got"          
[177] "gotten"        "greetings"     "h"             "had"          
[181] "hadn't"        "happens"       "hardly"        "has"          
[185] "hasn't"        "have"          "haven't"       "having"       
[189] "he"            "he's"          "hello"         "help"         
[193] "hence"         "her"           "here"          "here's"       
[197] "hereafter"     "hereby"        "herein"        "hereupon"     
[201] "hers"          "herself"       "hi"            "him"          
[205] "himself"       "his"           "hither"        "hopefully"    
[209] "how"           "howbeit"       "however"       "i"            
[213] "i'd"           "i'll"          "i'm"           "i've"         
[217] "ie"            "if"            "ignored"       "immediate"    
[221] "in"            "inasmuch"      "inc"           "indeed"       
[225] "indicate"      "indicated"     "indicates"     "inner"        
[229] "insofar"       "instead"       "into"          "inward"       
[233] "is"            "isn't"         "it"            "it'd"         
[237] "it'll"         "it's"          "its"           "itself"       
[241] "j"             "just"          "k"             "keep"         
[245] "keeps"         "kept"          "know"          "knows"        
[249] "known"         "l"             "last"          "lately"       
[253] "later"         "latter"        "latterly"      "least"        
[257] "less"          "lest"          "let"           "let's"        
[261] "like"          "liked"         "likely"        "little"       
[265] "look"          "looking"       "looks"         "ltd"          
[269] "m"             "mainly"        "many"          "may"          
[273] "maybe"         "me"            "mean"          "meanwhile"    
[277] "merely"        "might"         "more"          "moreover"     
[281] "most"          "mostly"        "much"          "must"         
[285] "my"            "myself"        "n"             "name"         
[289] "namely"        "nd"            "near"          "nearly"       
[293] "necessary"     "need"          "needs"         "neither"      
[297] "never"         "nevertheless"  "new"           "next"         
[301] "nine"          "no"            "nobody"        "non"          
[305] "none"          "noone"         "nor"           "normally"     
[309] "not"           "nothing"       "novel"         "now"          
[313] "nowhere"       "o"             "obviously"     "of"           
[317] "off"           "often"         "oh"            "ok"           
[321] "okay"          "old"           "on"            "once"         
[325] "one"           "ones"          "only"          "onto"         
[329] "or"            "other"         "others"        "otherwise"    
[333] "ought"         "our"           "ours"          "ourselves"    
[337] "out"           "outside"       "over"          "overall"      
[341] "own"           "p"             "particular"    "particularly" 
[345] "per"           "perhaps"       "placed"        "please"       
[349] "plus"          "possible"      "presumably"    "probably"     
[353] "provides"      "q"             "que"           "quite"        
[357] "qv"            "r"             "rather"        "rd"           
[361] "re"            "really"        "reasonably"    "regarding"    
[365] "regardless"    "regards"       "relatively"    "respectively" 
[369] "right"         "s"             "said"          "same"         
[373] "saw"           "say"           "saying"        "says"         
[377] "second"        "secondly"      "see"           "seeing"       
[381] "seem"          "seemed"        "seeming"       "seems"        
[385] "seen"          "self"          "selves"        "sensible"     
[389] "sent"          "serious"       "seriously"     "seven"        
[393] "several"       "shall"         "she"           "should"       
[397] "shouldn't"     "since"         "six"           "so"           
[401] "some"          "somebody"      "somehow"       "someone"      
[405] "something"     "sometime"      "sometimes"     "somewhat"     
[409] "somewhere"     "soon"          "sorry"         "specified"    
[413] "specify"       "specifying"    "still"         "sub"          
[417] "such"          "sup"           "sure"          "t"            
[421] "t's"           "take"          "taken"         "tell"         
[425] "tends"         "th"            "than"          "thank"        
[429] "thanks"        "thanx"         "that"          "that's"       
[433] "thats"         "the"           "their"         "theirs"       
[437] "them"          "themselves"    "then"          "thence"       
[441] "there"         "there's"       "thereafter"    "thereby"      
[445] "therefore"     "therein"       "theres"        "thereupon"    
[449] "these"         "they"          "they'd"        "they'll"      
[453] "they're"       "they've"       "think"         "third"        
[457] "this"          "thorough"      "thoroughly"    "those"        
[461] "though"        "three"         "through"       "throughout"   
[465] "thru"          "thus"          "to"            "together"     
[469] "too"           "took"          "toward"        "towards"      
[473] "tried"         "tries"         "truly"         "try"          
[477] "trying"        "twice"         "two"           "u"            
[481] "un"            "under"         "unfortunately" "unless"       
[485] "unlikely"      "until"         "unto"          "up"           
[489] "upon"          "us"            "use"           "used"         
[493] "useful"        "uses"          "using"         "usually"      
[497] "uucp"          "v"             "value"         "various"      
[501] "very"          "via"           "viz"           "vs"           
[505] "w"             "want"          "wants"         "was"          
[509] "wasn't"        "way"           "we"            "we'd"         
[513] "we'll"         "we're"         "we've"         "welcome"      
[517] "well"          "went"          "were"          "weren't"      
[521] "what"          "what's"        "whatever"      "when"         
[525] "whence"        "whenever"      "where"         "where's"      
[529] "whereafter"    "whereas"       "whereby"       "wherein"      
[533] "whereupon"     "wherever"      "whether"       "which"        
[537] "while"         "whither"       "who"           "who's"        
[541] "whoever"       "whole"         "whom"          "whose"        
[545] "why"           "will"          "willing"       "wish"         
[549] "with"          "within"        "without"       "won't"        
[553] "wonder"        "would"         "would"         "wouldn't"     
[557] "x"             "y"             "yes"           "yet"          
[561] "you"           "you'd"         "you'll"        "you're"       
[565] "you've"        "your"          "yours"         "yourself"     
[569] "yourselves"    "z"             "zero"         

Stop words example


End poverty in all its forms everywhere


End hunger, achieve food security and improved nutrition and promote sustainable agriculture


Ensure healthy lives and promote well-being for all at all ages

Stop words example


End poverty in all its forms everywhere


End hunger, achieve food security and improved nutrition and promote sustainable agriculture


Ensure healthy lives and promote well-being for all at all ages

Stop words can matter

Compare…

It was a nice party, Pablo had brought his ukulele.

To…

It was a nice party, but Pablo had brought his ukulele.

nice party Pablo brought ukulele
Sentence 1 1 1 1 1 1
Sentence 2 1 1 1 1 1

Removing stop words in R

sdg_dfm <- sdg_corpus %>% 
           tokens() %>% 
           dfm()

sdg_dfm

Removing stop words in R

sdg_dfm_no_stop <- sdg_corpus %>% 
           tokens(remove_punct = TRUE) %>% 
           tokens_remove(stopwords("en")) %>%
           dfm()

sdg_dfm_no_stop
Document-feature matrix of: 17 documents, 1,034 features (87.62% sparse) and 2 docvars.
       features
docs    end poverty forms everywhere 2030 eradicate extreme people currently
  text1   2       5     2          2    5         1       2      2         1
  text2   3       0     2          0    4         0       2      2         0
  text3   2       0     0          0    6         0       0      0         0
  text4   0       0     0          0    8         0       0      0         0
  text5   1       0     3          1    0         0       0      0         0
  text6   1       0     0          0    6         0       0      1         0
       features
docs    measured
  text1        1
  text2        0
  text3        0
  text4        0
  text5        0
  text6        0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 1,024 more features ]

Top features

Stemming and lematization

Stemming

Process for reducing inflected (or sometimes derived) words to their stem, base or root form. Stemmers operate on single words without knowledge of the context.

Example:

Production, producer, produce, produces, produced → produc

Lemmatization

Algorithmic process of converting words to their lemma forms.

Example:

am, are, is → be

Stemming is a crude heuristic process that chops off the ends of words. Lemmatization is smarter, but slower.

Stemming example


End poverty in all its forms everywhere


End hunger, achieve food security and improved nutrition and promote sustainable agriculture


Ensure healthy lives and promote well-being for all at all ages

Stemming example


End poverti in all it form everywher


End hunger , achiev food secur and improv nutrit and promot sustain agricultur


Ensure healthi live and promot well-b for all at all age

Stemming in R

sdg_dfm <- sdg_corpus %>% 
           tokens() %>% 
           dfm()

sdg_dfm

Stemming in R

sdg_dfm_stem <- sdg_corpus %>% 
           tokens(remove_punct = TRUE) %>% 
           tokens_wordstem() %>%
           dfm()

sdg_dfm_stem
Document-feature matrix of: 17 documents, 872 features (84.37% sparse) and 2 docvars.
       features
docs    end poverti in all it form everywher by 2030 erad
  text1   2       5  9   7  3    2         2  6    5    2
  text2   3       0 11   5  0    2         0  7    4    0
  text3   2       0  7   7  0    0         0  8    6    0
  text4   0       0  6   8  0    0         0  9    8    0
  text5   1       0  5   9  0    3         1  0    0    0
  text6   1       0  3   5  0    0         0  8    6    0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 862 more features ]

Top features

Filter by frequency

Very rare words and very frequent words are unlikely to be helpful in discriminating between documents.

Frequency-filtering in R

sdg_dfm <- sdg_corpus %>% 
           tokens() %>% 
           dfm()

sdg_dfm

Frequency-filtering in R

sdg_dfm_filtered <- sdg_corpus %>% 
           tokens(remove_punct = TRUE) %>% 
           dfm() %>%
           # Remove all words that appear fewer than 3 times in the corpus
           dfm_trim(min_termfreq = 3)

sdg_dfm_filtered
Document-feature matrix of: 17 documents, 323 features (70.86% sparse) and 2 docvars.
       features
docs    end poverty in all its forms everywhere by 2030 eradicate
  text1   2       5  9   7   3     2          2  6    5         1
  text2   3       0 11   5   0     2          0  7    4         0
  text3   2       0  7   7   0     0          0  8    6         0
  text4   0       0  6   8   0     0          0  9    8         0
  text5   1       0  5   9   0     3          1  0    0         0
  text6   1       0  3   5   0     0          0  8    6         0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 313 more features ]

Feature Comparison

dim(sdg_dfm)
[1]   17 1085
dim(sdg_dfm_bigram)
[1]   17 4337
dim(sdg_dfm_trigram)
[1]   17 8685
dim(sdg_dfm_no_stop)
[1]   17 1034
dim(sdg_dfm_stem)
[1]  17 872
dim(sdg_dfm_filtered)
[1]  17 323
sparsity(sdg_dfm)
[1] 0.8640824
sparsity(sdg_dfm_bigram)
[1] 0.9046237
sparsity(sdg_dfm_trigram)
[1] 0.9186359
sparsity(sdg_dfm_no_stop)
[1] 0.876152
sparsity(sdg_dfm_stem)
[1] 0.8436994
sparsity(sdg_dfm_filtered)
[1] 0.7086141
  • Feature selection matters! See Denny and Spirling, 2017

  • Just seven (binary) preprocessing decisions leads to a total of \(2^7 = 128\) possible feature matrices

  • These selection decisions can have substantive implications for the inferences we draw from QTA

Choosing between representations

How should we select between these representations?

  1. There is no single “best” dfm

  2. The optimal representation of a corpus will depend on the particular research task

    • Would you want to remove stop words when trying to detect gendered hate speech?
    • Would you want to stem if you wanted to measure future-oriented language?
    • Would you want to discard rare words when calculating linguistic complexity?
  3. We need to design ways of validating the representations we construct

Break

Dictionaries

Motivating Example

Are female politicians less aggressive than male politicians? (Hargrave and Blumenau, 2022)

A repeated claim in the qualitative literature on gender and politics is that male and female politicians have distinct styles. Many political observers argue that women are less aggressive in political debate than their male colleagues. Most of the evidence for these claims is taken from small-N classical content analysis studies. We will review this question by applying an existing sentiment dictionary to a large-N corpus of parliamentary texts.

Motivating Example

Are female politicians less aggressive than male politicians? (Hargrave and Blumenau, 2022)

A repeated claim in the qualitative literature on gender and politics is that male and female politicians have distinct styles. Many political observers argue that women are less aggressive in political debate than their male colleagues. Most of the evidence for these claims is taken from small-N classical content analysis studies. We will review this question by applying an existing sentiment dictionary to a large-N corpus of parliamentary texts.

Motivating Example

How might we conceptualize “aggression” in the context of parliamentary debate?

Use of aggressive or combative language, which might include criticisms or insults; language that suggests forceful action; or declamatory or adversarial language.

  1. Theoretical conceptualization

    • Existing literature makes frequent reference to the importance of combative language in politics
  2. Empirical exploration/discovery

    • We can read and watch parliamentary debates to assess the ways in which aggression manifests in politicians’ speeches

Hand-coding: “Classic” content analysis

  • Key feature: use of “human” coders to implement a pre-defined coding scheme, by reading and coding texts

  • Human decision-making is the central feature of coding decisions, not a computer or other mechanized tool

  • Validity is usually the objective, rather than reliability

    • Validity: am I measuring what I am claiming to measure?
    • Reliability: am I able to reliably replicate my coding?
  • Example: hand-coding sentences into pre-defined categories

Bridging Qualitative and Quantitative Text Analyses

Dictionaries represent a hybrid procedure that bridges qualitative approaches and fully-automated text-as-data approaches

  • “Qualitative” since it involves identification of the concepts and associated keys/categories, and the textual features associated with each key/category

    • Dictionary construction involves a lot of contextual interpretation and qualitative judgment
  • “Quantitative” because it involves applying an algorithm to large corpora and presenting statistical summaries of results

    • Perfect reliability because there is no human decision making as part of the text analysis procedure

Rationale for dictionaries

  • Rather than count all words that occur, pre-define words as associated with specific meanings

  • Two components:

    1. key: the label for the equivalence class for the concept or canonical term
    2. values: (multiple) terms or patterns that are declared equivalent occurences of the key class
  • A better metaphor is really a thesaurus: a canoncial term or concept (the key) associated with equivalent synonyms (the values)

Key Values
Dog Dalmation, Labrador, Poodle, Pug
Computation Data, Number, Computer, Simulation
Genetics Gene, DNA, Inherit

Counting words

A dictionary is just a list of words (\(m=1,...,M\)) that is related to a common concept.

Aggression
stupid
dishonest
lier
idiot
ignorant
hate
fight
battle

Counting words

Applying a dictionary to a corpus of texts (\(i = 1,...,N\)) simply requires counting the number of times each word occurs in each text and summing them.

If \(W_{im}\) is the number of times word \(m\) appears in text \(i\) and 0 otherwise, then the dictionary score for document \(i\) is:

\[ t_i = \frac{\sum_{m=1}^M W_{im}}{N_i} \]

Or, the proportion of words in document \(i\) that appear in the dictionary.

Counting words

“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”

\[ t_i = \frac{\sum_{m=1}^M W_{im}}{N_i} = \frac{1+1}{24} = 0.083 \]

Counting weighted words

A slight development on this would be to assign each word in the dictionary a weight which reflects something about the importance of the word to the concept

Aggression Weight
stupid .6
dishonest .2
lie .5
idiot .7
ignorant .3
brutal .4
violence .5
  • Weights are implicit in all dictionary approaches.
  • Typically, all words are counted equally which implies a score of 1 for all words.
  • This is not necessarily correct!

Counting weighted words

We can adjust the previous formula to incorporate the weights (\(s_m\)):

\[ t_i = \frac{\sum_{m=1}^M s_mW_{im}}{N_i} \]

Why normalise by \(N_i\)?

Some texts will be longer than others and we do not want these texts to mechanically be assigned higher scores.

Counting weighted words

“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”

\[ t_i = \frac{\sum_{m=1}^M s_mW_{im}}{N_i} = \frac{(1\cdot0.6)+(1\cdot0.3)}{24} = 0.0375 \]

Weights or no weights?

Most applications of dictionary methods in social science applications use unweighted dictionaries.

Why learn this then?

  1. The equal weighting assumption is not necessarily reasonable or effective.
  1. The idea of assigning weights to words is something that will come up many times in future weeks.

Advantages of dictionaries: Many existing implementations

  1. Linquistic Inquiry and Word Count

    • 82 different word categories reflecting psychological states, emotions, thinking styles, and social concerns
  2. Lexicoder Sentiment Dictionary

    • 2,858 “negative” sentiment words and 1,709 “positive” sentiment words designed for the automated coding of sentiment in news coverage, legislative speech and other text
  3. Moral Foundations Dictionary

    • 5 category dictionary of moral terms
  4. Loughran-McDonald Sentiment Dictionary

    • 9 category sentiment dictionary, especially developed for financial analyses
  5. Martindale’s Regressive Imagery Dictionary

    • 43 category dictionary designed to measure primordial vs. conceptual thinking.

Many of these are directly available in quanteda. Some are available only for purchase.

Advantages of dictionaries: Multi-lingual

Disadvantage: Off-the-Shelf Dictionaries and Context

Applying off-the-shelf dictionaries to new contexts can be problematic:

  • Problem 1: polysemes – words that have multiple meanings

    • Loughran and McDonald classify sentiment for a corpus of 50,115 firm-year 10-K filings from 1994–2008
    • Almost three-fourths of the “negative” words in their dictionary were typically not negative in a financial context: e.g. tax, cost, liability, foreign, vice, etc
  • Problem 2: Dictionaries often lack important words in a given context

    • e.g. negative financial words such as felony, litigation, restated, misstatement, and unanticipated
  • Problem 3: Some dictionaries might do more to pick up the topic of a document than the tone of a document

Applying dictionaries outside the domain for which they were developed can lead to serious errors (Grimmer and Stewart, 2013, 268)

Disdvantages of Dictionaries

“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”

“Terrible acts of brutality and violence have been carried out against the Rohingya people.”

  • Dictionaries may miss words that are important to the concept

    • “barbaric” is probably an aggressive word in this context
  • Dictionaries do not typically capture modifiers

    • “downright” is an intensifier (also: negators like “not good”)
  • Dictionaries often fail to capture all synonyms

    • “deliberate misconception” is parliamentary language for “lie”
  • Dictionaries may not capture the relevant concept

    • brutality/violence: descriptions, rather than expressions, of aggression

Application

Applying dictionaries in quanteda

library(quanteda)
aggression_texts <- read.csv("aggression_texts.csv")
aggression_words <- read.csv("aggression_words.csv")[,1]
  1. aggression_texts is a data.frame which includes 10937 sentences from parliamentary speeches
  2. aggression_words is a vector of 222 words from the an existing “Aggression” dictionary

Our goal is to use aggression_words to score the texts in aggression_texts.

Aggressive Words?

print(aggression_words)
  [1] "abhor*"         "abus*"          "abusiv*"        "accus*"        
  [5] "afflict*"       "aggress*"       "aggressiv*"     "ambush*"       
  [9] "anger*"         "angri*"         "angrier*"       "angry*"        
 [13] "annihilat*"     "annoy*"         "annoyanc*"      "antagoniz*"    
 [17] "argu*"          "argument*"      "army*"          "arrow*"        
 [21] "assault*"       "attack*"        "aveng*"         "ax"            
 [25] "axe"            "axes"           "battl*"         "beak*"         
 [29] "beat*"          "beaten*"        "betray*"        "blade*"        
 [33] "blam*"          "bloody*"        "bother*"        "brawl*"        
 [37] "break*"         "brok*"          "broken*"        "brutal*"       
 [41] "cannon*"        "chid*"          "combat*"        "complain*"     
 [45] "conflict*"      "condemn*"       "controversy*"   "critic*"       
 [49] "cruel*"         "crush*"         "cut"            "cuts"          
 [53] "cutt*"          "damag*"         "decei*"         "defeat*"       
 [57] "degrad*"        "demolish*"      "depriv*"        "derid*"        
 [61] "despis*"        "destroy*"       "destruct*"      "destructiv*"   
 [65] "detest*"        "disagre*"       "disagreement*"  "disapprov*"    
 [69] "discontent*"    "dislik*"        "disput*"        "disturb*"      
 [73] "doubt*"         "enemi*"         "enemy*"         "enrag*"        
 [77] "exasperat*"     "controversial*" "critique"       "disparag*"     
 [81] "irritable"      "exploit*"       "exterminat*"    "feud*"         
 [85] "fierc*"         "fight*"         "fought*"        "furiou*"       
 [89] "fury*"          "gash*"          "grappl*"        "growl*"        
 [93] "grudg*"         "gun"            "gunn*"          "guns"          
 [97] "harm*"          "harsh*"         "hate*"          "hatr*"         
[101] "hit"            "hits"           "hitt*"          "homicid*"      
[105] "hostil*"        "hurt*"          "ingrat*"        "injur*"        
[109] "injury*"        "insult*"        "invad*"         "invas*"        
[113] "irat*"          "irk*"           "irritat*"       "jealou*"       
[117] "jealousy*"      "jeer*"          "kick*"          "kil*"          
[121] "kill*"          "knif*"          "kniv*"          "loath*"        
[125] "maim*"          "mistreat*"      "mock*"          "murder*"       
[129] "obliterat*"     "offend*"        "oppos*"         "predatory*"    
[133] "protest*"       "quarrel*"       "rage"           "rages"         
[137] "raging"         "rapin*"         "rebel*"         "rebell*"       
[141] "rebuk*"         "relentles*"     "reproach*"      "resent*"       
[145] "resentment*"    "retribut*"      "reveng*"        "revolt*"       
[149] "ridicul*"       "rip"            "ripp*"          "rips"          
[153] "rob"            "robb*"          "robs"           "sarcasm*"      
[157] "sarcastic*"     "scalp*"         "scof*"          "scoff*"        
[161] "scourg*"        "seiz*"          "sever*"         "severity*"     
[165] "shatter*"       "shoot*"         "shot*"          "shov*"         
[169] "slain*"         "slander*"       "slap*"          "slaughter*"    
[173] "slay*"          "slew*"          "smash*"         "snarl*"        
[177] "sneer*"         "spear*"         "spiteful*"      "spurn*"        
[181] "stab*"          "steal*"         "stol*"          "stolen*"       
[185] "strangl*"       "strif*"         "strik*"         "struck*"       
[189] "struggl*"       "stubborn*"      "sword*"         "taunt*"        
[193] "temper*"        "threat*"        "threaten*"      "tore"          
[197] "torment*"       "torn*"          "tortur*"        "traitor*"      
[201] "trampl*"        "treacherou*"    "treachery*"     "tyrant*"       
[205] "unkind*"        "vengeanc*"      "vengeful*"      "vex"           
[209] "vexing"         "violat*"        "violenc*"       "violent*"      
[213] "war"            "warring"        "warrior*"       "wars"          
[217] "weapon*"        "whip*"          "wound*"         "wrath*"        
[221] "football*"      "wreck*"        

The * character will pick up any token which begins with the relevant string.

I.e. accus* ➡️ accuse, accuses, accused, etc.

Applying Dictionaries in Quanteda

# First we convert the texts to a corpus object:
aggression_corpus <- corpus(aggression_texts, text_field = "texts")

# Then we tokenize the texts and create a dfm:
aggression_tokens <- tokens(aggression_corpus)
aggression_dfm <- dfm(aggression_tokens)

# We use the aggression words to create a dictionary object:
aggression_dictionary <- dictionary(list(aggression = aggression_words))

# Finally, we apply the dictionary to the dfm using the dfm_lookup function:
aggression_dfm_dictionary <- dfm_lookup(aggression_dfm,
                                         dictionary = aggression_dictionary)

Applying Dictionaries in Quanteda

print(aggression_dfm_dictionary)
Document-feature matrix of: 10,937 documents, 1 feature (79.05% sparse) and 1 docvar.
       features
docs    aggression
  text1          0
  text2          0
  text3          0
  text4          0
  text5          1
  text6          1
[ reached max_ndoc ... 10,931 more documents ]

aggression_dfm is a document-feature matrix, where the only “feature” is the dictionary counts

Applying Dictionaries in Quanteda

Applying Dictionaries in Quanteda

Finally, we can calculate the score by dividing the dictionary counts by the number of words in each text:

aggression_texts$proportions <- as.numeric(aggression_dfm_dictionary[,1]) /
  ntoken(aggression_corpus)
summary(aggression_texts$proportions)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.00811 0.00000 0.19048 
hist(aggression_texts$proportions)

Validation

Validation tests

  • Applying dictionaries outside the domain for which they were developed can lead to errors.

  • One way of assessing the seriousness of these errors is to conduct validation tests

  • There are many forms of these tests!

  • All share a core idea: are the texts that are flagged by the dictionary more representative of the relevant concept than other texts?

Types of validation

There are many approaches to assessing validity of a measure, \(m_1\), for a target concept, \(\mu_1\):

  1. Face validity

    • Does \(m_1\) pass basic sanity checks?
    • Example: Correlation between aggression measure and different debate types.
  2. Concurrent validity

    • Does \(m_1\) correlate with a previously validated measure, \(m_2\) for the same concept?
    • Example: Correlation between aggression measure and measure of vocal pitch.
  3. Convergent validity

    • Does \(m_1\) positively correlate with \(m_2\) for a different target concept, \(\mu_2\), where we expect \(\mu_1\) and \(\mu_2\) to be positively correlated?
    • Example: Correlation between aggression measure and measure for negativity.
  4. Discriminant validity

    • Does \(m_1\) not or negatively correlate with \(m_2\) for a different concept, \(\mu_2\), where we expect them not to be or to be negatively correlated?
    • Example: Correlation between aggression and happiness.
  5. Predictive validity

    • Does \(m_1\), correlate with some covariate, \(x\), which we expect to be correlated with \(\mu_1\)?
    • Example: Correlation between aggression and time, where we expect anger to be higher in certain periods.

Human Judgement as a “Gold Standard”

  • Comparison to human judgements of a target concept, \(\mu\), are often thought to be the “gold standard” of validation

  • This is based on the (often implicit) assumption that real people can accurately identify and label examples of a given concept (“you know it when you see it”)

  • This assumption may not be met due to…

    • Misinterpretation
    • Poor/unclear conceptualisation
    • Lack of coder training
    • Etc
  • The caratage of the gold standard will therefore vary across applications

Face validity (1)

Intuition: Does our measure of aggression vary in sensible ways?

In this case, one obvious test is whether MPs speeches are more aggressive during Prime Minister’s Questions (PMQs).

Face validity (1)

str(aggression_texts)
'data.frame':   10937 obs. of  4 variables:
 $ texts      : chr  "Is it not more important to work hard to open up trade between eastern and western Europe than to allow the Eur"| __truncated__ "Also, the Bill will consider aspects of the procedures applying to boards of inquiry." "On that measure, NHS provision per head of population in Cornwall is about half the national average." "Making it a criminal offence would help to make it clear that forced marriage is completely and utterly unacceptable." ...
 $ human      : logi  NA NA NA NA NA NA ...
 $ debate_type: chr  "prime_ministers_questions" "legislation" "opposition_day" "question_time" ...
 $ proportions: num  0 0 0 0 0.0233 ...
table(aggression_texts$debate_type)

              legislation            opposition_day prime_ministers_questions 
                     4720                       972                      1376 
            question_time 
                     3869 

Face validity (1)

library(tidyverse) # Load libraries

aggression_texts %>% # Pipe the aggression texts object
  group_by(debate_type) %>% # Group data by the debate_type variable
  summarise(mean_dictionary = mean(proportions)) # Calculate the mean dictionary score for each type
# A tibble: 4 × 2
  debate_type               mean_dictionary
  <chr>                               <dbl>
1 legislation                       0.00684
2 opposition_day                    0.00734
3 prime_ministers_questions         0.0157 
4 question_time                     0.00715

Face validity (1)

summary(lm(proportions ~ debate_type, data = aggression_texts))

Call:
lm(formula = proportions ~ debate_type, data = aggression_texts)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.015708 -0.007154 -0.006837 -0.006837  0.174768 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          0.0068367  0.0002633  25.966   <2e-16 ***
debate_typeopposition_day            0.0005019  0.0006371   0.788    0.431    
debate_typeprime_ministers_questions 0.0088718  0.0005542  16.009   <2e-16 ***
debate_typequestion_time             0.0003170  0.0003923   0.808    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01809 on 10933 degrees of freedom
Multiple R-squared:  0.02486,   Adjusted R-squared:  0.0246 
F-statistic: 92.93 on 3 and 10933 DF,  p-value: < 2.2e-16

There is clear evidence that PMQ debates tend to have higher levels of aggressive language than other debates.

Face validity (2)

How does this approach perform? Let’s look at the top-scoring sentences:

score text
text3998 0.19 I fully appreciate that it is the Opposition's job to oppose, but there are times when opposition is destructive.
text7416 0.18 We unequivocally condemn Hamas's dreadful and murderous rocket attacks and defend Israel's right to defend itself.
text2941 0.14 They were asking ridiculous prices, because they had the sole remedy for a complaint, so could exploit that situation.
text106 0.13 Terrible acts of brutality and violence have been carried out against the Rohingya people.
text144 0.13 The motion condemns the early release scheme for those who have assaulted police officers.

While some seem reasonable, others indicate that we are picking up topic rather than tone.

Comparison to Human Judgement

  • The aggression_texts data.frame includes a variable, human, which includes the results of a validation exercise.
str(aggression_texts)
'data.frame':   10937 obs. of  4 variables:
 $ texts      : chr  "Is it not more important to work hard to open up trade between eastern and western Europe than to allow the Eur"| __truncated__ "Also, the Bill will consider aspects of the procedures applying to boards of inquiry." "On that measure, NHS provision per head of population in Cornwall is about half the national average." "Making it a criminal offence would help to make it clear that forced marriage is completely and utterly unacceptable." ...
 $ human      : logi  NA NA NA NA NA NA ...
 $ debate_type: chr  "prime_ministers_questions" "legislation" "opposition_day" "question_time" ...
 $ proportions: num  0 0 0 0 0.0233 ...
table(dictionary = aggression_texts$proportions > 0, 
      human = aggression_texts$human)
          human
dictionary FALSE TRUE
     FALSE   674  124
     TRUE     75  127
(127 + 674)/1000
[1] 0.801
  • Is this good?
  • Accuracy = \(\frac{674 + 127}{1000} = 80\%\)
  • Sensitivity = \(\frac{127}{251} = 51\%\)
  • Specificity = \(\frac{674}{749} = 90\%\)

Implication: Our aggression dictionary rarely detects non-aggressive texts as aggressive but frequently fails to detect aggressive texts.

Paired Comparisons versus Single Ratings

Which of these questions is easier?

  1. Is this sentence aggressive?

    • “I regard it as unbelievable that the minister has said that, when it is clearly wrong.”
  1. Which of these sentences is more aggressive?

    • “I regard it as unbelievable that the minister has said that, when it is clearly wrong.”

    • “I also welcome the fact that the Bill will encourage more young people to take advantage of the programme.”

Paired comparisons tend to give more useful and reliable information than single ratings.

Paired Comparisons versus Single Ratings

  1. Apply 7 basic QTA measures (including 6 dictionaries) to 8 million sentences

    • Aggression
    • Positive Emotion
    • Negative Emotion
    • Fact
    • Anecdote
    • Complexity
    • Repetition
  2. Score each sentence using uniform word weights

  3. Present pairs of sentences to human coders and ask them to select which sentence is most representative of a certain concept

Validation Measure

Does the difference in sentence-level dictionary scores predict human judgements?

  • Sample pairs of sentences from the corpus

    • Score each pair as \(\text{Diff}_{i} = t_{2i} - t_{1i}\)
  • Randomly present to human coders, code (\(Y_i\)) whether:

    • Sentence one is more <style> (1)
    • About the same (0)
    • Sentence two is more <style> (-1)
  • Calculate the relationship between human coding and dictionaries by:

    • \(Y_{i} = \alpha + \beta \text{Diff}_{i}\)
    • \(Cor(Y_{i},\text{Diff}_{i})\)
  • Repeat for each dictionary

Validation Results

Interpretation

Aggression tends to manifest very differently in parliamentary speech than in other contexts!

In the paper, H&B develop a more sophisticated approach to measuring aggression (and other styles):

  1. Take an off-the-shelf dictionary of aggressive words

  2. Use word-embeddings to…

    1. …expand the initial dictionary to include words that are relevant to parliamentary speeches
    2. …upweight words that are used in a similar way in parliamentary speech
    3. …downweight words that are not typically used in a similar way in parliamentary speech
  3. Score speeches according to these modified word lists

More on this approach on Thursday.

Word-embedding Results

Are women less aggressive than men?

Let’s believe for a second that the validation strategy worked.

Have political styles changed over time?

Full paper here.

Conclusion

Summing Up

  • Quantitative Text Analysis allows us to address a wide variety of important research questions

  • There is no one right way to represent text for all research questions.

  • The representation we choose can be consequential for the results we present

  • Dictionaries are fast, easy-to-apply, methods with many pre-existing implementations

  • Validation is critical to any quantitative text application

  • The validity of a dictionary will be sensitive to the contexts in which it is developed and applied