Using automated event detection to reduce data collection costs with an application to the BFRS dataset

The project aims to replicate and extend the BFRS dataset (which measured political violence in Pakistan, based on press reporting, from 1988 to 2011) by using machine learning and algorithm techniques.

Incident-level data on political violence has been analysed to study causal linkages between terrorism and economic growth. Such research aids policymakers in formulating policies that are focused on reducing the impact of terrorism. Better scholarship and policy using the BFRS dataset can create not only academic value but also strengthen the case for improving internal security in Pakistan, while better-informed voters can create pressure on the government to improve (Banerjee, Kumar, Pande, and Su; Ferrera, 2011).

BFRS compiles political violence by recording location, consequence, cause, type of violence, and party responsible. However, a problem with datasets compiled by manual newspaper extraction is the recurring cost of updating. Advancements in textual analytics suggest a better way to keep the dataset up to date: extracting data through automation.

We propose to create a similar dataset from 2010 to the present day by automating the identification and categorisation of events using textual analysis with pattern recognition. The automation will streamline the process and, once developed, this machine-learning tool will provide quick updates without incurring any additional cost. The project also aims to create the capacity for the construction of similar datasets on subjects other than violence.

Our proposed dataset’s overlap with the BFRS time period will allow us to compare the two and gauge the accuracy of this developed tool. We will search for events reported in BFRS in our scraped data; we will train our algorithms on how to detect such events using this data; and, finally, we will use the trained algorithm on data scraped between 2013 and the present to extract further instances of political violence without human intervention. As an additional check on data quality, human coders will review a sample of extracted data to detect false positives.