Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

04/03/2023
by   Andrew Halterman, et al.
0

Event data, or structured records of “who did what to whom” that are automatically extracted from text, is an important source of data for scholars of international politics. The high cost of developing new event datasets, especially using automated systems that rely on hand-built dictionaries, means that most researchers draw on large, pre-existing datasets such as ICEWS rather than developing tailor-made event datasets optimized for their specific research question. This paper describes a “bag of tricks” for efficient, custom event data production, drawing on recent advances in natural language processing (NLP) that allow researchers to rapidly produce customized event datasets. The paper introduces techniques for training an event category classifier with active learning, identifying actors and the recipients of actions in text using large language models and standard machine learning classifiers and pretrained “question-answering” models from NLP, and resolving mentions of actors to their Wikipedia article to categorize them. We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS, along with examples of how scholars can quickly produce smaller, custom event datasets. We publish example code and models to implement our new techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2022

Annotated Dataset Creation through General Purpose Language Models for non-English Medical NLP

Obtaining text datasets with semantic annotations is an effortful proces...
research
05/24/2019

MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching

Text matching is the core problem in many natural language processing (N...
research
08/12/2016

Extracting Biological Pathway Models From NLP Event Representations

This paper describes an an open-source software system for the automatic...
research
10/15/2020

SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP Techniques

In this paper, we put forward a new tool, called SpaML, for spam detecti...
research
03/23/2023

Mordecai 3: A Neural Geoparser and Event Geocoder

Mordecai3 is a new end-to-end text geoparser and event geolocation syste...
research
03/04/2022

Deep Lexical Hypothesis: Identifying personality structure in natural language

Recent advances in natural language processing (NLP) have produced gener...

Please sign up or login with your details

Forgot password? Click here to reset