The automation of political event detection in text has been of interest to political scientists for over two decades. schrodt:1998 introduced KEDS, the Kansas Event Data System in the 1990s, an early piece of event coding software. Successors to KEDS include TABARI, JABARI-NLP, and now PETRARCH in its various incarnations [18, 17]
. However, these tools rely primarily on performing pattern-matching within texts against dictionaries, limiting their ability to recognize singular events across multiple sentences or documents. This leads to unwanted duplication within event data sets and limits the types of detected events to those that are concisely summarized in a single line.††The data set described in this paper is available on Harvard Dataverse: https://doi.org/10.7910/DVN/8TEG5R.
Social scientists have recently begun exploring machine learning-based approaches to coding particular types of political events[2, 9, 15]. However, these efforts still mainly focus on classifying events at the sentence or document level. In this paper, I propose an approach to event-coding that is able to detect singular events at both the document (headline) level as well as across documents. Therefore, this challenge is not only a classification task but also a coreference prediction task; headlines are classified as pertaining to events and multiple headlines referring to the same event are identified as coreferencing the event.
This two-part challenge mirrors real-world cross-document event coreference detection. The first task is the identification of relevant events among a corpus that contains relevant (positive) and irrelevant (negative) events. The second task is to identify event coreferences across documents. Multiple articles may refer to the same event, and there may be an arbitrary number of distinct events within the corpus. This second task is conceptualized as link prediction wherein a link between articles signifies that they refer to the same event.
Event linking, or coreference resolution, has been studied in the context of computer science and computational linguistics. This research is often framed within the larger problem of automated knowledge base population from text. lu:ng:2018 provide a review of research in this area over the previous two decades including discussion of standard data sets, evaluation methods, common linguistic features used for coreference resolution, and coreference resolution models. Notable datasets for coreference resolution include one built by hong:etal:2016 using the Automated Content Extraction (ACE2005111http://projects.ldc.upenn.edu/ace) corpus, a data set produced by song:etal:2018 in support of the Text Analysis Conference Knowledge Base Population effort, and the EventCorefBank (ECB) and ECB+222http://www.newsreader-project.eu/results/data/the-ecb-corpus/ data sets [3, 6].
Advances in event linking also promise to enhance automated event data generation for social science applications. Event data sets like ICEWS, GDELT, and Pheonix suffer from duplicate event records when single events are reported multiple times by multiple sources [4, 11, 1]
. Typically, duplicated records are removed via heuristics based on the uniqueness of event attribute sets. Event linking techniques may allow event data sets like to these to better represent complex phenomena (e.g., wars) that are described across multiple documents while avoiding the duplication problem.
The paper proceeds as follows. I first describe a novel data set designed to evaluate performance on cross-document event detection. I then introduce a model capable of both event detection and cross-document coreference prediction and evaluate its performance on out-of-sample data. The paper concludes with a discussion of the limitations of the evaluation data set and suggested directions for future research.
I introduce here a task-specific evaluation data set referred to as the Headlines of War (HoW) data set. HoW takes the form of a node list that describes news story headlines and an edge list that represents coreference links between headlines. HoW draws headline and coreference data from two sources. The first is the Militarized Interstate Disputes data set (MIDS) version 3. MIDS provides a set of newspaper headlines that coreference interstate disputes. The New York Times (NYT) provides a second source of headlines that constitute the negative (non-coreferential) samples.
MIDS is a standard in political science and international relations.333The Militarized Interstate Disputes data set will be referred to as MIDS while an individual dispute will be referred to as MID (plural: MIDs). A MID incident will sometimes be referred to as MIDI. It is published by the Correlates of War Project, an effort that dates to 1963 . A MID is a collection of “incidents involving the deliberate, overt, government-sanctioned, and government-directed threat, display, or use of force between two or more states” . As such, many MIDs, and the incidents they comprise, are macro-level events that may occur over an extended period of time and comprise many smaller events. For example, a number of ceasefire violations in Croatia in February, 1992, together constitute incident 3555003. 3555003 is one of many incidents that make up MID 3555, the Croatian War for Independence. MIDs and the incidents they comprise tend to be larger-scale than the events found in typical event data sets.
MIDS differs from automated event data in several ways. Automated event data sets (referred to herein simply as “event data”) like GDELT, ICEWS, and Phoenix typically document discrete events that are easily described in a single sentence. This is due, in part, to the fact that the necessary coding software parses stories sentence-by-sentence and uses pattern-matching to identify the key components of an event within a given sentence. This leads to data sets that feature simple events and often include duplicate records of events. Failure to deduplicate led, in one case, to an incident in which a popular blog was forced to issue corrections due to the over-counting of kidnapping events in GDELT .
Because it is coded manually, MIDS features more complex events than automated event data systems are capable of producing. MIDs comprise incidents, and incidents may (or may not) themselves comprise a number of actions that would each constitute their own entry in an automated event data set. Because each MID is coded from a number of news sources, duplication of disputes is not a concern; human coders are capable of mapping stories from multiple news sources to the single incident or dispute to which they all refer.
MIDS provides HoW with positive class labels (i.e.,, headlines associated with MIDs) and positive coreferences (pairs of headlines associated with common MIDs). I use the third version of MIDS due to the availability of a subset of the source headlines used to produce the data set .444The source data are available at https://correlatesofwar.org/data-sets/MIDs. An effort to update HoW with MIDS version 4 headlines is underway.
2.2 The New York Times
Negative samples, headlines not associated with militarized interstate disputes, are drawn from The New York Times for the same period as that covered by MIDs 3.0: 1992–2001.555MIDs 3.0 only includes those conflicts from 1992 that were ongoing in 1993. For simplicity, NYT headlines are sampled from January 1, 1992. NYT headlines and their associated sections (e.g., World, US, Sports, …) are available from https://spiderbites.nytimes.com. HoW contains only samples from the World section. This is to ensure that the resulting task is sufficiently difficult. Articles drawn from the World section are more likely to mirror the MIDs headlines in tone and substance; distinguishing between MIDS headlines and NYT World headlines should, therefore, be more difficult than it would be if articles from all sections were sampled.
|Start date (01/01)||1992||1997||1998|
|End date (12/31)||1996||1997||2001|
2.3 Putting it Together: HoW
The HoW data are partitioned into three parts: training, validation, and testing. Partitioning is performed by year to make it unlikely that a single MID incident’s reference headlines are found across all three partitions. An unfortunate consequence of doing so is that it is difficult to control the relative sizes of each partition. MID incidents are not evenly distributed across years, and so the validation set is smaller (in terms of headline-pairs) than the training set, which is, in turn, smaller than the testing set.
Summary statistics for each partition of HoW are given in Table 1. Not all MIDs and MID incidents during the relevant time periods are included. This is due to the fact that the MIDS source data do not report headlines for all incidents. In many cases, page numbers and sections numbers are provided in lieu of the headline text itself. Therefore, HoW contains a total of only 18 unique MIDs (with one appearing in two partitions) and 36 unique incidents.
Each partition comprises a node list and an edge list. The node list contains the headline text, publication date, associated MID identifier and incident identifier (if applicable), and an indicator of whether the headline is a positive (MID) sample or a negative (NYT World) sample. The edge list includes positive links between headlines if they refer to the same MID incident along with a sample of negative links drawn randomly from NYT World and MIDS headlines. Therefore, a single MID incident is represented in the edge list by a fully-connected subgraph of headlines.
Figure 1 depicts the distribution of headline lengths, in words, for each of the HoW subsets. The average headline length is just under nine words.
3 Modeling Strategy
To demonstrate that HoW presents a tractable pair of tasks, I describe a model capable of accomplishing, to a degree, both headline classification and link prediction on the data set. The model is a multi-task neural network that takes as input numerical representations of two headlines and the reciprocal of . The model then predicts the MID status of both headlines, and , and whether or not the headlines refer to the same MID incident.
3.1 Preparing the Headlines
The first step of modeling is to remove all punctuation from the headlines’ texts. For convenience, headlines are zero-padded such that they are all of equal length. Headlines are then tokenized and word vectors are substituted for each word.666When a word vector cannot be obtained for a given token, that token is simply dropped. Pre-trained word vectors are obtained from Facebook’s fastText . FastText is selected because it is able to produce word vectors for out-of-sample words—those that it has not previously seen. Word vectors are length 300 real-valued vectors that represent words in such a way that semantically and syntactically related words share similar vectors.
3.2 Model Architecture
The model itself comprises a single convolutional layer of size and three dense, fully-connected layers for predicting MID status and coreference status. For a given input pair, the model outputs three predictions:
where indicates that and refer to the same MID incident and . The overall model architecture is depicted in Figure 2. The model contains 13,537 trainable parameters, 13,515 of which are in the convolutional layer.
The intuition behind the model is as follows. MID classification should be the same task regardless of whether the input headline is or
. Therefore, the convolutional layer and subsequent densely-connected layer are shared between the two. Combined, this outputs a predicted probability that a given headline describes a MID incident. After the convolutional layer and an element-wise maximum value pooling layer, the dot product of the hidden states representingand is computed; this represents the similarity of the two headlines. This value is multiplied by the predicted probabilities that each headline represents a MID incident as well as by a linear function of the time difference (in days) between the two headlines. A sigmoid activation is applied to this product; this value represents the probability of a MID incident coreference between and . Therefore, MID incident coreferences are most likely when the model predicts that both and describe MID incidents, when the hidden state representations of those headlines are most similar, and when the publication date difference between the headlines is small.
3.3 Training Procedure
The model is trained for 100 epochs on batches of 64 training samples. The validation set is used for parameter tuning. The testing set remains unobserved until the final model is selected. Because the model must predict three binary responses, the loss function is the unweighted sum of the three binary cross-entropy terms given in Equation1
. The model is fit using Nadam, a variant of the Adam optimizer with Nesterov momentum.
This model is similar in some aspects to the one introduced by krause:etal:2016. Major differences include the use of fastText vectors here rather than word2vec vectors, the requirement in this model that it not only identifies coreferential headlines but also that it discriminates between events and non-events, and the lack of additional contextual information about event pairs.777krause:etal:2016 include type compatibility, position in discourse, realis match, and argument overlap.
3.4 Task Evaluation
Tasks 1 and 2 are both conceptualized as binary classification and therefore a number of evaluation metrics are available. Here, I report classification accuracy888% classified correctly, precision999, recall101010, F-score111111
, and the area under the receiver operating characteristic curve (AUC) for both tasks. Due to class imbalance, I also report BLANC scores to better capture model performance among event links and non-links. The equivalent statistics, referred to as macro averaged precision, recall, and F-score, are reported for MID classification.
In out-of-sample evaluation (i.e., validation and test set performance) I use no information about the headline classes (MID incident versus non-incident) or coreferences. In other words, link predictions are conditioned on the texts and publication dates of headlines only and not on the MID status of a given headline.
|a. MIDI classification (positive class)|
|b. MIDI classification (macro average)|
|c. Coref prediction (positive class)|
|d. Coref prediction (BLANC)|
|False||0.91||Serbs Advance in Kosovo, Imperiling…|
|True||0.90||Feuding factions meet in Congo…|
|True||0.87||Significant Rwandan troop movement …|
|False||0.87||Serbs Stone Albanians in Divided Ko…|
|True||0.86||Zimbabwean troops deployed in Congo…|
|False||0.85||Attack in Baghdad…|
|False||0.83||Clashes in Zimbabwe…|
|True||0.82||Zimbabwe wins major battle in Congo…|
|True||0.80||Kabila moving against rebellious tr…|
|False||0.80||U.S. Cutbacks in Yemen…|
I turn now to an assessment of the model’s performance on both tasks: MID classification at the headline level and coreference prediction between pairs of headlines. In this analysis, only results are included when assessing MID classification. This is to prevent unintentional repeat counting of headlines that appear as both and in different training example pairs.
The model achieves high precision for coreference prediction but lower precision for MID classification: 0.99 and 0.27 on the testing set, respectively. Relatively high false negative rates mean that recall is low for both tasks: 0.19 for MIDI classification and 0.45 for coreference prediction. However, considering the class imbalance present for both tasks and apparent in Table 1, the macro averaged or BLANC adjusted statistics are also reported. This is recommended in previous work on coreference resolution . The model fares better for both tasks when taking this imbalance into account and achieves recall values of 0.59 and 0.72 for classification and coreference prediction, respectively. Table 2 provides a full set of results for all three partitions. The final column of Table 2 reports the area under the receiver operating characteristic curve (AUC). AUC can be interpreted as the probability that a randomly selected positive example will be assigned higher predicted probability of belonging to the positive class than will a randomly selected negative example. The very high accuracy and AUC scores (near 1.0) can be attributed to the high recall of the classifiers with respect to the majority negative class. The table reveals overfitting to the training set on which the model consistently achieves its highest scores.
Because content relevant to militarized interstate disputes often appears in the NYT World section, the HoW data set currently contains a significant number of false negative headlines. Table 3 reproduces the top 10 highest scoring headlines with respect to their predicted probabilities of describing a MID. Some of the reported non-MID headlines clearly refer to MIDs.121212Because these non-MID headlines are from NYT, they are not associated with a MID in HoW. I hope to reduce false negatives in future iterations of HoW.
Figure 3 depicts predicted coreferences in the test set. Two of four MID incidents are present. A selection of headlines labeled in Figure 3 is provided in Table 4. The four MID incidents present in the HoW test set are 4248001, 4248003, 4283012, 4339, of which coreferences are identified among two or more headlines referring to 4339 and 4248003. 4339 is the Congo War. 4248001 and 4248003 are incidents between Uganda and Sudan during 1998. 4283012 is an incident between the UK and Afghanistan during the 2001 invasion of Afghanistan.
|A||Sudanese plane bombed Ugandan town aid …|
|B||uganda condemns sudanese air attack…|
|C||One Dies as Navy Jets Collide Off Turkey…|
|D||U.S. to Change Strategy in Narcotics Fig…|
|E||Heading for an African War…|
|F||DRC gun running a rumour…|
|G||Rwanda needs and will get a buffer zone…|
|H||Farmers Protest Against Fox in Mexico Ci…|
|I||South Koreans Challenge Northerner on U….|
The HoW data set comes with a number of caveats discussed below. The negative sampling is performed by first subsetting MIDS 3.0 into the training, testing, and validation sets. Then, negative samples are picked at a rate of for every positive MID story pair (i.e., edge). This scale factor is selected arbitrarily and results in a sparse graph.131313While a negative sampling ratio of 5 to 1 is chosen arbitrarily, it does follow the standard in the literature for negative sampling skipgram models like word2vec . Many negative samples describe MIDs themselves and should not be labeled as negative. No negative samples have been manually corrected and at least some false negatives can be expected. Negative samples are drawn only from the NYT World section while the MIDS 3.0 headlines are drawn from many diverse (English language) sources. Unfortunately, a representative corpus of headlines for negative sampling was unavailable at the time of writing.
Not all sources in MIDS are documented with enough specificity to identify the relevant headline. Some MID incidents only reference a section or page number and not a headline. A future step in the development of HoW will seek to identify the original source data for MID incidents that currently lack headline text to improve the coverage of MIDs over the period in question. Longer-term, additional data sources may provide event types beyond MIDs and therefore allow researchers to evaluate the out-of-class generalizability of cross-document event detection methods. In the near term, the more comprehensive headline data set for MIDS 4 (2002–2010) is being used to extend HoW and address the high proportion of missing MID incidents in HoW.
The decision made here was to partition HoW by date. This has the advantage of offering a simple explanation of how the partitions differ from one another: they cover distinct date ranges. It also allows researchers to consider the impact of the temporal proximity of two headlines on their likelihood of being associated with the same event. In that way, date-based partitioning imitates the likely real-world scenario of cross-document event detection: near real-time monitoring. However, it also means that models fit to the training data set may generalize poorly to the testing data set since the testing data set represents events from up to five years later in time. Partitioning by time in such a way makes it difficult to control the number of positive-class observations per set. Down-sampling headlines from MIDS may help to manage partition balance but at the cost of even fewer positive MID headline examples.
HoW offers a novel evaluation data set for researchers interested in automated event data and coreference resolution. Conceptualizing event data generation as a two-task problem of detection and coreference resolution will allow future efforts to better identify complex social phenomena that may otherwise be invisible given existing sentence and document-level event coding strategies. It also has implications for deduplication: the ability to automatically detect event coreferences across documents may help to reduce the number of duplicate event records that result from coverage across multiple sources.
Future efforts should seek to build on HoW by including multiple classes of events or incidents.141414The previously mentioned event coreference resolution data sets contain multiple event types. Additionally, strategies for identifying true negative samples rather than relying on the assumption that all non-MIDS headlines are negative samples will help to more precisely evaluate model performance.
7 Bibliographical References
-  (2019-12-10) Cline center historical phoenix event data. Cline Center for Advanced Social Research.. Cited by: §1.
Generating politically-relevant event data.
Proceedings of 2016 EMNLP Workshop on Natural Language Processing and Computational Social Science, pp. 37–42. Cited by: §1.
-  (2010-07) Unsupervised event coreference resolution with rich linguistic features. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1412–1422. External Links: Cited by: §1.
-  (2015) BBN accent event coding evaluation.updated v01.pdf. In ICEWS Coded Event Data, External Links: Cited by: §1.
-  (2014-05-13) Mapping kidnappings in nigeria (updated). Note: onlinehttps://fivethirtyeight.com/features/mapping-kidnappings-in-nigeria/ Cited by: §2.1.
-  (2014) Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In Proceedings of the 9th international conference on Language Resources and Evaluation (LREC2014), Cited by: §1.
-  (2016) Incorporating nesterov momentum into adam. Note: http://cs229.stanford.edu/proj2015/054_report.pdf Cited by: §3.3.
-  (2004) The mid3 data set, 1993–2001: procedures, coding rules, and description. Conflict Management and Peace Science 21, pp. 133–154. Cited by: §2.1.
-  (2019) Overview of clef 2019 lab protestnews: extracting protests from news in a cross-context setting. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Cham, pp. 425–432. Cited by: §1.
-  (2016) Event linking with sentential features from convolutional neural networks. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 239–249. Cited by: §4.
-  (2013) GDELT: global data on events, location, and tone. ISA Annual Convention. Cited by: §1.
-  (2019) The dyadic militarized interstate disputes (mids) dataset version 3.0: logic, characteristics, and comparisons to alternative datasets. Journal of Conflict Resolution 63 (3), pp. 811–835. External Links: Cited by: §2.1.
-  (2018) Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §3.1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Red Hook, NY, USA, pp. 3111–3119. Cited by: footnote 13.
-  (2019) Automated dictionary generation for political eventcoding. Political Science Research and Methods, pp. 1–15. Cited by: §1.
-  (2011) BLANC: implementing the rand index for coreference evaluation. Natural Langauge Engineering 17 (4), pp. 485–510. Cited by: §3.4.
-  (2014)(Website) External Links: Cited by: §1.
-  (2009)(Website) External Links: Cited by: §1.
-  (1966) Formal alliances, 1815-1939. Journal of Peace Research 3, pp. 1–31. Cited by: §2.1.