Mining Events with Declassified Diplomatic Documents

12/20/2017 ∙ by Yuanjun Gao, et al. ∙ University of Michigan MIT Columbia University 0

Since 1973 the State Department has been using electronic records systems to preserve classified communications. Recently, approximately 1.9 million of these records from 1973-77 have been made available by the U.S. National Archives. While some of these communication streams have periods witnessing an acceleration in the rate of transmission; others do not show any notable patterns in communication intensity. Given the sheer volume of these communications -- far greater than what had been available until now -- scholars need automated statistical techniques to identify the communications that warrant closer study. We develop a statistical framework that can semi-automatically identify from a large corpus of documents a handful that historians would consider more interesting electronic records. Our approach brings together related but distinct statistical concepts from nonparametric signal estimation and statistical hypothesis testing -- which when put together help us identify and analyze various geometrical aspects of the communication streams. Dominant periods of heightened and sustained activities aka bursts, as identified through these methods, correspond well with historical events recognized by standard reference works on the 1970s.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For more than forty years, social scientists have been developing datasets of events for the quantitative analysis of world politics. The last decade has witnessed a dramatic increase in activity in this area, much of it focused on automatic event detection for purposes of explaining and predicting political crises Beieler2016 . All of these efforts however, have used public information, such as newspaper or wire service reporting. Rather than directly measuring political activity, these systems can only count what reporters write about, which can vary over time and geography depending on many extraneous factors jenkins2016should . Together with the intrinsic challenges in automatic extraction, this has produced datasets that purport to track the same kind of events, such as political protests, but that are completely uncorrelated Hanna2010 . Moreover, some of the most important political activity is not immediately reported, and may not become publicly known until decades later, when formerly secret records are declassified. Even then, the sheer volume of these records can make it difficult even for the diligent researcher to identify individual events and assess their relative importance.

In this paper we study a new dataset of declassified documents and use statistical methods to identify and rank political events. Since 1973, the State Department has been using electronic records systems to preserve classified communications. The National Archives111Website: https://aad.archives.gov/aad/series-list.jsp?cat=WR43 now makes available over 1.4 million declassified cables from 1973-77 and in addition, the metadata of more than 0.4 million other communications originally delivered by diplomatic pouch. They are all machine-readable and rich with metadata, creating many opportunities for statistical modeling.

(a) # of communications for TAGS CY (b) # of communications for TAGS VS

Count

(c) # of communications for TAGS UNGA (d) # of communications for TAGS FI

Count

Figure 1: Figures showing counts of communications sent on each day, in the period 1973-1978. The numbers in the plot represent day-of-week (0-Sunday, 1-Monday, 2-Tuesday, …, 6-Saturday), with weekdays colored in blue and weekends in red. Figures (a)–(d) show the communications restricted to different TAGS. The apparent heightened activities in the communication streams correspond to (a) Cyprus coup (b) Fall of Saigon (the most intense one) (c) the yearly United Nation General Assembly meetings. There does not seem to be any interesting activity for panel (d), showing cables related to Finland. A goal of this paper is to create statistical methods to automatically identify series with heightened diplomatic communications and further describe their structural patterns.

Our goal is to explore modern statistical techniques that can automatically identify statistically interesting events in an important corpus of historical documents, which will continue to grow year-by-year as millions more communications are declassified. For these communication streams, we are interested in studying and identifying “interesting” statistical patterns – we contend that these patterns correspond to heightened diplomatic activity; and validate our findings with standard reference works on the 1970s. A statistically interesting pattern can mean several things. Loosely speaking, this can correspond to sudden localized changes or abrupt “jumps” in communication traffic, regardless of the overall series-specific baseline activity (a communication stream may be very active or have very low traffic intensity overall). There can also be continuous periods in a communication stream, where the data lies consistently above a series-specific baseline that corresponds to a representative global activity-level of that stream – these are “bursts” of activity in the temporal structure of the document streams that probably correspond with heightened diplomatic activity, such as the start or end of a war. An interesting event can also correspond to a heightened traffic intensity that plays out over longer periods, such as an increase over time, whether or not there are jumps.

When these communications were first entered in the State Department system, they were assigned one or more TAGS (Traffic Analysis by Geography and Subject), which indicate what countries or subjects each cable is related to. For example, “VS” signifies South Vietnam, and “SHUM” concerns human rights. By using these content-based TAGS as the feature, we avoid the complication of language processing and focus on identifying statistically relevant activity patterns in the communication streams.

1.1 A brief exploratory description of the data

A glimpse of processed data in the form of communication streams is shown in Figure 1. The data shows that there is less traffic on weekends and holidays (including the end of the year). In addition, the number of communications sent in 1973 seems to be smaller compared to later years, due to fewer measurements. Since the overall (aggregated across all TAGS) number of communications sent across time did not have any systematic pattern indicative of events of historical importance, we decided to study the time series at a more granular level, by restricting to different types of TAGS. In Figure 1, panels (a)–(d) represent the communication streams, when restricted by TAGS type. Panels (a)–(c) show noticeable forms of increased activities in portions of the series – these are indicative of “interesting” historical events. For example, in panel (a) the increased activity in July 1974 corresponds to the Cyprus coup; in panel (b) the increase in number of diplomatic communications in April 1975 corresponds to the Fall of Saigon. Panel (c) shows multiple bursts in the number of cables, containing the particular TAGS UNGA (which stands for United Nations General Assembly) – interestingly, they all correspond to the annual United Nations General Assembly meetings. In addition to these visible bursts there seem to be some shorter periods of heightened activities, such as the smaller peaks for VS (South Vietnam) a year after the fall of Saigon corresponding to the ensuing refugee crisis.

In contrast to panels (a)–(c) in Figure 1, panel (d), for cables related to Finland (FI), does not seem to show any period of heightened activity during the time period under consideration. These prototypes are representative of the different TAGS-specific series: Exploratory analyses of the database of TAGS specific communication streams suggest that there are several series with some “interesting event” (as in Panels (a)–(c)), while others seem to be less active (as in panel (d)). A first goal of our work is to quantitatively define traits that separate communication streams like the figure in panel (d) from those in panels (a)–(c). We develop statistical methods that can mine these (TAGS specific) time-series and identify communication streams that exhibit statistically interesting activities in them. Once we identify these interesting communication streams, we develop algorithms that perform a deeper investigation of each series and identify time intervals where the signal undergoes abrupt localized changes in communication traffic. The informal ideas described above are made more precise in Section 2 of this paper.

Exploratory analysis suggests that changes in the proportion of a particular TAGS appearing in a communication stream are better representatives of identifying whether a period is active or not, as compared to tracking the corresponding counts. Due to the noticeable difference in the number of cables that were communicated over the weekdays and low-traffic days, the communication streams shown in Figure 1 seem to be a superposition of a high traffic series and a low-traffic (weekend and holiday) series. As a pre-processing step, we filtered out the days where the total number of cables being communicated were very small – they lead to more reliable estimates of proportions.

(a)TAGS FI, p-value = 0.181 (b)TAGS OSCI, p-value = 0.005

Proportion

(c) TAGS PFOR, p-value = 0 (d) TAGS PINT, p-value = 0

Proportion

Figure 2: Communication streams with different significance scores in the spirit of Section 2.1: (a): relating to Finland, corresponds to a null model (b): relating to scientific grants, shows a weak deviation from the null model – perhaps due a slow decreasing trend in the series (c): relating to foreign policy – this generally shows significant deviation from the null model, which is due to sudden changes/jumps after Oct ’76 (d): relating to internal political affairs, shows deviation from the null model but not due to a jump as prominent as (c) – this series seems to exhibit some systematic pattern of heightened activity after Oct ’76, leading to a small p-value. The small p-values suggest the presence of a statistically interesting event in each series, and can be used to identify interesting communication streams – the p-value however, does not provide additional insights into the finer structural patterns of the streams. Additional examples can be found in Figure 10.

2 Statistical Methodology

For the reader’s convenience, we first present an outline of the main statistical approaches that we discuss in this paper. Section 2.1 addresses our first question: how do we determine whether a communication stream, among several hundreds, is interesting or not? We address this as a statistical hypothesis testing problem, where we consider the problem of whether a communication stream is generated from a homogeneous binomial process – in other words: is the proportion of documents containing the particular TAGS uniform across time? We select a collection of TAGS for which there seems to be some form of a statistically interesting event, and explore their geometry in further detail. If denotes the proportion of cables containing a TAGS at time , we estimate the function with a piecewise constant signal – this is performed by using a regularized negative log-likelihood criterion based on the fused Lasso penalty TSRZ2005 or its -counterpart (boysen2009consistencies, ; Johnson2013fldp, )

. For efficient computation, we develop new fast algorithms for these penalties by adapting existing work for the least squares loss function to the case of the generalized linear model likelihood, studied herein – see Section 

2.2. The piecewise constant segments lead to localized changes in the underlying signal – we use hypothesis testing ideas based on sample splitting wasserman2009high to rank-order the strengths of the jumps based on the associated p-values – see Section 2.3. The locations of discontinuities or jumps of the signal are aggregated to obtain locally contiguous subintervals of heightened diplomatic communications, which we call bursts, adopting the terminology coined by kleinberg2003bursty in a different context (Section 2.4). Finally in Section 3, we discuss preliminary results on estimating the underlying signal with more flexible local models, beyond piecewise constant segments.

2.1 Identifying Interesting Communication Streams

We develop a framework to identify communication streams that exhibit some form of a heightened or in other words, statistically interesting activity. Consider a TAGS-specific series , where, denotes the number of documents containing the specific TAGS among cables, with proportion . We will assume that the conditional distributions of ’s are all independent across . We ask:

Is there any evidence of (localized) heightened intensity of the proportions, compared to a baseline model, where all proportions are the same?

To measure a localized change (increase) in intensity, we fix a window of size and consider all the points in the neighborhood of a time point , given by: (we pick days in our experiment). The average proportion in this neighborhood: , is a measure of communication traffic around the reference time point . We say that a large value of compared to a baseline value (for example, the global proportion), indicates the presence of an intense activity of some form; and declare such an occurrence to be statistically interesting. We hypothesize such an event to be associated with an event of historical interest; and subsequently validate this belief by factoring in the insights of a historian or social scientist.

Formally, we consider a simple testing problem with : versus there exists an such that is larger than the (global) average proportion. Inspired by popularly used scan statistics glaz2001scan

, we propose to use the following test statistic:

(1)

where,

is the (estimated) global proportion of the signal estimated under the null hypothesis;

is an estimate of ; and

is the standard deviation of

evaluated under the null (). measures the strength of a locally contiguous period of heightened activity–the larger this value, the more pronounced is the localized traffic. We use a permutation based approach to compute the distribution of under the null. Figure 2 shows different communication streams with their associated p-values computed using the framework described above; and varying levels of activity in the different representative communication streams – a large p-value for panel (a) (corresponding to TAGS FI) signifies a lack of interesting activity in this series – this aligns with an expert’s understanding of historical events. A table providing a summary of how many cables survive different p-value thresholds are provided in Table 1 (Appenfix).

We also used a test-statistic depending upon the likelihood ratio test: we replaced (in display (1

)) by the local likelihood ratio derived from the binomial distribution; assuming that

are independently distributed for . The results obtained via this likelihood based approach was quite similar to that obtained from the model-free testing procedure described above; and are hence omitted for the sake of brevity.

2.2 Identifying Jumps in Communication Streams

The method in Section 2.1 simply identifies whether something statistically interesting, in the form of a heightened localized activity (for example), is occurring somewhere in a communication stream. It does not inform us about the structure or precise location of such an activity. We proceed to explore some finer structural properties of the series. Inspired by popularly used signal segmentation/estimation methods TSRZ2005 ; Johnson2013fldp ; mammen1997locally in statistical signal processing, we seek to identify breaks or jumps in a piecewise constant approximation of the signal – these changes are more localized than the (global) deviations studied in Section 2.1.

We use the notation of Section 2.1. We assume herein that , where, denotes the probability of success. Our model assumes that the parameters ’s are piecewise constant. Locations where the underlying signal exhibits a discontinuity will be called a “jump” in the communication stream. Instead of working directly with the proportion ’s we will use the logistic-link function: and will treat as a natural parameter. This leads to the following regularized estimation problem:

(2)

where, , the negative logarithm of the likelihood is the data-fidelity term and is the regularizer. encourages the estimated ’s (and hence the proportion ’s) to be piecewise constant and the regularization parameter controls the amount of shrinkage. Two common examples of that we use herein are TSRZ2005 ; mammen1997locally ; boysen2009consistencies ; Johnson2013fldp :

  • -segmentation (Fused Lasso): , which penalizes the total variation of a signal, which may also be thought as a soft-version of the number of the number of jumps in .

  • -segmentation: Here we take , which penalizes the number of jumps in the signal .

We assume above and in the discussion below, that the time points are equally spaced. If they are not equispaced, the penalty function needs to be adjusted in a straightforward fashion as discussed in Section A.1.

For the penalty function , Problem (2) is a convex optimization problem. The -penalty on the successive differences in , is commonly referred to as the fused lasso or total-variation penalty TSRZ2005 ; mammen1997locally . This semi-norm induces shrinkage along with sparsity on the coefficient differences . However, the -based penalty often over-estimates the number of jumps when the tuning parameter is chosen so as to obtain a model with good data-fidelity – this is especially true if the tuning parameter is chosen based on a held out test set to minimize test error. This is due to the shrinkage effect of the -penalty, which severely penalizes large values of the jumps . To obtain a model with fewer jumps, the regularization parameter needs to be made larger – this however, may lead to a model where some of the important jumps are missed. Many of these limitations can be overcome by using a -based penalty boysen2009consistencies which directly penalizes the number of jumps and is less agnostic to the precise value of the jump. The caveat however, is that the resulting optimization problem becomes non-convex and discrete optimization methods are required to obtain the global minimum of such problems boysen2009consistencies ; Johnson2013fldp . Developing global optimization algorithms for Problem (2) (for the penalty ) along the lines of mazumder2017discrete ; bertsimas2016best is beyond the scope of the current paper. Herein, we describe fast and robust algorithms to obtain good quality solutions to Problem (2). Our proposal is motivated by first-order optimization based algorithms nesterov2004introductorynew pioneered in the convex optimization community. Loosely speaking, these are iterative methods that can be used to obtain high quality approximate solutions for convex optimization tasks, compared to off-the-shelf interior point methods that are difficult to scale to large problems. These methods apply to both and – however, there are certain subtleties as we describe below.

2.2.1 Model Fitting: Optimization Algorithm

Problem (2) is of the composite form, i.e., the objective function can be written as the sum of a smooth convex function and a non-smooth penalty function . We will apply proximal gradient descent methods fista-09 for this problem. The negative log-likelihood function is continuously differentiable and satisfies:

(3)

with . This follows by noting that the th coordinate of is: and is a diagonal matrix with the th diagonal entry being:

(4)

Hence, the largest eigenvalue of

, i.e., , which justifies the choice of , as above. For a fixed , our algorithm performs the following updates (for all ):

(5)

This leads to a decreasing sequence of objective values for . We now study the fate of the sequence , depending upon the choice of .

The fused lasso penalty ()

We first consider the case where the regularizer and hence the function is convex in . The sequence converges to a minimum of Problem (2) with the penalty function . More precisely, it can be shown that fista-09

(6)

where, is an optimal solution to Problem (2). Display (6) states that the sequence converges to a minimizer of Problem (2) with a rate of . In fact, this rate can be improved further under a minor additional assumption. Let , i.e., the smallest diagonal entry of the Hessian of . As long as ’s are uniformly bounded and then . The rate of convergence is linear; and is given by:

(7)

where,

(8)

Note that sub-problem (5) requires solving a problem of the form:

(9)

which can be solved very efficiently via Dynamic Programming Johnson2013fldp with cost for – one can solve instances with a few thousands in a few milliseconds.

The -segmentation penalty ()

The algorithm above, also applies to the regularizer . In update (5) we set to . This requires solving

(10)

which can also be computed efficiently using the dynamic programming algorithm of Johnson2013fldp . Describing the properties of this sequence is subtle since the associated optimization problem (2) is non-convex. Following bertsimas2016best it can be shown that the sequence is decreasing, bounded below222We will assume that the function is bounded below and the minimum exists. and it converges to (say), which may not be a global minimum. We say that is a first-order stationary point for Problem (2) if it satisfies the following fixed point equation:

is said to be an -accurate first order stationary point, if . Following the convergence analysis in bertsimas2016best [Theorem 3.1], we obtain the following finite-time convergence rate of to a first order stationary point:

(11)

Note that this algorithm may not reach the global minimum of the version of Problem (2). However, in practice, it reaches a high quality solution quite fast.

2.2.2 Estimated Signal

To gather some intuition about the behavior of the segmentation methods described above, we consider a synthetic example in Figure 3 – here the underlying (true) signal is piecewise constant with three levels up to time point , there is a right discontinuity at after which it becomes linear 333More specifically, data is generated by , where for ; for ; for ; for .. The figure presents the signal estimates (for both the and penalties) at the cross-validated choices of the tuning parameter – we use -fold cross validation FHT-09-new which is also used in the R package genlasso (Since we want to ensure each fold is representative of the time series, instead of randomly assigning points to a fold, we systematically assign points by placing every th point into the same fold). For both segmentation schemes, the estimated signals serve as good (overall) approximations of – however, there are some subtle differences. First of all, the -segmentation scheme leads to biased estimates and the bias becomes quite prominent in estimating the jump at the centre of the signal. This behavior is not present for the -scheme. In addition, the estimates for the linear component (at the right) also differ across the and schemes. The regularizer leads to a fewer number of segments (here three) compared to the -penalty which has several smaller jumps.

regularization

Proportion

regularization

Proportion

Figure 3: Estimators obtained from Problem (2) with (upper panel) and (lower panel) regularization. The data is synthetic and the underlying signal contains two sharp jumps and a gradual increasing trend. We use cross validation to select a value of . penalty shrinks the estimated probability during a big burst () and gives more jumps during the gradual increase period ().

Figures 4 and 5 show the estimated signal proportions , where the ’s are obtained from Problem (2). Both the penalty functions do a good job in estimating a piecewise constant version of the underlying signal – the scheme leads to fewer jumps than its counterpart, for a comparable data-fidelity. The figures also show fitted signals for a few other values of around the cross-validated choice at the center (

increases as one moves down the rows): we include the tuning parameter selected by the one-standard error rule 

FHT-09-new (see also the R package genlasso). We can see that as decreases, the algorithm captures a more granular structure of the data and estimates more jumps.

-segmentation -segmentation

Proportion

Proportion

Proportion

Figure 4: Figure showing the raw proportions (in blue dots) for TAGS UNGA (U.N. General Assembly) and the estimated proportions , as obtained from the regularization framework in Problem (2). The left panel shows the estimates obtained with the -segmentation penalty and the right panel shows the estimates with the -segmentation penalty. The middle rows correspond to the optimal chosen by 10-fold cross-validation. It shows how, in between the cyclical jumps in UN-related communications relating to the regular Fall meetings of the General Assembly, there was also a jump in April-May 1974. This occurred when Algeria called a special session to demand UN support for a “New International Economic Order.” We show a few additional choices of the regularization parameter for each example (see text for details). The figure (bottom right panel) shows the large bias incurred by the -penalization method in the estimation process – this behavior is less pronounced with the penalty. The -penalty also exhibits a stair-casing effect – with many small jumps. Unlike the -penalty, the -penalty selects some segments that are very short, these segments disappear upon increasing the penalty parameter.
-segmentation -segmentation

Proportion

Proportion

Proportion

Figure 5: Figure showing the raw proportions (in blue dots) for TAGS VS (South Vietnam) and the estimated proportions , as obtained from the regularization framework in Problem (2). The left panel shows the results for the -segmentation penalty and the right panel the -penalty. The middle rows correspond to the optimal chosen by cross-validation, and we show a few additional choices of the regularization parameter for each example. A practitioner might select one or another depending on whether they would want to identify smaller jumps that correspond, in this case, to the refugee crisis that followed the defeat of South Vietnam.

2.3 A deeper investigation of Jumps

2.3.1 How intense is a Jump?

The procedure in Section 2.2 provides an estimated piecewise constant signal . In particular, this can be used to identify locations where the signal changes: . Of course, this jump depends upon the choice of the tuning parameter, the penalty used and also the underlying signal. A jump estimated by the or -segmentation procedure may reflect (a) a discontinuity in the signal – in this case, the signal is well approximated by locally constant segments with pieces adapting to the data (b) a localized trend in the signal, as we saw in Figure 3 – a jump here is a consequence of the slope of and not a discontinuity in the signal .

Given an estimate of a scholar accustomed to analyzing events through a close reading of historical documents may ask:

  • Which of these jumps might be important or are indicative of a historical event of interest?

  • Can one obtain a rank ordering of the jumps based on their intensities?

We formalize this question as follows: given an estimate of and a set of candidate jumps, can we identify jumps that are strong enough? Ideally, we would like a simple measure that associates a score to the strength and size of a jump selected by the estimation procedure – this would help us select a smaller set of jumps that merit closer scrutiny. Towards this end, we use a sample splitting444Due to the large number of samples, sample splitting does not significantly reduce the size of the training dataset. procedure wasserman2009high : a subsample of size 50% of the data is used for estimating the location of the jumps and the remaining held out part of the data is used to associate a p-value score (the method is described below) to each jump identified in the first stage. This method is simple, intuitive and provides a natural method to rank the jumps in a communication stream. Using this scheme, one can potentially reduce the number of jumps selected by the fitting procedure by screening out jumps with p-values larger than a user-defined pre-specified threshold.

We describe our approach with reference to the -penalization procedure, though the idea will also apply for the -penalized estimate. Suppose a candidate location for the change point is estimated based on the first part of the sample (used for estimating the signal). The test statistic is evaluated on the held out sample. We take a neighborhood of size centered at , and denote the time points on the left of as and those on the right of as . Let us assume that for are all equal to ; and for are all equal to . We then test the null hypothesis () that the proportions on the left and right parts of are equal: ; versus the alternative () that . We use the likelihood ratio test statistic for this purpose. To compute the null distribution, we used a two step procedure. We first identified segments of the time series which did not overlap with any candidate change point location (i.e., parts of the series where the estimated signal was constant for stretches of size at least ) of the time series. Based on the regions thus selected (i.e., the locally constant stretches of the signal), we simulated the null distribution of the test statistic by using a permutation test. These 2-sided p-values were consequently used as a measure of intensity of every jump.

Note that a candidate jump estimated by the signal estimation procedure at the cross-validated choice of the tuning parameter need not necessarily correspond to a jump with a low p-value. A small p-value indicates that the intervals to the left and right of have different proportions555A jump obtained from the estimated may be due to a linear rise in the signal which need not correspond to a significant change in local proportions. Our experiments indicate that jumps in that correspond to gradual linear rises in the signal, have higher p-values associated with them when compared to sudden or abrupt changes in . Thus the p-values can be used to (a) devise a scoring mechanism to rank order multiple jumps observed in a series and/or (b) prune out redundant jumps and identify ones that exhibit a strong difference in proportions between the left and right intervals. Figure 6 shows the communication stream for TAGS CVIS and the estimated signal obtained via -segmentation. We also computed the p-values for each potential jump location as suggested via the -segmentation fit. We pruned them and refitted the model based on their p-value scores. It helps to interpret these patterns alongside Figure 9 (TAGS CVIS) which presents more flexible fits of the underlying signals for this communication stream. Figure 7 shows additional examples interpreting the p-values associated with the different jumps. Figure 7 suggests that the p-values are indicative of whether a jump is due to a shift in the piecewise constant level or a linear trend – the p-values are larger when there is a linear trend rather than a sharp jump (as in a piecewise constant signal). Figure 9 shows a more flexible approximation of this communication stream which provides further insights into the patterns of TAGS US series. To validate the intuition gathered above, we consider a synthetic example in Figure 8 (with the same data as in Figure 3) – here we observe that the p-values tend to be larger for jumps in the right part of the signal – these jumps in the piecewise constant segments result from estimating a linear trend (that appears at the right of the series) with piecewise constant segments. Note that the p-values associated with the first three jumps (at the left of the signal) are quite small – they correspond to jumps in the underlying piecewise constant signal.

In passing we note that it is also possible to perform multiple testing procedures lehmann2006testing to attach error rates to a family of jumps. In particular, we can use Family Wise Error Rates or False Discovery Rate (FDR) control procedures to return a collection of candidate jumps with a certain prescribed control on the error rate lehmann2006testing . However, our goal here is to use the sample splitting and p-value framework to perform an exploratory analysis of the strengths of the different jumps – we do not pursue an in-depth study of multiple testing in this work.

No pruning Pruning with

Proportion

Pruning with Pruning with

Proportion

Figure 6: Figure showing the communication stream for TAGS CVIS – the estimated signal is obtained from the -segmentation scheme (at the cross-validated choice of ). We compute the p-values (based on sample splitting, as described in Section 2.3.1) for every candidate jump location and prune the jump locations (and refit the signal with the new jump locations) based on different thresholds. We observe that a pruning rule based on p-values leads to a fairly robust selection of intense jump locations, where each location corresponds to a sharp change in local means.
TAGS ENRG TAGS US

Proportion

Figure 7: TAGS ENRG (energy) and US (for cables relating to the U.S.) with p-values associated with the estimated jumps (using the framework in Section 2.3.1). The jumps for ENRG are much sharper and indicate rapid (though not instantaneous) changes in mean, starting with the 1973 OPEC oil embargo. This gives them low p-values. In contrast, the two jumps to the right of the signal for US are during less rapid changes in mean and thus have slightly larger p-values (). Figure 9 presents more flexible fits of the underlying signal. (The notation is a shorthand for p-value being equal to .)

Proportion

Figure 8: Synthetic data: data description is the same as Figure 3, which contains three real jumps and a linear increasing trend. The first three detected jumps (from left to right) have small p-values (close to zero) – they correctly correspond to the jumps in the underlying signal. The other two potential jumps have p-values and respectively – these jumps are a consequence of the linear trend (we use the framework in Section 2.3.1). (The notation is a shorthand for p-value being equal to .)

2.4 From Jumps to Bursts

Section 2.2 presents methods to detect several jumps in a signal, and also presents an in-depth investigation of how to associate scores (p-values) to each of the detected jumps using a sample splitting scheme. We now explore if it is possible to summarize a single communication stream (corresponding to a specific TAGS) with a score – a measure that should ideally contain in it information pertaining to the strengths and stretches of the different jumps. Towards this end, following the terminology introduced by Kleinberg kleinberg2003bursty we formalize the notion of a “burst”. The general approach pursued in this paper and the models used differ from kleinberg2003bursty . Informally speaking, a burst corresponds to a stretch of time where a communication stream depicts traffic which is larger than a baseline value. We present a computational scheme to estimate such bursts for a TAGS -specific communication stream.

2.4.1 Computation of the strength of a Burst

As a starting point, we consider an estimate of a baseline proportion (we discuss how to compute this below) that is specific to a communication stream. A “burstiness period” or simply burst corresponds to a time interval where the estimated signal lies above the baseline value and is given by , where . Following kleinberg2003bursty , we define the strength of the burst as the logarithm of the likelihood ratio (here, the numerator is the likelihood of the signal and the denominator is that evaluated at the baseline) given by: where denotes the likelihood at time . Note that the baseline is specific to a communication stream and the score represents a deviation from this global baseline. is different than the magnitude of a jump given by – it takes into account the deviation of from the baseline as well as the duration of the burst given by the length of . A large value of means that a large part of the likelihood is explained by deviations from the baseline, and therefore, corresponds to a strong burst. Note that each TAGS can have multiple bursts leading to multiple intervals – each with an assigned strength .

Choice of baseline: The baseline value should be representative of the behavior of the TAG-specific communication stream. The global proportion of a communication stream is a reasonable choice. We set to be one standard deviation larger than the global proportion

A robust estimate like the median can also be used instead of the average. In our experiments we found that the top-ranked slots were relatively agnostic to the choice of the baseline .

2.4.2 Interpretation of Bursts

Table 2 presents the top thirty bursts, with the start and end dates, and the date with the highest value. A close study of the content of the cables shows that not all of these bursts correspond with what scholars would recognize as an even of historical importance. After all, the cable TAGS that diplomats used do not necessarily correspond with diplomatic activity. For instance, the second biggest burst is made up of cables related to transportation (ETRN) a TAGS that was commonly used, and overused, from when we begin to have records continuing until 1974, when diplomats’ use of this TAGS was largely discontinued. The biggest burst, for CVIS (visas), has a similar pattern (as shown in Figure 6). But in this case, it appears to reflect a decision by archivists to stop preserving records related to visas langbart2007 . To the model, both of these look like bursts, but they simply reflect administrative procedures rather than historical events.

The bursts that follow, on the other hand, appear to correspond well to historical events. The next ten include the Carter administration’s prioritization of human rights (SHUM), Anwar Sadat’s surprise visit to Israel (PGOV), the Southeast Asian “Boat People” crisis (SREF), the U.S. withdrawal from the International Labor Organization (PORG), the conclusion of the Panama Canal Treaty (PDIP), the 1973 Yom Kippur War (XF, for Middle East), Portugal’s withdrawal from Angola (AO), and the 1974 crisis over Cyprus (CY). All are included in each of the four standard reference works we consulted brune2003chronological ; flanders1993dictionary ; jentleson1997encyclopedia ; deconde2002encyclopedia .

A systematic evaluation of hundreds of bursts for historical significance lies outside the scope of this paper. But the relative proportion of recognized historical events appears to diminish as one examines smaller bursts, like the ones ranked in the range 13-22. They include the denouement of the Vietnamese War (VM and VS), the OPEC oil embargo (ENRG), the Vladivostok summit (OVIP), and negotiations to end white rule in Rhodesia (RH). But there are also largely unrecognized events, like a 1975 UN General Assembly debate over the command of foreign military forces in South Korea, that would appear to merit closer scrutiny. The identification of such unstudied episodes, no less than rank-ordering well-known events, is valuable for historical scholarship.

3 Beyond Piecewise Constant Segments

A major focus of Section 2 was on approximating a communication stream with a piecewise constant signal. This framework does help us answer some of the major data-driven questions of interest to a political scientist, based on a first order approximation of the communication streams. We now investigate more flexible signal approximations that provide us insights into the finer behavior of the signals. A natural extension of a piecewise constant estimate is a piecewise linear estimate. However, there are subtleties in incorporating this structure into our framework, as we discuss below.

To settle ideas, let us consider the usual signal denoising problem with data: , for where, . We seek to estimate such that it is piecewise linear. In this vein, it is common to use the trend-filtering approach kim2009ell_1 ; tibshirani2014adaptive with regularizer to obtain a signal with piecewise linear segments:

The penalty function encodes the -norm on the discrete second order derivative of the signal assuming that the time points are all equally spaced. can be interpreted as a convexification of its version: that counts the number of different piecewise linear segments.

Our situation is different from the denoising example outlined above. Since we are working under the modeling assumption: with , imposing a trend filtering penalty on so as to maintain piecewise linearity will lead to a non-convex optimization problem due to the nonlinear dependence of on . Thus, instead of enforcing the sequence to be piecewise linear, we allow the latent parameters to be piecewise linear – this leads to a computationally tractable estimation framework. Encouraging to be piecewise linear enables to be more flexible than a piecewise constant signal. Towards this end, we propose a simple adaption of the estimation criterion in Problem (2) by setting the regularizer . Figure 9 shows the results of estimates obtained from some communication streams using the -trend filtering penalty. If the time points are not equally spaced, then this penalty can be modified appropriately – see for example kim2009ell_1 and also Section A.1.

Computation

The proximal gradient-stylized algorithm update (5) can be adapted to the setting described above with . We use the specialized interior point solver666We use the R-package wrapper available from https://github.com/hadley/l1tf of kim2009ell_1 – this works quite nicely for the problem sizes encountered in this paper. The resulting Problem (2) is convex and the sequence (5) leads to the optimum of the problem. The convergence rates outlined in (6) and (7) will also apply to this problem. If we set , the resulting Problem (2) becomes a challenging nonconvex optimization problem – in this case, there is no analogue of the highly efficient dynamic programming implementation that was available for . Thus, for the case where we desire to be piecewise linear, we confine our study to the choice of the convex -trend filtering regularizer.

Acknowledgements

R. Mazumder was supported by ONR (grant # N000141512342) and NSF-IIS (grant # 1718258) for support. The authors will like to thank: David Blei, David Madigan, Shawn Simpson for helpful comments and encouragement. The authors will also like to thank the workshop participants of “Famine and Feast–International Historical Research in the Digital Age” (London, UK; 2015) for comments on the work. Preliminary results from this article appeared in a media article on buzzfeed.com 777Article www.buzzfeed.com/josephbernstein/can-a-computer-algorithm-do-the-job-of-a-historian?.

CY ENRG

Proportion

VS SHUM

Proportion

Figure 9: Figure showing the estimates obtained from Problem (2) with the -trend filtering regularizer (See Section 3). The sharp spike in the CY (Cyprus) communication stream corresponds to an unanticipated event, when Greek forces launched a coup with the goal of annexing Cyprus. The first peak for the second stream (ENRG) corresponds to the 1973 energy crisis, after the OPEC oil ministers announced an embargo during the Yom Kippur War. The peak for VS, for South Vietnam, corresponds to the Fall of Saigon in 1975, which marked the end of the Vietnam War. SHUM, for communications related to human rights, shows the increasing attention the State Department gave to this subject, especially after the election of President Jimmy Carter.

Appendix A Appendix

Significance level number of TAGS for test (1)
0.1 914
0.01 768
0.001 622
0.0001 509
0.00001 391
Table 1: Table showing how many TAGS -specific cables survive at different significance levels, with 0.00001 being the smallest detectible level. This is out of the first 1000 TAGS, which roughly corresponded to the TAGS with more than 50 total cables.

a.1 Problem (5) with irregularly spaced time points

If the time points are irregularly spaced then the penalty functions need to be adjusted accordingly. The fused lasso penalty function becomes: where, denotes the time difference between time point and the next time point, indexed by . For this choice, the associated proximal map i.e., Problem (5) needs to be modified to:

(12)

where, . More generally, if we consider the -trend filtering example with varying time intervals, we get an instance of Problem (12) with

We use the Alternating Direction Method of Multipliers (ADMM) procedure boyd-admm1 which performs the following decomposition: and obtains the Augmented Lagrangian:

for some choice of . The usual ADMM approach performs the following sequence of updates:

(13)

where, in the update wrt other variables remain fixed, and the same applies for the update wrt . We refer the reader to boyd-admm1 for details pertaining to the convergence of this algorithm and choices of . We note that the update wrt in display (13) can be solved quite easily via solving a system of linear equations:

Note that is a bidiagonal matrix when corresponds to the weighted fused lasso penalty and a tridiagonal matrix when it corresponds to the weighted trend filtering penalty. The inverses in each of these cases can be computed with cost and (respectively)  BV2004 ; kim2009ell_1 – furthermore the inverse can be computed once (at the onset) as the matrix does not change across iterations. The update wrt in display (13) requires a solving the following problem:

where, . A solution to the above problem is given by the familiar soft-thresholding FHT-09-new operation where, . The sequence of updates in (13) are performed till some form of convergence criterion is met boyd-admm1 .

TAGS meaning start end peak Burst Strength
1 ETRN Economic Affairs-Transportation 1973-07-02 1974-08-09 1973-09-28 5146.05
2 CVIS Consular Affairs-Visas 1973-07-02 1975-01-02 1974-06-28 4839.35
3 SHUM Social Affairs-Human Rights 1977-01-19 1977-12-30 1977-11-18 2872.02
4 US United States 1976-01-28 1977-09-16 1976-04-15 2516.03
5 PGOV Political Affairs-Government 1977-06-03 1977-12-30 1977-11-18 2484.57
6 SREF Social Affairs-Refugees 1975-04-22 1976-07-20 1976-06-02 1662.58
7 SOPN Social Affairs-Public Opinion and Information 1976-11-26 1977-12-30 1977-08-26 1597.14
8 PORG Political Affairs-Policy Relations With International Organizations 1977-06-15 1977-12-30 1977-11-11 1547.35
9 PDIP Political Affairs-Diplomatic and Consular Representation 1977-05-24 1977-12-30 1977-09-02 1462.93
10 XF Middle East 1973-10-09 1973-12-19 1973-10-16 1453.76
11 AO Angola 1975-11-08 1976-02-23 1975-11-10 1439.58
12 CY Cyprus 1974-07-15 1974-07-29 1974-07-20 1378.79
13 VM Vietnam 1977-10-11 1977-12-30 1977-10-12 1365.45
14 PDEV Political Affairs-National Development 1977-06-13 1977-12-30 1977-08-31 1344.70
15 VS Vietnam (South) 1973-07-02 1975-06-06 1975-04-25 1150.46
16 UNGA UN General Assembly 1975-08-19 1975-12-13 1975-11-07 1044.29
17 CARR Consular Affairs-Americans Arrested Abroad 1977-06-01 1977-12-30 1977-06-28 951.98
18 MCAP Political Affairs-Military Capabilities 1973-07-02 1974-08-15 1974-07-03 903.26
19 ENRG Economic Affairs-Energy 1973-11-08 1974-02-21 1974-01-25 760.64
20 PBOR Political Affairs-Boundary and Sovereignity Claims 1977-07-01 1977-12-30 1977-11-09 685.75
21 OVIP Operations-VIP Travel Arrangements 1974-10-09 1974-11-09 1974-10-31 607.17
22 RH Rhodesia 1976-09-01 1977-12-30 1977-08-31 569.89
23 AEMR Administration-Emergency and Evacuation 1975-03-28 1975-05-12 1975-04-28 524.59
24 MPLA Popular Movement for the Liberation of Angola 1975-11-07 1976-02-24 1976-02-18 507.98
25 MSG Marine Security Guards 1976-09-02 1977-12-30 1977-11-28 507.27
26 OREP Operations-Congressional Travel 1976-10-27 1976-11-18 1976-11-02 481.08
27 PRG Provisional Revolutionary Government of South Vietnam 1975-01-16 1975-02-06 1975-02-03 470.51
28 MNUC Military and Defense Affairs-Military Nuclear Applications 1977-03-11 1977-12-30 1977-08-22 421.53
29 UNGA UN General Assembly 1974-09-05 1974-12-05 1974-10-10 417.20
30 CB Cambodia (Khmer Republic) 1973-07-02 1975-05-21 1975-04-16 370.14
Table 2: Top 30 bursts identified using segmentation algorithm, using the method in Section 2.4 to compute burst strengths. For interpretations regarding the bursts please see the discussion in Section 2.4.2.
SA TQ FI

Proportion

SW AR OSCI

Proportion

PFOR SREF OEXC

; rapid change

Proportion

PINT OCON SF

; gradual change

Proportion

Figure 10: Communication streams with different significance scores in the spirit of Section 2.1. Top row shows the series where, like FI (for Finland), the p-values are larger than 0.1; second row has p-values in 0.1-0.001 (such as for OSCI, scientific grants). The third and fourth rows show series that seemed to have a high degree of intense activity: all p-values were smaller than 0.001. This includes, for example, SREF (for refugees) and SF (for South Africa.) As we will see, a strong deviation from the null model under the global testing framework does not necessarily imply a communication stream with a significant change point.

References

  • [1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
  • [2] John Beieler, Patrick T. Brandt, Andrew Halterman, Philip A. Schrodt, and Erin M. Simpson. Generating political event data in near real time: Opportunities and challenges. In R. Michael Alvarez, editor, Computational Social Science, pages 98–120. Cambridge University Press, 2016.
  • [3] Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern optimization lens. Annals of Statistics, 44 (2):23–42, 2016.
  • [4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Number 3(1). Now Publishers, 2011.
  • [5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
  • [6] Leif Boysen, Angela Kempe, Volkmar Liebscher, Axel Munk, and Olaf Wittich. Consistencies and rates of convergence of jump-penalized least squares estimators. The Annals of Statistics, pages 157–183, 2009.
  • [7] Lester H Brune and Richard D Burns. Chronological History of U.S. Foreign Relations: 1933-1988. Routledge, 2003.
  • [8] Alexander De Conde, Richard D Burns, Fredrik Logevall, and Louise B. Ketz. Encyclopedia of American foreign policy. Scriber, 2002.
  • [9] Stephen A Flanders and Carl N Flanders. Dictionary of American foreign affairs. Macmillan Library Reference, 1993.
  • [10] Joseph Glaz, Joseph I Naus, Sylvan Wallenstein, Sylvan Wallenstein, and Joseph I Naus. Scan statistics. Springer, 2001.
  • [11] Alex Hanna. Assessing gdelt with handcoded protest data. www.badhessian.org, 2014. (accessed: July 29, 2016).
  • [12] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer New York, 2 edition, 2009.
  • [13] J Craig Jenkins and Thomas V Maher. What should we do about source selection in event data? challenges, progress, and possible solutions. International Journal of Sociology, 46(1):42–57, 2016.
  • [14] Bruce W Jentleson, Thomas G Paterson, and Nicholas X Rizopoulos. Encyclopedia of US foreign relations, volume 2. Oxford University Press, USA, 1997.
  • [15] Nicholas A. Johnson. A dynamic programming algorithm for the fused lasso and l0-segmentation. Journal of Computational and Graphical Statistics, 22(2):246–260, 2013.
  • [16] Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. ell_1 trend filtering. SIAM review, 51(2):339–360, 2009.
  • [17] Jon Kleinberg. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4):373–397, 2003.
  • [18] David Langbart, William Fischer, and Lisa Roberson. Appraisal of records covered by N1-59-07-3-P. National Archives, 2007.
  • [19] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.
  • [20] Enno Mammen, Sara van de Geer, et al. Locally adaptive regression splines. The Annals of Statistics, 25(1):387–413, 1997.
  • [21] Rahul Mazumder and Peter Radchenko. The discrete dantzig selector: Estimating sparse linear models via mixed integer linear optimization. IEEE Transactions on Information Theory, 2017.
  • [22] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Norwell, 2004.
  • [23] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society, Series B, 67:91–108, 2005.
  • [24] Ryan J Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
  • [25] Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics, 37(5A):2178, 2009.