Wasserstein Index Generation Model: Automatic Generation of Time-series Index with Application to Economic Policy Uncertainty

08/12/2019 ∙ by Fangzhou Xie, et al. ∙ NYU college 0

I propose a novel method, called the Wasserstein Index Generation model (WIG), to generate public sentiment index automatically. It can be performed off-the-shelf and is especially good at detecting sudden sentiment spikes. To test the model's effectiveness, an application to generate Economic Policy Uncertainty (EPU) index is showcased.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inspired by a recent development in machine learning, I propose a novel method to generate a time-series index from news headlines automatically, namely the Wasserstein Index Generation model (WIG). It incorporates several methods that are widely used in machine learning, word embedding 

(mikolov2013), Wasserstein Dictionary Learning (WDL) (schmitz2018), Adam algorithm (kingma2015), and Singular Value Decomposition (SVD).

I test this method’s effectiveness in generating the Economic Policy Uncertainty index (baker2016, EPU), and compare the result against existing ones (azqueta-gavaldon2017), generated by the auto-labeling Latent Dirichlet Allocation (blei2003, LDA) method. Results show that this model requires a much smaller dataset to achieve similar results, without the need for human intervention. Thus, it can also be applied for generating other time-series indices from news headlines in a faster and more efficient manner.

Recently, there has been much progress in the methodology of the generation process of EPU, e.g. differentiating contexts for uncertainty (saltzman2018), generating index based on Google Trend (castelnuovo2017), and correcting EPU for Spain (ghirelli2019). I wish to extend the scope of index-generation by proposing this generalized WIG model.

2 Methods and Material

2.1 Wasserstein Index Generation Model

schmitz2018 proposes an unsupervised machine learning technique to cluster documents into topics, called the Wasserstein Dictionary Learning (WDL), wherein both documents and topics are considered as discrete distributions of vocabulary.

Consider a corpus with documents and a vocabulary of words. These documents form a matrix of , where , and each . We wish to find topics , with associated weights .

In other words, each document is a discrete distribution, which lies in an -dimensional simplex. Our aim is to represent and reconstruct these documents by some topics , with associated weights , where is the total number of topics to be clustered. Note that each topic is a distribution of vocabulary, and each weight represents its associated document as a weighted barycenter of underlying topics. We could also obtain a distance matrix of the total vocabulary , by first generating word embedding and measuring word distance pair-wise by using a metric function, i.e. , where , is Euclidean distance, and is the embedding depth. 222saltzman2018 proposes differentiating the use of “uncertainty” in both positive contexts and negative ones. In fact, word embedding methods, e.g. Word2Vec (mikolov2013), can do more. They consider not only the positive and negative context for a given word, but all possible contexts for all words.

Further, we could calculate the distances between documents and topics, namely the Sinkhorn distance. It is essentially a -Wasserstein distance, With the addition of an entropic regularization term to ensure faster computation. 333One could refer to cuturi2013 for the Sinkhorn algorithm and villani2003 for the theoretic results in optimal transport.

[Sinkhorn Distance] Given ,

as a Borel probability measure on

, , and as cost matrix,

(1)

where and is Sinkhorn weight.

Given the distance function for a single document, we could set up the loss function for the training process:

(2)

In Equation 2, is the reconstructed document given topics and weight under Sinkhorn distance (Equation. 1). Moreover, the constraint that and being distributions in Equation 1 is automatically fulfilled by column-wise Softmax operation in loss function.

0:  Word distribution matrix . Batch size .Sinkhorn weight . Adam Learning rate .
0:  Topics , weights .
1:  Initialize .
2:  , .
3:  for Each batch of documents do
4:     ,.
5:     , .
6:  end for
Algorithm 1 Wasserstein Index Generation

Next, we are ready to generate the time-series index. By facilitating Singular Value Decomposition (SVD) with one component, we can shrink the dimension of vocabulary from to . Next, we multiply by to get , which is the document-wise score given by SVD. 444Other candidate dimension reduction methods are also considered here: PCA and ICA. However, SVD performs better (Figure 2 in A).

Adding up these scores by month and scaling the index to get a mean of 100 and unit standard deviation, we get the final index.

2.2 Data and Computation

I collected a dataset from The New York Times and Nexis Uni, consisting of news headlines from Jan. 1, 1985 to Dec. 31, 2018. The corpus contains 15,515 documents, and 10,198 unique tokens. 555The majority of headlines were from NYT, accompanied by some from the proprietary retrieval system – Nexis Uni. Plots given in Figure 3, however, are from Jan. 1, 1985 to Aug. 31, 2016 for maintaining the same range to be compared with that from azqueta-gavaldon2017.

Next, I preprocess the corpus for further training process, for example, by removing special symbols and lemmatizing each token. 666Lemmatization refers to the process of converting each word to its dictionary form by its context.

Given this lemmatized corpus, I facilitate Word2Vec to generate embedding vectors for the whole dictionary and thus am able to calculate the distance matrix

for any pair for words.

To calculate the gradient, I choose the automatic differentiation library, PyTorch

(paszke2017), to perform differentiation of the loss function and then update the parameters by using the Adam algorithm (kingma2015).

To determine several important hyper-parameters, I use cross validation as is usually dine in machine learning communities. One-third of the documents are set for testing data and the rest used for The training process: Embedding depth , Sinkhorn weight , batch size , topics , and Adam learning rate . Once the parameters are set at their optimal values, the whole dataset is used for training, and thus, the topics and their associated weights are obtained.

3 Results

(a)
(b)
Figure 1: Above: Original EPU (baker2016), EPU with LDA (azqueta-gavaldon2017), and EPU with WIG in Sec. 2.1.
Below: Cumulated difference between cyclical component of original EPU with that of EPU given by LDA and WIG, after applying the Hodrick–Prescot filter.

As shown in Figure 1, the EPU index generated by the WIG model clearly resembles the original EPU. Moreover, the WIG detects the emotional spikes better than LDA, especially during major geopolitical events, such as “Gulf War I,” “Bush Election,” “9/11,” “Gulf War II,” etc. To further examine this point, I apply the Hodrick–Prescot filter on three EPU indices, remove its trend components, and calculate the cumulated differences between the cycles of EPU (LDA) and of EPU (WIG) with that of the original (Figure (b)b). Their cumulated differences show that WIG captures the EPU’s cycle behavior better than LDA over this three-decade period.

Moreover, this method only requires a small dataset, compared with LDA. The dataset used in this article contains only news headlines, and The dimensionality of the dictionary is only a small fraction compared with that of the LDA method. The WIG model takes only half an hour for computation, and still produces similar results. 777 Comparison of datasets are in Table 1, A.

Further, it extends the scope of automation in the generation process. Previously, LDA was considered an automatic-labeling method, but it continues to require human interpretation of topic terms to produce time-series indices. By introducing SVD, we could eliminate even that requirement and generate the index automatically as a black-box method. Yet, it by no means loses its interpretability. The key terms are still retrievable, given the result of WDL, if one wishes to view them.

Last, given its advantages, the WIG model is not restricted to generating EPU, but could potentially be used on any dataset regarding a certain topic, whose time-series sentiment index is of economic interest. The only requirement is that the input corpus be related to that topic, but this is easily satisfied.

4 Conclusions

I proposed a novel method to generate time-series indices of economic interest using unsupervised machine learning techniques. This could be applied as a black-box method, requiring only a small dataset, and is applicable to any time-series indices’ generation. This method incorporates deeper methods from machine learning research, including word embedding, Wasserstein Dictionary Learning, and the widely used Adam algorithm.

Acknowledgements

I am grateful to Alfred Galichon for launching this project and to Andrés Azqueta-Gavaldón for kindly providing his EPU data. I would also like to express my gratitude to referees at the 3rd Workshop on Mechanism Design for Social Good (MD4SG ’19) at ACM Conference on Economics and Computation (EC ’19) and the participants at the Federated Computing Research Conference (FCRC 2019) for their helpful remarks and discussions.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

Appendix A Supplementary Materials

Figure 2: Comparison between the dimension reduction methods: cumulated differences between SVD, ICA, and PCA with original EPU. SVD is better than both ICA and PCA.
Name Method Type Num. Entries Num. Tokens Time
EPU Manual articles 12009 N/A two years
EPU LDA Semi-Auto articles 40454 1,000,000+ several hours
EPU WIG Automatic headlines 15515 10,198 30min
Table 1: Comparison of the dataset among the three methods. WIG requires a much smaller dataset and runs faster.