Pruned Wasserstein Index Generation Model and wigpy Package

03/30/2020
by   Fangzhou Xie, et al.
0

Recent proposal of Wasserstein Index Generation model (WIG) has shown a new direction for automatically generating indices. However, it is challenging in practice to fit large datasets for two reasons. First, the Sinkhorn distance is notoriously expensive to compute and suffers from dimensionality severely. Second, it requires to compute a full N× N matrix to be fit into memory, where N is the dimension of vocabulary. When the dimensionality is too large, it is even impossible to compute at all. I hereby propose a Lasso-based shrinkage method to reduce dimensionality for the vocabulary as a pre-processing step prior to fitting the WIG model. After we get the word embedding from Word2Vec model, we could cluster these high-dimensional vectors by k-means clustering, and pick most frequent tokens within each cluster to form the "base vocabulary". Non-base tokens are then regressed on the vectors of base token to get a transformation weight and we could thus represent the whole vocabulary by only the "base tokens". This variant, called pruned WIG (pWIG), will enable us to shrink vocabulary dimension at will but could still achieve high accuracy. I also provide a wigpy module in Python to carry out computation in both flavor. Application to Economic Policy Uncertainty (EPU) index is showcased as comparison with existing methods of generating time-series sentiment indices.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/12/2019

Wasserstein Index Generation Model: Automatic Generation of Time-series Index with Application to Economic Policy Uncertainty

I propose a novel method, called the Wasserstein Index Generation model ...
research
02/08/2020

LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention

Non-autoregressive translation (NAT) models generate multiple tokens in ...
research
10/14/2022

HashFormers: Towards Vocabulary-independent Pre-trained Transformers

Transformer-based pre-trained language models are vocabulary-dependent, ...
research
03/08/2021

Domain Controlled Title Generation with Human Evaluation

We study automatic title generation and present a method for generating ...
research
02/14/2019

Actions Generation from Captions

Sequence transduction models have been widely explored in many natural l...
research
12/20/2022

What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

Dual encoders are now the dominant architecture for dense retrieval. Yet...
research
04/06/2022

Knowledge Base Index Compression via Dimensionality and Precision Reduction

Recently neural network based approaches to knowledge-intensive NLP task...

Please sign up or login with your details

Forgot password? Click here to reset