Post-Processing of Word Representations via Variance Normalization and Dynamic Embedding

08/20/2018 ∙ by Bin Wang, et al. ∙ 0

Although embedded vector representations of words offer impressive performance on many natural language processing (NLP) applications, the information of ordered input sequences is lost to some extent if only context-based samples are used in the training. For further performance improvement, two new post-processing techniques, called post-processing via variance normalization (PVN) and post-processing via dynamic embedding (PDE), are proposed in this work. The PVN method normalizes the variance of principal components of word vectors while the PDE method learns orthogonal latent variables from ordered input sequences. The PVN and the PDE methods can be integrated to achieve better performance. We apply these post-processing techniques to two popular word embedding methods (i.e., word2vec and GloVe) to yield their post-processed representations. Extensive experiments are conducted to demonstrate the effectiveness of the proposed post-processing techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

By transferring prior knowledge from large unlabeled corpus, one can embed words into high-dimensional vectors with both semantic and syntactic meanings in their distributional representations. The design of effective word embedding methods has attracted the attention of researchers in recent years because of their superior performance in many downstream natural language processing (NLP) tasks, including sentimental analysis

[1], information retrieval [2] and machine translation [3]. In this work, two new post-processing techniques, called post-processing via variance normalization (PVN) and post-processing via dynamic embedding (PDE), are proposed for further performance improvement.

PCA-based post-processing methods have been examined in various research fields. In the word embedding field, it is observed that learned word vectors usually share a large mean and several dominant principal components, which prevents word embedding from being isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) can be differentiated from each other more easily. A post-processing algorithm (PPA) was recently proposed in

[4] to exploit this property. That is, the mean and several dominant principal components are removed by the PPA method. On the other hand, their complete removal may not be the best choice since they may still provide some useful information. Instead of removing them, we propose a new post-processing technique by normalizing the variance of embedded words and call it the PVN method here. The PVN method imposes constraints on dominant principal components instead of erasing their contributions completely.

Existing word embedding methods are primarily built upon the concept that “You shall know a word by the company it keeps.” [5]. As a result, most of current word embedding methods are based on training samples of “(word, context)”. Most context-based word embedding methods do not differentiate the word order in sentences, meaning that, they ignore the relative distance between the target and the context words in a chosen context window. Intuitively, words that are closer in a sentence should have stronger correlation. This has been verified in [6]. Thus, it is promising to design a new word embedding method that not only captures the context information but also models the dynamics in a word sequence.

To achieve further performance improvement, we propose the second post-processing technique, which is called the PDE method. Inspired by dynamic principal component analysis (Dynamic-PCA)

[7], the PDE method projects existing word vectors into an orthogonal subspace that captures the sequential information optimally under a pre-defined language model. The PVN method and the PDE method can work together to boost the overall performance. Extensive experiments are conducted to demonstrate the effectiveness of PVN/PDE post-processed representations over their original ones.

2 Highlighted Contributions

Post-processing and dimensionality reduction techniques in word embeddings have primarily been based on the principal component analysis (PCA). There is a long history in high dimensional correlated data analysis using latent variable extraction, including PCA, singular spectrum analysis (SSA) and canonical correlation analysis (CCA). They are shown to be effective in various applications. Among them, PCA is a widely used data-driven dimensionality reduction technique as it maximizes the variance of extracted latent variables. However, the conventional PCA focuses on static variance while ignoring time dependence between data distributions. It demands additional work in applying the PCA to dynamic data.

It is pointed out in [4] that embedded words usually share a large mean and several dominant principal components. As a consequence, the distribution of embedded words are not isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) are more differentiable from each other. To make the embedding scheme more robust and alleviate the hubness problem [8], they proposed to remove dominant principal components of embedded words. On the other hand, some linguistic properties are still captured by these dominant principal components. Instead of removing dominant principal components completely, we propose a new post-processing technique by imposing regularizations on principal components in this work.

Recently, contextualized word embedding has gained attention since it tackles the word meaning problem using the information from the whole sentence [9]

. It contains a bi-directional long short-term memory (bi-LSTM) module to learn a language model whose inputs are sequences. The performance of this model indicates that the ordered information plays an important role in the context-dependent representation, and it should be taken into consideration in the design of context-independent word embedding methods.

There are three main contributions in this work. We propose two new post-processing techniques in Sec. 3 and Sec. 4, respectively. Then, we apply the developed techniques to several popular word embedding methods and generate their post-processed representations. Extensive experiments are conducted over various baseline models including SGNS [10], CBOW [10], GloVe [11] and Dict2vec [12] in Sec. 5 to demonstrate the effectiveness of post-processed representations over their original ones.

3 Post-Processing via Variance Normalization

We modify the PPA method [4] by regularizing the variances of leading components at a similar level and call it the post-processing algorithm with variance normalization (PVN). The PVN method is described in Algorithm 1, where denotes the vocabulary set.

  Input: Given word representations , , and threshold parameter .
  1. Remove the mean of and .
  2. Compute the first PCA components
  3.

Compute the standard deviation for the first

PCA components ,
  4. Determine the new representation
  Output: Processed representations , .
Algorithm 1 Post-Processing via Variance Normalization (PVN)

In Step 4, is the projection of to the principal component. We multiply it by a ratio factor to constrain its variance. Then, we project it back to the original bases and subtract it from the mean-removed word vector.

To compute the standard deviation of the principal component of processed representation , we project to bases :

(1)

Thus, the standard deviation of all post-processed principal component, , is equal to . Thus, all variances of leading principal components will be normalized to the same level by the PVN. This makes embedding vectors more evenly distributed in all dimensions.

The only hyper-parameter to tune is threshold parameter . The optimal setting may vary with different word embedding baselines. A good rule of thumb is to choose , where is the dimension of word embeddings. Also, we can determine the dimension threshold, , by examining energy ratios of principal components.

4 Post-Processing via Dynamic Embedding

4.1 Language Model

Our language model is a linear transformation that predicts the current word given its ordered context words. For a sequence of words:

, the word embedding format is represented as . Two baseline models, SGNS and GloVe, are considered. In other words, is the embedded word of

using one of these methods. Our objective is to maximize the conditional probability

(2)

where is the context window size. As compared to other language models that use tokens from the past [9], we consider the two-sided context as shown in Eq. (2) since they are about equally important to the center word distribution in language modeling.

The linear language model can be written as

(3)

where is the word embedding representation after the latent variable transform to be discussed in Sec. 4.2. The term, , is used to represent the information loss that cannot be well modeled in the linear model. We treat as a neglible term and Eq. (3) as a linear approximation of the original language model.

4.2 Dynamic Latent Variable Extraction

We apply the dynamic latent variable technique to the word embedding problem in this subsection. To begin with, we define an objective function to extract the dynamic latent variables. The word sequence data is denoted by

(4)

and the data matrix, , derived from is formed using the chosen context window size and its word embedding representation from data :

(5)
(7)

where is the word embedding dimension. Then, the objective function used to extract the dynamic latent variable can be written as

(8)

where is a matrix of dimension , is a weighted sum of context word representations, and where is the selected dynamic dimension.

We interpret in Eq. (8) as a matrix that stores dynamic latent variables. If contains all learned dynamic principal components of dimension , is the projection to dynamic principal components from all context word representation and is the projection of the center word representation . Vector is a weighted sum of context representations used for prediction. We seek optimal and to maximize the sum of all inner products of predicted center word representation and the th center word representation . The choice of the inner product rather than other distance measures is to maximize the variance over extracted dynamic latent variables. For further details, we refer to [7].

4.3 Optimization

There is no analytical solution to Eq. (8) since and are coupled [13]. Besides, we need to impose constraints on and . That is, the columns of

are orthonormal vectors while

. The orthogonality constraint on matrix plays an important role. For example, the orthogonality constraint is introduced for bilingual word embedding for several reasons. The original word embedding space is invariant and self-consistent under the orthogonal principal [14, 15]. Moreover, without such a constraint, the learned dynamic latent variables has to be extracted iteratively, which is time consuming. Furthermore, the extracted latent variables tend to be close to each other with a small angle.

We adopt the optimization procedure in Algorithm 2 to solve the optimization problem. Note that parameter is used to control the orthogonality-wise convergence rate.

  Initialize and randomly as learnable parameters.
  for Training batch =1,…,m do
     Find corresponding word embedding .
     Predict the center word .
     Extract negative samples based on word frequency.
     Update and by gradient descent to optimize Eq. (9).
     
     
  end for
Algorithm 2 Optimization for extracting dynamic latent variables

We maximize the inner product over all tokens as shown in Eq. (8) in theory, yet we adopt negative sampling for parameter update in practice to save the computation. The objective function can be rewritten as

where

(9)

and where , is the amount of the negative samples used per positive training sample, and negative samples are sampled based on their overall frequency distribution.

The final word embedding vector is a concatenation of two parts: 1) the dynamic dimensions by projecting to the learned dynamic subspace in form of , and 2) static dimensions obtained from static PCA dimension reduction in form of .

5 Experiments

5.1 Baselines and Hyper-parameter Settings

We conduct the PVN on top of several baseline models. For all SGNS related experiments, wiki2010 corpus111http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 (around 6G) is used for training. The vocabulary set contains 830k vocabularies, which means words occur more than 10 times are included. For CBOW, GloVe and Dict2vec, we all adopt the official released code and trained on the same dataset as SGNS. Here we set =11 across all experiments.

For the PDE method, we obtain training pairs from the wiki2010 corpus to maintain consistency with the SGNS model. The vocabulary size is 800k. Words with low frequency are assigned to the same word vectors for simplicity.

Name Pairs Year
WS-353 353 2002
WS-353-SIM 203 2009
WS-353-REL 252 2009
Rare-Word 2034 2013
MEN 3000 2012
MTurk-287 287 2011
MTurk-771 771 2012
SimLex-999 999 2014
Verb-143 143 2014
SimVerb-3500 3500 2016
Table 1: Word similarity datasets used in our experiments, where pairs indicate the number of word pairs in each dataset.

5.2 Datasets

We consider two popular intrinsic evaluation benchmarks in our evaluation: 1) word similarity and 2) word analogy. Detailed introduction can be found in [16]. Our proposed post-processing methods work well in both evaluation methods as reported in Sec. 5.3 and Sec. 5.4.

Type SGNS PPA PVN(ours)
WS-353 65.7 67.6 68.1
WS-353-SIM 73.2 73.8 73.9
WS-353-REL 58.1 59.4 60.7
Rare-Word 39.5 42.4 42.9
MEN 70.2 72.5 73.2
MTurk-287 62.8 64.7 66.4
MTurk-771 64.6 66.2 66.8
SimLex-999 41.6 42.6 42.8
Verb-143 35.0 38.9 39.5
SimVerb-3500 26.5 28.1 28.5
Average 47.8 49.8 50.3
Table 2: The SRCC performance comparison () for SGNS alone, SGNS+PPA and SGNS+PVN against word similarity datasets, where the last row is the average performance weighted by the pair number of each dataset.
Type: SGNS PPA PVN(ours)
Google Add 59.6 61.3 62.1
Mul 61.2 60.3 61.9
Semantic Add 57.8 62.4 62.4
Mul 59.3 59.5 60.9
Syntactic Add 61.1 60.5 61.8
Mul 62.7 61.0 62.7
MSR Add 51.0 53.0 53.4
Mul 53.3 53.3 54.9
Table 3: The SRCC performance comparison () for SGNS alone, SGNS+PPA and SGNS+PVN against word analogy datasets.

5.2.1 Word Similarity

Word similarity evaluation is widely used in evaluating word embedding quality. It focuses on the semantic meaning of words. Here, we use the cosine distance measure and Spearman’s rank correlation coefficients (SRCC) [17] to measure the distance and evaluate the similarity between our results and human scores, respectively. We conduct tests on 10 popular datasets (see Table 1) to avoid evaluation occasionality. For more information of each dataset, we refer to the website222http://www.wordvectors.org/.

5.2.2 Word Analogy

Due to the limitation of performance comparison in terms of word similarity [18], performance comparison in word analogy is adopted as a complementary tool to evaluate the quality of word embedding methods. Both addition and multiplication operations are implemented to predict word here. In PDE method, we report commonly used addition operation for simplicity.

We choose two major datasets for word analogy evaluation. They are: 1) the Google dataset [19] and 2) the MSR dataset [20]. The Google dataset contains 19,544 questions. They belong to two major categories: semantic and morpho-syntactic, each of which contains 8,869 and 10,675 questions, respectively. We also report the results conducted on two Google subsets. The MSR dataset contains 8,000 analogy questions. Out-of-vocabulary words were removed from both datasets.333Out-of-vocabulary words are those appear less than 10 times in the wiki2010 dataset.

5.2.3 Extrinsic Evaluation

For PVN method, we conduct further experiments over extrinsic evaluation tasks including sentiment analysis and neural machine translation (NMT). For both tasks, Bidirection LSTM is utilized as the inference tool. Two sentiment analysis dataset is utilized: Internet Movie Database (IMDb) and Sentiment Treebank dataset (SST). Europarl v8 dataset for english-french translation is utilized in our neural machine translation task. We report accuracy for IMDb and SST dataset and validation accuracy for NMT.

Baselines IMDb SST NMT
SGNS 80.92/86.03 66.00/66.76 50.50/50.62
CBOW 85.20/85.81 67.12/66.94 49.78/49.97
GloVe 83.51/84.88 64.53/67.74 50.31/50.58
Dict2vec 80.62/84.40 65.06/66.89 50.45/50.56
Table 4: Extrinsic Evaluation for SGNS alone and SGNS+PVN. The first value is from orignal model while second value is from our post-processing embedding model.

5.3 Performance Evaluation of PVN

The performance of the PVN as a post-processing tool for the SGNS baseline methods is given in Tables 3. It also shows results of the baselines and the baseline+PPA [4] for performance bench-marking.

Table 2 compares the SRCC scores of the SGNS alone, the SGNS+PPA and the SGNS+PVN against word similarity datasets. We see that the SGNS+PVN performs better than the SGNS+PPA. We observe the largest performance gain of the SGNS+PVN reaches 5.2% in the avarage SRCC scores. It is also interesting to point out that the PVN is more robust than the PPA with different settings in .

Table 3 compares the SRCC scores of SGNS, SGNS+PPA and SGNS+PVN against word analogy datasets. We use addition as well as multiplication evaluation methods. PVN performs better than PPA in both. For the multiplication evaluation, the performance of PPA is worse than the baseline. In contrast, the proposed PVN method has no negative effect as it performs consistently well. This can be explained below. When the multiplication evaluation is adopted, the missing dimensions of the PPA influence the relative angles of vectors a lot. This is further verified by the fact that some linguistic properties are captured by these high-variance dimensions and their total elimination is sub-optimal.

Table 4 indicates extrinsic evaluation results. We can see that our PVN post-processing method performs much better compared with the original result in various downstream tasks.

Type SGNS PVN(ours)
WS-353 65.7 65.9
WS-353-SIM 73.2 73.6
WS-353-REL 58.1 59.3
Rare-Word 39.5 38.6
Google 59.6 60.8
Semantic 57.8 59.6
Syntactic 61.1 61.8
MSR 51.0 51.6
Table 5: The SRCC performance comparison () for SGNS alone and SGNS+PDE against word similarity and analogy datasets.

5.4 Performance Evaluation of PDE

We adopt the same setting in evaluating the PDE method such as the window size, vocabulary size, number of negative samples, training data, etc. when it is applied to SGNS baseline. The final word representation is composed by two parts: , where is the static part obtained from dimension reduction using PCA and is the projection of to dynamic subspace . Here, we set the dimensions of and to 240 and 60, respectively. The SRCC performance comparison of SGNS alone and SGNS+PDE against the word similarity and analogy datasets are shown in Tables 5. By adding the ordered information via PDE, we see that the quality of word representations is improved in both evaluation tasks.

Type SGNS PVN(ours)
WS-353 65.7 69.0
WS-353-SIM 73.2 75.3
WS-353-REL 58.1 61.9
Verb-143 35.0 44.1
Rare-Word 39.5 42.5
Google 59.6 62.8
Semantic 57.8 62.8
Syntactic 61.1 62.8
MSR 51.0 53.7
Table 6: The SRCC performance comparison () for SGNS alone and SGNS+PVN/PDE model against word similarity and analogy datasets.

5.5 Performance Evaluation of Integrated PVN/PDE

We can integrate PVN and PDE together to improve their individual performance. Since the PVN provides a better word embedding, it can help PDE learn better. Furthermore, normalizing variances for dominant principal components is beneficial since they occupy too much energy and mask the contributions of remaining components. On the other hand, components with very low variances may contain much noise. They should be removed or replaced while the PDE can be used to replace the noisy components.

The SRCC performances of the baseline SGNS method and the SGNS+PVN/PDE method for the word similarity and the word analogy tasks are listed in Table 6. Better results are obtained across all datasets. The improvement over the Verb-143 dataset has a high ranking among all datasets with either joint PVN/PDE or PDE alone. This matches our expectation since the context order has more contribution over verbs.

6 Conclusion and Future Work

Two post-processing techniques, PVN and PDE, were proposed to improve the quality of baseline word embedding methods in this work. The two techniques can work independently or jointly. The effectiveness of these techniques was demonstrated by both intrinsic and extrinsic evaluation tasks.

We would like to study the PVN method by exploiting the correlation of dimensions, and applying it to dimensionality reduction of word representations in the near future. Furthermore, we would like to apply the dynamic embedding technique to both generic and/or domain-specific word embedding methods with a limited amount of data. It is also desired to consider its applicability to non-linear language models.

References