By transferring prior knowledge from large unlabeled corpus, one can embed words into high-dimensional vectors with both semantic and syntactic meanings in their distributional representations. The design of effective word embedding methods has attracted the attention of researchers in recent years because of their superior performance in many downstream natural language processing (NLP) tasks, including sentimental analysis, information retrieval  and machine translation . In this work, two new post-processing techniques, called post-processing via variance normalization (PVN) and post-processing via dynamic embedding (PDE), are proposed for further performance improvement.
PCA-based post-processing methods have been examined in various research fields. In the word embedding field, it is observed that learned word vectors usually share a large mean and several dominant principal components, which prevents word embedding from being isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) can be differentiated from each other more easily. A post-processing algorithm (PPA) was recently proposed in to exploit this property. That is, the mean and several dominant principal components are removed by the PPA method. On the other hand, their complete removal may not be the best choice since they may still provide some useful information. Instead of removing them, we propose a new post-processing technique by normalizing the variance of embedded words and call it the PVN method here. The PVN method imposes constraints on dominant principal components instead of erasing their contributions completely.
Existing word embedding methods are primarily built upon the concept that “You shall know a word by the company it keeps.” . As a result, most of current word embedding methods are based on training samples of “(word, context)”. Most context-based word embedding methods do not differentiate the word order in sentences, meaning that, they ignore the relative distance between the target and the context words in a chosen context window. Intuitively, words that are closer in a sentence should have stronger correlation. This has been verified in . Thus, it is promising to design a new word embedding method that not only captures the context information but also models the dynamics in a word sequence.
To achieve further performance improvement, we propose the second post-processing technique, which is called the PDE method. Inspired by dynamic principal component analysis (Dynamic-PCA), the PDE method projects existing word vectors into an orthogonal subspace that captures the sequential information optimally under a pre-defined language model. The PVN method and the PDE method can work together to boost the overall performance. Extensive experiments are conducted to demonstrate the effectiveness of PVN/PDE post-processed representations over their original ones.
2 Highlighted Contributions
Post-processing and dimensionality reduction techniques in word embeddings have primarily been based on the principal component analysis (PCA). There is a long history in high dimensional correlated data analysis using latent variable extraction, including PCA, singular spectrum analysis (SSA) and canonical correlation analysis (CCA). They are shown to be effective in various applications. Among them, PCA is a widely used data-driven dimensionality reduction technique as it maximizes the variance of extracted latent variables. However, the conventional PCA focuses on static variance while ignoring time dependence between data distributions. It demands additional work in applying the PCA to dynamic data.
It is pointed out in  that embedded words usually share a large mean and several dominant principal components. As a consequence, the distribution of embedded words are not isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) are more differentiable from each other. To make the embedding scheme more robust and alleviate the hubness problem , they proposed to remove dominant principal components of embedded words. On the other hand, some linguistic properties are still captured by these dominant principal components. Instead of removing dominant principal components completely, we propose a new post-processing technique by imposing regularizations on principal components in this work.
Recently, contextualized word embedding has gained attention since it tackles the word meaning problem using the information from the whole sentence 
. It contains a bi-directional long short-term memory (bi-LSTM) module to learn a language model whose inputs are sequences. The performance of this model indicates that the ordered information plays an important role in the context-dependent representation, and it should be taken into consideration in the design of context-independent word embedding methods.
There are three main contributions in this work. We propose two new post-processing techniques in Sec. 3 and Sec. 4, respectively. Then, we apply the developed techniques to several popular word embedding methods and generate their post-processed representations. Extensive experiments are conducted over various baseline models including SGNS , CBOW , GloVe  and Dict2vec  in Sec. 5 to demonstrate the effectiveness of post-processed representations over their original ones.
3 Post-Processing via Variance Normalization
We modify the PPA method  by regularizing the variances of leading components at a similar level and call it the post-processing algorithm with variance normalization (PVN). The PVN method is described in Algorithm 1, where denotes the vocabulary set.
In Step 4, is the projection of to the principal component. We multiply it by a ratio factor to constrain its variance. Then, we project it back to the original bases and subtract it from the mean-removed word vector.
To compute the standard deviation of the principal component of processed representation , we project to bases :
Thus, the standard deviation of all post-processed principal component, , is equal to . Thus, all variances of leading principal components will be normalized to the same level by the PVN. This makes embedding vectors more evenly distributed in all dimensions.
The only hyper-parameter to tune is threshold parameter . The optimal setting may vary with different word embedding baselines. A good rule of thumb is to choose , where is the dimension of word embeddings. Also, we can determine the dimension threshold, , by examining energy ratios of principal components.
4 Post-Processing via Dynamic Embedding
4.1 Language Model
Our language model is a linear transformation that predicts the current word given its ordered context words. For a sequence of words:, the word embedding format is represented as . Two baseline models, SGNS and GloVe, are considered. In other words, is the embedded word of
using one of these methods. Our objective is to maximize the conditional probability
where is the context window size. As compared to other language models that use tokens from the past , we consider the two-sided context as shown in Eq. (2) since they are about equally important to the center word distribution in language modeling.
The linear language model can be written as
where is the word embedding representation after the latent variable transform to be discussed in Sec. 4.2. The term, , is used to represent the information loss that cannot be well modeled in the linear model. We treat as a neglible term and Eq. (3) as a linear approximation of the original language model.
4.2 Dynamic Latent Variable Extraction
We apply the dynamic latent variable technique to the word embedding problem in this subsection. To begin with, we define an objective function to extract the dynamic latent variables. The word sequence data is denoted by
and the data matrix, , derived from is formed using the chosen context window size and its word embedding representation from data :
where is the word embedding dimension. Then, the objective function used to extract the dynamic latent variable can be written as
where is a matrix of dimension , is a weighted sum of context word representations, and where is the selected dynamic dimension.
We interpret in Eq. (8) as a matrix that stores dynamic latent variables. If contains all learned dynamic principal components of dimension , is the projection to dynamic principal components from all context word representation and is the projection of the center word representation . Vector is a weighted sum of context representations used for prediction. We seek optimal and to maximize the sum of all inner products of predicted center word representation and the th center word representation . The choice of the inner product rather than other distance measures is to maximize the variance over extracted dynamic latent variables. For further details, we refer to .
are orthonormal vectors while. The orthogonality constraint on matrix plays an important role. For example, the orthogonality constraint is introduced for bilingual word embedding for several reasons. The original word embedding space is invariant and self-consistent under the orthogonal principal [14, 15]. Moreover, without such a constraint, the learned dynamic latent variables has to be extracted iteratively, which is time consuming. Furthermore, the extracted latent variables tend to be close to each other with a small angle.
We adopt the optimization procedure in Algorithm 2 to solve the optimization problem. Note that parameter is used to control the orthogonality-wise convergence rate.
We maximize the inner product over all tokens as shown in Eq. (8) in theory, yet we adopt negative sampling for parameter update in practice to save the computation. The objective function can be rewritten as
and where , is the amount of the negative samples used per positive training sample, and negative samples are sampled based on their overall frequency distribution.
The final word embedding vector is a concatenation of two parts: 1) the dynamic dimensions by projecting to the learned dynamic subspace in form of , and 2) static dimensions obtained from static PCA dimension reduction in form of .
5.1 Baselines and Hyper-parameter Settings
We conduct the PVN on top of several baseline models. For all SGNS related experiments, wiki2010 corpus111http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 (around 6G) is used for training. The vocabulary set contains 830k vocabularies, which means words occur more than 10 times are included. For CBOW, GloVe and Dict2vec, we all adopt the official released code and trained on the same dataset as SGNS. Here we set =11 across all experiments.
For the PDE method, we obtain training pairs from the wiki2010 corpus to maintain consistency with the SGNS model. The vocabulary size is 800k. Words with low frequency are assigned to the same word vectors for simplicity.
We consider two popular intrinsic evaluation benchmarks in our evaluation: 1) word similarity and 2) word analogy. Detailed introduction can be found in . Our proposed post-processing methods work well in both evaluation methods as reported in Sec. 5.3 and Sec. 5.4.
5.2.1 Word Similarity
Word similarity evaluation is widely used in evaluating word embedding quality. It focuses on the semantic meaning of words. Here, we use the cosine distance measure and Spearman’s rank correlation coefficients (SRCC)  to measure the distance and evaluate the similarity between our results and human scores, respectively. We conduct tests on 10 popular datasets (see Table 1) to avoid evaluation occasionality. For more information of each dataset, we refer to the website222http://www.wordvectors.org/.
5.2.2 Word Analogy
Due to the limitation of performance comparison in terms of word similarity , performance comparison in word analogy is adopted as a complementary tool to evaluate the quality of word embedding methods. Both addition and multiplication operations are implemented to predict word here. In PDE method, we report commonly used addition operation for simplicity.
We choose two major datasets for word analogy evaluation. They are: 1) the Google dataset  and 2) the MSR dataset . The Google dataset contains 19,544 questions. They belong to two major categories: semantic and morpho-syntactic, each of which contains 8,869 and 10,675 questions, respectively. We also report the results conducted on two Google subsets. The MSR dataset contains 8,000 analogy questions. Out-of-vocabulary words were removed from both datasets.333Out-of-vocabulary words are those appear less than 10 times in the wiki2010 dataset.
5.2.3 Extrinsic Evaluation
For PVN method, we conduct further experiments over extrinsic evaluation tasks including sentiment analysis and neural machine translation (NMT). For both tasks, Bidirection LSTM is utilized as the inference tool. Two sentiment analysis dataset is utilized: Internet Movie Database (IMDb) and Sentiment Treebank dataset (SST). Europarl v8 dataset for english-french translation is utilized in our neural machine translation task. We report accuracy for IMDb and SST dataset and validation accuracy for NMT.
5.3 Performance Evaluation of PVN
Table 2 compares the SRCC scores of the SGNS alone, the SGNS+PPA and the SGNS+PVN against word similarity datasets. We see that the SGNS+PVN performs better than the SGNS+PPA. We observe the largest performance gain of the SGNS+PVN reaches 5.2% in the avarage SRCC scores. It is also interesting to point out that the PVN is more robust than the PPA with different settings in .
Table 3 compares the SRCC scores of SGNS, SGNS+PPA and SGNS+PVN against word analogy datasets. We use addition as well as multiplication evaluation methods. PVN performs better than PPA in both. For the multiplication evaluation, the performance of PPA is worse than the baseline. In contrast, the proposed PVN method has no negative effect as it performs consistently well. This can be explained below. When the multiplication evaluation is adopted, the missing dimensions of the PPA influence the relative angles of vectors a lot. This is further verified by the fact that some linguistic properties are captured by these high-variance dimensions and their total elimination is sub-optimal.
Table 4 indicates extrinsic evaluation results. We can see that our PVN post-processing method performs much better compared with the original result in various downstream tasks.
5.4 Performance Evaluation of PDE
We adopt the same setting in evaluating the PDE method such as the window size, vocabulary size, number of negative samples, training data, etc. when it is applied to SGNS baseline. The final word representation is composed by two parts: , where is the static part obtained from dimension reduction using PCA and is the projection of to dynamic subspace . Here, we set the dimensions of and to 240 and 60, respectively. The SRCC performance comparison of SGNS alone and SGNS+PDE against the word similarity and analogy datasets are shown in Tables 5. By adding the ordered information via PDE, we see that the quality of word representations is improved in both evaluation tasks.
5.5 Performance Evaluation of Integrated PVN/PDE
We can integrate PVN and PDE together to improve their individual performance. Since the PVN provides a better word embedding, it can help PDE learn better. Furthermore, normalizing variances for dominant principal components is beneficial since they occupy too much energy and mask the contributions of remaining components. On the other hand, components with very low variances may contain much noise. They should be removed or replaced while the PDE can be used to replace the noisy components.
The SRCC performances of the baseline SGNS method and the SGNS+PVN/PDE method for the word similarity and the word analogy tasks are listed in Table 6. Better results are obtained across all datasets. The improvement over the Verb-143 dataset has a high ranking among all datasets with either joint PVN/PDE or PDE alone. This matches our expectation since the context order has more contribution over verbs.
6 Conclusion and Future Work
Two post-processing techniques, PVN and PDE, were proposed to improve the quality of baseline word embedding methods in this work. The two techniques can work independently or jointly. The effectiveness of these techniques was demonstrated by both intrinsic and extrinsic evaluation tasks.
We would like to study the PVN method by exploiting the correlation of dimensions, and applying it to dimensionality reduction of word representations in the near future. Furthermore, we would like to apply the dynamic embedding technique to both generic and/or domain-specific word embedding methods with a limited amount of data. It is also desired to consider its applicability to non-linear language models.
-  Bonggun Shin, Timothy Lee, and Jinho D Choi, “Lexicon integrated cnn models with attention for sentiment analysis,” arXiv preprint arXiv:1610.06272, 2016.
-  Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan, Introduction to information retrieval, vol. 39, Cambridge University Press, 2008.
-  Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun, “Visualizing and understanding neural machine translation,” in NAACL, 2017, vol. 1, pp. 1150–1159.
-  Jiaqi Mu, Suma Bhat, and Pramod Viswanath, “All-but-the-top: Simple and effective postprocessing for word representations,” arXiv preprint arXiv:1702.01417, 2017.
-  John R Firth, “A synopsis of linguistic theory, 1930-1955,” Studies in linguistic analysis, 1957.
-  Urvashi Khandelwal, He He, and Peng Qi, “Sharp nearby, fuzzy far away: How neural language models use context,” arXiv preprint arXiv:1805.04623, 2018.
-  Yining Dong and S Joe Qin, “A novel dynamic pca algorithm for dynamic data modeling and process monitoring,” Journal of Process Control, vol. 67, pp. 1–11, 2018.
-  Milos Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović, “On the existence of obstinate results in vector space models,” in ACM SIGIR. ACM, 2010, pp. 186–193.
-  Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
-  Jeffrey Pennington, Richard Socher, and Christopher Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
-  Julien Tissier, Christophe Gravier, and Amaury Habrard, “Dict2vec: Learning word embeddings using lexical dictionaries,” in EMNLP, 2017, pp. 254–263.
-  Junghui Chen and Kun-Chih Liu, “On-line batch process monitoring using dynamic pca and dynamic pls models,” Chemical Engineering Science, vol. 57, no. 1, pp. 63–75, 2002.
-  Mikel Artetxe, Gorka Labaka, and Eneko Agirre, “Learning principled bilingual mappings of word embeddings while preserving monolingual invariance,” in EMNLP, 2016, pp. 2289–2294.
-  Mikel Artetxe, Gorka Labaka, and Eneko Agirre, “Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations,” in AAAI, 2018.
-  Bin Wang, Angela Wang, Fenxiao Chen, Yunchen Wang, and C-C Jay Kuo, “Evaluating word embedding models: Methods and experimental results,” arXiv preprint arXiv:1901.09785, 2019.
-  Charles Spearman, “The proof and measurement of association between two things,” The American journal of psychology, vol. 15, no. 1, pp. 72–101, 1904.
-  Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, and Chris Dyer, “Problems with evaluation of word embeddings using word similarity tasks,” arXiv preprint arXiv:1605.02276, 2016.
-  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, “Linguistic regularities in continuous space word representations,” in NAACL, 2013, pp. 746–751.