Improving Word Representations: A Sub-sampled Unigram Distribution for Negative Sampling

10/21/2019 ∙ by Wenxiang Jiao, et al. ∙ 0

Word2Vec is the most popular model for word representation and has been widely investigated in literature. However, its noise distribution for negative sampling is decided by empirical trials and the optimality has always been ignored. We suggest that the distribution is a sub-optimal choice, and propose to use a sub-sampled unigram distribution for better negative sampling. Our contributions include: (1) proposing the concept of semantics quantification and deriving a suitable sub-sampling rate for the proposed distribution adaptive to different training corpora; (2) demonstrating the advantages of our approach in both negative sampling and noise contrastive estimation by extensive evaluation tasks; and (3) proposing a semantics weighted model for the MSR sentence completion task, resulting in considerable improvements. Our work not only improves the quality of word vectors but also benefits current understanding of Word2Vec.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent decade has witnessed the great success achieved by word representation in natural language processing (NLP). It proves to be an integral part of most other NLP tasks, in which words have to be vectorized before input to the models. High quality word vectors have boosted the performance of many tasks, such as named entity recognition 

(Pennington et al., 2014; Sienčnik, 2015), sentence completion (Yogatama et al., 2014; Liu et al., 2015), part-of-speech tagging (Ling et al., 2015a, b)

, sentiment analysis 

(Tsvetkov et al., 2015; Yu et al., 2017), and machine translation (Sutskever et al., 2014; Johnson et al., 2016). In a conventional way, word vectors are obtained from word-context co-occurrence matrices by either cascading the row and column vectors (Lund and Burgess, 1996)

or applying singular value decomposition (SVD) 

(Deerwester et al., 1990). However, these approaches are limited by their sub-optimal linear structure of vector space and the highly increased memory requirement when confronting huge vocabularies. Both problems have been solved by a popular model called Word2Vec (Mikolov et al., 2013b)

, which utilizes two shallow neural networks,

i.e., skip-gram and continuous bag-of-words, to learn word vectors from large corpora. The model is also capable of capturing interesting linear relationships between word vectors.

While Word2Vec makes a breakthrough in word representation, it has not been fully understood and its theoretical exploitation is still in demand. One aspect, which has always been ignored, is the choice of noise distribution for negative sampling. Word2Vec employs a smoothed unigram distribution with a power rate of 3/4 as the noise distribution. The decision is made by empirical trials but has been widely adopted in subsequent work (Levy et al., 2015; Ling et al., 2015a; Yang et al., 2017; Bamler and Mandt, 2017). However, the quality of learned word vectors is sensitive to the choice of noise distribution (Gutmann and Hyvärinen, 2010; Levy et al., 2015) when using a moderate number (5 to 15) of negative samples, which is a common strategy for the tradeoff between vector quality and computation costs.

In this paper, we propose to employ a sub-sampled unigram distribution for negative sampling and demonstrate its capability of improving the linear relationships between word vectors. Our contributions include three aspects: (1) We propose the concept of semantics quantification and derive a suitable sub-sampling rate for the proposed distribution. (2) We demonstrate the advantages of our noise distribution in both negative sampling and noise contrastive estimation by extensive experiments. (3) We propose a semantics weighted model for the MSR sentence completion task, resulting in considerable improvements.

2 Word2Vec

2.1 Architectures

Firstly, we briefly introduce the two architectures, i.e., skip-gram (SG) and continuous bag-of-words (CBOW) in Word2Vec (Mikolov et al., 2013b). For a corpus with a word sequence , skip-gram predicts the context word given the center word

, and maximizes the average log probability,

(1)

where is the size of context window, and is defined by the full softmax function,

(2)

where and are the vectors of the “input” and “output” words, and is the size of vocabulary.

As for CBOW, it predicts the center word based on the context words. The input vector is usually the average of the context words’ vectors, i.e., .

Figure 1: Illustration of the skip-gram and continuous bag-of-words (CBOW) architectures.

2.2 Negative Sampling

For large vocabularies, it is inefficient to compute the full softmax function in Eq. (2). To tackle this problem, Word2Vec utilizes negative sampling to distinguish the real output word from noise words,

(3)

where , and is the so-called noise distribution, representing the probability for a word to be sampled as a noise word. The smoothed unigram distribution used in Word2Vec is expressed as,

(4)

where is the frequency of word .

2.3 Sub-sampling

Sub-sampling is a process in Word2Vec for randomly deleting the most frequent words during training, since they are usually stop words with less information than infrequent ones. During sub-sampling, the probability that a word should be kept is defined as,

(5)

where is the normalized word frequency of , and is called the sub-sampling rate typically between and . The process does not delete infrequent words.

3 Related Work

Unigram. A noise distribution is recommended to be close to the distribution of the real data in noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010). Such guidance finds its earliest application for training language models by Mnih and Teh (2012)

, demonstrating that the unigram distribution works better than a uniform distribution. This choice is also adopted in some other work 

(Mnih and Kavukcuoglu, 2013; Vaswani et al., 2013; Xiao and Guo, 2013; Baltescu and Blunsom, 2014). However, the performance of models is limited due to the inadequate training of infrequent words (Chen et al., 2015; Labeau and Allauzen, 2017).

Smoothed Unigram. The smoothed unigram distribution in Word2Vec (Mikolov et al., 2013b) solves this problem because it gives more chances for infrequent words to be sampled. However, the required power rate is decided empirically, and may need adjustment for different scenarios (Bojanowski et al., 2016; Ai et al., 2016). Labeau and Allauzen (2017) even propose to use a bigram distribution after studying the power rate, but it is infeasible for large corpora. Besides, the smoothed unigram distribution also changes the lexical structure of infrequent words, which could be a reason for the limited quality of word vectors.

4 Sub-sampled Unigram Distribution

We believe a sub-sampled unigram distribution is better for negative sampling since it reduces the amount of frequent words and also maintains the lexical structure of infrequent words. To our best knowledge, we are the first to employ such a noise distribution for negative sampling. Beyond this, we propose a approach to derive the sub-sampling rate that is adaptive to different corpora (Table 1).

4.1 Critical Word

We start our analysis by recalling the probability in Eq. (5) of a word to be kept during sub-sampling. Obviously, we need to choose the sub-sampling rate to decide the noise distribution. Although empirically selecting a sub-sampling rate can result in improvements (Table 3), we aim to derive the sub-sampling rate adaptive to different corpora. To accomplish this, we firstly introduce a concept critical word denoted by , which is the word with . The critical word indicates that words with frequencies lower than it will not be deleted during sub-sampling. It is uniquely decided by the sub-sampling rate. Thus, if we select the critical word with certain properties at first, we are able to obtain a suitable sub-sampling rate in return.

The basic rule for us to select the critical word is to find a word with balanced semantic and syntactic information. We prefer not to delete words with relatively more semantic information. Now, the problem is how to measure these two kinds of information a word possesses.

Figure 2: Illustration of a unigram distribution, the fitting line, and the sub-sampled version.

4.2 Semantics Quantification

In order to quantify the semantic and syntactic information of words, we consider two observations: (1) frequent words are more likely to be function words with more syntactic information; (2) infrequent words are more likely to be content words with more semantic information (Hochmann et al., 2010). Thus, for the -th most frequent word , the quantity of its semantic and syntactic information and , can be described as,

(6)

where and are monotonically increasing functions of the ranking and the frequency , respectively. One can tell that the functions capture the properties of the observations.

On the other hand, we require that the total quantity of semantic and syntactic information, denoted by is fixed for all words, i.e.,

(7)

where is a constant. We rewrite Eq. (7) into an exponential form as the following,

(8)

This expression leads us to a well known power law called Zipf’s law (Zipf, 1950), which approximates the relationship between and as,

(9)

where are constants and . Consequently, we can decide the form of the functions and as,

(10)

Obviously, the form functions satisfy the definition we made before. As a results, the total information becomes given .

4.3 Expression of Sub-sampling Rate

Now, given the quantified information, we are able to decide the critical word satisfying the condition

(11)

Combined with Eq. (9), we obtain the frequency of the critical word

(12)

where is the ranking of the critical word. Meanwhile, we know the probability of the critical word to be kept should be exactly . Thus, with Eq. (5) and Eq. (12), the sub-sampling rate for our noise distribution is expressed as

(13)

Note that we use to distinguish from the sub-sampling rate applied for the training corpus.

4.4 Constants Estimation

As for the estimation of constants and , we provide two choices:
(1) wLSE-1. We use weighted least squares estimation (wLSE) to estimate the two constants. Since more data are located at higher positions in axis, wLSE with a weight of for the r-th most frequent word makes sure the trend of line can be well fit. The estimated constants are

(14)
(15)

where denotes the weighted average of such that .
(2) wLSE-2. We use wLSE with a condition that the fitting line passes through the point . This method engages the most frequent word to further control the trend of the line. As a result, and

(16)

Now, we can write down the expression of the sub-sampled unigram distribution

(17)

where satisfies

(18)

Note that we use to distinguish from the original noise distribution in Word2Vec.

4.5 Discussions

In semantics quantification, the modeling of word distribution is not limited to zipf’s law. We adopt it because of its popularity and conciseness. There could be other choices (Mandelbrot, 1953; Piantadosi, 2014), and the expression of needs modification accordingly. Besides, one can either use the chosen law to decide the critical word or just search through the unigram distribution to find it.

5 Experiments

To show the advantages of our noise distribution, we conduct experiments on three evaluation tasks. While the word analogy task (Mikolov et al., 2013b) is our focus for testing the linear relationships between word vectors, we also evaluate the learned word vectors on the word similarity task (Pennington et al., 2014) and the synonym selection task (Liu et al., 2015).

In the following, we firstly describe the experimental setup including baselines, training corpora and details. Next, we report experimental results for the three NLP tasks. At last, we introduce the semantics weighted model proposed for the MSR sentence completion task (Zweig et al., 2012).

5.1 Experimental Setup

5.1.1 Baselines

We train the two models, SG and CBOW, using the original noise distribution and other two obtained by our approach, specifically,
(1) Uni. The smoothed unigram distribution proposed by Mikolov et al. (2013b).
(2) Sub. The sub-sampled uinigram distribution, of which the threshold is estimated by wLSE-1.
(3) Sub. The sub-sampled uinigram distribution, of which the threshold is estimated by wLSE-2.

5.1.2 Training Corpora

Our training corpora come from four sources, described as below:
(1) BWLM. The “One Billion Word Language Modeling Benchmark”111http://www.statmt.org/lm-benchmark, which is already pre-processed and has almost 1 billion tokens.
(2) Wiki10. The April 2010 snapshot of the Wikipedia corpus222http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html with a total of about 2 million articles and 1 billion tokens.
(3) UMBC. The UMBC WebBase corpus333http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words from the Stanford WebBase project’s February 2007 Web crawl, with over 3 billion tokens.
(4) MSR. The MSR corpus containing 5 Conan Doyle Sherlock Holmes novels444https://www.microsoft.com/en-us/research/project/msr-sentence-completion-challenge with about 50 million tokens.

The first three large corpora are used for word similarity, synonym selection, and word analogy tasks. The MSR corpus is designated for the MSR sentence completion task. We pre-process the corpora by converting all words into lowercase and removing all the non-alphanumeric. The number of remaining tokens for each corpus is listed in the column Size of Table 1. Vocabularies are built by discarding words whose occurrences are less than the threshold shown in the column Mcn. The column Vocab represents the sizes of the resulted vocabularies. The rightmost two columns are the sub-sampling rates for our noise distribution by the wLSE-1 and wLSE-2 estimations, respectively. The values are times of the true ones for readability.

Corpus Size Mcn Vocab -1 -2
BWLM 0.7 B 20 195 k 3.10 3.17
Wiki10 1.0 B 50 249 k 2.76 2.80
UMBC 3.0 B 50 267 k 1.33 1.39
MSR 47 M 5 77 k 13.2 13.1
Table 1: Information of the training corpora.

5.1.3 Training details

We implement the training of word vectors with the word2vec tool555https://code.google.com/archive/p/word2vec

, in which the part of noise distribution is modified to support several choices. For SG and CBOW, we set the vector dimensionality to 100, and the size of the context window to 5. We choose 10 negative samples for each training sample in the models. The models are trained using the stochastic gradient decent (SGD) algorithm with a linear decaying learning rate with an initial value of 0.025 in SG and 0.05 in CBOW. We train the models on the three large corpora for 2 epochs, and for MSR’s Holmes novels the value may vary. Results in this paper are shown in percentages and each of them is the average result of 4 repeated experiments, unless otherwise stated.

5.2 Task 1: Word Similarity Task

5.2.1 Task Description

The task computes the correlation between the word similarity scores by human judgment and the word distances in vector space. We use Pearson correlation coefficient as the metric, the higher of which the better the word vectors are. The expression of is

(19)

where and

are random variables for the word similarity scores by human judgment and the cosine distances between word vectors, respectively. Benchmark datasets for this task include RG 

(Rubenstein and Goodenough, 1965), MC (Miller and Charles, 1991), WS (Finkelstein et al., 2001), MEN (Bruni et al., 2012), and RW (Luong et al., 2013).

5.2.2 Results

We implement the task on the mentioned 5 datasets and show the results in the column Word Similarity of Table 2. At the first glance, our noise distributions Sub and Sub perform slightly better than Uni. Significant improvements can be achieved on two small datasets RG and MC, because they are more sensitive to the vector quality. Another observation is that CBOW is more affected by Sub and Sub than SG, if comparing results on RG and MC with Wiki10 corpus. These results show that our noise distributions have the potential as high as or even higher than the smoothed unigram distribution in learning good word vectors.

Size Model Noise Word Similarity Synonym Selection Word Analogy
RG MC WS MEN RW LEX Toefl Tot Sem Syn Tot
0.7 B sg Uni 62.1 64.4 62.5 66.8 43.3 66.1 74.6 67.9 61.4 57.4 59.2
Sub 62.9 66.0 63.1 67.1 43.3 67.9 73.6 69.1 62.8 56.8 59.5
Sub 63.0 66.8 62.8 67.1 43.2 68.3 73.6 69.4 63.5 56.9 59.9
cbow Uni 64.3 66.4 60.4 66.1 44.0 66.4 79.6 69.2 53.4 58.7 56.4
Sub 64.6 67.1 61.0 66.7 44.1 67.4 78.2 69.7 57.4 59.8 58.7
Sub 65.7 67.9 60.7 66.7 43.7 68.3 79.2 70.7 58.0 60.4 59.3
1.0 B sg Uni 77.2 81.4 68.2 70.1 43.3 65.9 82.8 69.6 61.6 58.8 60.1
Sub 77.2 81.9 68.7 70.5 43.6 65.6 86.3 70.2 64.2 58.6 61.1
Sub 77.3 81.5 68.4 70.4 43.5 64.7 84.4 69.1 63.9 58.7 61.1
cbow Uni 76.2 76.9 68.7 70.6 44.1 68.8 82.8 71.9 65.0 61.3 63.0
Sub 77.4 80.3 69.3 71.0 44.6 67.7 84.0 71.3 67.4 62.2 64.6
Sub 76.8 80.0 69.2 71.2 44.3 69.5 80.8 72.0 68.4 62.7 65.3
3.0 B sg Uni 69.7 77.6 67.6 69.5 46.7 72.6 84.6 75.1 46.4 63.2 55.7
Sub 69.9 78.7 67.8 70.2 47.3 72.9 84.6 75.3 51.4 64.1 58.4
Sub 70.5 79.2 67.8 70.2 47.1 72.4 85.9 75.2 51.2 63.8 58.2
cbow Uni 72.7 77.5 67.3 71.0 48.2 76.5 87.8 78.8 44.9 64.6 55.9
Sub 74.2 78.2 67.8 71.5 48.5 77.2 87.8 79.4 50.6 66.6 59.5
Sub 74.4 78.7 68.0 71.5 48.4 78.0 87.1 79.9 50.7 66.8 59.6
Table 2: Results of evaluation tasks on the learned word vectors, i.e., word similarity, synonym selection, and word analogy. The sub-sampling rate for the training corpora is .

5.3 Task 2: Synonym Selection Task

5.3.1 Task Description

This task attempts to select the semantically closest word, from the candidate answers, to the stem word. For example, given the stem word “costly” and the candidate answers “expensive, beautiful, popular, complicated”, the most similar word should be “expensive”. For each candidate answer, we compute the cosine similarity score between its word vector and that of the stem word. The candidate answer with the highest score is our final answer for a question. Here we use the TOEFL dataset 

(Landauer and Dumais, 1997) with 80 synonym questions and the LEX666We collect the questions from two ebooks 501 Synonym and Antonym Questions and 1001 Vocabulary & Spelling Questions provided on the eLearning platform LearningExpress. https://www.learningexpresshub.com dataset with 303 questions collected by ourselves.

5.3.2 Results

We report the results of this task in the Synonym Selection column of Table 2. For all the noise distributions, the results are not stable on TOEFL dataset since it is quite small. Still, Sub and Sub have comparable performance with Uni. In particular, Sub makes considerable improvements with Wiki10 corpus. As for LEX dataset, Sub and Sub outperform Uni in both SG and CBOW models with BWLM corpus. With the other two corpora, Sub performs better than Sub and Uni using CBOW model. But again, the SG model appears to be less boosted by Sub and Sub in terms of the corresponding results. Considering the unbalanced number of questions in these two datasets, we provide the total results on TOEFL+LEX and conclude that our noise distributions are better than Uni.

5.4 Task 3: Word Analogy Task

(a) CBOW
(b) SG
(c) Optimality
Figure 3: Word analogy results (a) and (b) for number of negative samples and (c) for optimality. Smoothed and wLSE-2 represent and , -2 means the sub-sampling rate of .

5.4.1 Task Description

The task comes from the idea that arithmetic operations in a word vector space can be predicted: given three words , , and , the goal is to find a word such that the relation is the same as the relation . Semantic questions are in the form of “Athens:Greece is as Berlin:German” and syntactic ones are like “dance:dancing is as fly:flying”. Here we choose the fourth word by maximizing the cosine similarity such that  (Mikolov et al., 2013c). We test the learned word vectors on the Google analogy dataset (Mikolov et al., 2013b), which contains 8,869 semantic questions and 10,675 syntactic ones.

5.4.2 Results

This task is our primary focus because it exposes interesting linear relationships between word vectors. Thus we conduct four sub-experiments to investigate four aspects of our noise distributions.

Model Responses. The two models SG and CBOW respond differently to our noise distributions as shown in Table 2. When applying CBOW model on the three corpora, our noise distributions Sub and Sub can result in significant improvements compared with Uni, especially on semantic questions. Specifically, the accuracy of semantic questions is improved by 2 to 6 points, and for syntactic questions it is 1.5 to 2 points. As for the SG model, the improvements on semantic questions by Sub and Sub are still considerable (2 to 5 points). But on syntactic questions, Uni becomes competitive with Sub and Sub and is slightly better with BWLM and Wiki10 corpora. The reason may be that SG model is better at capturing semantic relationships between words compared with CBOW model. Still, it is safe to say that our noise distributions are better for SG in terms of the total accuracy.

Number of Negative Samples. Increasing the number of negative samples does not reduce the advantages of our noise distributions necessarily. We report the results of the task using various number of negative samples in Fig. 3 (a) for CBOW and Fig. 3 (b) for SG. Note that we only train the models on Wiki10 and compare Sub with Uni. For CBOW, Sub outperforms Uni consistently with significant margins on both semantic and syntactic questions. For SG, though the two distributions are competitive with each other on syntactic questions, Sub always performs better than Uni on semantic ones.

Optimality. Since our approach is built on assumptions and new concepts, we wonder whether the resulted is optimal. We select several values around -2 and show the word analogy results in Fig. 3 (c). For CBOW, -2 approaches the optimal point given the accuracy on semantic questions and the total dataset. For SG, the optimal point lies between -2 and -2, with negligible advantages relative to Sub. Notice that the point -2 corresponds to , showing much worse performance than Sub. It indicates that trying a commonly used sub-sampling rate is inappropriate, and our approach is better.

Scalability. We apply our noise distributions in NCE, from which negative sampling originates, to train word vectors. The implementation comes from wang2vec777https://github.com/wlin12/wang2vec by Ling et al. (2015a), and we report the results of this task using CBOW. We include the unigram distribution Uni (Mnih and Kavukcuoglu, 2013) and the sub-sampled unigram distribution Sub with a manually chosen threshold for comparison. We draw three conclusions: (1) Uni indeed works much better than Uni as claimed in (Mikolov et al., 2013b); (2) Sub results in considerable improvements compared with Uni, especially on semantic questions; (3) Our Sub achieves the best performance consistently even with a larger vector size of 300. Note that even though Sub or Uni performs better on syntactic questions with UMBC corpus, its results on semantic questions and the total dataset are much worse than Sub. To this end, we believe that our approach is also scalable to the NCE related work.

Size Dim Noise Sem Syn Tot
0.7 B 100 Uni 36.2 47.5 42.5
Uni 44.8 50.5 47.9
Sub 49.4 51.4 50.5
Sub 52.3 51.8 52.0
300 Uni 46.4 58.3 53.0
Sub 55.0 59.7 57.6
1.0 B 100 Uni 51.5 47.6 49.3
Uni 57.5 50.7 53.8
Sub 61.9 51.1 56.0
Sub 63.5 52.7 57.6
300 Uni 65.8 59.0 62.1
Sub 70.3 60.8 65.1
3.0 B 100 Uni 25.4 48.1 38.0
Uni 34.7 54.7 45.8
Sub 37.1 55.7 47.4
Sub 42.6 54.8 49.4
300 Uni 52.4 62.3 57.9
Sub 62.0 61.8 61.9
Table 3: The results of word analogy task using NCE for the training of word vectors. Each entry is the average result of 2 repeated experiments.

5.5 Extension of Semantics Quantification

5.5.1 MSR Sentence Completion Task

The task deals with incompletion sentences, e.g., “A few faint    were gleaming in a violet sky.” with candidate answers “tragedies, stars, rumours, noises, explanations”, and aims to choose a word (e.g., “stars”) to best complete the sentence. Several works evaluate word vectors on this task (Mikolov et al., 2013a; Mnih and Kavukcuoglu, 2013; Liu et al., 2015) since it requires a combination of semantics and occasional logical reasoning. Most of them follow the same procedures of implementation described in (Mnih and Teh, 2012). Specifically, we can calculate the probabilities that a set of words surrounding the blank to be the context of each candidate answer . Then the score of the candidate answer is the sum of these probabilities,

(20)

and the highest score corresponds to the final answer for the question.

Since the conventional method ignores the syntactic structure of sentences, it should be biased to semantics. Thus, we modify the method with two steps: (1) applying sub-sampling on the words in the sentences (CM); and (2) using quantified semantics as weights to form a semantics weighted model (SWM) based on (1). Then we have

(21)
Model Acc
LSA (Zweig et al., 2012) 49.0
SG (Mikolov et al., 2013b) 48.0
ivLBL (Mnih and Kavukcuoglu, 2013) 55.5
SWE (Liu et al., 2015) 56.2
Model Dim Score
sg 100 CM 49.0 49.4
CM 54.8 54.4
SWM 56.0 56.5
300 CM 49.9 49.5
CM 56.4 55.4
SWM 58.0 57.9
cbow 100 CM 47.8 46.2
CM 55.3 54.3
SWM 56.3 55.8
300 CM 49.6 48.8
CM 56.4 56.1
SWM 57.5 57.3
Table 4: The results of MSR sentence completion task by previous word representation models and our approach.

5.5.2 Results

The setup of models is a little different: the size of context window for SG and CBOW is 10 and 5; the number of negative samples is 20 in both models; we train SG for 5 and 10 epochs when the size of word vectors is 100 and 300, while the number of epochs is 10 and 20 in CBOW; we use all the rest words in a sentence to form .

Our focus here is to popularize SWM rather than to compare the noise distributions. We show the results of this task by previous word presentation models and our approach in Table 4. The bottom three previous models follow the conventional method. Accordingly, we draw two conclusions: (1) sub-sampling on the words in sentences results in significant improvements to the conventional method; and (2) SWM further improves CM and beats previous word representation models with a vector size of 300, indicating the success of semantics quantification.

6 Conclusions

We propose to employ a sub-sampled unigram distribution for better negative sampling, and design an approach to derive the required sub-sampling rate. Experimental results show that our noise distribution captures better linear relationships between words than the baselines. It adapts to different corpora and is scalable to NCE related work. The proposed semantics weighted model also achieves a success on the MSR sentence completion task. In summary, our work not only improves the quality of word vectors, but also sheds light on the understanding of Word2Vec.

References

  • Q. Ai, L. Yang, J. Guo, and W. B. Croft (2016) Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM international conference on the theory of information retrieval, pp. 133–142. Cited by: §3.
  • P. Baltescu and P. Blunsom (2014) Pragmatic neural language modelling in machine translation. arXiv preprint arXiv:1412.7119. Cited by: §3.
  • R. Bamler and S. Mandt (2017) Dynamic word embeddings. In

    International Conference on Machine Learning

    ,
    pp. 380–389. Cited by: §1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §3.
  • E. Bruni, G. Boleda, M. Baroni, and N. Tran (2012) Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 136–145. Cited by: §5.2.1.
  • W. Chen, D. Grangier, and M. Auli (2015) Strategies for training large vocabulary neural language models. arXiv preprint arXiv:1512.04906. Cited by: §3.
  • S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391. Cited by: §1.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web, pp. 406–414. Cited by: §5.2.1.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    ,
    pp. 297–304. Cited by: §1, §3.
  • J. Hochmann, A. D. Endress, and J. Mehler (2010) Word frequency as a cue for identifying function words in infancy. Cognition 115 (3), pp. 444–457. Cited by: §4.2.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2016)

    Google’s multilingual neural machine translation system: enabling zero-shot translation

    .
    arXiv preprint arXiv:1611.04558. Cited by: §1.
  • M. Labeau and A. Allauzen (2017) An experimental analysis of noise-contrastive estimation: the noise distribution matters. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Vol. 2, pp. 15–20. Cited by: §3, §3.
  • T. K. Landauer and S. T. Dumais (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge.. Psychological review 104 (2), pp. 211. Cited by: §5.3.1.
  • O. Levy, Y. Goldberg, and I. Dagan (2015) Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. Cited by: §1.
  • W. Ling, C. Dyer, A. W. Black, and I. Trancoso (2015a) Two/too simple adaptations of word2vec for syntax problems. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1299–1304. Cited by: §1, §1, §5.4.2.
  • W. Ling, Y. Tsvetkov, S. Amir, R. Fermandez, C. Dyer, A. W. Black, I. Trancoso, and C. Lin (2015b) Not all contexts are created equal: better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1367–1372. Cited by: §1.
  • Q. Liu, H. Jiang, S. Wei, Z. Ling, and Y. Hu (2015) Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1501–1511. Cited by: §1, §5.5.1, Table 4, §5.
  • K. Lund and C. Burgess (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods, instruments, & computers 28 (2), pp. 203–208. Cited by: §1.
  • T. Luong, R. Socher, and C. Manning (2013) Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §5.2.1.
  • B. Mandelbrot (1953) An informational theory of the statistical structure of language. Communication theory 84, pp. 486–502. Cited by: §4.5.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §5.5.1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §2.1, §3, §5.1.1, §5.4.1, §5.4.2, Table 4, §5.
  • T. Mikolov, W. Yih, and G. Zweig (2013c) Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Cited by: §5.4.1.
  • G. A. Miller and W. G. Charles (1991) Contextual correlates of semantic similarity. Language and cognitive processes 6 (1), pp. 1–28. Cited by: §5.2.1.
  • A. Mnih and K. Kavukcuoglu (2013) Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pp. 2265–2273. Cited by: §3, §5.4.2, §5.5.1, Table 4.
  • A. Mnih and Y. W. Teh (2012) A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426. Cited by: §3, §5.5.1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §5.
  • S. T. Piantadosi (2014) Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic bulletin & review 21 (5), pp. 1112–1130. Cited by: §4.5.
  • H. Rubenstein and J. B. Goodenough (1965) Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §5.2.1.
  • S. K. Sienčnik (2015) Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania, pp. 239–243. Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • Y. Tsvetkov, M. Faruqui, W. Ling, G. Lample, and C. Dyer (2015) Evaluation of word vector representations by subspace alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2049–2054. Cited by: §1.
  • A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang (2013) Decoding with large-scale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1387–1392. Cited by: §3.
  • M. Xiao and Y. Guo (2013) Domain adaptation for sequence labeling tasks with a probabilistic language adaptation model. In International Conference on Machine Learning, pp. 293–301. Cited by: §3.
  • W. Yang, W. Lu, and V. Zheng (2017) A simple regularization-based algorithm for learning cross-domain word embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2898–2904. Cited by: §1.
  • D. Yogatama, M. Faruqui, C. Dyer, and N. A. Smith (2014) Learning word representations with hierarchical sparse coding. CoRR abs/1406.2035. External Links: Link, 1406.2035 Cited by: §1.
  • L. Yu, J. Wang, K. R. Lai, and X. Zhang (2017) Refining word embeddings for sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 534–539. Cited by: §1.
  • G. K. Zipf (1950) Human behavior and the principle of least effort. an introduction to human ecology. Philosophy of Science 17 (2), pp. 204–205. Cited by: §4.2.
  • G. Zweig, J. C. Platt, C. Meek, C. J. Burges, A. Yessenalina, and Q. Liu (2012) Computational approaches to sentence completion. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 601–610. Cited by: Table 4, §5.