A key input for statistical machine translation are bilingual mappings of words and phrases that are created from parallel, i.e., already translated corpora. The creation of such high-quality labeled data is costly, and this cost limits supply given the large number of bilingual language pairs.
word2vec Mikolov et al. (2013a)
is an unsupervised learning method that generates a distributed representationRumelhart et al. (1986) of words and phrases in a shared high-dimensional vector space. word2vec uses a neural network that is trained to capture the relation between language elements and the context in which they occur. More specifically, the network learns to predict the neighbors within a given text window for each word in the vocabulary. As a result, the relative locations of language elements in this space reflects co-occurrence in the training text material.
Moreover, this form of representation captures rich information on syntactic and semantic relationships between words and phrases. Syntactic matches like singular and plural nouns, present and past tense, or adjectives and their superlatives can be found through simple vector algebra: the location of the plural form of a word is very likely to be in the same direction and at the same distance as the plural of another word relative to its singular form. The same approach works for semantic relations like, countries and their capitals or family relationship. Most famously, the location of ‘queen’ can be obtained by subtracting ‘man’ from ‘king’ while adding ‘woman’.
Monolingual vector spaces for different languages learnt separately using word2vec from comparable text material have been shown to generate similar geometric representations. To the extent that similarities among languages, which aim to encode similar real-word concepts Mikolov et al. (2013a), are reflected in similar relative positions of words and phrases, learning a projection matrix that translates locations between these spaces could help identify matching words and phrases for use in machine translation from mono-lingual corpora only.
The benefit would be a significant expansion of the training material that can be used to produce high-quality inputs for language translation. In addition, the candidates for word and phrase translation identified through this approach can be scored using their distance to the expected location.
The paper proceeds as follows: (1) introduce the word2vec algorithm and the evaluation of its results. (2) outline the learning process for the projection matrix between vector spaces, the corresponding approach to word and phrase translation, and the evaluation of translation quality. (3) describe key steps to obtain and preprocess the Wikipedia input data, and present important descriptive statistics. (4) present empirical results and steps taken to optimize these results.
The word2vec Method
word2vec stands in a tradition of learning continuous vectors to represent words Mikolov et al. (2013d) using neural networks Bengio et al. (2003). The word2vec approach emerged with the goal to enhance the accuracy of capturing the multiple degrees of similarity along syntactic and semantic dimensions Mikolov et al. (2013c), while reducing computational complexity to allow for learning vectors beyond the thus far customary 50-100 dimensions, and for training on more than a few hundred million words Mikolov et al. (2013a).
Feed-forward neural network language models (NNLM) have been popular to learn distributed representations because they outperform Latent Semantic Analysis or Latent Dirichlet Allocation in capturing linear regularities, and in particular the latter in computational cost.
However, computational cost remained high as NNLM combine input, projection, hidden, and output layer. The input layer contains neighbors (the input context in 1-of-V encoding) for each word in the vocabulary
to predict a probability distribution over the vocabulary. The context is expanded to a higher dimensionality in the Projection Layer, and then fed forward through a non-linear Hidden Layer. The output probability distribution can be obtained through a Softmax layer, or, more efficiently, a hierarchical Softmax that uses a balanced binary of Huffman tree to reduce output complexity toor , respectively Mikolov et al. (2013a).
Recurrent Neural Networks
Recurrent Neural Networks (RNN) avoid the need to specify the size N of the context, and can represent more varied patterns then NNLM Bengio & Lecun (2007). RNN have input, hidden and output and no projection layer, but add a recurrent matrix that connects the hidden layer to itself to enable time-delayed effects, or short-term memory. The output layer works as for NNLM.
The Skip-Gram Model
The computational complexity, or number of parameters corresponds to the matrix multiplications required for back-propagation during the learning process driven by stochastic gradient descent is large due to the dense, non-linear Hidden Layer.
The work by Mikolov et al. (2013a) has focused on simpler models to learn word vectors, and then train NNLM using these vector representations. The result are two architectures that eliminate the Hidden Layer, and learn word vectors by, as above, predicting a word using its context, or, alternatively, by predicting the context for each word in the vocabulary.
The Continuous Bag-of-Word Model (CBOW) averages the vectors of words in a window before and after the target word for its prediction, as above. The model is called ‘bag of words’ because word order does not matter.
The Continuous Skip-Gram Model, in turn changes the direction of the prediction task, and learns word vectors by predicting various individual targets in the context window around each word. A comparison of these architectures Mikolov et al. (2013a) suggests that the Skip-Gram model produces word vectors that better capture the multiple degrees of similarity among words (see below on specific metrics used for this task). For this reason, the experiments in this paper focus on the Skip-Gram Model.
To improve the Skip-Gram model’s accuracy or increase the speed of training, several architecture refinements have been proposed, namely using candidate sampling for a more efficient formulation of the objective function, and the subsampling of frequently occurring words.
To find word representations that predict surrounding words within a context with high accuracy, the Skip-Gram model, for a given sequence of words , maximizes the following objective of average log probability
over all target words and their respective contexts. The probability predicted for any a context word can be based on the inner product of the vector representations of the input and the output candidates and , respectively, normalized to conform to the requirements of a probability distribution over all words in the vocabulary of size using to the Softmax function:
However, the complexity of calculating these probabilities and related gradients for all words becomes prohibitive as the vocabulary grows.
One alternative is the Hierarchical Softmax Bengio et al. (2003) that reduces the number of computations to by representing the output layer as a balanced binary tree. Using Huffman codes to obtain short codes for frequent words further speeds up training.
An alternative to the Hierarchical Softmax function that reduces the number of computations required for inference and back-propagation is Noise Contrastive Estimation (NCE) Gutmann & Hyvärinen (2012)
. Instead of calculating a probability distribution over all possible target words, NCE uses logistic regression to distinguish a target from samples from a noise distribution. NCE approximately maximizes the log probability of the Softmax.Mikolov et al. (2013a) simplify NCE by introducing Negative Sampling (NEG), which uses only samples from the noise distribution and obviates the need for the numerical probabilities of the noise distribution itself.
Either NCE or NEG replace the expression for in the Skip-Gram objective function, and the network is trained using the back-propagation signals resulting from the probabilities predicted for noise samples and actual context words during training.
Mikolov et al. (2013a) suggest values of in the range of 2-5 for large, and 5-20 for smaller data sets. In addition, using NEC or NEG requires defining the noise distribution, and the authors recommend drawing from the unigram distribution raised to the power .
Subsampling Frequent Words
The frequency distribution of words in large corpora tends to be uneven. In order to address an imbalance of very frequent, but less informative words that can dilute the quality of the word vectors, Mikolov et al. (2013a) introduce subsampling of frequent words, and discard each word in the training set with a probability:
where is a threshold (recommended at ), and is the frequency of word . The benefits of this subsampling approach is to significantly curtail the occurrence of words that are more frequent than while preserving the frequency ranking of words overall. The authors report improvements in both training speed and accuracy.
Hyper Parameter Choices
Training the Skip-Gram model on text samples requires choices for preprocessing and the setting of model hyper-parameters. Hyper-parameters include:
Context size: increasing has been reported to boost accuracy through a larger number of training samples, but also increases training time. Mikholov et al suggest randomizing the size of the context range for each training sample with probability , where is the maximum context size. In practice, for , this means selecting with probability each.
Minimum count for words to be included in the vocabulary: words that are observed less often than times are replaced by a token ‘UNK’ for unknown and all treated alike.
Subsampling frequency : as mentioned above, recommended at .
Size of negative samples: 2-5 for large samples, 5-20 for smaller training sets.
Embedding size : the dimensionality of the word vector increases computational complexity just as increasing the training set size. Mikolov et al. (2013a) suggest that conventional vector size choices of 50-100 are too small and report significantly better results for ranges 300-1,000.
Epochs to train: ranges from 3-50 have been reported, but this choice often depends on the constraints imposed by the size of the training set, the computational complexity resulting from other parameter choices, and available resources.
Additional choices include the optimization algorithm. Both standard stochastic gradient and adaptive methods like Adagrad or AdamOptimizer have been used. In either case, the initial learning rate and possibly decay rates may have to be selected, as well as batch sizes.
Evaluating Vector Quality
While the Skip-Gram model is trained to accurately predict context words for any given word, the desired output of the models are the learnt embeddings that represent each word in the vocabulary.
A number of tests have been developed to assess whether these vectors represent the multiple degrees of similarity that characterize words. These tests are based on the observation that word vectors encode semantic and syntactic relations in the relative locations of words, and that these relations can be recovered through vector algebra. In particular, for a relation analogous to , the location of should closely correspond to following operation:
Mikolov et al. (2013a) have made available over 500 analogy pairs in 15 syntactic and semantic categories. Within each category, the base pairs are combined to yield four-valued analogies. For each, the location of the fourth word is calculated from the above vector operation. Then, the cosine distance of the correct fourth term is calculated, and compared to other words found in the neighborhood of the projected . According to the metric used for evaluation, a prediction is deemed correct when the correct term appears among the nearest neighbors.
|Topic||Count||A or C||B or D|
In order to adapt this test to multiple languages, I translated the base pairs using the google translate API and then manually verified the result.
A few complications arise when translating the English word pairs into German, Spanish and French:
translation results in a single word in the target language (e.g., some adjectives and adverbs have the same word form in German), rendering the sample unsuitable for the geometric translation test. These pairs were excluded for the affected language.
the translation of the single-word source results in ngrams with In these cases, I combined the ngrams to unigrams using underscores and replaced the ngrams with the result.
Learning a Projection Matrix
The result of each monolingual Skip-Gram model is an embedding vector of dimensions , where is the size of the vocabulary for the language , and is the corresponding embedding size. Hence, for each word in the source language there is a vector , and for each word in the target language ,there is a vector .
We need to find a translation matrix so that for a correctly translated pair , the matrix approximately translates the vector locations so that . The solution can be find by solving the optimization problem
using gradient descent to minimize the above loss. The resulting translation matrix will estimate the expected location of a matching translation for any word in the source language provided its vector representation, and use cosine distance to identify the nearest candidates. In practice, the translation matrix will be estimated using known translations obtained via google translate API.
The Data: Wikipedia
Corpus Size & Diversity
The English corpus is by over two times larger than the second largest German corpus, counting 5.3 million articles, over 2 billion tokens and 8.2 million distinct word forms.
High-level statistics for the four language corpora highlight some differences between languages that may impact translation performance. For instance, the German corpus contains more unique word forms than the English corpus while only containing 35% of the number of tokens. The number of tokens per unique word form are 85 for German, 187 for French, 216 for Spanish and 254 for English.
In order to exclude less meaningful material like redirects and others, articles with fewer than 50 words were excluded, resulting in the reduced sample sizes shown in Figure 5.
Parts-of-Speech & Entity Tags
Sentence parsing revealed markedly shorter sentences in German, compared to longer sentences in Spanish and French (see Figure 6).
I used a simple approach to identify bigrams that represent common phrases. In particular, I used to following scoring formula to identify pairs of words that occur more likely together:
where represents a minimum count threshold for each unigram.
To allow for phrases composed of more than two unigrams, bigrams were scored repeatedly, after each iteration combined if their score exceeded a threshold. The threshold was gradually reduced. After three iteration, about 20% of the tokens consisted of ngrams.
|such as||vor allem||sin embargo||c’ est|
|has been||z. b.||estados unidos||qu’ il|
|as well as||gibt es||así como||à partir|
|United States||im jahr||se encuentra||ainsi que|
|had been||unter anderem||por ejemplo||par exemple|
|have been||nicht mehr||cuenta con||a été|
|based on||befindet sich||junto con||au cours|
|known as||im jahre||se encuentran||n’ a|
|would be||gab es||lo largo||la plupart|
|New York||zum beispiel||se convirtió||au début|
Since model training with the entire corpus is quite resource intensive (90 min per epoch for the English corpus on 16 cores), we tested various model configurations and preprocessing techniques on smaller subsets of the data comprising of 100 million tokens each.
The models used text input at various stages of preprocessing:
raw: unprocessed text
clean: removed punctuation and symbols recognized by language parsers
ngrams: identified by phrase modeling process
The following hyper parameters were tested (baseline in bold):
: Embedding Dimensionality (100, 200, 300)
: NCE candidate sample size (10, 25, 50)
: subsample threshold (1e-3 vs custom threshold, set to subsample 0.2% most frequent words)
text window size (3, 5, 8)
All models were trained for 5 epochs using stochastic gradient descent with a starting learning rate to 0.25 that linearly declined to 0.001. Models were evaluated using Precision@1 on the analogy test using questions that were covered by the reduced sample vocabulary.
The results show:
English language performs significantly better for same (reduced) training set size.
the Spanish and French languages did not benefit from text preprocessing, and achieved their highest score with the raw text input.
German alone performed best with a larger embedding size
English performed best with a larger negative sample size.
Relative to baseline values for the same input:
A higher embedding size benefited all languages except English, and reducing the embedding size had a negative impact for all languages.
A larger context improved performance for French and Spanish, while all languages did worse for a smaller context.
A larger number of negative samples had a small positive impact for all except French, while a smaller number reduced performance for all.
A custom (higher) subsampling threshold had a negative impact for English and French and only a small positive impact for German.
The validity of these results is limited by the number of training rounds and reduced vocabulary, but provide orientation regarding preferable parameter settings for training with the full set.
|Parameter Settings||Precision@1 Analogy Test Results|
|Input||Embedding Size ()||Negative Samples ()||Subsample Threshold ()||Context ()||English||French||Spanish||German|
Mikolov et al. (2013a) report the following accuracy for the word2vec Skip-Gram model in English:
|Vector Size||# Words||Epochs||Accuracy|
0.1 Monolingual word2vec
For the complete Wikipedia corpus for each language, various input formats and hyper parameter were applied to improve results and test for robustness. The choices took into account the sample results from above while adjusting for the larger data set (e.g., a smaller number of negative samples is recommended for larger data).
|Model||Loss||Ngrams||Min. Count||Vocab Size||Testable Analogies||Embed. Size||Initial LR||Negative Samples||Sub-sample Threshold||P@1||P@5|
Models with both raw input and phrase (ngram) input were tested using Noise Contrastive Estimation and Negative Sampling, but the latter did not change results (as expected). Embeddings ranged from , negative samples from , and the initial learning rate from 0.025-0.03, decaying linearly to . The subsample threshold varied between and .
The most significant impact resulted from reduced vocabulary size by increasing the minimum count for a world to be included from 5 for the smaller samples to at least 25, up to 500 (reducing the number of analogies available in the vocabulary, esp. in German and Spanish).
All models were trained for 20 epochs using stochastic gradient descent in TensorFlow.
English, with over 2bn test tokens, performed best (while also covering most analogy word pairs), but did not achieve the benchmark performance reported above. The best models achieved accuracy of 43.5%, and accuracy of 87.8% when the vocabulary was reduced to words. For a vocabulary almost six times this size, accuracy was still 37.8%.
Spanish, French, and German performed less well on the analogy test (which is tailored to the English language). accuracy ranges from 16.8% to 21%, and accuracy from 42.2% to 63.7%.
Overall, results behaved as expected in response to changes in parameter settings, with smaller vocabularies boosting precision. The accuracy curves below show that performance was still improving after 20 epochs, so further training (Mikolov et al. (2013a) suggest 3-50 epochs) would likely have further increased performance.
Performance across Analogies
Performance across the various analogy categories is uneven, and a closer look reveals some weaknesses of the models. Most models perform strong on capital-country relations for common countries, nationality adjectives, and family relationships.
Most other areas offer more alternative matches or synonyms, and performance is often strong at the level, but not necessarily at the level. Currency-country relationships are poorly captured for all languages, which might be due to the input text containing fewer information on these (the original tests were conducted on Google news corpora with arguably more economic content)
Mikolov et al. (2013b) report introduced word2vec for bilingual translation, and tested English to Spanish translation using the WMT11 corpus. It provides 575m tokens and 127K unique words in English, and 84m tokens and 107K unique words. The authors reported accuracy of 33%, and accuracy of 51%.
accuracy arguably is a better measure given that there are often several valid translation, while the dictionary contains only one solution (the google translate API does not provides multiple options).
I used the best performing models to test the translation quality from English to any of the other languages, learning the transformation matrix on a training set of the 5,000 most frequent words with matching translations in the process. Results on the test set of 2,500 words per language were best for Spanish with at 47.2% and at 62.7%. French and German performed less well.
In order to illustrate how word2vec produces many useful related terms, I am displaying the top 5 option for the English words complexity, pleasure and monsters, marking the dictionary entry provided by google translated in bold.
All languages translate complexity correctly, but produces additional interesting associations: both Spanish and French relate complexity to generalization, with Spanish also emphasizing prediction. German closely relates complexity to dynamics.
Pleasure, correctly translated in Spanish and German but with a debatable mistake - according to google translate - in French, also highlights interesting nuances - Spanish and French stress curiosity, while Spanish and German mention patience.
Figure 9 illustrate the projection of English word vectors into the Spanish vector space using the translation matrix.
A random sample of 75 English terms was selected based on correct translation at the accuracy level. The English vectors were translated into the foreign language space, and both the English vector and the nearest Spanish neighbor vector were projected into two dimensions using t-distributed stochastic neighbor embedding (T-SNE) and and displayed with their labels.
The proximity of similar concepts both within a language and across language boundaries is striking (see the highlighted shark and elephant example and their translation).
It is possible to produce high-quality translations using largely unsupervised learning, with limited supervised learning input to translate between monolingual vector spaces. Preprocessing is also quite limited compared to standard dependency parsing.
Using the Wikipedia corpus instead of the smaller WMT11 corpus shows that accuracy can be increased over the previously achieved benchmark. In addition, the translations come with a similarity metric that expresses confidence.
These translations can be extended to multiple bilingual pairs without major effort (e.g. to all six pairs among the four languages considered, and in either direction for 12 applications.
There are multiple opportunities to further improve results:
Monolingual models did not perform optimally, and fell short of benchmark performance. This seems in part to be due to the nature of the analogies chosen for testing, as well as the text material and the quality of the translations. The ‘city-in-state’ pairs emphasize US geography, and the ‘country-currency’ pairs found few matches for all languages, arguably because these terms are used less frequently in Wikipedia than in the Google News corpus used by the original authors. It would certainly be useful to blend multiple corpora to obtain more comprehensive coverage.
The reduced coverage of translated analogies also suggests reviewing in more detail how to adapt the translations. The manual review of the google translation API results produced a significant amount of refinements. Syntactic analogies may also need to be tailored to each language’s grammar as several areas (e.g. adjective-adverb) produced far fewer meaningful pairs in German than in English).
Monolingual models would also benefit from longer training and more computing resources as performance on the analogy test was still improving after the allotted amount of time ended. This would also allow for better hyper parameter settings, in particular with respect to larger context windows (resulting in more training examples per epoch), and higher embedding dimensionality (increasing computational complexity). Furthermore, additional preprocessing options could be explored, e.g. to utilize information from parts-of-speech tagging to distinguish between identical words used in different functions (as noun and verb).
Bilingual models performed quite well despite some shortcomings of the monolingual models, in particular in the English-Spanish case. A thorough review of the translation quality would likely improve the accuracy of the translation matrix.
The word2vec based translations could possibly be improved by complementing it with additional techniques that have been used successfully, e.g. to remove named entities or use the edit distance to account for morphological similarity.
- Bengio & Lecun (2007) Bengio, Yoshua and Lecun, Yann. Scaling learning algorithms towards AI. Large-scale kernel machines, 2007. 00804.
- Bengio et al. (2003) Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin, Christian. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003. ISSN ISSN 1533-7928. 03305.
- Gutmann & Hyvärinen (2012) Gutmann, Michael U. and Hyvärinen, Aapo. Noise-contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. J. Mach. Learn. Res., 13(1):307–361, February 2012. ISSN 1532-4435. 00247.
- Mikolov et al. (2013a) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs], January 2013a. 05984.
- Mikolov et al. (2013b) Mikolov, Tomas, Le, Quoc V., and Sutskever, Ilya. Exploiting Similarities among Languages for Machine Translation. arXiv:1309.4168 [cs], September 2013b. 00412.
- Mikolov et al. (2013c) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119, USA, 2013c. Curran Associates Inc. 07250.
- Mikolov et al. (2013d) Mikolov, Tomas, Yih, Wen-tau, and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751, Atlanta, Georgia, June 2013d. Association for Computational Linguistics. 01449.
- Rumelhart et al. (1986) Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J. Learning representations by back-propagating errors. Nature, 323:533, October 1986. 14256.