Word and Phrase Translation with word2vec

05/09/2017 ∙ by Stefan Jansen, et al. ∙ 0

Word and phrase tables are key inputs to machine translations, but costly to produce. New unsupervised learning methods represent words and phrases in a high-dimensional vector space, and these monolingual embeddings have been shown to encode syntactic and semantic relationships between language elements. The information captured by these embeddings can be exploited for bilingual translation by learning a transformation matrix that allows to match relative positions across two monolingual vector spaces. This method aims to identify high-quality candidates for word and phrase translation more cost-effectively from unlabeled data. This paper expands the scope of previous attempts of bilingual translation to four languages (English, German, Spanish, and French). It shows how to process the source data, train a neural network to learn the high-dimensional embeddings for individual languages and expands the framework for testing their quality beyond the English language. Furthermore, it shows how to learn bilingual transformation matrices and obtain candidates for word and phrase translation, and assess their quality.



There are no comments yet.


page 2

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


A key input for statistical machine translation are bilingual mappings of words and phrases that are created from parallel, i.e., already translated corpora. The creation of such high-quality labeled data is costly, and this cost limits supply given the large number of bilingual language pairs.

word2vec Mikolov et al. (2013a)

is an unsupervised learning method that generates a distributed representation

Rumelhart et al. (1986) of words and phrases in a shared high-dimensional vector space. word2vec uses a neural network that is trained to capture the relation between language elements and the context in which they occur. More specifically, the network learns to predict the neighbors within a given text window for each word in the vocabulary. As a result, the relative locations of language elements in this space reflects co-occurrence in the training text material.

Moreover, this form of representation captures rich information on syntactic and semantic relationships between words and phrases. Syntactic matches like singular and plural nouns, present and past tense, or adjectives and their superlatives can be found through simple vector algebra: the location of the plural form of a word is very likely to be in the same direction and at the same distance as the plural of another word relative to its singular form. The same approach works for semantic relations like, countries and their capitals or family relationship. Most famously, the location of ‘queen’ can be obtained by subtracting ‘man’ from ‘king’ while adding ‘woman’.

Monolingual vector spaces for different languages learnt separately using word2vec from comparable text material have been shown to generate similar geometric representations. To the extent that similarities among languages, which aim to encode similar real-word concepts Mikolov et al. (2013a), are reflected in similar relative positions of words and phrases, learning a projection matrix that translates locations between these spaces could help identify matching words and phrases for use in machine translation from mono-lingual corpora only.

The benefit would be a significant expansion of the training material that can be used to produce high-quality inputs for language translation. In addition, the candidates for word and phrase translation identified through this approach can be scored using their distance to the expected location.

The paper proceeds as follows: (1) introduce the word2vec algorithm and the evaluation of its results. (2) outline the learning process for the projection matrix between vector spaces, the corresponding approach to word and phrase translation, and the evaluation of translation quality. (3) describe key steps to obtain and preprocess the Wikipedia input data, and present important descriptive statistics. (4) present empirical results and steps taken to optimize these results.

The word2vec Method

word2vec stands in a tradition of learning continuous vectors to represent words Mikolov et al. (2013d) using neural networks Bengio et al. (2003). The word2vec approach emerged with the goal to enhance the accuracy of capturing the multiple degrees of similarity along syntactic and semantic dimensions Mikolov et al. (2013c), while reducing computational complexity to allow for learning vectors beyond the thus far customary 50-100 dimensions, and for training on more than a few hundred million words Mikolov et al. (2013a).

Feed-Forward Networks

Feed-forward neural network language models (NNLM) have been popular to learn distributed representations because they outperform Latent Semantic Analysis or Latent Dirichlet Allocation in capturing linear regularities, and in particular the latter in computational cost.

Figure 1: Feed-Forward NN Architecture

However, computational cost remained high as NNLM combine input, projection, hidden, and output layer. The input layer contains neighbors (the input context in 1-of-V encoding) for each word in the vocabulary

to predict a probability distribution over the vocabulary. The context is expanded to a higher dimensionality in the Projection Layer, and then fed forward through a non-linear Hidden Layer. The output probability distribution can be obtained through a Softmax layer, or, more efficiently, a hierarchical Softmax that uses a balanced binary of Huffman tree to reduce output complexity to

or , respectively Mikolov et al. (2013a).

Recurrent Neural Networks

Recurrent Neural Networks (RNN) avoid the need to specify the size N of the context, and can represent more varied patterns then NNLM Bengio & Lecun (2007). RNN have input, hidden and output and no projection layer, but add a recurrent matrix that connects the hidden layer to itself to enable time-delayed effects, or short-term memory. The output layer works as for NNLM.

Figure 2: Recurrent NN Architecture

The Skip-Gram Model

The computational complexity, or number of parameters corresponds to the matrix multiplications required for back-propagation during the learning process driven by stochastic gradient descent is large due to the dense, non-linear Hidden Layer.

Figure 3: Continuous Bag of Words & Skip-Gram Models

The work by Mikolov et al. (2013a) has focused on simpler models to learn word vectors, and then train NNLM using these vector representations. The result are two architectures that eliminate the Hidden Layer, and learn word vectors by, as above, predicting a word using its context, or, alternatively, by predicting the context for each word in the vocabulary.

The Continuous Bag-of-Word Model (CBOW) averages the vectors of words in a window before and after the target word for its prediction, as above. The model is called ‘bag of words’ because word order does not matter.

The Continuous Skip-Gram Model, in turn changes the direction of the prediction task, and learns word vectors by predicting various individual targets in the context window around each word. A comparison of these architectures Mikolov et al. (2013a) suggests that the Skip-Gram model produces word vectors that better capture the multiple degrees of similarity among words (see below on specific metrics used for this task). For this reason, the experiments in this paper focus on the Skip-Gram Model.

Figure 4: TensorFlow Computational Graph

Architecture Refinements

To improve the Skip-Gram model’s accuracy or increase the speed of training, several architecture refinements have been proposed, namely using candidate sampling for a more efficient formulation of the objective function, and the subsampling of frequently occurring words.

Candidate Sampling

To find word representations that predict surrounding words within a context with high accuracy, the Skip-Gram model, for a given sequence of words , maximizes the following objective of average log probability

over all target words and their respective contexts. The probability predicted for any a context word can be based on the inner product of the vector representations of the input and the output candidates and , respectively, normalized to conform to the requirements of a probability distribution over all words in the vocabulary of size using to the Softmax function:

However, the complexity of calculating these probabilities and related gradients for all words becomes prohibitive as the vocabulary grows.

One alternative is the Hierarchical Softmax Bengio et al. (2003) that reduces the number of computations to by representing the output layer as a balanced binary tree. Using Huffman codes to obtain short codes for frequent words further speeds up training.

An alternative to the Hierarchical Softmax function that reduces the number of computations required for inference and back-propagation is Noise Contrastive Estimation (NCE) Gutmann & Hyvärinen (2012)

. Instead of calculating a probability distribution over all possible target words, NCE uses logistic regression to distinguish a target from samples from a noise distribution. NCE approximately maximizes the log probability of the Softmax.

Mikolov et al. (2013a) simplify NCE by introducing Negative Sampling (NEG), which uses only samples from the noise distribution and obviates the need for the numerical probabilities of the noise distribution itself.

Either NCE or NEG replace the expression for in the Skip-Gram objective function, and the network is trained using the back-propagation signals resulting from the probabilities predicted for noise samples and actual context words during training.

Mikolov et al. (2013a) suggest values of in the range of 2-5 for large, and 5-20 for smaller data sets. In addition, using NEC or NEG requires defining the noise distribution, and the authors recommend drawing from the unigram distribution raised to the power .

Subsampling Frequent Words

The frequency distribution of words in large corpora tends to be uneven. In order to address an imbalance of very frequent, but less informative words that can dilute the quality of the word vectors, Mikolov et al. (2013a) introduce subsampling of frequent words, and discard each word in the training set with a probability:

where is a threshold (recommended at ), and is the frequency of word . The benefits of this subsampling approach is to significantly curtail the occurrence of words that are more frequent than while preserving the frequency ranking of words overall. The authors report improvements in both training speed and accuracy.

Hyper Parameter Choices

Training the Skip-Gram model on text samples requires choices for preprocessing and the setting of model hyper-parameters. Hyper-parameters include:

  1. Context size: increasing has been reported to boost accuracy through a larger number of training samples, but also increases training time. Mikholov et al suggest randomizing the size of the context range for each training sample with probability , where is the maximum context size. In practice, for , this means selecting with probability each.

  2. Minimum count for words to be included in the vocabulary: words that are observed less often than times are replaced by a token ‘UNK’ for unknown and all treated alike.

  3. Subsampling frequency : as mentioned above, recommended at .

  4. Size of negative samples: 2-5 for large samples, 5-20 for smaller training sets.

  5. Embedding size : the dimensionality of the word vector increases computational complexity just as increasing the training set size. Mikolov et al. (2013a) suggest that conventional vector size choices of 50-100 are too small and report significantly better results for ranges 300-1,000.

  6. Epochs to train: ranges from 3-50 have been reported, but this choice often depends on the constraints imposed by the size of the training set, the computational complexity resulting from other parameter choices, and available resources.

Additional choices include the optimization algorithm. Both standard stochastic gradient and adaptive methods like Adagrad or AdamOptimizer have been used. In either case, the initial learning rate and possibly decay rates may have to be selected, as well as batch sizes.

Evaluating Vector Quality

While the Skip-Gram model is trained to accurately predict context words for any given word, the desired output of the models are the learnt embeddings that represent each word in the vocabulary.

A number of tests have been developed to assess whether these vectors represent the multiple degrees of similarity that characterize words. These tests are based on the observation that word vectors encode semantic and syntactic relations in the relative locations of words, and that these relations can be recovered through vector algebra. In particular, for a relation analogous to , the location of should closely correspond to following operation:

Mikolov et al. (2013a) have made available over 500 analogy pairs in 15 syntactic and semantic categories. Within each category, the base pairs are combined to yield four-valued analogies. For each, the location of the fourth word is calculated from the above vector operation. Then, the cosine distance of the correct fourth term is calculated, and compared to other words found in the neighborhood of the projected . According to the metric used for evaluation, a prediction is deemed correct when the correct term appears among the nearest neighbors.

Topic Count A or C B or D
Capital-common-Countries 22 tokyo japan
Capital-World 92 zagreb croatia
City-in-State 66 cleveland ohio
Currency 29 vietnam dong
Family 22 uncle aunt
Adjective-to-Adverb 31 usual usually
Opposite 28 tasteful distasteful
Comparative 36 young younger
Superlative 33 young youngest
Present-Participle 32 write writing
Nationality-Adjective 40 ukraine ukrainian
Past-Tense 39 writing wrote
Plural 36 woman women
Plural-verbs 29 write writes
Table 1: Analogy Test

In order to adapt this test to multiple languages, I translated the base pairs using the google translate API and then manually verified the result.

A few complications arise when translating the English word pairs into German, Spanish and French:

  • translation results in a single word in the target language (e.g., some adjectives and adverbs have the same word form in German), rendering the sample unsuitable for the geometric translation test. These pairs were excluded for the affected language.

  • the translation of the single-word source results in ngrams with In these cases, I combined the ngrams to unigrams using underscores and replaced the ngrams with the result.

Learning a Projection Matrix

The result of each monolingual Skip-Gram model is an embedding vector of dimensions , where is the size of the vocabulary for the language , and is the corresponding embedding size. Hence, for each word in the source language there is a vector , and for each word in the target language ,there is a vector .

We need to find a translation matrix so that for a correctly translated pair , the matrix approximately translates the vector locations so that . The solution can be find by solving the optimization problem

using gradient descent to minimize the above loss. The resulting translation matrix will estimate the expected location of a matching translation for any word in the source language provided its vector representation, and use cosine distance to identify the nearest candidates. In practice, the translation matrix will be estimated using known translations obtained via google translate API.

The Data: Wikipedia

Corpus Size & Diversity

The empirical application of word2vec word and phrase translation uses monolingual Wikipedia corpora available online in the four languages shown in Table 2.

Figure 5: Article Length Distribution

The English corpus is by over two times larger than the second largest German corpus, counting 5.3 million articles, over 2 billion tokens and 8.2 million distinct word forms.

Language Articles Sentences Tokens Word Forms
English 5.3 95.8 2091.5 8.2
German 2.0 49.8 738.1 8.9
Spanish 1.3 21.2 637.1 3.0
French 1.8 20.9 571.4 3.1
Table 2: Wikipedia corpus statistics (in million)

High-level statistics for the four language corpora highlight some differences between languages that may impact translation performance. For instance, the German corpus contains more unique word forms than the English corpus while only containing 35% of the number of tokens. The number of tokens per unique word form are 85 for German, 187 for French, 216 for Spanish and 254 for English.

In order to exclude less meaningful material like redirects and others, articles with fewer than 50 words were excluded, resulting in the reduced sample sizes shown in Figure 5.

Parts-of-Speech & Entity Tags

To obtain further insight into differences in language structure, the corpora were parsed using SpaCy (English & German) and Stanford CoreNLP (French & Spanish).

Figure 6: Sentence Length Distribution

Sentence parsing revealed markedly shorter sentences in German, compared to longer sentences in Spanish and French (see Figure 6).

Phrase Modeling

I used a simple approach to identify bigrams that represent common phrases. In particular, I used to following scoring formula to identify pairs of words that occur more likely together:

where represents a minimum count threshold for each unigram.

Rank English German Spanish French
1 the . de ,
2 , , , de
3 . der el .
4 of die la la
5 and und . le
6 in in en "
7 " y et
8 to von a l’
9 a ) " à
10 was den que les
Table 3: Most frequent words

To allow for phrases composed of more than two unigrams, bigrams were scored repeatedly, after each iteration combined if their score exceeded a threshold. The threshold was gradually reduced. After three iteration, about 20% of the tokens consisted of ngrams.

English German Spanish French
such as vor allem sin embargo c’ est
has been z. b. estados unidos qu’ il
as well as gibt es así como à partir
United States im jahr se encuentra ainsi que
had been unter anderem por ejemplo par exemple
have been nicht mehr cuenta con a été
based on befindet sich junto con au cours
known as im jahre se encuentran n’ a
would be gab es lo largo la plupart
New York zum beispiel se convirtió au début
Table 4: Most frequent ngrams


Hyperparameter Tuning

Since model training with the entire corpus is quite resource intensive (90 min per epoch for the English corpus on 16 cores), we tested various model configurations and preprocessing techniques on smaller subsets of the data comprising of 100 million tokens each.

The models used text input at various stages of preprocessing:

  • raw: unprocessed text

  • clean: removed punctuation and symbols recognized by language parsers

  • ngrams: identified by phrase modeling process

The following hyper parameters were tested (baseline in bold):

  • : Embedding Dimensionality (100, 200, 300)

  • : NCE candidate sample size (10, 25, 50)

  • : subsample threshold (1e-3 vs custom threshold, set to subsample 0.2% most frequent words)

  • : con

  • text window size (3, 5, 8)

All models were trained for 5 epochs using stochastic gradient descent with a starting learning rate to 0.25 that linearly declined to 0.001. Models were evaluated using Precision@1 on the analogy test using questions that were covered by the reduced sample vocabulary.

The results show:

  • English language performs significantly better for same (reduced) training set size.

  • the Spanish and French languages did not benefit from text preprocessing, and achieved their highest score with the raw text input.

  • German alone performed best with a larger embedding size

  • English performed best with a larger negative sample size.

Relative to baseline values for the same input:

  • A higher embedding size benefited all languages except English, and reducing the embedding size had a negative impact for all languages.

  • A larger context improved performance for French and Spanish, while all languages did worse for a smaller context.

  • A larger number of negative samples had a small positive impact for all except French, while a smaller number reduced performance for all.

  • A custom (higher) subsampling threshold had a negative impact for English and French and only a small positive impact for German.

The validity of these results is limited by the number of training rounds and reduced vocabulary, but provide orientation regarding preferable parameter settings for training with the full set.

Parameter Settings Precision@1 Analogy Test Results
Input Embedding Size () Negative Samples () Subsample Threshold () Context () English French Spanish German
clean 200 50 0.001 5 0.0411 0.0125 0.0106 0.0070
clean 200 25 0.001 5 0.0401 0.0124 0.0098 0.0062
clean 200 25 0.001 8 0.0354 0.0110 0.0109 0.0051
clean 300 25 0.001 5 0.0345 0.0126 0.0104 0.0117
raw 200 25 0.001 5 0.0319 0.0137 0.0133 0.0079
clean 200 25 0.001 3 0.0287 0.0097 0.0101 0.0064
clean 100 25 0.001 5 0.0263 0.0085 0.0085 0.0055
clean 200 25 custom* 5 0.0222 0.0095 0.0098 0.0068
clean 200 10 0.001 5 0.0186 0.0095 0.0080 0.0059
ngrams 200 25 0.001 5 0.0170 0.0072 0.0078 0.0045
clean, ngrams 200 25 0.001 5 0.0122 0.0086 0.0078 0.0050
Table 5: Sample Results

Monolingual Benchmarks

Mikolov et al. (2013a) report the following accuracy for the word2vec Skip-Gram model in English:

Vector Size # Words Epochs Accuracy
300 783M 3 36.1%
300 783M 1 49.2%
300 1.6B 1 53.8%
600 783M 1 55.5%
1000 6B 1 66.5%
Table 6: English Benchmark

0.1 Monolingual word2vec

For the complete Wikipedia corpus for each language, various input formats and hyper parameter were applied to improve results and test for robustness. The choices took into account the sample results from above while adjusting for the larger data set (e.g., a smaller number of negative samples is recommended for larger data).

Model Loss Ngrams Min. Count Vocab Size Testable Analogies Embed. Size Initial LR Negative Samples Sub-sample Threshold P@1 P@5
English NCE y 25 481955 25392 250 0.025 25 0.0010 0.365 0.637
NCE n 25 481732 25392 250 0.025 25 0.0010 0.378 0.626
NCE y 100 219341 25268 200 0.030 15 0.0005 0.387 0.705
NCE y 500 82020 23736 200 0.030 15 0.0010 0.426 0.747
NCE y 500 82020 23736 200 0.030 15 0.0010 0.434 0.878
NEG y 500 82020 23736 200 0.030 15 0.0010 0.435 0.878
NCE y 500 82020 23736 200 0.030 15 0.0005 0.435 0.748
French NCE y 25 241742 18498 250 0.025 25 0.0010 0.094 0.637
NCE n 5 624810 18832 200 0.030 15 0.0010 0.100 0.626
NCE n 25 237069 20962 250 0.030 15 0.0005 0.102 0.328
NCE y 250 62349 13712 200 0.030 15 0.0010 0.140 0.328
NCE y 500 43610 9438 200 0.030 15 0.0010 0.190 0.328
German NCE n 25 536947 6414 250 0.025 25 0.0005 0.119 0.253
NCE y 250 103317 3772 200 0.030 15 0.0010 0.136 0.328
NCE y 500 65179 2528 200 0.030 15 0.0010 0.168 0.422
Spanish NCE n 5 626183 8162 200 0.025 25 0.0010 0.098 0.637
NCE n 25 236939 5026 250 0.030 15 0.0010 0.106 0.626
NCE y 250 63308 2572 200 0.030 15 0.0005 0.210 0.328
Table 7: Monolingual word2vec Model Results

Models with both raw input and phrase (ngram) input were tested using Noise Contrastive Estimation and Negative Sampling, but the latter did not change results (as expected). Embeddings ranged from , negative samples from , and the initial learning rate from 0.025-0.03, decaying linearly to . The subsample threshold varied between and .

The most significant impact resulted from reduced vocabulary size by increasing the minimum count for a world to be included from 5 for the smaller samples to at least 25, up to 500 (reducing the number of analogies available in the vocabulary, esp. in German and Spanish).

Figure 7: Model Accuracy

All models were trained for 20 epochs using stochastic gradient descent in TensorFlow.

English, with over 2bn test tokens, performed best (while also covering most analogy word pairs), but did not achieve the benchmark performance reported above. The best models achieved accuracy of 43.5%, and accuracy of 87.8% when the vocabulary was reduced to words. For a vocabulary almost six times this size, accuracy was still 37.8%.

Spanish, French, and German performed less well on the analogy test (which is tailored to the English language). accuracy ranges from 16.8% to 21%, and accuracy from 42.2% to 63.7%.

Overall, results behaved as expected in response to changes in parameter settings, with smaller vocabularies boosting precision. The accuracy curves below show that performance was still improving after 20 epochs, so further training (Mikolov et al. (2013a) suggest 3-50 epochs) would likely have further increased performance.

Performance across Analogies

Performance across the various analogy categories is uneven, and a closer look reveals some weaknesses of the models. Most models perform strong on capital-country relations for common countries, nationality adjectives, and family relationships.

Figure 8: Model Accuracy by Analogy Topic

Most other areas offer more alternative matches or synonyms, and performance is often strong at the level, but not necessarily at the level. Currency-country relationships are poorly captured for all languages, which might be due to the input text containing fewer information on these (the original tests were conducted on Google news corpora with arguably more economic content)


Mikolov et al. (2013b) report introduced word2vec for bilingual translation, and tested English to Spanish translation using the WMT11 corpus. It provides 575m tokens and 127K unique words in English, and 84m tokens and 107K unique words. The authors reported accuracy of 33%, and accuracy of 51%.

accuracy arguably is a better measure given that there are often several valid translation, while the dictionary contains only one solution (the google translate API does not provides multiple options).

I used the best performing models to test the translation quality from English to any of the other languages, learning the transformation matrix on a training set of the 5,000 most frequent words with matching translations in the process. Results on the test set of 2,500 words per language were best for Spanish with at 47.2% and at 62.7%. French and German performed less well.

P@k Spanish French German
1 47.2% 33.4% 32.0%
2 55.0% 39.6% 39.3%
3 58.3% 43.3% 43.9%
4 61.0% 45.6% 46.4%
5 62.7% 47.5% 48.3%
Table 8: Tranlsation Test Performance

In order to illustrate how word2vec produces many useful related terms, I am displaying the top 5 option for the English words complexity, pleasure and monsters, marking the dictionary entry provided by google translated in bold.

Spanish French German
complejidad complexité Komplexität
predicción généralisation Dynamik
mecánica cuántica formulation Präzision
generalización probabilités dynamische
simplicidad modélisation dynamischen
Table 9: Translations for ‘complexity‘

All languages translate complexity correctly, but produces additional interesting associations: both Spanish and French relate complexity to generalization, with Spanish also emphasizing prediction. German closely relates complexity to dynamics.

Pleasure, correctly translated in Spanish and German but with a debatable mistake - according to google translate - in French, also highlights interesting nuances - Spanish and French stress curiosity, while Spanish and German mention patience.

Spanish French German
placer plaisirs vergnügen
encanto curiosité neugier
paciencia bonheur geduld
curiosidad volontiers stillen
ternura plaisir erlebnis
Table 10: Translations for ‘Pleasure‘

Figure 9 illustrate the projection of English word vectors into the Spanish vector space using the translation matrix.

Figure 9: t-SNE Projection of Bilingual word2vec Space

A random sample of 75 English terms was selected based on correct translation at the accuracy level. The English vectors were translated into the foreign language space, and both the English vector and the nearest Spanish neighbor vector were projected into two dimensions using t-distributed stochastic neighbor embedding (T-SNE) and and displayed with their labels.

The proximity of similar concepts both within a language and across language boundaries is striking (see the highlighted shark and elephant example and their translation).


It is possible to produce high-quality translations using largely unsupervised learning, with limited supervised learning input to translate between monolingual vector spaces. Preprocessing is also quite limited compared to standard dependency parsing.

Using the Wikipedia corpus instead of the smaller WMT11 corpus shows that accuracy can be increased over the previously achieved benchmark. In addition, the translations come with a similarity metric that expresses confidence.

These translations can be extended to multiple bilingual pairs without major effort (e.g. to all six pairs among the four languages considered, and in either direction for 12 applications.

There are multiple opportunities to further improve results:

  • Monolingual models did not perform optimally, and fell short of benchmark performance. This seems in part to be due to the nature of the analogies chosen for testing, as well as the text material and the quality of the translations. The ‘city-in-state’ pairs emphasize US geography, and the ‘country-currency’ pairs found few matches for all languages, arguably because these terms are used less frequently in Wikipedia than in the Google News corpus used by the original authors. It would certainly be useful to blend multiple corpora to obtain more comprehensive coverage.

  • The reduced coverage of translated analogies also suggests reviewing in more detail how to adapt the translations. The manual review of the google translation API results produced a significant amount of refinements. Syntactic analogies may also need to be tailored to each language’s grammar as several areas (e.g. adjective-adverb) produced far fewer meaningful pairs in German than in English).

  • Monolingual models would also benefit from longer training and more computing resources as performance on the analogy test was still improving after the allotted amount of time ended. This would also allow for better hyper parameter settings, in particular with respect to larger context windows (resulting in more training examples per epoch), and higher embedding dimensionality (increasing computational complexity). Furthermore, additional preprocessing options could be explored, e.g. to utilize information from parts-of-speech tagging to distinguish between identical words used in different functions (as noun and verb).

  • Bilingual models performed quite well despite some shortcomings of the monolingual models, in particular in the English-Spanish case. A thorough review of the translation quality would likely improve the accuracy of the translation matrix.

  • The word2vec based translations could possibly be improved by complementing it with additional techniques that have been used successfully, e.g. to remove named entities or use the edit distance to account for morphological similarity.


  • Bengio & Lecun (2007) Bengio, Yoshua and Lecun, Yann. Scaling learning algorithms towards AI. Large-scale kernel machines, 2007. 00804.
  • Bengio et al. (2003) Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin, Christian. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003. ISSN ISSN 1533-7928. 03305.
  • Gutmann & Hyvärinen (2012) Gutmann, Michael U. and Hyvärinen, Aapo. Noise-contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. J. Mach. Learn. Res., 13(1):307–361, February 2012. ISSN 1532-4435. 00247.
  • Mikolov et al. (2013a) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs], January 2013a. 05984.
  • Mikolov et al. (2013b) Mikolov, Tomas, Le, Quoc V., and Sutskever, Ilya. Exploiting Similarities among Languages for Machine Translation. arXiv:1309.4168 [cs], September 2013b. 00412.
  • Mikolov et al. (2013c) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119, USA, 2013c. Curran Associates Inc. 07250.
  • Mikolov et al. (2013d) Mikolov, Tomas, Yih, Wen-tau, and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751, Atlanta, Georgia, June 2013d. Association for Computational Linguistics. 01449.
  • Rumelhart et al. (1986) Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J. Learning representations by back-propagating errors. Nature, 323:533, October 1986. 14256.