Vectorial word representations have proven useful for multiple NLP tasks [1, 2]. It’s been shown that meaningful representations, capturing syntactic and semantic similarity, can be learned from unlabled data. Along with a (usually smaller) set of labeled data, these representations allows to exploit unlabeled data and improve the generalization performance on some given task, even allowing to generalize out of the vocabulary observed in the labeled data only.
While the majority of previous work has concentrated on the monolingual case, recent work has started looking at learning word representations that are aligned across languages [3, 4, 5]. These representations have been applied to a variety of problems, including cross-lingual document classification  and phrase-based machine translation . A common property of these approaches is that a word-level alignment of translated sentences is leveraged, either to derive a regularization term relating word embeddings across languages [3, 4].
In this workshop paper, we experiment with a method to learn multilingual word representations that does without word-to-word alignment of bilingual corpora during training. We only require aligned sentences and do not exploit word-level alignments (e.g. extracted using GIZA++, as is usual). To do so, we propose a multilingual autoencoder model, that learns to relate the hidden representation of paired bag-of-words sentences.
We use these representations in the context of cross-lingual document classification where labeled dataset can be available in one language, but not in another one. With the multilingual word representations, we want to learn a classifier with documents in one language and then use it on documents in another language. Our preliminary experiments suggest that our method is competitive with the representations learned by, which rely on word-level alignments.
In Section 2, we describe the initial autoencoder model that can learn a representation from which an input bag-of-words can be reconstructed. Then, in Section 3, we extend this autoencoder for the multilingual setting. Related work is discussed in Section 4 and experiments are presented in Section 5.
2 Autoencoder for Bags-of-Words
Let be the bag-of-words representation of a sentence. Specifically, each is a word index from a fixed vocabulary of words. As this is a bag-of-words, the order of the words within does not correspond to the word order in the original sentence. We wish to learn a -dimensional vectorial representation of our words from a training set of sentence bag-of-words .
We propose to achieve this by using an autoencoder model that encodes an input bag-of-words as the sum of its word representations (embeddings) and, using a non-linear decoder, is trained to reproduce the original bag-of-words.
Specifically, let matrix be the
matrix whose columns are the vector representations for each word. The aggregated representation for a given bag-of-words will then be:
To learn meaningful word representations, we wish to encourage to contain information that allows for the reconstruction of the original bag-of-words . This is done by choosing a reconstruction loss and by designing a parametrized decoder which will be trained jointly with the word representations so as to minimize this loss.
Because words are implicitly high-dimensional objects, care must be taken in the choice of reconstruction loss and decoder for stochastic gradient descent to be efficient. For instance,Dauphin et al.  recently designed an efficient algorithm for reconstructing binary bag-of-words representations of documents, in which the input is a fixed size vector where each element is associated with a word and is set to 1 only if the word appears at least once in the document. They use importance sampling to avoid reconstructing the whole -dimensional input vector, which would be expensive.
In this work, we propose a different approach. We assume that, from the decoder, we can obtain a probability distribution over any wordobserved at the reconstruction output layer . Then, we treat the input bag-of-words as a -trials multinomial sample from that distribution and use as the reconstruction loss its negative log-likelihood:
We now must ensure that the decoder can compute efficiently from . Specifically, we’d like to avoid a procedure scaling linear with the vocabulary size , since will be very large in practice. This precludes any procedure that would compute the numerator of for each possible word separetly and normalize so it sums to one.
We instead opt for an approach borrowed from the work on neural network language models[7, 8]. Specifically, we use a probabilistic tree decomposition of . Let’s assume each word has been placed at the leaf of a binary tree. We can then treat the sampling of a word as a stochastic path from the root of the tree to one of the leaf.
We note as as the sequence of internal nodes in the path from the root to a given word , with always corresponding to the root. We will not as the vector of associated left/right branching choices on that path, where means the path branches left at internal node and branches right if
otherwise. Then, the probabilityof a certain word observed in the bag-of-words is computed as
where is output by the decoder. By using a full binary tree of words, the number of different decoder outputs required to compute will be logarithmic in the vocabular size . Since there are words in the bag-of-words, at most outputs are thus required from the decoder. This is of course a worse case scenario, since words will share internal nodes between their paths, for which the decoder output can be computed just once. As for organizing words into a tree, as in Larochelle and Lauly  we used a random assignment of words to the leaves of the full binary tree, which we have found to work well in practice.
Finally, we need to choose of parametrized form for the decoder. We choose the following non-linear form:
where is an element wise non-linearity, is a
-dimensional bias vector,is a (-1)-dimensional bias vector, is a matrix and 111While the literature on autoencoders usually refers to the post-nonlinearity activation vector as the hidden layer, we use a different description here simply to be consistent with the representation we will use for documents in our experiments, where the non-linearity will not be used.
3 Multilingual Bag-of-words
Let’s now assume that for each sentence bag-of-words in some source language , we have an associated bag-of-words for the same sentence translated in some target language by a human expert. Assuming we have a training set of such pairs, we’d like to use it to learn representations in both languages that are aligned, such that pairs of translated words have similar representations.
To achieve this, we propose to augment the regular autoencoder proposed in Section 2 so that, from the sentence representation in a given language, a reconstruction can be attempted of the original sentence in the other language.
Specifically, we now define language specific word representation matrices and , corresponding to the languages of the words in and respectively. Let and also be the number of words in the vocabulary of both languages, which can be different. The word representations however are of the same size in both languages. The sentence-level representation extracted by the encoder becomes
From the sentence in either languages, we want to be able to perform a reconstruction of the original sentence in any of the languages. In particular, given a representation in any language, we’d like a decoder that can perfrom a reconstruction in language and another decoder that can reconstruct in language . Again, we use decoders of the form proposed in Section 2, but let the decoders of each language have their own parameters and :
where can be either or . Notice that we share the bias in the nonlinearity across decoders.
This encoder/decoder structure allows us to learn a mapping within each language and across the languages. Specifically, for a given pair , we can train the model to (1) construct from , (2) construct from , (3) reconstruct from itself and (4) reconstruct from itself. In our experiments, performed each of these 4 tasks simultaneously, combining the equally weighting the learning gradient from each. Experiments on various weighting schemes should be investigated and are left for future work. Another promising direction of investigation to the exploit the fact that tasks (3) and (4) could be performed on extra monolingual corpora, which is more plentiful.
4 Related work
We mentioned that recent work has considered the problem of learning multilingual representations of words and usually relies on word-level alignments. Klementiev et al.  propose to train simultaneously two neural network languages models, along with a regularization term that encourages pairs of frequently aligned words to have similar word embeddings. Zou et al.  use a similar approach, with a different form for the regularizor and neural network language models as in . In our work, we specifically investigate whether a method that does not rely on word-level alignments can learn comparably useful multilingual embeddings in the context of document classification.
Looking more generally at neural networks that learn multilingual representations of words or phrases, we mention the work of Gao et al.  which showed that a useful linear mapping between separately training monolingual skip-gram language models could be learned. They too however rely on the specification of pairs of words in the two languages to align. Mikolov et al.  also propose a method for training a neural network to learn useful representations of phrases (i.e. short segments of words), in the context of a phrase-based translation model. In this case, phrase-level alignments (usually extracted from word-level alignments) are required.
To evaluate the quality of the word embeddigns learned by our model, we experiment with a task of cross-lingual document classication. The setup is as follows. A labeled data set of documents in some language is available to train a classifier, however we are interested in classifying documents in a different language at test time. To achieve this, we leverage some bilingual corpora, which importantly is not labeled with any document-level categories. This bilingual corpora is used instead to learn document representations in both languages and that are enroucaged to be invariant to translations from one language to another. The hope is thus that we can successfully apply the classifier trained on document representations for language directly to the document representations for language .
We trained our multilingual autoencoder to learn bilingual word representation between English and French and between English and German. To train the autoencoder, we used the English/French and English/German section pairs of the Europarl-v7 dataset222http://www.statmt.org/europarl/. This data is composed of about two million sentences, where each sentence is translated in all the relevant languages.
For our crosslingual classification problem, we used the English, French and German sections of the Reuters RCV1/RCV2 corpus, as provided by Amini et al. 333http://multilingreuters.iit.nrc.ca/ReutersMultiLingualMultiView.htm. There are 18758, 26648 and 29953 documents (news stories) for English, French and German respectively. Document categories are organized in a hierarchy in this dataset. A 4-category classification problem was thus created by using the 4 top-level categories in the hierarchy (CCAT, ECAT, GCAT and MCAT). The set of documents for each language is split into training, validation and testing sets of size 70%, 15% and 15% respectively. The raw documents are represented in the form of a bag-of-words using a TFIDF-based weighting scheme. Generally, this setup follows the one used by Klementiev et al. , but uses the preprocessing pipeline of Amini et al. .
5.2 Crosslingual classification
As described earlier, crosslingual document classification is performed by training a document classifier on documents in one language and applying that classifier on documents in a different language at test time. Documents in a language are representated as a linear combination of its word embeddings learned for that language. Thus, classification performance relies heavily on the quality of multlingual word embeddigns between languages, and specifically on whether similar words across languages have similar embeddings.
Overall, the experimental procedure is as follows.
Train bilingual word representations and on sentence pairs extracted from Europarl-v7 for languages and (we use a separate validation set to early-stop training).
Train document classifier on the Reuters training set for language , where documents are represented using the word representations (we use the validation set for the same language to perform model selection).
Use the classifier trained in the previous step on the Reuters test set for language , using the word representations to represent the documents.
We used a linear SVM as our classifier.
We compare our representations to those learned by Klementiev et al. 444The trained word embeddings were downloaded from http://people.mmci.uni-saarland.de/~aklement/data/distrib/. This is achieved by simply skipping the first step of training the bilingual word representations and directly using those of Klementiev et al.  in step 2 and 3. The provided word embeddings are of size 80 for the English and French language pair, and of size 40 for the English and German pair. The vocabulary used by Klementiev et al.  consisted in 43614 words in English, 35891 words in French and 50110 words in German. The same vocabulary was used by our model, to represent the Reuters documents.
In all cases, document representations were obtained by multiplying the word embeddings matrix with either the TFIDF-based bag-of-words feature vector or its binary version (the choice of which method to use is made based on the validation set performance). We normalized the TFIDF-based weights to sum to one.
Test set classification error results are reported in Table 1. We observe that the word representations learned by our autoencoder are competitive with those provided by Klementiev et al. . One will notice that the results for Klementiev et al.  are worse than those reported in the original reference. This difference might be due to the fact that our preprocessing of the Reuters data, which comes from Amini et al. , is different from the one in Klementiev et al. . In particular, we note that Klementiev et al.  ignored documents that belonged to multiple categories, while Amini et al.  included them by assigning them to the category with the least training examples.
Table 1 also shows, for a few French words, what are the nearest neighbor words in the English embedding space. A more complete picture is presented in the t-SNE visualization  of Figure 2. It shows a 2D visualization of the French/English word embeddings, for the 600 most frequent words in both languages. Both illustrations confirm that the multilingual autoencoder was able to learn similar embeddings for similar words across the two languages.
6 Conclusion and Future Work
We presented evidence that meaningful multilingual word representations could be learned without relying on word-level alignments. Our proposed multilingual autoencoder was able to perform competitively on a crosslingual document classification task, compared to a word representation learning method that exploits word-level alignments.
Encouraged by these preliminary results, our future work will investigate extensions of our bag-of-words multilingual autoencoder to bags-of-ngrams, where the model would also have to learn representations for short phrases. Such a model should be particularly useful in the context of a machine translation system. Thanks to the use of a probabilistic tree in the output layer, our model could efficiently assign scores to pairs of sentences. We thus think it could act as a useful, complementary metric in a standard phrase-based translation system.
Thanks to Alexandre Allauzen for his help on this subject. Thanks also to Cyril Goutte for providing the classification dataset. Finally, a big thanks to Alexandre Klementiev and Ivan Titov for their help.
Turian et al. 
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
Word representations: A simple and general method for semi-supervised learning.In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL2010), pages 384–394. Association for Computational Linguistics, 2010.
Collobert et al. 
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa.
Natural Language Processing (Almost) from Scratch.
Journal of Machine Learning Research, 12:2493–2537, 2011.
Klementiev et al. 
Alexandre Klementiev, Ivan Titov, and Binod Bhattarai.
Inducing Crosslingual Distributed Representations of Words.In Proceedings of the International Conference on Computational Linguistics (COLING), 2012.
- Zou et al.  Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D. Manning. Bilingual Word Embeddings for Phrase-Based Machine Translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 2013.
- Mikolov et al.  Tomas Mikolov, Quoc Le, and Ilya Sutskever. Exploiting Similarities among Languages for Machine Translation. Technical report, arXiv, 2013.
- Dauphin et al.  Yann Dauphin, Xavier Glorot, and Yoshua Bengio. Large-Scale Learning of Embeddings with Reconstruction Sampling. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pages 945–952. Omnipress, 2011.
Morin and Bengio 
Frederic Morin and Yoshua Bengio.
Hierarchical Probabilistic Neural Network Language Model.
In Proceedings of the 10th International Workshop on Artificial
Intelligence and Statistics (AISTATS 2005)
, pages 246–252. Society for Artificial Intelligence and Statistics, 2005.
- Mnih and Hinton  Andriy Mnih and Geoffrey E Hinton. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21 (NIPS 2008), pages 1081–1088, 2009.
- Larochelle and Lauly  Hugo Larochelle and Stanislas Lauly. A Neural Autoregressive Topic Model. In Advances in Neural Information Processing Systems 25 (NIPS 25), 2012.
- Gao et al.  Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. Learning Semantic Representations for the Phrase Translation Model. Technical report, Microsoft Research, 2013.
- Amini et al.  Massih-Reza Amini, Nicolas Usunier, and Cyril Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. In Advances in Neural Information Processing Systems 22 (NIPS 2009), pages 28–36, 2009.
- van der Maaten and Hinton  Laurens van der Maaten and Geoffrey E Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf.