Leveraging Monolingual Data for Crosslingual Compositional Word Representations

12/19/2014
by   Hubert Soyer, et al.
0

In this work, we present a novel neural network based architecture for inducing compositional crosslingual word representations. Unlike previously proposed methods, our method fulfills the following three criteria; it constrains the word-level representations to be compositional, it is capable of leveraging both bilingual and monolingual data, and it is scalable to large vocabularies and large quantities of data. The key component of our approach is what we refer to as a monolingual inclusion criterion, that exploits the observation that phrases are more closely semantically related to their sub-phrases than to other randomly sampled phrases. We evaluate our method on a well-established crosslingual document classification task and achieve results that are either comparable, or greatly improve upon previous state-of-the-art methods. Concretely, our method reaches a level of 92.7 the English to German and German to English sub-tasks respectively. The former advances the state of the art by 0.9 absolute improvement upon the previous state of the art by 7.7 accuracy and an improvement of 33.0

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/19/2016

Adaptive Joint Learning of Compositional and Non-Compositional Phrase Embeddings

We present a novel method for jointly learning compositional and non-com...
08/15/2019

A Multivariate Model for Representing Semantic Non-compositionality

Semantically non-compositional phrases constitute an intriguing research...
06/02/2021

Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation

Self-training has proven effective for improving NMT performance by augm...
07/20/2017

A Sub-Character Architecture for Korean Language Processing

We introduce a novel sub-character architecture that exploits a unique c...
05/09/2017

Word and Phrase Translation with word2vec

Word and phrase tables are key inputs to machine translations, but costl...
07/14/2020

A novel dowscaling procedure for compositional data in the Aitchison geometry with application to soil texture data

In this work, we present a novel downscaling procedure for compositional...
07/29/2018

Fast derivation of neural network based document vectors with distance constraint and negative sampling

A universal cross-lingual representation of documents is very important ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dense vector representations (embeddings) of words and phrases, as opposed to discrete feature templates, have recently allowed for notable advances in the state of the art of Natural Language Processing (NLP)

(Socher et al., 2013; Baroni et al., 2014). These representations are typically induced from large unannotated corpora by predicting a word given its context (Collobert & Weston, 2008). Unlike discrete feature templates, these representations allow supervised methods to readily make use of unlabeled data, effectively making them semi-supervised (Turian et al., 2010).

A recent focus has been on crosslingual, rather than monolingual, representations. Crosslingual representations are induced to represent words, phrases, or documents for more than one language, where the representations are constrained to preserve representational similarity or can be transformed between languages (Klementiev et al., 2012; Mikolov et al., 2013b; Hermann & Blunsom, 2014)

. In particular, crosslingual representations can be helpful for tasks such as translation or to leverage training data in a source language when little or no training data is available for a target language. Examples of such transfer learning tasks are crosslingual sentiment analysis

(Wan, 2009) and crosslingual document classification (Klementiev et al., 2012).

Mikolov et al. (2013b) induced language-specific word representations, learned a linear mapping between the language-specific representations using bilingual word pairs and evaluated their approach for single word translation. Klementiev et al. (2012) used automatically aligned sentences and words to constrain word representations across languages based on the number of times a given word in one language was aligned to a word in another language. They also introduced a dataset for crosslingual document classification and evaluated their work on this task. Hermann & Blunsom (2014) introduced a method to induce compositional crosslingual word representations from sentence-aligned bilingual corpora. Their method is trained to distinguish the sentence pairs given in a bilingual corpus from randomly generated pairs. The model represents sentences as a function of their word representations, encouraging the word representations to be compositional. Another approach has been to use auto-encoders and bag of words representations of sentences that can easily be applied to jointly leverage both bilingual and monolingual data (Chandar A P et al., 2014). Most recently, Gouws et al. (2014) extended the Skip-Gram model of Mikolov et al. (2013a) to be applicable to bilingual data. Just like the Skip-Gram model they predict a word in its context, but constrain the linear combinations of word representations from aligned sentences to be similar.

However, these previous methods all suffer from one or more of three short-comings. Klementiev et al. (2012); Mikolov et al. (2013b); Gouws et al. (2014) all learn their representations using a word-level monolingual objective. This effectively means that compositionality is not encouraged by the monolingual objective, which may be problematic when composing word representations for a phrase or document-level task. While the method of Hermann & Blunsom (2014) allows for arbitrary composition functions, they are limited to using sentence-aligned bilingual data and it is not immediately obvious how their method can be extended to make use of monolingual data. Lastly, while the method of Chandar A P et al. (2014) suffers from neither of the above issues, their method represents each sentence as a bag of words vector with the size of the whole vocabulary. This leads to computational scaling issues and necessitates a vocabulary cut-off which may hamper performance for compounding languages such as German.

The question that we pose is thus, can a single method

  1. Constrain the word-level representations to be compositional.

  2. Leverage both monolingual and bilingual data.

  3. Scale to large vocabulary sizes without greatly impacting training time.

In this work, we propose a neural network based architecture for creating crosslingual compositional word representations. The method is agnostic to the choice of composition function and combines a bilingual training objective with a novel way of training monolingual word representations. This enables us to draw from a plethora of unlabeled monolingual data, while our method is efficient enough to be trained using roughly seven million sentences in about six hours on a single-core desktop computer. We evaluate our method on a well-established document classification task and achieve results for both sub-tasks that are either comparable or greatly improve upon the previous state of the art. For the German to English sub-task our method achieves 84.4% in accuracy, an error reduction of 33.0% in comparison to the previous state of the art.

2 Model

2.1 Inducing Crosslingual Word Representations

For any task involving crosslingual word representations we distinguish between two kinds of errors

  1. Transfer errors occur due to transferring representations between languages. Ideally, expressions of the same meaning (words, phrases, or documents) should be represented by the same vectors, regardless of the language they are expressed in. The more different these representations are from language 1 () to language 2 (), the larger the transfer error.

  2. Monolingual errors

    occur because the word, phrase or document representations within the same language are not expressive enough. For example, in the case of classification this would mean that the representations do not possess enough discriminative power for a classifier to achieve high accuracy.

The way to attain high performance for any task that involves crosslingual word representations is to keep both transfer errors and monolingual errors to a minimum using representations that are both expressive and constrained crosslingually.

2.2 Creating Representations for Phrases and Documents

Following the work of Klementiev et al. (2012); Hermann & Blunsom (2014); Gouws et al. (2014) we represent each word as a vector and use separate word representations for each language. Like Hermann & Blunsom (2014), we look up the vector representations for all words of a given sentence in the corresponding lookup table and apply a composition function to transform these word vectors into a sentence representation. To create document representations, we apply the same composition function again, this time to transform the representations of all sentences in a document to a document representation. For the majority of this work we will make use of the addition composition function, which can be written as the sum of all word representations in a given phrase

(1)

To give an example of another possible candidate composition function, we also use the bigram based addition (Bi) composition function, formalized as

(2)

where the hyperbolic tangent () is wrapped around every word bigram to produce intermediate results that are then summed up. By introducing a non-linear function the Bi composition is no longer a bag-of-vectors function and takes word order into account.

Given that neither of the above composition functions involve any additional parameters, the only parameters of our model are in fact the word representations that are shared globally across all training samples.

2.3 Objective

Following Klementiev et al. (2012) we split our objective into two sub-objectives, a bilingual objective minimizing the transfer errors and a monolingual objective minimizing the monolingual errors for and . We formalize the loss over the whole training set as

(3)

where is the bilingual loss for two aligned sentences, is a sample from the set of aligned sentences in language 1 and 2, is the monolingual loss which we sum over sentences from corpora in language 1 and sentences from corpora in language 2. We learn the parameters , which represent the whole set of word representations for both and . The parameters are used in a shared fashion to construct sentence representations for both the monolingual corpora and the parts of the bilingual corpus corresponding to each language. We regularize using the squared euclidean norm and scale the contribution of the regularizer by .

Both objectives operate on vectors that represent composed versions of phrases and are agnostic to how a phrase is transformed into a vector. The objective can therefore be used with arbitrary composition functions.

An illustration of our proposed method can be found in Figure 1.

Figure 1: An illustration of our method.

2.3.1 Bilingual Objective

Given a pair of aligned sentences, in and in , we first compute their vector representations and using the composition function. Since the sentences are either translations of each other or at least very close in meaning, we require their vector representations to be similar and express this as minimizing the squared euclidean distance between and . More formally, we write

(4)

for any two vector representations and corresponding to the sentences of an aligned translation pair.

The bilingual objective on its own is degenerate, since setting the vector representations of all sentences to the same value poses a trivial solution. We therefore combine this bilingual objective with a monolingual objective.

2.3.2 Monolingual Objective

The choice of the monolingual objective greatly influences the generality of models for crosslingual word representations. Klementiev et al. (2012) use a neural language model to leverage monolingual data. However, this does not explicitly encourage compositionality of the word representations. Hermann & Blunsom (2014) achieve good results with a noise-contrastive objective, discriminating aligned translation pairs from randomly sampled pairs. However, their approach can only be trained using sentence aligned data, which makes it difficult to extend to leverage unannotated monolingual data. Gouws et al. (2014) introduced BilBOWA combining a bilingual objective with the Skip-Gram model proposed by Mikolov et al. (2013a) which predicts the context of a word given the word itself. They achieve high accuracy on the sub-task of the crosslingual document classification task introduced by Klementiev et al. (2012). Chandar A P et al. (2014) presented a bag-of-words auto-encoder model which is the current state of the art for the sub-task for the same task. Both the auto-encoder based model and BilBOWA require a sentence-aligned bilingual corpus, but in addition are capable of leveraging monolingual data. However, due to their bag-of-words based nature, their architectures implicitly restrict how sentence representations are composed from word representations.

We extend the idea of the noise-contrastive objective given by Hermann & Blunsom (2014) to the monolingual setting and propose a framework that, like theirs, is agnostic to the choice of composition function and operates on the phrase level. However, our framework, unlike theirs, is able to leverage monolingual data. Our key novel idea is based on the observation that phrases are typically more similar to their sub-phrases than to randomly sampled phrases. We leverage this insight using the hinge loss as follows

(5)
(6)

where is a margin, is a phrase sampled from a sentence, is a sub-phrase of and is a phrase extracted from a sentence that was sampled uniformly from the corpus. The start and end positions of both phrases and the sub-phrase were chosen uniformly at random within their context and constrained to guarantee a minimum length of 3 words. Subscript denotes that a phrase has been transformed into its vector representation. We add

to the hinge loss to reduce the influence of the margin as a hyperparameter and to make sure that the we retain an error signal even after the hinge loss objective is satisfied. To compensate for differences in phrase and sub-phrase length we scale the error by the ratio between the number of words in the outer phrase and the inner phrase. Minimizing this objective captures the intuition stated above; a phrase should generally be closer to its sub-phrases, than to randomly sampled phrases.

The examples in Figure 2 seek to further clarify this observation.

Figure 2: Examples illustrating the inclusion criterion which we use to leverage monolingual text.

In both examples, the blue area represents the outer phrase (), the red area covers the inner sub-phrase (), and the gray area marks a randomly selected phrase in a randomly sampled noise sentence. The inner workings of the monolingual inclusion objective only become clear when more than one example is considered. In Example 1, is embedded in the same context as in Example 2, while in both examples is contrasted with the same noise phrase. Minimizing the objective brings the representations of both likes to drink beer and likes to eat chips closer to the phrase they are embedded in and makes them less similar to the same noise sentence. Since in both examples the outer phrases are very similar, this causes likes to drink beer and likes to eat chips to be similar. While we picked idealized sentences for demonstration purposes, this relative notion still holds in practice to varying degrees depending on the choice of sentences.

In contrast to many recently introduced log-linear models, like the Skip-Gram model, where word vectors are similar if they appear as the center of similar word windows, our proposed objective, using addition for composition, encourages word vectors to be similar if they tend to be embedded in similar phrases. The major difference between these two formulations manifests itself for words that appear close or next to each other very frequently. These word pairs are not usually the center of the same word windows, but they are embedded together in the same phrases.

For example: the two word central context of “eat” is “to” and “chips”, whereas the context of “chips” would be “eat” and “when”. Using the Skip-Gram model this would cause “chips” and “eat” to be less similar, with “chips” probably being similar to other words related to food and “eat” being similar to other verbs. Employing the inclusion objective, the representations for “eat” and “chips” will end up close to each other since they tend to be embedded in the same phrases. This causes the word representations induced by the inclusion criterion to be more topical in nature. We hypothesize that this property is particularly useful for document classification.

3 Experiments

3.1 Crosslingual Document Classification

Crosslingual document classification constitutes a task where a classifier is trained to classify documents in one language () and is later applied to documents in a different language (). This requires either transforming the classifier itself to fit the new language or transforming/sharing representations of the text for both languages. The crosslingual word and document representations induced using the approach proposed in this work present an intuitive way to tackle crosslingual document classification.

Like previous work, we evaluate our method on the crosslingual document classification task introduced by Klementiev et al. (2012). The goal is to correctly classify news articles taken from the English and German sections of the RCV1 and RCV2 corpus (Lewis et al., 2004)

into one of four categories: Economics, Government/Social, Markets, or Corporate. Maintaining the original setup, we train an averaged perceptron

(Collins, 2002) for 10 iterations on representations of documents in one language (English/German) and evaluate its performance on representations of documents in the corresponding other language (German/English).

We use the original data and the original implementation of the averaged perceptron used by Klementiev et al. (2012) to evaluate the document representations created by our method. There are different versions of the training set of varying sizes, ranging from 100 to 10,000 documents, and the test sets for both languages contain 5,000 documents. Most related work only reports results using the 1,000 documents sized training set. Following previous work, we tune the hyperparameters of our model on held out documents in the same language that the model was trained on.

3.2 Inducing Crosslingual Word Representations

To induce representations using the method proposed in this work, we require at least a bilingual corpus of aligned sentences. In addition, our model allows the representations to draw upon monolingual data from either or both languages. Like Klementiev et al. (2012) we choose EuroParl v7 (Koehn, 2005) as our bilingual corpus and leverage the English and German parts of the RCV1 and RCV2 corpora as monolingual resources. To avoid a testing bias, we exclude all documents that are part of the crosslingual classification task. We detect sentence boundaries using pre-trained models of the Punkt tokenizer (Kiss & Strunk, 2006) shipped with NLTK111http://www.nltk.org/ and perform tokenization and lowercasing with the scripts deployed with the cdec decoder222http://www.cdec-decoder.org/. Following Turian et al. (2010) we remove all English sentences (and their German correspondences in EuroParl) that have a ratio of less than . This affects mainly headlines and reports with numbers. In total it reduces the number of sentences in EuroParl by about and the English part of the Reuters corpus by about million. Since German features more upper case characters than English we set the cutoff ratio to , which reduces the number of sentences by around . Further, we replace words that occur less than a certain threshold by an UNK token. Corpus statistics and thresholds are reported in Table 1.

Type UNK threshold #sentences #tokens
EuroParl (EN) bilingual 2 1.66 million 46 million 51,000
EuroParl (DE) bilingual 2 1.66 million 48 million 163,000
Reuters (EN) monolingual 5 4.5 million 120 million 114,000
Reuters (DE) monolingual 3 0.9 million 18 million 117,000
Table 1: Statistics from the corpora used to induce crosslingual word representations listing type, frequency threshold for turning tokens into UNKs, the number of sentences, the number of tokens and the vocabulary size. The statistics were calculated on the preprocessed versions of the corpora.

We initialize all word representations with noise samples from a Gaussian with and optimize them in a stochastic setting to minimize the objective defined in Equation 3. To speed up the convergence of training we use AdaGrad (Duchi et al., 2011). We tuned all hyperparameters of our model and explored learning rates around , mini-batch sizes around 40,000, hinge loss margins around 40 (since our vector dimensionality is 40) and (regularization) around . We trained all versions that use the full monolingual data for 25 iterations (= million samples) and the versions only involving bilingual data for 100 iterations on their training sets. Training our model333Our implementation is available at https://github.com/ogh/binclusion, implemented in a high-level, dynamic programming language (Bezanson et al., 2012), for the largest set of data takes roughly six hours on a single-core desktop computer. This can be compared to for example Chandar A P et al. (2014) which train their auto-encoder model for 3.5 days.

4 Results

4.1 Crosslingual Document Classification

We compare our method to various architectures introduced in previous work. As these methods differ in their ability to handle monolingual data, we evaluate several versions of our model using different data sources and sizes for training. Also, we follow the lines of previous work and use 40-dimensional word representations. We report results when using the first 500,000 sentence pairs of EuroParl (Euro500k), the full EuroParl corpus (EuroFull), the first 500,000 sentence pairs of EuroParl and the German and English text from the Reuters corpus as monolingual data (Euro500kReuters), and one version using the full EuroParl and Reuters corpus (EuroFullReuters). Table 2 shows results for all these configurations. The result table includes previous work as well as the Glossed, the machine translation and the majority class baselines from Klementiev et al. (2012).

Method Training Data
Machine Translation 68.1 67.4
Glossed 65.1 68.6
Majority Class 46.8 46.8
I-Matrix (Klementiev et al., 2012) EuroFullReuters 77.6 71.1
ADD (Hermann & Blunsom, 2014) Euro500k 83.7 71.4
BAE-cr (Chandar A P et al., 2014) Euro500k 86.1 68.8
BAE-cr (Chandar A P et al., 2014) Euro500kReuters 87.9 76.7
BAE-cr (Chandar A P et al., 2014) EuroFullReuters 91.8 74.2
BilBOWA (Gouws et al., 2014) Euro500k 86.5 75.0

Binclusion
Euro500k 86.8 76.7
Binclusion EuroFull 87.8 75.7
Binclusion Euro500kReuters 92.7 84.4
Binclusion (reduced vocabulary) Euro500kReuters 92.6 82.8
Binclusion EuroFullReuters 90.8 79.5
Binclusion (Bi) EuroFullReuters 89.8 80.1
Table 2: Results for our proposed models, baselines, and related work. All results are reported for a training set size of 1,000 documents for each language. We refer to our proposed method as Binclusion.

Our method achieves results that are comparable or improve upon the previous state of the art for all dataset configurations. It advances the state of the art for the sub-task by points of accuracy and greatly outperforms the previous state of the art for the sub-task, where it yields an absolute improvement of points of accuracy. The latter corresponds to an error reduction of in comparison to the previous state of the art.

An important observation is that including monolingual data is strongly beneficial for the classification accuracy. We found increases in performance to for and accuracy for , even when using as little as of the monolingual data. We hypothesize that the key cause of this effect is domain adaptation. From this observation it is also worth pointing out that our method is on par with the previous state of the art for the sub-task using no monolingual training data and would improve upon it using as little as of the monolingual data. To show that our method achieves high accuracy even with a reduced vocabulary, we discard representations for infrequent terms and report results using our best setup with the same vocabulary size as Klementiev et al. (2012).

4.2 Interesting Properties of the Induced Crosslingual Word Representations

For a bilingual word representation model that uses monolingual data, the most difficult cases to resolve are words appearing in the monolingual data, but not in the bilingual data. Since the model does not have any kind of direct signal regarding what translations these words should correspond to, their location in the vector space is entirely determined by how the monolingual objective arranges them. Therefore, looking specifically at these difficult examples presents a good way to get an impression of how well the monolingual and bilingual objective complement each other.

In Table 3, we list some of the most frequently occurring words that are present in the monolingual data but not in the bilingual data. The nearest neighbors are topically strongly related to their corresponding queries. For example, the credit-rating agency Standard & Poor’s (s&p) is matched to rating-related words, soybeans is proximal to crop and food related terms, forex features a list of currency related terms, and the list for stockholders, includes aktionäre, its correct German translation. This speaks strongly in favor of how our objectives complement each other, even though these words were only observed in the monolingual data they relate sensibly across languages.

English soybeans forex s&p stockholders
German mais drachme ratings aktionärsschutz
alkoholherstellung liquiditätsfalle ratingindustrie minderheitenaktionäre
silomais bankenliquidität ratingbranche aktionärsrechte
genmais abnutzung ratingstiftung aktionäre
gluten pfändung kreditratingagenturen minderheitenaktionären
Table 3: German nearest neighbors for English words that only appear in the monolingual data.

To convey an impression of how the induced representations behave, not interlingually, but within the same language, we list some examples in Table 4. The semi-conductor chip maker intel, is very close to IT-related companies like ibm or netscape and also to microprocessor-related terms. For the verb fly, the nearest neighbors not only include forms like flying, but also related nouns like airspace or air, underlining the topical nature of our proposed objective.

Query kabul intel transport fly
Neighbors taliban pentium traffic air
talibans microprocessor transporting flying
taleban ibm transports flight
masood microprocessors dockworkers airspace
dostum netscape transportation naval
Table 4: English words and their nearest neighbors in the induced space, demonstrating the topical nature of the word representations.

5 Conclusion and Future Work

In this work we introduced a method that can induce compositional crosslingual word representations while scaling to large datasets. Our novel approach for learning monolingual representations integrates naturally with our bilingual objective and allows us to make use of sentence-aligned bilingual corpora as well as monolingual data. The method is agnostic to the choice of composition function, enabling more complex (e.g. preserving word order information) ways to compose phrase representations from word representations. For crosslingual document classification (Klementiev et al., 2012) our models perform comparably or greatly improve upon previously reported results.

To increase the expressiveness of our method we plan to investigate more complex composition functions, possibly based on convolution to preserve word order information. We consider the monolingual inclusion objective worthy of further research on its own and will evaluate its performance in comparison to related methods when learning word representations from monolingual data.

Acknowledgements

This work was supported by the Data Centric Science Research Commons Project at the Research Organization of Information and Systems and by the Japan Society for the Promotion of Science KAKENHI Grant Number 13F03041.

References

  • Baroni et al. (2014) Baroni, Marco, Dinu, Georgiana, and Kruszewski, Germán. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL, pp. 238–247, 2014.
  • Bezanson et al. (2012) Bezanson, Jeff, Karpinski, Stefan, Shah, Viral B., and Edelman, Alan. Julia: A Fast Dynamic Language for Technical Computing. arXiv, abs/1209.5145, September 2012.
  • Chandar A P et al. (2014) Chandar A P, Sarath, Lauly, Stanislas, Larochelle, Hugo, Khapra, Mitesh, Ravindran, Balaraman, Raykar, Vikas C, and Saha, Amrita.

    An Autoencoder Approach to Learning Bilingual Word Representations.

    In NIPS, pp. 1853–1861. 2014.
  • Collins (2002) Collins, Michael.

    Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms.

    In EMNLP, pp. 1–8, 2002.
  • Collobert & Weston (2008) Collobert, Ronan and Weston, Jason. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In ICML, pp. 160–167, 2008.
  • Duchi et al. (2011) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR, 12:2121–2159, 2011.
  • Gouws et al. (2014) Gouws, Stephan, Bengio, Yoshua, and Corrado, Greg. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. arXiv, abs/1410.2455, 2014.
  • Hermann & Blunsom (2014) Hermann, Karl Moritz and Blunsom, Phil. Multilingual Models for Compositional Distributed Semantics. In ACL, pp. 58–68, 2014.
  • Kiss & Strunk (2006) Kiss, Tibor and Strunk, Jan. Unsupervised Multilingual Sentence Boundary Detection. CL, 32(4):485–525, 2006.
  • Klementiev et al. (2012) Klementiev, Alexandre, Titov, Ivan, and Bhattarai, Binod.

    Inducing Crosslingual Distributed Representations of Words.

    In COLING, pp. 1459–1474, 2012.
  • Koehn (2005) Koehn, Philipp. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pp. 79–86, 2005.
  • Lewis et al. (2004) Lewis, David D., Yang, Yiming, Rose, Tony G., and Li, Fan. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR, 5:361–397, 2004.
  • Mikolov et al. (2013a) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey.

    Efficient Estimation of Word Representations in Vector Space.

    In ICLR Workshop, 2013a.
  • Mikolov et al. (2013b) Mikolov, Tomas, Le, Quoc V., and Sutskever, Ilya. Exploiting Similarities among Languages for Machine Translation. arXiv, abs/1309.4168, 2013b.
  • Socher et al. (2013) Socher, Richard, Perelygin, Alex, Wu, Jean, Chuang, Jason, Manning, Christopher D., Ng, Andrew, and Potts, Christopher. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In EMNLP, pp. 1631–1642, 2013.
  • Turian et al. (2010) Turian, Joseph, Ratinov, Lev-Arie, and Bengio, Yoshua.

    Word representations: A simple and general method for semi-supervised learning.

    In ACL, pp. 384–394, 2010.
  • Wan (2009) Wan, Xiaojun. Co-training for Cross-lingual Sentiment Classification. In ACL, pp. 235–243, 2009.