Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

03/23/2019 ∙ by Meryem M'hamdi, et al. ∙ USC Information Sciences Institute 0

Most work in text classification and Natural Language Processing (NLP) focuses on English or a handful of other languages that have text corpora of hundreds of millions of words. This is creating a new version of the digital divide: the artificial intelligence (AI) divide. Transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution to the emerging AI divide. Recent work on CLTC has focused on demonstrating the benefits of using bilingual word embeddings as features, relegating the CLTC problem to a mere benchmark based on a simple averaged perceptron. In this paper, we explore more extensively and systematically two flavors of the CLTC problem: news topic classification and textual churn intent detection (TCID) in social media. In particular, we test the hypothesis that embeddings with context are more effective, by multi-tasking the learning of multilingual word embeddings and text classification; we explore neural architectures for CLTC; and we move from bi- to multi-lingual word embeddings. For all architectures, types of word embeddings and datasets, we notice a consistent gain trend in favor of multilingual joint training, especially for low-resourced languages.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text classification is one of the main applications of Natural Language Processing (NLP). However, like the majority of NLP tasks, text classification methods tend to focus on English or a handful of other languages that have text corpora of hundreds of millions of words. This is contributing to a new flavor of the digital divide: the AI divide, an inequality in the access to, use of, or impact of AI. Several technology companies are now addressing the digital divide with ”next billion users” initiatives. In NLP, transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution.

The first CLTC studies, appearing as early as Bel N. (2003)

, range from creating a single classifier for several languages by pooling the training data to training a monolingual classifier and using the translation of important terms for the other languages. Since then, the face of NLP, including CLTC, has been transformed by embeddings. Word embeddings have become a widely adopted way to transfer information from large unlabeled datasets to downstream tasks, such as sentiment analysis (

Maas et al. (2011)

), document summarization (

Wang et al. (2016)) or dialogue management systems (Yan et al. (2016)).

While most applications of embeddings transfer knowledge across tasks for the same language (English), multilingual embeddings aim to learn a representation common to multiple languages at the same time, making them a perfect addition to the CLTC toolbox. Indeed, a simple averaged perceptron-based CLTC is a common benchmark task to evaluate the quality of bilingual embeddings, by training CLTC on documents in a source language and testing its direct applicability to documents in a different target language.

However, the focus on CLTC as a benchmark has left several gaps. Firstly, there is no systematic comparison between CLTC with monolingual versus multilingual embeddings. Secondly, it is not clear whether and which neural architecture gives the best results for CLTC. And finally and most importantly, the multilingual embeddings are fed as such to the CLTC, treating them as universal feature representation, while recent work has shown that encoding words in context significantly improves performance in a variety of NLP tasks, for example, by transferring the encoder of a machine translation system McCann et al. (2017) or by multi-tasking the multilingual embeddings learning alongside with the CLTC learning.

In this paper, we address the above gaps, by establishing a comprehensive, systematic benchmarking framework 111To be open-sourced after publication. for surveying the performance of various types of embeddings on different variations of CLTC architectures. The components of the framework, corresponding to our main contributions are:

  • Several CLTC architectures, adjusted to be fed directly with mono-/multi-lingual embeddings, thus enabling mono-/multi-lingual training and the comparison between the two modes (Section 3.1).

  • A representative set of state-of-the-art multilingual embeddings, obtained either via training from scratch or via offline linear projection methodologies

    (e.g., Singular Value Decomposition (SVD), Canonical Correlation Analysis (CCA), Attract-Repel).

  • A multi-tasking architecture that fine-tunes multilingual embeddings alongside the CLTC training, thus specializing them to the CLTC task (Section 3.2)

Our experiments (Section 5) with two flavors of CLTC (long news stories to be classified by topics versus short tweets to be classified for churn intent) show that the multilingual approach clearly benefits low-resource languages and that multilingual training outperforms language-specific models for each language.

2 Related Work

In previous work, the quality of multilingual embeddings is either evaluated intrinsically by directly testing their ability to capture syntactic and semantic relationships between words. Such benchmarks include word similarity, word translation, and correlation-based evaluation. Extrinsically, those multilingual models are evaluated on their performance when used as input features to downstream semantic transfer tasks.

One of the main downstream application of multilingual word embeddings is Cross-Lingual Document Classification benchmark (CLDC) initially defined in Klementiev et al. (2012)

. They train a model on labeled documents in a source language and apply it directly to classify unlabeled documents in a target language. This aims to test the ability of multilingual embeddings to act as important agents in direct transfer learning. However, a comparison between the performance using monolingual versus multilingual embeddings is missing.

Zhou et al. (2015) propose a methodology to learn a cross-lingual representation of sentiment information to enable sentiment classification (CLSC). They jointly train bilingual embeddings using the documents annotated with sentiments and their translations to other languages and show that the multilingual approach outperforms monolingual training.

Other work that multi-task training the multilingual embeddings with the task at hand include Wang et al. (2017)

for named entity recognition.

Ferreira et al. (2016)

propose a model that jointly learns to embed and predict classes of multilingual documents by optimizing for a loss that combines a cross-lingual training loss with a supervised document classification loss using logistic regression. Despite the simplicity of each loss component, this model manages to surpass other state-of-the-art models. The shown gain in performance in this work motivates us to investigate a multi-tasking model where a more complex model is adapted for document classification. To the best of our knowledge, we are the first to compare the gain when multilingual embeddings are used across different independent and multi-tasking architectures.

3 A Framework for Benchmarking Embeddings in CLTC Tasks

In what follows, we describe the different neural network models used for CLTC at different levels of complexity and how we apply them to the multilingual setting either by directly incorporating different kinds of already trained multi-embeddings or by training the embeddings alongside with the task.

3.1 Cross-Lingual Text Classification using Pre-trained Embeddings

The different variations of plain text classification models to which pre-trained embeddings are directly fed are represented in Fig. 1. In addition to fine-tuned multi-layer perceptron (FT-MLP) in Fig. 1(a) which is an extension of Klementiev et al. (2012)

averaged perceptron, we implement and evaluate other extensions namely: multi-filter convolutional neural networks and bi-directional GRU with attention. Before describing them, we explain briefly the rationale used to reproduce the set of pre-trained multilingual embeddings we work with.

(a) Averaged Multi-Layer Perceptron (FT-MLP)
(b) Multi-Filter CNN (MF-CNN)
(c) Bidirectional GRU with Attention (bi-GRU-Att)
Figure 1: Different Document Classification Models

3.1.1 Pre-trained Multilingual Embeddings

We obtain and reproduce several multilingual embeddings to draw fair conclusions on the potential gain of a multilingual approach applied to text classification. They have been chosen to comply with previous work proving that models with higher levels of supervision tend to perform the best Upadhyay et al. (2016). We cover a wide range of supervised methodologies to work with both models fine-tuned on top of monolingual embeddings and those trained from scratch.

Fine-tuned multilingual embeddings models are built on top of monolingual embeddings by mapping words from different languages into one joint target space. We set English as the target space, and we learn the linear transformation that aligns other languages to English using bilingual translation pairs. We evaluate two offline fine-tuned approaches. The first variant uses Singular Value Decomposition (SVD) following the work of

Smith et al. (2017) to produce two versions based on the type of bilingual dictionaries used: using ground truth dictionaries and using matching strings. The other variant is which uses Canonical Correlation Analysis (CCA).

We follow Attract-Repel methodology of Mrksic et al. (2017) for generating semantically specialized multilingual embeddings

by injecting monolingual and cross-lingual synonyms and antonyms as linguistic constraints to monolingual distributional vectors. We include more details on how the alignment from bilingual to multilingual is solved using VowPal Wabbit tool in Appendix


Trained from scratch models are optimized using either cross-lingual only or both monolingual and cross-lingual constraints. The first type follows a sentence alignment approach to optimize cross-lingual objective which consists of minimizing the distance between parallel sentences from different languages as described in Appendix C. The second type of embeddings uses skip-gram objective modified for multilingual setting as introduced by Luong et al. (2015).

3.1.2 Multi-Filter CNN (MF-CNN)

We build a multi-filter CNN where convolutions of different kernel sizes are applied and concatenated as described in the work of Kim (2014) and shown in Fig. 1(b). This architecture works better than a single-filter CNN as it is shown to over-fit less.

Given an input text which consists of the concatenation of n words, an embeddings layer is used to convert the words into their corresponding m dimensional embeddings vectors , , … and . The input to the convolution is then the concatenation of the word vectors: . We apply a two-dimensional convolution operation which consists of applying a filter of a window of shape: where k is the number of words and m is the entire embeddings dimensionality to be traversed at a time.

In the end, an output feature is produced from each consecutive window of words using the following equation: where W and b are the weights and bias terms and f is a non-linearity. By applying each filter times, we obtain feature maps. In order to concatenate different feature maps from each filter type of sizes (, ,

and so on), we apply max pooling as described in

Collobert et al. (2011)

. Then, we apply a dropout regularization to the concatenated feature before feeding the output to a dense layer with softmax activation to convert it to a probability distribution over the set of labels.

3.1.3 Bi-directional GRU with Attention (bi-GRU-Att)

We use a non-hierarchical version of bi-directional GRU with attention model as shown in

1(c). GRU is used instead of LSTM since it more lightweight and faster to train while keeping comparable performance Chung et al. (2014). We encode the input both in its forward and backward directions to encapsulate both the past and future. On top of that, we use an attention mechanism Bahdanau et al. (2014) to get a measure of which words are more important by assigning weights of importance.

Formally, at each time step t, the GRU computes the output state as a function of the previous hidden state and the update gate , dropping the forget gate as follows: . We encode each sentence where are the embeddings vector for word using GRU in the forward and backward directions: and computed for each word . Those states then are concatenated to form the encoded representation for each word: .

Attention weights are computed using a dense layer over all encoders’ states as shown in Eq. 1. Then, those scores are normalized and a probability distribution is obtained using softmax as in Eq. 2. The sentence representation is simply the weighted sum of the different encoder states by the attention weights as in Eq. 3.


where and are the weights and bias of the dense layer, is the context vector that gives a high level representation of a fixed query on the words and is initialized randomly and learned during the training process.

3.1.4 Loss Function

We use a weighted categorical cross entropy loss which is defined as follows:


where is the number of testing instances, is the weight attributed to each instance corresponding to its class, is the true label and is the prediction. The weights are inversely proportional to the distribution of classes to circumvent the possibility of over-fitting that can be caused by an imbalanced label distribution and are computed as follows:


3.2 Specialized Multilingual Embeddings

In addition to directly applying multilingual embeddings trained independently to text classification task, we investigate training them along with the task at hand in an end-to-end multi-tasking fashion.

Figure 2 depicts the main components of the followed architecture. The left-hand side fine-tunes multilingual embeddings using sentence alignment while the right-hand side optimizes for document classification using hierarchical bidirectional GRU attention network. The two tasks share a single embeddings layer which is tuned by the two tasks. Other layers which are shared between the tasks include word level GRU units and attention activation.

Figure 2: Multi-tasking hierarchical attention networks for CLDC and multilingual embeddings elignment

3.2.1 Sentence Alignment (Sent-Ali)

The goal of this component is to construct sentence embeddings out of word embeddings using the weighted average of the output of bi-GRU states, a representation which can encapsulate word order and their importance and is more useful than taking the plain average of word embeddings. Let and be the bi-GRU encoded representation of the source and target sentences in the alignment pair (, ) respectively. The loss

is reversely proportional to the cosine similarity between each pair (


) in addition to an l2-regularizer to avoid exploding gradient problem as follows:


where is an arbitrarily fixed scalar that is experiment specific and is the training weights.

3.2.2 Hierarchical Bidirectional GRU-Attention Networks (bi-GRU-Att)

The goal of this component is to come up with a hierarchical representation for documents (only relevant for CLDC). Unlike Yang et al. (2016), we use a bidirectional GRU with attention at different levels. More specifically, we construct document representation using sentence encodings where each sentence representation is built from word representations where both levels of encodings use bidirectional GRUs with attention.

3.2.3 Learning Methodology

We alternate between the training of the losses of the two tasks as defined in Eq. 4 and Eq. 6. Two different optimizers are adapted to each task to make the learning of one task synchronized with the other one.

4 Experimental Setup

In this section, we present the approach used to compare the performance of different embeddings models and text classification architectures including datasets used for the evaluation, how experiments are designed and how models are trained.

4.1 Datasets

4.1.1 Cross-Lingual Document Dataset

The dataset used for CLDC is the Reuters RCV1/RCV2 corpora described in Lewis et al. (2004)222We obtain it under a NIST license http://trec.nist.gov/data/reuters/reuters.html. We choose to work with this dataset since it has a sufficient amount of training instance and has been extensively used in prior research on the evaluation of multilingual embeddings which enables easy comparison with other work. RCV1 consists of about 810,000 English newswire stories, while RCV2 contains over 487,000 news stories in thirteen other languages333The thirteen languages are: Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish all made available by Reuters, Ltd.

English German French Italian
Train 418,566 50,387 40,470 12,566
Valid 104,601 12,609 10,090 3,129
Test 130,780 15,843 12,669 3,964
Total 653,947 78,839 63229 19,659
Table 1: Training, Validation and Testing Distribution of RCV Dataset across Languages

We follow the same cross-lingual document classification benchmark defined in Klementiev et al. (2012) and work on a multi-classification task with at most one single label per document among four high-level topic categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). Table 1 shows the distribution of training, validation and testing instances per language which make up 60%, 20% and 20% out of the dataset respectively.

4.1.2 Cross-Lingual Churn Datasets

We use churn datasets from two languages: English and German. The English dataset Amiri and Daume (2015), , contains tweets mentioning the following telecommunication brands: Verizon, AT&T or T-Mobile. A churny tweet is one that mentions a particular brand that the Twitter user expresses an intent to leave. There are 4854 tweets in total with an annotation confidence above 0.7 out of which only 944 are churny. On the other hand, the German dataset Abbet et al. (2018), , contains a total of 4339 tweets where 611 are churny regarding telecommunication operators active in German-speaking countries.

4.1.3 Multilingual Parallel Sentences Corpus

We use a combination of Europarl Parallel Corpus v7.1 Koehn (2005), titles from Wikipedia, and parallel news commentary444http:// as our sentence alignment dataset. Extracted from parliament proceedings, Europarl covers over 21 European languages. This extended corpus is chosen because it is commonly used in the literature due to its richness and its large number of instances. The whole dataset consists of around 2.9M, 3.1M and 2.6M sentence pairs for English-German, English-French, and English-Italian.

4.2 Experiment Design

For both CLDC and TCID, we design several experiments for the evaluation of different multilingual embeddings. We train several language specific and multilingual models using different text classification architectures. In both cases, models are trained for each language independently and are used as a baseline against different multilingual embeddings models. In the end, we report only on the best and average multilingual embeddings performances in each case.

For CLDC, we evaluate three models: Fine-Tuned MLP (FT-MLP), Multi-Filter CNN (MF-CNN) and Multi-tasking embeddings with the task (HAN+Sent-Ali). For TCID, we additionally investigate the performance of bidirectional GRU with Attention (bi-GRU-Att). It was not possible to investigate the performance of bi-GRU-Att on CLDC, due to the high number of training documents and the higher number of words per document which imposes a long training time.

Mono: Training on a specific language using monolingual embeddings and testing on

Multi: Training on languages (i.e. for CLDC and for TCID) using multilingual embeddings and testing on , , and for CLDC or and for TCID.

For all models except HAN+Sent-Ali, we run over the whole dataset. However, we only manage to run using 10K instances for HAN+Sent-Ali. This also enables us to test our hypothesis against a low data regime scenario where no language is predominant and how this impacts the gain in performance of multilingual over monolingual. On the other hand, we use the whole churn dataset in all experiments since it is already not that large.

To ensure a fair comparison between different monolingual and multilingual experiments, we use the same hyper-parameters in the design of each text classification architecture independently. In other words, we only change the parameters when switching between architectures but not when switching between monolingual and multilingual training modes. More details about the hyper-parameters used are available in Appendix D. The metrics used for performance evaluation are macro F1-scores, macro precision, and macro recall.

4.3 Pre-trained Embeddings

Monolingual word embeddings are obtained directly from FastText Bojanowski et al. (2017). Details of training and can be found in Appendix A. We obtain and from Ammar et al. (2016) 555We obtain pre-trained 512-dimensional embeddings for up to 13 languages from The linear projection from bilingual spaces to one multilingual space for is optimized using Vowpal Wabbit tool as explained in Appendix B. Details for training can be found in Appendix C.

5 Results

F1 P R F1 P R F1 P R
EN mono 91.68 91.62 91.75 90.34 91.24 89.59 61.44 53.47 72.22
best multi 92.07 93.14 91.02 90.8 90.88 90.72 82.84 85.62 80.27
avg multi 91.62 91.79 91.45 90.43 90.57 90.31 - - -
gain 0.39 1.52 -0.73 0.46 -0.36 1.13 21.4 32.15 8.05
DE mono 81.65 79.95 83.44 84.11 85.88 82.42 89.53 86.66 92.59
best multi 84.85 86.21 83.54 86.41 89.76 83.31 85.93 86.11 85.83
avg multi 83.90 84.25 83.57 84.89 85.53 84.33 - - -
gain 3.2 6.26 0.13 2.3 3.88 0.89 -3.6 -0.55 -6.76
FR mono 81.92 88.44 76.29 85.77 88.55 83.17 76.89 82.22 72.22
best multi 88.55 88.55 88.56 89.47 90.76 88.21 83.86 83.05 84.72
avg multi 88.32 88.73 87.94 88.89 89.55 88.24 - - -
gain 6.63 0.11 12.27 3.7 2.21 5.04 6.97 0.83 12.5
IT mono 74.2 77.95 70.8 78.16 81.06 75.47 57.27 53.47 61.66
best multi 81.86 84.31 79.27 81.78 84.67 79.09 74.72 73.61 76.11
avg multi 80.79 82.77 78.90 80.48 83.25 77.90 - - -
gain 7.66 6.36 8.47 3.62 3.61 3.62 17.45 20.14 14.45
Avg Gain 4.47 3.56 5.03 2.52 2.33 2.67 10.55 13.14 7.06
Table 2: CLDC performance comparison between different text classification architectures highlighting gain per language
F1 P R F1 P R F1 P R F1 P R
EN mono 68.04 70.05 66.15 77.42 85.84 70.5 79.71 81.48 78.01 52.91 51.26 55.79
best 71.84 70.68 73.03 79.43 85.56 74.13 78.28 82.15 74.76 74.98 77.52 73.79
avg 69.68 71.02 68.46 76.09 82.08 70.95 73.78 80.55 68.97 - - -
gain 3.8 0.63 6.88 2.01 -0.28 3.63 -1.43 0.67 -3.25 22.07 26.26 18
DE mono 55.46 59.93 51.61 58.58 68.22 51.33 58.58 68.22 51.33 63.24 63.59 64.16
best 65.68 66.01 65.37 69.03 77.32 62.34 69.87 74.81 65.54 73.44 80.05 69.21
avg 64.05 66.07 62.20 66.93 73.98 61.14 66.34 72.84 60.94 - - -
gain 10.22 6.08 13.76 10.45 9.1 11.01 11.29 6.59 14.21 10.2 16.46 5.05
Avg Gain 7.01 3.35 10.32 6.23 4.41 7.32 4.93 3.63 5.48 16.13 21.36 11.52
Table 3: TCID performance comparison between different text classification architectures highlighting gain per language

Table 2

summarizes F1-score, precision and recall performance for CLDC task by comparing the different gains of the best multilingual over monolingual training for each text classification architecture. In general, for all languages and all text classification architectures, multilingual training wins over monolingual training with an average improvement in F1-score of

and and for FT-MLP and MF-CNN and HAN+Sent-Ali respectively.

After examining different multilingual embeddings (Appendix E.1 and E.2), we notice that the gain is more or less the same and not as significant as the fluctuations in gains when changing text classification architecture. For these reasons, we report only best performant multilingual model in each case, and we average over all multilingual embeddings for a more concise and clear analysis.

Using FT-MLP, the improvement is well pronounced mostly for Italian (the most resource scarce language) with an increase of in F1-score followed by French and German with increases of and respectively which matches the order of languages in terms of the number of training and validation instances according to Table 1. This finding is similar to MF-CNN and confirms our hypothesis that the less resourced a language is, the more likely it is to benefit from multilingual training. Although there is a gain in performance for English, it is marginal for both architectures (only at most). Obtaining a monolingual performance for English always on par with multilingual performance is not at all surprising as English is the dominant language accounting for more than of the training and validation data.

We notice that MF-CNN performs slightly better than FT-MLP with an across language average gain in performance of and has a lower gap between multilingual and monolingual, which is not counter-intuitive since MF-CNN is more complex than FT-MLP and it has more parameters to train which leads even monolingual models to converge better. This verifies the hypothesis that the gain that comes from multilingual aggregation is more pronounced the more shallow the model is.

On the other hand, performance when multi-Tasking embeddings training alongside with the classification task using HAN architecture is even lower compared to other shallower models like MF-CNN and FT-MLP. This can be explained by the low data regime adopted. The results support our assumption by showing a more significant gain of multilingual over monolingual since all languages are low resourced in this case. For example, higher gain in English in case of HAN+Sent-Ali compared to other models ( for HAN+Sent-Ali versus and for FT-MLP and MF-CNN respectively) is due to the fact that English, in this model, is treated as a low-resourced language as maximum of 10K instances from each language are used for training.

Table 3 compares between results obtained for TCID using different text classification architectures across different training modes and embeddings. The results show that in general multilingual models tend to outperform monolingual baselines for both English and German irrespective of the embeddings model used with an average increase of , and FT-MLP, MF-CNN, and bi-GRU-Att respectively.

We notice that the gap between multilingual and monolingual becomes smaller: , , and for the three architectures from less to more complex which matches our previous finding in CLDC. The difference between the degree of improvement of multilingual versus monolingual for English and German is due to the fact that English dataset has already what it takes to learn classification patterns while German benefits more from the aggregation of more languages to learn complex patterns that are not present in German alone. In all cases, we notice that multilingual embeddings performance are close to each other with an average of and far from monolingual performance for FT-MLP for example. Appendix E.1 and E.2 provide a fine-grained analysis over all multilingual embeddings for CLDC and TCID tasks respectively.

6 Conclusion and Future Work

In this paper, we put in place a systematic multi-dimensional comparative analysis of multilingual embeddings on two variations of Cross-Lingual Text Classification (CLTC) tasks. Our approach has the advantage of being unified for training across languages leveraging different multilingual embeddings methods and an end-to-end benchmark for their evaluation against their monolingual counterparts. The embeddings covered in our analysis span a diverse spectrum of methodologies covering those fine-tuned on top of monolingual embeddings, those trained from scratch, and those learned jointly with the task. We test both in an imbalanced data scenario with English being the most dominant language and in a low data regime and witnessed a consistent gain of multilingual approach especially for low-resource languages for all text classification architectures and for both datasets.

Although this study focuses on four languages at most: English, French, German and Italian, the described models and evaluation strategy can be extended to more languages. Testing more languages especially under-resourced ones can be explored in future work. It is also worth investigating ways of making multi-tasking architecture scalable to test it on the whole imbalanced document dataset.


Appendix A Training Offline Embeddings

We build multilingual embeddings which map words from different languages into one joint vector space by learning translations of monolingual embeddings into a target space. We set English as the target space and we learn the transformation matrix that aligns other languages to English using bilingual translation pairs. In other words, this approach fine-tunes non-English embedding by applying a linear transformation that maps them into the English space.

We learn the alignment on top of monolingual embeddings using the training split of the expert bilingual dictionary where the problem of building bilingual embeddings reduces to learning the linear transformation matrix which maps the source monolingual space into the English space where . Formally, given X and Y monolingual word vector matrices for the source and target spaces, the goal is to learn that maximizes the cosine similarity defined by:


Smith et al. (2017) proves that this optimization objective can be solved directly and efficiently using SVD of the product of the paired dictionary matrices:


The resulting U and V vectors are orthonormal matrices whose product gives us the desired transformation matrix . We also apply dimensionality reduction by keeping only the first rows in matrices U and V which correspond to large values in the diagonal matrix .

We train two variants of this approach: and . For training , we use ground truth bilingual dictionaries as introduced in Conneau et al. (2017)666A large repository of up to 110 bilingual dictionaries covering high and low resource languages is available in https://github.com/facebookresearch/MUSE consisting of translation pairs for each pair of source and target languages (where the target language is always English). Only the train split (consisting of 5000 pairs) is used for training while 1500 pairs are used for testing the quality of the embeddings before feeding them to the downstream applications. For both and , we use dimensionality reduction on top of SVD by considering only the first significant rows corresponding to a value threshold of 1 in the diagonal vector.

Appendix B Training of Semantic Specialized Multilingual Embeddings using using Vowpal Wabbit

To learn the alignment from bilingual to multilingual space, we learn the weights for two linear projections: from EN-FR to EN-DE and from EN-IT to EN-DE to bring the French part of EN-FR and Italian part of EN-IT to the same joint space as EN-DE. We solve each linear projection using logistic regression optimized using stochastic gradient descent.

Here, we describe the approach for learning the mapping . The idea is to make use of the inherent parallelism between the two spaces in the sense that English vectors for words in space EN-FR should be aligned to vectors of the same words in space EN-DE. Formally, let and be the vectors of word i in space EN-FR and EN-DE respectively. So, we learn the matrix such that . This can be solved by minimizing the Euclidean distance between English words shared between the two spaces as follows:


where U and V are embeddings matrices where each row corresponds to vector in EN-FR and EN-DE of each word shared between the two spaces and is the Frobenius norm.

To solve this m-variate linear regression, we use stochastic gradient descent (SGD) which is solved using Vowpal Wabbit, a library that can handle large-scale data efficiently. To comply with VW inability to deal with multidimensional output, we split the problem to single output linear regression sub-problems. Therefore, for each sub-problem, we create a VW file for each embeddings dimension

of . The format of the file looks like:

where n is the number of words, m is the dimensionality of the embeddings. Running optimization for this file results in the column of the desired transformation . In the end, this transformation is applied to French vectors in EN-FR (and Italian vectors in EN-IT with the same methodology) leaving German and English vectors of EN-DE unchanged. We run for 100 passes and Vowpal Wabbit fines tunes by itself the learning parameters.

Appendix C Training Sentence Alignment

Bilingual Case

Training embeddings by optimizing the cross-lingual objective using sentence alignment means to train a model that maximizes the semantic similarity between parallel sentences. Formally, given pairs of parallel sentences in two languages and , the goal is to find the embeddings matrices and which transform sentences in and to one common space. For that purpose, we minimize the sum of the distances between the embeddings representation of aligned sentences as follows:


where is the total number of aligned sentences, , and and , ,

are regularization terms. Here l1-distance was chosen instead of l2-distance for its robustness against outliers.

We take advantage of monolingual embeddings to initialize with . and are optimized using gradient descent with steps and to optimize and respectively as follows:


where the gradients are computed as follows:


The list of parameters used for our experiment to generate embeddings is as detailed in table 4.

Param Val

num epochs

Dimension 300
Batch size 64
Table 4: Training Parameters for Sentence Alignment
Multilingual Extension

The multilingual extension is straightforward as the bilingual objective function is additive. Therefore, the multilingual objective consists of the sum of multiple bilingual objectives which is equivalent to one bilingual objective where the source language for sentences is any non-English language, and the target is English. Thus, we train multilingual embeddings using a concatenation of all sentences from German, French, and Italian to learn and English sentences to learn .

Appendix D Implementation and Hyperparameter Choices

Param Val Dense Units L1 512 Dense Act L1 relu Dropout 0.7 Optim Ada Learning Rate 10-2 Patience 20 Batch 64 a) FT-MLP Param Val Kernel Sizes 3,4,5 # Filters 200 Dropout 0.3 Optim Ada Learning Rate 10-3 Patience 20 Batch 64 b) MF-CNN Param Val # GRU units 150 GRU activation tanh Dropout 0.3 Optim Ada Learning Rate 10-3 Patience 20 Batch 64 c) bi-GRU-Att Param Val # GRU units 50 GRU activation tanh Dropout 0.5 Optim Task 1 Ada (10-3) Optim Task 2 Ada (10-2) beta 1e-10 Batch 15 d) Multi-Tasking Param Val 1e-9 1e-11 1e-11 # epochs 50 1 1e-12 Learning Rate 10-2 Batch 64 e) Sentence Alignment
Table 5: HyperParameters for different text classification architectures and sentence alignment
Train Test Embeddings FT-MLP MF-CNN
F1 P R F1 P R
EN EN mono 91.68 91.62 91.75 90.34 91.24 89.59
All multi(pseudo_dict) 91.46 91.21 91.71 90.46 90.42 90.52
multi(exp_dict) 91.61 91.40 91.81 90.34 90.45 90.26
multi(CCA) 91.48 91.91 91.05 90.8 90.88 90.72
multi(sem) 92.07 93.14 91.02 90.14 90.85 89.48
multi(sent_ali) 91.61 91.68 91.55 90.18 90.13 90.24
multi(skip_gram) 91.49 91.4 91.58 90.66 90.69 90.64
DE DE mono 81.65 79.95 83.44 84.11 85.88 82.42
All multi(pseudo_dict) 84.44 85.66 83.25 83.77 84.64 82.97
multi(exp_dict) 84.85 86.21 83.54 86.37 83.91 88.97
multi(CCA) 83.07 82.68 83.46 84.46 85.5 83.45
multi(sem) 83.15 83.13 83.19 83.79 84.13 83.46
multi(sent_ali) 83.93 83.46 84.42 86.41 89.76 83.31
multi(skip_gram) 83.96 84.35 83.57 84.52 85.21 83.84
FR FR mono 81.92 88.44 76.29 85.77 88.55 83.17
All multi(pseudo_dict) 88.51 89.54 87.5 88.69 88.72 88.66
multi(exp_dict) 88.27 89.99 86.62 88.03 88.83 87.25
multi(CCA) 88.34 88.38 88.31 89.47 90.76 88.21
multi(sem) 87.75 86.97 88.55 88.55 88.85 88.26
multi(sent_ali) 88.55 88.55 88.56 89.43 90.11 88.75
multi(skip_gram) 88.52 88.97 88.07 89.16 90.04 88.29
IT IT mono 74.2 77.95 70.8 78.16 81.06 75.47
All multi(pseudo_dict) 81.86 84.31 79.27 80.11 83.41 77.07
multi(exp_dict) 80.76 84.15 77.65 78.56 81.40 75.92
multi(CCA) 81.53 81.17 81.89 81.78 84.67 79.09
multi(sem) 80.82 82.96 78.80 80.76 81.81 79.74
multi(sent_ali) 78.98 82.37 75.87 80.18 84.14 76.57
multi(skip_gram) 80.76 81.64 79.89 81.49 84.07 79.07
Table 6: CLDC Performance Comparison between different training modes with different embeddings using different text classification architectures
Train Test Embeddings FT-MLP MF-CNN bi-GRU-Att
F1 P R F1 P R F1 P R
EN EN mono 68.04 70.05 66.15 77.42 85.84 70.5 79.71 81.48 78.01
All multi(pseudo_dict) 71.84 70.68 73.03 79.31 83.51 75.51 76.86 77.73 76.02
multi(exp_dict) 67.12 69.22 65.14 79.43 85.56 74.13 78.28 82.15 74.76
multi(CCA) 70.89 70.52 71.28 76.76 82.52 71.76 78.19 82.42 74.39
multi(sem) 68.12 70.80 65.64 73.37 80.0 67.76 73.53 79.34 68.51
multi(sent_ali) 69.19 72.55 66.14 69.45 74.55 65.01 69.39 71.23 67.65
multi(skip_gram) 70.91 72.35 69.52 78.22 86.33 71.5 66.43 90.44 52.5
DE DE mono 55.46 59.93 51.61 58.58 68.22 51.33 58.58 68.22 51.33
All multi(pseudo_dict) 64.08 65.38 62.83 69.03 77.32 62.34 69.87 74.81 65.54
multi(exp_dict) 62.13 66.53 58.27 66.81 74.40 60.62 68.71 75.74 62.87
multi(CCA) 65.02 65.90 64.16 64.45 72.97 57.71 65.39 74.12 58.50
multi(sem) 64.15 65.99 62.42 68.46 76.1 62.21 66.26 73.24 60.5
multi(sent_ali) 65.68 66.01 65.37 65.57 70.29 61.45 67.17 71.89 63.03
multi(skip_gram) 63.21 66.6 60.14 67.25 72.79 62.5 60.63 67.24 55.2
Table 7: Comparison of Detection Results using different text classification architectures

Table 5

shows the different hyperparameters used for each model. For FT-MLP, we use a first dense layer with 512 units and rectified linear unit activation prior to the second dense layer that directly precedes softmax activation, a dropout layer of 0.7, an Adam optimizer with learning rate 10-2. For MF-CNN, we use 3 types of filters with kernel sizes 3, 4 and 5 consisting of 200 filters each, a dropout of 0.3 and Adam optimizer with learning rate 10-3. bi-GRU-Att uses 150 GRU units with tanh as an activation function, dropout layer of 0.3 and Adam optimizer 10-3. For multi-tasking experiments, we design hierarchical attention network use bidirectional GRUs consisting of 50 units, and tanh activation function. We use dropout layer of rate 0.5 and alternate training of the two tasks with two different optimization learning rates in order to make them synchronized to each other.

We use Keras version 2.0.2 for training FT-MLP, MF-CNN and bi-GRU-Att and Tensorflow version 1.4.0 to implement multi-tasking models as they require lower-level handling of the loss function.

Appendix E Fine Grained results

e.1 Cross-lingual Document Classification

Table 6 shows the fine grained analysis of the performance of different embeddings used for different neural architectures for document classification.

e.2 Cross-lingual Churn Detection

Table 7 shows the fine grained analysis of the performance of different embeddings used for different neural architectures for churn detection.