Bayesian multilingual topic model for zero-shot cross-lingual topic identification

07/02/2020 ∙ by Santosh Kesiraju, et al. ∙ Brno University of Technology IIIT Hyderabad 0

This paper presents a Bayesian multilingual topic model for learning language-independent document embeddings. Our model learns to represent the documents in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear classifiers for zero-shot cross-lingual topic identification. Our experiments on 5 language Europarl and Reuters (MLDoc) corpora show that the proposed model outperforms multi-lingual word embedding and BiLSTM sentence encoder based systems with significant margins in the majority of the transfer directions. Moreover, our system trained under a single day on a single GPU with much lower amounts of data performs competitively as compared to the state-of-the-art universal BiLSTM sentence encoder trained on 93 languages. Our experimental analysis shows that the amount of parallel data improves the overall performance of embeddings. Nonetheless, exploiting the uncertainties is always beneficial.



There are no comments yet.


page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Majority of the pattern recognition tasks in (not limited to) computer vision, speech and natural language processing, rely on deep learning models 

[Goodfellow et al.2016]. By exploiting the large amounts of available data, these models are able to learn compact, semantic-rich representations [Schroff et al.2015, Olah et al.2018, Grave et al.2018]. Embeddings are such semantic-rich representations extracted for the input data. In the context of text modelling, these refer to word and (sentence) document embeddings [Bojanowski et al.2017]; which are further used in several downstream tasks such as text classification [Pappagari et al.2018]

, (neural) machine translation 

[Qi et al.2018]

, named entity recognition 

[Chiu and Nichols2016], language model adaptation [Chen et al.2015, Beneš et al.2018]

. Often, these embeddings are only point estimates and do not capture any uncertainty in the estimates. Thus, any error in the estimated embeddings is propagated to the downstream tasks. This is especially important when the training data for the downstream classification task is scarce. In this paper, we present a Bayesian model that learns to represent document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty within its covariance. Furthermore, the uncertainty is propagated to the classifier which exploits it for topic identification (ID) in a low-resource scenario. More specifically, we learn language-independent document embeddings which are used for zero-shot cross-lingual topic ID.

A closed-set monolingual topic ID or document classification in resource-rich scenarios is usually done with the help of discriminative models such as end-to-end neural network classifiers 

[Zhang et al.2015, Yang et al.2016] or pre-trained language models fine-tuned for classification [Howard and Ruder2018, Yang et al.2019]. In case of cross-lingual topic ID, where target data has little or no labels, learning a common embedding space for multiple (say, number of) languages is beneficial [Ammar et al.2016, Schwenk and Li2018, Ruder et al.2019]. This common embedding space is learnt by exploiting parallel dictionary or parallel sentences (translations) among the languages. Such a parallel data is not required to have any topic labels. A classifier is then trained on the embeddings from a source (src) language (one from the languages) that has topic labels. The same classifier is then used to classify the embeddings extracted for test data, which can be from any of the target (tar) languages. The underlying assumption here is that the embeddings carry semantic concept(s), independent of language, enabling cross-lingual transferability (src tar). Hence, the reliability of this scheme solely depends on quality of the embedding space. Note that the amount of available training data for the classifier could be limited and different from the parallel data, which is also the case for the experiments presented in this paper. To summarize:

  1. We propose a Bayesian multilingual topic model (§ 2), which aims to learn a common low-dimensional subspace for document-specific unigram distributions from multiple languages. Moreover, the proposed model represents the document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We present two classifiers for zero-shot cross-lingual topic identification that exploit these uncertainties, (a) generative Gaussian linear classifier (§ 3.1

    ), and (b) discriminative multi-class logistic regression (§ 


  2. The experiments on 5 European (EU) language subset of Reuters multi-lingual corpora (MLDoc) show that the proposed system outperforms: (a) multilingual word embedding based method (Multi-CCA

    ), and (b) neural machine translation based sequence-to-sequence bi-directional long short-term memory network (

    BiLSTM-EU) systems [Schwenk and Li2018]. We also show that our system, even when using relatively low amount of the parallel training data, performs competitively against the state-of-the-art universal sentence encoder trained on 93 languages (BiLSTM-93[Artetxe and Schwenk2019].

  3. Our experimental analysis (§ 6.2) shows that increasing the amount of parallel data improves the overall performance of the cross-lingual transfers. Nonetheless, exploiting the uncertainties during classification is always beneficial.

2 Model


Linear layer


Figure 1: (Left) Graphical representation of the proposed multilingual model, where represents number of languages and denotes number of -way parallel documents (translations). are document-independent, language-specific model parameters, whereas

is document-specific but language-independent random variable (embedding).

represents number of word tokens in document from language . (Right) Alternative representation, where document embedding is a passed through language-specific linear layers whose parameters are . The outputs are sent through function to obtain unigram distribution of words in document for each language .

Like majority of the probabilistic topic models [Blei2012, Miao et al.2016], our model also relies on bag-of-words representation of documents. Let represent the vocabulary size in language . Let represent the language-specific model parameters, where is a low-rank matrix of size defines the subspace of document specific unigram distributions. Our multilingual model assumes that the -way parallel data (translations of bag-of-words) are generated according to the following process:

First, sample a -dimensional () language-independent, document-specific embedding from isotropic Gaussian prior distribution with precision :


can be interpreted as vector representing higher-level semantic concepts (topic alike) of a document, independent of any language. For each language

, a vector of word counts is generated by the following two steps:

  1. [label=()]

  2. Compute the document-specific unigram distribution using the language-specific parameters:

  3. Sample a vector of word counts :


    where are the number of trials (word tokens in document ), i.e., .

represent -way parallel bag-of-words statistics.

The above steps describe the generative process of the proposed multilingual topic model. However, in reality, we do not generate any data, instead we invert the generative process: given the training (observed) data , we estimate the language-specific model parameters and also the posterior distributions of language-independent document embeddings . Moreover, given an unseen document from any of the languages, we infer the corresponding posterior distribution of the document embedding . Note that such a posterior distribution also carries the uncertainty about the estimate.

Although we describe the model assuming -way parallel data, in practice the model can be trained with parallel text (translations) between language pairs covering all the languages.

2.1 Variational Bayes training

The proposed model is trained using the variational Bayes framework, i.e., we approximate the intractable true posterior with the variational distribution:


and, optimize the evidence lower-bound [Bishop2006]. Further, we use Monte Carlo samples via the re-parametrization trick [Kingma and Welling2014, Rezende et al.2014] to approximate the expectation over -- term which appears in the lower-bound [Miao et al.2016, Kesiraju et al.2019]. The resulting lower-bound for a single set of -parallel documents in given by:



is the Kullback-Leibler divergence from variational distribution (

4) to the prior (1) and, with . are the number of Monte Carlo samples used for empirically approximating the expectation over --. The derivation of the lower-bound for a monolingual case is given in [Kesiraju et al.2019].

The complete lower-bound is just the summation over all the documents. Additionally, we use regularization term with weight for language-specific model parameters . Thus, the final objective is


In practice, we follow batch-wise stochastic optimization of (6) using adam  [Kingma and Ba2015]. In each iteration, we update the all model parameters and the corresponding posterior distributions of document embeddings .

2.2 Extracting embeddings for unseen documents

Given a bag-of-word statistics from an unseen document from any of the languages, we can infer (extract) the corresponding document embedding along with its uncertainty. This is done by keeping the language-specific model parameters fixed, and iteratively optimizing the objective in (5) with respect to the parameters of the variational distribution. In the resulting , the mean

represents the (most likely) document embedding, and variance

encodes the uncertainty around the mean .

3 Classification exploiting uncertainties

In a traditional scenario, where we have only point estimates of embeddings, all the embeddings are considered equally important by a classifier. This may not be true all the time. For example, shorter and ambiguous documents can result in poor estimates of the embeddings, which can affect the classifier during training and the performance during prediction. Since our proposed model yields document embeddings represented by Gaussian distributions, with the uncertainty about the embedding encoded in the covariance, we use two linear classifiers that can exploit this uncertainty. The first one is the generative Gaussian linear classifier with uncertainty (GLCU[Kesiraju et al.2019]. The second one is the discriminative multi-class logistic regression with uncertainty (MCLRU).

3.1 Generative classifier

In general, for any classification task, we estimate the posterior probability of class label (

) given a feature vector (embedding)


where, is the likelihood function parametrized by , and is the class prior. In case of generative classifiers, the likelihood function is assumed to have a known parametric form (e.g. Gaussian, Multinomial).

For Gaussian linear classifier (GLC), the likelihood function is , where is the input feature (point estimate of the embedding), is the mean of class , and is the precision matrix shared across all the classes.

Given that the our input features (embeddings) come in the form of Gaussian distributions, i.e., , we can integrate out (exploit) the uncertainty in the input while evaluating the likelihood function. In case of generative Gaussian classifier, where the likelihood function (LABEL:eq:lh_glc) is also Gaussian, the expected likelihood has an analytical form [Cumani et al.2015, Kesiraju et al.2019]:


GLC with likelihood function replaced by (8) is called GLCU. Both are essentially the same classifiers, i.e., they have the same assumptions about the underlying data and hence the same model parameters. The only difference lies in the evaluation of likelihood function.

3.2 Discriminative classifier

For discriminative classifier such as multi-class logistic regression (MCLR), the posterior probability of class label () given an input feature vector is


where are the parameters of the classifier. Unlike in GLC, we cannot analytically compute the expectation over (9) with-respect-to the input features (Gaussian distributions). Instead we approximate the expectation using Monte Carlo samples [Kendall and Gal2017, Xiao and Wang2019]:


Eq. (10) represents the posterior probability computation for MCLRU.

Theoretically, given the true uncertainties in the training examples, GLCU and MCLRU can better estimate the model parameters of the classifier. Similarly, it can also exploit the uncertainties in the test examples during classification. See Appendix A for an illustration on synthetic data. However, in our case, the uncertainties are estimated using our Bayesian multilingual topic model as described in § 2.2. The underlying assumption here is that uncertainties extracted using our model are close enough to the true uncertainties as expected by the classifiers. This assumption is empirically supported through our experimental results presented in § 6.

4 Related works

4.1 Gaussian embeddings: modelling uncertainties

Recent works in NLP [Vilnis and McCallum2015, Sun et al.2018] represent word embeddings in the form of Gaussian distributions. Using the asymmetric KL divergence or the symmetric Wasserstein Distance, the uncertainty is exploited for word similarity, entailment and document classification tasks. Similar to the presented paper, [Xiao and Wang2019]

quantifies the uncertainties in the data and exploits it for sentiment analysis, named entity recognition, etc.

Gaussian embeddings extracted from spoken utterance, popularly known as i-vectors [Dehak et al.2011] were used for speaker identification, and verification tasks; and have been the state-of-the-art for several years [Kenny et al.2013]

. Ondel et al Ondel:2019:SHMM proposed a fully Bayesian subspace hidden Markov model for acoustic unit discovery from speech; where phone-like (acoustic) units from an unseen language are represented by Gaussian embeddings living in a subspace that was learnt using labelled data from other languages. Brümmer et al Brummer:2018:GE developed a theoretical framework around Gaussian embeddings for various classification and verification scenarios.

Kendall and Gal Kendall:2017:Uncert argued the importance of modelling uncertainty of safety critical applications in computer vision, and applied it for semantic segmentation and depth regression tasks.

4.2 Multilingual embeddings in NLP

Multilingualism in machine learning models can be achieved using word embeddings, or joint sentence (document) embeddings or pre-trained language models sharing a common vocabulary.

Ammar et al Ammar:2016:MMWE showed that word embeddings trained using monolingual corpora in several languages can be mapped to a common space (EN

) by exploiting parallel dictionaries. The authors used canonical correlation analysis (CCA) to learn these mappings. The mapped embeddings are used in a convolutional neural network for cross-lingual topic ID 

[Schwenk and Li2018].

Using parallel data (Europarl), Schwenk and Li Schwenk:2018:MLDoc trained a sequence-to-sequence (seq2seq) model comprising of BiLSTM layers to learn a common embedding space for sentences from multiple languages. In their model, each language has a separate encoder and decoder. A similar seq2seq model was proposed [Artetxe and Schwenk2019], where the authors used a joint byte-pair-encoding vocabulary over 93 languages. Further the encoder and decoder is shared across all the languages. The encoder is BiLSTM

with 5 layers, where as the decoder is a single LSTM layer, which additionally takes language ID (embedding) as input. Embeddings for new test data are obtained by forward propagating through the encoder. This is followed by a two hidden layered feed-forward neural network classifier for cross-lingual topic ID.

BERT [Devlin et al.2019] is a transformer based pre-trained language model. Multi-lingual BERT (mBERT[Wu and Dredze2019] uses shared word piece vocabulary from 104 languages and aims to learn cross-lingual representations without any parallel data. On the other hand multilingual translation encoder (MMTE[Siddhant et al.2020] uses the transformer architecture for neural machine translation, whose encoder is fine tuned for classification tasks.

5 Experimental setup

5.1 Datasets

Europarl (v7) contains numerous parallel sentences between several European language pairs [Koehn2005]. We considered 5 languages namely, English (EN), German (DE), French (FR), Italian (IT) and Spanish (ES) and constructed multi-aligned sentences. Using English as reference, we retained sentences that are at least 40 words in length; which resulted in k multi-aligned sentences. These were used to train the proposed multi-lingual document embedding model. The maximum number of sentences are kk. In reality, not every sentence has a translation in all 5 languages. Later in § 6.2, we present the comparison of our systems with various amounts of parallel data.

MLDoc (Reuters multilingual corpus vols 1, and 2) is a collection of more than 800k news stories covering 4 topics in 13 languages including EN,DE, FR, IT and ES. Using the standardized data preparation framework [Schwenk and Li2018], we created 5 class-balanced splits, where each split has 1000 training, 1000 development and 4000 test documents. We report the average classification accuracy of the 5 splits.

Language Vocabulary size () English (EN) 29823 German (DE) 60937 French (FR) 37164 Italian (IT) 44300 Spanish (ES) 44724

Table 1: Vocabulary size in each language.
Hyper-parameters Multilingual model MCLR
Table 2: Model hyper-parameters, where is the embedding dimension, and are the regularization weights for the multilingual model and MLCR respectively.

5.2 Pre-processing

The vocabulary was built using only the multi-aligned Europarl corpus. Table 2 presents the vocabulary statistics. All the words were lower-cased and punctuation was stripped. Further, words that do not occur in at least two sentences were removed.

5.3 Hyper-parameters and model configurations

The proposed Bayesian multilingual topic model has 2 important hyper-parameters, i.e., latent (embedding) dimension and regularization weight corresponding to the model parameters . Table 2 presents the list of hyper-parameters we explored in our experiments. The prior distribution (1) was set to and the variational distribution (4) was initialized to be the same as prior. This enabled us to same learning rate for both mean and variance parameters. A batch size of was used during training. A constant learning rate of was used both during training and inference. The model is trained for epochs and inference is done for iterations to obtain the posterior distributions.

The Gaussian linear classifier with uncertainty (GLCU) has no hyper-parameters to tune. We added regularization term with weight (Table 2) for the parameters of multi-class logistic regression (MCLR). The classifier was trained for a maximum 100 epochs using adam  with a constant learning rate of . For multi-class logistic regression with uncertainty (MCLRU), we used for the empirical approximation (10). did not affect the classification performance significantly but, lower values degraded the performance for about 5%.

5.4 Proposed topic ID systems

The two linear classifiers GLC and MCLR use only the point estimates of the embeddings, i.e., they cannot exploit uncertainty during training and test. In the experiments we used only the mean parameter () as the point estimate of document embedding. Contrastingly, GLCU and MCLRU are trained with the full posterior distribution .

5.5 Baseline systems

Our baseline systems for comparison are based on multilingual word embeddings + CNN classifier (Multi-CCA) and BiLSTM based seq2seq models [Schwenk and Li2018]. We denote BiLSTM-EU [Schwenk and Li2018] as the system trained on 5 European languages similar to our systems.

Further, we also compare with the seq2seq BiLSTM trained on 93 languages sharing a common encoder [Artetxe and Schwenk2019]. We represent this as BiLSTM-93. Since the published work [Artetxe and Schwenk2019] only reports results from EN XX, we took the full matrix of results from the corresponding github repository maintained by the authors111 These are the improved results since the publication. BiLSTM-93 was trained on 16 NVIDIA V100 GPUs which took about 5 days [Artetxe and Schwenk2019].

Although all of these models use the same MLDoc corpus for cross-lingual topic ID, the multi-lingual embedding models are trained on different amounts of data comprising of various languages, hence we cannot directly compare all the models. However, we can compare BiLSTM-EU with our primary system, since both models use the same 5 European languages from Europarl.

6 Results and discussion

We present full matrix of results, i.e., all possible training-test combinations among the 5 languages. It shows the cross-lingual performance in all transfer directions, enabling a detailed understanding. Fig. 2 shows accuracy on the development for various regularization weights . We split the results into two parts: in language represents same source and target language pair, where as zero-shot transfer implies different source and target language pairs. Note that MCLR performs best on in language setting, whereas GLCU and MCLRU perform the best in zero-shot transfer setting. However, model selection was based only on the in language performance. For MCLRU was found to give best results on the development set (in language average = ). Similarly, for GLCU was found to give best results on the development set (in language average = ). These two are our primary systems; each of which has about 56 million parameters and took about 22 hours to train on a single NVIDIA Tesla P-100 GPU. Since the language-specific model parameters are independent inferring the embeddings can be easily parallelized.

Figure 2: Comparison of average classification accuracies on dev set for various hyper-parameters , and classifiers. The embedding dimension .

6.1 Zero-shot cross-lingual transfer

Table 3 presents the zero-shot classification results of our primary system with GLCU and MCLRU respectively. These are the average accuracies from 5 test splits (§ 5.1). All the further comparisons are made with-respect-to these primary systems.

Table 4 shows the absolute differences in classification accuracy between our primary systems and each of the baseline systems. The positive bold value indicate the absolute improvement of our system as compared the respective baseline system. Note that the first two baseline systems are slightly better when training and test language are same, but significantly worse in transfer directions. This suggests that these models over-fit on the source language and generalizes poorly to the target languages.

As a specific example, by examining the results of Multi-CCA (Table 4 from [Schwenk and Li2018], alternatively, we can infer the same in Table 4 of this paper), it can be observed that the system performs better when training and testing on the same language. Moreover Multi-CCA is slightly better when transferring from EN XX, but relatively worse is other cases such as IT XX, and XX DE, suggesting a language bias in the embedding space. Note that our primary systems out performs Multi-CCA and BiLSTM-EU in majority of the transfer directions with significant margins, and more over performs competitively with the state-of-the-art BiLSTM-93 system. On an average, our primary systems (GLCU, MCLRU) are 9.2% and 5.6% better than Multi-CCAand BiLSTM-EU respectively; and only 1.6% worse than BiLSTM-93 in the zero-shot cross-lingual transfer (off-diagonal). Note that BiLSTM-93 is trained with 223M parallel sentences across 93 languages whereas our primary system is trained on just 730k parallel sentences across 5 languages.

Test language (GLCU) Test language (MCLRU)
EN 86.99 83.90 80.23 65.14 72.60 87.04 83.04 78.39 64.40 73.51
DE 74.04 91.25 81.75 63.50 76.79 74.61 91.67 82.45 66.67 76.97
FR 77.00 85.60 90.34 69.00 78.74 76.21 86.11 89.81 70.69 79.05
IT 71.89 79.36 80.22 80.89 79.69 71.63 80.56 80.37 80.93 79.23
ES 73.14 81.75 81.17 72.32 89.45 72.43 77.93 79.79 71.68 90.12
Table 3: Average test accuracies of the primary systems with GLCU (Left) and MCLRU (Right).
Test language (GLCU) Test language (MCLRU)
Multi-CCA [Schwenk and Li2018]
EN -5.10 1.42 6.17 -5.62 0.89 -5.16 1.84 6.01 -4.98 1.01
DE 17.30 -2.24 10.78 2.26 3.84 18.66 -2.03 10.90 2.69 3.74
FR 11.52 31.73 -2.50 8.74 13.25 11.41 32.41 -2.69 9.54 13.65
IT 17.34 28.76 16.99 -4.28 20.33 17.93 31.36 18.12 -4.62 20.55
ES -1.67 23.00 13.66 12.88 -4.53 -1.57 22.13 14.16 13.33 -4.32
BiLSTM-EU [Schwenk and Li2018]
EN -1.30 10.80 5.75 3.03 6.74 -1.36 11.21 5.59 3.67 6.86
DE 1.73 -0.57 6.88 9.79 1.57 3.09 -0.36 7.00 10.22 1.47
FR 0.32 7.01 0.25 6.19 7.95 0.21 7.69 0.06 6.99 8.35
IT 3.89 11.74 14.17 -1.61 11.94 4.48 14.34 15.30 -1.95 12.16
ES 9.63 7.75 16.62 13.30 1.64 9.73 6.88 17.12 13.75 1.85
BiLSTM-93 [Artetxe and Schwenk2019]
EN -3.74 -2.35 2.20 -5.06 -6.70 -3.69 -3.21 0.36 -5.80 -5.79
DE -6.71 -1.45 -1.08 -9.75 -2.81 -6.14 -1.03 -0.38 -6.58 -2.63
FR -3.08 -1.43 -0.46 -2.08 0.34 -3.87 -0.92 -0.99 -0.39 0.65
IT -2.26 -1.37 1.87 -5.04 -2.91 -2.52 -0.17 2.02 -5.00 -3.37
ES 3.56 2.02 5.87 1.22 0.70 2.85 -1.80 4.49 0.58 1.38
Table 4: Comparison of our primary systems (GLCU (Left) and MCLRU (Right)) with the baseline systems. Bold value indicates absolute improvement of our system over the respective baseline.

6.2 Significance of uncertainties in low-resource scenario

In this section, we compare the zero-shot topic ID performance of various classifiers with the embeddings extracted using our multilingual model. Given that we have only 1000 examples for training the classifiers, we can see the importance of modelling and utilizing uncertainties under such low-resource setting.

To better illustrate the importance of uncertainties, we trained GLC and MCLR with only the mean parameters, but during the test (prediction) time, we used the full posterior distributions (along with uncertainties) of the test document embeddings. This is valid because both GLC and GLCU have exactly the same model parameters (§ 3.1). Similarly MCLR and MCLRU are have exactly the same model parameters (§ 3.2). We represent these two classifiers as GLCU-P and MCLRU-P, where -P denotes uncertainty exploited only during prediction.

The comparisons with GLCU-P and MCLRU-P is presented in conjunction with the amount of parallel data that was used for training our multilingual embedding model. For simplicity, we present results in two parts, in language and zero-shot transfer. Figure 3 shows the average score on development set of all the 6 classifiers for varying amounts of parallel data. The overall performance of the systems increase slightly with the amount of parallel data. Nonetheless, exploiting the uncertainties, only even during the test time (GLCU-PMCLRU-P) is always beneficial.

Figure 3: Comparison of average classification accuracies for various classifiers and varying amounts of parallel data. Model trained with 146k multi-aligned parallel data was primary systems.
Number of languages Test language
System in training data EN DE FR IT ES
mBERT [Wu and Dredze2019] 104 94.20 80.20 72.60 68.90 72.60
MMTE [Siddhant et al.2020] 103 94.70 77.40 77.20 64.20 73.00
BiLSTM-93 [Artetxe and Schwenk2019] 93 90.73 86.25 78.03 70.20 79.30
Multi-CCA [Schwenk and Li2018] 5 92.20 81.20 72.38 69.38 72.50
BiLSTM-EU [Schwenk and Douze2017] 5 88.40 71.83 72.80 60.73 66.65
Primary system (GLCU) 5 86.99 83.90 80.23 65.14 72.60
Primary system (MCLRU) 5 87.04 83.04 78.39 64.40 73.51
Table 5: Results of multi-lingual zero-shot topic ID systems from EN  XX. Bold and underline indicates the first and second best scores respectively.

6.3 Results for reference

In Table 5, we present the cross-lingual topic ID results from the recently published works for reference. Note that all the systems were evaluated on MLDoc corpus, but the multilingual representation (embedding) model was trained on different amounts of data from various languages. Only BiLSTM-EU and our primary system are trained on the Europarl corpus with the same 5 languages. Moreover mBERT and BiLSTM-EU are models with relatively huge number of parameters which take enormous computational resources to train; whereas our model can be trained under a day on a single GPU.

7 Conclusions

In this paper, we presented a Bayesian multilingual topic model, which learns language-independent document embeddings along with their uncertainties. We propagated the uncertainties into a generative and discriminative linear classifier for zero-shot cross-lingual topic ID. Our systems out performed former state-of-the-art BiLSTM, and multilingual word embedding based system in majority of the transfer directions with significant margins. Moreover our systems perform competitively to the state-of-the-art universal sentence encoder, while only requiring fraction of training data and computational resources. Our detailed experiment analysis emphasizes the importance of modelling and exploiting uncertainties for cross-lingual topic ID.

Appendix A Gaussian linear classifier with uncertainty

The following Figure 4 compares Gaussian linear classifier (GLC) with Gaussian linear classifier with uncertainty (GLCU) on two dimensional synthetic data. Both GLC and GLCU are the same classifiers with same model parameters. The difference lies in the evaluation of the likelihood function. Given training data in the form of Gaussian distributions (uncertainty encoded in the covariance), GLCU can exploit this uncertainty to better estimate the model parameters and the corresponding decision boundaries.

Figure 4: The illustration of GLC vs GLCU on two-dimensional synthetic data. The image should be read row-wise first and then compared column-wise. The subplot (i.a) in the first row represents 4 Gaussian distributed classes with true mean denoted by , along with the sampled training data from these true distributions. The next subplot (i.b) represents the class means (), and shared covariance estimated from the training samples using Gaussian linear classifier (GLC). The corresponding (oracle) decision boundaries are shown in subplot (i.c). The subplot (ii.a) shows noisy (uncertain) training samples which are obtained by adding Gaussian noise to each of the original training samples. The subplots (ii.b) and (ii.c) show the estimated parameters and decision boundaries GLC. The subplot (iii.a) in the last row represents the same noisy training samples with the (true) uncertainties (only few uncertainties, Gaussian ellipses are shown for illustration). Subplots (iii.b) and (iii.c) show the estimated parameters and decision boundaries by using GLCU, which exploits the uncertainties in the training examples.


  • [Ammar et al.2016] Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. CoRR, abs/1602.01925.
  • [Artetxe and Schwenk2019] Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. TACL, 7:597–610.
  • [Beneš et al.2018] Karel Beneš, Santosh Kesiraju, and Lukáš Burget. 2018. i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models. In Proc. Interspeech 2018, pages 3383–3387.
  • [Bishop2006] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
  • [Blei2012] David M. Blei. 2012. Probabilistic topic models. Commun. ACM, 55(4):77–84, April.
  • [Bojanowski et al.2017] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, 5:135–146.
  • [Brümmer et al.2018] Niko Brümmer, Anna Silnova, Luk’aš Burget, and Themos Stafylakis. 2018. Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model. In Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, pages 349–356.
  • [Chen et al.2015] X. Chen, T. Tan, Xunying Liu, Pierre Lanchantin, M. Wan, Mark J. F. Gales, and Philip C. Woodland. 2015. Recurrent neural network language model adaptation for multi-genre broadcast speech recognition. In Proc. Interspeech, ISCA, pages 3511–3515, Dresden, Germany, September.
  • [Chiu and Nichols2016] Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370.
  • [Cumani et al.2015] Sandro Cumani, Oldřich Plchot, and Radek Fér. 2015. Exploiting i-vector posterior covariances for short-duration language recognition. In Proceedings of INTERSPEECH, ISCA, pages 1002–1006, Dresden, Germany.
  • [Dehak et al.2011] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech & Language Processing, 19(4):788–798.
  • [Devlin et al.2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186.
  • [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
  • [Grave et al.2018] Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA).
  • [Howard and Ruder2018] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July. Association for Computational Linguistics.
  • [Kendall and Gal2017] Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc.
  • [Kenny et al.2013] P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel. 2013. PLDA for speaker verification with utterances of arbitrary duration. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7649–7653, May.
  • [Kesiraju et al.2019] Santosh Kesiraju, Oldřich Plchot, Lukáš Burget, and Suryakanth V. Gangashetty. 2019. Learning document embeddings along with their uncertainties. ArXiv, abs/1908.07599v2.
  • [Kingma and Ba2015] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • [Kingma and Welling2014] Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR Conference Track Proceedings, Banff, AB, Canada, April.
  • [Koehn2005] Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, AAMT.
  • [Miao et al.2016] Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML’16, pages 1727–1736, New York, NY, USA.
  • [Olah et al.2018] Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The building blocks of interpretability. Distill.
  • [Ondel et al.2019] Lucas Ondel, K. Hari Vydana, Lukáš Burget, and Jan Černocký. 2019. Bayesian subspace hidden markov model for acoustic unit discovery. In Proceedings of Interspeech, pages 261–265. International Speech Communication Association.
  • [Pappagari et al.2018] R. Pappagari, J. Villalba, and N. Dehak. 2018. Joint verification-identification in end-to-end multi-scale cnn framework for topic identification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203, April.
  • [Qi et al.2018] Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana, June. Association for Computational Linguistics.
  • [Rezende et al.2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.

    Stochastic backpropagation and approximate inference in deep generative models.

    In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun. PMLR.
  • [Ruder et al.2019] Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models. J. Artif. Int. Res., 65(1):569–630, May.
  • [Schroff et al.2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015.

    Facenet: A unified embedding for face recognition and clustering.

    In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 815–823.
  • [Schwenk and Douze2017] Holger Schwenk and Matthijs Douze. 2017. Learning joint multilingual sentence representations with neural machine translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pages 157–167.
  • [Schwenk and Li2018] Holger Schwenk and Xian Li. 2018. A corpus for multilingual document classification in eight languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018.
  • [Siddhant et al.2020] Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2020. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI NY, USA, February 7-12, 2020

    , pages 8854–8861. AAAI Press.
  • [Sun et al.2018] Chi Sun, Hang Yan, Xipeng Qiu, and Xuanjing Huang. 2018. Gaussian word embedding with a wasserstein distance loss. ArXiv, 1808.07016v7.
  • [Vilnis and McCallum2015] Luke Vilnis and Andrew McCallum. 2015. Word representations via gaussian embedding. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • [Wu and Dredze2019] Shijie Wu and Mark Dredze. 2019. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China, nov. Association for Computational Linguistics.
  • [Xiao and Wang2019] Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 7322–7329.
  • [Yang et al.2016] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1480–1489.
  • [Yang et al.2019] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, pages 5753–5763. Curran Associates, Inc.
  • [Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 649–657, Cambridge, MA, USA. MIT Press.