Log In Sign Up

Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via multilingual adversarial training. Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data.


page 1

page 2

page 3

page 4


Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

We introduce an architecture to learn joint multilingual sentence repres...

Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation

We introduce a new method to improve existing multilingual sentence embe...

Learning Multilingual Word Embeddings Using Image-Text Data

There has been significant interest recently in learning multilingual wo...

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

We present an easy and efficient method to extend existing sentence embe...

drsphelps at SemEval-2022 Task 2: Learning idiom representations using BERTRAM

This paper describes our system for SemEval-2022 Task 2 Multilingual Idi...

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

We present EASE, a novel method for learning sentence embeddings via con...

Learning Semantic Sentence Embeddings using Pair-wise Discriminator

In this paper, we propose a method for obtaining sentence-level embeddin...


Learning multilingual sentence representations [29] is a key technique for building NLP applications with multilingual support. A primary advantage of multilingual sentence embeddings is that they enable us to train a single classifier based on a single language (e.g., English) and then apply it to other languages without using training models for those languages (e.g., German.) Furthermore, recent advances in multilingual sentence embedding techniques [30, 6] have shown to exhibit competitive performance on several downstream NLP tasks, compared to the two-stage approach that relies on machine translation followed by monolingual sentence embedding techniques.

The main challenge of multilingual sentence embeddings is that they are sensitive to textual similarity (textual similarity bias) which negatively affects the the semantic similarity of sentence embeddings [36]. The following example illustrates this point:

  • S1: What time is the pool open tonight?

  • S2: What time are the stores on 5th open tonight?

  • S3: When does the pool open this evening?

S1 and S3 have similar intents. They ask for the opening hours of the pool in the evening. S2 has a different intent: it asks about the opening hour of stores. We expect embeddings of sentences of the same intent to be closer (e.g., to have higher cosine similarity) to one another than embeddings of sentences with different intents.

We tested several pre-trained (multilingual) sentence embedding models [24, 8, 30, 6] in both monolingual and cross-lingual settings. Somewhat surprisingly, every model provided lower similarity scores between S1 and S3 (compared to S1 and S2, or S2 and S3). This is mainly because S1 and S2 are more textually similar (because both sentences contain “what time” and “tonight”) compared to S1 and S3. This example highlights that general-purpose multilingual sentence embeddings exhibit textual similarity bias, which is a fundamental limitation as they may not correctly capture the semantic similarity of sentences.

Motivated by the need for sentence embeddings that better reflect the semantics of sentence, we examine multilingual semantic specialization, which tailors pre-trained multilingual sentence embeddings to handle semantic similarity. Although prior work has developed semantic specialization methods for word embeddings [23] and semantic and linguistic properties of sentence embeddings [36, 9], no prior work has considered semantic specialization of multilingual sentence embeddings.

In this paper, we develop a “lightweight” approach for semantic specialization of multilingual embeddings that can be applied to any base model. Our approach fine-tunes a pre-trained multilingual sentence embedding model based on a classification task that considers semantic similarity. This aligns with common techniques of pre-training methods for NLP  [19, 27, 11]

. We explore several loss functions to determine which is appropriate for the semantic specialization of cross-lingual sentence embeddings. We found that naive choices of loss functions such as the softmax loss, which is a common choice for classification, may suffer from significant degradation of the original multilingual sentence embedding model.

We also design Emu to specialize multilingual sentence embeddings using only monolingual training data as it is expensive to collect parallel training data in multiple languages. Our solution incorporates language adversarial training to enhance the multilinguality of sentence embeddings. Specifically, we implemented a language discriminator that tries to identify the language of an input sentence given its embedding and optimizes multilingual sentence embeddings to confuse the language discriminator.

We conducted experiments on three cross-lingual intent classification tasks that involves 6 languages. The results show that Emu successfully specializes the state-of-the-art multilingual sentence embedding techniques, namely LASER, using only monolingual training data with unlabeled data in other languages. It outperforms the original LASER model and monolingual sentence embeddings with machine translation by up to 47.7% and 86.2% respectively.

The contributions of the paper are as follows:

  • We developed Emu, a system that semantically enhances pre-trained multilingual sentence embeddings. Emu incorporates multilingual adversarial training on top of fine-tuning to enhance multilinguality without using parallel sentences.

  • We experimented with several loss functions and show that the two loss functions, namely constrained softmax and center loss, outperform common loss functions used for fine-tuning.

  • We show that Emu successfully specializes multilingual sentence embedding using only monolingual labeled data.

Multilingual Semantic Specialization

The architecture of Emu is depicted in Figure 1. There are three main components, which we detail next: multilingual encoder , semantic classifier , and language discriminator . The solid lines show the flow of the forward propagation for fine-tuning and , and the dotted lines are that for

. These arrows become reversed during the backpropagation. The semantic classifier and language discriminator are only used for fine-tuning.

After fine-tuning, Emu uses the fine-tuned multilingual encoder to obtain sentence embeddings for input sentences. More specifically, we expect the similarity (e.g., cosine similarity) between two related sentences in any languages to be closer to each other. We consider cosine similarity as it is the most common choice and can be calculated efficiently [33].

Figure 1: Architecture of Emu.

Multilingual Encoder

A multilingual encoder is a language-agnostic sentence encoder that converts sentences in any language into embedding vectors in a common space.

Emu is flexible with the choice of multilingual encoders and their architectures. The only requirement of this component is that it encodes a sentence in any language into a sentence embedding.

In this paper, we use LASER [30]

as a base multilingual sentence embedding model. LASER is a multilingual sentence embedding model that covers more than 93 languages with more than 23 different alphabets. It is an encoder-decoder model that shares the same BiLSTM encoder with max-pooling and uses BPE 

[31] to accept sentences in any languages as input. The model is trained on a set of bilingual translation tasks and is shown to have the state-of-the-art performance on cross-lingual NLP tasks including bitext mining. We use LASER instead of multilingual models for BERT [11] because (1) LASER outperformed the BERT model on the XNLI task [30] and (2) a LASER model can be used as a sentence encoder without any change111A BERT model needs to be fine-tuned to use the first vector corresponding to [CLS] as a sentence embedding..

Semantic Classifier

The semantic classifier categorizes input sentences into groups that share the same intent, such as “seeking pool information” or “seeking restaurant information”. We expect the semantic classifier to enhance multilingual sentence embeddings to better reflect the semantic similarity of related sentences, where the semantic similarity is calculated as the cosine similarity between the embeddings of the two sentences.

Additionally, we expect that learned embeddings retain semantic similarity with respect to cosine similarity. Thus, we propose the use of -constrained softmax loss [28] and center loss [34], which are known to be effective for image recognition tasks. To the best of our knowledge, we are the first to apply these loss functions for fine-tuning embedding models. We describe these loss functions next.

-constrained softmax loss -constrained softmax loss [28] considers hard constraints on the norm of embedding vectors on top of the softmax loss:

subject to

where denotes the number of classes, and and are -th sentence embedding vector and its true label respectively.

The constraint ensures that embedding vectors are distributed on the hypersphere with the size of Therefore, the Euclidean distance between two vectors on the hypersphere is approximately close to its cosine distance. This property is helpful for specializing sentence embeddings to learn semantic similarity in the form of cosine similarity. Note that this -constraint is different from the regularization term applied to the weight parameters of the output layer. In that case, the regularization term will be considered in the loss function.

To implement -constrained softmax loss, the model additionally inserts an -normalized layer that normalizes the encoder output (i.e., ) followed by a layer that scales with a hyper-parameter . The scaled vectors are then fed into the output layer, where the model evaluates the softmax loss.

Center loss The center loss [34]

was originally developed for face recognition tasks to stabilize deep features learned from data. The center loss is described as follows:


where denotes the centroid of sentence embedding vectors of class . The loss function forces the embedding vector of -th sample toward the centroid of the true category. Our motivation to use this loss function is to enhance the intra-class compactness of sentence embeddings. That is, we want to ensure that the sentence embeddings that have the same intent form compact clusters because other loss functions, such as the softmax loss, does not have this functionality. The center loss works as cross-lingual center loss; it enforces sentences, in any language, that belong to the same intent as a same cluster if multilingual training data are available.

We consider combining the center loss with another function with a hyper-parameter :


where denotes the -constrained softmax loss function.

Language Discriminator

The semantic classifier does not directly consider multilinguality, so the model, which is fine-tuned on a single language, may now perform worse on other languages. To avoid this problem, we incorporate multilingual adversarial learning into the framework. Specifically, the language discriminator aims to identify the language of an input sentence given its embedding, whereas the multilingual sentence encoder incorporates an additional loss function to “confuse” . The idea was inspired by related work that used adversarial learning for multilingual NLP models [5, 4]. We hypothesize and our experiments show that incorporating adversarial learning also enhances the multilinguality of sentence embeddings.

The language discriminator is trained to determine whether the languages of two input embeddings are different. Simultaneously, the other part of the model is trained to confuse the discriminator. In our implementation, we use Wasserstein GAN [2] because it is known to be more robust than the original GAN [16].

Algorithm 1 shows a single training step of Emu. Each step consists of two training routines for language discriminator and the other components (multilingual sentence encoder and semantic classifier ). Target language denotes the language used for training (e.g., English). is randomly chosen from a training language set if multiple languages are used for training. Adversarial languages is a set of languages that are used to retrieve adversarial sentences. To train language discriminator , training sentences in language and adversarial sentences from randomly chosen language are used to evaluate . Formally, the loss function for any training language is described as


where is the cross entropy loss, and are embedding vectors (encoded by ) of sentences in language and language (). Our design implements a language discriminator for each training language . For instance, language discriminator aims to predict whether an input multilingual sentence embedding belongs to English.

Next, labeled sentences in language and adversarial sentences are sampled to update the parameters of and with the fixed parameters of . The overall loss function now takes into account the loss value of so that the multilingual encoder can generate multilingual sentences embeddings for sentences in languages and , which cannot be classified by the language discriminator . We use hyper-parameter to balance the loss functions:

1:Training lang , adversarial langs , iteration number , clipping interval .
2:for  to  do
3:      Sample training sentences as
4:      Sample adversarial language from
5:      Sample adversarial sentences as
6:      ;   
7:      Evaluate loss Eq. 3
8:      Update parameters
9:      Clip parameters to
10:Sample training sentences and labels as and
11:Sample adversarial language from
12:Sample adversarial sentences as
14:Evaluate loss Eq. 4
15:Update and parameters
Algorithm 1 Single Training Step of Emu
HotelQA ATIS Quora
# of classes 28 13 50
# of training data 676 1,195 1,059
# of test data 144 252 353
Vocab. size (en) 977 626 1,308
Table 1: Statistics of the datasets.


We evaluated Emu based on the cross-lingual intent classification task. The task is to detect the intent of an input sentence in a source language (e.g., German) based on labeled sentences associated with intent labels in a target language (e.g., English.) We consider similarity-based intent detection, which categorizes an input sentence based on the label of the nearest neighbor sentence that has the highest cosine similarity against the input sentence. We adopted this evaluation method since it is widely used in search-based QA systems [25] and works robustly especially if training data are sparse. An intuitive alternative for intent detection is to directly use the trained semantic classifier (see Figure 1). We evaluated the classification results using the semantic classifier but the performance was poor. Therefore, we excluded the results from the tables.

(a) HotelQA

Table 2: Experimental results (Acc@1) on three dataset. The highest performance (excluding Emu-Parallel) is in bold and the highest performance by Emu-Parallel is underlined.


, and

denote -value , , and

respectively based on the binomial proportion confidence intervals of Acc@1 values against the baseline methods.

Training data en-en en-de en-fr de-en de-de de-fr fr-en fr-de fr-fr
En only +37.5% +40.0% +34.3% +27.0% +10.0% +1.7% +12.3% +12.7% +11.1%
De only +26.2% +47.7% +10.0% +49.2% +10.0% +25.0% +9.6% +7.9% +9.9%
Fr only +30.0% +33.8% +28.6% +17.5% +8.7% +16.7% +31.5% +15.9% +17.3%
En + De +37.5% +58.5% +27.1% +50.8% +17.5% +23.3% +9.6% +12.7% +14.8%
En + Fr +40.0% +60.0% +50.0% +46.0% +12.5% +33.3% +35.6% +25.4% +23.5%
De + Fr +28.7% +50.8% +37.1% +55.6% +12.5% +46.7% +31.5% +25.4% +17.3%
En + De + Fr +41.2% +63.1% +47.1% +60.3% +20.0% +56.7% +31.5% +34.9% +25.9%
Table 3: Relative performance (Acc@1 on HotelQA) of Emu w/o LD models trained on different training languages against the original LASER model for each language pair.
* en de es fr zh ja
en +37.5% +40.0% +22.9% +34.3% +39.1% +26.1%
de +27.0% +10.0% +10.0% +1.7% +14.5% +20.9%
es +34.8% +0.0% +11.5% +5.0% +21.3% +8.0%
fr +12.3% +12.7% +23.2% +11.1% +13.2% +7.1%
zh +21.9% +34.5% +11.4% +9.6% +9.1% +10.1%
ja +18.3% +31.0% +23.4% +20.3% +32.3% +22.1%
Table 4: Relative performance of Acc@1 on HotelQA of Emu w/o LD against the original LASER model for each language pair.


We used three datasets for evaluation. Some statistics of these datasets are shown in Table 1.

HotelQA is a real-world private corpus of 820 questions collected via a multi-channel communication platform for hotel guests and hotel staff. Questions are always made by guests and have ground truth labels for 28 intent classes (e.g., check-in, pool.) The utterances are professionally translated into 5 non-English languages (German (de), Spanish (es), French (fr), Japanese (ja), and Chinese (zh).) We split the dataset into training and test sets so that the sentences used for fine-tuning do not appear in the test set.

ATIS [17] is a publicly available corpus for spoken dialog systems and is widely used for intent classification research. The dataset consists of more than 5k sentences and 22 intent labels are assigned to each sentence. We excluded the “flights” class from the dataset since the class accounts for about 75% of the dataset. We also ensured that each class has at least 5 sentences in each of train and test datasets. As a result, 13 classes remained in the dataset. Similar to previous studies [10, 14], we used Google Translate to generate corresponding translations in the same 5 non-English languages as HotelQA.

Quora222 is a publicly available paraphrase detection dataset that contains over 400k questions with duplicate labels. Each row is a pair of questions with a duplicate label. Duplicate questions can be considered sentences that belong to the same intent. Therefore, we created a graph where each node is a question and an edge between two nodes denotes that these questions are considered duplicate. By doing this, we can consider each disjoint clique in the graph as a single intent class. Specifically, we filtered only complete subgraphs whose size (i.e., # of nodes) is less than 30 to avoid having extremely large clusters that are too general. We chose the 50 largest clusters after the filtering. The original dataset contains only English sentences. We used Google Translate to translate into the same 5 languages in the same manner as ATIS.


MT + sent2vec We consider the two-stage approach that uses machine translation and monolingual sentence embeddings in a pipeline333The non-English sentences obtained through MT from English had to be translated back to English. We observed some degradation in ja and zh due to the multiple application of MT.. We used Google Translate for translation and sent2vec [24] as a baseline method444We tested the official implementation of InferSent [8], finding that performance was unstable and often significantly lower than that of sent2vec. Thus, we decided to use sent2vec in the experiments..

Softmax loss Softmax loss is the most common loss function for classification, and thus a natural choice for fine-tuning the embeddings. We used the softmax loss function to train the semantic classifier and adjust the embeddings.

Contrastive loss Contrastive loss [7] is a widely used pairwise loss function for metric learning. The loss function minimizes the squared distance between two embeddings if the labels are the same, and it maximizes the margin (we used ) between two samples otherwise. For contrastive loss, we use the Siamese (i.e., dual-encoder) architecture [7] that takes two input sentences that will be fed into a shared encoder (i.e., multilingual encoder ) to obtain sentence embeddings.

N-pair loss As another metric learning method, we used the N-pair sampling cosine loss [35], which first samples one positive sample and negative samples and then minimizes a cosine similarity-based loss function.

Experimental Settings

For each dataset, we used only English training data to fine-tune the models with Emu and the baseline methods. To train Emu’s language discriminator, we used unlabeled training data in other non-English languages (i.e., de, es, fr, ja, zh.)

Emu variants To verify the effect of the language discriminator and the center loss, we also evaluated Emu without the language discriminator (Emu w/o LD) and Emu without the language discriminator or the center loss (Emu w/o LD+CL) as a part of an ablation study. Finally, we evaluated Emu-Parallel, which uses parallel sentences instead of randomly sampled sentences for cross-lingual adversarial training.

Hyper-parameters We used the official implementation of LASER555

and the pre-trained models including BPE. We implemented our proposed method and the baseline methods using PyTorch. We used an initial learning rate of

and optimized the model with Adam. We used a batch size of 16. For our proposed methods, we set and

. All the models were trained for 3 epochs. The architecture of language discriminator

has two 900-dimensional fully-connected layers with a dropout rate of 0.2. The hyper-parameters were , , respectively. The language discriminator was also optimized with Adam with an initial learning rate of .

Evaluation Metric

We used the leave-one-out evaluation method on the test data. For each sentence, we consider the other sentences in the test data as labeled sentences to find the nearest neighbor to predict the label. The idea is to exclude the direct translation of an input sentence in the target language to make the nearest neighbor search more challenging and to simulate the real-world setting where parallel sentences are missing. We used Acc@1 (the ratio of test sentences that are correctly categorized into the intent classes) as our evaluation metric.

Results and Discussion

Table 2 shows the experimental results on these three datasets. In Table 2 (a), Emu achieved the best performance for all the 11 tasks (en-fr, en-ja, and ja-en by Emu w/o LD and en-ja by Emu w/o LD+CL.) Emu outperformed the baseline methods including the original LASER model. In Table 2 (b), Emu achieved the best performance for 10 tasks (en-fr by Emu w/o LD+CL.) The original LASER model showed the best performance for zh-en and all of the Emu methods degraded the performance for the task. In Table 2 (c), Emu achieved the best performance for 7 tasks (en-zh by Emu w/o LD), whereas the original LASER model achieved the best performance for the rest of the tasks. From the results, Emu consistently outperformed the baseline methods, including the original LASER model. At the same time, Emu failed to improve the performance of the five tasks, namely zh-en on ATIS (Table 2 (b)) and en-fr, fr-en, ja-en, ja-zh on Quora (Table 2 (c)). We would like to emphasize that the Emu models were trained using labeled data only in English. The Emu also used unlabeled data in non-English languages. Therefore, it is noteworthy that our framework successfully specializes multilingual sentence emebeddings for multiple language pairs, which involve English, using only English labeled data. The results support that Emu is effective in semantically specializing multilingual sentence embeddings.

For all the tasks, we observe that the baseline fine-tuning methods (i.e., contrastive loss, N-pair loss, softmax loss) do not improve the performance but instead decrease the accuracy values compared to the original LASER performance. The results indicate that fine-tuning multilingual sentence embeddings is sensitive to the choice of loss functions, and -constrained softmax loss is the best choice among the loss functions.

Component HotelQA ATIS Quora
Language Discriminator

Center loss

Table 5: Ablation study of Emu. Each value denotes the average percentage point (pp) drop after removing the component. Negative values denote improvements after removing the component.


denote -values and (Wilcoxon signed ranked test) respectively.

Ablation study We conducted an ablation study to quantitatively evaluate the contribution of each component of Emu, namely, the language discriminator and the center loss. First, we compared Emu w/o LD with Emu to verify the effect of the language discriminator, and then compared Emu w/o LD and Emu w/o LD+CL to determine the effect of the center loss.

Table 5 shows the average percentage point drop (i.e., the degree of contributions) of each component. The language discriminator had a significant contribution of 2.81 points on ATIS. The contributions were 1.45 points and 1.05 points on HotelQA and Quora respectively. Similarly, the center loss had a significant impact on Quora, whereas it had almost no effect on ATIS and had a negative impact on HotelQA.

Figure 2: Visualizations of the sentence embeddings of English () and German () test data of the ATIS dataset. We used -SNE to convert the sentence embeddings into the 2d space. Each point is a sentence and the color denotes the intent class. The plots are: (a) the original LASER embeddings, (b) softmax loss, (c) Emu w/o LD, (d) Emu.

Sentence Embedding Visualization We conducted a qualitative analysis to observe how our framework with the language discriminator specialized multilingual sentence embeddings and enhanced the multilinguality. We filtered English and German sentences from the test data of the ATIS dataset and visualized sentence embeddings of (a) the original LASER model, (b) the softmax loss, (c) Emu w/o LD, and (d) Emu into the same 2D space using -SNE.

Figure 2 shows visualizations of these methods. Figure 2 shows that the original LASER sentence embeddings have multilinguality, as the sentences in the same intent in English and German were embedded close to each other. Figure 2 shows that fine-tuning the model with the softmax loss function broke not only the intent clusters but also spoiled the multilinguality. In Figure 2, Emu w/o LD successfully specialized the sentence embeddings, whereas multilinguality was degraded as the sentence embeddings of the same intent classes were separated compared to the original LASER model. Finally, Emu (with the language discriminator) moved sentence embeddings of the same intent in English and German close to each other, as shown in Figure 2.

From the results, we observe that incorporating the language discriminator enriches the multilinguality in the embedding space.

Do we need parallel sentences for Emu? We compared Emu to Emu-Parallel, which uses parallel sentences instead of randomly sampled sentences, to verify whether using parallel sentences makes multilingual adversarial learning more effective. The results are shown in Tables 2 (a)-(c). Compared to Emu, Emu-Parallel showed lower Acc@1 values on the three datasets. The decreases were -0.5 points, -1.2 points, and -5.9 points on HotelQA, ATIS, and Quora respectively. The differences are not statistically significant except for Quora. The results show that the language discriminator of Emu does not need any cost-expensive parallel corpus but can improve performance using unlabeled and non-parallel sentences in other languages.

What language(s) should we use for training? We also investigated how the performance changes by fine-tuning with training data in multiple languages other than English. To understand the insights more closely, we turned off the language discriminator in this analysis to ensure that Emu uses data only in specified languages. We summarize the relative performance of Emu w/o LD against the original LASER model on the HotelQA dataset. As discussed above, the accuracy values of tasks that involve English in at least one side (i.e., source language, target language, or both) show larger improvements than the other pairs that only involve non-English languages. This is likely because sentence embeddings of those languages were not appropriately fine-tuned compared to those of English because training data in those languages were not used.

Therefore, we hypothesized that using training data in the same language for a target and/or source language would be the best choice. To test the hypothesis, we chose English, German, and French as source/target languages and conducted additional experiments on the HotelQA dataset. The experimental settings, including the hyper-parameters, followed the main experiments, with only the training data used for fine-tuning being different.

Table 3 shows the results. When only using training data in a single language (i.e., En only, De only, Fr only), the target language was the best training data for monolingual intent classification tasks because this method achieved the best performance in the en-en, de-de, and fr-fr tasks respectively. Similarly, using the source and target languages as training data was the best configuration for methods that trained in two languages. That is, En+De achieved the best performance for the en-de and de-en tasks. En+Fr (De+Fr) also achieved the best performance for the en-fr (de-fr) and fr-en (fr-de.) Finally, the method that used training data in the three languages (En+De+Fr) showed the best accuracy values for 7 out of 9 tasks. The degradation in those two tasks occurred when En+De+Fr incorporated a language that was neither the source nor target languages (i.e., en-fr and fr-en.)

From the results, we conclude that we should focus on creating training data in a target or source language to obtain the best performance with Emu and use our budget effectively.

Related Work

Multilingual embedding techniques [29] have been well studied, and most of the prior work has focused on word embeddings. However, relatively fewer techniques have been developed for multilingual sentence embeddings. This is because such techniques [18, 30] require parallel sentences for training multilingual sentence embeddings and some use both sentence-level and word-level alignment information [22]. Recently developed LASER [30, 3] trains a language-agnostic sentence embedding model with a large number of translation tasks on a large-scale parallel corpora.

Similar to the center loss used in this paper, two techniques have incorporated cluster-level information [20, 12] to enhance the compactness of word clusters to improve the quality of multilingual word embedding models. None of them have directly used the centroid of each class to calculate loss values for training.

Adversarial learning [16] is a common technique that has been used for many NLP tasks, including (monolingual) sentence embeddings [26] and multilingual word embeddings [10, 4]. [5]

developed a technique that uses a language discriminator to train a cross-lingual sentiment classifier. Our framework is similar in the use of a language discriminator, but our novelty is that it uses a language discriminator for learning multilingual sentence embeddings instead of cross-lingual transfer.

There is a line of work in post-processing word embedding models called word embedding specialization [13, 21, 23]. Prior work specialized word embeddings with different external resources such as semantic information [13]. The common approaches are (1) a post-hoc learning [13] that uses additional loss function to tune pre-trained embeddings, (2) learning an additional model [15, 32], and (3) the fine-tuning approach [1], which is similar to our fine-tuning approach. However, to the best of our knowledge, we are the first to approach semantic specialization of multilingual sentence embeddings.


We have presented Emu, a semantic specialization framework for multilingual sentence embeddings. Emu incorporates multilingual adversarial training on top of fine-tuning to enhance multilinguality without using parallel sentences.

Our experimental results show that Emu outperformed the baseline methods including state-of-the-art multilingual sentence emebeddings, LASER, and monolingual sentence embeddings after machine translation with respect to multiple language pairs. The results also show that Emu can successfully train a model using only monolingual labeled data and unlabeled data in other languages.


  • [1] M. Abdalla, M. Sahlgren, and G. Hirst (2019) Enriching word embeddings with a regressor instead of labeled corpora. In Proc. AAAI ’19, Cited by: Related Work.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In Proc. ICML ’17, Vol. 70, pp. 214–223. Cited by: Language Discriminator.
  • [3] M. Artetxe and H. Schwenk (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In Proc. ACL ’19, Cited by: Related Work.
  • [4] X. Chen and C. Cardie (2018) Unsupervised multilingual word embeddings. In Proc. EMNLP ’18, pp. 261–270. Cited by: Language Discriminator, Related Work.
  • [5] X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger (2018) Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6 (), pp. 557–570. Cited by: Language Discriminator, Related Work.
  • [6] M. Chidambaram, Y. Yang, D. Cer, S. Yuan, Y. Sung, B. Strope, and R. Kurzweil (2018) Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836. Cited by: Introduction, Introduction.
  • [7] S. Chopra, R. Hadsell, Y. LeCun, et al. (2005) Learning a similarity metric discriminatively, with application to face verification. In Proc. CVPR ’05, pp. 539–546. Cited by: Baselines.
  • [8] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proc. EMNLP ’17, Cited by: Introduction, footnote 4.
  • [9] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties. In Proc. ACL ’18, pp. 2126–2136. Cited by: Introduction.
  • [10] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word translation without parallel data. In Proc. ICLR ’18, Cited by: Dataset, Related Work.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Multilingual Encoder.
  • [12] Y. Doval, J. Camacho-Collados, L. Espinosa Anke, and S. Schockaert (2018) Improving cross-lingual word embeddings by meeting in the middle. In Proc. EMNLP ’18, pp. 294–304. Cited by: Related Work.
  • [13] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith (2015)

    Retrofitting word vectors to semantic lexicons

    In Proc. NAACL-HLT ’15, pp. 1606–1615. Cited by: Related Work.
  • [14] G. Glavas, R. Litschko, S. Ruder, and I. Vulic (2019) How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In Proc. ACL ’19 (to appear), Cited by: Dataset.
  • [15] G. Glavaš and I. Vulić (2018) Explicit retrofitting of distributional word vectors. In Proc. ACL ’18, pp. 34–45. Cited by: Related Work.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proc. NIPS ’14, pp. 2672–2680. Cited by: Language Discriminator, Related Work.
  • [17] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The ATIS spoken language systems pilot corpus. In Proc. the Workshop on Speech and Natural Language, HLT ’90, pp. 96–101. Cited by: Dataset.
  • [18] K. M. Hermann and P. Blunsom (2014) Multilingual Models for Compositional Distributional Semantics. In Proc. ACL ’14, Cited by: Related Work.
  • [19] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proc. ACL ’18, pp. 328–339. Cited by: Introduction.
  • [20] L. Huang, K. Cho, B. Zhang, H. Ji, and K. Knight (2018) Multi-lingual common semantic space construction via cluster-consistent word embedding. In Prc. EMNLP ’18, Cited by: Related Work.
  • [21] D. Kiela, F. Hill, and S. Clark (2015) Specializing word embeddings for similarity or relatedness. In Proc. EMNLP ’15, pp. 2044–2048. Cited by: Related Work.
  • [22] T. Luong, H. Pham, and C. D. Manning (2015) Bilingual word representations with monolingual quality in mind. In Proc. RepL4NLP ’15, pp. 151–159. Cited by: Related Work.
  • [23] N. Mrkšić, I. Vulić, D. Ó Séaghdha, I. Leviant, R. Reichart, M. Gašić, A. Korhonen, and S. Young (2017) Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics 5, pp. 309–324. Cited by: Introduction, Related Work.
  • [24] M. Pagliardini, P. Gupta, and M. Jaggi (2018) Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In NAACL-HLT ’18, Cited by: Introduction, Baselines.
  • [25] M. Paşca (2003) Open-domain question answering from large text collections. MIT Press. Cited by: Evaluation.
  • [26] B. N. Patro, V. K. Kurmi, S. Kumar, and V. P. Namboodiri (2018) Learning semantic sentence embeddings using pair-wise discriminator. In Proc. COLING ’18, Cited by: Related Work.
  • [27] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. NAACL-HLT ’18, pp. 2227–2237. Cited by: Introduction.
  • [28] R. Ranjan, C. D. Castillo, and R. Chellappa (2017) L2-constrained softmax loss for discriminative face verification. arXiv prepring arXiv:1703.09507 abs/1703.09507. Cited by: Semantic Classifier, Semantic Classifier.
  • [29] S. Ruder, I. Vulić, A. Søgaard, and M. Faruqui (2019) Cross-lingual word embeddings. Morgan & Claypool Publishers. Cited by: Introduction, Related Work.
  • [30] H. Schwenk, D. Kiela, and M. Douze (2019) Analysis of joint multilingual sentence representations and semantic k-nearest neighbor graphs. In Proc. AAAI ’19, pp. 6982–6990. Cited by: Introduction, Introduction, Multilingual Encoder, Related Work.
  • [31] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proc. ACL ’16, pp. 1715–1725. Cited by: Multilingual Encoder.
  • [32] I. Vulić, G. Glavaš, N. Mrkšić, and A. Korhonen (2018) Post-specialisation: Retrofitting vectors of words unseen in lexical resources. In Proc. NAACL-HLT ’19, Cited by: Related Work.
  • [33] J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. (2017) A survey on learning to hash. IEEE Transactions on On Pattern Analysis and Machine Intelligence (TPAMI) 40 (4), pp. 769–790. Cited by: Multilingual Semantic Specialization.
  • [34] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In Proc. ECCV ’16, pp. 499–515. Cited by: Semantic Classifier, Semantic Classifier.
  • [35] Y. Yang, G. H. Abrego, S. Yuan, M. Guo, Q. Shen, D. Cer, Y. Sung, B. Strope, and R. Kurzweil (2019) Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. arXiv preprint arXiv:1902.08564. Cited by: Baselines.
  • [36] X. Zhu, T. Li, and G. de Melo (2018) Exploring semantic properties of sentence embeddings. In Proc. ACL ’18, pp. 632–637. Cited by: Introduction, Introduction.