Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

02/22/2019 ∙ by Yinfei Yang, et al. ∙ Google 0

In this paper, we present an approach to learn multilingual sentence embeddings using a bi-directional dual-encoder with additive margin softmax. The embeddings are able to achieve state-of-the-art results on the United Nations (UN) parallel corpus retrieval task. In all the languages tested, the system achieves P@1 of 86 train NMT models that achieve similar performance to models trained on gold pairs. We explore simple document-level embeddings constructed by averaging our sentence embeddings. On the UN document-level retrieval task, document embeddings achieve around 97 Lastly, we evaluate the proposed model on the BUCC mining task. The learned embeddings with raw cosine similarity scores achieve competitive results compared to current state-of-the-art models, and with a second-stage scorer we achieve a new state-of-the-art level on this task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation (NMT) systems are highly sensitive to the volume and quality of the parallel data (also called bitext) used for model training. It naturally leads to an increasing interest in collecting large amount of parallel data corpora where filtering is used to guarantee quality. [Uszkoreit et al.2010, Antonova and Misyurev2011] have shown that it is possible to mine parallel documents from the internet using large distributed systems, with computationally intensive and heavily engineered subsystems. Recently, lightweight end-to-end word and sentence embedding-based approaches have gained popularity and are showing some success for this purpose  [Grégoire and Langlais2017, Bouamor and Sajjad2018, Schwenk2018, Guo et al.2018, Artetxe and Schwenk2018]. These systems are usually easier to train as they require very little feature engineering.

One popular type of approach is based on cross-lingual embeddings, where two sentences in a translation pair are set to be close to each other in an embedding space. In this approach, given a source sentence, nearest-neighbor search can be used in the cross-lingual embedding space to find potential translation candidates. The nearest neighbors can be chosen in terms of cosine distance. However, current approaches are limited by the fact that the scale of the cosine distance is not globally consistent [Guo et al.2018, Artetxe and Schwenk2018]

. A complex classifier or re-scoring step is still necessary to compensate for this problem and thus get a reasonably clean parallel corpus for training a NMT system. Moreover, the lack of global consistency of the embedding space can also degrade retrieval performance.

Figure 1: The learned embedding space with or without additive margin softmax. The decision boundary with margin helps reducing the intra-class variation, so that examples in the same class are more compact in the space. The examples with different shapes represent sentences that are translations of each other in different languages.

In order to solve this problem, we use a bi-directional dual encoder with additive margin softmax. In this task, it seems reasonable to consider that the order of the source and target languages is irrelevant. So, it is natural to extend the dual encoder to a bi-directional one by optimizing losses for both directions, which, as described later on the paper, also helps stabilizing the embedding space. The additive margin softmax was first introduced to solve a face verification problem [Wang et al.2018a]. The intuition is that the standard softmax is good at optimizing the inter-class difference, but not good at reducing the intra-class variation. This is problematic given that both small intra-class variation and large inter-class discrimination are important to achieve good performance when learning using a loss metric like cosine distance. The limitations of the standard softmax in a dual-encoder [Guo et al.2018] show that the cosine distance is not a good metric to score translation pair quality. In another work, [Artetxe and Schwenk2018] uses the embeddings from a learned translation model without directly optimizing for cosine similarity, and thus they face a similar problem. Introducing a large margin property in the softmax forces a fixed margin between decision boundaries of different classes. The margin between decision boundaries then makes the examples in the same class more compact, as illustrated in figure 1.

Extensive experiments demonstrate the learned multilingual sentence embeddings achieve state-of-the-art parallel sentence retrieval results. On the United Nations (UN) Parallel Corpus [Ziemski et al.2016], the system achieves P@1 of 86% or higher (from 11.3 million candidates) in all the languages tested. We train NMT models from the retrieved data filtered with a simple strategy based on our cosine similarity. The trained models perform as well as models trained on gold data or the data from more advanced filtering strategies, indicating the cosine distances of bitext pairs are stable. Furthermore, we show how this approach can be used to produce document embeddings by simply averaging all sentence embeddings, which achieves around 97% on P@1 (from 86k candidates) on the UN document-level retrieval task. On the BUCC bitext mining task [Pierre Zweigenbaum and Rapp2018]

, the learned embeddings with raw cosine similarity scores also show competitive results compared to state-of-the-art models, with F-scores ranging from 84.6 to 89.2. These results significantly outperform the baseline cosine similarity models by a large margin of around 10 points. With a second stage scorer we achieve a new state-of-the-art level, with F-scores ranging from 93.38 (ru-en) to 97.24 (de-en).

We summarize our contributions as follows:

  • We introduce a novel bidirectional and additive margin softmax into the dual encoder framework for the bitext retrieval task.

  • Empirical results show that the proposed approach greatly stabilizes cosine similarity in the multilingual embedding space, which also leads to state-of-the-art results on the UN sentence-level bitext retrieval task.

  • We also introduce an approach to embed documents by simply averaging all sentence embeddings, which achieves nearly state-of-the-art results on the UN document-level bitext retrieval task.

2 Model

In this section we introduce the dual-encoder model and our extension of using a bidirectional encoder with additive margin softmax. We also describe a light-weight approach to generate hard negatives in a low-dimensional space.

2.1 Dual Encoder Model

Figure 2: Dual encoder architecture.

A naïve dual-encoder model consists of two encoders for a paired input. As shown in figure 2, each encoder encodes one input and a piece-wise function is applied at the top. Previous work has shown that dual-encoder models with negative sampling are well suited for ranking problems such as conversation response prediction, translation candidate prediction, etc [Yang et al.2018, Guo et al.2018].

Following [Guo et al.2018], the translation retrieval task can be modeled as a ranking problem to rank the true translation over all other sentences in for in a given language, where and are translations of each other. With a scoring function that assesses the compatibility between and , the translation distribution can be expressed as the following log-linear model:


The distribution is approximated by summing over the compatibility score of matching to sampled negatives from the same batch during training, along with the compatibility score for the positive candidate:


The model is then trained to optimize the probability for the true translation pair using a softmax loss function:


The cosine scoring function is commonly used, as it is efficient to use matrix multiplication to compute scores for all examples in the same batch.

2.2 Bidirectional Dual Encoder

As the order of source and target in a translation pair does not matter in the translation retrieval task. We can perform forward retrieval from source to target, or backward retrieval from target to source. It seems natural to extend the dual encoder and turn it into a bidirectional one by optimizing for the backward ranking, i.e. from to .


2.3 Dual Encoder with Additive Margin Softmax

Likewise, additive margin softmax can be used to extend the piece-wise scoring function and thus introduce a large margin property to the positive pairs [Wang et al.2018a]. The cosine scoring function is redefined as follows:


where is the margin value. Plugging equation 6 back into the softmax loss functions we get the final objective functions for the dual-encoder with additive margin:




As discussed in section 1, the intuition behind the additive margin softmax is that adding a large margin property can reduce the intra-class variation of the standard softmax so the examples of the same class are more compact in the learned embedding space. The experiments in the following sections show that the cosine distance is surprisingly stable in the learned embedding space. The improved embedding space also leads to better retrieval performance.

2.4 Hard Negatives

Following [Guo et al.2018], we also integrate semantically similar hard-negative examples in training. For each source sentence, we identify hard negatives as target sentences that achieve high dot product scores with the source sentence embedding but are not the correct translation. We generate hard negatives for each target sentence using the following approach. The source and target embeddings are generated from a ”coarse” model trained using a deep averaging network (DAN) architecture. 5 negatives are generated for each source and target sentence.

It is worth noting that we use a very low dimension embedding vector (d=25) in the coarse model. The coarse model, by itself, performs worse than a high dimension ”fine” model (d=500) on the retrieval task. However, in our preliminary experiments, we did not notice a significant performance difference between models trained with coarse or fine negatives. Using a low-dimension embedding space can dramatically decrease the computational resources and speed up the approximate nearest neighbor search. This improvement has allowed the generation of hard negatives for the entire corpus instead of generating hard negatives for only 20% of the data as in

[Guo et al.2018].

3 Experimental Setup

In this section, we describe the training data and provide details about the training process. We also illustrate the retrieval pipeline, powered by an approximate nearest neighbor (ANN) search engine.

3.1 Training Data

The training corpus is extracted from the internet using a off-the-shelf bitext mining system, which similar to the approach described in [Uszkoreit et al.2010]. The extracted sentence pairs are filtered by a pre-trained data selection scoring model [Wang et al.2018b]. We set the threshold of the scoring model according to a small subset of the scored pairs. We ask human annotators to manually evaluate each sentence pair from the subset: the given sentence pair is either a GOOD translation pair or a BAD one. The threshold is selected so that the retained pairs have an 80% GOOD rate. Finally, we construct a corpus that contains around 400 million sentences pairs for en-fr, en-es, en-de, en-ru and en-zh.111We started with a large dataset. However, preliminary results showed that we could use only 10% of the data without downgrading the model performance in the hold-out dev set. For each language pair, we use of the sentence pairs for training and as development set for parameter tuning. We evaluate the trained models on the United Nations Parallel Corpus reconstruction task and the BUCC bitext mining task.

3.2 Model Configuration

In our models, we use both word-level and character-level representations. For word-level representation, we build a shared 200k unigram vocabulary with 10k out-of-vocabulary (OOV) buckets for each language pair. Each token is mapped to a 300 dimension word embedding space. For character-level representation, we build 200k hash buckets and each character -gram is mapped to one of the buckets via a quick hashing method. The character hash buckets are shared across both languages in each language pair. We first decompose each token into character -grams from 1 to . Then, we look up a 300 dimension embedding vector for each character -gram and sum up them together as the final character level representation. We set =[1, 4] for en-zh and =[3, 6] for other language pairs. We use different settings of for the en-zh pair because collocations of more than 3 character tokens are rare in Chinese. Finally, we sum the word-level representation and character-level representation as the final token representation. All inputs are tokenized and normalized before being fed to the model.

The encoder uses a 3-layer transformer network architecture 

[Vaswani et al.2017]

. In the transformer layers, we use 8 attentions heads, a hidden size of 512, and a filter size of 2048. The transformer encoder output is a variable-length sequence. We make use of 4 pooling layers consisting of max pooling, mean pooling, first token pooling, and attention pooling 

[dos Santos et al.2016]. The output of all pooling layers are concatenated into a single vector and then projected to a fixed 500-dimensional space. The encoders for each language pair are shared.

For training, we use SGD with learning rate 0.003 and batch size 100. We train a total of 40 million steps for all models. The learning rate is reduced to 0.0003 after 33 millions steps for fine-tuning. A margin value 0.3 is used in all our experiments unless otherwise specified. We follow [Chidambaram et al.2018] to multiply the gradients of the word and character embeddings by a factor of 25, in order to alleviate vanishing gradient issues and greatly improve training speed. All these parameters are tuned on the hold-out development set.

3.3 The Parallel Corpus Retrieval Pipeline

As mentioned in section 2.1, the source embedding and target embedding can be encoded separately using a dual-encoder model. Taking advantage of this fact, we use approximate nearest neighbor (ANN) search [Vanderkam et al.2013] to find the candidate targets for each given source once all target embeddings are built, and vice versa. Figure 3 illustrates the retrieval pipeline. Source and target can be either sentences or documents in the following experiments.


encode source

Approximated Nearest Neighbour (ANN) Search

Pre-encoded targets

Selected targets

Figure 3: Translation retrieval pipeline, a source/target could be a sentence or document. Figure from [Guo et al.2018].
Models en-fr en-es en-ru en-zh
P@1 P@3 P@10 P@1 P@3 P@10 P@1 P@3 P@10 P@1 P@3 P@10
Guo et al.mandy2018 48.9 62.3 73.0 54.9 67.8 78.1 - - - - - -
Artetxe et al. artetxe2018a 83.3 - - 85.8 - - - - - - - -
DE 80.7 87.9 91.2 85.6 92.7 95.1 83.9 89.7 92.0 82.2 91.1 94.1
BiDE 82.3 90.7 94.2 86.3 93.0 95.6 85.7 92.3 95.1 83.1 91.3 94.6
BiDE+AM 86.1 93.5 96.1 89.0 95.2 97.2 89.2 94.8 96.9 87.9 94.7 97.1
Table 1: Precision at N (P@N) (%) of target sentence retrieval on the UN corpus. Models attempt to select the true translation target for a source sentence from the entire corpus (11.3 million aligned sentence pairs). [Guo et al.2018] is a dual encoder (DE) model trained with deep averaging network (DAN) instead of transformer.

4 Evaluation on the United Nations Corpus

The United Nations Parallel Corpus [Ziemski et al.2016] contains 86,000 bilingual document pairs in five language pairs: from en to fr, de, ru, ar and zh. Each document pair is also aligned at the sentence level. For each language pair, there are a total of 11.3 million aligned sentence pairs. We use our model at both sentence and document levels for en-fr, en-de, en-ru and en-zh.

4.1 Sentence-Level Retrieval

We first apply the proposed model to retrieve target candidates at sentence-level from the entire UN corpus using the retrieval pipeline illustrated in figure 3

. After de-duping, there are still around  9.5 millions unique pairs for each language pair. Precision at N (P@N) is used as evaluation metric for target retrieval, for N=1, 3, 10. We compare the performance with two baseline models. The first baseline is a dual encoder model 

[Guo et al.2018] trained with deep averaging network (DAN) and separate word embeddings for each language. The second is from [Artetxe and Schwenk2018], where they train the multilingual embeddings from an encoder-decoder translation model and get rid of the decoder part after training.

Table 1 shows the results of the nearest neighbor search for each English sentence when retrieving the true translation among the sentences in the target language. The results from the baseline systems are listed in the first two rows. Rows 3 to 5 show the results from the models described in section 2: a naïve dual-encoder (DE) model, a bi-directional dual-encoder (BiDE) model, and the bi-directional model with additive margin softmax (BiDE+AM). The BiDE is slightly better than the naïve dual encoder model, but very consistent in all four language pairs. Adding additive margin greatly improves the performance in all metrics, with a minimum of 86.1% P@1 in all language pairs. It also outperforms the baseline models in both en-fr and en-es by a large margin, and, to the best of our knowledge, establishing a new state-of-the-art performance level.

Figure 4: en-fr model performance and mean cosine similarity at different margin value.

Figure 4 shows the F1 scores for the en-fr model with different margin values, from 0 to 0.5. The models improved even with a small margin value, and they kept improving as the margin reached a value of 0.3. It also shows the mean cosine similarity scores of positive pairs and the scores of negative pairs. These pairs are all sampled from the top 2 retrieved nearest neighbors, and thus they are all considered hard negatives. The mean scores of positive pairs keep increasing with the margin value, while the mean of the negative pairs keep decreasing, showing the effectiveness of the additive margin softmax.

We also report the backward search performance in table 2. Instead of always treating English as source language, we use each non-English sentence to search for its English translation. Interestingly, the P@1 from the backward search is higher in all four language pairs. We suspect that, in general, our embedding approach may be implicitly biased to work better when retrieving English sentences. Another possible reason may be that class separability might be easier in English given that for many English terms there are often more than one representation in the other languages.

en-fr en-es en-ru en-zh
Forward 86.1 89.0 89.2 87.9
Backward 88.4 90.6 90.4 88.9
Table 2: P@1 (%) on the UN corpus from forward search and backward search. Forward search treats English as source and the other language as target. Backward is vice versa.

4.2 Evaluation Using a Translation Model

After the retrieval step, we use the raw cosine similarity score to filter sentence pairs. We empirically set the threshold for the cosine similarity to 0.5, based on the hold-out development set. Then, we train NMT models on the filtered sentence pairs for en-es and en-fr. The trained models are evaluated on the wmt13 [Bojar et al.2013] and wmt14 [Bojar et al.2014] test sets for en-es and en-fr, respectively.

To train the translation models, the sentence pairs are segmented using a shared 32,000 word-piece vocabulary [Schuster and Nakajima2012]. We employ a 6-layer transformer architecture [Vaswani et al.2017] with a model dimension of 512, a hidden dimension of 2048, and 8 attention heads. The Adam optimization algorithm is used with the training schedule described in [Vaswani et al.2017]. To ensure a fair comparison, our training setup is similar to the one used in [Guo et al.2018], where sentences are batched by approximate sequence length with 128 sentences per batch. We test the performance of this setup using the UN Oracle training set and confirm that our results are within 0.2 BLEU of the original numbers reported.

We evaluate the sentence pairs mined from the forward and backward directions separately. Table 3 shows the BLEU [Papineni et al.2002] of the NMT models trained from the mined sentence pairs. The first two rows show two baseline models trained with original UN pairs (Oracle) and from [Guo et al.2018] respectively. All models perform very closely, scoring within 2 BLEU points of each other. The one trained with pairs mined from the backward search achieves the highest performance in both language pairs. It is surprising to see that the model trained on mined data is even better than the model trained on original data. To investigate why this is the case, we examined some “false positive” examples returned by the retrieval pipeline. We found that many of them actually are quite good translations, with only a few words different from the actual translation pairs, some of them are even better aligned with the original translation. Thus, the pipeline potentially filtered out the noise in the original data, which leads to better performance. In contrast to the model from [Guo et al.2018], which uses a separate filtering classifier, the proposed model shows that raw cosine similarity can perform as well in the UN corpus. This demonstrates that the cosine similarity is quite stable in our models. Notice, however, that we are not claiming that the separate filtering is unnecessary. We will discuss this point in later sections.

en-fr en-es
(wmt14) (wmt13)
Oracle 30.96 28.81
Guo et al. mandy2018 29.63 29.03
Mined forward 31.11 28.70
Mined backward 31.12 29.15
Table 3: BLEU scores on WMT testing sets of the NMT models trained on original UN pairs (Oracle) and on two versions of mined UN corpora at sentence level.
Model en-fr en-es en-ru en-zh
Guo et al. mandy2018 89.0 90.5
Uszkoreit. jakob2010 93.4 94.4
Forward(Avg) 96.7 97.3 98.6 97.3
Backward(Avg) 96.9 97.8 98.2 97.4
Table 4: P@1 (%) of target document retrieval on the UN corpus. Models attempt to select the true target from the entire corpus (85k documents). We simply take the average of the sentence embeddings from BiDE+AM models as the document embedding.

4.3 Document-Level Retrieval

In this section, we experiment with the proposed model at document level. For each document, we first compute the embeddings of all the sentences in the document using the sentence-level model, e.g. BiDE+AM. Then, we take the average of all the sentence embeddings as the final representation for that document. Once the document embeddings are built, we once again apply the retrieval pipeline in figure 3 and perform the nearest neighbor search.

Table 4 shows the P@1 results at document level for different models. The results of two baseline models are shown in rows 1 and 2. The baseline models require either a post-document matching step after the sentence-level retrieval [Guo et al.2018] or a complex engineering system [Uszkoreit et al.2010]. Rows 3 and 4 show the results of the proposed averaging BiDE+AM model for forward and backward retrieval respectively. Remarkably, both en-fr and en-es models establish new state-of-the-art performance at 97% P@1. We also list the results of the averaging models for en-ru and en-zh. It is worth noticing that the performance is consistent across all language pairs.

Models Direction fr-en de-en ru-en zh-en
Artetxe et al. artetxe2018a Forward 82.1 74.2 78.0 78.9 75.1 77.0 - - - - - -
Backward 77.2 72.7 74.7 79.0 73.1 75.9 - - - - - -
BiDE+AM Forward 86.7 85.6 86.1 90.3 88.0 89.2 84.6 91.1 87.7 86.7 90.9 88.8
Backward 83.8 85.5 84.6 89.3 87.7 88.5 83.6 90.5 86.9 88.7 87.5 88.1
Table 5: [P]recision, [R]ecall and [F]-score of BUCC training set with cosine similarity score used to optimize the threshold for best F score. Following the naming of [Pierre Zweigenbaum and Rapp2018], we treat en as the target language and the other language as source in forward search. Backward is vice versa. We show the cosine similarity performance of Artetxe et al. artetxe2018a as baseline.

5 Evaluation on the BUCC Mining Task

In this section, we evaluate the proposed models on the BUCC mining task [Pierre Zweigenbaum and Rapp2018]. The BUCC mining task is a shared task on parallel sentence extraction from two monolingual corpora with a subset of them assumed to be parallel, and that has been available since 2016. We make use of the data from the 2018 shared task, which consists of corpora for four language pairs: fr-en, de-en, ru-en and zh-en. For each language pair, the shared task provides a monolingual corpus for each language and a gold mapping list containing true translation pairs. These pairs are the ground truth. The task is to construct a list of translation pairs from the monolingual corpora. The constructed list is compared to the ground truth, and evaluated in terms of the F1 measure. For more details on this task refer to [Pierre Zweigenbaum and Rapp2018].

5.1 Mining with Cosine Similarity

For each BUCC language pair, we apply the same retrieval pipeline depicted in figure 3 to iteratively retrieve the nearest neighbors for each source sentence. We then filter the retrieved nearest-neighbor pairs using the cosine similarity scores. Table 5 reports the precision, recall and F-score on the training set for both the forward and backward search. As opposed to the UN corpus, the forward search in BUCC is treating English as target and the other language as source. The cosine score threshold is optimized for itself for comparison with [Artetxe and Schwenk2018] with the raw cosine similarity, which is listed in rows 1 and 2. Row 3 and 4 show the performance of our model. Both the forward and backward search achieve very good performance on all metrics, with F-score ranges from 84.6 to 89.2. Both results perform better than the baseline models for fr-en and de-en. The large performance margin of around 10 points in terms of F-scores is certainly remarkable. Once again, the search from non-English to English (the forward direction in this case) works better than starting from English to retrieve the other languages. The high performance level in all four languages shows that the model is quite stable even for distinct language pairs such as en-zh.

Model fr-en de-en ru-en zh-en
Train. Raw cosine 86.1 89.2 87.7 88.8
90.0 92.6 90.1 92.5
89.4 91.6 89.3 91.2
Test 90.02 92.26 89.18 92.81
89.56 91.42 88.31 91.89
96.96 97.24 93.38 96.00
Baseline 1 81 86 81 77
Baseline 2 92.89 95.58 92.57 92.03
Table 6: The final F1 scores of BUCC shared task with rescoring. Baseline 1: best results from [Pierre Zweigenbaum and Rapp2018]; Baseline 2: [Artetxe and Schwenk2018] with margin based scorer.

5.2 Mining with a Second-Stage Scorer

False positives (cosine similarity = 0.96, 0.89)
en-000034142 The Declaration of Brussels (1874) stated that the ”honours and rights of the family…should be respected”.
zh-000090005 ”布鲁塞尔宣言(1874年)表示,”家庭荣誉和权利…应当受到尊重.
en-000069836 Males had a median income of 26,397 versus 25,521 for females.
zh-000078235 男性人均年收入为 52,454,女性为43,750。
False negatives (cosine similarity = 0.74, 0.69)
en-000038309 This is especially true of Brazil, the source of much of this mistrust.
zh-000044511 其中巴西尤其如此,这个国家是南美国家联盟间不信任感的主要源头。
en-000027659 But this may change as the source countries become richer and undergo rapid declines in birth rates.
zh-000023675 但随着移民母国逐渐发达、出生率逐渐下降,美国的情况也会很快有所改变。
Table 7: Failure cases in BUCC zh-en task using cosine similarity scoring, threshold is 0.78 tuned to maximize the f-score on training set. False positives are mined pairs but not in gold pairs. False negatives are gold pairs but missed by the retrieval system or with scores lower than threshold.

We further experiment with the retrieved nearest neighbor pairs with a second-stage scorer other than the raw cosine similarity. We follow  [Artetxe and Schwenk2018] to apply a margin-based scorer using the formula: , where is the cosine similarity function and


The formula is slightly modified by adding the cosine similarity as part of the final score. We found it is consistently better in our model compared to the original formula, which also indicates that the raw cosine similarity is a strong signal for quality measurement.

We also experiment with a simplified margin only considering one direction, which greatly reduces computation time and simplifies the system:


Furthermore, we also train a classier by fine-tuning a multilingual BERT model [Devlin et al.2018]. For training, we use the nearest neighbor pairs mined from the BUCC training set. Pairs in the gold map list are treated as positives, and the rest as negatives. We sample the negative pairs so that the positive/negative ratio is 1:10. 80% of them are used for training and 20% are used for development. On the test set, we first retrieve the nearest neighbors and remove those pairs that are not nearest neighbor of each other. Then, the fine-tuned BERT classifier is applied to select the positive pairs.

The final results on the BUCC task are shown in table 6. Margin-based scorer still helps to improve the F1 scores 2 to 3 points in average, even with the simplified one-direction version from Eq. 10. The proposed approach with the BERT scorer significantly outperforms the current state-of-the-art models, with F-scores ranging from 93.38 (ru-en) to 97.24 (de-en). We will show examples where scorer helped to improve the results in the next section.

Figure 5: The precision-recall curve on the training set of the BUCC shared task, using a forward search with cosine similarly scoring.

5.3 Analysis

Figure 5 shows the precision-recall curve on the training set of the BUCC shared task using a forward search with cosine similarly scoring. The curves show that the cosine similarity score is quite stable and a strong translation quality signal in all language pairs. The figure also lists the average precisions for each language pair, ranging from 88% for fr-en to 94% for de-en, which are also very promising.

To better understand the learned embedding space and what could be improved by a second-stage scorer, we list typical failure cases using a raw cosine similarity model for the zh-en task in table 7. The first “false positive” is actually not really a failure. It is a good translation but that happens to be missing in the gold pairs. [Pierre Zweigenbaum and Rapp2018] mentioned that there are some true translations that are missing, but that they did not affect the evaluation before. With the improvement of the models, more and more missed translations are being retrieved, which may start affecting the evaluation. The second false positive is a typical failure of the proposed model. The model has trouble differentiating sentences that are semantically similar but that include different numbers or entities, especially numbers with long digit sequences. Part of the error may originate from the tokenization and normalization steps. Interestingly, often several sentences are very close and with only differences in the numbers, e.g. (en-000065918) Males had a median income of $59,738 versus $39,692 for females; (en-000069337) Males had a median income of $31,106 versus $21,985 for females., etc. As the pattern is very clear, they can be simply removed by a second-stage scorer.

We also list two typical false negatives from the gold pairs. They are both ”partial translation” pairs with cosine similarity close to the threshold. These cases are especially challenging, as it is even hard for humans to make a consistent judgment on them. We found that some partial translations are present in the gold set, but others are not in the BUCC training task. In practice, the threshold could be changed and fine-tuned based on the application use case. This type of error seems hard to be captured by a margin-based scorer. The BERT scorer, however, is likely capturing the human preference on the partial translations from the training data.

6 Conclusion

In this paper we presented an approach to learn multilingual sentence and document embeddings using a bi-directional dual-encoder with additive margin softmax. In the proposed approach, the additive margin softmax tries to address the intra-class variation problem observed in the original formulation of the dual encoder model. Through our experimental results, we showed that the cosine distances of bitext pairs are surprisingly stable, and that the improved embedding space also leads to better retrieval performance on the United Nations (UN) parallel corpus. At sentence level, the system achieves P@1 of 86% or higher in all the languages tested. At document level, sentence-level embedding were used to produce document embeddings by simply averaging all the sentence embeddings. Despite this being simple, document embeddings achieved 97% at P@1. Lastly, we evaluated the proposed model on the BUCC bitext mining task, and in this case the learned embeddings with raw cosine similarity scores also showed competitive results compared to current state-of-the-art models. With a second-stage BERT scorer, we achieve a new state-of-the-art level on this task.


The authors are especially grateful to Wei Wang and Keith Stevens of the Google Translation Team for the valuable discussion and help running data extraction and selection pipeline.


  • [Antonova and Misyurev2011] Alexandra Antonova and Alexey Misyurev. Building a web-based parallel corpus and filtering out machine-translated text. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pages 136–144. Association for Computational Linguistics, 2011.
  • [Artetxe and Schwenk2018] Mikel Artetxe and Holger Schwenk. Margin-based parallel corpus mining with multilingual sentence embeddings. CoRR, abs/1811.01136, 2018.
  • [Bojar et al.2013] Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.
  • [Bojar et al.2014] Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014.
  • [Bouamor and Sajjad2018] Houda Bouamor and Hassan Sajjad. H2@bucc18: Parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In Reinhard Rapp, Pierre Zweigenbaum, and Serge Sharoff, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, may 2018. European Language Resources Association (ELRA).
  • [Chidambaram et al.2018] Muthuraman Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Learning cross-lingual sentence representations via a multi-task dual-encoder model. CoRR, abs/1810.12836, 2018.
  • [Devlin et al.2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
  • [dos Santos et al.2016] Cícero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. Attentive pooling networks. CoRR, abs/1602.03609, 2016.
  • [Grégoire and Langlais2017] Francis Grégoire and Philippe Langlais. A deep neural network approach to parallel sentence extraction. arXiv preprint arXiv:1709.09783, 2017.
  • [Guo et al.2018] Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. Effective parallel corpus mining using bilingual sentence embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 165–176. Association for Computational Linguistics, 2018.
  • [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  • [Pierre Zweigenbaum and Rapp2018] Serge Sharoff Pierre Zweigenbaum and Reinhard Rapp. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Reinhard Rapp, Pierre Zweigenbaum, and Serge Sharoff, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, may 2018. European Language Resources Association (ELRA).
  • [Schuster and Nakajima2012] M. Schuster and K. Nakajima. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152, March 2012.
  • [Schwenk2018] Holger Schwenk. Filtering and mining parallel data in a joint multilingual space. arXiv preprint arXiv:1805.09822, 2018.
  • [Uszkoreit et al.2010] Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 1101–1109, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [Vanderkam et al.2013] Dan Vanderkam, Rob Schonberger, Henry Rowley, and Sanjiv Kumar. Nearest neighbor search in google correlate. Technical report, Google, 2013.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [Wang et al.2018a] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
  • [Wang et al.2018b] Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, and Ciprian Chelba. Denoising neural machine translation training with trusted data and online data selection. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 133–143. Association for Computational Linguistics, 2018.
  • [Yang et al.2018] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-Yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164–174. Association for Computational Linguistics, 2018.
  • [Ziemski et al.2016] Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. The united nations parallel corpus v1. 0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC ’16. European Language Resources Association, April 2016.