Not only have regular vector representations of words become ubiquitous[word2vec], but bilingual word embeddings [DBLP:journals/corr/Ruder17]
have as well enjoyed remarkable success, enabling many novel forms of cross-lingual NLP. Owing to the availability of adequate cross-lingual word datasets and supervised learning methods, cross-lingual transfer learning at the word level is well-studied. However, only few results exist on sentence-level cross-lingual mapping, let alone studying low-resource settings.
In order to effortlessly obtain sentence-level representations, one of the most popular methods is to compute a (possibly weighted) average of the word vectors of all words encountered in a sentence. The method is favored for its straightforward simplicity, particularly in light of the widespread availability of pre-trained word vectors. While there are numerous more powerful methods (cf. Section 2
), they require substantially longer training times and pre-trained models are typically less convenient to load. Somewhat surprisingly, weighted averages of word vectors have been shown capable of outperforming several more advanced techniques, including certain Long short-term memory (LSTM)-based setups[wieting2015towards, arora2017asimple].
In this paper, we study to what extent sentence embeddings based on simple word vector averages can be aligned cross-lingually. While word vector averages have been studied for embeddings of entire text documents, such document embeddings mainly need to capture topic information. Sentence embeddings, in contrast, are typically expected to retain more detailed semantic information. If simple word vector averages can achieve this cross-lingually despite being entirely oblivious of the order of words in the input sentences, this would provide a simple means of connecting semantically related sentences across language boundaries, in support of a diverse range of possible tasks such as question answering, recommendation, plagiarism detection [liu2019xqa, Xian:2019:RKG:3331184.3331203, ferrero2017usingword]. While there has been research on joint multilingual training of NMT to obtain richer cross-lingual sentence embeddings [Wang2019MultilingualNM], such methods tend to require substantial training data.
At the same time, aligning sentences cross-lingually with limited parallel data is challenging, as it is not obvious how to exploit non-parallel data (see Figure 1
). While linear transformations have proven fruitful for cross-lingual word vector mapping[Mikolov2013ExploitingSA], recent work shows that non-linear transformations may be necessary even just at the level of individual words [Nakashole2018]. At the sentence level, this necessity may be much more pronounced, due to the divergent syntactic and morphological properties of different languages and the linear superposition of different concepts. We not only study this empirically but also show how non-linear transformations can be learned with limited parallel data.
Specifically, we propose Adversarial Bi-directional Sentence Embedding Mapping (ABSent), based on the generative adversarial network (GAN) framework to bridge the gap between languages while avoiding overfitting even with limited parallel data. We consider simple (weighted) averaged word embeddings for a source language sentence as input but generates a sentence embedding in a target language space resulting from (weighted) averaged word vectors in a target language. The bi-directional structure additionally enables joint transformations between two languages.
The major contributions of the paper can be outlined as follows: 1) We highlight the simplicity and effectiveness of inducing high-quality sentence representations from pre-trained word embeddings by means of Term Frequency-Inverse Document Frequency (TF-IDF) weighted averaging word vectors. 2) We propose an adversarially bidirectional model for cross-lingual sentence embedding mapping based on a custom form of GAN framework that is capable of utilizing non-parallel sentence pairs. Moreover, we show that the same model architecture can easily be extended to more than one source language. 3) We extensively evaluate the performance of our method on the Tatoeba and Europarl corpora, obtaining exceptional accuracy as well as high quality mapping results, even in low-resource settings.
2 Related Work
Cross-Lingual Projection Approaches.
A number of papers consider linear projections to align two word vector spaces with a regression objective [Mikolov2013ExploitingSA, zou2013bilingual]
. FaruquiDyer:2014:EACL FaruquiDyer:2014:EACL proposed using Canonical Correlation Analysis (CCA). Xing2015NormalizedWE Xing2015NormalizedWE showed that adding an orthogonality constraint to the mapping can significantly enhance the result quality, and has a closed-form solution. There have been approaches that assume that languages share some common vocabulary items as a heuristic for supervision[Smith2017OfflineBW, dong2018cross, artetxe2017learning].
A few works also attempt to align monolingual word vector spaces with no supervision at all. P17-1179 P17-1179 employed a form of adversarial training, but their approach differs from ours in multiple respects. First, they rely on sharp drops of discriminator accuracy for model selection. Second, their performance is highly sensitive to the selected parallel corpus. lample2018word lample2018word presented a related unsupervised technique that learns a rotation matrix that outperforms several state-of-the-art supervised techniques. In contrast to our approach, none of the above methods consider non-linear transformations [Nakashole2018].
Well-known approaches to create sentence embeddings include the Paragraph Vector approach [LeMikolov2014ParagraphVectorPMLR], which straightforwardly extends word2vec to generate vectors for paragraphs, and the Skip-Thought Vector approach [Kiros2015SkipThoughtVectors], which relies on recurrent units to encode and decode sentence representations such that these are predictive of neighbouring sentences. There are more sophisticated methods that rely on supervision from a range of different NLP tasks [SubramanianEtAl2018MultiTaskSentenceEmbeddings, Yang2019ImprovingMS].
However, inspired by the results from wieting2015towards wieting2015towards, arora2017asimple arora2017asimple presented a weighting technique that enables simple weighted sums of word vectors to outperform several state-of-the-art models. In our experiments, we build on these insights and as well consider weighted sums of word vectors as sentence embeddings, as these are readily available, even for many low-resource languages. Our weighting scheme is described in Section 4.
Recently, GANs [Goodfellow2014GANs] have shown remarkable success across a diverse range of multimodal tasks. Their adversarial training process resembles a min–max game. Some GAN approaches require a supervised learning setting like image-to-image transfer [pix2pix2017]. The CycleGAN approach [zhu2017unpaired] shows promise in its exploitation of unpaired data to achieve a domain transfer. While GANs have mostly been considered for multimodal data [liu2019oogan], we show how they can be used for linguistic representations in an NLP task.
3 ABSent Approach
In this paper, we seek to learn a transformation between two languages such that the mapping model can be invoked to project an embedding of a source language sentence to a target language space and be able to find the nearest neighbor targets in the target space at the sentence level. At the same time, our approach is shown to be robust under low-resource conditions in terms of the amount of parallel sentences available for training. We start with a formal definition of our sentence representation problem under limited parallel data, followed by a detailed demonstration of our proposed deep neural model.
3.1 Problem Definition
Formally, we assume a source language (domain) and a target one , such that each element is a -dimensional vector, denoted by or . We assume that , is aligned with one , denoted by . Given a bilingual corpus and a labeled (parallel) subset , an unlabeled subset and some distance measure , our goal is to learn two non-linear transformation functions and that minimize
Note that in the labeled set , the alignment between is known, while in the unlabeled set , the relationship between and are unknown.
In this paper, we consider the case when the labeled set is very limited and the size of the unlabeled set is large. In other words, the challenge is how to utilize unpaired vectors from two domains to learn good mappings from one domain to the other. Specifically, the model is expected to be able to jointly learn from bidirectional transformations between the two languages at the same time to improve the mappings for each direction by better modeling the joint distribution, which is important in settings with limited parallel data as considered in this paper.
3.2 Our Method
To solve the trasnformation problem with limited parallel data, we introduce the novel Adversarial Bi-directional Sentence Embedding Mapping (ABSent) method.
The fundamental core of our model is inspired by the Triangle Generative Adversarial Network [GanCWPZLLC17]
, which addresses the task of image-to-image translation. The key idea is that the generator component learns to non-linearly project embeddings across the two representation spaces, while a discriminator component attempts to distinguish automatically projected embeddings from genuine target language embeddings, thus constraining us to more closely match the target distribution. Unlike regular GANs, our model incorporates additional information from “adversarial” pairs of sentence embeddings that come from both parallel and non-parallel data. We define the corresponding objective function as follows.
Here, is a discriminator that aims to distinguish real pairs from fake pairs. A real pair is a known mapping in the parallel dataset. A fake pair or is an artificial pair based on a projection emitted by the generator or .
Equation 2 reflects an adversarial min–max game, in which the generators , and the discriminator are trained adversarially and concurrently to improve their respective abilities. This is a bidirectional process due to its reliance on both generator functions and to map from one domain to the other.
In addition, in order to utilize non-parallel information, we further take into consideration mismatch pairs induced from non-parallel data. Given the set of source language sentence embeddings and the set of target language sentence embeddings , the set of mismatch pairs consists of all training pairs such that is an embedding for a sentence that is not translationally equivalent to the sentence represented by
. The loss function is augmented with the following mismatch term:
However, the discriminator alone cannot determine the directionality between fake pairs. Therefore, we introduce another discriminator to distinguish whether fake pairs come from the domain or from the domain. The loss function is defined as:
The overall framework of our ABSent method is illustrated in Figure 2, where we seek to solve the following joint optimization problem:
is a weighting factor to balance the effect between distance metrics and adversarial components. In this paper, we use cosine similarity as the distance measure:
We train our model adversarially to learn the mappings bi-directionally by encouraging that the resulting pairs be indistinguishable from genuine pairs, and the direction that was generated remain as indiscernible as possible.
3.3 Zero-Shot Multilingual Setting
Our model can also be easily extended to align sentences between two languages and in a zero-shot manner without any parallel data between them. The zero-shot multilingual task involves jointly projecting two languages and to a common target language given only limited parallel data and non-aligned data connecting each to . As input, we have labeled data and unlabeled data for the language pairs (). However, we do not observe any direct relationship between the two source languages and in the training data.
In this case, we adopt the same framework as in the previous section except that the generator is expected to learn two language projections from to and from to . Let be the loss function defined in Equation 5 for cross-lingual and . The overall loss function for multilingual mapping is simply .
3.4 Sentence Representation and Mapping
With regard to obtaining the sentence representations, we adopt pre-trained word vectors [bojanowski2016enriching] that are available for numerous languages trained on Wikipedia using fastText. We ensure the same input embeddings in training and evaluation for the baselines as in our model. Based on the results of wieting2015towards wieting2015towards and arora2017asimple arora2017asimple, we adopt simple (weighted) averages of word vectors, which are surprisingly powerful, although our method could also be applied to other sentence embedding methods.
Given source sentence embeddings and target sentence embeddings acquired as described above, we can train the generators and through the joint loss function in Equation 5. Subsequently, we evaluate the obtained transformation via a standard sentence retrieval task. For each source sentence embedding , we compute its nearest neighbours in terms of the distance function among all target embeddings. The corresponding target sentences are regarded as the candidate set of mapping results.
|Mikolov et al. Mikolov2013ExploitingSA||13.6||20.9||12.8||21.7||31.1||46.9||24.7||38.3||6.4||13.4||7.8||14.6||13.1||21.6||12.4||22.1|
|Dinu et al. dinu2014improving||17.3||30.8||22.7||36.4||35.4||52.8||30.5||46.2||13.8||21.9||11.9||21.1||18.5||28.6||15.7||27.3|
In this section, we extensively evaluate the effectiveness of our ABSent method compared with state-of-the-art approaches on two heterogeneous real-world corpora.
4.1 Experimental Setup
We evaluate the precision of our approach on the Europarl parallel corpus and on extracted from the Tatoeba service111http://tatoeba.org, which provides translations of commonly used phrases that might be useful to language learners. We focus on German and English as well as Spanish and English translation retrieval. For the English German datasets, we take 160k pairs as the training set and 1,600 pairs as the test set in both datasets. For the English Spanish datasets, we take 60k pairs as training and 600 pairs as test data for the Tatoeba corpus, and 130k as training and 1,300 as test for the Europarl corpus. However, to emphasize that our model can cope with very limited amounts of parallel data, we solely make use of just 20% of the parallel training data when training our model, while all the baseline methods exploit 100% of the parallel training data. Given a set of training pairs, we randomly sample false pairs of the same size as the respective parallel data.
For comparison, we consider as baselines the linear transformation methods by [Mikolov2013ExploitingSA], dinu2014improving dinu2014improving, and Smith2017OfflineBW Smith2017OfflineBW, the supervised version of MUSE [lample2018word] with cross-domain similarity local scaling. We also use the multilingual version of BERT [devlin2018bert], using the standard method for sentence-level representations based on the [CLS] token222As provided by based on bert-as-service: https://github.com/hanxiao/bert-as-service to generate sentence vectors that are already multilingual without further projection. This is to assess how far we can take simple word vector averages in comparison to powerful alternatives.
We further consider a seq2seq [sutskever2014sequence] NMT baseline jointly trained to translate language to as well as to monolingually auto-encode sentences from language back to . We use two different encoders with a shared decoder such that the two encoders produce latent representations in the same space. This allows us to save the latent sentence embeddings for evaluation rather than generate an output translation.
Additionally, we consider the fairseq NMT [gehring2017convs2s] approach based on a convolutional encoder model, which constructs latent representations hierarchically.
Finally, we investigate a Conditional GAN [Mirza2014ConditionalGA], for which we use our model but do not consider any fake pairs or mismatch pairs for training.
Both generators and consist of three fully connected layers with hidden sizes of 512, 1024, 512, respectively. Each hidden layer is connected with a BatchNorm layer and the ReLUactivation function. The final activation function is tanh. Both discriminators and take as input two embeddings, followed by three fully connected layers of sizes 512, 1024, 512 with concatenation. Each hidden layer is connected with a leaky ReLU activation function (0.2), while the output is activated by a sigmoid function. We rely on Adam optimization with an initial learning rate of 0.002 and a batch size of 128.
4.2 Main Results
We assess the quality of the mapping by considering the ranking of the ground truth paired target sentence. The overall quality across all test set instances is given by the precision@ metric, which, following previous work [Mikolov2013ExploitingSA] in this area, is defined as the ratio of test set instances for which the correct target is among the top . Then we repeat the same evaluation process for all four datasets from the two corpora.
The results are reported in Table 1. Recall that we only use 20% of parallel sentences (true pairs) to train our model while all the other baselines utilize 100% of parallel sentence pairs for the training. We observe that our ABSent approach still significantly outperforms other baselines by a large margin. Take the deueng data from Tatoeba as an example. Our method achieves a precision@1 of 46.2% and a precision@5 of 65.5%, which are 18.6 and 22.9 absolute percentage points higher than the respective results of the best baseline. Similar trends can be observed for other datasets. Note that the results for different languages are not fully comparable due to different sizes of training data.
4.3 Detailed Analysis
Influence of Bi-directional Transformation.
We evaluate how the bi-directional mapping strategy affects the effectiveness of our model. Taking German sentence embeddings as source and English sentence embedding as the target, we train our model to map the German sentence embeddings to the corresponding English vector space and align the sentence with the same meaning. We obtain the unidirectional transformation model from German to English, which we refer to as uni-Sent. Then we repeat the same process for English to German, Spanish to English, and English to Spanish. Thus, this model only acquires the ability to conduct a unidirectional transformation between two languages, since the bi-directional discriminator is omitted. In this case, we only learn the generator to map the source to the target domain. The goal is to optimize
where and stay the same as in Equations 3 and 1. The results in Figure 3 shows that the effectiveness of the method is improved substantially by the introduction of bidirectional learning. This demonstrates that the bidirectional training method not only enables a simultaneous transformation between the two language sentence embeddings, but also delivers better results. One possible reason is the effectiveness of , which can regularize the directional uncertainty among two language sentence embedding transformations. A unidirectional transformation does not bear this benefit.
Influence of Ratio of Parallel Training Corpus.
Next, we study how various ratios of available parallel data affect the effectiveness of our model. Apart from using 20% of parallel data in the training corpus, we also evaluate using 10%, 40%, and 100% as ratios of parallel training sentence pairs. We randomly sample the mismatch pairs to be of equal size as the respective parallel data. The results are also depicted in Figure 3. We observe that the precision improves as the ratio of parallel labelled sentence increases.
Influence of Mismatch Loss.
Recall that our ABSent method incorporates a custom mismatch loss in Equation 3. This experiment aims to study how this loss function influences the effectiveness of our proposed models. For simplicity, we refer to a model X without the mismatch loss as Xmis, where X can be either uni-Sent or ABSent. The results in Table 2 demonstrate the effectiveness of the mismatch strategy of introducing false pairs. We take 10% of parallel training pairs, leaving other parameters as in the main experiments. Generally, bringing in mismatch pairs into the training improves the effectiveness of both our ABSent model and the uni-Sent version.
Influence of Weighting strategy
For the input sentence embeddings of ABSent, we take the average of word embeddings and get normalized to obtain sentence representations for the Tatoeba corpus. For the Europarl corpus, we define the sentence embeddings to be the normalized sum over the word vectors multiplied with TF-IDF scores for a weighted average. As a comparison experiment, we swap the weighting stragy in while keeping other parameters as for ABSent, i.e., we impose TF-IDF weighted vectors for Tatoeba, while using vanilla word averages for Europarl.
The effectiveness of is shown in Table 2. Choosing an appropriate weighting strategy boosts the experimental results. Since TF-IDF weights words in accordance with their assumed importance, for the Europarl corpus, the volume of sentence lengths falls between 35 and 60, while for the Tatoeba corpus it is between 8 and 16. In such circumstances, words that appear more frequently in long sentences, especially function words such as ’a’, ’the’, etc. ought to have a lower weight, while infrequent ones ought to have a higher weight. For short sentences, it may make sense to even consider words with higher frequency, so as to not neglect their semantic contribution to the sentence. Thus, simple averaging works better than TF-IDF weighted averaging for the Tatoeba corpus.
First Aligning Words.
As we can see in Table 2, the accuracy drops quite notably compared to the regular ABSent approach if we first align words and then create the sentence vectors. The only difference between and ABSent is that we first align individual word vectors using our method and then average them (TF-IDF weighted averaging for the Europarl corpus) to generate sentence embeddings in the target vector space. For this comparison, we take 10% of parallel word pairs for training. We conjecture that this approach is less able to account for variation in the meaning of a word across different sentences.
We additionally provide representative examples of nearest neighbours, showcasing typical high-quality, medium quality, as well as low-quality transformation results for English–German in Table 4. The lower quality results in some cases highlight the limits of average word vector based embeddings, as they disregard word order and may lose semantically salient signals. This problem can be overcome by applying our method using more semantically sophisticated methods to obtain sentence embeddings. Although many such methods require extensive training and in some cases also rich supervision, the advantage of our model is that we can rely on just limited parallel data to project a resource-poor language into the embedding space of a resource-rich language such as English for which such sentence embedding methods are readily available.
|#train||Acc (%)||#train||Acc (%)||#test|
|Example||German Sentence||English Sentences ranked by similarity score|
|Ich finde keine Worte.||(1) I am at a loss for words. (0.852)|
|(2) I just don’t know what to say. (0.847)|
|(3) Tom has turned twenty. (0.836)|
|Notwendig ist die Interoperabilität der Set-Top-Boxen.||(1) Set top boxes must be compatible with each other. (0.895)|
|(2) People are showing great solidarity at local and regional level and help is being mobilised at national level. (0.858)|
|(3) I support the proposed deadline for the Commission being 30 September in proposed amendment. (0.846)|
|Europarl (medium-quality)||Zunächst ist es zu begrüßen, dass der Universaldienst zwar einfache, aber keine breitbandigen Internetanschlüsse umfassen soll.||(1) In the past, the international community has done itself credit by prohibiting anti-personnel mines on these grounds. It should now, by the same token, ban weapons containing depleted uranium. (0.866)|
|(2) This is because it provides clarity and therefore does not expose public services to the attacks which would otherwise have been levelled at them. (0.863)|
|(3) That does not mean, however, that we do not still see much room for improvement, as other speakers have pointed out as well, and some of our wishes have not been fulfilled. (0.854)|
|(4) Firstly, the fact that the universal service is to include simple but not broadband Internet connections is to be welcomed. (0.852)|
|Sie teilt die Auffassung, dass mit dem Universaldienst nicht nur die geographische Abdeckung gewährleistet werden soll.||(1) At the European Council in Gothenburg at the end of this week, the Swedish Presidency will point out the need to discuss these issues within the European Union in order to develop a concrete basis allowing powerful action by the European Union on these vital issues. (0.862)|
|(2) It could have expressed a lot more in the way of hopes for the future. (0.860)|
|(3) So what does this railway package contain? (0.859)|
|(6) The Commission shares the view that universal service is not just about getting geographical coverage right. (0.814)|
4.4 Mapping of Low-Resource Languages
In order to better evaluate the effectiveness and robustness of our model for diverse languages, we conduct additional experiments on low-resource languages. For comparison, we consider the state-of-the-art massively multilingual neural MT model LASER 333https://github.com/facebookresearch/LASER [ArtetxeSchwenk2018]. We evaluate the mapping of low-resource languages with English on the test sets provided by them. However, as they only provide test sets but not training sets, our model is trained on comparably sized training data obtained via random sampling from Tatoeba, OpenSubtitles2018, Global Voices.444Available from http://opus.nlpl.eu/. For Irish, we incorporate some additional training data from the EUbookshop dataset We adopt equivalent preprocessing steps such as filtering certain special characters and eliminating duplicate pairs.
The results in Table 3 confirm that simple word averages can be aligned with a broadly similar level of accuracy. This is obtained although our method does not have access to word order information and is not trained on the rich massively multilingual data used to train LASER, but only on the respective single language pairs.
4.5 Zero-Shot Multilingual Mapping
We also evaluate multilingual training, which entails mapping two source languages (Spanish and German) both to the same target language space (English) without any parallel data connecting the two source languages.
From both Tatoeba and Europarl, we each take 44,280 Spanish English sentence pairs and 44,280 German English sentence pairs. To train the baselines, we also collect the same number of German Spanish pairs for which the bilingual baselines make use of dedicated supervision, while our method does not receive any pairings at all for this language pair. Additionally, our proposed method only utilizes 20% of parallel training data and an equal amount of non-parallel data for training, while all the baselines take 100% training data with parallel labels. During the training process, we alternate over mini-batches with Spanish–English pairings and German–English pairings. The number of test queries is 490 for all experiments.
The results are reported in Table 5. Though seq-to-seq models can learn full-fledged neural translation models, they do not fare particularly well in resource-constrained scenarios with limited training data. Particularly, even with a parallel sentence pair percentage of just 20%, our model outperforms many baselines that utilize the total amount of training data. Moreover, retrieval accuracies between two source languages German and Spanish obtained by our model are very competitive with baselines receiving supervision for that language pair. Note that in our method, we do not provide any direct pairwise mapping. This proves the effectiveness of our zero-shot Multilingual mapping.
Our study shows that despite their simplicity, word vector averages can serve as reasonably strong cross-lingually projectable sentence representations. To this end, we have presented the ABSent model to align such representations via an adversarial approach that requires only small amounts of parallel data. We obtain competitive results, although our method does not obtain any information about the word order in the input sentences. Our results in a series of retrieval experiments on both short and long sentences outperform previous work by a substantial margin.