Motivated by a monolingual speaker acquiring translation ability by referring to a bilingual dictionary, we propose a novel MT task that no parallel sentences are available, while a ground-truth bilingual dictionary and large-scale monolingual corpora can be utilized. This task departs from unsupervised MT task that no parallel resources, including the ground-truth bilingual dictionary, are allowed to utilize Artetxe2018Unsupervised2; Lample2018Phrase. This task is also distinct to supervised/semi-supervised MT task that mainly depends on parallel sentences Bahdanau2015Neural; Gehring2017Convolutional; Vaswani2017Attention; chen2018best; sennrich-haddow-birch2016Improving.
The bilingual dictionary is often utilized as a seed in bilingual lexicon induction (BLI) that aims to induce more word pairs within the language pairMikolov2013Exploiting. Another utilization of the bilingual dictionary is for translating low-frequency words in supervised NMT Arthur2016Incorporating; Zhang2016Bridging. We are the first to utilize the bilingual dictionary and the large scale monolingual corpora to see how much potential an MT system can achieve without using parallel sentences. This is different from using artificial bilingual dictionaries generated by unsupervised BLI for initializing an unsupervised MT system Artetxe2018Unsupervised2; Artetxe2018Unsupervised; Lample2018Unsupervised, we use the ground-truth bilingual dictionary and apply it throughout the training process.
We propose Anchored Training (AT) to tackle this task. Since word representations are learned over monolingual corpora without any parallel sentence supervision, the representation distances between source language and target language are often quite large, leading to significant translation difficulty. As one solution, AT selects words covered by the bilingual dictionary as anchoring points to drive the distance between the source language space and the target language space closer so that translation between the two languages becomes easier. Furthermore, we propose Bi-view AT that places anchors based on either source language view or target language view, and combines both views to enhance the translation quality.
Experiments on various language pairs show that AT performs significantly better than various baselines, including word-by-word translation through looking up the dictionary, unsupervised MT, and dictionary-supervised cross-lingual word embedding transformation to make distances between both languages closer. Bi-view AT further improves AT performance due to mutual strengthening of both views of the monolingual data. When combined with cross-lingual pretraining Lample2019Cross, Bi-view AT achieves performances comparable to traditional SMT systems trained on more than 4M parallel sentences. The main contributions of this paper are as follows:
A novel MT task is proposed which can only use the ground-truth bilingual dictionary and monolingual corpora, while is independent on parallel sentences.
AT is proposed as a solution to the task. AT uses the bilingual dictionary to place anchors that can encourage monolingual spaces of both languages to become closer so that translation becomes easier.
The detailed evaluation on various language pairs shows that AT, especially Bi-view AT, performs significantly better than various methods, including word-by-word translation, unsupervised MT, and cross-lingual embedding transformation. On distant language pairs that unsupervised MT struggled to be effective, AT and Bi-view AT perform remarkably better.
2 Related Work
The bilingual dictionaries used in previous works are mainly for bilingual lexicon induction (BLI), which independently learns the embedding in each language using monolingual corpora, and then learns a transformation from one embedding space to another by minimizing squared euclidean distances between all word pairs in the dictionary Mikolov2013Exploiting; Artetxe2016Learning. Later efforts for BLI include optimizing the transformation further through new training objectives, constraints, or normalizations Xing2015Normalized; Lazaridou2015Hubness; Zhang2016Ten; Artetxe2016Learning; Smith2017Offline; Faruqui2014Improving; Lu2015Deep. Besides, the bilingual dictionary is also used for supervised NMT which requires large-scale parallel sentences Arthur2016Incorporating; Zhang2016Bridging. To our knowledge, we are the first to use the bilingual dictionary for MT without using any parallel sentences.
Our work is closely related to unsupervised NMT (UNMT) Artetxe2018Unsupervised2; Lample2018Phrase; yang-etal-2018-unsupervised; sun-etal-2019-unsupervised, which does not use parallel sentences neither. The difference is that UNMT may use the artificial dictionary generated by unsupervised BLI for initialization Artetxe2018Unsupervised2; Lample2018Unsupervised or abandon the artificial dictionary by using joint BPE so that multiple BPE units can be shared by both languages Lample2018Phrase. We use the ground-truth dictionary instead and apply it throughout a novel training process. UNMT works well on close language pairs such as English-French, while performs remarkably bad on distant language pairs in which aligning the embeddings of both side languages is quite challenging. We use the ground-truth dictionary to alleviate such problem, and experiments on distant language pairs show the necessity of using the bilingual dictionary.
Other utilizations of the bilingual dictionary for tasks beyond MT include cross-lingual dependency parsing Xiao2014Distributed, unsupervised cross-lingual part-of-speech tagging and semi-supervised cross-lingual super sense tagging Gouws2015Simple, multilingual word embedding training Ammar2016Massively; Long2016Learning
, and transfer learning for low-resource language modelingAdams2017Cross.
3 Our Approach
There are multiple freely available bilingual dictionaries such as Muse dictionary222https://github.com/facebookresearch/MUSE conneau2017word, Wiktionary333https://en.wiktionary.org/wiki/Wiktionary:Main_Page, and PanLex444https://panlex.org/. We adopt Muse dictionary which contains 110 large-scale ground-truth bilingual dictionaries.
We propose to inject the bilingual dictionary into the MT training by placing anchoring points on the large scale monolingual corpora to drive the semantic spaces of both languages becoming closer so that MT training without parallel sentences becomes easier. We present the proposed Anchored Training (AT) and Bi-view AT in the following.
3.1 Anchored Training (AT)
Since word embeddings are trained on monolingual corpora independently, the embedding spaces of both languages are quite different, leading to significant translation difficulty. AT forces words of a translation pair to share the same word embedding as an anchor. We place multiple anchors by selecting words covered by the bilingual dictionary. With stable anchors, the embedding spaces of both languages become more and more close during the AT process.
As illustrated in Figure LABEL:fig:overall (a), given the source sentence “” with words of and being covered by the bilingual dictionary, we replace the two words with their translation words according to the dictionary. This results in the source sentence “”, of which and serve as the anchors which are actually the target language words obtained by translating and according to the dictionary, respectively. Through the anchors, some words on the source side share the same word embeddings with the corresponding words on the target side. The AT process will strengthen the consistency of embedding spaces of both languages based on these anchors.
The training process illustrated in Figure LABEL:fig:overall (a) consists of a mutual back-translation procedure. The anchored source sentence “” is translated into target sentence “” by using source-to-target decoding, then “” and “” constitute a sentence pair for training the target-to-source translation model. In contrast, the target sentence “” is translated into anchored source sentence “” by using target-to-source decoding, then both sentences constitute a sentence pair for training the source-to-target translation model. Note that during training the translation model, the input sentences are always pseudo sentences generated by decoding an MT model, while the output sentences are always true or anchored true sentences. Beside this mutual back-translation procedure, a denoising procedure used in unsupervised MT Lample2018Phrase is also adopted. The deletion and permutation noises are added to the source/target sentence, and the translation model is also trained to denoise them into the original source/target sentence.
During testing, a source sentence is transformed into an anchored sentence at first by looking up the bilingual dictionary. Then we use the source-to-target model trained in the AT process to decode the anchored sentence.
We use Transformer architecture Vaswani2017Attention as our translation model with four stacked layers in both encoder and decoder. In the encoder, we force the last three layers shared by both languages, and leave the first layer not shared. In the decoder, we force the first three layers shared by both languages, and leave the last layer not shared. Such architecture is designed to capture both common and specific characteristics of the two languages in one model for the training.
3.2 Bi-view AT
AT as illustrated in Figure LABEL:fig:overall (a) actually tries to model the sentences of both languages in the target language view with partial source words replaced with the target words and the full target language sentence. Bi-view AT enhances AT by adding another language view. Figure LABEL:fig:overall (b) adds the source language view shown in the right part to accompany with the target language view of Figure LABEL:fig:overall (a). In particular, the target language sentence “” is in the form of “” after looking up the bilingual dictionary. Such partial target words replaced with the source words and the full source language sentence “” constitute the source language view.
Based on the target language view shown in the left part and the source language view shown in the right part, we further combine both views through the pseudo sentences denoted by primes in Figure LABEL:fig:overall (b). As shown by “” in Figure LABEL:fig:overall (b), “” is further transformed into “” by looking up the bilingual dictionary. Similarly, “” is further transformed into “” as shown by “”. Finally, solid line box represents training the source-to-target model on data from both views, and dashed line box represents training the target-to-source model on data from both views.
Bi-view AT starts from training both views in parallel. After both views converge, we generate pseudo sentences in both the solid line box and the dashed line box, and pair these pseudo sentences (as input) with genuine sentences (as output) to train the corresponding translation model. This generation and training process iterates until Bi-view AT converges. Through such rich views, the translation models of both directions are mutually strengthened.
3.3 Anchored Cross-lingual Pretraining (ACP)
Cross-lingual pretraining has demonstrated effectiveness on tasks such as cross-lingual classification, unsupervised MT Lample2019Cross. It is conducted over large monolingual corpora by masking random words and training to predict them as a cloze task. Instead, we propose ACP to pretrain on data that is obtained by transforming the genuine monolingual corpora of both languages into the anchored version. For example, words in the source language corpus that are covered by the bilingual dictionary are replaced with their translation words respectively. Such words are anchoring points that can drive the pretraining to close the gap between the source language space and the target language space better than the original pretraining method of Lample and Conneau Lample2019Cross does as evidenced by the experiments in section 4.5. Such anchored source language corpus and the genuine target language corpus constitute the target language view for ACP.
ACP can be conducted in either the source language view or the target language view. After ACP, each of them is used to initialize the encoder of the corresponding AT system.
3.4 Training Procedure
For AT, the pseudo sentence generation step and NMT training step are interleaved. Take the target language view AT shown in Figure LABEL:fig:overall (a) for example, we extract anchored source sentences as one batch, and decode them into pseudo target sentences; then we use the same batch to train the NMT model of target-to-anchored source. In the meantime, a batch of target sentences are decoded into pseudo anchored source sentences, and then we use the same batch to train the NMT model of anchored source-to-target. The above process repeats until AT converges.
For Bi-view AT, after each mono-view AT converging, we set larger batch for generating pseudo sentences as shown in solid/dashed line boxes in Figure LABEL:fig:overall (b), and train the corresponding NMT model using the same batch.
For ACP, we follow XLM procedure Lample2019Cross, and conduct pretraining on the anchored monolingual corpora concatenated with the genuine corpora of the other language.
We conduct experiments on English-French, English-Russian, and English-Chinese translation to check the potential of our MT system with only bilingual dictionary and large scale monolingual corpora. The English-French task deals with the translation between close-related languages, while the English-Russian and English-Chinese tasks deal with the translation between distant languages that do not share the same alphabets.
For English-French translation task, we use the monolingual data released by XLM Lample2019Cross555https://github.com/facebookresearch/XLM/blob/master/get-data-nmt.sh. For English-Russian translation task, we use the monolingual data identical to Lample et al.Lample2018Unsupervised, which uses all available sentences for the WMT monolingual News Crawl datasets from years 2007 to 2017. For English-Chinese translation task, we extract Chinese sentences from half of the 4.4M parallel sentences from LDC, and extract English sentences from the complementary half. We use WMT newstest-2013/2014, WMT newstest-2015/2016, and NIST2006/NIST2002 as validation/test sets for English-French, English-Russian, and English-Chinese, respectively.
For cross-lingual pretraining, we extract raw sentences from Wikipedia dumps, which contain 80M, 60M, 13M, 5.5M monolingual sentences for English, French, Russian, and Chinese, respectively.
Muse ground-truth bilingual dictionaries are used for our dictionary-related experiments. If a word has multiple translations, we select the translation word that appears most frequently in the monolingual corpus. Table 1 summarizes the number of word pairs and their coverage on the monolingual corpora on the source side.
4.2 Experiment Settings
For AT/Bi-view AT without cross-lingual pretraining, we use Transformer with 4 layers, 512 embedding/hidden units, and 2048 feed-forward filter size, for fair comparison to UNMT Lample2018Phrase. For AT/Bi-view AT with ACP, we set Transformer with 6 layers, 1024 embedding/hidden units, and 4096 feed-forward filter size for a fair comparison to XLM Lample2019Cross.
We conduct joint byte-pair encoding (BPE) on the monolingual corpora of both languages with a shared vocabulary of 60k tokens for both English-French and English-Russian tasks, and 40k tokens for English-Chinese task Sennrich2015Neural.
During training, we set the batch size to 32 and limit the sentence length to 100 BPE tokens. We employ the Adam optimizer with , and . At decoding time, we generate greedily with length penalty .
Word-by-word translation by looking up the ground truth dictionary or the artificial dictionary generated by Conneau et al. conneau2017word.
Unsupervised NMT (UNMT) that does not rely on any parallel resources Lample2018Phrase666https://github.com/facebookresearch/UnsupervisedMT. Besides, cross-lingual pretraining (XLM) based UNMT Lample2019Cross777https://github.com/facebookresearch/XLM, is also set as a stronger baseline (XLM+UNMT).
We implement a UNMT initialized by Unsupervised Word Embedding Transformation (UNMT+UWET) as a baselineartetxe2018iclr. The transformation function is learned in an unsupervised way without using any ground-truth bilingual dictionaries conneau2017word888https://github.com/facebookresearch/MUSE.
We also implement a UNMT system initialized by Supervised Word Embedding Transformation (UNMT+SWET) as a baseline. Instead of UWET used in Artetxe et al. artetxe2018iclr, we use the ground-truth bilingual dictionary as the supervision signal to train the transformation function for transforming the source word embeddings into the target language space conneau2017word. After such initialization, the gap between the embedding spaces of both languages is narrowed for easy UNMT training.
|system||fr en||en fr||ru en||en ru||zh en||en zh|
|Without Cross-lingual Pre-training|
|Word-by-word using artificial dictionary||7.76||4.88||3.05||1.60||1.99||1.14|
|Word-by-word using ground-truth dictionary||7.97||6.61||4.17||2.81||2.68||1.79|
|With Cross-lingual Pre-training|
4.4 Experimental Results: without Cross-lingual Pretraining
The upper part of Table 2 presents the results of various baselines and our AT approaches. AT and Bi-view AT significantly outperform the baselines, and Bi-view AT is consistently better than AT. Detailed comparisons are listed as below:
Results of Word-by-word Translation
It shows that using the ground-truth dictionary is slightly better than using the artificial one generated by Conneau et al. conneau2017word. Both performances are remarkably bad, indicating that simple word-by-word translation is not qualified as an MT method. More effective utilization of the bilingual dictionary is needed to improve the translation performance.
Comparison between UNMT and UNMT with WET Initialization
UNMT-related systems generally improves the performance of the word-by-word translation. On the close-related language pair of English-French, UNMT is better than UNMT+UWET/SWET. This is partly because there are numerous BPE units shared by both English and French, enabling easy establishing the shared word embedding space of both languages. In contrast, WET that transforms the source word embedding into the target language space seems not a necessary initialization step since shared BPE units already establish the shared space.
On distant language pairs, UNMT does not have an advantage over UNMT with WET initialization. Especially on English-Chinese, UNMT performs extremely bad, even worse than the word-by-word translation method. We argue that this is because the BPE units shared by both languages are so few that UNMT fails to align the language spaces. In contrast, using the bilingual dictionary greatly alleviate such problem for distant language pairs. UNMT+SWET, which transforms the source word embedding into the target word embedding space supervised by the bilingual dictionary, outperforms UNMT by more than 18 BLEU points on Chinese-to-English and more than 7 BLEU points on English-to-Chinese. This indicates the necessity of the bilingual dictionary for translation between distant language pairs.
Comparison between AT/Bi-view AT and The Baselines
Our proposed AT approaches significantly outperform the baselines. The baselines of using the ground-truth bilingual dictionary, i.e., word-by-word translation using the dictionary and UNMT+SWET that uses the dictionary to supervise the word embedding transformation, are inferior to our AT approaches.
The AT approaches consistently improves the performances over both close-related language pair of English-French and distant language pairs of English-Russian and English-Chinese. Our Bi-view AT achieves the best performance on all language pairs.
4.5 Experimental Results: with Cross-lingual Pretraining
The bottom part of Table 2 reports performances of UNMT with XLM, which conducts the cross-lingual pretraining on concatenated non-parallel corpora Lample2019Cross, and performances of our AT/Bi-view AT with the anchored cross-lingual pretraining, i.e., ACP. The results show that our proposed AT approaches are still superior when equipped with the cross-lingual pretraining.
UNMT obtains great improvement when combined with XLM, achieving state-of-the-art unsupervised MT performance better than Unsupervised SMT artetxe2019acl-effective and Unsupervised NMT Lample2018Phrase across close and distant language pairs.
ACP+AT/Bi-view AT performs consistently superior to XLM+UNMT. Especially on distant language pairs, ACP+Bi-view AT gains 2.7-9.4 BLEU improvements over the strong XLM+UNMT. This indicates that AT/Bi-view AT with ACP builds closer language spaces via anchored pretraining and anchored training. We present such advantage in the analyses of Section 4.6.
Comparison with Supervised SMT
To check the ability of our system using only the dictionary and non-parallel corpora, we make the comparison to supervised SMT trained on over 4M parallel sentences, which are from WMT19 for English-Russian and from LDC for English-Chinese. We use Moses999http://www.statmt.org/moses/. We use the default setting of Moses. as the supervised SMT system with a 5-gram language model trained on the target language part of the parallel corpora.
The bottom part of Table 2 shows that ACP+Bi-view AT performs comparable to supervised SMT, and performs even better on English-to-Russian and English-to-Chinese.
We analyze the cross-lingual property of our approaches in both word level and sentence level. We also compare the performances between the ground-truth dictionary and the artificial dictionary. In the end, we vary the size of the bilingual dictionary and report its impact on the AT training.
Effect on Bilingual Word Embeddings
As shown in Figure 2
, we depict the word embeddings of some sampled words in English-Chinese after our Bi-view AT. The dimensions of the embedding vectors are reduced to two by using T-SNE and are visualized by the visualization tool in Tensorflow101010https://projector.tensorflow.org/.
We sample the English words that are not covered by the dictionary at first, then search their nearest Chinese neighbors in the embedding space. It shows that the words which constitute a new ground-truth translation pair do appear as neighboring points in the 2-dimensional visualization of Figure 2.
Precision of New Word Pairs
We go on with studying bilingual word embedding by quantitative analysis of the new word pairs, which are detected by searching bilingual words that are neighbors in the word embedding space, and evaluate them using the ground-truth bilingual dictionary. In particular, we split the Muse dictionary of Chinese-to-English into standard training set and test set as in BLI Artetxe2018Robust. The training set is used for the dictionary-based systems, including our AT/Bi-view AT, UNMT+SWET, and Muse, which is a BLI toolkit. The test set is used to evaluate these systems by computing the precision of discovered translation words given the source words in the test set. The neighborhood is computed by CSLS distance conneau2017word.
Table 3 shows the precision, where precision@k indicates the accuracy of top-k predicted candidate. Muse induces new word pairs through either the supervised way or the unsupervised way. MuseSupervised is better than MuseUnsupervised since it is supervised by the ground-truth bilingual dictionary. Our AT/Bi-view AT surpasses MuseSupervised by a large margin. UNMT+SWET/UWET also obtains good performance through the word embedding transformation. Bi-view AT significantly surpasses UNMT+SWET/UWET in precision@5 and precision@10, while is worse than them in precision@1. This indicates that Bi-view AT can produce better -best translation words that are beneficial for NMT beam decoding to find better translations.
Through the word level analysis, we can see that AT/Bi-view AT leads to more consistent word embedding space shared by both languages, making the translation between both languages easier.
Sentence Level Similarity of Parallel Sentences
We check the sentence level representational invariance across languages for the cross-lingual pretraining methods. In detail, following Arivazhagan et al. arivazhagan2019missing
, we adopt max-pooling operation to collect the sentence representation of each encoder layer for all Chinese-to-English sentence pairs in the test set. Then we calculate the cosine similarity for each sentence pair and average all cosine scores.
Figure 3 shows the sentence level cosine similarity. ACP+Bi-view AT consistently has a higher similarity for parallel sentences than XLM+UNMT on all encoder layers. When compare Bi-view AT and AT, the Bi-view AT is better on more encoder layers.
We can see that in both word level and sentence level analysis, our AT methods achieve better cross-lingual invariance, significantly reduce the gap between the source language space and the target language space, leading to decreased translation difficulty between both languages.
Ground-Truth Dictionary Vs Artificial Dictionary
|Ground-Truth Dict.||Artificial Dict.|
|ACP+AT with 1/4 of the dictionary||22.84|
|ACP+AT with 1/2 of the dictionary||24.32|
|ACP+AT with the full dictionary||26.80|
Table 4 presents the comparison in English-Chinese. The ground-truth dictionary is from the Muse dictionary deposit, and the artificial dictionary is generated by unsupervised BLI conneau2017word. We extract top- word pairs as the artificial dictionary, where is the same as the number of entries in the ground-truth dictionary.
Both dictionaries use AT methods for translation. As shown in Table 4, the ground-truth dictionary performs significantly better than the artificial dictionary in both methods and both translation directions.
The Effect of The Dictionary Size
We randomly select a portion of the ground-truth bilingual dictionary to study the effect of the dictionary size on the performance. Table 5 reports the performances of ACP+AT using a quarter or a half of the zhen dictionary.
It shows that, in comparison to the baseline of XLM+UNMT that does not use a dictionary, a quarter of the dictionary consisting of around 3k word pairs is capable of improving the performance significantly. More word pairs in the dictionary lead to better translation results, suggesting that expanding the size of the current Muse dictionary via collecting various dictionaries built by human experts may improve the translation performance further.
5 Discussion and Future Work
In the literature of unsupervised MT that only uses non-parallel corpora, Unsupervised SMT (USMT) and Unsupervised NMT (UNMT) are complementary to each other. Combining them (USMT+UNMT) achieves significant improvement over the individual system, and performs comparable to XLM+UNMT Lample2018Phrase; artetxe2019acl-effective.
We have set XLM+UNMT as a stronger baseline, and our ACP+AT/Bi-view AT surpasses it significantly. By referring to the literature of unsupervised MT, we can opt to combine ACP+AT/Bi-view AT with SMT. We leave it as a future work.
In this paper, we explore how much potential an MT system can achieve when only using a bilingual dictionary and large-scale monolingual corpora. This task simulates people acquiring translation ability via looking up the dictionary and depending on no parallel sentence examples. We propose to tackle the task by injecting the bilingual dictionary into MT via anchored training that drives both language spaces closer so that the translation becomes easier. Experiments show that, on both close language pairs and distant language pairs, our proposed approach effectively reduces the gap between the source language space and the target language space, leading to significant improvement of translation quality over the MT approaches that do not use the dictionary and the approaches that use the dictionary to supervise the cross-lingual word embedding transformation.
The authors would like to thank the anonymous reviewers for the helpful comments. This work was supported by National Natural Science Foundation of China (Grant No. 61525205, 61673289), National Key R&D Program of China (Grant No. 2016YFE0132100), and was also partially supported by the joint research project of Alibaba and Soochow University.