Pretrained word embeddings (Mikolov et al., 2013b)et al. (2013a)
stated that the continuous embedding spaces exhibit similar structures across different languages, and we can exploit the similarity by a linear transformation from source embedding space to target embedding space. This similarity derives theBilingual Lexicon Induction(BLI) task. The goal of bilingual lexicon induction is to align two languages’ embedding space and generates word translation lexicon automatically. This fundamental problem in natural language processing benefits much other research such as sentence translation (Rapp, 1995; Fung, 1995), unsupervised machine translation (Lample et al., 2017), cross-lingual information retrieval (Lavrenko et al., 2002).
Recent endeavors Lample et al. (2018); Alvarez-Melis and Jaakkola (2018); Grave et al. (2019); Artetxe et al. (2017) have proven that unsupervised BLI’s performance is even on par with the supervised methods. A crucial part of these approaches is the matching procedure, i.e., how to generate the translation plan. Alvarez-Melis and Jaakkola (2018) used Gromov-Wasserstein distance to approximate the matching between languages. Grave et al. (2019) regarded it as a classic optimal transport problem and used the sinkhorn algorithm (Cuturi, 2013) to compute the translation plan.
In this work, we follow the previous iterative framework but use a different matching procedure. Previous iterative algorithms required to compute an approximate 1 to 1 matching every step. This 1 to 1 constraint brings out many redundant matchings. Thus in order to avoid this problem, we relax the constraint and control the relaxation degree by adding two KL divergence regularization terms to the original loss function. This relaxation derives a more precise matching and significantly improves performance. Then we propose a bidirectional optimization framework to optimize the mapping from source to target and from target to source simultaneously. In the section of experiments, we verify the effectiveness of our method, and results show our method outperforms many SOTA methods on the BLI task.
The early works for the BLI task require a parallel lexicon between languages. Given two embedding matrices and with shape (:word number,
:vector dimension) of two languages and wordin is the translation of word in , i.e., we get a parallel lexicon . Mikolov et al. (2013a) pointed out that we could exploit the similarities of monolingual embedding spaces by learning a linear transformation such that
where is the space of matrices of real numbers. Xing et al. (2015) stated that enforcing an orthogonal constraint on would improve performance. There is a closed-form solution to this problem called Procrutes: where .
Under the unsupervised condition without parallel lexicon, i.e., vectors in and are totally out of order, Lample et al. (2018) proposed a domain-adversarial approach for learning . On account of the ground truth that monolingual embedding spaces of different languages keep similar spatial structures, Alvarez-Melis and Jaakkola (2018) applied the Gromov-Wasserstein distance based on infrastructure to find the corresponding translation pairings between and and further derived the orthogonal mapping Q. Grave et al. (2019) formulated the unsupervised BLI task as
where is the set of orthogonal matrices and is is the set of permutation matrices.Givenin Problem (2) is equivalent to the minimization of the 2-Wasserstein distance between the two sets of points: and .
) is the standard optimal transport problem that can be solved by Earth Mover Distance linear program withtime complexity. Considering the computational cost, Zhang et al. (2017) and Grave et al. (2019) used the Sinkhorn algorithm (Cuturi, 2013) to estimate by solving the entropy regularized optimal tranpsort problem (Peyré et al., 2019).
We also take Problem (2) as our loss function and our model shares a similar alternative framework with Grave et al. (2019). However, we argue that the permutation matrix constraint on is too strong, which leads to many inaccurate and redundant matchings between and , so we relax it by unbalanced optimal transport.
Alaux et al. (2019) extended the line of BLI to the problem of aligning multiple languages to a common space. Zhou et al. (2019) estimated Q by a density matching method called normalizing flow. Artetxe et al. (2018) proposed a multi-step framework of linear transformations that generalizes a substantial body of previous work. Garneau et al. (2019) further investigated the robustness of Artetxe et al. (2018)’s model by introducing four new languages that are less similar to English than the ones proposed by the original paper. Artetxe et al. (2019) proposed an alternative approach to this problem that builds on the recent work on unsupervised machine translation.
3 Proposed Method
In this section, we propose a method for the BLI task. As mentioned in the background, we take Problem (2) as our loss function and use a similar optimization framework in Grave et al. (2019) to estimate and alternatively. Our method focuses on the estimation of and tries to find a more precise matching between and . Estimation of
is by stochastic gradient descent. We also propose a bidirectional optimization framework in section 3.2.
3.1 Relaxed Matching Procedure
Regarding embedding set and as two discrete distributions and , where (or ) is column vector satisfies ( is similar), is the Dirac function supported on point .
Standard optimal transport enforces the optimal transport plan to be the joint distribution. This setting leads to the result that every mass in should be matched to the same mass in . Recent application of unbalanced optimal transport (Wang et al., 2019) shows that the relaxation of the marginal condition could lead to more flexible and local matching, which avoids some counterintuitive matchings of source-target mass pairs with high transportation cost.
The formulation of unbalanced optimal transport (Chizat et al., 2018a) differs from the balanced optimal transport in two ways. Firstly, the set of transport plans to be optimized is generalized to . Secondly, the marginal conditions of the Problem (3) are relaxed by two KL-divergence terms.
where is the KL divergence.
We estimate by considering the relaxed Problem (4) instead of the original Problem (3) in (Grave et al., 2019). Problem (4) could also be solved by entropy regularization with the generalized Sinkhorn algorithm (Chizat et al., 2018b; Wang et al., 2019; Peyré et al., 2019).
In short, we already have an algorithm to obtain the minimum of the Problem (4). In order to avoid the hubness phenomenon, we replace distance of embedding with the distance proposed in Joulin et al. (2018) formalized as . can not provide significantly better results than euclidean distance in our evaluation. However, previous study suggests that RCSLS could be considered as a better metric between words than euclidean distance. So we propose our approach with RCSLS. The ”relaxed matching” procedure and the ”bi-directional optimization” we proposed bring most of the improvement.
We call this relaxed estimation of as Relaxed Matching Procedure(RMP). With RMP only when two points are less than some radius apart from each other, they may be matched together. Thus we can avoid some counterintuitive matchings and obtain a more precise matching . In the section of experiments we will verify the effectiveness of RMP.
3.2 Bidirectional Optimization
Previous research solved the mapping to and the mapping to
as two independent problems, i.e., they tried to learn two orthogonal matrixand to match the with and with , respectively. Intuitively from the aspect of point cloud matching, we consider these two problems in opposite directions are symmetric. Thus we propose an optimization framework to solve only one for both directions.
In our approach, we match with and with simultaneously. Based on the stochastic optimization framework of Grave et al. (2019), we randomly choose one direction to optimize at each iteration.
The entire process of our method is summarized in Algorithm 2. At iteration , we start with sampling batches , with shape . Then we generate a random integer and choose to map to or map to by ’s parity. Given the mapping direction, we run the RMP procedure to solve Problem (4) by sinkhorn and obtain a matching matrix between and (or and ). Finally we use gradient descent and procrutes to update by the given . The procedure of ’s update is detailed in Grave et al. (2019).
In this section, we evaluate our method in two settings. First, We conduct distillation experiments to verify the effectiveness of RMP and bidirectinal optimization. Then we compare our method consisting of both RMP and bi-directional optimization with various SOTA methods on the BLI task.
|Adv. - Refine||None||81.7||83.3||82.3||82.1||74.0||72.2||44.0||59.1||77.9||77.5||73.4|
|W.Proc. - Refine||None||82.8||84.1||82.6||82.9||75.4||73.3||43.7||59.1||73.0|
|Dema - Refine||None||82.8||84.9||82.6||82.4||75.3||74.9||46.9||62.4||74.0|
|Ours - Refine||None||82.7||85.8||83.0||83.8||76.2||74.9||48.1||64.7||79.1||80.3||75.9|
DataSets***https://github.com/facebookresearch/MUSE We conduct word translation experiments on 6 pairs of languages and use pretrained word embedding from fasttext. We use the bilingual dictionaries opensourced in the work Lample et al. (2018) as our evaluate set.We use the CSLS retrieval method for evaluation as Lample et al. (2018) in both settings. All the translation accuracy reported is the precision at 1 with CSLS criterion. We open the source code on Github†††https://github.com/BestActionNow/bidirectional-RMP.
4.1 Main Results
Through the experimental evaluation, we seek to demonstrate the effectiveness of our method compared to other SOTA methods. The word embeddings are normalized and centered before entering the model. We start with a batch size 500 and 2000 iterations each epoch. We double the batch size and quarter the iteration number after each epoch. First 2.5K words are taken for initialization, and samples are only drawn from the first 20K words in the frequently ranking vocabulary. The coefficientsand of the relaxed terms in Problem (4) are both set to 0.001.
Baselines We take basic Procrutes and RCSLS-Loss of Joulin et al. (2018) as two supervised baselines. Five unsupervised methods are also taken into accounts: the Gromov Wasserstein matching method of Alvarez-Melis and Jaakkola (2018), the adversarial training(Adv.-Refine) of Lample et al. (2018), the Wasserstein Procrutes method(W.Proc.-Refine) of Grave et al. (2019), the density matching method(Dema-Refine) of Zhou et al. (2019).
In Table 1, it’s shown that leading by an average of 2 percentage points, our approach outperforms other unsupervised methods in most instances and is on par with the supervised method on some language pairs. Surprisingly we find that our method achieves significant progress in some tough cases such as English - Russian, English - Italian, which contain lots of noise. Our method guarantees the precision of mapping computed every step which achieves the effect of noise reduction.
However, there still exists an noticeable gap between our method and the supervised RCSLS method, which indicates further research can be conducted to absorb the superiority of this metric to unsupervised methods.
We also compare our method with W.Proc on two non-English pairs including FR-DE and FR-ES to show how bidirectional relaxed matching improves the performance and results are presented in Table 2. Most of the recent researches didn’t report results of non-English pairs, which makes it hard for fair comparison. However from the results in Table 2, we could find that our method keeps an advantage over W.Proc. Note that the W.Proc. results here are our implementation rather than that are reported in the original paper.
4.2 Ablation Study
The algorithms for BLI could be roughly divided into three parts: 1. initialization, 2 iterative optimization, and 3. refinement procedure, such as Lample et al. (2017). W.Proc.(Grave et al., 2019) only covers the first two parts. Our approaches, i.e. relaxed matching and bi-directional optimization are categorized into the second part. To ensure a fair comparison, W.Proc.-Refine is compared to ours-Refine which is discussed in next section. To verify the effectiveness of RMP and bidirectional optimization directly, we apply them to the method proposed in Grave et al. (2019)
one by one. We take the same implementation and hyperparameters reported in their paper and code‡‡‡https://github.com/facebookresearch/fastText/alignment but using RMP to solve instead of ordinary 2-Wasserstein.
On four language pairs, We applied RMP, bidirectional optimization and refinement procedure to original W.Proc. gradually and evaluate the performance change. In Figure 1 it’s clearly shown that after applying bidirectional RMP, the translation accuracy improves by 3 percentage averagely. The results of ’WP-RMP’ are worse than ’WP-RMP-bidirection’ but better than original ’WP’. Moreover, we find that by applying RMP, a more precise not only eliminates many unnecessary matchings but also leads to a faster converge of the optimization procedure. Furthurmore, the effectiveness of refinement procedure is quite significant.
To summarize, we consider the average of scores (from en-es to ru-en). By mitigating the counter-intuitive pairs by polysemies and obscure words, the ”relaxed matching” procedure improves the average score about 2 points, the ”bi-directional optimization” improves the average score about 0.6 points. From the results we could get some inspiration that our ideas of relaxed matching and bidirectional optimization can also be applied to other frameworks such as adversarial training by Lample et al. (2017) and Gromov-Wasserstein by Alvarez-Melis and Jaakkola (2018).
This paper focuses on the matching procedure of BLI task. Our key insight is that the relaxed matching mitigates the counter-intuitive pairs by polysemy and obscure words, which is supported by comparing W.Proc.-RMP with W.Proc in Table 1. The optimal transport constraint considered by W.Proc. is not proper for BLI tasks. Moreover, Our approach also optimizes the translation mapping Q in a bi-directional way, and has been shown better than all other unsupervised SOTA models with the refinement in Table 1.
This work was supported by the National Natural Science Foundation of China (11871297, 91646202), National Key R&D Program of China(2018YFB1404401, 2018YFB1402701), Tsinghua University Initiative Scientific Research Program.
- Unsupervised hyperalignment for multilingual word embeddings. CoRR abs/1811.01124. External Links: Cited by: §2.
- Gromov-wasserstein alignment of word embedding spaces. See DBLP:conf/emnlp/2018, pp. 1881–1890. External Links: Cited by: §1, §2, §4.1, §4.2.
- Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462. Cited by: §1.
Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
- Bilingual lexicon induction through unsupervised machine translation. See DBLP:conf/acl/2019-1, pp. 5002–5007. External Links: Cited by: §2.
An interpolating distance between optimal transport and fisher–rao metrics. Foundations of Computational Mathematics 18 (1), pp. 1–44. Cited by: §3.1.
- Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation 87 (314), pp. 2563–2609. Cited by: §3.1.
- Sinkhorn distances: lightspeed computation of optimal transport. See DBLP:conf/nips/2013, pp. 2292–2300. External Links: Cited by: §1, §2.
- Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Third Workshop on Very Large Corpora, Cited by: §1.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings: making the method robustly reproducible as well. CoRR abs/1912.01706. External Links: Cited by: §2.
- Unsupervised alignment of embeddings with wasserstein procrustes. See DBLP:conf/aistats/2019, pp. 1880–1890. External Links: Cited by: §1, §2, §2, §3.1, §3.2, §3.2, §3, Figure 1, §4.1, §4.2.
- Loss in translation: learning bilingual word mapping with a retrieval criterion. See DBLP:conf/emnlp/2018, pp. 2979–2984. External Links: Cited by: §3.1, §4.1.
- Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043. Cited by: §1, §4.2, §4.2.
- Word translation without parallel data. See DBLP:conf/iclr/2018, External Links: Cited by: §1, §2, §4.1, §4.
- Cross-lingual relevance models. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 175–182. Cited by: §1.
- Exploiting similarities among languages for machine translation. CoRR abs/1309.4168. External Links: Cited by: §1, §2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §2, §3.1.
- Identifying word translations in non-parallel texts. arXiv preprint cmp-lg/9505037. Cited by: §1.
- Wasserstein-fisher-rao document distance. CoRR abs/1904.10294. External Links: Cited by: §3.1, §3.1.
- Normalized word embedding and orthogonal transform for bilingual word translation. See DBLP:conf/naacl/2015, pp. 1006–1011. External Links: Cited by: §2.
- Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1934–1945. Cited by: §2.
- Density matching for bilingual word embedding. See DBLP:conf/naacl/2019-1, pp. 1588–1598. External Links: Cited by: §2, §4.1.