Evaluation metrics are essential for judging progress in natural language generation (NLG) tasks such as machine translation (MT) and summarization, as they identify the state-of-the-art in a key NLP technology. Despite their wide dissemination, it has recently become more and more evident that classical lexical overlap evaluation metrics like BLEU(Papineni et al., 2002) and ROUGE (Lin, 2004) are unsuitable as metrics, especially when judging the quality of modern NLG systems (Mathur et al., 2020; Marie et al., 2021), necessitating the need for novel (BERT-based) evaluation metrics that correlate better with human judgments. This has been a very active research area in the last 2-3 years, cf. (Zhang et al., 2020; Zhao et al., 2020, 2019; Colombo et al., 2021; Yuan et al., 2021; Zhao et al., 2022).111Of course, the search for high quality metrics dates back at least to the invention of BLEU and its predecessors.
A deficit of most evaluation metrics (classical or more recent ones) is their need for some form of supervision, requiring human involvement: (i) TYPE-1: most metrics use supervision in the form of human references which they compare to system outputs (reference-based metrics) (Yuan et al., 2021; Zhao et al., 2019; Zhang et al., 2020); (ii) TYPE-2: some metrics are trained on human assessments such as Direct-Assessment (DA) or Post-Editing (PE) scores (Sellam et al., 2020; Rei et al., 2020); (iii) TYPE-3: there are also so-called reference-free metrics which do not necessarily use supervision in the form of (i) or (ii). However, to work well, they still use parallel data (Zhao et al., 2020; Song et al., 2021), which is considered a form of supervision, e.g., in the MT community (Artetxe et al., 2018; Lample et al., 2018), or are fine-tuned as in (ii) (Ranasinghe et al., 2021).
In this work, we aim for fully unsupervised evaluation metrics (for MT) that do not use any form of supervision. In addition, subject to the constraint that no supervision is allowed, our metrics should be of maximally high quality, i.e., correlation with human assessments. We have two use cases in mind: (a) Such sample efficiency222We use the term sample efficiency in a generalized sense to denote the amount of supervision required. is a prerequisite for the wide applicability of the metrics. This is especially important when we want to overcome the current English-centricity (Anastasopoulos and Neubig, 2020) of MT systems and evaluation metrics and also cover much lower-resource language pairs (Fomicheva et al., 2021). We point out that the languages involved for which our approach is relevant need not necessarily be low-resource individually; the particular pairing can also be low-resource, e.g., Latvian-Chinese, for which it may be difficult to obtain human supervision signals. (b) Our fully unsupervised evaluation metrics should be considered strong lower bounds for any future work that uses (mild) forms of supervision for metric induction, i.e., we want to push the lower bounds for newly developed TYPE-k metrics.
To achieve our goals, we employ self-learning (He et al., 2020; Wei et al., 2021) and in particular, we leverage the following dualities to make our metrics maximally effective, cf. Figure 1: (a) Evaluation metrics and NLG systems are closely related; e.g., a metric can be an optimization criterion for an NLG system (Böhm et al., 2019), and a system can conversely generate pseudo references (a.o.) from which to improve a metric; (b) evaluation metrics and parallel corpus mining (Artetxe and Schwenk, 2019) are closely related; e.g., a metric can be used to mine parallel data, which in turn can be used to improve the metric (Zhao et al., 2020), e.g., by remapping deficient embedding spaces. Our contributions are the following:
We show that effective unsupervised evaluation metrics can be obtained by exploiting relationships with parallel corpus mining approaches and MT system induction;
to do so, we explore ways to (a) make parallel corpus mining with Word Mover Distance-based metrics efficient (e.g., overcome cubic runtime complexity) and (b) induce unsupervised multilingual sentence embeddings from pseudo-parallel data;
we show that pseudo-parallel data can rectify deficient vector spaces such as mBERT;
we show that our metrics beat three current state-of-the-art supervised metrics on four out of five datasets that we evaluate on.
We take inspiration from three recent supervised (reference-free; TYPE-3) metrics to induce our own unsupervised metric UScore. The three metrics are: XMoverScore (Zhao et al., 2020), DistilScore (Reimers and Gurevych, 2020), and SentSim (Song et al., 2021). Below, we briefly review key aspects of each of the three metrics (in §2.1,2.2,2.3), show where supervision plays a role and how we plan to eliminate it in our new contributions (§2.4,2.5,2.6).
Central to XMoverScore is the use of Word Mover’s Distance (WMD) as a measure of similarity between an MT hypothesis and a source text (Zhao et al., 2020). WMD and further enhancements are discussed below.
Word Mover’s Distance
Word Mover’s Distance is a distance function that compares sentences at the token level (Kusner et al., 2015), by leveraging word embeddings. XMoverScore obtains these word embeddings by extracting the hidden states of the last layer of mBERT (Devlin et al., 2019). From a source sentence and a MT hypothesis , WMD constructs a (word travel cost) distance matrix , where is the distance between two word embeddings, ; index respective words in ,
. WMD uses these word travel costs to compute a distance between the two sentences. This cost can be defined as the linear programming problem
where is an alignment matrix with denoting how much of word travels to word . The two constraints prevent the degenerate solution where
is a zero matrix.
Monolingual Subspace Realignment
Zhao et al. (2020), akin to similar earlier and subsequent work (Cao et al., 2020; Schuster et al., 2019), argue that the monolingual subspaces of mBERT are not well aligned, i.e., the embeddings for similar words in different languages could be far apart. As a remedy, they investigate linear projection methods which alter the vector representations of the source language words using parallel data (i.e., a supervision signal), with the goal to post-hoc improve cross-lingual alignments. We refer to this approach as vector space remapping. XMoverScore explores two different remapping approaches, CLP and UMD. They both leverage parallel data on sentence-level from which they extract word-level alignments using fast-align, which are then used for the vector space remapping. We give more details in the appendix (§A.1).
XMoverScore linearly combines WMD with the perplexity of a GPT-2 language model (Radford et al., 2019). Allegedly this penalizes ungrammatical translations. This updates the scoring function of XMoverScore to
and are weights for the cross-lingual WMD and LM components of XMoverScore.
Reimers and Gurevych (2020) show that the cosine between multilingual sentence embeddings captures semantic similarity and can be used to assess multi- and cross-lingual semantic textual similarity. Their approach to inducing embedding models is based on multilingual knowledge distillation. We refer to this metric as DistilScore. Their approach requires supervision at multiple levels. First, parallel sentences are needed to induce multilingual models, and second, NLI and STS corpora are required to induce teacher embeddings in the source language.
A key difference between XMoverScore and DistilScore is that one approach is based on word embeddings and the other one on sentence embeddings. Song et al. (2021) and Kaster et al. (2021) show that combining approaches based on word-level and sentence-level representations can substantially improve metrics. The metric of Song et al. (2021), which is called SentSim, combines supervised DistilScore and a word embedding-based metric. Overall, the authors explore two word embedding-based metrics. The first one is quite similar to XMoverScore, as it is also based on WMD. The other one is a multilingual variant of BERTScore (Zhang et al., 2020).
To overcome the need for supervision in XMoverScore, we aim to replace the parallel sentences used during remapping with pseudo-parallel data (Tran et al., 2020). Since fast-align’s parameter optimization depends directly on how well sentences are aligned (Dyer et al., 2013), we replace it in XMoverScore with an unsupervised variant of awesome-align (Dou and Neubig, 2021) which only relies on pre-trained language models. This allows for completely unsupervised remapping, and we call the resulting metric UScorewmd. In the following, we explain our approach to obtaining pseudo-parallel data.
Efficient WMD-based Corpus Mining
Metrics such as XMoverScore could in principle be used for pseudo-parallel corpus mining since they allow arbitrary sentences to be compared. However, when WMD-based metrics are scaled to corpus mining, algorithmic efficiency problems arise: (a) the computational complexity of WMD scales cubically with the number of words in a sentence (Kusner et al., 2015); (b) to compare source sentences to target sentences, WMD invocations are necessary, which quickly becomes intractable. Thus, we explore ways to improve the performance of WMD for efficient pseudo-parallel corpus mining. Kusner et al. (2015) define an approximation of WMD called word centroid distance (WCD; linear complexity) and use it to define a prefetch and pruning algorithm for fast similarity search; they include a second approximation called relaxed word moving distance (RWMD; with quadratic complexity), which we omit however. The algorithm first sorts all target samples according to their WCD to a given query and computes exact WMD for the nearest neighbors.
We use this approach iteratively (cf. Figure 1): we start out with an initial WMD metric (based on mBERT), obtain pseudo-parallel data with it (via the approach described), use UMD and CLP to remap mBERT (using word alignments from unsupervised awesome-align) from which we obtain a better WMD metric; then we iterate.
Apart from remapping, pseudo-parallel corpora could potentially overcome subspace realignment problems in other ways. Specifically, we want to mine enough pseudo-parallel data to train an unsupervised MT system to translate source sentences into the target language to create so-called pseudo references (Albrecht and Hwa, 2007; Gao et al., 2020; Fomicheva et al., 2020). This would allow for a comparison with the hypothesis in the target language only, similar to reference-based metrics, eliminating the problem of mismatches in multilingual embeddings. This approach updates UScorewmd to
where denotes the iterations of remapping, and is a new weight to control the influence of the pseudo reference on the total score. Tran et al. (2020) show that fine-tuning the mBART transformer using pseudo-parallel data leads to very promising results, so we use it for our experiments as well. All components of UScorewmd are illustrated in Figure 2.
Besides our word-based metric, we induce an unsupervised metric based on the cosine similarity between sentence embeddings, which we refer to asUScorecos. One could, similarly to UScorewmd, use pseudo-parallel data to perform knowledge distillation, e.g., in DistilScore, to induce unsupervised multilingual sentence embeddings. As our initial experiments in this direction were unsuccessful, we chose another approach to induce unsupervised sentence embeddings.
We explore contrastive learning for unsupervised multilingual sentence embedding induction, which has recently been successfully used to train unsupervised monolingual sentence embeddings (Gao et al., 2021). In our context, the basic idea behind contrastive learning is to pull semantically close sentences together and to push distant sentences apart in the embedding space. Let and be the embeddings of two sentences that are semantically related and an arbitrary batch size. The contrastive training objective for this pair can be formulated as
is a temperature hyperparameter that can be used to either amplify or dampen the assessed distances. For each sentence, all remaining sentences in the current batch can be used as so-called in-batch negatives; those should be pushed apart in the embedding space. For positive sentences that should be pulled together in the embedding space, we again use pseudo-parallel sentence pairs as positive training instances. We use pooled XLM-R embeddings as sentence representations, and, as with unsupervised remapping, we plan to experiment with multiple iterations of successive pseudo-parallel sentence mining and sentence embedding induction operations. To make this possible, the pseudo-parallel data must be processed using UScorecos.
Ratio Margin-based Corpus Mining
As UScorecos is based on sentence embeddings, we cannot use the prefetch and pruning algorithm for mining since it requires access to word-level representations. An alternative would be to just use cosine similarity for mining, but Artetxe and Schwenk (2019) show that this approach often retrieves badly aligned sentence pairs. Instead, we follow Artetxe and Schwenk (2019) and use a ratio margin function defined as
where and are the nearest neighbors of sentence embeddings and in the respective language. Informally, this ratio margin function divides the cosine similarities of the nearest neighbor by the average similarities of the neighborhood.
2.6 UScorewmd cos
Inspired by SentSim, which combines word and sentence embeddings, we similarly ensemble UScorewmd and UScorecos. We refer to this final metric as UScore UScorewmd cos with two new weights and :
In this section, we evaluate all UScore variants at the segment level and compare them to TYPE-1/2/3 upper bounds. We do not evaluate at the system level because metrics there often perform very similarly and achieve very high correlations, making it difficult to determine the best metric (Mathur et al., 2020; Freitag et al., 2021).
We use various datasets to evaluate the performance of our proposed metrics. Most of them consist of pairs of source sentences, machine-translated hypotheses, and human-annotated scores, allowing us to compute the correlation with human assessments using Pearson’s r correlation. We also evaluate our metrics as pseudo-parallel corpus mining tools, for which we report Precision at N (P@N) on parallel sentence matching, a standard evaluation measure in the parallel corpus mining field (Guo et al., 2018; Kvapilíková et al., 2020). The task is to search a set of shuffled parallel sentences to recover correct translation pairs.
In WMT-16, each language pair consists of tuples of source sentences from the news domain, machine-translated hypotheses, and reference translations. Each tuple was annotated with a direct assessment (DA) score, which quantifies the adequacy of the hypothesis given the reference translation. Following Zhao et al. (2020) and Song et al. (2021), we use these DA scores also to assess the adequacy of the hypothesis given the source. We also make use of the analogous dataset from the following year, which we refer to as WMT-17. In this, some language pairs and directions were changed. MLQE-PE
has been used in the WMT 2020 Shared Task on Quality Estimation(Specia et al., 2020). MLQE-PE only provides source sentences and hypotheses for its language pairs, with no references. Each source sentence and hypothesis pair was annotated with cross-lingual direct assessment (CLDA) scores. In terms of annotation, Eval4NLP is very similar to MLQE-PE. However, it focuses on non-English-centric language directions, especially de-zh and ru-de. WMT-MQM uses fine-grained error annotations from the Multidimensional Quality Metrics (MQM) framework (Freitag et al., 2021) for adequacy assessments. Here, MQM is used to structure all possible problems that may occur during MT into a hierarchy, evaluate them separately, and aggregate them into a single score using adjusted weightings. Like MLQE-PE and Eval4NLP, WMT-MQM also assigns scores based on source sentences and hypotheses.
Using ISO 639-1 codes, our datasets cover the language pairs: de-zh, ru-de, en-ru, en-zh, cs-en, de-en, en-de, et-en, fi-en, lv-en, ne-en, ro-en, ru-en, si-en, tr-en, zh-en.
Parallel sentence matching
To evaluate our metrics on parallel sentence matching, we use the News Commentary333http://data.statmt.org/news-commentary dataset. It consists of parallel sentences crawled from economic and political data. We use News Commentary v15 as for the WMT-20 Machine Translation of News Shared Task (Barrault et al., 2020).
3.2 Preliminary Studies on de-en
We conduct preliminary experiments to gain an understanding of the properties of iterative techniques and the influence of individual parameters. For these experiments, we only use the de-en language direction of WMT-16 and News Commentary v15.
Monolingual Subspace Realignment
We explore if the CLP and UMD remapping methods also work with word alignments extracted from pseudo-parallel sentence pairs. We use News Crawl for these experiments. Since large corpora tend to include low-quality data points, we follow Artetxe and Schwenk (2019) and Keung et al. (2021) and apply three simple filtering techniques, described in the appendix (§A.2). For mining, we randomly extract 40k monolingual sentences per language direction, set for the prefetch and pruning algorithm, and select the best k sentence pairs with the highest similarity scores. This gives us the same number of sentences as were used for training the remapping matrices already provided by XMoverScore.
The results for UMD and CLP-based remapping on the de-en language direction can be seen in Figure 3.
The figure contains two graphs, one for correlation with human judgments and one for precision on the parallel sentence matching task. Each graph illustrates model performance before remapping (depicted as Iteration 0) and after remapping one to five times. After one iteration of remapping, both UMD and CLP achieve substantial improvements in Pearson’s r correlation. The improvement of CLP, however, is noticeably larger. For subsequent iterations, UMD seems to continue to improve slightly, but the correlations of CLP seem to drop. This can be explained by the results for precision where the P@1 of CLP drops each iteration, meaning the remapping capabilities of the metrics decrease. UMD does not exhibit this problem. We conclude that UMD could be a more robust choice for metrics that should perform reasonably well on both tasks.
Pseudo References & Language Model
Next, we add a language model to the metric and investigate pseudo-parallel corpus mining to train an MT system for pseudo references. Since fine-tuning for MT is a very resource-intensive undertaking requiring many parallel sentence pairs (Barrault et al., 2020), especially compared to our subspace realignment experiments, we use considerably more training data. Tran et al. (2020) roughly use between 100k and 300k pseudo-parallel sentence pairs to train their final mBART model, so for fine-tuning an MT system, we increase the number of monolingual sentences used for mining by a factor of 100. This means we now have a pool of 4m sentences per language direction, again taken from News Crawl. We extract the top 5% pseudo-parallel sentence pairs and thus have 200k samples to train on. Our results on the de-en data of WMT-16 are reported in Figure 4, which is similar to an ablation study.
On the x-axis, we vary the weight for UScorewmd with pseudo references, and on the y-axis, we explore different weights for the language model. We set . Without a language model (i.e. ), the best results are achieved with . When combining both approaches, the overall best performance is achieved with and . The improvement when pseudo references and a language model are included in the metric is substantial, highlighting their effectiveness—e.g., we improve from 28% correlation with humans to 49% with the best weight combination, an improvement of 75%.
For UScorecos, we additionally filter our retrieved pseudo-parallel data. Up to now, we aligned each source sentence with the best matching target sentence, which could lead to multiple source sentences being aligned to the same target sentence. While this was not problematic in our previous experiments, it could lead to complications for contrastive learning since the same positive sentences could also appear as in-batch negative instances. As a remedy, we discard all sentence pairs where the target sentence has already occurred before. Since the additional filtering means that we have fewer potential sentence pairs to choose from, we decided to only use the best 2.5% sentence pairs to train the models. As we again use 4m monolingual sentences per language, this means our training datasets consist of 100k sentence pairs. The results of UScorecos are shown in Figure 5.
The P@1 scores seem to steadily improve over the previous best result every two training iterations. Beginning with the sixth iteration, the precision seems to converge.
Table 2 shows pseudo-parallel data mined with UScorecos and with UScorewmd after remapping once with UMD. The mined sentences are semantically similar, but contain factuality errors (e.g., have wrong places or numbers in hypotheses).
3.3 Other Languages
We now test our unsupervised metrics on other languages and datasets. For UScorewmd, we remap mBERT once with UMD and make use of a language model and pseudo references obtained from an MT system using the best weight configuration identified in Section 3.2. For UScorecos, we train its sentence embedding model for six iterations. We also evaluate UScorewmd cos. Based on the experiments in Section 3.2, we determined and to be the best configuration for ensembling.
|Supervised Metrics (TYPE-1/2)||WMT-16||WMT-17||MLQE-PE||Eval4NLP||WMT-MQM|
|Supervised Metrics (TYPE-3)|
The Pearson’s r correlations with human judgments averaged over language pairs are shown in Table 1 for all metrics (results on individual language pairs can be found in the Appendix A). For comparison purposes, we also present the results of the popular TYPE-1 metric BLEU, where possible, and the recent trained TYPE-2 metrics MonoTransQuest (Ranasinghe et al., 2020b, a) and COMET-QE (Rei et al., 2021). Finally, as more direct competitors, we compare to the TYPE-3 metrics XMoverScore and SentSim.
Expectedly, DistilScore, which uses parallel data, is always better than UScorecos, from 2-10 points correlation. In contrast, UScorewmd is generally on par with XMoverScore, even though XMoverScore uses parallel data—the difference is that UScorewmd also leverages pseudo-references which XMoverScore does not. From Figure 4, we observe that the pseudo-references can make an improvement of up to 1-11 points in correlation (comparing ‘column’ labeled to the columns ).
We beat reference-based TYPE-1 BLEU across the board. TYPE-2 metrics, which are fine-tuned on human scores, are generally the best. They exhibit 10+ points higher correlation on four of of five datasets than our metrics. Intruigingly, the only two language pairs where our metrics are on par are the non-English de-zh and ru-de from Eval4NLP. These languages are outside the training scope of the current TYPE-2 metrics and thus test their generalization abilities. For example, on ru-de our best metric outperforms MonoTransQuest by 5 points correlation and COMET-QE by 9 points (see Table 6 in the appendix). This indicates an interesting application scenario for TYPE-3 metrics as well as our class of metrics.
Surprisingly, our unsupervised metrics also outperform the TYPE-3 upper bounds on four out of five datasets. Compared to TYPE-3 competitors, on WMT-16, WMT-17, and on Eval4NLP, our combined metric has the best overall results. On MQM-WMT, UScorewmd alone has the highest correlation score. The drop in performance for the combined metric is caused by UScorecos, which on its own achieves astonishingly bad correlations. However, supervised DistilScore exhibits the same issues. Thus, this could be a general problem for metrics based on sentence embeddings on this dataset.
For MLQE-PE, the SentSim metrics perform best on average among TYPE-3 and our metrics (although our reproduced scores for this dataset differ noticeably from the authors’ results because their original script read the human scores from an incorrect data column). Among our self-learned metrics, the combined variant performs best on average again, but still is 3-5 points below SentSim and DistilScore, even though it outperforms both XMoverScore variants by over 6 points. Interestingly, by itself, UScorecos works better than UScorewmd, unlike for the other datasets. Similarly, DistilScore clearly outperforms XMoverScore. One reason for this unusual behavior is the usage of mBERT in UScorewmd. MLQE-PE contains sentences in Sinhala, a language not present in the training data of mBERT. Another explanation is the data collection scheme for ru-en, which uses different sources of parallel sentences. These sources are mainly made of colloquial data and Russian proverbs, which use rather unconventional grammar (Fomicheva et al., 2020a). This unconventional grammar apparently confuses the language model. Similarly, we believe that the MT system has problems translating colloquial sentences because it has been trained on news data where formal writing is used. When we exclude si-en and ru-en from MLQE-PE, UScorewmd cos performs best, with a Pearson’s r of 44.22 vs. 43.82 for SentSim (BERTScore).
In §A.3, we show that incorporating real parallel data (in addition to pseudo-parallel data) at an order of magnitude lower than that which SentSim uses allows us to outperform SentSim on MLP-QE also.
Limitations of our metrics include (1) algorithmic inefficiency, (2) resource inefficiency, (3) the brittleness of unsupervised MT systems in certain situations, and (4) hyperparameters.
(1) Some of the
components of UScorewmd (mainly the MT system)
lead to substantial computational overhead and make inference slow.
To put this into perspective, XMoverScore and SentSim (BERTScore) take less
than 30 seconds to score 1000 hypotheses on an Nvidia V100 Tensor Core
on an Nvidia V100 Tensor Core GPU. UScorewmd, on the other hand, takes over 2.5 minutes. This algorithmic inefficiency trades off with our sample efficiency, by which we did not use any supervision signals. In future work, we aim to experiment with efficient MT architectures (e.g., distilled versions) to reduce computational costs.
(2) Similarly as XMoverScore, MonoTransQuest or SentSim, our metrics use high-quality encoders such as BERT or GPT, which are not only memory and inference inefficient but also leverage large monolingual resources. Future work should thus not only investigate using smaller BERT models but also models that leverage smaller amounts of monolingual resources.
(3) Concerning the inclusion of unsupervised MT approaches, as we leverage via pseudo references, even though they may be less effective for truly low-resource languages (Marchisio et al., 2020), this remains a very active and fascinating field of research with a constant influx of more powerful solutions (Ranathunga et al., 2021; Sun et al., 2021).
(4) Even though we presented our approach as fully unsupervised, it still has three tunable weights. Figure 4 shows that these may have a big influence on the outcomes. In this work, we set these hyperparameters with reference to one high-resource language pair (de-en) only, which is a very mild form of supervision. This may on the other hand also mean that our model might perform much better when more suitable language-specific choices were set.
5 Related Work
All metrics in this work presented so far treated the MT model generating the hypotheses as a black-box which is otherwise not further involved in the scoring process. There also exists a recent line of work of so-called glass-box metrics, which actively incorporate the MT model under test into the scoring process (Fomicheva et al., 2020b, 2020). In particular, Fomicheva et al. (2020) explore whether the MT model under test can be used to generate additional hypotheses (Dreyer and Marcu, 2012). They then define various reference-based and reference-free metrics. A crucial difference to our metrics is the required availability of the original MT model, which we are agnostic about. The MT models used in Fomicheva et al. (2020) are all trained on parallel data, which makes their approach a supervised metric in our sense.
. We do not classify them as unsupervised, however, asPrism is trained from scratch on parallel data and BARTScore uses a BART model fine-tuned on labeled summarization or paraphrasing datasets.
There are also multilingual sentence embedding models which are highly relevant in our context. Kvapilíková et al. (2020), for example, fine-tune XLM-R with translation language modeling on synthetic data translated with an unsupervised MT system. Similar to our contrastive learning approach, the resulting embedding model is completely unsupervised. Important differences are that our sentence embedding model can be improved iteratively and does not rely on an MT system. We leave a comparison to future work.
Another relevant multilingual sentence embedding model is the supervised model LaBSE (Feng et al., 2020). The fine-tuning task of LaBSE consists of optimizing a so-called additive margin softmax loss (Yang et al., 2019) on parallel sentences. This is an instance of a contrastive training objective that shares some similarities with the contrastive loss of our UScorecos. A crucial difference, however, is the presence of a margin parameter. LaBSE achieves state-of-the-art performance on various bitext retrieval and corpus mining tasks but performs worse than comparable sentence embedding models on assessing semantic similarity. We suspect that this is due to the additional margin parameter, which may bias the assessed distance for similar sentences that are not perfect translations of each other.
Finally, the idea of fully unsupervised text generation systems has originated in the MT community (Artetxe et al., 2018; Lample et al., 2018; Artetxe et al., 2019). Given the similarity of MT systems and evaluation metrics, designing fully unsupervised evaluation metrics is an apparent next step, which we take in this work.
In this work, we aimed for sample efficient evaluation metrics that do not use any form of supervision signals. In addition, our novel metrics should be maximally effective, i.e., of high quality. To achieve this, we leveraged pseudo-parallel data obtained from fully unsupervised evaluation metrics in an iterative manner. We also exploited pseudo references from unsupervised MT systems as an alternative to original human references. We showed that such an approach can lead to substantial quality boosts when the right choices of parameters for the components are chosen. Moreover, we showed that our approach is effective and can outperform three supervised upper bounds (making use of parallel data) on 4 out of 5 datasets we included in our comparison.
In future work, we want to aim for algorithmic efficiency, include pseudo source texts as additional components (using the MT system in backward translation) and use the MT system to generate additional (better) pseudo-parallel data. We also think that our approach still has substantial room for improvement given that we selected hyperparameters based on one high-resource language pair (de-en) only. Thus, it will be particularly intriguing to explore weakly-supervised approaches which leverage minimal forms of supervision. It will also be interesting to explore other ways of inferring unsupervised metrics, e.g., adapting BERTScore (Zhang et al., 2020) or BARTScore (Yuan et al., 2021).
- Regression for sentence-level MT evaluation with pseudo references. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 296–303. External Links: Cited by: §2.4.
- Should all cross-lingual embeddings speak English?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8658–8679. External Links: Cited by: §1.
Unsupervised neural machine translation. In International Conference on Learning Representations, External Links: Cited by: §1, §5.
- An effective approach to unsupervised machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 194–203. External Links: Cited by: §5.
- Margin-based parallel corpus mining with multilingual sentence embeddings. In ACL, Cited by: §1, §2.5, §3.2.
- Findings of the 2020 conference on machine translation (wmt20). In WMT@EMNLP, Cited by: §3.1, §3.2.
Better rewards yield better summaries: learning to summarise without references.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3110–3120. External Links: Cited by: §1.
- Multilingual alignment of contextual word representations. In International Conference on Learning Representations, External Links: Cited by: §2.1.
- Automatic text evaluation through the lens of Wasserstein barycenters. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 10450–10466. External Links: Cited by: §1.
- Attenuating bias in word vectors. In AISTATS, Cited by: §A.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §2.1.
- Word alignment by fine-tuning embeddings on parallel corpora. In EACL, Cited by: §2.4.
- HyTER: meaning-equivalent semantics for translation evaluation. In NAACL, Cited by: §5.
- Efforts in the development of an augmented english-nepali parallel corpus. Technical report Kathmandu University. Cited by: §A.3.
- A simple, fast, and effective reparameterization of ibm model 2. In HLT-NAACL, Cited by: §A.1, §2.4.
- Language-agnostic bert sentence embedding. ArXiv abs/2007.01852. Cited by: §5.
- MLQE-pe: a multilingual quality estimation and post-editing dataset. ArXiv abs/2010.04480. Cited by: §3.3.
- Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 539–555. Cited by: §5.
- The Eval4NLP shared task on explainable quality estimation: overview and results. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Punta Cana, Dominican Republic, pp. 165–178. External Links: Cited by: §1.
- Multi-hypothesis machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1218–1232. External Links: Cited by: §2.4, §5.
- Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics 9, pp. 1460–1474. External Links: Cited by: §3.1.
- Results of the WMT21 metrics shared task: evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, Online, pp. 733–774. External Links: Cited by: §3.
- SimCSE: simple contrastive learning of sentence embeddings. In EMNLP, Cited by: §2.5.
SUPERT: towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1347–1354. External Links: Cited by: §2.4.
- Effective parallel corpus mining using bilingual sentence embeddings. In WMT, Cited by: §3.1.
- Revisiting self-training for neural sequence generation. In Proceedings of ICLR, External Links: Cited by: §A.3, §1.
- Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 427–431. External Links: Cited by: §A.2.
- Global explainability of BERT-based evaluation metrics by disentangling along linguistic factors. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 8912–8925. External Links: Cited by: §2.3.
- Unsupervised bitext mining and translation via self-trained contextual embeddings. Transactions of the Association for Computational Linguistics 8, pp. 828–841. Cited by: §3.2.
- Europarl: a parallel corpus for statistical machine translation. In MT summit, Cited by: §A.1.
- From word embeddings to document distances. In ICML, Cited by: §2.1, §2.4.
- Unsupervised multilingual sentence embeddings for parallel corpus mining. In ACL, Cited by: §3.1, §5.
- Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 5039–5049. External Links: Cited by: §1, §5.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §1.
- When does unsupervised machine translation work?. In Proceedings of the Fifth Conference on Machine Translation, Online, pp. 571–583. External Links: Cited by: §4.
- Scientific credibility of machine translation research: a meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 7297–7306. External Links: Cited by: §1.
- Tangled up in BLEU: reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4984–4997. External Links: Cited by: §1.
- Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, Online, pp. 688–725. External Links: Cited by: §3.
- Exploiting similarities among languages for machine translation. ArXiv abs/1309.4168. Cited by: §A.1.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Cited by: §1.
- Language models are unsupervised multitask learners. In OpenAI Blog, Cited by: §2.1.
- TransQuest at wmt2020: sentence-level direct assessment. In Proceedings of the Fifth Conference on Machine Translation, Cited by: §3.3.
- TransQuest: translation quality estimation with cross-lingual transformers. In Proceedings of the 28th International Conference on Computational Linguistics, Cited by: §3.3.
- An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, pp. 434–440. External Links: Cited by: §1.
- Neural machine translation for low-resource languages: a survey. External Links: Cited by: §4.
- Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, Online, pp. 1030–1040. External Links: Cited by: §3.3.
- COMET: a neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2685–2702. External Links: Cited by: §1.
- Making monolingual sentence embeddings multilingual using knowledge distillation. In EMNLP, Cited by: §2.2, §2.
- Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1599–1613. External Links: Cited by: §2.1.
- WikiMatrix: mining 135m parallel sentences in 1620 language pairs from wikipedia. In EACL, Cited by: §A.3.
- BLEURT: learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7881–7892. External Links: Cited by: §1.
- SentSim: crosslingual semantic evaluation of machine translation. In NAACL, Cited by: §1, §2.3, §2, §3.1.
- Findings of the wmt 2020 shared task on quality estimation. In WMT@EMNLP, Cited by: §3.1.
- Unsupervised neural machine translation for similar and distant language pairs: an empirical study. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 20 (1), pp. 1–17. Cited by: §4.
- Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In EMNLP, Cited by: §5.
- Cross-lingual retrieval for iterative self-supervised training. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 2207–2219. External Links: Cited by: §2.4, §2.4, §3.2.
- Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, External Links: Cited by: §1.
- Normalized word embedding and orthogonal transform for bilingual word translation. In HLT-NAACL, Cited by: §A.1.
Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5370–5378. External Links: Cited by: §5.
- BARTScore: evaluating generated text as text generation. In Thirty-Fifth Conference on Neural Information Processing Systems, External Links: Cited by: §1, §1, §5, §6.
- BERTScore: evaluating text generation with bert. In International Conference on Learning Representations, External Links: Cited by: §1, §1, §2.3, §6.
- On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In ACL, Cited by: §A.1, §A.1, §A.1, §1, §1, §1, §2.1, §2.1, §2, §3.1.
- MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 563–578. External Links: Cited by: §1, §1.
- DiscoScore: evaluating text generation with bert and discourse coherence. External Links: Cited by: §1.
Appendix A Appendix
a.1 CLP and UMD
Procrustes alignment: Mikolov et al. (2013)
propose to compute a linear transformation matrixwhich can be used to map a vector of a source word into the target language subspace by computing . The transformation can be computed by solving the problem
Here are matrices with embeddings of source and target words, respectively, where the tuples come from parallel word pairs. XMoverScore constrains
to be an orthogonal matrix such that, since this can lead to further improvements (Xing et al., 2015). Zhao et al. (2020) call this remapping Linear Cross-Lingual Projection remapping (CLP).
De-biasing: The second remapping method of XMoverScore is rooted in the removal of biases from word embeddings. Dev and Phillips (2019)
explore a bias attenuation technique called Universal Language Mismatch-Direction (UMD). It involves a bias vector, which is supposed to capture the bias direction. For each word embedding , an updated word embedding is computing by subtracting their projections onto , as in
where is the dot product. To obtain the bias vector , Dev and Phillips (2019) use a set of word pairs that should be de-biased (e.,g. man and woman). The subtractions of the embeddings of the words in each pair are then stacked to form a matrix , and the bias vector is its top-left singular vector. Zhao et al. (2020) use the same approach for XMoverScore, but instead consists of parallel word pairs.
Zhao et al. (2020) show that these remapping methods lead to substantial improvements of their XMoverScore metric (on average, up to 10 points in correlation). The required parallel word pairs were extracted from sentences of the EuroParl corpus (Koehn, 2005) using the fast-align (Dyer et al., 2013) word alignment tool. The best results were obtained when remapping on 2k parallel sentences.
|Top-WMD||Uruguay belegt mit vier Punkten nur Platz Sieben.||Russia was second with four gold and 13 medals.|
|Top-WMD||Soweit lautet zumindest die Theorie.||That, at least, is the theory.|
|Rnd-WMD||Die USA stellen etwa 17.000 der insgesamt 47.000 ausländischen Soldaten in Afghanistan.||Currently, there are about 170,000 U.S. troops in Iraq and 26,000 in Afghanistan.|
|Rnd-WMD||“Das ist eine schwierige Situation”, sagte Kaczynski.||“It seemed like a ridiculous situation,” Vanderjagt said.|
|Top-Cos||Die Wahlen für ein neues Parlament sollen dann Anfang Januar stattfinden.||Parliamentary elections are to be held by January.|
|Top-Cos||Anzeichen für die Blauzungenkrankheit sind Fieber, Entzündungen und Blutungen an der Zunge der Tiere.||Contact with the creatures can cause itching, rashes, conjunctivitis and, in some cases, breathing problems.|
|Rnd-Cos||Riesen-Wirbel an der Universität Zagreb: An der wirtschaftlichen Fakultät und am Institut für Verkehrsstudien durchsuchen Polizisten die Büros von Dozenten.||Those attending the Soil Forensics International Conference work in the fields of science, policing, forensic services as well as private industries.|
|Rnd-Cos||Frankfurt soll WM-Finale der Frauen ausrichten||The women’s tournament gets underway on Sunday.|
|Supervised Metrics (TYPE-1/2)||de-en||en-ru||ru-en||ro-en||cs-en||fi-en||tr-en|
|Supervised Metrics (TYPE-3)|
|Supservised Metrics (TYPE-1/2)||cs-en||de-en||fi-en||lv-en||ru-en||tr-en||zh-en|
|Supervised Metrics (TYPE-3)|
|Supervised Metrics (TYPE-1/2)||en-de||en-zh||ru-en||ro-en||et-en||ne-en||si-en|
|Supervised Metrics (TYPE-3)|
|Supervised Metrics (TYPE-1/2)||en-de||zh-en||de-zh||ru-de|
|Supervised Metrics (TYPE-3)|
We first remove all sentences from each monolingual corpus for which the fastText language identification tool (Joulin et al., 2017) predicts a different language. We then filter all sentences which are shorter than 3 tokens or longer than 30 tokens. As the last step, we discard sentence pairs sharing substantial lexical overlap, which prevents degenerate alignments of, e.g., proper names. We remove all sentence pairs for which the Levenshtein distance detects an overlap of over 50%.
a.3 Fine-Tuning on Parallel Data
To examine whether and by how much we can further improve our metrics using forms of supervision, we experiment with a fine-tuning step on parallel sentences and treat self-learning on pseudo-parallel data as pre-training (He et al., 2020). We use the parallel data to fine-tune the contrastive sentence embeddings of UScorecos and the MT system of UScorewmd, which is responsible for generating pseudo references. Further, we also compute new remapping matrices for UScorewmd. Since CLP is superior to UMD when parallel data is used (see Section 3.2), we compute these remapping matrices using CLP instead of UMD. To assess how different amounts of parallel sentences affect performance, we fine-tune our metrics on 10k, 20k, 30k, and 200k parallel sentences. We use WikiMatrix (Schwenk et al., 2021) and the Nepali Translation Parallel Corpus (Duwal et al., 2019) to obtain parallel sentences.
Pearson’s r correlations with human judgments for individual and averaged language pairs are shown in Figure 6; we focus on MLPQE-PE, where our metrics performed worst. Overall, introducing parallel data into the training process consistently improves performance for the majority of language directions; more parallel data leads to better results. The relatively biggest improvements are achieved for the si-en language direction, which is in accordance with our discussion above. When fine-tuning with 30k parallel sentences, the performance of our metrics is roughly on par with the SentSim variants (see Table 1). With 200k parallel sentences, our metrics clearly outperform SentSim, which uses millions of parallel sentences and NLI data as supervision signals.