compare-mt
A program to compare language generation results and extract salient features
view repo
In this paper, we describe compare-mt, a tool for holistic analysis and comparison of the results of systems for language generation tasks such as machine translation. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems that can then be used to guide further analysis or system improvement. It implements a number of tools to do so, such as analysis of accuracy of generation of particular types of words, bucketed histograms of sentence accuracies or counts based on salient characteristics, and extraction of characteristic n-grams for each system. It also has a number of advanced features such as use of linguistic labels, source side data, or comparison of log likelihoods for probabilistic models, and also aims to be easily extensible by users to new types of analysis. The code is available at https://github.com/neulab/compare-mt
READ FULL TEXT VIEW PDFA program to compare language generation results and extract salient features
Tasks involving the generation of natural language are ubiquitous in NLP, including machine translation (MT; koehn10smt), language generation from structured data Reiter and Dale (2000), summarization Mani (1999), dialog response generation Oh and Rudnicky (2000), image captioning Mitchell et al. (2012). Unlike tasks that involve prediction of a single label such as text classification, natural language texts are nuanced, and there are not clear yes/no distinctions about whether outputs are correct or not. Evaluation measures such as BLEU Papineni et al. (2002), ROUGE Lin (2004), METEOR Denkowski and Lavie (2011), and many others attempt to give an overall idea of system performance, and technical research often attempts to improve accuracy according to these metrics.
However, as useful as these metrics are, they are often opaque: if we see, for example, that an MT model has achieved a gain in one BLEU point, this does not tell us what characteristics of the output have changed. Without fine-grained analysis, readers of research papers, or even the writers themselves can be left scratching their heads asking “what exactly is the source of the gains in accuracy that we’re seeing?”
Unfortunately, this analysis can be time-consuming and difficult. Manual inspection of individual examples can be informative, but finding salient patterns for unusual phenomena requires perusing a large number of examples. There is also a risk that confirmation bias will simply affirm pre-existing assumptions. If a developer has some hypothesis about specifically what phenomena their method should be helping with, they can develop scripts to automatically test these assumptions. However, this requires deep intuitions with respect to what changes to expect in advance, which cannot be taken for granted in beginning researchers or others not intimately familiar with the task at hand. In addition, creation of special-purpose one-off analysis scripts is time-consuming.
In this paper, we present compare-mt, a tool for holistic comparison and analysis of the results of language generation systems. The main use case of compare-mt, illustrated in 1, is that once a developer obtains multiple system outputs (e.g. from a baseline system and improved system), they feed these outputs along with a reference output into compare-mt, which extracts aggregate statistics comparing various aspects of these outputs. The developer can then quickly browse through this holistic report and note salient differences between the systems, which will then guide fine-grained analysis of specific examples that elucidate exactly what is changing between the two systems.
Examples of the aggregate statistics generated by compare-mt are shown in §2, along with description of how these lead to discovery of salient differences between systems. These statistics include word-level accuracies for words of different types, sentence-level accuracies or counts for sentences of different types, and salient -grams or sentences where one system does better than the other. §4 demonstrates compare-mt’s practical applicability by showing some case studies where has already been used for analysis in our previously published work. §3 further details more advanced functionality of compare-mt that can make use of specific labels, perform analysis over source side text through alignments, and allow simple extension to new types of analysis. The methodology in compare-mt is inspired by several previous works on automatic error analysis Popović and Ney (2011), and we perform an extensive survey of the literature, note how many of the methods proposed in previous work can be easily realized by using functionality in compare-mt, and detail the differences with other existing toolkits in §5.
Using compare-mt with the default settings is as simple as typing
compare-mt ref sys1 sys2 |
where ref is a manually curated reference file, and sys1 and sys2 are the outputs of two systems that we would like to compare. These analysis results can be written to the terminal in text format, but can also be written to a formatted HTML file with charts and LaTeX tables that can be directly used in papers or reports.222In fact, all of the figures and tables in this paper (with the exception of Fig. 1) were generated by compare-mt, and only slightly modified for formatting. An example of the command used to do so is shown in the Appendix.
In this section, we demonstrate the types of analysis that are provided by this standard usage of compare-mt. Specifically, we use the example of comparing phrase-based Koehn et al. (2003) and neural Bahdanau et al. (2015) Slovak-English machine translation systems from neubig18emnlp.
PBMT | NMT | Win? | |
---|---|---|---|
BLEU | 22.43 | 24.03 | s2s1 |
[21.76,23.19] | [23.33,24.65] | p0.001 | |
RIBES | 80.00 | 80.00 | - |
[79.39,80.64] | [79.44,80.92] | p=0.44 | |
Length | 94.79 | 93.82 | s1s2 |
[94.10,95.49] | [92.90,94.85] | p0.001 |
Aggregate score analysis with scores, confidence intervals, and pairwise significance tests.
The first variety of analysis is not unique to compare-mt, answering the standard question posed by most research papers: “given two systems, which one has better accuracy overall?” It can calculate scores according to standard BLEU Papineni et al. (2002)
, as well as other measures such as output-to-reference length ratio (which can discover systematic biases towards generating too-long or too-short sentences) or alternative evaluation metrics such as chrF
Popović (2015) and RIBES Isozaki et al. (2010). compare-mt also has an extensible Scorer class, which will be used to expand the metrics supported by compare-mt in the future, and can be used by users to implement their own metrics as well. Confidence intervals and significance of differences in these scores can be measured using bootstrap resampling Koehn (2004).Fig. 1 shows the concrete results of this analysis on our PBMT and NMT systems. From the results we can see that the NMT achieves higher BLEU but shorter sentence length, while there is no significant difference in RIBES.
A second, and more nuanced, variety of analysis supported by compare-mt is bucketed analysis, which assigns words or sentences to buckets, and calculates salient statistics over these buckets.
Specifically, bucketed word accuracy analysis attempts to answer the question “which types of words can each system generate better than the other?” by calculating word accuracy by bucket. One example of this, shown in Fig. 2, is measurement of word accuracy bucketed by frequency in the training corpus. By default this “accuracy” is defined as f-measure of system outputs with respect to the reference, which gives a good overall picture of how well the system is doing, but it is also possible to separately measure precision or recall, which can demonstrate how much a system over- or under-produces words of a specific type as well. From the results in the example, we can see that both PBMT and NMT systems do more poorly on rare words, but the PBMT system tends to be more robust to low-frequency words while the NMT system does a bit better on very high-frequency words.
A similar analysis can be done on the sentence level, attempting to answer questions of “on what types of sentences can one system perform better than the other?” In this analysis we define the “bucket type”, which determines how we split sentences into bucket, and the “statistic” that we calculate for each of these buckets. For example, compare-mt calculates three types of analysis by default:
bucket=length, statistic=score: This calculates the BLEU score by reference sentence length, indicating whether a system does better or worse at shorter or longer sentences. From the Fig. 3, we can see that the PBMT system does better at very long sentences, while the NMT system does better at very short sentences.
bucket=lengthdiff, statistic=count: This outputs a histogram of the number of sentences that have a particular length difference from the reference output. A distribution peaked around 0 indicates that a system generally matches the output length, while a flatter distribution indicates a system has trouble generating sentences of the correct length Fig. 4 indicates that while PBMT rarely generates extremely short sentences, NMT has a tendency to do so in some cases.
bucket=score, statistic=count: This outputs a histogram of the number of sentences receiving a particular score (e.g. sentence-level BLEU score). This shows how many sentences of a particular accuracy each system outputs. Fig. 5, we can see that the PBMT system has slightly more sentences with low scores.
These are just three examples of the many different types of sentence-level analysis that are possible with difference settings of the bucket and statistic types.
The holistic analysis above is quite useful when word or sentence buckets can uncover salient accuracy differences between the systems. However, it is also common that we may not be able to predict a-priori what kinds of differences we might expect between two systems. As a method for more fine-grained analysis, compare-mt implements a method that looks at differences in the -grams produced by each system, and tries to find -grams that each system is better at producing than the other Akabe et al. (2014). Specifically, it counts the number of times each system matches each ngram , defined as and
respectively, and calculates a smoothed probability of an
-gram match coming from one system or another:(1) |
Intuitively, -grams where the first system excels will have a high value (close to 1), and when the second excels the value will be low (close to 0). If smoothing coefficient is set high, the system will prefer frequent -grams with robust statistcs, and when is low, the system will prefer highly characteristic -grams with a high ratio of matches in one system compared to the other.
-gram | |||
---|---|---|---|
phantom | 34 | 1 | 0.945 |
Amy | 9 | 0 | 0.909 |
, who | 8 | 0 | 0.900 |
my mother | 7 | 0 | 0.889 |
else happened | 5 | 0 | 0.857 |
going to show you | 0 | 6 | 0.125 |
going to show | 0 | 6 | 0.125 |
hemisphere | 0 | 5 | 0.143 |
Is | 0 | 5 | 0.143 |
’m going to show | 0 | 5 | 0.143 |
An example of -grams discovered with this analysis is shown in Tab. 2. From this, we can then explore the references and outputs of each system, and figure out what phenomena resulted in these differences in -gram accuracy. For example, further analysis showed that the relatively high accuracy of “hemisphere” for the NMT system was due to the propensity of the PBMT system to output the mis-spelling “hemispher,” which it picked up from a mistaken alignment. This may indicate the necessity to improve alignments for word stems, a problem that could not have easily been discovered from the bucketed analysis in the previous section.
Finally, compare-mt makes it possible to analyze and compare individual sentence examples based on statistics, or differences of statistics. Specifically, we can calculate a measure of accuracy of each sentence (e.g. sentence-level BLEU score), sort the sentences in the test set according to the difference in this measure, then display the examples where the difference in evaluation is largest in either direction.
Ref/Sys | BLEU | Text |
---|---|---|
Ref | - | Beth Israel ’s in Boston . |
PBMT | 1.00 | Beth Israel ’s in Boston . |
NMT | 0.41 | Beat Isaill is in Boston . |
Ref | - | And what I ’m talking about is this . |
PBMT | 0.35 | And that ’s what I ’m saying is this . |
NMT | 1.00 | And what I ’m talking about is this . |
Tab. 3 shows two examples (cherry-picked from the top 10 sentence examples due to space limitations). We can see that in the first example, the PBMT-based system performs better on accurately translating a low-frequency named entity, while in the second example the NMT system accurately generates a multi-word expression with many frequent words. These concrete examples can both help reinforce our understanding of the patterns found in the holistic analysis above, or uncover new examples that may lead to new methods for holistic analysis.
Here we discuss advanced features that allow for more sophisticated types of analysis using other sources of information than the references and system outputs themselves.
One feature that greatly improves the flexibility of analysis is compare-mt’s ability to do analysis over arbitrary word labels. For example, we can perform word accuracy analysis where we bucket the words by POS tags, as shown in 6. In the case of the PBMT vs. NMT analysis above, this uncovers the interesting fact that PBMT was better at generating base-form verbs, whereas NMT was better at generating conjugated verbs. This can also be applied to the -gram analysis, finding which POS -grams are generated well by one system or another, a type of analysis that was performed by chiang05hiero to understand differences in reordering between different systems.
Labels are provided by external files, where there is one label per word in the reference and system outputs, which means that generating these labels can be an arbitrary pre-processing step performed by the user without any direct modifications to the compare-mt code itself. These labels do not have to be POS tags, of course, and can also be used for other kinds of analysis. For example, one may perform analysis to find accuracy of generation of words with particular morphological tags Popović et al. (2006)
, or words that appear in a sentiment lexicon
Mohammad et al. (2016).While most analysis up until this point focused on whether a particular word on the target side is accurate or not, it is also of interest what source-side words are or are not accurately translated. compare-mt also supports word accuracy analysis for source-language words given the source language input file, and alignments between the input, and both the reference and the system outputs. Using alignments, compare-mt finds what words on the source side were generated correctly or incorrectly on the target side, and can do aggregate word accuracy analysis, either using word frequency or labels such as POS tags.
Finally, as many recent methods can directly calculate a log likelihood for each word, we also provide another tool compare-ll that makes it possible to perform holistic analysis of these log likelihoods. First, the user creates a file where there is one log likelihood for each word in the reference file, and then, like the word accuracy analysis above, compare-ll can calculate aggregate statistics for this log likelihood based on word buckets.
One other useful feature is compare-mt’s ability to be easily extended to new types of analysis. For example,
If a user is interested in using a different evaluation metric, they could implement a new instance of the Scorer class and use it for both aggregate score analysis (with significance tests), sentence bucket analysis, or sentence example analysis.
If a user wanted to bucket words according to a different type of statistic or feature, they could implement their own instance of a Bucketer class, and use this in the word accuracy analysis.
Over the past year or so, we have already been using compare-mt in our research to accelerate the analysis of our results and figure out what directions are most promising to pursue next. Accordingly, results from compare-mt have already made it into a number of our published papers. For example:
Figs. 4 and 5 of wang18emnlptrdec can be generated using sentence bucket analysis to measure “bucket=length, statistic=score” and “bucket=lengthdiff, statistic=count”.
Tab. 7 of qi18naacl shows the results of -gram analysis, and Fig. 2 shows the results of frequency-based word accuracy analysis.
Fig. 4 of sachan18wmt shows the results of frequency-based word accuracy analysis.
Tab. 8 of michel18emnlp used compare-mt to compare under and over-generated n-grams.
Tab. 5 of kumar2018vmf used compare-mt for frequency-based word accuracy analysis.
There have been a wide variety of tools and methods developed to perform analysis of machine translation results. These can be broadly split into those that attempt to perform holistic analysis and those that attempt to perform example-by-example anaylsis.
compare-mt is a tool for holistic analysis over the entire corpus, and many of the individual pieces of functionality provided by compare-mt are inspired by previous works on this topic. Our word error rate analysis is inspired by previous work on automatic error analysis, which takes a typology of errors Flanagan (1994); Murata et al. (2005); Vilar et al. (2006), and attempts to automatically predict which sentences contain these errors Popović and Ney (2011); Zeman et al. (2011); Fishel et al. (2012). Many of the ideas contained in these works can be used easily in compare-mt. Measuring word matches, insertions, and deletions decomposed over POS/morphological tags Popović et al. (2006); Popović and Ney (2007); Zeman et al. (2011); El Kholy and Habash (2011) or other “linguistic checkpoints” Zhou et al. (2008); Naskar et al. (2011) can be largely implemented using the labeled bucketing functionality described in §3. Analysis of word reordering accuracy Birch et al. (2010); Popović and Ney (2011); Bentivogli et al. (2016) can be done through the use of reordering-sensitive measures such as RIBES as described in §2. In addition, the extraction of salient -grams is inspired by similar approaches for POS -gram Chiang et al. (2005); Lopez and Resnik (2005) and word -gram Akabe et al. (2014) based analysis respectively. To the best of our knowledge, and somewhat surprisingly, no previous analysis tool has included the flexible sentence-bucketed analysis that is provided by compare-mt.
One other practical advantage of compare-mt compared to other tools is that it is publicly available under the BSD license on GitHub,333https://github.com/neulab/compare-mt and written in modern Python, which is quickly becoming the standard program language of the research community. Many other tools are either no longer available Stymne (2011), or written in other languages such as Perl Zeman et al. (2011) or Java Naskar et al. (2011), which provides some degree of practical barrier to their use and extension.
In contrast to the holistic methods listed above, example-by-example analysis methods attempt to intelligently visualize single translation outputs in a way that can highlight salient differences between the outputs of multiple systems, or understand the inner workings of a system. There are a plethora of tools that attempt to make the manual analysis of individual outputs of multiple systems, through visualization or highlighting of salient parts of the output Lopez and Resnik (2005); Stymne (2011); Zeman et al. (2011); Madnani (2011); Aziz et al. (2012); Gonzàlez et al. (2012); Federmann (2012); Chatzitheodorou and Chatzistamatis (2013); Klejch et al. (2015). There has also been work that attempts to analyze the internal representations or alignments of phrase-based DeNeefe et al. (2005); Weese and Callison-Burch (2010) and neural Ding et al. (2017); Lee et al. (2017) machine translation systems to attempt to understand why they arrived at the decisions they did. While these tools are informative, they play a complementary role to the holistic analysis that compare-mt proposes, and adding the ability to more visually examine individual examples to compare-mt in a more extensive manner is planned as future work.
Recently, there has been a move towards creating special-purpose diagnostic test sets designed specifically to test an MT system’s ability to handle a particular phenomenon. For example, these cover things like grammatical agreement Sennrich (2017), translation of pronouns or other discourse-sensitive phenomena Müller et al. (2018); Bawden et al. (2018), or diagnostic tests for a variety of different phenomena Isabelle et al. (2017). These sets are particularly good for evaluating long-tail phenomena that may not appear in naturalistic data, but are often limited to specific language pairs and domains. compare-mt takes a different approach of analyzing the results on existing test sets and attempting to extract salient phenomena, an approach that affords more flexibilty but less focus than these special-purpose methods.
In this paper, we presented an open-source tool for holistic analysis of the results of machine translation or other language generation systems. It makes it possible to discover salient patterns that may help guide further analysis.
compare-mt is evolving, and we plan to add more functionality as it becomes necessary to further understand cutting-edge techniques for MT. One concrete future plan includes better integration with example-by-example analysis (after doing holistic analysis, clicking through to individual examples that highlight each trait), but many more improvements will be made as the need arises.
Acknowledgements: The authors thank the early users of compare-mt and anonymous reviewers for their feedback and suggestions (especially Reviewer 1, who found a mistake in a figure!). This work is sponsored in part by Defense Advanced Research Projects Agency Information Innovation Office (I2O) Program: Low Resource Languages for Emergent Incidents (LORELEI) under Contract No. HR0011-15-C0114. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages 121–126, Copenhagen, Denmark. Association for Computational Linguistics.Journal of Artificial Intelligence Research
, 55:95–130.Building natural language generation systems
. Cambridge university press.Fig. 7 shows an example of the command that was used to generate the report containing the figures and tables used in this paper.