Evaluating models is a critical step of the machine learning workflow. However, unlike classification-based tasks, evaluating models which generate text is difficult and is a research area on its own. The basic workflow for developing a new automatic evaluation metric is to design/implement the metric, calculate its correlation to human judgments, then use that metric to evaluate text generation systems.
While there have been significant efforts to build libraries for developing machine learning models (Klein et al., 2017; Gardner et al., 2018; Ott et al., 2019), no equivalent library exists for developing evaluation metrics. In this work, we present SacreROUGE, an open-source, Python-based library for using and developing text generation metrics, with an emphasis on summarization.
SacreROUGE removes many obstacles that researchers face when they use or develop evaluation metrics. First, the official implementations of various metrics do not share a common interface or programming language, so using many metrics to evaluate a model can be frustrating and time consuming. SacreROUGE provides Python-based wrappers around many evaluation metrics so they all implement a simple, easy-to-use interface regardless of how they are implemented internally (§2).
Second, evaluating metrics themselves can be tricky. Correlations between metric values and human judgments are calculated in several different ways, there are multiple commonly used correlation coefficients, and fairly comparing human-written references to system output requires implementing jackknifing. Since the evaluation code in SacreROUGE is shared across all of the metrics, any metric with implements the common Metric interface can be evaluated without writing additional code (§3).
Third, datasets that contain judgments which are commonly used to evaluate metrics do not share the same format, so writing code to load each dataset requires writing a significant amount of effort. SacreROUGE provides scripts for popular summarization datasets that load and reformat them into a common schema so they can easily be used for evaluation (§4).
The development of SacreROUGE is ongoing. We intend to add more metrics and datasets to the library as they become available. Further, we encourage researchers to use the SacreROUGE framework to use existing metrics and develop new ones. SacreROUGE is released under the Apache 2.0 license and is open to contributions from the community.
2 The Metric Interface
The development of evaluation metrics for summarization has been an active area of research for two decades. However, the community has not converged on a consistent format for the input data, so each metric uses its own custom schema. Further, the published code for evaluation metrics is written in various programming languages based on which language was popular when the metric was proposed. These challenges make it very cumbersome to use multiple metrics to evaluate a summarization system. SacreROUGE addresses these two problems by unifying all of the metrics’ implementations into a common interface called Metric. The interface provides a Pythonic API that allows for evaluating an individual summary or batch of summaries. Since all of the metrics share the same interface, evaluating a summarization system with several different metrics is trivial.
In order to support older evaluation metrics written in languages such as Perl or Java, we have written Python wrappers around the original code that still implement the Metric interface. Internally, the wrappers serialize the input summaries to the format required by the underlying metric, a subprocess is created to run the original metric’s code, and the output is then loaded from disk again in Python. This way, we do not have to port the original metric’s code to Python and end-users can still use the metrics with the Python API.
SacreROUGE currently supports the following evaluation metrics: AutoSummENG (Giannakopoulos et al., 2008), BERTScore (Zhang et al., 2019), BEwT-E (Tratz and Hovy, 2008), METEOR (Denkowski and Lavie, 2014), MeMoG (Giannakopoulos and Karkaletsis, 2010), NPowER (Giannakopoulos and Karkaletsis, 2013), ROUGE (Lin, 2004), a near-identical Python-based version of ROUGE that we wrote, SIMetrix (Louis and Nenkova, 2009), and SumQE (Xenouleas et al., 2019).
Many of the evaluation metrics rely on external resources in the form of code, models, or data files. Setting up these dependencies in the right format to use the metrics can be difficult.
The SacreROUGE library addresses this problem by providing setup scripts for each metric which download or compile any required resources. To make this process as easy as possible for the end-user, these scripts are run through a setup-metric command. The command takes the name of the metric to setup, then downloads the required dependencies to a common folder which is managed by SacreROUGE. Abstracting the metric setup by a simple command makes it such that the end-user can quickly and easily begin using all of the metrics within the library.
3 Evaluating Systems and Metrics
The two most common use cases of an evaluation metric are to evaluate a summarization system and to evaluate a metric itself by calculating its correlation to human judgments. Since all of the metrics in SacreROUGE implement a common interface, the code for these procedures is shared, so developers of new metrics do not need to rewrite the code to implement these procedures. This logic is exposed through evaluate, score, and correlate, which are subcommands of sacrerouge, the entry point for the library’s command-line interface.
The evaluate Subcommand
The evaluate subcommand accepts a specific metric and an input file that contains the output of a summarization system for an input corpus. The command will load the input data, pass it to the metric, and save the metric’s output at the summary-level and system-level. The summary-level output contains the metric’s value for each individual summary, whereas system-level output represents the average performance across the dataset and is most often reported in papers.
The score Subcommand
Evaluating metrics themselves is exposed through the score and correlate subcommands. The score subcommand is very similar to evaluate except for two key differences. First, the input data is not expected to the the output from a single system; Correlations to human judgment often involve scoring summaries from many different summarization models. Subsequently, no system-level metrics are calculated.
Second, the score subcommand will run jackknifing on the input data when possible and necessary. Jackknifing is a procedure which allows the value of a metric on system-produced and human-written summaries to be fairly compared when the human-written summaries are used to assess the quality of the system summary. Briefly, if there is more than one reference summary, each reference is evaluated against all of the others. Each system summary is repeatedly evaluated against each possible subset of the reference summaries that has one reference removed. The final system summary score is an average across those evaluations. When jackknifing is performed, a _jk suffix is appended to the name of the metric which makes it clear that it is not comparable to the non-jackknifed version.
The correlate Subommand
After the score subcommand is complete, the correlate subcommand can be used to calculate the correlation between two metrics. SacreROUGE calculates the three correlation coefficients most commonly used in summarization: Pearson, Spearman, and Kendall. Further, these correlations are computed at three different granularities: the summary-level, the system-level, and globally. The summary-level correlation calculates the average correlation per input. The system-level calculates the correlation between average system performances for each metric. The global correlation directly calculates the correlation between all of the observed metric values. The former two granularities are most often used in the summarization literature.
Handling Different Input Requirements
It is often the case that different metrics require different input data (e.g., some metrics use reference summaries, others need access to the input documents). Therefore, the required data must be loaded from the input file and the evaluate and score subcommands must pass the required data to the metric.
The interface for loading data from an input file in SacreROUGE is called a DatasetReader. For a given input file(s), a DatasetReader loads the Fields for the evaluation instances. A Field is a base class which contains the data for an input instance, such as a DocumentsField that maintains the contents of the input documents. Then, each evaluation instance contains a mapping from the name of a field to its data.
In order to pass the appropriate Fields to the summarization metrics, we require that every class that implements the Metric interface lists the names of the Fields that it uses. For instance, the wrapper for the document-based evaluation metric SIMetrix specifies it needs a field called documents, a key in the evaluation instance Field mapping. Then, once the input data has been loaded, the evaluate and score commands can pass the required data to a metric for evaluation.
Automatically Generated Subcommands
It is desirable to have a different evaluate and score subcommand for each individual metric so that developers can easily specify different metric parameters on the command line. A naive implementation of this would require manually creating the subcommand for each metric. However, in order to eliminate as much boilerplate code as possible, SacreROUGE includes a feature to automatically generate these subcommands for any metric that implements the Metric interface.
Using Python’s inspect and typing libraries, we are able to examine the constructor of each metric and generate a command-line argument for each parameter. For parameters with primitive types, the argparse library directly supports casting command line parameters to the correct types. However, some metrics may use complex types, such as a list of integers. In such situations, SacreROUGE assumes that the command line argument will be a string-serialized JSON object that can be deserialized into the required type at runtime. This allows us to support automatically generating evaluate and score subcommands for every metric supported by the library.
4 A Common Dataset Format
Over the past two decades, the summarization community has collected a large number of expensive summarization dataset and human quality annotations. However, these very useful datasets are seldom saved in a common format, forcing every researcher who wants to train a model on the datasets or use the judgments to evaluate a metric to write boilerplate code to load the data.
To mitigate this issue, SacreROUGE provides scripts that will load the datasets and their corresponding judgments, then serialize them to new files with a common format. The data is serialized in such a manner that it can be directly used in the evaluate, score, and correlate subcommands, thereby making it incredibly easy to run or evaluate any metric in the library on the dataset.
The scripts to preprocess the datasets are exposed through the setup-dataset subcommand. The subcommand accepts the name of a dataset, an output directory, and any potential dataset-specific arguments. Then, SacreROUGE will load and preprocess the respective dataset. For datasets which are publicly available, the scripts will download the data automatically. However, many summarization datasets are licensed, so the corresponding preprocessing scripts require paths to the original data supplied to the command.
The datasets which are currently supported by SacreROUGE are the Document Understanding Conference from 2001 to 2007,222https://duc.nist.gov/ Text Analysis Conference from 2008 to 2011,333https://tac.nist.gov/ the MultiLing 2011, 2013, 2015, 2017, and 2019 Workshops,444http://multiling.iit.demokritos.gr/ and the CNN/DailyMail dataset judgments provided by Chaganty et al. (2018). We intend to add more datasets as they become available.
5 Related Work
The namesake and idea for SacreROUGE came from the SacreBLEU (Post, 2018) library. SacreBLEU was developed to standardize and simplify calculating BLEU (Papineni et al., 2002) for machine translation. Like SacreROGUE, it provides a simple command-line interface to download and evaluate on common machine translation datasets. Whereas SacreBLEU is mainly for evaluating machine translation models with BLEU, our library focuses on summarization and includes a large number of evaluation metrics. Further, SacreROUGE also provides a framework for developing and evaluating new metrics.
Much of the design of SacreROUGE was inspired by AllenNLP (Gardner et al., 2018)
, a library built on PyTorch(Paszke et al., 2017)
for developing deep learning models. AllenNLP provides useful abstractions over different models and neural network modules that allows for the sharing of boilerplate code so developers can quickly create and train new machine learning models. SacreROUGE provides similar abstractions for evaluation metrics.
Recently, Hugging Face released a library called nlp that sets out to achieve similar goals to SacreROUGE.555https://github.com/huggingface/nlp Namely, they also standardize loading different datasets and provide a Pythonic API to many popular evaluation metrics. However, because their library is focused on a large number of NLP tasks and SacreROUGE is built specifically for summarization, SacreROUGE is able to support more summarization datasets and metrics. Further, unlike SacreROUGE, nlp does not provide a framework for developing and evaluating new metrics.
We have presented SacreROUGE, an open-source library dedicated to the development of summarization evaluation metrics. With a unified metric interface and common data format, our library makes it very simple to use existing evaluation metrics as well as develop new ones with a minimum amount of effort. We hope that future researchers will contribute their own metrics and datasets to the library so that it is as easy as possible to run and evaluate summarization metrics.
- The price of debiasing automatic metrics in natural language evalaution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 643–653. External Links: Cited by: §4.
- Meteor universal: language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT@ACL 2014, June 26-27, 2014, Baltimore, Maryland, USA, pp. 376–380. External Links: Cited by: §2.
AllenNLP: a deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, pp. 1–6. External Links: Cited by: §1, §5.
Summarization system evaluation revisited: n-gram graphs. TSLP 5 (3), pp. 5:1–5:39. External Links: Cited by: §2.
- Summarization system evaluation variations based on n-gram graphs. In Proceedings of the Third Text Analysis Conference, TAC 2010, Gaithersburg, Maryland, USA, November 15-16, 2010, External Links: Cited by: §2.
- Summary evaluation: together we stand npower-ed. In Computational Linguistics and Intelligent Text Processing - 14th International Conference, CICLing 2013, Samos, Greece, March 24-30, 2013, Proceedings, Part II, A. F. Gelbukh (Ed.), Lecture Notes in Computer Science, Vol. 7817, pp. 436–450. External Links: Cited by: §2.
OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL, External Links: Cited by: §1.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §2.
- Automatically evaluating content selection in summarization without human models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 306–314. External Links: Cited by: §2.
- Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §1.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pp. 311–318. External Links: Cited by: §5.
- Automatic differentiation in pytorch. Cited by: §5.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §5.
- Summarization evaluation using transformed basic elements. In Proceedings of the First Text Analysis Conference, TAC 2008, Gaithersburg, Maryland, USA, November 17-19, 2008, External Links: Cited by: §2.
SUM-QE: a bert-based summary quality estimation model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 6004–6010. External Links: Cited by: §2.
- BERTScore: evaluating text generation with BERT. CoRR abs/1904.09675. External Links: Cited by: §2.