Shared Task on Evaluating Accuracy in Natural Language Generation

06/22/2020 ∙ by Ehud Reiter, et al. ∙ University of Aberdeen 0

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts. Participants will measure the accuracy of basketball game summaries produced by NLG systems from basketball box score data.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Users expect data-to-text NLG systems to generate textual summaries which are accurate. However, many neural NLG systems in particular generate texts which are factually incorrect. This is an aspect of hallucination. There are many kinds of accuracy errors in NLG texts, for more information see Reiter (2020).

The gold standard for accuracy evaluation is to ask human annotators to carefully fact-check generated texts against the source data. However this is a time-consuming process. Our experiences at Aberdeen show that it can take an experienced annotator 30 minutes to fact-check a moderately complicated 300-word paragraph produced by a neural data-to-text NLG system.

It would be very useful to the NLG community if we could come up with quicker and easier ways of measuring accuracy which had good correlations with careful fact-checking. These could be based on less time-consuming human evaluations, such as asking subjects to rate the accuracy of a text on a Likert-type scale (van der Lee et al., 2019), or on automatic metrics. But we should only use such techniques if we feel confident that they have good agreement and correlation with gold-standard fact-checking.

The goal of our proposed shared task is to encourage innovative ideas for evaluating accuracy, including both automatic metrics and protocols for human evaluation. Participants will apply their techniques to summaries of basketball games produced from box score data, produced by neural NLG systems such as Wiseman et al. (2017), Puduppully et al. (2019), and Rebuffel et al. (2020). We will assess how well results produced by the participant’s techniques correlate with the gold-standard fact-checking.

The shared task is unusual because submissions can be protocols for human evaluations as well as computer algorithms (ie, metrics). The community has limited experience with shared tasks which evaluate protocols, and we hope our experiences will help develop a better understanding of how to do such shared tasks, as well as a better understanding of how to evaluate the accuracy of NLG texts.

2 Organisers

The organisers are

  • Ehud Reiter, University of Aberdeen (

  • Craig Thomson, University of Aberdeen (

3 Task Description

Participants will be asked to submit one or more submissions which describe either

  • An evaluation protocol for human subjects which assesses the accuracy of generated texts. This should include experimental design, guidance on number and type of subjects, and recommended statistical analysis (van der Lee et al., 2019). The subjects will have access to data about the game and the teams, and also (if part of the protocol) to a reference text.

  • An automatic metric (algorithm) which computes the accuracy of a generated text. The algorithm will have access to data about the game and the teams, and to a reference text.

It is fine for submissions to give human subjects or metrics access to additional data beyond the box score data used to generate the texts. The goal is to find statements which are not true in the real world (ie, classic fact-checking), not just statements which disagree with (or are not derivable from) the box score data.

The output of the evaluation protocol or metric will be a list of mistakes in the text. Each mistake will be characterised by

  • Its position in the text.

  • A category. We are currently using the following categories, we may evolve these.

    • Incorrect number: It doesnt matter whether the number is spelled out or is in digits.

    • Incorrect named entity: This includes people, places, teams, and days of the week.

    • Incorrect word: A word which is not one of the above and is incorrect.

    • Context error: A phrase which causes an incorrect inference because of context or discourse.

    • Not checkable: A statement which can not be checked, either because the information is not available or because it is too time-consuming to check.

    • Other: Any other type of mistake.

An example is shown in Figure 1. Note that this example combines fragments from texts produced by several different systems, in order to illustrate different types of mistakes. Box score data for this game is available at .

The Memphis Grizzlies (5- 2) defeated the Phoenix Suns (3 - 2) Monday 102-91 at the Talking Stick Resort Arena in Phoenix. The Grizzlies had a strong first half where they out-scored the Suns 59- 42. The Grizzlies were led by Isaiah Thomas, who is averaging 19 points in the season so far.

List of errors:

  • 2: incorrect number, should be 0.

  • Monday: incorrect named entity, should be Wednesday.

  • Talking Stick Resort Arena: incorrect named entity, should be US Airways Arena.

  • strong: incorrect word, the Grizzlies did not do well in the first half.

  • out-scored: incorrect word, the Suns had a higher score in first half.

  • 59: incorrect number, should be 46.

  • 42: incorrect number, should be 52 .

  • led: incorrect word. Thomas did not lead the Grizzles since he played for the Suns.

  • Isaiah Thomas: Context error. Thomas played for the Suns, but context here implies he played for the Grizzlies.

  • averaging 10 points in the season so far: Not checkable. This is very hard to check, since data sources report performance per season and per game, not performance at a particular point in a season.

Figure 1: Example text with error annotations. Corrections and explanations are not required, but are included here for clarity. Box score data for this game is available at .


We will also ask participants to submit estimates of the time required to find mistakes in a text (human time for human evaluations, and CPU/GPU time for metrics). This is optional, it is not required.

We also plan to have an ’open’ track where people can submit ideas for evaluating accuracy on our data set which do not fit into the above framework.

4 Data

We will use texts produced by three systems that use basketball box score data: Wiseman et al. (2017), Puduppully et al. (2019), and Rebuffel et al. (2020). We are currently in the process of getting 30 texts (ten from each system) carefully fact checked. We will ask each participant in the shared task to manually fact-check an additional 20 texts. If we have 6 participants, this will result in a total of 150 fact-checked texts, which will serve as training data.

Participants will also have access to all of the texts produced by each of the three systems, along with source box score data and a human-written reference text.

We will create a separate test set of 45 texts which will be manually fact-checked.

5 Evaluation Plans

We will release the test set (but not the manual fact-checking annotations), and give participants two weeks to apply their techniques to the test set and return the results. Each mistake will be reported as a position and category, as described above. We will create a Reported Mistake List (RML) for each annotated text submitted by a participant.

We will then try to align each RML entry with an entry in the gold standard mistake list (GSML), for the same text, as follows

  • First look for a GSML entry which is an exact match to the RML entry.

  • If not found, look for a GSML entry with same category and maximal (non-zero) overlap in position

  • If not found, look for a GSML mistake with a different category, with maximal (non-zero) overlap in position

  • If not found, RML entry cannot be aligned with any GSML entry

When we have done this, we will compute a set of scores as follows

  • Recall and precision for each category. In other words, for each category, what percentage of mistakes of this type in GSML where aligned with an RML entry of this category, and vice-versa.

  • Overall recall and precision (ignoring category). Looking at RML as a whole, what percentage of entries were successfully aligned with a GSML entry (of any category), and vice-versa.

6 Schedule

We plan on the following schedule

  • soon after INLG2020: announce task, ask for participants

  • 6 months before INLG2021: deadline for participants to register and provide 20 manually fact-checked stories.

  • 3 months before INLG2021: submission of techniques (algorithms and protocols). Test set issued, participants give results on test set within 2 weeks.

  • 2 months before INLG2021: Results of evaluation computed

  • INLG2021: Results presented at INLG, along with posters describing the techniques


  • R. Puduppully, L. Dong, and M. Lapata (2019) Data-to-text generation with content selection and planning. In

    Proceedings of the 33rd AAAI Conference on Artificial Intelligence

    Honolulu, Hawaii. External Links: Link Cited by: §1, §4.
  • C. Rebuffel, L. Soulier, G. Scoutheeten, and P. Gallinari (2020) A hierarchical model for data-to-text generation. In Advances in Information Retrieval, J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, and F. Martins (Eds.), Cham, pp. 65–80. External Links: ISBN 978-3-030-45439-5 Cited by: §1, §4.
  • E. Reiter (2020) Accuracy errors go beyond getting facts wrong. External Links: Link Cited by: §1.
  • C. van der Lee, A. Gatt, E. van Miltenburg, S. Wubben, and E. Krahmer (2019) Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, pp. 355–368. External Links: Link, Document Cited by: §1, 1st item.
  • S. Wiseman, S. Shieber, and A. Rush (2017) Challenges in data-to-document generation. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    Copenhagen, Denmark, pp. 2253–2263. External Links: Link, Document Cited by: §1, §4.