CoTK: An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation

by   Fei Huang, et al.
Tsinghua University

In text generation evaluation, many practical issues, such as inconsistent experimental settings and metric implementations, are often ignored but lead to unfair evaluation and untenable conclusions. We present CoTK, an open-source toolkit aiming to support fast development and fair evaluation of text generation. In model development, CoTK helps handle the cumbersome issues, such as data processing, metric implementation, and reproduction. It standardizes the development steps and reduces human errors which may lead to inconsistent experimental settings. In model evaluation, CoTK provides implementation for many commonly used metrics and benchmark models across different experimental settings. As a unique feature, CoTK can signify when and which metric cannot be fairly compared. We demonstrate that it is convenient to use CoTK for model development and evaluation, particularly across different experimental settings.


page 1

page 2

page 3

page 4


MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

Despite major advances in open-ended text generation, there has been lim...

Texygen: A Benchmarking Platform for Text Generation Models

We introduce Texygen, a benchmarking platform to support research on ope...

Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation

We introduce Texar, an open-source toolkit aiming to support the broad s...

JaTeCS an open-source JAva TExt Categorization System

JaTeCS is an open source Java library that supports research on automati...

Analysing Data-To-Text Generation Benchmarks

Recently, several data-sets associating data to text have been created t...

VizSeq: A Visual Analysis Toolkit for Text Generation Tasks

Automatic evaluation of text generation tasks (e.g. machine translation,...

On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

The goal of text generation models is to fit the underlying real probabi...

1 Introduction

Neural text generation, as a key but challenging task in NLP, has been widely studied recently. Text generation has been applied to various scenarios, such as dialog generation Vinyals and Le (2015), story generation Roemmele (2016), machine translation Sutskever et al. (2014)

, text summarization

Rush et al. (2015)

and image captioning

Vinyals et al. (2015).

Novel methods for text generation are constantly developed, but the evaluation of text generation is much less touched Huang et al. (2019). Even worse, it is common that the results provided by prior models contradict one another, and thus it is hard to identify state-of-the-art models for some task. After thorough investigation of existing open-source projects on text generation, we observe two common problems in model evaluation.

Figure 1: Examples of problems leading to inconsistent results. (a) Transformer is better than GRU when compared in the same setting. If GRU is tested on a smaller vocabulary, the unfair comparison may lead to a wrong conclusion. (b) If the ground truth contains unk (unknown tokens), the original BLEU-3 may prefer the worse sentence (the first one). (c) The different implementations of perplexity lead to inconsistent results, where eos means tokens at the end of sentences and pad

means paddings. We present the formulations of log perplexity, where the negative log probability of the

-th token in the -th sentence .

One problem is that models are tested on different datasets and with different experimental settings, thereby making these models incomparable. Even on the same dataset, the settings used in evaluation, such as the data split (training/test), the method of tokenization and the size of vocabularies, differ unconsciously, thereby leading to unfair comparison between models and untenable conclusions. As shown in Figure 1 (a), the subtle difference in vocabularies can completely change the results. However, uncovering the differences among these settings can be extremely difficult, which makes results hardly reproducible.

The other problem lies in that the metrics of text generation are rather complicated, leading to inconsistency among different implementations. For example, as shown in Figure 1 (b), the truncation of the vocabulary brings unk (unknown tokens) into ground truth, and the original BLEU metric Papineni et al. (2002) favors the sentence containing more unk . In (c), we present several implementations of perplexity Brown et al. (1992), leading to very different results.

These problems severely prevent us from comparing different models fairly, reproducing existing models and implementing new models. There is a heavy burden in checking the details of experimental settings, metric implementations, and more. To this end, we develop Conversational Toolkit (CoTK), an open-source 111CoTK is available at with Apache License 2.0. toolkit as a python package. CoTK is mainly developed for open-domain conversation generation, but other tasks of text generation are also supported. CoTK is designed to achieve two goals:

  • [leftmargin=1em]

  • Empowering fast development. CoTK helps handle the cumbersome issues in data loading, processing, evaluation, and reproduction, so that the researchers can concentrate on the most creative part, i.e., the implementation of novel models.

  • Empowering fair evaluation. CoTK is specially designed for ensuring fair comparison, where a unique hash code can signify whether the experimental results are comparable.

In CoTK, we provide:

  • [leftmargin=1em]

  • Data loaders. We build data loaders for common text generation tasks, where the data loaders handle the whole procedure before sending data into the models, including reading files, processing, and packing sample batches.

  • Metrics. CoTK covers commonly used metrics in text generation tasks. Each evaluation result will be tagged with a hash code, which can be used to verify the fairness of comparisons.

  • A tool for publication and reproduction. CoTK can track the code and experimental environment, which enables researchers to publish models or reproduce others’ results conveniently.

  • Resources and benchmark models. We collect some public datasets and commonly used benchmark models, which facilitate development of new models and comparison with existing ones.

2 Design and Structure

CoTK aims at supporting researchers through the entire lifetime of model development. As shown in Figure 2, we divide the development procedure into four steps: data processing, model implementation, evaluation, and publication. CoTK characterizes itself in three aspects: data loaders for data processing, metrics for evaluation, and a tool for publication and reproduction. For model implementation, CoTK is compatible with many existing toolkits. That is, researchers can be supported in data processing, evaluation, and publication from CoTK while implementing models with other toolkits, such as Texar Hu et al. (2019) and Fairseq Ott et al. (2019)

, in PyTorch

Paszke et al. (2019)

, TensorFlow

Abadi et al. (2016)

or other deep learning frameworks.

Figure 2: CoTK’s design concepts. We provide tools for data processing, evaluation and model publication, as well as resources and benchmark models. CoTK is compatible with many existing toolkits that provide various model implementations.

2.1 Data Loader

Data loader helps users prepare data for deep learning models. Following user-specified settings, data loader can read files, make tokenization, build vocabularies and pack sentences to mini-batches.

A task is specified by the configurations shown in Table 1. We mainly support text generation (without input) Sutskever et al. (2011), single-turn dialog generation Vinyals and Le (2015) and multi-turn dialog generation Sordoni et al. (2015). By assembling different types of input and output, CoTK can be easily extended to various tasks, such as machine translation Sutskever et al. (2014), controllable conversation generation Zhou et al. (2018).

Task Configuration
Text Generation (w/o input) Sentence
Single-Turn Dialog Sentence Sentence
Multi-Turn Dialog Context Sentence
Machine Translation Sentence Sentence
Controllable Generation (Sentence, Label) Sentence
Table 1: Examples of tasks supported by CoTK. Configuration is described by Input Output. denotes no input, and Context means a sequence of utterances.

The means of tokenization are usually ignored but can largely affect the experimental results. We provide widely used Puckt tokenizer Kiss and Strunk (2006) as well as tokenizers for GPT-2 Radford et al. (2019) and other pretraining models.

It is a common choice to filter out rare words in the vocabulary. However, fair comparison among the models trained with different vocabularies is not trivial, as shown by the example in Figure 1 (a). To this end, we split a vocabulary into two parts:

  • [leftmargin=1em]

  • Frequent vocabulary (). Frequent vocabulary contains frequent words from the training set. It is the vocabulary used by most models.

  • Rare vocabulary (). Rare vocabulary contains the remaining words from the training and the test set, which cannot be generated by most models except the copy mechanism He et al. (2017). Note that and have no intersection.

In the training stage, models only see and they regard all words not in as unk . In the test stage, models are evaluated on . Although rare words cannot be generated by most models, they are crucial for evaluation. Our metrics are designed to achieve fair comparison as long as does not change. Supposing that two models are trained with different frequent vocabularies, and for instance, they can be fairly compared by adjusting the rare vocabularies to keep .

Hash Code for Data Loader
Since it is difficult to track the differences among various data loaders, CoTK provides hash codes to identify each part of the data loader including the input data, vocabularies and settings (shown in Table 2). For example, if two data loaders have the same General Hash code, their data, vocabularies and settings are guaranteed to be the same. This is implemented by computing SHA-256 given the corresponding parts of data loaders as input. A usage case is presented in Section 3.1.

Hash Code Object to be Identified
Raw Data Hash Raw Text Data
Data Hash Tokenized Data
Vocab Hash Vocabulary
Setting Hash Settings
General Hash All Above
Table 2: Hash codes used to identify different parts of data loaders including data, vocabulary and settings.

2.2 Metric

CoTK covers commonly used metrics in text generation tasks, as shown in Table 3.

Text Generation (Without Input)
Perplexity Brown et al. (1992)
Self-BLEU Zhu et al. (2018)
Forward / Backward BLEU Shi et al. (2018)
Forward / Reverse Perplexity Zhao et al. (2018)
Dialog Generation
BLEU Papineni et al. (2002)

Distinct N-gram

Li et al. (2016)
BOW Embedding Forgues et al. (2014)
Machine Translation & Text Summarization
ROUGE Lin (2004)
METEOR Banerjee and Lavie (2005)
Table 3: Part of the metrics supported by CoTK.

In CoTK, metrics are implemented to achieve fair comparison among models in different experimental settings. We will take perplexity and BLEU as examples to introduce our implementation.

Example: Perplexity
The original perplexity is calculated as

where is the -th token in the ground truth, and is given by the model . Supposing two models are trained with different frequent vocabularies, denoted as and respectively, where , and there exists but . Then the model B should predict the exact , but the model A only need to predict a unk . It is unfair to compare the two models based on the original perplexity, as shown in Figure 1 (a).

Similar to Ahn et al. (2016), we distribute the probability of unk evenly to the rare words:

where ,

are frequent vocabulary and rare vocabulary respectively. This method converts the predicted probability distribution over

to a distribution over , so that the perplexity can always be fairly compared as long as keeps unchanged.

Example: BLEU
The BLEU metric may be affected by two issues: different tokenizers bring different token sets; BLEU may favor sentences with unk , as shown in Figure 1 (b).

In CoTK’s BLEU, we first concatenate tokens for both hypotheses and references and then make tokenization again by Puckt tokenizer. This step standardizes the tokenization. Then we count the matches of n-grams following the original BLEU, but we never match n-grams containing unk . It is because unk is not a real token and should be always regarded as mismatched.

Figure 3: The steps of calculating BLEU in CoTK. Blue solid lines align matched pairs, and red dotted lines denote unmatched pairs.

This modification greatly extends applicability, which ensures fair comparison regardless of tokenization methods or vocabulary sets adopted by generation models.

Hash Code for Metric
Hash codes generated for metrics can track the settings and the reference data, where two metric scores are comparable if and only if they have the same hash code. The implementation of the hash code in each metric can be different. For example, the hash code of perplexity computes the SHA-256 hash given the reference sentences, the frequent vocabulary, and rare vocabulary as input. However, the computation of the hash code for BLEU only uses the tokenized reference sentences as input, because BLEU does not rely on the vocabulary set for fair comparison.

The hash codes has several advantages: It avoids human errors such as inconsistent settings; It saves researchers from memorizing the requirements of each metric for fair comparison. A case of usage is presented in Section 3.1.

Tasks Resources
Text Generation (Without Input) MSCOCO Chen et al. (2015)
EMNLP2017 WMT222Only monolingual corpus used.
Single-Turn Dialog OpenSubtitles Tiedemann (2016)
Multi-Turn Dialog Ubuntu Lowe et al. (2015)
SwitchBoard333 et al. (2017)
Table 4: Some datasets supported by CoTK.
Tasks Benchmark Models
Text Generation (Without Input) GRUGraves (2013); Chung et al. (2014)
TransformersVaswani et al. (2017)
GPT2-finetuneRadford et al. (2019)
VAEKingma and Welling (2014)
Single-Turn Dialog Seq2Seq-GRUSutskever et al. (2014)
Seq2Seq-TransVaswani et al. (2017)
GPT2-finetuneWolf et al. (2019)
Multi-Turn Dialog HREDSordoni et al. (2015)
CVAEZhao et al. (2017)
Table 5: Some benchmark models provided by CoTK.

2.3 Publication and Reproduction

To further improve reproducibility, we develop a tool that helps researchers publish their code and experimental results.

Publication: If a user wants to share the results with the community, the user should follow a few steps: (1) Use the version control system git444 to track code updates. (2) Write code that generates a result file when executed. (3) Execute the code from our tool. These three steps track the code, the results, and the running environment. Then all these data can be uploaded to our website555 or GitHub666, which are accessible to the community. We highlight that the results contain hash codes, which guarantee fair comparison with other results.

Reproduction: If a user wants to reproduce the results, the user only needs to run our tool to fetch the data uploaded by another user, including the code and the running environment.

Dashboard : The dashboard is a website that maintains the results uploaded by users, which makes it convenient to compare performances of models and find state-of-the-art models. As our unique feature, users can submit the results by running the code from other users, which further facilitates reproduction.

Figure 4: Different hash codes (only the leading 6 characters showed) in the different settings. The table demonstrates that hash code can identify the differences in settings and avoid unfair comparisons.

2.4 Resources and Benchmark Models

To improve usability, we further provide resources and benchmark models compatible with our toolkit. The resources include benchmark datasets, pretrained model weights and more, which can be automatically downloaded by data loaders in CoTK. Some resources and benchmark models we provide are presented in Table 4 and Table 5.

2.5 Other Features

Batched Data:

CoTK is specially designed for deep learning models, where all APIs receive batched data. Batched sentences can be directly converted to/from tensors. This feature avoids errors of manipulating paddings.

Compatibility: CoTK is not dependent on the deep learning frameworks, such as TensorFlow Abadi et al. (2016) and PyTorch Paszke et al. (2019). CoTK is also compatible with many other toolkits of text generation, such as Texar Hu et al. (2019) and Fairseq Ott et al. (2019), which means the model implemented with these tookits can be evaluated or published with CoTK. Moreover, the comparison across frameworks is also possible.

Extensibility: CoTK is highly extensible, where new tasks, metrics and benchmark models can be easily integrated into the toolkit. We believe that CoTK can grow with the advancements of text generation in the community.

3 Proof-of-Concept Examples

3.1 Hash Codes of Different Settings

We present an example to demonstrate how hash codes can identify differences in settings. We choose a subset of OpenSubtitles as our dataset (Origin), and modify it for four settings: Shuffled data. The lines of the data file are shuffled, which only affects the order of the samples. Small vocabulary. The size of frequent vocabulary is changed from 1323 to 752. Different tokenizers. The tokenizer is changed from the Punkt tokenizer to the tokenizer of BERT 777The tokenizers of BERT are bert-base-uncased from Corrupted data. A sample from the dataset is removed.

The result is presented in Figure 4. Shuffling data does not change any hash code because it does not affect training or evaluation. On the dataset of small vocabulary, Vocab Hash and Setting Hash codes are different. However, hash codes for metrics do not change since the result is still comparable. On the dataset of different tokenizers, the Perplexity Hash is changed, because comparison under different tokenizers is not supported by perplexity. On corrupted data, all the hash codes are changed, where Raw Data Hash signifies that it is a different dataset.

3.2 Comparison under Different Vocabularies

We present an example to demonstrate that we can achieve fair comparison under different vocabularies. We train a GRU text generation model on the dataset MSCOCO with different frequent vocabularies, and show how the vocabulary size affects the result.

CoTK’s Perplexity Original Perplexity
1 30765 14.17 14.15
2 19227 14.11 13.95
4 12555 14.10 13.74
10 8044 14.22 13.39
40 4062 15.30 12.69
160 1900 17.13 11.49
Table 6: Perplexity under different sets of the frequent vocabulary. In each row, the words appearing less than times in the training data are regarded as rare words. The results of CoTK’s perplexity are comparable while those of the original perplexity are not.

The result is presented in Table 6. When , the model reaches the best CoTK’s perplexity. However, the original perplexity will get smaller as decreases, and it will reach 1 when . It shows that the original perplexity is not a fair metric under different vocabularies.

3.3 Evaluation of Benchmark Models

We demonstrate the evaluation results of some benchmark models on text generation (without input) and single-turn dialog generation, as shown in Table 7 and Table 8. The details of implementation and metrics are presented in the appendix.

Notice that the perplexity of GPT2-ft(finetune) cannot be fairly compared with the other models, because their tokenization are different. However, the other metrics, including BLEU, S-BLEU Zhu et al. (2018), F/B/H-BLEU Shi et al. (2018), F/R-PPL Zhao et al. (2018), Distinct-2 Li et al. (2016), standardize the tokenization with the same method of BLEU described in Section 2.2, so the results of these metrics are comparable among the models.

GRU 47.74 30.8 27.5/20.1/23.2 80.9/186.2
Transformer 37.04 30.2 25.3/20.3/22.5 80.7/180.3
GPT2-ft 19.30* 31.7 26.8/20.5/23.3 76.1/168.9
Table 7: Evaluation results of text generation (without input) models on the EMNLP2017 dataset. (*) indicates the value is not comparable with its counterparts.
Model PPL BLEU Distinct-2 F/R-PPL
GRU 37.26 1.07 0.094 23.4/243.3
Transformer 39.36 0.99 0.063 19.5/308.6
GPT2-ft 21.12* 1.30 0.156 29.8/124.2
Table 8: Evaluation results of single-turn dialog models on OpenSubtitles dataset. (*) indicates the value is not comparable with other counterparts.

3.4 Other Examples

More examples about usage, model publication and reproduction are demonstrated in the appendix.

4 Related Work

Unlike PyTorch-NLP Petrochuk (2018), torchtext888, AllenNLP Gardner et al. (2018), and GluonNLP Guo et al. (2019) that provide common modules and utilities in NLP, CoTK mainly focuses on text generation. Analogous to ours, Texar Hu et al. (2019) and Fairseq Ott et al. (2019) provide state-of-the-art models in text generation. However, these toolkits are largely targeting at model implementation. CoTK characterizes itself by focusing more on data processing, evaluation and reproduction, but also allowing the compatibility with other toolkits: the models implemented by these toolkits can be easily evaluated with CoTK.

Data Processing
Many toolkits provide data loaders as we do. Texar and Fairseq implement utilities for loading data and provide benchmarks for text generation tasks. PyTorch-NLP, torchtext and AllenNLP provide a similar function for general NLP tasks.

In comparison, as a unique feature, CoTK uses hash code to identify the differences and remind researchers when experimental settings change. Furthermore, the data loaders work with our implemented metrics to realize fair comparison across different settings and datasets.

The evaluation of text generation are less touched by previous toolkits. Most of toolkits, such as torchtext, PyTorch-NLP and Texar, only provide few metrics like BLEU. Although NLTK Loper and Bird (2002) and AllenNLP contain evaluation modules, few of which are designed for text generation.

We provide unified APIs to receive batched samples, which are convenient for deep learning models. Moreover, hash code plays an essential role in our toolkit to achieve fair comparison.

Publication and Reproduction
Publication and reproduction are rarely addressed by existing toolkits for text generation. Here we list two applications to achieve a similar function. Sacred Greff et al. (2017) is an experiment management tool, where the configurations, codes and results are tracked for reproduction. It is only for individuals and not designed for sharing results with the community. Paper with Code999 collects the evaluation results and leaderboards for different tasks. However, the results are manually filled and can be hard to reproduce. CoTK can track the codes, the results and the running environment automatically and publish them to the community, which is more convenient and efficient for comparison and reproducibility.

5 Conclusion and Future Work

In this paper, we introduce CoTK, a toolkit for fast development and fair evaluation of text generation. CoTK provides support through the entire lifetime of model development and addresses issues that are often ignored but lead to unfair comparison. With CoTK, researchers can easily handle data processing, model evaluation, and reproduction. Our special design signifies when and which metric cannot be fairly compared. CoTK can grow with the development of text generation in the community where more tasks, metrics, resources and benchmark models can be constantly integrated into our toolkit. We believe that this toolkit will not only facilitate researchers to develop text generation models, but also support fair comparison among models and promote the reproducibility of these models.


Appendix A An Example of Development Procedure

Figure 5: An example of a development procedure with the help of CoTK. (a) The file ””, including codes for loading data, training model, and performing evaluation. The implementation of the model are omitted. (b) Two commands for publishing and reproducing results.

With the help of CoTK, it is convenient to develop a novel model, as an example of the development procedure shown in Figure 5.

In (a), we show the main code for data processing, model training and evaluation. The three parts of code are explained as follows:

  • [leftmargin=1em]

  • Data processing. The dataset is downloaded and processed in a single line, where SingleTurnDialog is a data loader (Section 2.1) and OpenSubtitles is a resource (Section 2.4).

  • Model training. Our data loader provides batches of data, where sentences are converted to index and padded. This format is commonly used by most of the text generation models.

  • Model evaluation. The model can generate data in batches, which are passed to a metric (Section refsec:metric) , BleuCorpusMetric, in our case. The metric object produces the result with hash codes.

In (b), we show two commands for publishing and reproducing results, respectively. The first command uploads the code to Github, the running environment and the results to our dashboard. The second command shows how to access and reproduce the results in one line.

Appendix B An Example of Comparison Under Different Tokenizers

As aforementioned, our implementation of BLEU supports fair comparison on the dataset with different tokenizers. Here we train a GRU seq2seq model on the dataset OpenSubtitles with Punkt tokenizer Kiss and Strunk (2006), and the tokenizers101010The tokenizers of BERT, GPT2, XLNet are bert-base-uncased, gpt2, xlnet-base-cased respectively, from BERT Devlin et al. (2019), GPT2 Radford et al. (2019) and XLNet Yang et al. (2019). The result is presented in Table 9. These scores are directly comparable with our implementation.

Tokenizer Punkt BERT GPT2 XLNet
BLEU-4 1.07 1.07 1.05 0.99
Table 9: BLEU-4 with different tokenizers. These values have the same hash code (3f2b67…), which indicates that the results are comparable.

Appendix C Evaluation Details of Benchmark Models

c.1 Metrics

Here we present the details of the metrics in the evaluation of text generation (without input) and single-turn dialog tasks. All the implementation can be found in the code of CoTK.

  • [leftmargin=1em]

  • PPL(Perplexity) Brown et al. (1992). Perplexity is a common metric for text generation models.

  • BLEU Papineni et al. (2002). The metric used in the single-turn dialog task shows the overlap between generated responses and ground-truth responses. We use BLEU-4.

  • S-BLEU(Self-BLEU) Zhu et al. (2018). The metric used in the text generation (without input) task shows the diversity of generated sentences. We adopt BLEU-4 and use 1,000 sentences as samples.

  • F/B/H-BLEU(Forward/Backward/Harmony BLEU) Shi et al. (2018). The metric used in the text generation (without input) task shows fluency, diversity and overall quality of generated sentences, respectively. We adopt BLEU-4 and use 1,000 sentences as references.

  • F/R-PPL(Forward/Reverse Perplexity) Zhao et al. (2018). The metric used in both tasks shows the fluency/diversity of generated sentences. We adopt a 5-gram language model with Kneser–Ney smoothing trained on the test set and use 10,000 sentences as references.

  • Distinct N-gram Li et al. (2016). The metric used in the single-turn dialog task shows the diversity of generated sentences. We use Distinct-2.

c.2 Benchmark Models

The implementation details of benchmark models for text generation (without input) and single-turn dialog tasks are presented in Table 10 and Table 11, respectively. All the implementations are publicly available111111˙docs/index.html#model-zoo.

Model Parameters
GRU Embedding Size 300
Decoder Features 300
Optimizer/Learning Rate Adam/1e-3
Decoding Strategy Random Sampling
Decoding Temperature 0.9
Transformer Embedding Size 300
Decoder Features 256
Decoder Heads/Layers 4/5
Optimizer/Learning Rate RAdam/1e-3
Decoding Strategy Random Sampling
Decoding Temperature 0.9
GPT-ft Pretrained Model gpt2-117M
Optimizer/Learning rate RAdam/1e-4
Decoding Strategy Random Sampling
Decoding Temperature 0.9
Table 10: Implementation details of text generation models (without input).
Model Parameters
GRU Embedding Size 300
Encoder Features 200 (bidirectional)
Decoder Features 300
Optimizer/learning Rate Adam/1e-3
Decoding Strategy Top-10 Sampling
Decoding Temperature 0.9
Transformer Embedding Size 300
Encoder Features 256
Encoder Heads/Layers 4/5
Decoder Features 256
Decoder Heads/Layers 4/5
Optimizer/learning Rate RAdam/1e-3
Decoding Strategy Top-10 Sampling
Decoding Temperature 0.9
GPT-ft Pretrained Model gpt2-117M
Optimizer/learning rate RAdam/1e-4
Decoding Strategy Top-10 Sampling
Decoding Temperature 0.9
Table 11: Implementation details of single-turn dialog generation models.