It is common practice in natural language processing (NLP) to collect and possibly annotate a text corpus and split it into training, development and test data. These splits are created either randomly or using some kind of ordering of the data, the latter being common in NLP and referred to as ‘standard splits’. Gorman:Bedrick:19 recently showed that system ranking results based onstandard splits differ from results based on random splits and used this to argue in favor of using random splits. While perhaps less common, random splits are already used in probing Elazar and Goldberg (2018), interpretability Poerner et al. (2018), as well as core NLP tasks Yu et al. (2019); Geva et al. (2019).111See also many of the tasks in the SemEval evaluation campaigns: http://alt.qcri.org/semeval2020/
Gorman:Bedrick:19 focus on whether there is a significant performance difference between systems and ; , in their notation. They argue McNemar’s test Gillick and Cox (1989) or bootstrap Efron (1981) can establish that , using random splits to sample from . This, of course, relies on the assumption the data was sampled i.i.d. Wolpert (1996).
In reality, what Gorman:Bedrick:19 call the true difference in system performance, i.e., , is the system difference on data that users would expect the systems to work well on (see §2 for practical examples) – and not just on the corpus that we happen to have annotations for. Our estimates of can therefore be very misleading. In this paper, we investigate how misleading our estimates can be: We show that random splits consistently over-estimate performance at test time. This favors systems that overfit, leading to bad estimates of system rankings. We investigate alternatives across a heterogeneous set of NLP tasks, and based on our experiments, our answer to community-wide overfitting to standard splits is not to use random splits but to collect more diverse data with different biases – or if that is not feasible, split your data in adversarial, not random, ways. In general, we observe that estimates of test time error are worst for random splits, slightly better for standard splits (if those exist), better for heuristic and adversarial splits, but error still tends to be higher on new (in-domain) samples; see Figure 1.
We consider 7 different NLP tasks: POS tagging (like Gorman:Bedrick:19), two sentence representation probing tasks, headline generation, translation quality estimation, emoji prediction, and news classification. We experiment with these tasks, because they a) are diverse, b) have not been subject to decades of community-wide overfitting (with the exception of POS tagging), and c) three of them enabled temporal splits (see Appendix §A.5).
|QE||WMT 2016||IT||WMT 2018|
The datasets which we will use in our experiments are presented in Table 1. For all seven tasks, we will present results for standard splits when possible (POS, Probing,QE, Headlines), random splits, heuristic and adversarial splits, as well as on new samples. In the case of Emojis, Headlines and News, which are all time-stamped datasets, we leave out historically more recent data as our new samples. All new samples are in-domain samples of data where models are supposed to generalize, i.e, samples from similar text sources.222Domains are commonly defined as collections of similar text sources Harbusch et al. (2003); Koehn and Knowles (2017). In addition to using similar sources, we control for low -distance Ben-David et al. (2006) by looking at separability; e.g., a simple linear classifier over frequent unigrams can distinguish between Penn Treebank development and test sections with an accuracy of 64%; and between the development and our new sample with an accuracy of 69%.
by looking at separability; e.g., a simple linear classifier over frequent unigrams can distinguish between Penn Treebank development and test sections with an accuracy of 64%; and between the development and our new sample with an accuracy of 69%.This is a key point: These are samples that any end user would expect decent NLP models to fair well on. Examples include a sample of newspaper articles from newspaper for a POS tagger trained on articles from newspaper ; tweets sampled the day after the training data was sampled; or news headlines sampled from the same sources, but a year later.
We resample random splits multiple times (3-10 per task) and report average results. The heuristic
splits are obtained by finding a sentence length threshold and putting the long sentences in the test split. We choose a threshold so that approximately 10% of the data ends up in this split. The idea of checking whether models generalize to longer sentences is not new; on the contrary, this goes back, at least, to early formal studies of recurrent neural networks, e.g., Siegelmann:Sontag:92. In the §A.3, we present a few experiments with alternative heuristic splits, but in our main experiments we limit ourselves to splits based on sentence length. Finally, theadversarial splits are computed by approximately maximizing the Wasserstein distance between the splits. The Wasserstein distance is often used to measure divergence between distributions Arjovsky et al. (2017); Tolstikhin et al. (2018); Shen et al. (2018); Shah et al. (2018), and while alternatives exist Ben-David et al. (2006); Borgwardt et al. (2006), it is easy to compute and parameter-free. Since selecting the worst-case split is an NP-hard problem (e.g., by reduction of the knapsack problem), we have to rely on an approximation. We first compute a ball tree encoding the Wasserstein distances between the data points in our sample. We then randomly select a centroid for our test split and find its nearest neighbors. Those nearest neighbors constitute our test split; the rest is used to train and validate our model. We repeat these steps to estimate performance on worst-case splits of our sample. See §A.4 for an algorithm sketch. Random, heuristic, and adversarial results are averaged across five runs.
We first consider the task in Gorman:Bedrick:19, experiment with heuristic and adversarial splits of the original Penn Treebank Marcus et al. (1993), and add the Xinhua section of OntoNotes 5.0333https://catalog.ldc.upenn.edu/LDC2013T19 as our New Sample. Our tagger is NCRF with default parameters.444https://github.com/jiesutd/NCRFpp
We also include two SentEval probing tasks Conneau et al. (2018) with data from the Toronto Book Corpus: Probing-WC (word classification) and Probing-BShift (whether a bigram was swapped) Conneau et al. (2018). Unlike the other probing tasks, these two tasks do not rely on external syntactic parsers, which would otherwise introduce a new type of bias that we would have to take into account in our analysis. We use the official SentEval framework555https://github.com/facebookresearch/SentEval and BERT666https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/1
as our sentence encoder. The probing model is a logistic regression classifier withregularization, tuned on the development set. As our New Samples, we use five random samples of the 2018 Gutenberg Corpus777https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html for each task, preprocessed in the same way as Conneau:ea:18.
We use the WMT 2014 shared task datasets for Quality Estimation. Specifically, we use the Spanish-English data from Task 1.1: scoring for perceived post-editing effort. The dataset comes with a training and test set, and a second, unofficial test set, which we use as our New Sample. In the §A.2, we also present results training on Spanish-English and evaluating on German-English. We present a simple model that only considers the target sentence, but performs better than the best shared task systems: we train an MLP over a LASER sentence embedding Schwenk et al. (2019)2015), a batch size of 200, and penalty of strength .
We use the standard dataset for headline generation, derived from the Gigaword corpus Napoles et al. (2012), as published by rush15. The task is to generate a headline from the first sentence of a news article. Our architecture is a sequence-to-sequence model with stacked bi-directional LSTMs with dropout, attention Luong et al. (2015) and beam decoding; the number of hidden units is 128; we do not pre-train. Different from rush15, we use subword units Sennrich et al. (2016) to overcome the OOV problem and speed up training. The ROUGE scores we obtain on the standard splits are higher than those reported by rush15 and comparable to those of nallapati16, e.g., ROUGE-1 of 0.321. As our New sample, we reserve 20,000 sentence-headline pairs each from the first and second halves of 2004 for validation and testing; years 1998-2003 are used for training. For all the experiments we report the error reduction in ROUGE-2 of the model over the identity baseline, which simply copies the input sentence (other ROUGE values are reported in the §A.1). In §5, we will explore how much of a performance drop on the fixed test set is caused by shifting the training data by only five years to the past.
Go:ea:09 introduce an emoji prediction dataset, collected from Twitter and is time-stamped. We use the 67,980 tweets from June 16 as our New Sample, and tweets from all previous days for the remaining experiments. For this task, we again train an MLP over a LASER embedding Schwenk et al. (2019) with hyper-parameters: two hidden layers with 50 parameters each and ReLU activation functions, trained using the Adam stochastic gradient-based optimizer Kingma and Ba (2015), a batch size of 200, and penalty of strength . See §5 for a discussion of temporal drift in this data.
We use a UCI Machine Learning Repository text classification problem.888https://archive.ics.uci.edu/ml/datasets/News+Aggregator Our datapoints are headlines associated with five different news genres. We use the last year of this corpus as our New Sample. We sample 100,000 headlines from the rest and train an MLP over a LASER embedding Schwenk et al. (2019) with the following hyper-parameters: two hidden layers with 100 parameters and ReLU activation functions, trained using the Adam stochastic gradient-based optimizer Kingma and Ba (2015), dynamic batch sizes, and penalty of strength .
Our results are presented in Table 2. Since the results are computed on different subsamples of data, we report error reductions over multinomial random (or, for Headlines, identity) baselines, following previous work comparing system rankings across different samples Søgaard (2013).
Our main observations are the following: (a) Random splits (and standard splits) consistently under-estimate error on new samples. The absolute differences between error reductions over random baselines for random splits and on new samples are often higher than 20%, and in the case of Probing-BShift, for example, the BERT model reduces 80% of the error of a random baseline when data is randomly split, but only 45% averaging over five samples of new data from the same domain. (b) Heuristic splits sometimes under-estimate error on new samples. Our heuristic splits in the above experiments are quite aggressive. We only evaluate our models on sentences that are longer than any of the sentences observed during training. Nevertheless for 5/7 tasks, this leads to more optimistic performance estimates than evaluating on new samples! (c) The same story holds for adversarial splits based on approximate maximization of Wasserstein distances between training and test data. While adversarial splits are very challenging, results on adversarial splits are more optimistic than on new samples in 4/7 cases.
The fact that random splits over-estimate real-life performance also leads to misleading system rankings. If, for example, we remove the CRF inference layer from our POS tagger, performance on our Random splits drops to 0.952; on the New Sample, however, performance is 0.930, which is significantly better than with a CRF layer.
In the spirit of earlier position papers Sakaguchi et al. (2017); Madnani and Cahill (2018); Gorman and Bedrick (2019), we provide recommendations for improving the correlation between reported performance and performance on new data samples: (i) Using biased splits can help determine what data characteristics affect performance. We proposed heuristic and adversarial splits, acknowledging that sometimes performance on new samples is worse. (ii) Evaluating on new samples enables computing significance across datasets Demsar (2006), providing confidence that the model is better than alternatives at test time. Several benchmarks already provide multiple, diverse test sets (e.g. Hovy et al., 2006; Petrov and McDonald, 2012; Williams et al., 2018); we hope more will follow.999These are domain adaptation datasets; to be clear, we argue it is important for all benchmarks to include multiple test samples, even if focusing on a single domain.
What explains the high variance across samples in NLP? One reason is the dimensionality of languageBengio et al. (2003), but in §A.5 we also show significant impact of temporal drift.
We have shown that out-of-sample error can be hard to estimate from random splits, which tend to underestimate error by some margin, but even biased and adversarial splits sometimes underestimate error on new samples. We show this phenomenon across seven very different NLP tasks and provide practical recommendations on how to best bridge the gap between experimental practices and what is needed to produce truly robust NLP models that perform well in the wild.
- Wasserstein gan. In ICML, Cited by: §2.
- Analysis of representations for domain adaptation. In NeurIPS, Cited by: §2, footnote 2.
- A neural probabilistic language model. JMLR 3, pp. 1137––1155. Cited by: §3.
- Integrating structured biological data by kernel maximum mean discrepancy. In Proceedings 14th International Conference on Intelligent Systems for Molecular Biology 2006, Fortaleza, Brazil, August 6-10, 2006, pp. 49–57. External Links: Cited by: §2.
What you can cram into a single \$&!#* vector: probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 2126–2136. External Links: Cited by: §2.
- Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7. Cited by: §3.
Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika 68. Cited by: §1.
- Adversarial removal of demographic attributes from text data. In EMNLP, Cited by: §1.
- DISCOFUSE: a large-scale dataset for discourse-based sentence fusion. In NAACL, Cited by: §1.
- Some statistical issues in the comparison of speech recognition algorithms. In ICASSP, Cited by: §1.
- We need to talk about standard splits. In ACL, Cited by: §1, §3.
- Domain-specific disambiguation for typing with ambiguous keyboards. In EACL Workshop on Language Modeling for Text Entry Methods, Cited by: footnote 2.
- OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, New York City, USA, pp. 57–60. External Links: Cited by: §3.
- AAdam: a method for stochastic optimization. In ICLR, Cited by: §2, §2, §2.
Six challenges for neural machine translation. In ACL Workshop on Neural Machine Translation, Cited by: footnote 2.
- Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Cited by: §2.
- Automated scoring: beyond natural language processing. In COLING, Cited by: §3.
- Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19. Cited by: §2.
- Annotated Gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), Montréal, Canada, pp. 95–100. External Links: Cited by: §2.
- Overview of the 2012 shared task on parsing the web. External Links: Cited by: §3.
- Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement. In ACL, Cited by: §1.
- GEC into the future: where are we going and how do we get there?. In BEA, Cited by: §3.
- CCMatrix: mining billions of high-quality parallel sentences on the web. In ArXiv, Cited by: §2, §2, §2.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: §2.
- Adversarial domain adaptation for duplicate question detection. In EMNLP, Cited by: §2.
- Predictive biases in natural language processing models: a conceptual framework and overview. In ACL, Cited by: How Far is It From Samples to Test Time?, §1.
- Wasserstein distance guided representation learning for domain adaptation. In AAAI, Cited by: §2.
- Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90. Cited by: How Far is It From Samples to Test Time?, §1.
- Estimating effect size across datasets. In NAACL, Cited by: §3.
- Wasserstein auto-encoders. In ICLR, Cited by: §2.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Cited by: §3.
- The lack of a priori distinctions between learning algorithms. Neural Computation 8. Cited by: §1.
- What you see is what you get: visual pronoun coreference resolution in dialogues. In EMNLP, Cited by: §1.
Appendix A Appendices
We present supplementary details about two of our tasks in §A.1 and §A.2 and discuss variations over heuristic splits in §A.3. In §A.4, we present the pseudo-algorithm for how we compute adversarial splits, and finally, in §A.5, we present our results documenting temporal drift.
Table 3 reports the error reduction in ROUGE-1, ROUGE-2 and ROUGE-L over the identity baseline (see §2) for the different data splits. The results are consistent with Table 2. Figure 2 gives more details on an interesting drift phenomenon, which contributed to the superior performance of the model trained on the most recent five years (1999-2003). Apparently, the dotless spelling of U.S./US (’United States’) became more common over time. Consequently, the model trained on the 1999-2003 part generated US more frequently than the model trained on 1994-1998.
a.2 Quality Estimation
In the results above, we train and test our quality estimation regressor on Spanish-English from WMT QE 2014. We also ran a similar experiment where we used the German-English test data as our New Sample. Here, we see a similar pattern to the one above: The RMSE on the Standard split was 0.630, which is slightly higher than for Spanish-English; with our Heuristic split, RMSE is 0.652; for Adversarial, it is 0.626 (which is slightly better than with standard splits), and on our New Sample, RMSE is 0.813.
|Task||Model||Standard||Bootstrap||Random Length||Rare Words|
a.3 Alternative Heuristic Splits
For both SentEval tasks we experimented with the following alternatives for heuristic splits.
Instead of cross-validation, a random split can be generated by bootstrap resampling. For this we randomly select 10% of the data as test set and then randomly sample (with replacement) a new training and dev set from the remaining examples.
As alternative to the length threshold heuristic in earlier experiments we randomly sample a length and select all examples having this length to be part of the test set. We repeat this procedure until approximately 10% of the data ends up in the test set. With this procedure we create 5 different test sets. We included this heuristic in order to see how fragile the probing setup is.
Another alternative for heuristic splits is to use word frequency information. Here we assign those sentences containing at least one of the rarest words of the dataset to the test set. This way we end up again with approximately 10% of the data in the test set. Note that this way we create only 1 dataset, because it’s not a random process.
Table 4 lists the results. While bootstrap resampling leads to slightly lower error reduction than cross-validation we decided to report the latter in the main part of this paper, because it is a more wide-spread way to randomly split datasets. Random Length results are comparable to standard splits results. The split based on word frequency (Rare Words) leads to considerable drop in both tasks. However, it is not as strong as the drop of the heuristic split (length threshold) in the main part of the paper.
a.4 Computing adversarial splits
We present the pseudo-algorithm of our implementation of approximate Wasserstein splitting in Algorithm 1. We also make the corresponding code available as part of our code repository for this paper.
a.5 The significance of drift
Some of our splits in the main experiments were based on slicing data into different time periods (Headlines, Emojis). Since temporal drift is a potential explanation for sampling bias, we analyze this in more detail here. We show that temporal drift is pervasive and leads to surprising drops in performance. We note, however, that temporal drift is not the only cause of sampling bias, of course. Since we have time stamps for two of our datasets we study these in greater detail:
Our headline generation data covers the years 1994 to 2004. Having reserved 20,000 sentence-headline pairs from the first half of 2004 for validation and the same amount from the second half for testing, we use 50% of the years 1994-2003 for training three models. The models’ architectures and parameters are identical (same as in Sec. 3). The only difference is in what the models are trained on: (a) a random half, (b) the first, or (c) the second half of 1994-2003. The training data sizes are comparable (1.63-1.76M), the publisher distributions (AFP, APW, CNA, NYT or XIN) are also similar. Hence, the models are expected to perform similarly on the same test set.
|50% of 1994-2003||0.409||0.205||0.386|
As Table 5 indicates, shifting the training data by five years to the past results in a big performance drop. Sampling training data randomly or taking the most recent period produces models with similar ROUGE scores, both much better than the identity baseline. However, about half of the gap to the identity baseline disappears when older training data is taken. In the §A.1, we give an example of temporal drift in the Headlines data: US largely replaces U.S. in the newer training set and the test set.
For emoji prediction, Go:ea:09 provide data for a temporal span of 62 days. We split the data into single days and keep the splits with more than 25,000 datapoints in which both classes are represented. We use the last of these, June 16, as our test sample and vary the training data from the first day to the day before June 16. Figure 3 (left) visualizes the results.