1 Introduction
Consider the following standard and general paradigm of NLP training: given a corpus consisting of samples, each indexed by , the training of NLP model aims at optimizing a corpuslevel objective
. For example, a popular training method follows the maximum likelihood estimation (MLE) principle, in which a sample is a
pair with being a decision context, which is usually one or more sentences in NLP tasks, and being a desired atomic decision, which is usually a token in generative tasks or a class label in discriminative tasks. The corpuslevel objective that MLEoriented training aims at maximizing is the loglikelihood of the whole corpus: .The MLE objective is relatively easy to optimize because we can construct a samplelevel loss function
such that the sample average can “effectively represent” as a surrogate objective of the optimization. Specifically, since itself is additive with respect to the samples in , we can simply take the CE loss , which givesThe average form of
admits efficient stochasticgradient optimization (which requires the objective to be a population mean such that its gradient can be unbiasedly estimated by the gradient of the sample mean over a random minibatch), and the proportionality between
and guarantees that an optimal (better) solution of the former is also an optimal (better) solution of the latter.However, it is rare that a task directly uses as the endtoend evaluation metric. Instead, common evaluation metrics used in practice include accuracy, precision/recall/F1 (for discriminative tasks), and BLEU Papineni et al. (2002) (for machine translation and other language generation tasks). While a model trained with may well optimize the corresponding MLE objective , it does not necessarily optimize the true evaluation metric of the task. For this reason, researchers have proposed to optimize alternative objective that is closer to, or in some cases equal to, the true evaluation metric used at testing time. For example, the Dice loss Li et al. (2020)
has been recently proposed for tasks such as Paraphrase Similarity Matching (PSM) and Named Entity Recognition (NER) because of its similarity to the F1 metric used in these tasks. Similarly,
sentencelevel BLEU scores have been used in sentencelevel training for machine translation due to its correspondence to the true corpuslevel BLEU metric Ranzato et al. (2016); Wu et al. (2016); Edunov et al. (2018).Unfortunately, these alternative learning objectives posed new challenges in optimization. Specifically, metrics like F1 and BLEU (and many others) are not sampleseparable, meaning that they cannot be converted proportionally or monotonically into an averaged form as in the case of MLE. Consequently, while the intended objectives and are more aligned with the evaluation metric of the corresponding tasks, what the training algorithms are truly optimizing is usually the averagedform objectives and , and models thus trained could improve the averaged objective while at the same time being worse with respect to the intended objective .
In this paper, we call the disparity mentioned above, Simpson’s bias. It is a bias between nonseparably aggregated objective and its corresponding averaged form . The name is inspired by the classic paradox known as Simpson’s reversal in statistics and social sciences, which refers to a class of conflicting conclusions obtained when comparing two “candidates” based on their aggregated performance and based on their percase performance. In the following, we will give a systematic analysis on how a similar effect can widely arise in the context of machine learning when designing samplelevel loss for many popular metrics including precision, recall, Dice Similarity Coefficient (DSC), Macro F1, and BLEU. We then experimentally examine and verify the practical impacts of the Simpson’s bias on the training of stateoftheart models in three different NLP tasks: Paraphrase Similarity Matching (with the DSC metric), Named Entity Recognition (with the MacroF1 metric), and Machine Translation (with the BLEU metric).
2 The Simpson’s Bias
As discussed in the last section, the ultimate goal of NLP training is to optimize a set function which is a corpuswise aggregated measurement of model ’s performance on given data set . On the other hand, the model is typically trained by following the gradient direction of a samplelevel loss on random sample . ^{1}^{1}1 When minibatch is used, the algorithm generates a random batch at each optimization step and follows the gradient direction of batchwise averaged loss . Such training is expected to find an extreme point of the averaged performance .
We will pay special attention to the “naive” samplelevel loss , which uses the same metric to measure a single sample. We use the without subscript to denote the corpuswise averaged performance corresponding to this particular sample loss , so . Note that every welldefined set function is conjugated with such an , which is the arithmetic average of over all singletons of . On the other hand, the function itself, when used as a performance metric in machine learning, often involves some form of “complex averaging” over as well. We are interested to understand whether, or to what extent, a model optimized for the arithmetic average can also perform well w.r.t. the “complex” average , for various specific forms of .
2.1 Special case 1: Ratio of Sums (RoS)
This is a very common family of metric , which computes the ratio of two summations over the set . Let and be two quantities defined on each sample , the RoS family of is generally in the form of
(1) 
and the corresponding “naively”averaged metric is
(2) 
In the above, we have omitted , which is considered given in this section. As a best case, of the RoS family equals in the following two conditions:
Type1: If for some constant , then
Type2: If for some constant , then
Depending on precise definitions of and , the RoS family subsumes many concrete metrics used in NLP tasks. We discuss three popular RoS metrics in the following.
Scenario 1.a: Accuracy
Let be a groundtruth decision on sample and the decision output by the model , the accuracy of on data set of size is
(3) 
which is a special case of with and , where is the indicator function.
Accuracy is the simplest case in our analysis, which does not suffer from the Simpson’s bias at all, as it satisfies the type1 condition above. In other words, optimization based on the naive samplelevel loss will maximize exactly the accuracy .
Note that in supervised learning, the sample loss
may further need to be differentiable, in which case the indicator variable is usually approximated in practice. For example in binary recognition problems, which ask to judge if each sample is positive or negative (w.r.t. some feature of interest), the modelis usually set to output a probability
, and differentiable sample losses such as are used, essentially as smoothed variants of the discrete loss .We do not consider errors from such differentiablization tricks as part of the Simpson’s bias under discussion, as the former is mostly a limit of only specific (types of) learning algorithms. In contrast, the Simpson’s bias that we are studying in this paper is concerned more with intrinsic properties of the learning objectives themselves. For example, the exact samplelevel accuracy can indeed be directly optimized through reinforcement learning algorithms, in which case the learning algorithm is equivalently optimizing exactly the corpuswise accuracy .
Scenario 1.b: Precision/Recall
While being applicable to almost all discrete decision tasks, accuracy can be problematic for tasks with imbalanced data. For example, in binary recognition problems, a model always outputting negative would have very high accuracy if positive samples are rare. Precision and recall are standard evaluation metrics used in binary recognition tasks to solve this problem.
In binary recognition problems, let be the true label of sample , for negative sample and for positive sample. Let be the predicted label by model , for negative output and for positive output. The precision on a data set of size is
(4) 
It is clear that can be seen as a RoS metric with and . But strictly speaking, is not a completely welldefined metric as its denominator can be zero. This issue becomes more evident when we try to write its naivelyconjugated form . For this reason, we turn to consider the smoothed precision
(5) 
which is a genuine RoS metric that subsumes the vanilla precision with , and its average form
(6) 
is always well defined for .
Unlike accuracy, the (smoothed) precision metrics do not satisfy either of the two equality conditions above, and may suffer from the Simpson’s bias in general. This is especially true for which is the commonly used smoothing constant in existing practice, as Section 4 will later demonstrate. However, the following theorem shows that the Simpson’s bias for smoothed precision may disappear under a special (and unusual) smoothing term , such that the smoothed precision equals precisely to its conjugate metric under this special .
Theorem 1
if and .
More importantly, there turns out to be also a special smoothing term , such that the averaged samplelevel precision smoothed by this particular happens to equal precisely the original precision metric .
Theorem 2
if .
According to Theorem 2, the special smoothing term is the negated negativeoutputrate of the model . The theorem says that although the original precision metric does suffer from the Simpson’s bias (in the sense that ), the bias can be completely resolved by using the special smoothing term . Note that , as a negative smoothing term, is outside the typical value range of smoothingterm tuning in previous works (which usually used ). ^{2}^{2}2We also remark that the smoothing term was previously only used to make the precision metric well defined on singleton samples, not for solving the Simpson’s bias.
Finally, the recall metric is symmetrically defined as , thus all the observations about precision as discussed also symmetrically apply to recall. In particular, we have for .
Scenario 1.c: Dice Coefficient
Dice similarity coefficient (DSC) is a measure to gauge the similarity of two overlapped (sub)sets. In binary recognition problems, DSC is used as a performance metric that combines precision and recall.
Specifically, the DSC metric is the harmonic mean of precision and recall. Following the same formulation with Scenario 1.b, we can write
(7) 
which is a RoS metric with and . We can also similarly generalize DSC to smoothed variant
(8) 
which has conjugated averageform
(9) 
The following theorem shows an interesting connection between DSC and accuracy. See the proofs in Appendix A.
Theorem 3
for .
When , the righthand side of Theorem 3 is very close to the value of accuracy. So, it turns out that averaging the nearly unsmoothed samplelevel DSC gives us the corpuslevel accuracy: for . In other words, Theorem 3 implies that the original DSC metric (which is approximately with , see (8)) does not only have the Simpson’s bias, but the bias in this metric is so significant that its averageform conjugate with has been completely distorted towards another metric (i.e. towards accuracy ).
Moreover, Theorem 3 further implies that the Simpson’s bias in DSC cannot be resolved by any smoothing term . Specifically, the theorem asserts that the smoothed averaged DSC is monotonic to the error rate under any admissible , which thus is monotonic to correction rate (i.e., accuracy) as well. This means optimizing the averageform DSC under whatever admissible smoothing term will be equivalent to optimizing just the accuracy. In other words, in any binary recognition problem where the DSC metric is preferred over accuracy, the (potential) advantage of direct DSC optimization would be completely offset by the Simpson’s bias, no matter how we tune the smoothing constant.
2.2 Special case 2: MacroF1
The DSC metric can be further extended to multiclass classification problems, in which the model
is asked to classify each sample input
into one of predefined classes. The groundtruth labelis a categorical variable whose
th component is if sample is from class , otherwise. The decision of the model is similarly encoded by a onehot vector
, where is the model output under .For given class , the decision of the model is making binary recognition on the particular class , thus all the metrics discussed so far applies in a perclass sense. Specifically, the model’s precision for class is , and its recall for class is . The DSC for class is, accordingly, . The F1 score of the model is the mean DSC value averaged over all classes, ^{3}^{3}3 (10) is usually called MacroF1, although the same name was also used for a similar but different metric Opitz and Burst (2019). Other F1 variants also exist, such as MicroF1. (10) is the evaluation metric used in tasks that we will experimentally examine later. denoted as
(10) 
The F1 metric is a linear sum of several RoS metrics, but itself is not a RoS metric. The corresponding (smoothed) averageform F1 is
(11) 
From Theorem 3 we know that the averageform F1 (that is, with ) is equivalent to an “meanaccuracyoverclass” metric, which is different from the aggregated F1 metric (and is different from the multiclass accuracy metric actually used in multiclassification tasks too).
2.3 Special case 3: BLEU
BLEU is a widely used evaluation metric in machine translation(MT) and question answering (QA). Given a parallel corpus consisting of sentence pairs , being the source sentence and a reference translation, the MT model will generate a translation for each . The BLEU score of the model on such a data set is defined as
where
is the total number of ngrams of length
in , is the number of “matched” ngrams of length in , is the total number of 1grams in , andmeans taking the geometric mean over
.To subsume the BLEU metric into our framework, define
(13) 
which is equivalent to the exact BLEU metric in terms of model training. Similar to , the metric is also an aggregation of five RoS submetrics. However, different from , the RoS submetrics in will each go through a nonlinear transformation before summing over together. The corresponding averageform BLEU is
(14) 
Note that in , a sample is a sentence, and the metric computes a sentencelevel BLEU score Chen and Cherry (2014) for each sentence , then takes the arithmetic mean over all sentencelevel scores. Sentencelevel training could be conducted based on, as have been explored by many authors Ranzato et al. (2016); Shen et al. (2016); Wu et al. (2016); Bahdanau et al. (2017); Wu et al. (2018); Edunov et al. (2018), if the sentenceaveraged BLEU indeed serves as a good proxy to the true evaluation metric , a presumption that we will experimentally examine in later sections.
3 Connections to Simpson’s Paradox
Our naming of the bias between corpuslevel metric and its averageform conjugate is largely inspired by its connection with the famous notion, Simpson’s reversal paradox, which we will explain in this section.
Simpson’s reversal often refers to the statistical observation that a candidate method/model is better in each and every case, but is worse in terms of the overall performance. For example, let be a new medical treatment that is better than the baseline method in terms of survival rate for both the group of male patients and the group of female patients, it turns out that could have a lower survival rate than for the combined group of all patients, as famously shown by Blyth (1972).
Many people feel surprising, or even paradoxical, when they observe the Simpson’s reversal. Blyth (1972) was the first to call this phenomenon, Simpson’s paradox, named after Edward H. Simpson for his technical notes Simpson (1951) that proposed to study the phenomenon more carefully. On the other hand, Simpson’s reversal, as a mathematical fact, is not too rare in realworld experiences. Pavlides and Perlman (2009)
show that the reversal occurs in about 2% of all the possible 2x2x2 contingency tables. It is then interesting to ask why people consider a notsouncommon phenomenon psychologically surprising – the paradoxical feeling appears to suggest some deeply held conviction in people’s mind that the Simpson’s reversal has clashed with.
The surething principle has been hypothesized to be such a contradictory conviction behind the Simpson’s paradox Pearl (2014), which validly asserts that a method that helps in every case must be beneficial in terms of the averaged performance under any mixture distribution. In the medical example above, for instance, the new method improves survival rate for both males and females, which by the surething principle does entail that ’s average survival rate under any given gender ratio must improve. However, it is often overlooked that the aggregated survival rate of a method (over both males and females) is not a simple average of its pergender survival rate, but depends on the specific gender ratio that the method is facing (which may vary between methods). People might feel the Simpson’s reversal paradoxical if they overlooked the difference between the averaged performance and the aggregated performance, in which case the observed reversal clashes with the surething principle in the observer’s mind.
We argue that this oftenoverlooked disparity between average and aggregate performances, as possibly the real crux behind the Simpson’s paradox, is indeed sometimes overlooked in the context of NLP training, not only regarding its existence, but also regarding its impact to the training. Given presence of this disparity, a model that is better in terms of averaged persample performance could turn out to be worse in terms of the aggregate performance measured by applying the same evaluation metric to the whole data set directly. This reversal in ranking NLP models (or model parameters) can not only lead to biases in the gradient estimation for SGD (which is based on the average performance), causing inefficiency or failure to optimize the model towards better aggregate performance, but more severely, can cause the training to land in suboptimal solutions (in terms of aggregate performance) even if an oracle optimization procedure is given (which can at its best maximize the average performance). As both the aforementioned issue in model training and the classic Simpson’s paradox in statistical sciences are fundamentally rooted from the disparity between two different ways to compute the same metric (averaged or aggregated), we call this disparity, the Simpson’s bias, so as to highlight the intrinsic connections between the two.
For completion we remark that there is another paradox about the Simpson reversal when we have to make decisions based on the reversed result – sometimes it feels reasonable to consult the aggregate measurement while in other scenarios the percase measurement is the one we want to resort to. This is a different paradoxical experience from the “Simpson’s paradox” that we have discussed above: One occurs when we merely observe the reversal, the other occurs when we go on trying to use the reversal data. For clarity we will call the former, Simpson’s Reversal Paradox (SRP), while call the latter, Simpson’s Decision Paradox (SDP). There is an active AI community that study SDP from a causal perspective Pearl (2014). Their causal framework also helps explain why people often overlooked the Simpson’s bias behind SRP.
We stress, however, that the SDP literature is less relevant to our paper where we focus only on SRP. On the other hand, the causal explanation on SRP is complementary to our paper where we point out that the perhaps causallyrooted (or for whatever reason) tendency to overlook the Simpson’s bias may not only induce the Simpson’s Reversal Paradox in statistical sciences, but may also lead to undesired results in ML/NLP.
4 Experiments
This section experimentally studies (1) how significant the Simpson’s bias can be in standard NLP benchmarks and (2) how the bias affects the NLP training in those benchmarks. In the following, we report observations about these two questions in three common NLP tasks: Paraphrase Similarity Matching (PSM), Named Entity Recognition (NER) and Machine Translation (MT).
4.1 Experiment Design
The first question is relatively easy to address. Let be a NLP model trained for a task with training corpus and testing metric , the significance of the Simpson’s bias of on model is denoted by
(15) 
where is the averageform metric corresponding to . Note that model is not necessarily trained with , but we can generally measure the Simpson’s bias between and on an arbitrary model. In our experiments, we will measure the bias in various tasks with various metrics , and on models trained with various loss functions under various hyperparameter and preprocessing settings.
The second question, i.e. to measure the impact of the Simpson’s bias, is more tricky. Ideally, one would want to directly compare the performances (in terms of ) between models trained with samplelevel objective and those trained with corpuslevel objective . However, a key obstacle here is that we cannot easily compute/estimate the gradient of the corpuslevel objective (over any corpus beyond modest size) to optimize it, which is exactly why people turned to the samplelevel objective in the first place. In our experiments we instead observe the impact of Simpson’s bias to NLP training from three indirect perspectives.
First, we seek to observe how consistent and can be when used to compare a given pair of models. Such a model pair essentially serves as a highly degenerate model/parameter space (of size ), over which we want to see if the optimum of is also the optimum of . In this paper we focus on comparing pairs of models obtained from consecutive learning steps in a training process. For a learning step , we measure the changing directions at by calculating the and according to:
(16) 
The sign of or represents the changing direction. indicates that and are consistent in evaluating the models at and . suggests that and have changed in opposite directions in step , indicating inconsistent model evaluation. We call such an inconsistent , a reversal pair. If reversal pairs are rare throughout the whole training process, we can say that the changes of and are highly consistent. In other words, we can maximize by optimizing . Alternatively, if there are a large number of reversal pairs, we may at least need a longer time to reach the optimal . Moreover, a tremendous amount of inconsistent directions increase the risk that can be significantly suboptimal.
Our second experiment to observe the impact of Simpson’s bias is to compare models trained with to those trained with the standard CE loss. In particular, some previous NLP works, such as Li et al. (2020), proposed to replace the CE loss with smoothed Dice loss for imbalanced data sets due to its similarity to the F1 metric. Instead of asking if models thus trained are competitive to those trained directly with F1, we ask: How much can the models trained with Dice loss (at least) outperform those with CE loss? As our theoretical analysis (Theorem 3 in particular) has pointed out, optimizing smoothed averageform DSC is actually equivalent to optimize the accuracy. One may then expect comparable learning results between smoothed Dice loss and CE loss. If this were indeed the case, it would indirectly indicate that the models trained with Dice loss (corresponding to ) might be substantially suboptimal in F1 (corresponding to ), assuming that the CE loss (which is not F1oriented) cannot fully optimize F1 (which was the general premise to consider conjugated loss at all).
Our third experiment on the impact of Simpson’s bias is to examine the correlation between the bias and the training quality (in varying training settings). If high significanceofbias is correlated with low training quality, it may potentially imply some deeper causal relationships between the two.
4.2 Dataset and Setting
For PSM, we use two standard data sets: Microsoft Research Paragraph Corpus (MRPC) Dolan and Brockett (2005)
and Quora Question Pairs (QQP)
Wang et al. (2018). We adopt the pretrained BERTbaseuncased model with different training objectives (CE and Dice loss). The officially recommended parameter settings Wolf et al. (2019)are leveraged, including max sequence length=128, epoch number=3, train batch size=32, learning rate=2e5, and
=1.For NER, we finetune BERT base multilingual cased model with different loss function (CE / Dice) on GermEval 2014 dataset Benikova et al. (2014). Formally, let be a NER data set consisting of sentences in total; each sentence has
tokens. We want to train a neural network model that classifies each token
into one of predefined entity classes. In the experiment, we use the same setting as Wolf et al. (2019), including max sequence length=128, epoch=3, lr=5e5, batch size = 32, and the Dice loss is , where refers to:(17) 
There is an alternative Dice loss , where is defined as:
(18) 
Both (17) and (18) correspond to dice loss, but (17) uses the “standard” method that classifiers as many entity phrases in a sentence as possible, while (18) is a variant of (17) that independently classifies each token, and thus obviously induces the Simpson’s bias to (17).
This dice loss is in ill condition. Since every sentence in the dataset has not the same number of words, the padding is necessary. Ideally, padding makes no or almost no contribution to the training objective, however, in (
18) the effect of padding is the same as that of negative examples in the dataset without additional processing. At the same time, the smooth strategy is directly applied to each independent token, resulting in the DSC value of a single negative example changing from 0 to 1. Such a change will make training hard.For MT, we train a transformer model Vaswani et al. (2017) on IWSLT 2016 dataset using the default setting in the original paper, except we hold the learning rate constant as and set the batch size to tokens after padding.
More details of data and setting appear in the appendix.
4.3 Significance of Simpson’s Bias
For PSM task, Figure 0(a) and Figure 0(b) show Simpson’s bias change overtime during the “BERT with dice loss()” training in MRPC/QQP task. As the training progresses, the value gradually decreases, but it still cannot be ignored at the end of the training. For NER task, the Simpson’s bias cannot be resolved by . Because of the significance of bias between and , it seems converges early in Figure 0(c), but it is not. Actually, in the whole training process, increases rapidly and then changes with smallscale. At this time, increases slowly and finally converges to about 0.4. For MT task, Figure 0(d) shows the changes of the and scores over time during training. As they both increase, we can see clear disparity between them. Through these observations, we find that (1) smooth strategy in these NLP tasks is of limited use for eliminating bias; (2) during the whole training process, the value of bias is significant and cannot be ignored.
4.4 Impact of Simpson’s Bias
Consistency testing
This experiment seeks to observe how consistent and can be when used to compare a given pair of models. For PSM task, Figure 1(a) and 1(b) show a clear inconsistency between the changes in and on MRPC and QQP task. By tracking the tendency of the DSC value changes at the and , we find out of the training steps, (or half of them) show an opposite trends between and . 46 out of 100 sample dots pairs in Figure 1(b) has different change directions, the red dots indicate the disparity between and . For NER task, there some extreme values in model early training, which reflect the fastest improvements. But the existence of these extreme values hinder our analysis, so it does not exist in Figure 1(c). It can be seen from Figure 1(c), in most cases, the change directions of and are completely inconsistent. For MT task, we plotted the scattered dots for each pairs to see whether they both increase or decrease in the same direction. There are / sampled dots have different changing directions in total. There are a larger number of reversal pairs on these NLP tasks, may at least need a longer time to reach the optimal. Moreover, the high degree of inconsistency between and may increase the difficulty for optimization.
Comparison with CE
This experiment is to observe the impact of Simpson’s bias by comparing models trained with to those trained with the standard CE loss. For PSM task, as show in Table 1, BERT trained with the CE loss (a.k.a. ) outperforms the model parameters trained with Dice loss (i.e., BERT + Dice) by a small margin: + 0.78/0.45 in terms of F1 score on MRPC/QQP task. For NER task, as the Table 1 shows, the model trained with CE is about 3.53 point higher than that trained with Dice. All the result in Table 1 indicates the fact that the Dice did not achieve better performance may suggest that it does not necessarily drive the optimization toward high DSC scores, despite of their similarity. And using smoothing constants does not work to eliminate Simpson’s bias on these tasks.
Loss  MRPC  QQP  NER 

CE Loss  89.78  87.84  86.14 
Dice Loss  89.00  87.39  82.61 
Impacts on training quality
We conduct more experiments under different settings to get various variant on MRPC task. No matter how to modify the hyperparameter, this bias between and is still significant, there are still a lot of reversed pairs and the performance of the model trained with is worse than that of CE. Meanwhile, we find a negative relation between the model quality on train dataset and the significance of bias . Figure 3 is a scatter plot that shows the significance of bias and training quality. As can be seen from the figure, when tends to decrease as increases. These experiments results suggest that the Simspon’s bias is a common phenomenon in NLP training and not changing with model tuning. See more discussions in appendix.
5 Conclusions
In this paper we coined a new concept, Simpson’s bias, for its similar role in inducing suboptimal training in ML and in inducing the Simpson’s paradox in statistics. We presented a theoretical taxonomy for the Simpson’s bias in ML, revealing how similar effect is embodied in a wide spectrum of ML metrics, from ones as simple as Accuracy, to ones as sophisticated as BLEU. For some aggregateform metrics, we show that it is possible to construct provably unbiased averageform surrogate through adding special and uncommon (e.g. negative) smoothing constants. But the Simpson’s bias is generally a factor with important impact in a variety of NLP tasks, as our experiments showed. We observed both noticeable margins of the bias and a significant number of “reversed” SGD steps in all the different tasks, datasets, and metrics. Our experiments also show that models trained with “naivelyconjugated” objectives (such as dice loss to F1) can be even worse than those trained with nonconjugated objectives (such as CE loss to F1), which could potentially reflect a significant suboptimality when training using (seemingly)conjugated objectives. Finally, a clear correlation between the Simpson’s bias and training quality is consistently observed. We believe these results indicate that the Simpson’s bias is a serious issue in NLP training, and probably in machine learning in general, that deserves more studies in the future.
References
 An actorcritic algorithm for sequence prediction. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.3.
 NoStad named entity annotation for german: guidelines and dataset.. In LREC, pp. 2524–2531. Cited by: Appendix B, §4.2.
 On simpson’s paradox and the surething principle. Journal of the American Statistical Association 67 (338), pp. 364–366. Cited by: §3, §3.
 A systematic comparison of smoothing techniques for sentencelevel bleu. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 362–367. Cited by: §2.3.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Appendix B.
 Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: Appendix B, §4.2.
 Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 355–364. Cited by: §1, §2.3.
 Dice loss for dataimbalanced NLP tasks. In ACL, pp. 465–476. Cited by: §1, §2.2, §4.1.

Vnet: fully convolutional neural networks for volumetric medical image segmentation
. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §2.2.  Macro f1 and macro f1. arXiv preprint arXiv:1911.03347. Cited by: footnote 3.
 BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: Appendix B, §1.
 How likely is simpson’s paradox?. The American Statistician 63 (3), pp. 226–233. Cited by: §3.
 Comment: understanding simpson’s paradox. The American Statistician 68 (1), pp. 8–13. Cited by: §3, §3.
 A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Link Cited by: Appendix B.

Sequence level training with recurrent neural networks
. In International Conference on Learning Representations, Cited by: §1, §2.3. 
Minimum risk training for neural machine translation
. In ACL (1), Cited by: §2.3.  The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B (Methodological) 13 (2), pp. 238–241. Cited by: §3.
 Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §4.2.
 Glue: a multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: Appendix B, §4.2.

HuggingFace’s transformers: stateoftheart natural language processing
. ArXiv abs/1910.03771. Cited by: §4.2, §4.2.  A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3612–3621. Cited by: §2.3.
 Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §2.3.
Appendix A Proofs
Lemma 1
for .
Proof. By definition . As both and
are binary variables in
, we can write the contingency table of asfor  
for  
for  
for 
from which we see that is anchored at except for and in which case gets an additional penalty of . With this observation we immediately have
Proof of Theorem 1: Let and denote the set of false positives and true positives, respectively. From Lemma 1 we have . On the other hand, from (5) we have . Comparing the two equations we see that when the denominators are equal, that is, if
(19) 
Rearranging (19) gives as desired.
Note that (19) is based on Lemma 1 which requires and , or equivalently, requires and . As the theorem has excluded the case of , we only need to further encompass the special case of .
The problem with is that in this case , and that invalidates Lemma 1. However, having a closer look at its proof we see that the whole reason for Lemma 1 to exclude is exactly because makes the first two entries of ’s contingency table illdefined. Nevertheless, note that with we are dealing with a special model that always outputs , in which case we never run into the first two entries of ’s contingency table at all. As a result, in the special case of , Lemma 1 holds – and thus (19) also holds – even if .
Finally, we remark that for , that is, for models with exactly one positive output throughout the data set , we indeed must have otherwise is illdefined on that single positive instance. On the other hand, we see from the above proof that only if . The contradiction means there is no way to make when .
Proof of Theorem 2: The proof idea is similar to that for Theorem 1 except that now we want to connect to . Clearly, the equality condition for is
(20) 
or equivalently, .
Again, we need to discuss the two special cases and separately (as and in these two cases, respectively, which invalidate Lemma 1). But this time we observe that Theorem 2 is valid in both special cases, so we don’t need to exclude any model (even those that always or never output positive) from the theorem. Specifically, when (or ) we have (or ), in which case Lemma 1 and (20) hold even for (or ), as the last (or first) two entries of ’s contingency table are impossible.
Proof of Theorem 3: The proof idea is similar to that of Lemma 1. By definition , whose contingency table is as follows.
for  
for  
for  
for 
from the table we see that when and when . With this observation we have
when .
Note that the above result also implies that with
Comments
There are no comments yet.