Simpson's Bias in NLP Training

by   Fei Yuan, et al.

In most machine learning tasks, we evaluate a model M on a given data population S by measuring a population-level metric F(S;M). Examples of such evaluation metric F include precision/recall for (binary) recognition, the F1 score for multi-class classification, and the BLEU metric for language generation. On the other hand, the model M is trained by optimizing a sample-level loss G(S_t;M) at each learning step t, where S_t is a subset of S (a.k.a. the mini-batch). Popular choices of G include cross-entropy loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption behind this paradigm is that the mean value of the sample-level loss G, if averaged over all possible samples, should effectively represent the population-level metric F of the task, such as, that 𝔼[ G(S_t;M) ] ≈ F(S;M). In this paper, we systematically investigate the above assumption in several NLP tasks. We show, both theoretically and experimentally, that some popular designs of the sample-level loss G may be inconsistent with the true population-level metric F of the task, so that models trained to optimize the former can be substantially sub-optimal to the latter, a phenomenon we call it, Simpson's bias, due to its deep connections with the classic paradox known as Simpson's reversal paradox in statistics and social sciences.


Bipol: A Novel Multi-Axes Bias Evaluation Metric with Explainability for NLP

We introduce bipol, a new metric with explainability, for estimating soc...

Dice Loss for Data-imbalanced NLP Tasks

Many NLP tasks such as tagging and machine reading comprehension are fac...

Optimizing the Dice Score and Jaccard Index for Medical Image Segmentation: Theory Practice

The Dice score and Jaccard index are commonly used metrics for the evalu...

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrins...

Improving Polyphonic Sound Event Detection on Multichannel Recordings with the Sørensen-Dice Coefficient Loss and Transfer Learning

The Sørensen–Dice Coefficient has recently seen rising popularity as a l...

Optimize what matters: Training DNN-HMM Keyword Spotting Model Using End Metric

Deep Neural Network–Hidden Markov Model (DNN-HMM) based methods have bee...

A Graph-based Stratified Sampling Methodology for the Analysis of (Underground) Forums

[Context] Researchers analyze underground forums to study abuse and cybe...

Please sign up or login with your details

Forgot password? Click here to reset