Artificial Intelligence (AI) agents have begun to outperform humans on remarkably challenging tasks; AlphaGo defeated legendary Go players silver2016mastering; singh2017learning, and OpenAI’s Dota2 AI has defeated human world champions of the game berner2019dota. These AI tasks may be evaluated objectively, e.g.
using the total score achieved in a game and the victory or defeat against another player. However, for supervised learning tasks such as image classification and sentiment analysis, certifying a machine learning model as superhuman is subjectively tied to human judgments rather than comparing with an oracle. This work focuses on paving a way towards evaluating models with potentially superhuman performance in classification.
When evaluating the performance of a classification model, we generally rely on the accuracy of the predicted labels with regard to ground truth labels, which we call the oracle accuracy. However, oracle labels may arguably be unobservable. For tasks such as object detection, the predictions are subjective to many factors of the annotators, e.g., their background and physical or mental state. For other tasks, even experts may not be able to summarize an explicit rule for the prediction, such as predicting molecule toxicity and stability. Without observing the oracle labels, human predictions or aggregated human annotations are treated as ground truth wang2018glue; lin2014microsoft; wang2019superglue to approximate the oracle. Such approximation mainly suffers from two disadvantages. Firstly, the quality control of human annotation is challenging artstein2017inter; lampert2016empirical. Secondly, current evaluation paradigms focus on evaluating the performance of models, but not the oracle accuracy of humans — yet we cannot claim that a machine learning model is superhuman without a proper estimation on human performance.
In this paper, we work on the setting that oracle labels are unobserved (see Figure 1). Within this setting, we develop a theory for estimating the oracle accuracy on classification tasks. Our theory includes i) upper bounds for the averaged oracle accuracy of the annotators, ii) lower bounds for the oracle accuracy of the model, and iii) finite sample analysis for both bounds and their margin which represents the model’s outperformance. We propose an algorithm to discover competitive models and to report confidence scores, which formally bound the probability that a given model outperforms the average human annotator. Empirically, we observe that some existing models for sentiment classification and natural language inference (NLI) have already achieved superhuman performance.
2 Related Work
Classification accuracy is a widely used measure of model performance han2011data, although there are other options such as precision, recall, F1-score chowdhury2010introduction; sasaki2007truth, Matthews correlation coefficient matthews1975comparison; chicco2020advantages, etc.. Accuracy measures the disagreement between the model outputs and some reference labels. A common practice is to collect human labels to treat as the reference. However, we argue that the ideal reference is rather the (unobserved) oracle, as human predictions are imperfect. We focus on measuring the oracle accuracy for both human annotators and machine learning models, and for comparing the two.
A widely accepted approach is to crowd source kittur2008crowdsourcing; mason2012conducting a dataset for testing purposes. The researchers collect a large corpus with each examples labeled by multiple annotators. Then, the aggregated annotations are treated as ground truth labels socher2013recursive; bowman2015large
. This largely reduces the variance of the predictionnowak2010reliable; kruger2014axiomatic, however, such aggregated results are still not oracle, and their difference to oracle remains unclear. In our paper, we proves that the accuracy on aggregated human prediction, as ground truth, could be considered as a special case of the lower bound of oracle accuracy for machine learning models. On the other hand, much work considers the reliability of collected data, by providing the agreement scores between annotators landis1977measurement. Statistical measures for the reliability of inter-annotator agreement gwet2010handbook, such as Cohen’s Kappa pontius2011death and Fleiss’ Kappa fleiss1971measuring, are normally based on the raw agreement ratio. However, the agreement between annotators does not obviously reflect the oracle accuracy; e.g. identical predictions from two annotators does not mean they are both oracles. In our paper, we prove that observed agreement between all annotators could serve as an upper bound for the average oracle accuracy of those annotators. Overall, we propose a theory for comparing the oracle accuracy of human annotators and machine learning models, by connecting the aforementioned bounds.
The discovery that models can predict better than human experts dates back at least to the seminal and controversial work of Meehl54clinicalversus, which compared ad hoc predictions based on subjective and informal information to those based on simple linear models with a (typically small) number of relevant numeric attributes. Subsequent work found that one may even train such a model to mimic the predictions made by the experts (rather than an oracle), and yet still maintain superior out of sample performance goldberg70. The comparison of human and algorithmic decision making remains an active topic of psychology research kahnemannoise.
3 Evaluation Theory
In this section, we present the theory for comparing the oracle accuracy for classification tasks between human annotators and machine learning models.
3.1 Problem Statement
We are given labels crowd sourced from human annotators, , along some labels from a model . We denote and the label assigned by annotator and model to the -th data point, for . We observe the ratio of matched labels for all of a pairs of annotators and . Denote by the label of the “average” human annotator which we define as the label obtained by selecting one of the human annotators uniformly at random. We seek to formally compare the oracle accuracy of the average human, , with that of the machine learning model, , where is the unobserved oracle label. Denote by the label obtained by aggregating (say, by majority voting) the human annotators’ labels. Our work distinguishes between the oracle accuracy and the agreement with human annotations , although these two concepts have been confounded in many previous applications and benchmarks.
3.2 An Upper Bound for the Average Annotator Performance
The oracle accuracy of the average annotator follows the definition of the previous section, and conveniently equals the average of the oracle accuracy of each annotator, i.e.
By introducing an assumption, also discussed in Section 4.2, we may bound the above quantity.
Theorem (Average Performance Upper Bound)
Assume annotators are positively correlated, namely . Then, the upper bound of averaged annotator accuracy with respect to the oracle is
We observe that is overestimated as when , but that the total overestimation to is less or equal to ( out of terms), and that the influence will reduce and converge to zero as . To calibrate the overestimation, we introduce an empirically approximated upper bound . In contrast, in (2) is also noted as theoretical upper bound, .
The empirically approximated upper bound,
Lemma (Convergence of )
Assume that , where is number of classes. The approximated upper bound satisfies
Therefore, with large , converges to or .
3.3 A Lower Bound for Model Performance
For our next result, we introduce another assumption, also discussed in Section 4.2. Given two predicted labels and , we assume that is reasonably predictive even on those instances that gets wrong, as per
Theorem (Performance Lower Bound)
Assume that for any incorrect label ,
Then, the lower bound for the oracle accuracy of is
In practice, a more accurate gives a tighter lower bound for , and so we employ the aggregated human annotations for the former (letting ) to calculate the lower bound of the machine learning model (letting ), as demonstrated in Section 4.2.
Connection to common practice.
Generally, the ground truth of a benchmark corpus is constructed by aggregating multiple human annotations wang2018glue; wang2019superglue. For example, the averaged sentiment score is used in SST socher2013recursive
and majority of votes in SNLIbowman2015large. Then, the aggregated annotations are treated as ground truth to calculate accuracy. Under this setting, the accuracy on the (aggregated) human ground truth may be viewed as a special case of our lower bound.
3.4 Finite Sample Analysis
The results above assume that the agreement probabilities are known; we now connect with the finite sample case where those probabilities are estimated empirically. We begin with a standard concentration inequality (see e.g. (boucheron2013concentration, § 2.6)),
Theorem (Hoeffding’s Inequality)
be independent random variables with finite variance such that, for all . Let
then, for any ,
Combining this with Thereom 3.2 we obtain the following.
Theorem (Sample Average Performance Upper Bound)
Take the assumptions of Theorem 3.2, and let
be the empirical agreement ratio111Here is the Iverson bracket.. Define
With probability at least , for any ,
Analagously for Theorem 3.3, we have
Theorem (Sample Performance Lower Bound)
3.5 Detecting and Certifying Superhuman Models
We propose a procedure to discover potentially superhuman models based on our theorems.
Calculate the upper bound of the average oracle accuracy of human annotators, , with samples;
Calculate the lower bound of the model oracle accuracy using aggregated human annotations as the reference222We demonstrate that aggregating the predictions by voting and weighted averaging are effective in improving our bounds. We emphasize however that the aggregated predictions need not be perfect, as we do not assume that this aggregation yields an oracle., with samples;
Check whether the finite sample margin is larger than zero;
Give proper estimation of and and calculate a confidence score of .
Generally, larger margin indicates higher confidence of the out-performance. To formally check confidence for the aforementioned margin we provide
Theorem (Confidence of Out-Performance)
Confidence Score Estimation.
The above theorem suggests the confidence score
and we need now only choose the free constants and , which it depends on. Recall (28),
and remove one degree of freedom parameterise inas
We are interested in and so we may set . We offer two alternatives for selecting and .
Algorithm 1 (Heuristic Margin Separation, HMS).
Algorithm 1 (Heuristic Margin Separation, HMS).We assign half of the margin to ,
Then, with we calculate the corresponding
and compute the heuristic confidence score .
Algorithm 2 (Optimal Margin Separation, OMS).
For an locally (in ) optimal confidence score, we perform gradient ascent lemarechal2012cauchy on , where
with is initialized as before optimization333For all OMS experiments, we set learning rate 1e-4, and iterate 100 times. We will publish our code upon acceptance..
4 Experiments and Discussion
Previously, we introduced a new theory for analyzing the oracle accuracy of set of classifiers using observed agreements between them. In this section, we demonstrate our theory on several classification tasks, to demonstrate the utility of the theory and reliability of the associated assumptions.
4.1 Experimental Setup
We first consider two classification tasks with oracle labels generated by rules. Given the oracle predictions, we are able to empirically validate the assumptions for our theorems and observe the convergence of the bounds. Then, we apply our theory on two real-world classification tasks and demonstrate that some existing state-of-the-art models have potentially achieved better performance than the (averaged) performance of the human annotators.
Classification tasks with oracle rules.
To validate the correctness of our theory, we collect datasets with observable oracle labels. We construct two visual cognitive tasks, Color Classification and Shape Classification, with explicit unambiguous rules to acquire oracle labels, as follows:
Color Classification: the oracle selects the most frequently occuring color of the objects in a given image.
Shape Classification: the oracle selects the most frequent occurring shape of the objects in a given image.
For both tasks, the size of the objects is ignored. As illustrated in Figure 2, we vary three colors (Red, Blue and Yellow) and five shapes (Triangle, Square, Pentagon, Hexagon and Circle) for the two tasks, respectively.
For each task, we generated 100 images and recruited 10 annotators from the Amazon Mechanical Turk444https://www.mturk.com to label them. Each randomly generated example includes 20 to 40 objects. We enforce that no objects overlap more than 70% with all others, and that there is only one class with the highest count, to ensure uniqueness of the oracle label. The oracle number of the colors and shapes are recorded to generate oracle labels of the examples. More details about annotation interfaces and guidelines are provided in Appendix B.
Real-World Classification Tasks.
We analyze the performance of human annotators and machine learning models on two real-world NLP tasks, namely sentiment classification and natural language inference (NLI). We use the Stanford Sentiment Treebank (SST) socher2013recursive for sentiment classification. The sentiment labels are mapped into two classes (SST-2)555Samples with overall neutral scores are excluded as in tai2015improved. or five classes (SST-5), very negative(), negative (), neutral(), positive (), and very negative (). We use the Stanford Natural Language Inference (SNLI) corpus bowman2015large for NLI. All samples are classified by five annotators into three categories, i.e. Contradiction (C), Entailment (E), and Neutral (N). More details of the datasets are reported in Table 1. In the latter part of this section, we only report the estimated upper bounds on test sets, as we intend to compare them with the performance of machine learning models generally evaluated on test sets.
Machine Learning Models.
For both of the classification tasks with known oracles, we treat them as detection tasks and train YOLOv3 models redmon2018yolov3 for them. The input image resolution is 608
608 and we use the proposed Darknet-53 as the backbone feature extractor. For comparison, we train two models, a strong model and a weak model, on 512 and 128 randomly generated examples, respectively. All models are trained for a maximum of 200 epochs until convergence. During inference, the model detects the objects and we count each type of object to obtain the prediction.
We compare several representative models and their variants for real-world classification tasks, such as Recurrent Neural Networkschen2018enhancing; zhou2015c
, Tree-based Neural Networksmou2016natural; tai2015improved, and Pre-trained Transformers devlin2019bert; radfordimproving; wang2020structbert; sun2020self.
4.2 Results and Discussion
We now conduct several experiments to validate the convergence of the bounds and the validity of the assumptions. We then demonstrate the utility of our theory by detecting superhuman models. We organize the discussion into several research questions (RQ).
RQ1: Will the bounds converge given more annotators?
We first analyze the lower bounds. We demonstrate lower bounds for strong models (s) and weak models (w) in Figure 3 in black and blue lines respectively. Generally, i) the lower bounds are always under the oracle accuracy of corresponding models; ii) the lower bounds grow and tend to get closer to the bounded scores given more aggregated annotators. Then, we analyze the upper bounds. We illustrate theoretical upper bound and empirically approximated upper bound , in comparison with average oracle accuracy of annotators , in Figure 3. We observe that i) both upper bounds give higher estimation than the average oracle accuracy of annotators; ii) the margin between and reduce, given more annotators incorporated; iii) generally provides a tighter bound than , and we will use as to calculate confidence score in later discussion.
RQ2: Are the assumptions of our theorems valid?
We verify the key assumptions for the upper bound of Theorem 3.2 and the lower bound of Theorem 3.3 by computing the relevant quantities in Table 2. The two assumptions hold in our experiments, although we can only perform this analysis on the tasks with known oracle labels. For the assumptions required for our lower bound, our experiment is more conservative than the assumption, as we sum over all incorrect labels (see column 2 of Table 2.b). Despite the stricter setting, our assumption still holds on both experiments.
|(a) Theorem 3.2 assumes ,|
|(b) Theorem 3.3 assumes|
Disclaimer: while the assumptions appear reasonable, we recommend where possible to obtain a small set of oracle labels to validate the assumptions in future research.
|SST 5-Class||SST 2-Class||SNLI 3-Class|
|Avg. Human||0.790||Avg. Human||0.960||Avg. Human||0.904|
|Avg. Human||0.660||Avg. Human||0.939||Avg. Human||0.879|
|CNN-LSTM zhou2015c||0.492||CNN-LSTM zhou2015c||0.878||BiLSTM chen2018enhancing||0.855|
|Constituency Tree-LSTM tai2015improved||0.510||Constituency Tree-LSTM tai2015improved||0.880||Tree-CNN mou2016natural||0.821|
|BERT-large devlin2019bert||0.555||BERT-large devlin2019bert||0.949||LM-Pretrained Transformer radfordimproving||0.899|
|RoBERTa+Self-Explaining sun2020self||0.591||StructBERT wang2020structbert||0.971||SemBERT zhang2020semantics||0.919|
RQ3: How to identify a ‘powerful’, or even superhuman, classification model?
We first compare the with in our toy experiments, in Figure 3. Overall, it is more likely to observe superhuman performance given more annotators. We observe that outperforms both and , given more than 4 and 6 annotators for color classification and shape classification, respectively. When the model is marginally outperforming human, see weak model for color classification, we may not observe a clear superhuman performance margin, and are very close given more than 7 annotators.
For real-world classification tasks, we i) calculate the average annotator upper bounds given multiple annotators’ labels and ii) collect model lower bounds reported in previous literature. Some results on SST and SNLI are reported in Table 3
. We observe that pre-trained language models provide significant performance improvement on those tasks. Our theory manages to identify some of these models that potentially exceed the average human annotator performance, by comparingor the even more restrictive .
RQ4: How confident are the certifications?
We calculate our confidence score for the identified outperforming models via , , , and using HMS and OMS, as reported in Table 4. Generally, the confidence scores for SNLI models are higher than those of SST-2 because the former has test set is more than five times larger, while more recent and advanced models achieve higher confidence scores as they have larger margin of .
In this paper, we built a theory towards estimating the oracle accuracy of classifiers. Our theory covers i) the upper bounds for the average performance of human annotators, ii) lower bounds for machine learning models, and iii) confidence scores which formally capture the degree of certainty to which we may assert that a model outperforms human annotators. Our theory provides formal guarantees even within the highly practically relevant realistic setting of a finite data sample and no access to an oracle to serve as the ground truth. Our experiments on synthetic classification tasks validate the plausibility of the assumptions on which our theorems are built. Finally, our meta analysis of existing progress succeeded in identifying some existing state-of-the-art models have already achieved superhuman performance compared to the average human annotator.
Our approach can identify classification models that outperform typical humans in terms of classification accuracy. Such conclusions influence the understanding of the current stage of research on classification, and therefore potentially impact the strategies and policies of human-computer collaboration and interaction. The questions we may help to answer include the following: When should we prefer a model’s diagnosis over that of a medical professional? In courts of law, should we leave sentencing to an algorithm rather than a Judge? These questions and many more like them are too important to ignore. Given recent progress in machine learning we believe the work is overdue.
Yet we caution that estimating a model’s oracle accuracy in this way is not free. Our approach requires the results from multiple annotators and preferably also the number of annotators should be higher than the number of possible classes in the target classification task. Another potential challenge in applying our analysis is that some of our assumptions may not hold under some specific tasks or settings. We recommend those who apply our theory where possible to collect a small amount of ‘oracle’ annotations, to validate the assumptions in this paper.
Appendix A Proof Details
Proof of Lemma 3.2
Appendix B Details for Annotation
We crowd source the annotations via the Amazon Mechanical Turk. The annotation interfaces with instructions for color classification and shape classification are illustrated in Figure 4. Each example is annotated by different annotators. For quality control, we i) offer our tasks only to experienced annotators with 100 or more approved HITs; ii) automatically reject answers from annotators who have selected ‘None of the above’.