In recent years, there have a raising interest on a new learning paradigm which seeks the development of predictive models that operate under different automation levels—models that take decisions for a given fraction of instances and leave the remaining ones to human experts. This new paradigm has been so far referred to as learning under algorithmic triage [raghu2019algorithmic], learning under human assistance [de2020aaai, de2021aaai], learning to complement humans [wilder2020learning, bansal2020optimizing], and learning to defer to an expert [sontag2020]. Here, one does not only has to find a predictive model but also a triage policy which determines who predicts each instance.
The motivation that underpins learning under algorithmic triage is the observation that, while there are high-stake tasks where predictive models have matched, or even surpassed, the average performance of human experts [cheng2015antisocial, pradel2018deepbugs, topol2019high], they are still less accurate than human experts on some instances, where they make far more errors than average [raghu2019algorithmic]. The main promise is that, by working together, human experts and predictive models are likely to achieve a considerably better performance than each of them would achieve on their own. While the above mentioned work has shown some success at fulfilling this promise, the interplay between the predictive accuracy of a predictive model and its human counterpart under algorithmic triage is not well understood.
One of the main challenges in learning under algorithmic triage is that, for each potential triage policy, there is an optimal predictive model, however, the triage policy is also something one seeks to optimize, as first noted by de2020aaai
. In this context, most previous work on learning under algorithmic triage has developed heuristic algorithms that do not enjoy theoretical guarantees at learning the triage policy and the predictive model[raghu2019algorithmic, bansal2020optimizing, sontag2020, wilder2020learning]. The only exceptions are by de2020aaai, de2021aaai
Our contributions. Our starting point is a theoretical investigation of the interplay between the prediction accuracy of supervised learning models and human experts under algorithmic triage. By doing so, we hope to better inform the design of general purpose techniques for training differentiable models under algorithmic triage. Our investigation yields the following insights:
To find the optimal triage policy and predictive model, we need to take into account the amount of human expert disagreement, or expert uncertainty, on a per-instance level.
We identify under which circumstances a predictive model that is optimal under full automation may be suboptimal under a desired level of triage.
Given any predictive model and desired level of triage, the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level.
Building on the above insights, we introduce a practical gradient-based algorithm that finds a sequence of triage policies and predictive models
of increasing performance subject to a constraint on the maximum level of triage.
Finally, we apply our gradient-based algorithm in a wide variety of supervised learning tasks using synthetic both synthetic and real-world data
from two important applications—content moderation and scientific discovery.
Our experiments illustrate our theoretical results and show that the models and triage policies provided by our algorithm outperform those provided
by several competitive baselines111 To facilitate research in this area, we will release an open-source implementation of our algorithm with
the final version of the paper.
To facilitate research in this area, we will release an open-source implementation of our algorithm with the final version of the paper..
Further related work.
Our work is also related to the areas of learning to defer and active learning.
In learning to defer, the goal is to design machine learning models that are able to defer predictions[bartlett2008classification, cortes2016learning, geifman2018bias, ramaswamy2018consistent, geifman2019selectivenet, liu2019deep, thulasidasan2019combating, ziyin2020learning]
. Most previous work focuses on supervised learning and design classifiers that learn to defer either by considering the defer action as an additional label value or by training an independent classifier to decide about deferred decisions. However, in this line of work, there are no human experts who make predictions whenever the classifiers defer them—they just pay a constant cost every time they defer predictions. Moreover, the classifiers are trained to predict the labels of all samples in the training set, as in full automation. A very recent notable exception is bymeresht2020learning
, who consider there is a human decision maker, however, they tackle the problem in a reinforcement learning setting.
In active learning, the goal is to find which subset of samples one should label so that a model trained on these samples predicts accurately any sample at test time [cohn1995active, hoi2006batch, sugiyama2006active, willett2006faster, guo2008discriminative, sabato2014active, chen2017active, hashemi2019submodular]. In contrast, in our work, the trained model only needs to predict accurately a fraction of samples picked by the triage policy at test time and rely on human experts to predict the remaining samples.
2 Supervised Learning under Triage
Let be the feature domain, be the label domain, and assume features and labels are sampled from a ground truth distribution . Moreover, let be the label predictions provided by a human expert and assume they are sampled from a distribution , which models the disagreements amongst experts [raghu2019direct]. Then, in supervised learning under triage, one needs to find:
a triage policy
, which determines who predicts each feature vector—a supervised learning model () or a human expert ();
a predictive model , which needs to provide label predictions for those feature vectors for which .
Here, similarly as in standard supervised learning, we look for the triage policy and the predictive model that result into the most accurate label predictions by minimizing a loss function. More formally, given a hypothesis class of triage policies and predictive models , our goal is to solve the following minimization problem:
where is a given parameter that limits the level of triage, , the percentage of samples human experts need to provide predictions for, and
Here, one might think of replacing in the loss function
with its point estimate. However, the resulting objective would have a bias term, as formalized by the following proposition222All proofs can be found in Appendix A: Let be a convex function with respect to and assume there exist for which the distribution of human predictions is not a point mass. Then, the function
is a biased estimate of the true average loss defined in Eq. 2. The above result implies that, to find the optimal triage policy and predictive model, we need to take into account the amount of expert disagreement, or expert uncertainty, on each feature vector rather than just an average expert prediction.
3 On the Interplay Between Prediction Accuracy and Triage
Let be the optimal predictive model under full automation, ,
where for all . Then, the following proposition tells us that, if the predictions made by are less accurate than those by human experts on some instances, the model will always benefit from algorithmic triage: If there is a subset of positive measure under such that
then there exists a nontrivial triage policy such that . Moreover, if we rewrite the average loss as
it become apparent that, for any model , the optimal triage policy is a deterministic threshold rule in which triage decisions are derived by thresholding the difference between the model and human loss on a per-instance level. More specifically, it is given by the following proposition: Let be any fixed predictive model. Then, the optimal triage policy that minimize the loss subject to a constraint on the maximum level of triage is given by:
Here, note that, in the unconstrained case, and .
Building on the above expression, we can now identify the circumstances under which the optimal predictive model under full automation within a hypothesis class of parameterized predictive models is suboptimal under algorithmic triage. More formally, our main result is the following Proposition: Let be the optimal predictive model under full automation within a hypothesis class of parameterized models , the optimal triage policy for defined in Eq. 3 for a given maximum level of triage , and . If
then it holds that . Finally, we can also identify the circumstances under which any predictive model within a hypothesis class of parameterized predictive models is suboptimal under algorithmic triage: Let be a predictive model within a hypothesis class of parameterized models , the optimal triage policy for defined in Eq. 3 for a given maximum level of triage , and . If
then it holds that . The above results will lay the foundations for our practical gradient-based algorithm for differentiable learning under triage in the next section.
4 How To Learn Under Triage
In this section, our goal is to find the policy within a hypothesis class of parameterized predictive models that maximizes the loss defined in Eq. 5.
To this end, we now introduce a general gradient-based algorithm to approximate and given a desirable maximum level of triage . The main obstacle we face is that the threshold value in the average loss depends on the predictive model , which we are trying to learn. To overcome this challenge, we proceed sequentially, starting from the triage policy , with for all , and build a sequence of triage policies and predictive models with lower loss value, , .
More specifically, in step , we find the parameters of the predictive model
via stochastic gradient descent (SGD)[kiefer1952stochastic], , .
In practice, given a set of samples , we can use the following finite sample Monte-Carlo estimator for the gradient :
where denotes the -th sample in the set in terms of the difference between the model and the human loss333Note that, if the set of samples contains several predictions by different human experts for each sample , we would use all of them to estimate the (average) human loss. and is the number of samples with .
In the above, note that we do not have to explicitly compute the threshold nor the triage policy for every sample in the set—we just need to pick the top samples in terms of the difference between the model and the human loss , using the predictive model fitted in step . To understand why note that, as long as , by definition, needs to satisfy that
and this can only happens if
for out of samples. Here, we can think of our gradient-based algorithm as a particular instance of disciplined parameterized programming [amos2017optnet, agrawal2019differentiable], where the differentiable convex optimization layer is given by the minimization with respect to the triage policy.
Unfortunately, at test time, we cannot do the same and, to make things worse, we cannot explicitly compute for an unseen sample since we would need to observe its corresponding label and human prediction. To overcome this, during training, we also fit a model to approximate the optimal triage policy using SGD, ,
where the choice of loss depends on the model class chosen for .
Algorithm 1 summarizes the overall gradient-based algorithm, where samples a minibatch of size from the training set and returns the top samples in the set in terms of the difference between the model and the human loss.
and the class of predictive models parameterized by sigmoid functions, ,.
5 Experiments on Synthetic Data
In this section, our goal is to shed light on the theoretical results from Section 3. To this end, we use our gradient-based algorithm in a simple regression task in which the optimal predictive model under full automation is suboptimal under algorithmic triage.
Experimental setup. We generate samples, where we first draw the features uniformly at random, ,
, and then obtain the response variablesusing two different sigmoid functions . More specifically, we set if and if . Moreover, we assume human experts provide noisy predictions of the response variables, , , where with
In the above, we are using heteroscedastic noise motivated by multiple lines of evidence that suggest that human experts performance on a per instance level spans a wide range[raghu2019algorithmic, raghu2019direct, de2020aaai].
Then, we consider the hypothesis class of predictive models parameterized by sigmoid functions, , , and utilize the sum of squared errors on the predictions as loss function, , , to train the following models and triage policies:
Predictive model trained under full automation without algorithmic triage, , for all .
Predictive model trained under full automation with optimal algorithmic triage .
Predictive model trained under algorithmic triage , with , with suboptimal algorithmic triage . Here, note that we use the triage policy that is optimal for the predictive model trained under full automation.
Predictive model trained under algorithmic triage , with , with optimal algorithmic triage .
In all the cases, we train the predictive models and using Algorithm 1 with and , respectively.
Finally, we investigate the interplay between the accuracy of the above predictive models and the human experts and the structure of the triage policies at a per-instance level.
Results. Figure 3 shows the training samples along with the predictions made by the predictive models and and the values of the triage policies , and , as well as the losses achieved by the models and triage policies (1-4) on a per-instance level. The results provide several interesting insights.
Since the predictive model trained under full automation seeks to generalized well across the entire feature space, the loss it achieves on a per-instance level is never too high, but neither too low, as shown in the left column of Panel (a). As a consequence, this model under no triage achieves the highest average loss among all alternatives, . This may not come as a surprise since the mapping between feature and response variables does not lie within the hypothesis class of predictive models used during training. However, since the predictions by human experts are more accurate than those provided by the above model in some regions of the feature space, we can deploy the model with the optimal triage policy given by Theorem 3 and lower the average loss to , as shown in the right column of Panel (a) and suggested by Proposition 3.
In contrast with the predictive model trained under full automation , the predictive model trained under triage learns to predict very accurately the instances that lie in the regions of the feature space colored in green and yellow but it gives up on the regions colored in red and blue, where its predictions incur a very high loss, as shown in Panel (b). However, these latter instances where the loss would have been the highest if the predictive model had to predict their response variables are those that the optimal triage policy hand in to human experts to make predictions. As a result, this predictive model under the optimal triage policy does achieve the lowest average loss among all alternatives, , as suggested by Propositions 3 and 6.
Finally, our results also show that deploying the predictive model under a suboptimal triage policy may actually lead to a higher loss than the loss achieved by the predictive model trained under full automation with its optimal triage policy . This happens because the predictive model is trained to work well only on the instances such that and not necessarily on those with .
6 Experiments on Real Data
In this section, we use Algorithm 1 in two classification tasks in content moderation and scientific discovery, one binary and one multiclass. We first investigate the interplay between the accuracy of the predictive models and human experts and the structure of the optimal triage policies at different steps of the training process. Then, we compare the performance of our algorithm with several competitive baselines.
Experimental setup. We use two publicly available datasets [hateoffensive, bamford2009galaxy], one from an application in content moderation and the other for scientific discovery.
— Hatespeech: It consists of
tweets containing lexicons used in hate speech. Each tweet is labeled by three to five human experts from Crowdflower as “hate-speech”, “offensive”, or “neither”.
— Galaxy zoo: It consists of galaxy images444The original Galaxy zoo dataset consists of images, however, we report results on a randomly chosen subset of images due to scalability reasons. We found similar results in other random subsets.. Each image is labeled by human experts as “early type” or “spiral”.
For each tweet in the Hatespeech dataset, we first generate a dimensional feature vector using fasttext [ft] as , similarly as in de2020aaai. For each image in the Galaxy zoo dataset, we use its corresponding pixel map555The pixel maps for each image are vailable at https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge as . Given an instance with feature value , we estimate , where denotes the number of human experts who predicted label , and set its true label to . Moreover, at test time, for each instance that the triage policy assigns to humans, we sample the human prediction .
In all our experiments, we consider the hypothesis class of probabilistic predictive models parameterized by softmax distributions, ,
where, for the Hatespeech dataset,
is the convolutional neural network (CNN) developed bykim2014convolutional and, for the Galaxy zoo dataset, it is the deep residual network developed by he2015deep. Refer to Appendix B for more details. During training, we use a cross entropy loss on the observed labels, , . Here, if an instance is assigned to the predictive model, we have that
and, if an instance is assigned to a human expert, we have that . For the function , we use the class of logistic functions, , , where, for the Hatespeech dataset, is the CNN developed by kim2014convolutional and, for the Galaxy zoo dataset, it is the the deep residual network developed by he2015deep.Moreover, during training, we also use the cross entropy loss, , . Finally, in each experiment, we used 60% samples for training, 20% samples for validation and 20% samples for testing.
Results. First, we look at the average loss achieved by the predictive models and triage policies throughout the execution of Algorithm 1 during training. Figure 6 summarizes the results, which reveals several insights. For small values of the triage level, , , the models aim to generalize well across a large portion of the feature space and, as a result, they incur a large training loss, as shown in Panel (a). In contrast, for , the models are trained to generalize across a smaller region of the feature space, which leads to a considerably smaller training loss. However, for such a high triage level, the overall performance of our method is also contingent on how well approximates the optimal triage policy . Fortunately, Panel (b) shows that, as epochs increase, the average training loss of decreases.
Next, we compare the predictive model and the human expert losses per training instance throughout the execution of the Algorithm 1 during training. Figure 13 summarizes the results. At each time step , we find that the optimal triage policies hands in to human experts those instances (in orange) where the loss would have been the highest if the predictive model had to predict their response variables . Moreover, at the beginning of the training process (, low step values ), since the predictive model seeks to generalize across a large portion of the feature space, the model loss remains similar across the feature space. However, later into the training process (, high step values ), the predictive models focuses on predicting more accurately the samples that the triage policy hands in to the model, achieving a lower loss on those samples, and gives up on the remaining samples, where it achieves a high loss.
Finally, we compare the performance of our method against four baselines in terms of misclassification test error . Refer to Appendix B for more details on the baselines, which we refer to as confidence-based triage [bansal2020optimizing], score-based triage [raghu2019algorithmic], surrogate-based triage [sontag2020] and full automation triage. Figure 16 summarizes the results, which show that the predictive models and triage policies found by our gradient-based algorithm outperform the baselines across the majority of maximum triage levels in both Hatespeech and Galaxy zoo datasets.
In this paper, we have contributed towards a better understanding of supervised learning under algorithmic triage. We have first identified under which circumstances predictive models may benefit from algorithmic triage, including those trained for full automation. Then, given a predictive model and desired level of triage, we have shown that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically from the model and human per-instance errors. Finally, we have introduced a practical gradient-based algorithm to train supervised learning models under triage and have shown that the models and triage policies provided by our algorithm outperform those provided by several competitive baselines.
Our work also opens many interesting venues for future work. For example, we have assumed that each instance is predicted by either a predictive model or a human expert. However, there may be many situations in which human experts predict all instances but their predictions are informed by a predictive model [lubars2020ask]. Moreover, we have studied the problem of learning under algorithmic triage in a batch setting. It would be very interesting to study the problem in an online setting. Finally, it would be valuable to assess the performance of supervised learning models under algorithmic triage using interventional experiments on a real-world application.
Appendix A Proofs
Proof of Proposition 2. Due to Jensen’s inequality and the fact that, by assumption, the distribution of human predictions is not a point-mass, it holds that . Hence,
Proof of Proposition 3. Let . Then, we have:
where inequality holds by assumption and equality holds by the definition of .
Proof of Theorem 3. We first provide the proof of the unconstrained case. First, we note that,
Since the second term in the above equation does not depend on , we can find the optimal policy by solving the following optimization problem:
Note that the above problem is a linear program and it decouples with respect to. Therefore, for each , the optimal solution is clearly given by:
Next, we provide the proof of the constrained case. Here, we need to solve the following optimization problem:
To this aim, we consider the dual formulation of the optimization problem, where we only introduce a Lagrangian multiplier for the first constraint, ,
The inner minimization problem can be solved using the similar argument for the unconstrained case. Therefore, we have:
Proof of Proposition 3. The optimal predictive model under full automation within a parameterized hypothesis class of predictive models satisfies that
and the optimal predictive model under satisfies that
Now we have that