Differentiable Learning Under Triage

Multiple lines of evidence suggest that predictive models may benefit from algorithmic triage. Under algorithmic triage, a predictive model does not predict all instances but instead defers some of them to human experts. However, the interplay between the prediction accuracy of the model and the human experts under algorithmic triage is not well understood. In this work, we start by formally characterizing under which circumstances a predictive model may benefit from algorithmic triage. In doing so, we also demonstrate that models trained for full automation may be suboptimal under triage. Then, given any model and desired level of triage, we show that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level. Building upon these results, we introduce a practical gradient-based algorithm that is guaranteed to find a sequence of triage policies and predictive models of increasing performance. Experiments on a wide variety of supervised learning tasks using synthetic and real data from two important applications – content moderation and scientific discovery – illustrate our theoretical results and show that the models and triage policies provided by our gradient-based algorithm outperform those provided by several competitive baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/08/2019

Improving Consequential Decision Making under Imperfect Predictions

Consequential decisions are increasingly informed by sophisticated data-...
09/23/2021

Reinforcement Learning Under Algorithmic Triage

Methods to learn under algorithmic triage have predominantly focused on ...
06/21/2020

Classification Under Human Assistance

Most supervised learning models are trained for full automation. However...
03/28/2019

The Algorithmic Automation Problem: Prediction, Triage, and Human Effort

In a wide array of areas, algorithms are matching and surpassing the per...
01/02/2021

Characterizing Fairness Over the Set of Good Models Under Selective Labels

Algorithmic risk assessments are increasingly used to make and inform de...
05/22/2017

Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction

The increasing size and complexity of scientific data could dramatically...
09/12/2017

Information Design in Crowdfunding under Thresholding Policies

In crowdfunding, an entrepreneur often has to decide how to disclose the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there have a raising interest on a new learning paradigm which seeks the development of predictive models that operate under different automation levels—models that take decisions for a given fraction of instances and leave the remaining ones to human experts. This new paradigm has been so far referred to as learning under algorithmic triage [raghu2019algorithmic], learning under human assistance [de2020aaai, de2021aaai], learning to complement humans [wilder2020learning, bansal2020optimizing], and learning to defer to an expert [sontag2020]. Here, one does not only has to find a predictive model but also a triage policy which determines who predicts each instance.

The motivation that underpins learning under algorithmic triage is the observation that, while there are high-stake tasks where predictive models have matched, or even surpassed, the average performance of human experts [cheng2015antisocial, pradel2018deepbugs, topol2019high], they are still less accurate than human experts on some instances, where they make far more errors than average [raghu2019algorithmic]. The main promise is that, by working together, human experts and predictive models are likely to achieve a considerably better performance than each of them would achieve on their own. While the above mentioned work has shown some success at fulfilling this promise, the interplay between the predictive accuracy of a predictive model and its human counterpart under algorithmic triage is not well understood.

One of the main challenges in learning under algorithmic triage is that, for each potential triage policy, there is an optimal predictive model, however, the triage policy is also something one seeks to optimize, as first noted by de2020aaai

. In this context, most previous work on learning under algorithmic triage has developed heuristic algorithms that do not enjoy theoretical guarantees at learning the triage policy and the predictive model 

[raghu2019algorithmic, bansal2020optimizing, sontag2020, wilder2020learning]. The only exceptions are by de2020aaai, de2021aaai

who have reduced the problem to the maximization of approximately submodular functions, however, their methodology is only applicable to ridge regression and support vector machines.

Our contributions. Our starting point is a theoretical investigation of the interplay between the prediction accuracy of supervised learning models and human experts under algorithmic triage. By doing so, we hope to better inform the design of general purpose techniques for training differentiable models under algorithmic triage. Our investigation yields the following insights:

  • To find the optimal triage policy and predictive model, we need to take into account the amount of human expert disagreement, or expert uncertainty, on a per-instance level.

  • We identify under which circumstances a predictive model that is optimal under full automation may be suboptimal under a desired level of triage.

  • Given any predictive model and desired level of triage, the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level.

Building on the above insights, we introduce a practical gradient-based algorithm that finds a sequence of triage policies and predictive models of increasing performance subject to a constraint on the maximum level of triage. Finally, we apply our gradient-based algorithm in a wide variety of supervised learning tasks using synthetic both synthetic and real-world data from two important applications—content moderation and scientific discovery. Our experiments illustrate our theoretical results and show that the models and triage policies provided by our algorithm outperform those provided by several competitive baselines111

To facilitate research in this area, we will release an open-source implementation of our algorithm with the final version of the paper.

.

Further related work.

Our work is also related to the areas of learning to defer and active learning.

In learning to defer, the goal is to design machine learning models that are able to defer predictions 

[bartlett2008classification, cortes2016learning, geifman2018bias, ramaswamy2018consistent, geifman2019selectivenet, liu2019deep, thulasidasan2019combating, ziyin2020learning]

. Most previous work focuses on supervised learning and design classifiers that learn to defer either by considering the defer action as an additional label value or by training an independent classifier to decide about deferred decisions. However, in this line of work, there are no human experts who make predictions whenever the classifiers defer them—they just pay a constant cost every time they defer predictions. Moreover, the classifiers are trained to predict the labels of all samples in the training set, as in full automation. A very recent notable exception is by 

meresht2020learning

, who consider there is a human decision maker, however, they tackle the problem in a reinforcement learning setting.

In active learning, the goal is to find which subset of samples one should label so that a model trained on these samples predicts accurately any sample at test time [cohn1995active, hoi2006batch, sugiyama2006active, willett2006faster, guo2008discriminative, sabato2014active, chen2017active, hashemi2019submodular]. In contrast, in our work, the trained model only needs to predict accurately a fraction of samples picked by the triage policy at test time and rely on human experts to predict the remaining samples.

2 Supervised Learning under Triage

Let be the feature domain, be the label domain, and assume features and labels are sampled from a ground truth distribution . Moreover, let be the label predictions provided by a human expert and assume they are sampled from a distribution , which models the disagreements amongst experts [raghu2019direct]. Then, in supervised learning under triage, one needs to find:

  • a triage policy

    , which determines who predicts each feature vector—a supervised learning model (

    ) or a human expert ();

  • a predictive model , which needs to provide label predictions for those feature vectors for which .

Here, similarly as in standard supervised learning, we look for the triage policy and the predictive model that result into the most accurate label predictions by minimizing a loss function

. More formally, given a hypothesis class of triage policies and predictive models , our goal is to solve the following minimization problem:

(1)

where is a given parameter that limits the level of triage, , the percentage of samples human experts need to provide predictions for, and

(2)

Here, one might think of replacing in the loss function

with its point estimate

. However, the resulting objective would have a bias term, as formalized by the following proposition222All proofs can be found in Appendix A: Let be a convex function with respect to and assume there exist for which the distribution of human predictions is not a point mass. Then, the function

is a biased estimate of the true average loss defined in Eq. 2. The above result implies that, to find the optimal triage policy and predictive model, we need to take into account the amount of expert disagreement, or expert uncertainty, on each feature vector rather than just an average expert prediction.

3 On the Interplay Between Prediction Accuracy and Triage

Let be the optimal predictive model under full automation, ,

where for all . Then, the following proposition tells us that, if the predictions made by are less accurate than those by human experts on some instances, the model will always benefit from algorithmic triage: If there is a subset of positive measure under such that

then there exists a nontrivial triage policy such that . Moreover, if we rewrite the average loss as

it become apparent that, for any model , the optimal triage policy is a deterministic threshold rule in which triage decisions are derived by thresholding the difference between the model and human loss on a per-instance level. More specifically, it is given by the following proposition: Let be any fixed predictive model. Then, the optimal triage policy that minimize the loss subject to a constraint on the maximum level of triage is given by:

(3)

where

(4)

If we plug in the optimal triage policy given by Eq. 3 into Eq. 1, we can rewrite our minimization problem as:

(5)

where

with

Here, note that, in the unconstrained case, and .

Building on the above expression, we can now identify the circumstances under which the optimal predictive model under full automation within a hypothesis class of parameterized predictive models is suboptimal under algorithmic triage. More formally, our main result is the following Proposition: Let be the optimal predictive model under full automation within a hypothesis class of parameterized models , the optimal triage policy for defined in Eq. 3 for a given maximum level of triage , and . If

(6)

then it holds that . Finally, we can also identify the circumstances under which any predictive model within a hypothesis class of parameterized predictive models is suboptimal under algorithmic triage: Let be a predictive model within a hypothesis class of parameterized models , the optimal triage policy for defined in Eq. 3 for a given maximum level of triage , and . If

(7)

then it holds that . The above results will lay the foundations for our practical gradient-based algorithm for differentiable learning under triage in the next section.

4 How To Learn Under Triage

In this section, our goal is to find the policy within a hypothesis class of parameterized predictive models that maximizes the loss defined in Eq. 5.

To this end, we now introduce a general gradient-based algorithm to approximate and given a desirable maximum level of triage . The main obstacle we face is that the threshold value in the average loss depends on the predictive model , which we are trying to learn. To overcome this challenge, we proceed sequentially, starting from the triage policy , with for all , and build a sequence of triage policies and predictive models with lower loss value, , .

More specifically, in step , we find the parameters of the predictive model

via stochastic gradient descent (SGD) 

[kiefer1952stochastic], , .

In practice, given a set of samples , we can use the following finite sample Monte-Carlo estimator for the gradient :

where denotes the -th sample in the set in terms of the difference between the model and the human loss333Note that, if the set of samples contains several predictions by different human experts for each sample , we would use all of them to estimate the (average) human loss. and is the number of samples with .

1:Set of training samples , predictive model , maximum level of triage , number of time steps and iterations , minibatch size , learning rate .
2:
3:for  do
4:     
5:
6:return
7:
8:function TrainModel(, , , , )
9:     
10:     for  do
11:         
12:         
13:         
14:         for  do
15:              if  ¿ 0 then
16:                                          
17:               
18:     return
19:
20:function TrainTriage(, , , , )
21:     
22:     for  do
23:         
24:         
25:         for  do
26:                        
27:               
28:     return
Algorithm 1 Differentiable Triage: it returns a predictive model and a triage policy .

In the above, note that we do not have to explicitly compute the threshold nor the triage policy for every sample in the set—we just need to pick the top samples in terms of the difference between the model and the human loss , using the predictive model fitted in step . To understand why note that, as long as , by definition, needs to satisfy that

and this can only happens if

for out of samples. Here, we can think of our gradient-based algorithm as a particular instance of disciplined parameterized programming [amos2017optnet, agrawal2019differentiable], where the differentiable convex optimization layer is given by the minimization with respect to the triage policy.

Unfortunately, at test time, we cannot do the same and, to make things worse, we cannot explicitly compute for an unseen sample since we would need to observe its corresponding label and human prediction. To overcome this, during training, we also fit a model to approximate the optimal triage policy using SGD, ,

where the choice of loss depends on the model class chosen for .

Algorithm 1 summarizes the overall gradient-based algorithm, where samples a minibatch of size from the training set and returns the top samples in the set in terms of the difference between the model and the human loss.

Predictive model
     
[5pt]No triage,       [5pt]Triage policy ,
(a) Predictive model trained under full automation
     
[5pt]Triage policy ,       [5pt]Triage policy ,
(b) Predictive model trained under algorithmic triage with
Figure 3: Interplay between the per-instance accuracy of predictive models and experts under different triage policies. In both panels, the first row shows the training samples along with the predictions made by the models and the triage policy values and the second row shows the predictive model loss against the human expert loss on a per-instance level. The triage policy is optimal for the predictive model and the triage policy is optimal for the predictive model . Each point corresponds to one instance and, for each instance, the color indicates the amount of noise in the predictions by experts, as given by Eq. 8, and the tone indicates the triage policy value. In all panels, we used

and the class of predictive models parameterized by sigmoid functions, ,

.

5 Experiments on Synthetic Data

In this section, our goal is to shed light on the theoretical results from Section 3. To this end, we use our gradient-based algorithm in a simple regression task in which the optimal predictive model under full automation is suboptimal under algorithmic triage.

Experimental setup. We generate samples, where we first draw the features uniformly at random, ,

, and then obtain the response variables

using two different sigmoid functions . More specifically, we set if and if . Moreover, we assume human experts provide noisy predictions of the response variables, , , where with

(8)

In the above, we are using heteroscedastic noise motivated by multiple lines of evidence that suggest that human experts performance on a per instance level spans a wide range 

[raghu2019algorithmic, raghu2019direct, de2020aaai].

Then, we consider the hypothesis class of predictive models parameterized by sigmoid functions, , , and utilize the sum of squared errors on the predictions as loss function, , , to train the following models and triage policies:

  • Predictive model trained under full automation without algorithmic triage, , for all .

  • Predictive model trained under full automation with optimal algorithmic triage .

  • Predictive model trained under algorithmic triage , with , with suboptimal algorithmic triage . Here, note that we use the triage policy that is optimal for the predictive model trained under full automation.

  • Predictive model trained under algorithmic triage , with , with optimal algorithmic triage .

In all the cases, we train the predictive models and using Algorithm 1 with and , respectively.

[5pt]Hatespeech [5pt]Galaxy zoo

(a) Training losses by the predictive models

[5pt]Hatespeech [5pt]Galaxy zoo

(b) Training losses by the triage policies
Figure 6: Average training losses achieved by the predictive models and the triage policies on the Hatespeech and Galaxy zoo datasets throughout the execution of Algorithm 1. In Panel (a), each predictive model is the output of TrainModel at time step . In Panel (b), each triage policy is the output of TrainTriage

at epoch

.

Finally, we investigate the interplay between the accuracy of the above predictive models and the human experts and the structure of the triage policies at a per-instance level.

Results. Figure 3 shows the training samples along with the predictions made by the predictive models and and the values of the triage policies , and , as well as the losses achieved by the models and triage policies (1-4) on a per-instance level. The results provide several interesting insights.

Since the predictive model trained under full automation seeks to generalized well across the entire feature space, the loss it achieves on a per-instance level is never too high, but neither too low, as shown in the left column of Panel (a). As a consequence, this model under no triage achieves the highest average loss among all alternatives, . This may not come as a surprise since the mapping between feature and response variables does not lie within the hypothesis class of predictive models used during training. However, since the predictions by human experts are more accurate than those provided by the above model in some regions of the feature space, we can deploy the model with the optimal triage policy given by Theorem 3 and lower the average loss to , as shown in the right column of Panel (a) and suggested by Proposition 3.

In contrast with the predictive model trained under full automation , the predictive model trained under triage learns to predict very accurately the instances that lie in the regions of the feature space colored in green and yellow but it gives up on the regions colored in red and blue, where its predictions incur a very high loss, as shown in Panel (b). However, these latter instances where the loss would have been the highest if the predictive model had to predict their response variables are those that the optimal triage policy hand in to human experts to make predictions. As a result, this predictive model under the optimal triage policy does achieve the lowest average loss among all alternatives, , as suggested by Propositions 3 and 6.

Finally, our results also show that deploying the predictive model under a suboptimal triage policy may actually lead to a higher loss than the loss achieved by the predictive model trained under full automation with its optimal triage policy . This happens because the predictive model is trained to work well only on the instances such that and not necessarily on those with .

6 Experiments on Real Data

In this section, we use Algorithm 1 in two classification tasks in content moderation and scientific discovery, one binary and one multiclass. We first investigate the interplay between the accuracy of the predictive models and human experts and the structure of the optimal triage policies at different steps of the training process. Then, we compare the performance of our algorithm with several competitive baselines.

Experimental setup. We use two publicly available datasets [hateoffensive, bamford2009galaxy], one from an application in content moderation and the other for scientific discovery.

— Hatespeech: It consists of

tweets containing lexicons used in hate speech. Each tweet is labeled by three to five human experts from Crowdflower as “hate-speech”, “offensive”, or “neither”.

— Galaxy zoo: It consists of galaxy images444The original Galaxy zoo dataset consists of images, however, we report results on a randomly chosen subset of images due to scalability reasons. We found similar results in other random subsets.. Each image is labeled by human experts as “early type” or “spiral”.

For each tweet in the Hatespeech dataset, we first generate a dimensional feature vector using fasttext [ft] as , similarly as in de2020aaai. For each image in the Galaxy zoo dataset, we use its corresponding pixel map555The pixel maps for each image are vailable at https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge as . Given an instance with feature value , we estimate , where denotes the number of human experts who predicted label , and set its true label to . Moreover, at test time, for each instance that the triage policy assigns to humans, we sample the human prediction .

Time step
(a)
(b)
(c)
(b) Time step
(c) Time step
(d) Time step
Figure 13: Predictive model and expert losses at a per-instance level on a randomly selected subset of samples of the Hatespeech (top row) and Galaxy zoo (bottom row) datasets throughout the execution of Algorithm 1 during training. The maximum level of triage is set to for Hatespeech and for Galaxy zoo dataset. Each point corresponds to an individual instance and, for each instance, the color pattern indicates the triage policy value.

In all our experiments, we consider the hypothesis class of probabilistic predictive models parameterized by softmax distributions, ,

(9)
(10)

where, for the Hatespeech dataset,

is the convolutional neural network (CNN) developed by 

kim2014convolutional and, for the Galaxy zoo dataset, it is the deep residual network developed by he2015deep. Refer to Appendix B for more details. During training, we use a cross entropy loss on the observed labels, , . Here, if an instance is assigned to the predictive model, we have that

and, if an instance is assigned to a human expert, we have that . For the function , we use the class of logistic functions, , , where, for the Hatespeech dataset, is the CNN developed by kim2014convolutional and, for the Galaxy zoo dataset, it is the the deep residual network developed by he2015deep.Moreover, during training, we also use the cross entropy loss, , . Finally, in each experiment, we used 60% samples for training, 20% samples for validation and 20% samples for testing.

Results. First, we look at the average loss achieved by the predictive models and triage policies throughout the execution of Algorithm 1 during training. Figure 6 summarizes the results, which reveals several insights. For small values of the triage level, , , the models aim to generalize well across a large portion of the feature space and, as a result, they incur a large training loss, as shown in Panel (a). In contrast, for , the models are trained to generalize across a smaller region of the feature space, which leads to a considerably smaller training loss. However, for such a high triage level, the overall performance of our method is also contingent on how well approximates the optimal triage policy . Fortunately, Panel (b) shows that, as epochs increase, the average training loss of decreases.

Next, we compare the predictive model and the human expert losses per training instance throughout the execution of the Algorithm 1 during training. Figure 13 summarizes the results. At each time step , we find that the optimal triage policies hands in to human experts those instances (in orange) where the loss would have been the highest if the predictive model had to predict their response variables . Moreover, at the beginning of the training process (, low step values ), since the predictive model seeks to generalize across a large portion of the feature space, the model loss remains similar across the feature space. However, later into the training process (, high step values ), the predictive models focuses on predicting more accurately the samples that the triage policy hands in to the model, achieving a lower loss on those samples, and gives up on the remaining samples, where it achieves a high loss.

Finally, we compare the performance of our method against four baselines in terms of misclassification test error . Refer to Appendix B for more details on the baselines, which we refer to as confidence-based triage [bansal2020optimizing], score-based triage [raghu2019algorithmic], surrogate-based triage [sontag2020] and full automation triage. Figure 16 summarizes the results, which show that the predictive models and triage policies found by our gradient-based algorithm outperform the baselines across the majority of maximum triage levels in both Hatespeech and Galaxy zoo datasets.

Hatespeech
(a) Hatespeech
(b) Galaxy zoo
Figure 16: Misclassification test error against the triage level on the Hatespeech and Galaxy zoo datasets for our algorithm, confidence-based triage [bansal2020optimizing], score-based triage [raghu2019algorithmic], surrogate-based triage [sontag2020] and full automation triage. Appendix B contains more details on the baselines.

7 Conclusions

In this paper, we have contributed towards a better understanding of supervised learning under algorithmic triage. We have first identified under which circumstances predictive models may benefit from algorithmic triage, including those trained for full automation. Then, given a predictive model and desired level of triage, we have shown that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically from the model and human per-instance errors. Finally, we have introduced a practical gradient-based algorithm to train supervised learning models under triage and have shown that the models and triage policies provided by our algorithm outperform those provided by several competitive baselines.

Our work also opens many interesting venues for future work. For example, we have assumed that each instance is predicted by either a predictive model or a human expert. However, there may be many situations in which human experts predict all instances but their predictions are informed by a predictive model [lubars2020ask]. Moreover, we have studied the problem of learning under algorithmic triage in a batch setting. It would be very interesting to study the problem in an online setting. Finally, it would be valuable to assess the performance of supervised learning models under algorithmic triage using interventional experiments on a real-world application.

References

References

Appendix A Proofs

Proof of Proposition 2. Due to Jensen’s inequality and the fact that, by assumption, the distribution of human predictions is not a point-mass, it holds that . Hence,

(11)

Proof of Proposition 3. Let . Then, we have:

where inequality holds by assumption and equality holds by the definition of .

Proof of Theorem 3. We first provide the proof of the unconstrained case. First, we note that,

Since the second term in the above equation does not depend on , we can find the optimal policy by solving the following optimization problem:

subject to

Note that the above problem is a linear program and it decouples with respect to

. Therefore, for each , the optimal solution is clearly given by:

Next, we provide the proof of the constrained case. Here, we need to solve the following optimization problem:

subject to

To this aim, we consider the dual formulation of the optimization problem, where we only introduce a Lagrangian multiplier for the first constraint, ,

(12)
subject to (13)

The inner minimization problem can be solved using the similar argument for the unconstrained case. Therefore, we have:

where

Proof of Proposition 3. The optimal predictive model under full automation within a parameterized hypothesis class of predictive models satisfies that

(14)

and the optimal predictive model under satisfies that

(15)

Now we have that