1 Introduction
Increasingly, humans work collaboratively with an AI teammate, for example, because the team may perform better than either the AI or human alone [nagar2011making, patel2019human, kamar2012combining], or because legal requirements may prohibit complete automation [gdpr, facerecognitionlaw]
. For humanAI teams, just like for any team, optimizing the performance of the whole team is more important than optimizing the performance of an individual member. Yet, to date for the most part, the AI community has focused on maximizing the individual accuracy of machinelearning models. This raises an important question: Is the most accurate AI the best possible teammate for a human?
We argue that the answer is "No." We show this formally, but the intuition is simple. Consider humanhuman teams, Is the bestranked tennis player necessarily the best doubles teammate? Clearly not — teamwork puts additional demands on participants besides high individual performance, such as ability to complement and coordinate with one’s partner. Similarly, creating highperforming humanAI teams may require training AI that exhibits additional humancentered properties that facilitate trust and delegation. Implicitly, this is the motivation behind much work in intelligible AI [caruanakdd15, weldcacm19] and posthoc explainable AI [Ribeiro2016], but we suggest that directly modeling the collaborative process may offer additional benefits.
Recent work emphasized the importance of better understanding how people transform AI recommendations into decisions [kleinbergecon18]. For instance, consider scenarios when a system outputs a recommendation on which it is uncertain. A rational user is likely to distrust such recommendations — erroneous recommendations are often correlated with a low confidence in prediction [hendrycksarxiv18]. In this work we assume that the user will discard the recommendation and solve the task themselves, after incurring a cost (e.g., due to additional human effort). As a result, the team performance depends on the AI accuracy only in the accept region, i.e., the region where a user is actually likely to rely on AI t the singular objective of optimizing for AI accuracy (e.g., using logloss) may hurt team performance when the model has fixed inductive bias; team performance will instead benefit from improving AI in the accept regions in Figure 1. While there exist other aspects of collaboration that can also be addressed via optimization techniques, such as model interpretability, supporting complementary skills, or enabling learning among partners, the problem we address in this paper is to account for teambased utility as a basis for collaboration.
In sum, we make the following contributions:

We highlight a novel, important problem in the field of humancentered artificial intelligence: the most
accurate ML model may not lead to the highest team utility when paired with a human overseer. 
We show that logloss, the most popular loss function, is insufficient (as it ignores team utility) and develop a new loss function teamloss, which overcomes its issues by calculating a team’s expected utility.

We present experiments on multiple realworld datasets that compare the gains in utility achieved by teamloss and logloss. We observed that while the gains are small and vary across datasets they reflect the behavior encoded in the loss. We present further analysis to understand how teamloss results in a higher utility and when, for example, as a function of domain parameters such as cost of mistake.
2 Problem Description
Symbol  Description 

Human accuracy  
Cost of mistake  
Confidence in the predicted label  
Human decision maker  
Classifier  
Classifier hypothesis space  
Cost of human effort to Solve  
Metadecision function  
Metadecision space  
Expected team utility  
Recommendation  
Recommendation space  
Utility function  
Feature space  
Recommended label  
Label space 
We focus on a special case of AIadvised decision making where a classifier gives recommendations to a human decision maker to help make decisions (Figure 1(a)). If
denotes the classifier’s output, a probability distribution over
, the recommendation consists of a label and a confidence value , i.e., . Using this recommendation, the user computes a final decision . The environment, in response, returns a utility which depends on the quality of the final decision and any cost incurred due to human effort. Let denote the utility function. If the team classifies a sequence of instances, the objective of this team is to maximize the cumulative utility. Before deriving a closed form equation of the objective, we characterize the form of the humanAI collaboration along with our assumptions. We study this particular, simple setting as a first step to explore the opportunities and challenges in teamcentric optimization. If we cannot optimize for this simple setting, it may be much harder to optimize for more complex scenarios (discussed more in Section 4).
User either accepts the recommendation or solves the task themselves: The human computes the final decision by first making a metadecision: Accept or Solve (Figure 1(b)). Accept passes off the recommendation as the final decision. In contrast, Solve ignores the recommendation and the user computes the final decision themselves. Let denote the function that maps an input instance and recommendation to a metadecision in . As a result, the optimal classifier would maximize the team’s expected utility:
(1) 
Mistakes are costly: A correct decision results in unit reward whereas an incorrect decision results in a penalty .

Solving the task is costly: Since it takes time and effort for the human to perform the task themselves, (e.g., cognitive effort), we assume that the Solve metadecision costs more than Accept. Further, without loss of generality, we assume units of cost to Solve and zero cost to Accept.
Using the above assumptions we obtain the following utility function. The values in each cell of the table originate from subtracting the cost of the action from the environment reward.
Metadecision/Decision Correct Incorrect Accept [ A ] Solve [ S ] Figure 3: Team utility w.r.t. metadecision and decision accuracy. 
Human is uniformly accurate: Let denote the conditional probability that if the user solves the task, they will make the correct decision, i.e.,
(2) 
Human is rational: The user makes the metadecision by comparing expected utilities. Further, the user trusts the classifier’s confidence as an accurate indicator of the recommendation’s reliability. As a result, the user will choose Accept if and only if the expected utility for accepting is higher than the expected utility for solving.
Let denote the minimum value of system confidence for which the user’s metadecision is Accept.
(3) This implies the human will follow the following thresholdbased policy to make metadecisions:
(4)
2.1 Expected Team Utility
We now derive the equation for expected utility of recommendations. Let denote the expected utility of the classifier and a decision maker .
Since upon Accept, the human returns the classifier’s recommendation, the probability that the final decision is correct is the same as the classifier’s predicted probability of the correct decision:
(5) 
(6) 
(7) 
Figure 3(a) visualizes the expected team utility of the classifier predictions as a function of confidence in the true label.
2.2 UtilityBased Loss
Since gradient descentbased minimization of loss functions is common in machine learning, we transform the expected utility into a loss function by negating it. We call this new loss function teamloss.
(8) 
We take a logarithm before negating utility to allow comparisons with logloss, where the logarithmic nature of loss is known to benefit optimization, for example, by heavily penalizing highconfidence mistakes.^{1}^{1}1Since loss can be negative, in our implementation, before computing a logarithmic of utility we appropriately shift up the utility function (by subtracting its minimum value). Figure 3(b) visualizes this new loss function.
3 Experiments
We conducted experiments to answer the following research questions:

Does the new loss function result in a classifier that improves team utility over the most accurate classifier?

How do these improvements change with properties of the task, e.g., the cost of mistakes ()?

How do these improvements change with properties of the dataset, e.g., with data distribution or dimensionality?
Metrics and Datasets We compared the utility achieved by two models: the most accurate classifier trained using logloss and a classifier optimized using teamloss on the datasets described in Table 2. We experimented with two synthetic datasets and four realworld datasets with highstakes. The real datasets are from domains that are known to or already deploy AI to assist human decision makers. The Scenario1 dataset refers to a dataset we created by sampling 10000 points from the data distribution described in Figure 1.
Dataset  #Features  #Examples  % Positive 

Scenario1  2  10000  0.43 
Moons  2  10000  0.50 
German  24  1000  0.30 
Fico  39  9861  0.52 
Recidivism  13  6172  0.46 
Mimic  714  21139  0.13 
Training Procedure
We experimented with two models: logistic regression and multilayered perceptron (two hidden layers with 50 and 100 units). For each task (defined by a choice of task parameters, dataset, model, loss) we optimized the loss using stochastic gradient descent (SGD) and also used standard, wellknown training practices such as using regularization, checkpointing, and learning rate schedulers. We selected the best hyperparameters using 5fold cross validation, including values for the learning rate, batch size, patience and decay factor of the learning rate scheduler, and weight of the L2 regularizer.
In our initial experiments for training with teamloss using SGD, we observed that the classifier’s loss would never reduce and in fact remain constant. This happened because, in practice, random initializations resulted in classifiers that are uncertain on all examples. And, since, by definition, teamloss is flat in these uncertain regions (Figure 3(b)), the gradients was zero and uninformative. To overcome this issue, we initialized the classifiers with the (already converged) most accurate classifier.
3.1 Results
Model  Dataset  Acc LL  Util LL  Acc  Util 

Linear  Scenario1  0.86  0.59  0.16  0.165 
Moons  0.89  0.81  0.01  0.020  
German  0.75  0.61  0.004  0.009  
Mimic  0.88  0.80  0.000  0.001  
Recid  0.68  0.53  0.000  0.000  
Fico  0.73  0.58  0.000  0.000  
MLP  Fico  0.72  0.56  0.01  0.018 
Scenario1  0.98  0.84  0.04  0.008  
Moons  1.00  0.99  0.00  0.007  
German  0.74  0.61  0.02  0.003  
Mimic  0.88  0.80  0.00  0.003  
Recid  0.67  0.52  0.00  0.001 

Note: LL indicates logloss
RQ1: Experiments showed that when we used teamloss, the magnitude of improvements in team utility over logloss varied across the datasets but were consistently observed (Table 3). We observed that teamloss often sacrifices classifier accuracy to improve team utility, the more desirable metric. For the linear classifier, this sacrifice is especially large on the synthetic datasets: Scenario1 (16%) and Moons (1%) datasets. For the MLP, teamloss sacrifices 2% accuracy to improve team utility.^{2}^{2}2We report absolute improvements instead of percentage improvements in utility because utility can be negative.
While the metrics in Table 3 (change in accuracy and utility) provide a global understanding of the effect of teamloss, they do not help understand how teamloss achieved improvements and whether the behavior of the new models is consistent with intuition. Figure 5 visualizes the difference in behavior (averaged over 150 seeds) between the classifiers produced by logloss and teamloss on the Scenario1 dataset. Specifically, as shown in Figure 5, we visualize and compare their:

Calibration using reliability curves, which compare system confidence and its true accuracy. A perfectly calibrated system, for example, will be 80% accurate on regions that is 80% confident. However, in practice, systems may over or underconfident.

Distributions of confidence in predictions. For example, in Figure 5, teamloss makes more highconfidence predictions than logloss.

Fraction of total system accuracy contributed by different regions (of confidence values). Thus, the area under this curve indicates the system’s total accuracy. Note that for our setup the area under the curve in the Accept region is more crucial than the area in the Solve region since in the latter the human is expected to take over.

Similar to (V4), the forth subgraph shows the fraction of total system utility contributed by different regions of confidence.
If teamloss had not resulted in different predictions than logloss, the curves in Figure 5 for the two loss functions would have been indistinguishable. However, we observed that teamloss results in dramatically different predictions than logloss. In fact, we noticed two types of behaviors when teamloss improved utility.

The first type of behavior was observed on Scenario1 datatset (Figure 5) and is easier to understand as it matches the intuition we set out in the beginning– the classifier trained with teamloss sacrifices accuracy on the uncertain examples in the Solve region to make more highconfidence predictions in the Accept region. This change improves system accuracy in the Accept region, which is where the system accuracy matters and contributes to team utility. Later, we show that this same behavior is observed on the German dataset (Figure 6)

The second type of behavior was observed on Moons (Figure 5), where the new loss increases accuracy in the Accept region at the cost predicting fewer very highconfidence predictions (e.g., when confidence is greater than 0.95 in the region marked Y). This change improves utility because the system’s accuracy in the Accept region matters more than making very highconfidence predictions.
In both these behaviors, teamloss effectively increases the contribution to AI accuracy from the Accept region, i.e., the region where AI’s performance providing value to the team. In contrast, logloss has no such considerations. Figure 6 shows a similar analysis on the real datasets for both the linear and MLP classifiers. When teamloss improves utility, we see one of the two behaviors we described above.
RQ2:
Since the penalty of mistakes may be taskdependant (e.g., an incorrect diagnosis may be costlier than incorrect loan approval), we varied the mistake penalty to study its effects on the improvements from teamloss. Our experiments (Figure 7) showed that the difference in utilities depend on the cost of mistake, and highest difference is observed for a different value of across datasets. We also observed that, for our setup, as the mistake penalty increases, logloss may achieve lower performance than the humanonly baseline, and so, deploying automation is undesirable in these cases. For example, on Fico and ), linear model learned using logloss achieves lower performance than human baseline. Similarly, on Mimic and , MLP learned using logloss deploying the AI is undesirable.
Model  Dataset  Acc LL  Util LL  Acc  Util 

Linear  Germanb  0.72  0.56  0.01  0.004 
Mimicb  0.77  0.65  0.00  0.002  
MLP  Germanb  0.74  0.57  0.00  0.024 
Mimicb  0.93  0.87  0.00  0.002 
RQ3: Since the gains from using teamloss were small and varied across datasets, we conducted experiments to investigate properties of the dataset that may have affected these improvements. While there are many properties of a dataset one could investigate, we studied following:

Data distribution In the Moons dataset, we observed that the linear model trained with teamloss increased utility by increasing confidence on the examples on the outer edges of the moons enough to move these examples to the Accept region. So, to test whether a different data distribution would benefits from using teamloss, we created additional versions of Moons by systematically moving points from the middle of the circle towards its edges. Figure 9 shows the improvements in utility as we moved more data.

Class imbalance While most of our datasets were balanced, German and Mimic had a lower percentage of positive instances (see Table 2). We conducted experiments on balanced versions of these two datasets to understand if class imbalance affected our observations in the previous experiments. Table 4 shows the performance after we oversampled the positive class to adjust for class imbalance in the two datasets. We observed that in both cases correcting class imbalance increased the improvement when using teamloss.
We also focussed on the dimensionality of the datasets. Since teamloss may be harder to optimize than NLL, an increase in data dimensions may affect the optimizer’s ability (in this case SGD) to optimize teamloss objective. We also experimented with ADAM as the optimizer, unfortunately it did not provide any benefits. For a given dataset, we varied its dimensionality by using only a subset of features. However, we did not notice any correlations between dimensionality and improvements in utility.
4 Discussion
We conjecture two reasons to explain the small gains in utility on real datasets using teamloss: either there is no scope for improving the utility on those datasets and model pairs or our current optimization procedures are ineffective. Since we do not know the optimal (utility) solution for a given dataset and model, we cannot verify or reject the first conjecture. However, the results on the two synthetic datasets suggest the existence of situations where there is a significant gap between the utilities achieved using teamloss and logloss.
However, it is possible that our current optimization procedures may be ineffective for optimizing teamloss. One reason this might happen is that teamloss is more complex that logloss– it introduces new plateaus in the loss surface and thus may increase the chances of optimization methods such stochastic gradient descent getting stuck in local minima. In fact, in our experiments, we observed that on the datasets where teamloss did not increase utility it resulted in predictions identical to logloss. This may, for example, happen if the most accurate classifier is a local minima. Since we use the most accurate classifier to initialize the optimization on teamloss, this entails that the further optimization with the new loss did not manage to overcome the potential local minima.
While we propose a solution for simplified humanAI teamwork (see assumptions in Section 2), our observations have implications for humanAI teams in general. If we cannot optimize utility for our simplified case, it may be harder to optimize utility in scenarios where users make Accept and Solve decisions using a richer, more complex mental model instead beyond relying on just model confidence. Such scenarios are common in cases where the system confidence is an unreliable indicator of performance (e.g., due to poor calibration), and, as a result, the user develops an understanding of system failures in terms of domain features. For example, Tesla drivers often override the Autopilot using features such as road and weather conditions. We can reduce this case, where users have a complex mental model, to the one we studied. Specifically, we can construct a loss function that is constant when a prediction belongs to the Solve region described by the user’s mental model and logloss otherwise. This case may be harder to optimize because the resultant loss surface will contain more complex combinations of plateaus and local optima than the one we considered.
5 Related Work
Our approach is closely related to maximummargin classifiers, such as an SVM optimized with the hinge loss [burges1998tutorial], where a larger soft margin can be used to make highconfidence and accurate predictions. However, unlike our approach, it is not possible to directly plug the domain’s payoff matrix (e.g., in Figure 3) into such a model. Furthermore, the SVM’s output and margin do not have an immediate probabilistic interpretation, which is crucial for our problem setting. One possible (though computationally intensive) solution direction is to convert margin into probabilities, e.g., using posthoc calibration (e.g., Platt scaling [platt99]), and use crossvalidation for selecting margin parameters to optimize team utility. While it is still an open question whether such an approach would be effective for SVM classifiers, in this work we focused our attention on gradientbased optimization.
Another related problem is costsensitive learning, where different mistakes incur different penalties; for example, falsenegatives may be costlier than falsepositives [zadrozny2003cost]. A common solution here is upweighting the inputs where the mistakes are costlier. Also relevant is work on importancebased learning where reweighting helps learn from imbalanced data or speedup training. However, in our setup, reweighting the inputs makes less sense— the weights would depend on the classifier’s output, which has not been trained yet. An iterative approach may be possible, but our initial analysis showed this approach is prone to oscillations, where the classifier may never converge. We leave exploring this avenue for future work.
A fundamental line of work that renders AI predictions more actionable (for humans) and better suitable for teaming is confidencecalibration, for example, using Bayesian models [ghahramani2015probabilistic, beach1975expert, gal_dropout_2016] or via posthoc calibration [platt99, zadrozny2001obtaining, guo2017calibration, niculescu2005predicting]. A key difference between these methods and our approach is that teamlossretrains the model to improve on inputs on which users are more likely to rely on the AI predictions. The same contrast distinguishes our approach from outlier detection techniques [hendrycks2018deep, lee2017training, hodge2004survey].
More recent work that adjusts model behavior to accommodate collaboration is backwardcompatibility for AI [bansal2019updates], where the model considers user interactions with a previous version of the system to preserve trust across updates. Recent user studies showed that when users develop mental models of system’s mistakes, properties other than accuracy are also desirable for successful collaboration, for example, parsimonious and deterministic error boundaries [bansal2019beyond]. Our approach is a first step towards implementing these desiderata within machine learning optimization itself. Other approaches on humancentered optimization regularize or constrain model optimization for other humancentered requirements such as interpretability [wu2019regional, wu2018beyond] or fairness [jung2019eliciting, zafar2015fairness].
6 Conclusions
We studied the problem of training classifiers that optimize team performance, a metric that for collaboration matters than mere automation accuracy. To support direct optimization of team performance we advised a new loss function with a formulation based on the expected utility of the humanAI team for decision making. Thorough investigations and visualizations of classifier behavior before and after leveraging teamloss for optimization show that, when such an optimization is effective, teamloss can fundamentally change model behavior and improve team utility. Changes in model behavior include either (i) sacrificing model accuracy in low confidence regions for more accurate highconfidence predictions, or (ii) increasing accuracy in the Accept region through more accurate predictions but fewer highly confident ones. Such behaviors were observed in synthetic and realworld datasets where AI is known to be employed as support for human decision makers. However, we also report that current optimization techniques were not always effective and in fact sometimes they did not change model behavior, i.e., models remain identical even after finetuning with teamloss. Since teamloss clearly emphasizes optimization challenges mostly related to its flat curvature and potential local minimas in the Solve region, we invite future work on machine learning optimization and humanAI collaboration to jointly approach such challenges at the intersection of both fields.
References
7 Appendix
7.1 Extension: Users May Not be Rational
In Section 2 we assumed that the user acted rationally while making the metadecision. We now relax this assumption and assume that with a small probability the user may (uniformly) randomly choose between Accept and Solve.^{3}^{3}3Note can be conditioned on confidence and threshold. Then, extending Equation 4, the user will Accept system recommendation with probability:
(9) 
In the above equation, when the model is confident, probability decreased by because the user may decide to Solve. Similarly, when the model is not confident, the increase (compared to Equation 4) indicates that the user may randomly decide to Accept an uncertain recommendation.
To simplify deriving the new equation for the expected utility, we introduce rewrite Equation 6 as:
(10) 
Using the above two equations, we obtain the following equation for expected utility when the user is not perfectly rational:
(11) 
The above equation denotes that, when the system is confident, instead of always obtaining as in Equation 10, with a small probability the user may obtain the expected utility associated with an Solve action. Similarly, when the system is uncertain, the user may sometimes obtain expected utility associated with an Accept action. Qualitatively, this will result in a worse bestcase expected utility, an artifact of user making suboptimal decision (to Solve) when automation would result in the highest utility. Similarly, the expected utility in the Solve region will also decrease– the user may Accept uncertain recommendations. On the other hand, this will improve the worstcase utility— the new user will avoid some highconfidence mistakes that a rational user would not. However, unlike , is strictly monotonic: is a linear function and hence strictly monotonic, and sum of strictly monotonic and constant function is strictly monotonic.