Increasingly, humans work collaboratively with an AI teammate, for example, because the team may perform better than either the AI or human alone [nagar2011making, patel2019human, kamar2012combining], or because legal requirements may prohibit complete automation [gdpr, face-recognition-law]
. For human-AI teams, just like for any team, optimizing the performance of the whole team is more important than optimizing the performance of an individual member. Yet, to date for the most part, the AI community has focused on maximizing the individual accuracy of machine-learning models. This raises an important question: Is the most accurate AI the best possible teammate for a human?
We argue that the answer is "No." We show this formally, but the intuition is simple. Consider human-human teams, Is the best-ranked tennis player necessarily the best doubles teammate? Clearly not — teamwork puts additional demands on participants besides high individual performance, such as ability to complement and coordinate with one’s partner. Similarly, creating high-performing human-AI teams may require training AI that exhibits additional human-centered properties that facilitate trust and delegation. Implicitly, this is the motivation behind much work in intelligible AI [caruana-kdd15, weld-cacm19] and post-hoc explainable AI [Ribeiro2016], but we suggest that directly modeling the collaborative process may offer additional benefits.
Recent work emphasized the importance of better understanding how people transform AI recommendations into decisions [kleinberg-econ18]. For instance, consider scenarios when a system outputs a recommendation on which it is uncertain. A rational user is likely to distrust such recommendations — erroneous recommendations are often correlated with a low confidence in prediction [hendrycks-arxiv18]. In this work we assume that the user will discard the recommendation and solve the task themselves, after incurring a cost (e.g., due to additional human effort). As a result, the team performance depends on the AI accuracy only in the accept region, i.e., the region where a user is actually likely to rely on AI t the singular objective of optimizing for AI accuracy (e.g., using log-loss) may hurt team performance when the model has fixed inductive bias; team performance will instead benefit from improving AI in the accept regions in Figure 1. While there exist other aspects of collaboration that can also be addressed via optimization techniques, such as model interpretability, supporting complementary skills, or enabling learning among partners, the problem we address in this paper is to account for team-based utility as a basis for collaboration.
In sum, we make the following contributions:
We highlight a novel, important problem in the field of human-centered artificial intelligence: the mostaccurate ML model may not lead to the highest team utility when paired with a human overseer.
We show that log-loss, the most popular loss function, is insufficient (as it ignores team utility) and develop a new loss function team-loss, which overcomes its issues by calculating a team’s expected utility.
We present experiments on multiple real-world datasets that compare the gains in utility achieved by team-loss and log-loss. We observed that while the gains are small and vary across datasets they reflect the behavior encoded in the loss. We present further analysis to understand how team-loss results in a higher utility and when, for example, as a function of domain parameters such as cost of mistake.
2 Problem Description
|Cost of mistake|
|Confidence in the predicted label|
|Human decision maker|
|Classifier hypothesis space|
|Cost of human effort to Solve|
|Expected team utility|
We focus on a special case of AI-advised decision making where a classifier gives recommendations to a human decision maker to help make decisions (Figure 1(a)). If
denotes the classifier’s output, a probability distribution over, the recommendation consists of a label and a confidence value , i.e., . Using this recommendation, the user computes a final decision . The environment, in response, returns a utility which depends on the quality of the final decision and any cost incurred due to human effort. Let denote the utility function. If the team classifies a sequence of instances, the objective of this team is to maximize the cumulative utility. Before deriving a closed form equation of the objective, we characterize the form of the human-AI collaboration along with our assumptions. We study this particular, simple setting as a first step to explore the opportunities and challenges in team-centric optimization. If we cannot optimize for this simple setting, it may be much harder to optimize for more complex scenarios (discussed more in Section 4).
User either accepts the recommendation or solves the task themselves: The human computes the final decision by first making a meta-decision: Accept or Solve (Figure 1(b)). Accept passes off the recommendation as the final decision. In contrast, Solve ignores the recommendation and the user computes the final decision themselves. Let denote the function that maps an input instance and recommendation to a meta-decision in . As a result, the optimal classifier would maximize the team’s expected utility:
Mistakes are costly: A correct decision results in unit reward whereas an incorrect decision results in a penalty .
Solving the task is costly: Since it takes time and effort for the human to perform the task themselves, (e.g., cognitive effort), we assume that the Solve meta-decision costs more than Accept. Further, without loss of generality, we assume units of cost to Solve and zero cost to Accept.
Using the above assumptions we obtain the following utility function. The values in each cell of the table originate from subtracting the cost of the action from the environment reward.
Meta-decision/Decision Correct Incorrect Accept [ A ] Solve [ S ] Figure 3: Team utility w.r.t. meta-decision and decision accuracy.
Human is uniformly accurate: Let denote the conditional probability that if the user solves the task, they will make the correct decision, i.e.,
Human is rational: The user makes the meta-decision by comparing expected utilities. Further, the user trusts the classifier’s confidence as an accurate indicator of the recommendation’s reliability. As a result, the user will choose Accept if and only if the expected utility for accepting is higher than the expected utility for solving.
Let denote the minimum value of system confidence for which the user’s meta-decision is Accept.
This implies the human will follow the following threshold-based policy to make meta-decisions:
2.1 Expected Team Utility
We now derive the equation for expected utility of recommendations. Let denote the expected utility of the classifier and a decision maker .
Since upon Accept, the human returns the classifier’s recommendation, the probability that the final decision is correct is the same as the classifier’s predicted probability of the correct decision:
Figure 3(a) visualizes the expected team utility of the classifier predictions as a function of confidence in the true label.
2.2 Utility-Based Loss
Since gradient descent-based minimization of loss functions is common in machine learning, we transform the expected utility into a loss function by negating it. We call this new loss function team-loss.
We take a logarithm before negating utility to allow comparisons with log-loss, where the logarithmic nature of loss is known to benefit optimization, for example, by heavily penalizing high-confidence mistakes.111Since loss can be negative, in our implementation, before computing a logarithmic of utility we appropriately shift up the utility function (by subtracting its minimum value). Figure 3(b) visualizes this new loss function.
We conducted experiments to answer the following research questions:
Does the new loss function result in a classifier that improves team utility over the most accurate classifier?
How do these improvements change with properties of the task, e.g., the cost of mistakes ()?
How do these improvements change with properties of the dataset, e.g., with data distribution or dimensionality?
Metrics and Datasets We compared the utility achieved by two models: the most accurate classifier trained using log-loss and a classifier optimized using team-loss on the datasets described in Table 2. We experimented with two synthetic datasets and four real-world datasets with high-stakes. The real datasets are from domains that are known to or already deploy AI to assist human decision makers. The Scenario1 dataset refers to a dataset we created by sampling 10000 points from the data distribution described in Figure 1.
We experimented with two models: logistic regression and multi-layered perceptron (two hidden layers with 50 and 100 units). For each task (defined by a choice of task parameters, dataset, model, loss) we optimized the loss using stochastic gradient descent (SGD) and also used standard, well-known training practices such as using regularization, check-pointing, and learning rate schedulers. We selected the best hyper-parameters using 5-fold cross validation, including values for the learning rate, batch size, patience and decay factor of the learning rate scheduler, and weight of the L2 regularizer.
In our initial experiments for training with team-loss using SGD, we observed that the classifier’s loss would never reduce and in fact remain constant. This happened because, in practice, random initializations resulted in classifiers that are uncertain on all examples. And, since, by definition, team-loss is flat in these uncertain regions (Figure 3(b)), the gradients was zero and uninformative. To overcome this issue, we initialized the classifiers with the (already converged) most accurate classifier.
|Model||Dataset||Acc LL||Util LL||Acc||Util|
Note: LL indicates log-loss
RQ1: Experiments showed that when we used team-loss, the magnitude of improvements in team utility over log-loss varied across the datasets but were consistently observed (Table 3). We observed that team-loss often sacrifices classifier accuracy to improve team utility, the more desirable metric. For the linear classifier, this sacrifice is especially large on the synthetic datasets: Scenario1 (16%) and Moons (1%) datasets. For the MLP, team-loss sacrifices 2% accuracy to improve team utility.222We report absolute improvements instead of percentage improvements in utility because utility can be negative.
While the metrics in Table 3 (change in accuracy and utility) provide a global understanding of the effect of team-loss, they do not help understand how team-loss achieved improvements and whether the behavior of the new models is consistent with intuition. Figure 5 visualizes the difference in behavior (averaged over 150 seeds) between the classifiers produced by log-loss and team-loss on the Scenario1 dataset. Specifically, as shown in Figure 5, we visualize and compare their:
Calibration using reliability curves, which compare system confidence and its true accuracy. A perfectly calibrated system, for example, will be 80% accurate on regions that is 80% confident. However, in practice, systems may over- or under-confident.
Distributions of confidence in predictions. For example, in Figure 5, team-loss makes more high-confidence predictions than log-loss.
Fraction of total system accuracy contributed by different regions (of confidence values). Thus, the area under this curve indicates the system’s total accuracy. Note that for our setup the area under the curve in the Accept region is more crucial than the area in the Solve region since in the latter the human is expected to take over.
Similar to (V4), the forth sub-graph shows the fraction of total system utility contributed by different regions of confidence.
If team-loss had not resulted in different predictions than log-loss, the curves in Figure 5 for the two loss functions would have been indistinguishable. However, we observed that team-loss results in dramatically different predictions than log-loss. In fact, we noticed two types of behaviors when team-loss improved utility.
The first type of behavior was observed on Scenario1 datatset (Figure 5) and is easier to understand as it matches the intuition we set out in the beginning– the classifier trained with team-loss sacrifices accuracy on the uncertain examples in the Solve region to make more high-confidence predictions in the Accept region. This change improves system accuracy in the Accept region, which is where the system accuracy matters and contributes to team utility. Later, we show that this same behavior is observed on the German dataset (Figure 6)
The second type of behavior was observed on Moons (Figure 5), where the new loss increases accuracy in the Accept region at the cost predicting fewer very high-confidence predictions (e.g., when confidence is greater than 0.95 in the region marked Y). This change improves utility because the system’s accuracy in the Accept region matters more than making very high-confidence predictions.
In both these behaviors, team-loss effectively increases the contribution to AI accuracy from the Accept region, i.e., the region where AI’s performance providing value to the team. In contrast, log-loss has no such considerations. Figure 6 shows a similar analysis on the real datasets for both the linear and MLP classifiers. When team-loss improves utility, we see one of the two behaviors we described above.
Since the penalty of mistakes may be task-dependant (e.g., an incorrect diagnosis may be costlier than incorrect loan approval), we varied the mistake penalty to study its effects on the improvements from team-loss. Our experiments (Figure 7) showed that the difference in utilities depend on the cost of mistake, and highest difference is observed for a different value of across datasets. We also observed that, for our setup, as the mistake penalty increases, log-loss may achieve lower performance than the human-only baseline, and so, deploying automation is undesirable in these cases. For example, on Fico and ), linear model learned using log-loss achieves lower performance than human baseline. Similarly, on Mimic and , MLP learned using log-loss deploying the AI is undesirable.
|Model||Dataset||Acc LL||Util LL||Acc||Util|
RQ3: Since the gains from using team-loss were small and varied across datasets, we conducted experiments to investigate properties of the dataset that may have affected these improvements. While there are many properties of a dataset one could investigate, we studied following:
Data distribution In the Moons dataset, we observed that the linear model trained with team-loss increased utility by increasing confidence on the examples on the outer edges of the moons enough to move these examples to the Accept region. So, to test whether a different data distribution would benefits from using team-loss, we created additional versions of Moons by systematically moving points from the middle of the circle towards its edges. Figure 9 shows the improvements in utility as we moved more data.
Class imbalance While most of our datasets were balanced, German and Mimic had a lower percentage of positive instances (see Table 2). We conducted experiments on balanced versions of these two datasets to understand if class imbalance affected our observations in the previous experiments. Table 4 shows the performance after we over-sampled the positive class to adjust for class imbalance in the two datasets. We observed that in both cases correcting class imbalance increased the improvement when using team-loss.
We also focussed on the dimensionality of the datasets. Since team-loss may be harder to optimize than NLL, an increase in data dimensions may affect the optimizer’s ability (in this case SGD) to optimize team-loss objective. We also experimented with ADAM as the optimizer, unfortunately it did not provide any benefits. For a given dataset, we varied its dimensionality by using only a subset of features. However, we did not notice any correlations between dimensionality and improvements in utility.
We conjecture two reasons to explain the small gains in utility on real datasets using team-loss: either there is no scope for improving the utility on those datasets and model pairs or our current optimization procedures are ineffective. Since we do not know the optimal (utility) solution for a given dataset and model, we cannot verify or reject the first conjecture. However, the results on the two synthetic datasets suggest the existence of situations where there is a significant gap between the utilities achieved using team-loss and log-loss.
However, it is possible that our current optimization procedures may be ineffective for optimizing team-loss. One reason this might happen is that team-loss is more complex that log-loss– it introduces new plateaus in the loss surface and thus may increase the chances of optimization methods such stochastic gradient descent getting stuck in local minima. In fact, in our experiments, we observed that on the datasets where team-loss did not increase utility it resulted in predictions identical to log-loss. This may, for example, happen if the most accurate classifier is a local minima. Since we use the most accurate classifier to initialize the optimization on team-loss, this entails that the further optimization with the new loss did not manage to overcome the potential local minima.
While we propose a solution for simplified human-AI teamwork (see assumptions in Section 2), our observations have implications for human-AI teams in general. If we cannot optimize utility for our simplified case, it may be harder to optimize utility in scenarios where users make Accept and Solve decisions using a richer, more complex mental model instead beyond relying on just model confidence. Such scenarios are common in cases where the system confidence is an unreliable indicator of performance (e.g., due to poor calibration), and, as a result, the user develops an understanding of system failures in terms of domain features. For example, Tesla drivers often override the Autopilot using features such as road and weather conditions. We can reduce this case, where users have a complex mental model, to the one we studied. Specifically, we can construct a loss function that is constant when a prediction belongs to the Solve region described by the user’s mental model and log-loss otherwise. This case may be harder to optimize because the resultant loss surface will contain more complex combinations of plateaus and local optima than the one we considered.
5 Related Work
Our approach is closely related to maximum-margin classifiers, such as an SVM optimized with the hinge loss [burges1998tutorial], where a larger soft margin can be used to make high-confidence and accurate predictions. However, unlike our approach, it is not possible to directly plug the domain’s payoff matrix (e.g., in Figure 3) into such a model. Furthermore, the SVM’s output and margin do not have an immediate probabilistic interpretation, which is crucial for our problem setting. One possible (though computationally intensive) solution direction is to convert margin into probabilities, e.g., using post-hoc calibration (e.g., Platt scaling [platt-99]), and use cross-validation for selecting margin parameters to optimize team utility. While it is still an open question whether such an approach would be effective for SVM classifiers, in this work we focused our attention on gradient-based optimization.
Another related problem is cost-sensitive learning, where different mistakes incur different penalties; for example, false-negatives may be costlier than false-positives [zadrozny2003cost]. A common solution here is up-weighting the inputs where the mistakes are costlier. Also relevant is work on importance-based learning where re-weighting helps learn from imbalanced data or speed-up training. However, in our setup, re-weighting the inputs makes less sense— the weights would depend on the classifier’s output, which has not been trained yet. An iterative approach may be possible, but our initial analysis showed this approach is prone to oscillations, where the classifier may never converge. We leave exploring this avenue for future work.
A fundamental line of work that renders AI predictions more actionable (for humans) and better suitable for teaming is confidence-calibration, for example, using Bayesian models [ghahramani2015probabilistic, beach1975expert, gal_dropout_2016] or via post-hoc calibration [platt-99, zadrozny2001obtaining, guo2017calibration, niculescu2005predicting]. A key difference between these methods and our approach is that team-lossre-trains the model to improve on inputs on which users are more likely to rely on the AI predictions. The same contrast distinguishes our approach from outlier detection techniques [hendrycks2018deep, lee2017training, hodge2004survey].
More recent work that adjusts model behavior to accommodate collaboration is backward-compatibility for AI [bansal2019updates], where the model considers user interactions with a previous version of the system to preserve trust across updates. Recent user studies showed that when users develop mental models of system’s mistakes, properties other than accuracy are also desirable for successful collaboration, for example, parsimonious and deterministic error boundaries [bansal2019beyond]. Our approach is a first step towards implementing these desiderata within machine learning optimization itself. Other approaches on human-centered optimization regularize or constrain model optimization for other human-centered requirements such as interpretability [wu2019regional, wu2018beyond] or fairness [jung2019eliciting, zafar2015fairness].
We studied the problem of training classifiers that optimize team performance, a metric that for collaboration matters than mere automation accuracy. To support direct optimization of team performance we advised a new loss function with a formulation based on the expected utility of the human-AI team for decision making. Thorough investigations and visualizations of classifier behavior before and after leveraging team-loss for optimization show that, when such an optimization is effective, team-loss can fundamentally change model behavior and improve team utility. Changes in model behavior include either (i) sacrificing model accuracy in low confidence regions for more accurate high-confidence predictions, or (ii) increasing accuracy in the Accept region through more accurate predictions but fewer highly confident ones. Such behaviors were observed in synthetic and real-world datasets where AI is known to be employed as support for human decision makers. However, we also report that current optimization techniques were not always effective and in fact sometimes they did not change model behavior, i.e., models remain identical even after fine-tuning with team-loss. Since team-loss clearly emphasizes optimization challenges mostly related to its flat curvature and potential local minimas in the Solve region, we invite future work on machine learning optimization and human-AI collaboration to jointly approach such challenges at the intersection of both fields.
7.1 Extension: Users May Not be Rational
In Section 2 we assumed that the user acted rationally while making the meta-decision. We now relax this assumption and assume that with a small probability the user may (uniformly) randomly choose between Accept and Solve.333Note can be conditioned on confidence and threshold. Then, extending Equation 4, the user will Accept system recommendation with probability:
In the above equation, when the model is confident, probability decreased by because the user may decide to Solve. Similarly, when the model is not confident, the increase (compared to Equation 4) indicates that the user may randomly decide to Accept an uncertain recommendation.
To simplify deriving the new equation for the expected utility, we introduce re-write Equation 6 as:
Using the above two equations, we obtain the following equation for expected utility when the user is not perfectly rational:
The above equation denotes that, when the system is confident, instead of always obtaining as in Equation 10, with a small probability the user may obtain the expected utility associated with an Solve action. Similarly, when the system is uncertain, the user may sometimes obtain expected utility associated with an Accept action. Qualitatively, this will result in a worse best-case expected utility, an artifact of user making sub-optimal decision (to Solve) when automation would result in the highest utility. Similarly, the expected utility in the Solve region will also decrease– the user may Accept uncertain recommendations. On the other hand, this will improve the worst-case utility— the new user will avoid some high-confidence mistakes that a rational user would not. However, unlike , is strictly monotonic: is a linear function and hence strictly monotonic, and sum of strictly monotonic and constant function is strictly monotonic.