1 Introduction
In a wide range of critical applications, societies rely on the judgement of human experts to take consequential decisions—decisions which have significant consequences. Unfortunately, the timeliness and quality of the decisions are often compromised due to the large number of decisions to be taken and a shortage of human experts. For example, in certain medical specialties, patients in most countries need to wait for months to be diagnosed by a specialist. In content moderation, online publishers often stop hosting comments sections because their staff is unable to moderate the myriad of comments they receive. In software development, bugs may be sometimes overlooked by software developers who spend long hours on code reviews for large software projects.
In this context, there is a widespread discussion on the possibility of letting machine learning models take decisions in these highstake tasks, where they have matched, or even surpassed, the average performance of human experts [topol2019high, cheng2015antisocial, pradel2018deepbugs]. Currently, these models are mostly trained for full automation—they assume they will take all the decisions. However, their decisions are still worse than those by human experts on some instances, where they make far more errors than average [raghu2019algorithmic]. Motivated by this observation, our goal is to develop machine learning models that are optimized to operate under different automation levels. In other words, these models are optimized to take decisions for a given fraction of the instances and leave the remaining ones to humans.
More specifically, we focus on ridge regression and introduce a novel problem formulation that allows for different automation levels. Based on this formulation, we make the following contributions:

We show that the problem is NPhard. This is due to its combinatorial nature—for each potential metadecision about which instances the machine will decide upon, there is an optimal set of parameters for the regression model, however, the metadecision is also something we seek to optimize.

We derive an alternative representation of the objective function as a difference of nondecreasing submodular functions. This representation enables us to use a recent iterative algorithm [iyer2012algorithms] to solve the problem, however, this algorithm does not enjoy approximation guarantees.

Building on the above representation, we further show that the objective function is nondecreasing and satisfies submodularity, a recently introduced notion of approximate submodularity [gatmiry2019]. These properties allow a simple and efficient greedy algorithm (refer to Algorithm 1) to enjoy approximation guarantees.
Here, we would like to acknowledge that our contributions are just a first step towards designing machine learning models that are optimized to operate under different automation levels. It would be very interesting to extend our work to more sophisticated machine learning models and other machine learning tasks (, classification).
Finally, we experiment with synthetic and realworld data from two important applications—medical diagnosis and content moderation. Our results show that the greedy algorithm beats several competitive algorithms, including the iterative algorithm for maximization of a difference of submodular functions mentioned above, and is able to identify and outsource to humans those samples where their expertise is required.
Related work. The work most closely related to ours is by Raghu et al. [raghu2019algorithmic]
, in which a classifier can outsource samples to humans. However, in contrast to our work, their classifier is trained to predict the labels of all samples in the training set, as in full automation, and the proposed algorithm does not enjoy theoretical guarantees. As a result, a natural extension of their algorithm to ridge regression achieves a significantly lower performance than ours, as shown in Figure
4.There is a rapidly increasing line of work devoted to designing classifiers that are able to defer decisions [bartlett2008classification, cortes2016learning, geifman2018bias, geifman2019selectivenet, raghu2019direct, ramaswamy2018consistent, thulasidasan2019combating]. Here, the classifiers learn to defer either by considering the defer action as an additional label value or by training an independent classifier to decide about deferred decisions. However, there are two fundamental differences between this work and ours. First, they do not consider there is a human decision maker, with a human error model, who takes a decision whenever the classifiers defer it. Second, the classifiers are trained to predict the labels of all samples in the training set, as in full automation.
Finally, our work also relates to robust linear regression
[bhatia2017consistent, suggala2019adaptive, tsakonas2014convergence, wright2010dense]and robust logistic regression
[feng2014robust], where the (implicit) assumption is that a constant fraction of the output variables are corrupted by an unbounded noise. Then, the goal is to find a consistent estimator of the model parameters which ignores the samples whose output variables are noisy. In contrast, in our work, we do not assume any noise model for the output variables but rather a human error per sample. Then, our goal is to find a estimator of the model parameters that outsources some of the samples to humans.
2 Problem Formulation
In this section, we formally state the problem of ridge regression under human assistance, where some of the predictions can be outsourced to humans.
Given a set of training samples and a human error per sample , we can outsource a subset of the training samples to humans, with . Then, ridge regression under human assistance seeks to minimize the overall training error, including the outsourced samples, ,
(1) 
with
(2) 
where the first term accounts for the human error, the second term accounts the machine error, and is a given regularization parameter for the machine.
Moreover, if we define and , we can rewrite the above objective function as
where is the subvector of indexed by and is the submatrix formed by columns of that are indexed by . Then, whenever , it readily follows that the optimal parameter is given by
If we plug in the above equation into Eq. 1, we can rewrite the ridge regression problem under human assistance as a set function maximization problem, ,
(3) 
where
Unfortunately, due to its combinatorial nature, the above problem formulation is difficult to solve, as formalized by the following Theorem: The problem of ridge regression under human assistance defined in Eq. 1 is NPhard. Consider a particular instance of the problem with for all and
. Moreover, assume the response variables
are generated as follows:(4) 
where is a
sparse vector which takes nonzero values on at most
corrupted sampless, and a zero elsewhere. Then, the problem can be just viewed as a robust least square regression (RLSR) problem [studer2011recovery], ,(5) 
which has been shown to be NPhard [bhatia2017consistent]. This concludes the proof.
However, in the next section, we will show that, perhaps surprisingly, a simple greedy algorithm enjoys approximation guarantees. In the remainder of the paper, to ease the notation, we will use .
Remarks. Once the model is trained, given a new unlabeled sample , we outsource the sample to a human if the nearest neighbor in the set belongs to , , , where is the solution to the above maximization problem, and pass it on to the machine, , , otherwise.
3 An Algorithm With Approximation Guarantees
In this section, we first show that the objective function in Eq. 3 can be represented as a difference of nondecreasing submodular functions. Then, we build on this representation to show that the objective function is nondecreasing and satisfies submodularity [gatmiry2019], a recently introduced notion of approximate submodularity. Finally, we present an efficient greedy algorithm that, due to the submodularity of the objective function, enjoys approximation guarantees.
Difference of submodular functions. We first start by rewriting the objective function using the following Lemma, which states a wellknown property of the Schur complement of a block matrix: Let . If is invertible, then . More specifically, consider , and in the above lemma. Then, for , it readily follows that:
(6) 
where
In the above, note that, for , the functions and are not defined. As it will become clearer later, for , it will be useful to define their values as follows:
where note that these values also satisfy Eq. 6. Next, we show that, under mild technical conditions, the above functions are nonincreasing and satisfy a natural diminishing property called submodularity^{1}^{1}1A set function is submodular iff it satisfies that for all and , where is the ground set..
Assume and with , then, and are nonincreasing and submodular. We start by showing that is submodular, , for all and . First, define
and observe that
Then, it follows from Proposition A.4 (refer to Appendix A.4) that
Hence, we have a Cholesky decomposition
Similarly, we have that
(7) 
Now, for , we have
where equality (a) follows from Sylvester’s determinant theorem [shamaiah2010greedy], , . Moreover, from Eq. 7, it follows that
Therefore and hence . In addition, we also note that , using Lemma 3, we have that
which in turn indicates that . Hence, due to Proposition A.4 (refer to Appendix A.4), we have .
Finally, for , we have that
(8) 
where the first inequality follows from the proof of submodularity for and the second inequality comes from the definition of for . This concludes the proof of submodularity of .
Next, we show that is nonincreasing. First, recall that, for , we have that
(9) 
Then, note that and . Hence, using Proposition A.4 (refer to Appendix A.4), it follows that , which proves is nonincreasing for . Finally, for , it readily follows from Eq. 8 that
(10) 
Now since we have proved that is nonincreasing for . This concludes the proof of monotonicity of .
Proceeding similarly, it can be proven that is also nondecreasing and submodular. We would like to highlight that, in the above, the technical conditions have a natural interpretation—the first condition is satisfied if the human error is not greater than a fraction of the true response variable and the second condition is satisfied if the regularization parameter is not too small.
In our experiments, the above result will enable us to use a recent heuristic iterative algorithm for maximizing the difference of submodular functions
[iyer2012algorithms] as baseline. However, this algorithm does not enjoy approximation guarantees—it only guarantees to monotonically reduce the objective function at every step.Monotonicity. We first start by analyzing the monotonicity of whenever , for any in the following Lemma (proven in Appendix A.1): Assume and with . Then, it holds that for all . Then, building on the above lemma, we have the following Theorem, which shows that is a strictly nonincreasing function (proven in Appendix A.2): Assume and with , then, the function is strictly nonincreasing, ,
for all and .
Finally, note that the above result does not imply that the human error is always smaller than the machine error , where is optimal parameter for , as formalized by the following Proposition (proven in Appendix A.3): Assume and with and , then, it holds that
submodularity. Given the above results, we are now ready to present and prove our main result, which characterizes the objective function of the optimization problem defined in Eq. 3:
Assume , with , and ^{2}^{2}2Note that we can always rescale the data to satisfy this last condition.. Then, the function is a nondecreasing submodular function^{3}^{3}3A function is submodular [gatmiry2019] iff it satisfies that for all and , where is the ground set. and the parameter satisfies that
(11) 
with
Using that and the function is nonincreasing, we can conclude that . Then, it readily follows from the proof of Theorem 3 that
(12) 
Hence we have,
(13) 
where inequality follows from Proposition A.4 (refer to Appendix A.4) and equality follows from Theorem 3, which implies that . Then, we have that
Next, we bound the first term as follows:
where inequality (a) follows from Eq. 13, inequality (b) follows from the monotonicity of , and inequalities (c) and (d) follows from Theorem 3. Finally, we use the monotonicity of and Eq. 13 to bound the second term as follows:
This concludes the proof.
A greedy algorithm. The greedy algorithm proceeds iteratively and, at each step, it assigns to the humans the sample that provides the highest marginal gain among the set of samples which are currently assigned to the machine. Algorithm 1 summarizes the greedy algorithm.
Since the objective function in Eq. 3 is submodular, it readily follows from Theorem 9 in Khashayar and GomezRodriguez [gatmiry2019] that the above greedy algorithm enjoys an approximation guarantee. More specifically, we have the following Theorem: The greedy algorithm returns a set such that , where is the optimal value and with defined in Eq. 11.
In the next section, we will demonstrate that, in addition to enjoying the above approximation guarantees, the above greedy algorithm performs better in practice than several competitive alternatives.
4 Experiments on Synthetic Data
In this section, we experiment with a variety of synthetic examples. First, we look into the solution provided by the greedy algorithm. Then, we compare the performance of the greedy algorithm with several competitive baselines. Finally, we investigate how the performance of the greedy algorithm varies with respect to the amount of human error.
Experimental setup. For each sample , we first generate each dimension of the feature vector uniformly at random, , and then sample the response variable
from either (i) a gaussian distribution
or (ii) a logistic distribution . Moreover, we sample the associated human error from a gaussian distribution, , , and set . In each experiment, we use training samples and we compare the performance of the greedy algorithm with three competitive baselines:— An iterative heuristic algorithm (DS) for maximizing the difference of submodular functions by iyer2012algorithms.
— A greedy algorithm (Distorted greedy) for maximizing weakly submodular functions by harshaw2019submodular^{4}^{4}4Note that any submodular function is weakly submodular [gatmiry2019]..
— A natural extension of the algorithm (Pre training) by raghu2019algorithmic, originally developed for classification under human assistance.
Results. We first look into the solution provided by the greedy algorithm both for the gaussian and logistic distributions and a different number of outsourced samples . Figure 1 summarizes the results, which reveal several interesting insights. For the logistic distribution, as increases, the greedy algorithm let the machine to focus on the samples where the relationship between features and the response variables is more linear and outsource the remaining points to humans. For the gaussian distribution, as increases, the greedy algorithm outsources samples on the tails of the distribution to humans.
Then, we compare the performance of the greedy algorithm in terms of mean squared error (MSE) on a heldout set against the three competitive baselines. Figure 2 summarizes the results, which show that the greedy algorithm consistently outperforms the baselines across the entire range of automation levels.
Finally, we investigate how the performance of our greedy algorithm varies with respect to the amount of human error. Figure 3 summarizes the results, which shows that, for low levels of human error, the overall mean squared error decreases monotonically with respect to the number of outsourced samples to humans. In contrast, for high levels of human error, it is not beneficial to outsource samples.
5 Experiments on Real Data
In this section, we experiment with four realworld datasets from two important applications, medical diagnosis and content moderation, and show that the greedy algorithm beats several competitive baselines. Moreover, we also look at the samples that the greedy algorithm outsources to humans and show that, for different distributions of human error, the outsourced samples are those that humans are able to predict more accurately.
Experimental setup. We experiment with one dataset for content moderation and three datasets for medical diagnosis, which are publicly available [hateoffensive, decenciere_feedback_2014, hoover2000locating]. More specifically:

Hatespeech: It consists of
tweets containing words, phrases and lexicons used in hate speech. Each tweet is given several scores by three to five annotators from Crowdflower, which measure the severity of hate speech.

StareH: It consists of retinal images. Each image is given a score by one single expert, on a five point scale, which measures the severity of a retinal hemorrhage.

StareD: It contains the same set of images from StareH. However, in this dataset, each image is given a score by a single expert, on a six point scale, which measures the severity of the Drusen disease.

Messidor: It contains eye images. Each image is given score by one single expert, on a three point scale, which measures the severity of an edema.
We first generate a 100 dimensional feature vector using fasttext [ft] for each sample in the Hatespeech dataset and a 1000 dimensional feature vector using Resnet [resnet] for each sample in the StareH, StareD, and Messidor datasets. Then, we use the top 50 features, as identified by PCA, as in our experiments. For the image datasets, the response variable is just the available score by a single expert and the human predictions are sampled from a categorical distribution , where
are the probabilities of each potential score value
for a sample with features . For the Hatespeech dataset, the response variable is the mean of the scores provided by the annotators and the human predictions are picked uniformly at random from the available individual scores given by each annotator. In all datasets, we compute the human error as for each sample . In each experiment, we use 80% samples for training and 20% samples for testing.Results. We first compare the performance of the greedy algorithm in terms of mean squared error (MSE) on a heldout set against the same competitive baselines used in the experiments on synthetic data, , DS [iyer2012algorithms], distorted greedy [harshaw2019submodular], and pre training [raghu2019algorithmic]. Figure 4 summarizes the results, which shows that the greedy algorithm consistently outperforms the baselines across (almost) the entire range of automation levels in all datasets. The notable exception is the Messidor dataset under high automation levels, where the pre training baseline is the best performer.
Next, we look at the samples that the greedy algorithm outsources to humans and those that leaves to machines. Intuitively, human assistance should be required for those samples which are difficult (easy) for a machine (a human) to decide about. Figure 5 provides an illustrative example of an easy and a difficult sample image. While both sample images are given a score of severity zero for the Drusen disease, one of them contains yellow spots, which are often a sign of Drusen disease^{5}^{5}5In this particular case, the patient suffered diabetic retinopathy, which is also characterized by yellow spots., and is therefore difficult to predict. In this particular case, the greedy algorithm outsourced the difficult sample to humans and let the machine decide about the easy one. Does this intuitive assignment happen consistently? To answer this question, we run our greedy algorithm on the StareH and StareD datasets under different distributions of human error and assess to which extent the greedy algorithm outsources to humans those samples they can predict more accurately.
More specifically, we sample the human predictions from a nonuniform categorical distribution under which human error is low for a fraction of the samples and high for the remaining fraction . Figure 6 shows the performance of the greedy algorithm for different values. We observe that, as long as there are samples that humans can predict with low error, the greedy algorithm outsources them to humans and thus the overall performance improves. However, whenever the fraction of outsourced samples is higher than the fraction of samples with low human error, the performance degrades. This results in a characteristic Ushaped curve.
6 Conclusions
In this paper, we have studied the problem of developing machine learning models that are optimized to operate under different automation levels. We have focused on ridge regression and shown that a simple and efficient greedy algorithm is able to find a solution with nontrivial approximation guarantees. Moreover, using synthetic and realworld data, we have shown that this greedy algorithm beats several competitive baselines. Our work also opens many interesting venues for future work. For example, it would be worth to advance the development of other more sophisticated machine learning models, both for regression and classification, under different automation levels. It would be very interesting to find tighter lower bounds on the parameter , which better characterize the good empirical performance. Finally, in our work, we have assumed that we can measure the human error for every training sample. It would be interesting to tackle the problem under uncertainty on the human error.
References
Appendix A Proofs
a.1 Proof of Lemma 3
By definition, we have that
Moreover, note that it is enough to prove that , without the logarithms, to prove the result. Then, we have that
where equality follows from Lemma A.4 and inequality follows from the lower bound on .