DeepAI
Log In Sign Up

Regression Under Human Assistance

Decisions are increasingly taken by both humans and machine learning models. However, machine learning models are currently trained for full automation-they are not aware that some of the decisions may still be taken by humans. In this paper, we take a first step towards making machine learning models aware of the presence of human decision-makers. More specifically, we first introduce the problem of ridge regression under human assistance and show that it is NP-hard. Then, we derive an alternative representation of the corresponding objective function as a difference of nondecreasing submodular functions. Building on this representation, we further show that the objective is nondecreasing and satisfies ξ-submodularity, a recently introduced notion of approximate submodularity. These properties allow simple and efficient greedy algorithm to enjoy approximation guarantees at solving the problem. Experiments on synthetic and real-world data from two important applications-medical diagnoses and content moderation-demonstrate that the greedy algorithm beats several competitive baselines.

READ FULL TEXT VIEW PDF
06/21/2020

Classification Under Human Assistance

Most supervised learning models are trained for full automation. However...
03/16/2020

Harnessing Explanations to Bridge AI and Humans

Machine learning models are increasingly integrated into societally crit...
04/30/2021

QoS-Aware Placement of Deep Learning Services on the Edge with Multiple Service Implementations

Mobile edge computing pushes computationally-intensive services closer t...
01/14/2020

"Why is 'Chicago' deceptive?" Towards Building Model-Driven Tutorials for Humans

To support human decision making with machine learning models, we often ...
11/19/2018

On the Network Visibility Problem

Social media is an attention economy where users are constantly competin...
05/28/2020

Learning How To Learn Within An LSM-based Key-Value Store

We introduce BOURBON, a log-structured merge (LSM) tree that utilizes ma...

1 Introduction

In a wide range of critical applications, societies rely on the judgement of human experts to take consequential decisions—decisions which have significant consequences. Unfortunately, the timeliness and quality of the decisions are often compromised due to the large number of decisions to be taken and a shortage of human experts. For example, in certain medical specialties, patients in most countries need to wait for months to be diagnosed by a specialist. In content moderation, online publishers often stop hosting comments sections because their staff is unable to moderate the myriad of comments they receive. In software development, bugs may be sometimes overlooked by software developers who spend long hours on code reviews for large software projects.

In this context, there is a widespread discussion on the possibility of letting machine learning models take decisions in these high-stake tasks, where they have matched, or even surpassed, the average performance of human experts [topol2019high, cheng2015antisocial, pradel2018deepbugs]. Currently, these models are mostly trained for full automation—they assume they will take all the decisions. However, their decisions are still worse than those by human experts on some instances, where they make far more errors than average [raghu2019algorithmic]. Motivated by this observation, our goal is to develop machine learning models that are optimized to operate under different automation levels. In other words, these models are optimized to take decisions for a given fraction of the instances and leave the remaining ones to humans.

More specifically, we focus on ridge regression and introduce a novel problem formulation that allows for different automation levels. Based on this formulation, we make the following contributions:

  • We show that the problem is NP-hard. This is due to its combinatorial nature—for each potential meta-decision about which instances the machine will decide upon, there is an optimal set of parameters for the regression model, however, the meta-decision is also something we seek to optimize.

  • We derive an alternative representation of the objective function as a difference of nondecreasing submodular functions. This representation enables us to use a recent iterative algorithm [iyer2012algorithms] to solve the problem, however, this algorithm does not enjoy approximation guarantees.

  • Building on the above representation, we further show that the objective function is nondecreasing and satisfies -submodularity, a recently introduced notion of approximate submodularity [gatmiry2019]. These properties allow a simple and efficient greedy algorithm (refer to Algorithm 1) to enjoy approximation guarantees.

Here, we would like to acknowledge that our contributions are just a first step towards designing machine learning models that are optimized to operate under different automation levels. It would be very interesting to extend our work to more sophisticated machine learning models and other machine learning tasks (, classification).

Finally, we experiment with synthetic and real-world data from two important applications—medical diagnosis and content moderation. Our results show that the greedy algorithm beats several competitive algorithms, including the iterative algorithm for maximization of a difference of submodular functions mentioned above, and is able to identify and outsource to humans those samples where their expertise is required.

Related work. The work most closely related to ours is by Raghu et al. [raghu2019algorithmic]

, in which a classifier can outsource samples to humans. However, in contrast to our work, their classifier is trained to predict the labels of all samples in the training set, as in full automation, and the proposed algorithm does not enjoy theoretical guarantees. As a result, a natural extension of their algorithm to ridge regression achieves a significantly lower performance than ours, as shown in Figure 

4.

There is a rapidly increasing line of work devoted to designing classifiers that are able to defer decisions [bartlett2008classification, cortes2016learning, geifman2018bias, geifman2019selectivenet, raghu2019direct, ramaswamy2018consistent, thulasidasan2019combating]. Here, the classifiers learn to defer either by considering the defer action as an additional label value or by training an independent classifier to decide about deferred decisions. However, there are two fundamental differences between this work and ours. First, they do not consider there is a human decision maker, with a human error model, who takes a decision whenever the classifiers defer it. Second, the classifiers are trained to predict the labels of all samples in the training set, as in full automation.

Finally, our work also relates to robust linear regression 

[bhatia2017consistent, suggala2019adaptive, tsakonas2014convergence, wright2010dense]

and robust logistic regression 

[feng2014robust]

, where the (implicit) assumption is that a constant fraction of the output variables are corrupted by an unbounded noise. Then, the goal is to find a consistent estimator of the model parameters which ignores the samples whose output variables are noisy. In contrast, in our work, we do not assume any noise model for the output variables but rather a human error per sample. Then, our goal is to find a estimator of the model parameters that outsources some of the samples to humans.

2 Problem Formulation

In this section, we formally state the problem of ridge regression under human assistance, where some of the predictions can be outsourced to humans.

Given a set of training samples and a human error per sample , we can outsource a subset of the training samples to humans, with . Then, ridge regression under human assistance seeks to minimize the overall training error, including the outsourced samples, ,

(1)

with

(2)

where the first term accounts for the human error, the second term accounts the machine error, and is a given regularization parameter for the machine.

Moreover, if we define and , we can rewrite the above objective function as

where is the subvector of indexed by and is the submatrix formed by columns of that are indexed by . Then, whenever , it readily follows that the optimal parameter is given by

If we plug in the above equation into Eq. 1, we can rewrite the ridge regression problem under human assistance as a set function maximization problem, ,

(3)

where

Unfortunately, due to its combinatorial nature, the above problem formulation is difficult to solve, as formalized by the following Theorem: The problem of ridge regression under human assistance defined in Eq. 1 is NP-hard. Consider a particular instance of the problem with for all and

. Moreover, assume the response variables

are generated as follows:

(4)

where is a

-sparse vector which takes non-zero values on at most

corrupted sampless, and a zero elsewhere. Then, the problem can be just viewed as a robust least square regression (RLSR) problem [studer2011recovery], ,

(5)

which has been shown to be NP-hard [bhatia2017consistent]. This concludes the proof.

However, in the next section, we will show that, perhaps surprisingly, a simple greedy algorithm enjoys approximation guarantees. In the remainder of the paper, to ease the notation, we will use .

Remarks. Once the model is trained, given a new unlabeled sample , we outsource the sample to a human if the nearest neighbor in the set belongs to , , , where is the solution to the above maximization problem, and pass it on to the machine, , , otherwise.

3 An Algorithm With Approximation Guarantees

In this section, we first show that the objective function in Eq. 3 can be represented as a difference of nondecreasing submodular functions. Then, we build on this representation to show that the objective function is nondecreasing and satisfies -submodularity [gatmiry2019], a recently introduced notion of approximate submodularity. Finally, we present an efficient greedy algorithm that, due to the -submodularity of the objective function, enjoys approximation guarantees.

Difference of submodular functions. We first start by rewriting the objective function using the following Lemma, which states a well-known property of the Schur complement of a block matrix: Let . If is invertible, then . More specifically, consider , and in the above lemma. Then, for , it readily follows that:

(6)

where

In the above, note that, for , the functions and are not defined. As it will become clearer later, for , it will be useful to define their values as follows:

where note that these values also satisfy Eq. 6. Next, we show that, under mild technical conditions, the above functions are nonincreasing and satisfy a natural diminishing property called submodularity111A set function is submodular iff it satisfies that for all and , where is the ground set..

Assume and with , then, and are nonincreasing and submodular. We start by showing that is submodular, , for all and . First, define

and observe that

Then, it follows from Proposition A.4 (refer to Appendix A.4) that

Hence, we have a Cholesky decomposition

Similarly, we have that

(7)

Now, for , we have

where equality (a) follows from Sylvester’s determinant theorem [shamaiah2010greedy], , . Moreover, from Eq. 7, it follows that

Therefore and hence . In addition, we also note that , using Lemma 3, we have that

which in turn indicates that . Hence, due to Proposition A.4 (refer to Appendix A.4), we have .

Finally, for , we have that

(8)

where the first inequality follows from the proof of submodularity for and the second inequality comes from the definition of for . This concludes the proof of submodularity of .

Next, we show that is nonincreasing. First, recall that, for , we have that

(9)

Then, note that and . Hence, using Proposition A.4 (refer to Appendix A.4), it follows that , which proves is nonincreasing for . Finally, for , it readily follows from Eq. 8 that

(10)

Now since we have proved that is nonincreasing for . This concludes the proof of monotonicity of .

Proceeding similarly, it can be proven that is also nondecreasing and submodular. We would like to highlight that, in the above, the technical conditions have a natural interpretation—the first condition is satisfied if the human error is not greater than a fraction of the true response variable and the second condition is satisfied if the regularization parameter is not too small.

In our experiments, the above result will enable us to use a recent heuristic iterative algorithm for maximizing the difference of submodular functions 

[iyer2012algorithms] as baseline. However, this algorithm does not enjoy approximation guarantees—it only guarantees to monotonically reduce the objective function at every step.

Monotonicity. We first start by analyzing the monotonicity of whenever , for any in the following Lemma (proven in Appendix A.1): Assume and with . Then, it holds that for all . Then, building on the above lemma, we have the following Theorem, which shows that is a strictly nonincreasing function (proven in Appendix A.2): Assume and with , then, the function is strictly nonincreasing, ,

for all and .

Finally, note that the above result does not imply that the human error is always smaller than the machine error , where is optimal parameter for , as formalized by the following Proposition (proven in Appendix A.3): Assume and with and , then, it holds that

-submodularity. Given the above results, we are now ready to present and prove our main result, which characterizes the objective function of the optimization problem defined in Eq. 3:

1:Ground set , set of training samples , parameters and .
2:Set of items
3:
4:while  do
5:       % Find best sample
6:       % Sample is outsourced to humans
7:end while
8:return
Algorithm 1 Greedy algorithm

Assume , with , and 222Note that we can always rescale the data to satisfy this last condition.. Then, the function is a nondecreasing -submodular function333A function is -submodular [gatmiry2019] iff it satisfies that for all and , where is the ground set. and the parameter satisfies that

(11)

with

Using that and the function is nonincreasing, we can conclude that . Then, it readily follows from the proof of Theorem 3 that

(12)

Hence we have,

(13)

where inequality follows from Proposition A.4 (refer to Appendix A.4) and equality follows from Theorem 3, which implies that . Then, we have that

Next, we bound the first term as follows:

where inequality (a) follows from Eq. 13, inequality (b) follows from the monotonicity of , and inequalities (c) and (d) follows from Theorem 3. Finally, we use the monotonicity of and Eq. 13 to bound the second term as follows:

This concludes the proof.

(a) , Logistic
(b) , Logistic
(c) , Logistic
(d) , Logistic
(e) , Gaussian
(f) , Gaussian
(g) , Gaussian
(h) , Gaussian
Figure 1: Visualization of , and the line , obtained by the proposed greedy and DS algorithms on the Logistic dataset with and . It shows that, the proposed greedy approach outsources more samples from the nonlinear part of the curve to humans, than the DS method.

A greedy algorithm. The greedy algorithm proceeds iteratively and, at each step, it assigns to the humans the sample that provides the highest marginal gain among the set of samples which are currently assigned to the machine. Algorithm 1 summarizes the greedy algorithm.

Since the objective function in Eq. 3 is -submodular, it readily follows from Theorem 9 in Khashayar and Gomez-Rodriguez [gatmiry2019] that the above greedy algorithm enjoys an approximation guarantee. More specifically, we have the following Theorem: The greedy algorithm returns a set such that , where is the optimal value and with defined in Eq. 11.

In the next section, we will demonstrate that, in addition to enjoying the above approximation guarantees, the above greedy algorithm performs better in practice than several competitive alternatives.

4 Experiments on Synthetic Data

In this section, we experiment with a variety of synthetic examples. First, we look into the solution provided by the greedy algorithm. Then, we compare the performance of the greedy algorithm with several competitive baselines. Finally, we investigate how the performance of the greedy algorithm varies with respect to the amount of human error.

Experimental setup. For each sample , we first generate each dimension of the feature vector uniformly at random, , and then sample the response variable

from either (i) a gaussian distribution

or (ii) a logistic distribution . Moreover, we sample the associated human error from a gaussian distribution, , , and set . In each experiment, we use training samples and we compare the performance of the greedy algorithm with three competitive baselines:

— An iterative heuristic algorithm (DS) for maximizing the difference of submodular functions by iyer2012algorithms.

— A greedy algorithm (Distorted greedy) for maximizing -weakly submodular functions by harshaw2019submodular444Note that any -submodular function is -weakly submodular [gatmiry2019]..

— A natural extension of the algorithm (Pre training) by raghu2019algorithmic, originally developed for classification under human assistance.

Results. We first look into the solution provided by the greedy algorithm both for the gaussian and logistic distributions and a different number of outsourced samples . Figure 1 summarizes the results, which reveal several interesting insights. For the logistic distribution, as increases, the greedy algorithm let the machine to focus on the samples where the relationship between features and the response variables is more linear and outsource the remaining points to humans. For the gaussian distribution, as increases, the greedy algorithm outsources samples on the tails of the distribution to humans.

Then, we compare the performance of the greedy algorithm in terms of mean squared error (MSE) on a held-out set against the three competitive baselines. Figure 2 summarizes the results, which show that the greedy algorithm consistently outperforms the baselines across the entire range of automation levels.

(a) Gaussian
(b) Logistic
Figure 2: Mean squared error (MSE) against number of outsourced samples for the proposed greedy algorithm, DS [iyer2012algorithms], distorted greedy [harshaw2019submodular] and pre training [raghu2019algorithmic] on synthetic data. In all cases, we used and . The greedy algorithm consistently outperforms the baselines across the entire range of automation levels.
(a) Gaussian
(b) Logistic
Figure 3: Mean squared error (MSE) achieved by the proposed greedy algorithm against the number of outsourced samples for different levels of human error () on synthetic data. In all cases, we used . For low levels of human error, the overall mean squared error decreases monotonically with respect to the number of outsourced samples. In contrast, for high levels of human error, it is not beneficial to outsource samples to humans.

Finally, we investigate how the performance of our greedy algorithm varies with respect to the amount of human error. Figure 3 summarizes the results, which shows that, for low levels of human error, the overall mean squared error decreases monotonically with respect to the number of outsourced samples to humans. In contrast, for high levels of human error, it is not beneficial to outsource samples.

5 Experiments on Real Data

In this section, we experiment with four real-world datasets from two important applications, medical diagnosis and content moderation, and show that the greedy algorithm beats several competitive baselines. Moreover, we also look at the samples that the greedy algorithm outsources to humans and show that, for different distributions of human error, the outsourced samples are those that humans are able to predict more accurately.

(a) Hatespeech
(b) Stare-H
(c) Stare-D
(d) Messidor
Figure 4: Mean squared error (MSE) against number of outsourced samples for the proposed greedy algorithm, DS [iyer2012algorithms], distorted greedy [harshaw2019submodular] and pre training [raghu2019algorithmic] on four real-world datasets. The greedy algorithm consistently outperforms the baselines across (almost) the entire range of automation levels in all datasets. The notable exception is the Messidor dataset under high automation levels, where the pre training baseline is the best performer.

Experimental setup. We experiment with one dataset for content moderation and three datasets for medical diagnosis, which are publicly available [hateoffensive, decenciere_feedback_2014, hoover2000locating]. More specifically:

  • Hatespeech: It consists of

    tweets containing words, phrases and lexicons used in hate speech. Each tweet is given several scores by three to five annotators from Crowdflower, which measure the severity of hate speech.

  • Stare-H: It consists of retinal images. Each image is given a score by one single expert, on a five point scale, which measures the severity of a retinal hemorrhage.

  • Stare-D: It contains the same set of images from Stare-H. However, in this dataset, each image is given a score by a single expert, on a six point scale, which measures the severity of the Drusen disease.

  • Messidor: It contains eye images. Each image is given score by one single expert, on a three point scale, which measures the severity of an edema.

We first generate a 100 dimensional feature vector using fasttext [ft] for each sample in the Hatespeech dataset and a 1000 dimensional feature vector using Resnet [resnet] for each sample in the Stare-H, Stare-D, and Messidor datasets. Then, we use the top 50 features, as identified by PCA, as in our experiments. For the image datasets, the response variable is just the available score by a single expert and the human predictions are sampled from a categorical distribution , where

are the probabilities of each potential score value

for a sample with features . For the Hatespeech dataset, the response variable is the mean of the scores provided by the annotators and the human predictions are picked uniformly at random from the available individual scores given by each annotator. In all datasets, we compute the human error as for each sample . In each experiment, we use 80% samples for training and 20% samples for testing.

(a) Easy sample
(b) Difficult sample
Figure 5: An easy and a difficult sample image from the Stare-D dataset. Both images are given a score of severity zero for the Drusen disease, which is characterized by pathological yellow spots. The easy sample does not contain yellow spots and thus it is easy to predict its score. In contrast, the difficult sample contains yellow spots, which are manifested not from Drusen, but diabetic retinopathy, and thus it is challenging to accurately predict its score. As a result, the greedy algorithm decides to outsource the difficult sample to humans, whereas it lets the machine decide about the easy one.
(a) Stare-H
(b) Stare-D
Figure 6: Mean squared error (MSE) achieved by the proposed greedy algorithm against the number of outsourced samples under different distributions of human errors on two real-world datasets. Under each distribution of human error, human error is low () for a fractions of the samples and high () for the remaining fraction . As long as there are samples that humans can predict with low error, the greedy algorithm does outsource them to humans and thus the overall performance improves. However, whenever the fraction of outsourced samples is higher than the fraction of samples with low human error, the performance degrades. This results in a characteristic U-shaped curve.

Results. We first compare the performance of the greedy algorithm in terms of mean squared error (MSE) on a held-out set against the same competitive baselines used in the experiments on synthetic data, , DS [iyer2012algorithms], distorted greedy [harshaw2019submodular], and pre training [raghu2019algorithmic]. Figure 4 summarizes the results, which shows that the greedy algorithm consistently outperforms the baselines across (almost) the entire range of automation levels in all datasets. The notable exception is the Messidor dataset under high automation levels, where the pre training baseline is the best performer.

Next, we look at the samples that the greedy algorithm outsources to humans and those that leaves to machines. Intuitively, human assistance should be required for those samples which are difficult (easy) for a machine (a human) to decide about. Figure 5 provides an illustrative example of an easy and a difficult sample image. While both sample images are given a score of severity zero for the Drusen disease, one of them contains yellow spots, which are often a sign of Drusen disease555In this particular case, the patient suffered diabetic retinopathy, which is also characterized by yellow spots., and is therefore difficult to predict. In this particular case, the greedy algorithm outsourced the difficult sample to humans and let the machine decide about the easy one. Does this intuitive assignment happen consistently? To answer this question, we run our greedy algorithm on the Stare-H and Stare-D datasets under different distributions of human error and assess to which extent the greedy algorithm outsources to humans those samples they can predict more accurately.

More specifically, we sample the human predictions from a non-uniform categorical distribution under which human error is low for a fraction of the samples and high for the remaining fraction . Figure 6 shows the performance of the greedy algorithm for different values. We observe that, as long as there are samples that humans can predict with low error, the greedy algorithm outsources them to humans and thus the overall performance improves. However, whenever the fraction of outsourced samples is higher than the fraction of samples with low human error, the performance degrades. This results in a characteristic U-shaped curve.

6 Conclusions

In this paper, we have studied the problem of developing machine learning models that are optimized to operate under different automation levels. We have focused on ridge regression and shown that a simple and efficient greedy algorithm is able to find a solution with nontrivial approximation guarantees. Moreover, using synthetic and real-world data, we have shown that this greedy algorithm beats several competitive baselines. Our work also opens many interesting venues for future work. For example, it would be worth to advance the development of other more sophisticated machine learning models, both for regression and classification, under different automation levels. It would be very interesting to find tighter lower bounds on the parameter , which better characterize the good empirical performance. Finally, in our work, we have assumed that we can measure the human error for every training sample. It would be interesting to tackle the problem under uncertainty on the human error.

References

Appendix A Proofs

a.1 Proof of Lemma 3

By definition, we have that

Moreover, note that it is enough to prove that , without the logarithms, to prove the result. Then, we have that

where equality follows from Lemma A.4 and inequality follows from the lower bound on .

a.2 Proof of Theorem 3

Define , and . Moreover, note that

where equality follows from Proposition A.4. Then, it follows that

where equality follows from Proposition A.4, inequality uses that , equality follows from the following observation:

and inequality follows from Lemma 3.

a.3 Proof of Proposition 3