Learning the Truth From Only One Side of the Story

06/08/2020 ∙ by Heinrich Jiang, et al. ∙ Google berkeley college Stanford University 6

Learning under one-sided feedback (i.e., where examples arrive in an online fashion and the learner only sees the labels for examples it predicted positively on) is a fundamental problem in machine learning – applications include lending and recommendation systems. Despite this, there has been surprisingly little progress made in ways to mitigate the effects of the sampling bias that arises. We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge sub-optimally or even fail to converge to the optimal solution. We propose an adaptive Upper Confidence Bound approach that comes with rigorous regret guarantees and we show that it outperforms several existing methods experimentally. Our method leverages uncertainty estimation techniques for generalized linear models to more efficiently explore uncertain areas than existing approaches which explore randomly.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning is used in a wide range of critical applications where the feedback is one-sided, including bank lending Tsai and Chen (2010); Kou et al. (2014); Tiwari (2018), criminal recidivism prediction Tollenaar and Van der Heijden (2013); Wang et al. (2010); Berk (2017), credit card fraud Chan et al. (1999); Srivastava et al. (2008), oil spill detection Brekke and Solberg (2005); Topouzelis (2008), spam detection Jindal and Liu (2007); Sculley (2007), mineral exploration Rodriguez-Galiano et al. (2015); Granek (2016), self-driving motion planning Paden et al. (2016); Lee et al. (2014), and recommendation systems Pazzani and Billsus (2007); Covington et al. (2016); He et al. (2014). These applications can often times be modeled as online learning with one-sided feedback in that the true labels are only observed for the positively predicted examples. For example, in bank loans, the learner only observes whether the loan was repaid if it was approved. In criminal recidivism prediction, the decision maker only observes any re-offences for inmates who were released.

One often overlooked aspect is that the samples used to train the model will be biased by past predictions. In practical applications, there is a common belief that the main issue caused by such one-sided sampling is label imbalance He et al. (2014), as the number of positive examples will be expected to be much higher than overall for the population. Indeed, this biasing of the labels leading to label imbalance can be a challenge, motivating much of the vast literature on label imbalance. However, the challenges go beyond label imbalance. We show that without accounting for such biased sampling, it’s possible that we under-sample in regions where the model makes false negative predictions, and even with continual online feedback, the model never corrects for those mistakes.

Despite the importance of learning in this one-sided feedback setting, there has been surprisingly little work done in studying the effects of such biased sampling and how to mitigate it. Learning with partial feedback was first studied by Helmbold et al. (2000)

under the name “apple tasting" who suggest to transform any online learning procedure into an apple tasting one by randomly flipping some of the negative predictions into positive ones with probability decaying over time. They show upper and lower bounds on the number of mistakes made by the procedure in the partial feedback setting. Since then, there has only been a handful of works on this challenging yet ubiquitous problem in machine learning, which we outline in the related works section.

In this paper, we focus on generalized linear models, borrowing assumptions from a popular linear bandit framework Filippi et al. (2010). Our contributions can be summarized as follows.

  • In Section 3, we propose a new notion of regret to capture the one-sided learner’s objective.

  • In Section 4, we show that without leveraging online learning where the model is continuously updated upon seeing new samples, an offline learner may need to be trained on as many as samples to attain an average regret of at most .

  • In Section 5, we show that the online greedy approach (i.e., updating the model only on examples with positive predictions at each timestep without any adjustments) in general may not have vanishing regret.

  • In Section 6, we give an upper confidence bound based strategy that adaptively adjusts the model decision by incorporating the uncertainty of the prediction with an regret.

  • In Section 7

    , we provide an extensive experimental analysis on linear and logistic regression on various benchmark datasets showing that our method outperforms a number of baselines.

To the best of our knowledge, we give the most detailed analysis in the ways in which passive or greedy learners are sub-optimal in the one-sided feedback setting and we present a practical algorithm that comes with theoretical guarantees and show it outperforms existing methods empirically. Our method is adaptive

: it leverages variance estimation techniques for generalized linear models to more efficiently explore uncertain areas than existing approaches which perform the exploration randomly. Such efficient exploration is critical: in practice, mistakes can be costly (e.g. when a bank gives a defaulting loan) and at the same time we show principled exploration is necessary for the benefit in the long run (e.g. the bank needs to take a chance on some loan applicants in order to find more profitable lending opportunities in the future).

2 Related Works

As mentioned in the introduction, this problem of learning with one-sided feedback has been studied by Helmbold et al. (2000) under the name “apple tasting." They propose randomly flipping some of the negative predictions into positive ones with probability decaying over time. Thus, their method can be seen as performing the exploration randomly while our method performs the exploration adaptively. Sculley (2007) studies the one-sided feedback setting for the application of email spam filtering and show that the approach of Helmbold et al. (2000)

was less effective than a simple greedy strategy and explored active learning approaches to solve the problem.

Bechavod et al. (2019) consider the problem of one-sided learning in the group-based fairness context where the goal is to satisfy equal opportunity Hardt et al. (2016)

at every round. They consider convex combinations over a finite set of classifiers and arrive at a solution which is a randomized mixture of at most two of these classifiers.

Cesa-Bianchi et al. (2006b) studies a setting which generalizes the one-sided feedback, called partial monitoring, through considering repeated two-player games in which the player receives a feedback generated by the combined choice of the player and the environment. They propose a randomized solution. Antos et al. (2013) provides a classification of such two-player games in terms of the regret rates attained and Bartók and Szepesvári (2012) studies a variant of the problem with side information. Our approach does not rely on randomization that is typically required to solve such two-player games. There has also been work studying the effects of distributional shift caused by biased sampling Perdomo et al. (2020).

Ensign et al. (2017) studies the one-sided feedback setting through the problems of predictive policing and recidivism prediction. They show a reduction to the partial monitoring setting and provide corresponding regret guarantees.

More broadly, this one-sided learning problem is related to selective sampling or active learning where the learner chooses which labels to observe. Cesa-Bianchi et al. (2006a) propose for linear models to sample randomly based on the model’s prediction score. The difference in our setting is that we incur a cost when we query for a label that’s negative and the goal is to query exactly the positively labeled examples.

Filippi et al. (2010) propose the generalized linear model framework for the multi-armed bandit problem, where for arm , the reward is of the form where is unknown to the learner, is additive noise, and is a link function. Our work borrows ideas from this framework as well as proof techniques. Their notion of regret is based on the difference between the expected reward of the chosen arm and that of an optimal arm. One of our core contributions is showing that, surprisingly, a modification to the GLM-UCB algorithm leads to a procedure for the contextual bandits setting that minimizes a very different notion of regret that is one-sided and not compared to any single arm but the best context-dependent decision at each time-step.

3 Setup

We assume that datapoints are streaming in and the learner interacts with the data in sequential rounds: at time step we are presented with a batch of samples , and for the data points we decide to observe, we are further shown the corresponding labels , while no feedback is provided for the unobserved ones. We make the following assumptions on the model which is standard in works on linear bandits.

Assumption 1.

There exists (unknown to the learner) and link function (known to the learner) such that is drawn according to an additive noise model where the following holds:

  1. Both the covariates and response are bounded in norm: , for all .

  2. The unknown parameter satisfies .

  3. The noise residuals are mutually independent and conditionally zero-mean. That is, . Moreover, it is conditionally sub-gaussian with parameter .

  4. The link function is continuously differentiable and strictly monotonically increasing, with Lipschitz constant , i.e., .


Taking gives a linear model and gives a logistic model. Also note that the assumptions imply there exists such that for all satisfying and (see Lemma 3 in Appendix C for a short proof).

We are interested in learning a policy that can identify all the feature vectors

that have response above some pre-specified cutoff , while making as few mistakes as possible along the sequential learning process compared to the Bayes-optimal oracle that knows (i.e., the classifier ). It is worth noting that we don’t make any distributional assumption on the feature vectors . Thus, our adaptive algorithm works in both the adversarial setting and the stochastic setting where the features are drawn i.i.d. from some unknown underlying distribution.

Our goal is to minimize the notion of regret formally defined in Definition 1, which penalizes exactly when the model performs an incorrect prediction w.r.t. the Bayes-optimal decision rule, and the penalty is how far the expected response value for that example is from the cutoff .

Definition 1 (Instantaneous Regret).

For feature-action pairs , the regret incurred at time on a batch of size with cutoff at is the following:


We emphasize that this doesn’t quite fit into the usual setting considered in the bandit literature due to the imbalance of information for the two actions in the following sense: if we choose to observe (), we have the full information for both of the two actions (i.e., can evaluate the counter-factuals), whereas if we don’t then no information whatsoever is gathered about , and we don’t get to observe the corresponding instantaneous regret. It is for this reason that one can only perform empirical minimization on for which .

We give an illustrative example of how this notion of regret could be relevant in practice. Suppose that a company is looking to hire job applicants, where each applicant will contribute some variable amount of revenue to the company and the cost of hiring an applicant is a fixed cost of . If the company makes the correct decision on each applicant, it will incur no regret, where correct means that it hired exactly the applicants whose expected revenue contribution to the company is at least . The company incurs regret whenever it makes an incorrect decision: if it hires an applicant whose expected revenue is below , it is penalized on the difference. Likewise, if it doesn’t hire an applicant whose expected revenue is above , it is also penalized for the expected profit that could have been made. Moreover, this definition of regret also promotes a notion of individual fairness because it encourages the decision maker to not hire an unqualified applicant over a qualified one. While our setup captures scenarios beyond fairness applications, this aspect of individual fairness in one-sided learning may be of independent interest.

4 Offline Learner Has Slow Learning Rate

In this section, we show that under the stronger i.i.d data generation assumption, in order to achieve sublinear regret, one could leverage an “offline" algorithm that performs one-time exploration only, but at the cost of having a slower rate for our notion of regret under one-sided feedback. Our offline learner (Algorithm 1) proceeds by predicting positively on the first samples to obtain the labeled examples to fit on, where the first samples are used to obtain a finite set of models which represent all possible binary decision combinations on these samples that could have been made by the GLM model. The entire observed labeled examples are then used to choose the best model from this finite set to be used for the remaining rounds without further updating.

More formally, we work with the setting where the feature-utility pairs are generated i.i.d in each round. Let the policy class be , where is the threshold rule. Moreover, let the utility for covariate with action be

The initial discretization of the policy class is used for a covering argument, the size of which is bounded with VC dimension. We show that with optimal choices of and , Algorithm 1 has suboptimal guarantees – needing as many as rounds in order to attain an average regret of at most , whereas our adaptive algorithm to be introduced later will only need rounds. This suggests the importance of having the algorithm actively engaging in both exploration and exploitation throughout the data streaming process, beyond working with large collection of observational data only, for efficient learning.

Inputs: Discretization sample size , Exploration sample size , cutoff , Time horizon
Initialization: Choose to observe pairs of for rounds, set the action
Construct discretized policy class using the first samples, containing one representative for each element of the set
Find the best policy on the observed data pairs as
for  do
     Output as decision on , observe if
Algorithm 1 Offline Learner

We give the regret guarantee in the proposition below. The proof is deferred to Appendix B.

Proposition 1 (Regret Bound for Algorithm 1).

Under Assumption 1 and the assumption that the feature-utility pairs are drawn i.i.d in each round, we have that picking , in Algorithm 1, with probability at least ,

This in turn gives the cumulative regret bound with the same probability as:

5 Greedy Learner May Incur Linear Regret

In this section, we show that greedy online learning, which updates the model after each round on the received labeled examples without adjusting for one-sided feedback, can fail to have vanishing regret, under the i.i.d data generation assumption alone. More specifically, the greedy learner fits parameter that minimizes the empirical loss only on the datapoints whose labels it has seen so far at each time step. For example in the case we use , the squared loss; when we instead use , the cross-entropy loss. An alternative definition of the greedy learner can utilize the decision rule mandated by the that minimizes the regret (Definition 1) on the datapoints seen thus far. In our setup this is possible because whenever a datapoint label is revealed, the regret incurred by the decision can be estimated. As it turns out, these two methods share similar behavior, and we defer the discussion for this alternative method to Appendix A.

We illustrate in Theorem 1 below that even when allowing warm starting with full-rank randomly drawn i.i.d samples, there are settings where the greedy learner will suffer linear regret. More specifically, if the underlying data distribution produces with constant probability a vector with the rest of the mass concentrated on the orthogonal subspace, under Gaussian noise assumption, the prediction

has Gaussian distribution centered at the true prediction

. Using the Gaussian anticoncentration inequality from Lemma 1 provided in Appendix A, we can show that if is too close to the decision boundary , there is a constant probability that the model will predict , and therefore the model may never gather more information in direction for updating the prediction as no more observation will be made on ’s label. This situation can arise for instance when dealing with a population consisting of two subgroups having small overlap between their features.

Figure 1: Examples when greedy fails to converge. We use the example provided in Theorem 1 with and . The axis shows the number of rounds () (i.e., number of batches) and the axis shows the average regret . Batch size is chosen to be

for linear regression and

for logistic regression. The average regret fails to decrease for the greedy method but our method (Algorithm 2) exhibits vanishing regret.
Theorem 1 (Linear Regret for Greedy Learner).

Let with and independent of . Moreover, for , let be a distribution such that and for all other vectors , it holds that . Consider an MLE fit using with pairs of i.i.d. samples from for warm starting the greedy learner. Under the additional assumption that span all of , if , with (where is the number of samples among with ), then the cumulative regret is lower bounded as:

6 Adaptive Algorithm

We propose the following algorithm with the goal of minimizing the cumulative regret at time horizon , , accounting for one-sided feedback. We recast the problem as a generalized linear contextual bandit problem where we choose of the choices in each round, corresponding to the decision on each one of the data points in the batch. Each context is equipped with a feature matrix of size where each row is either or . In this case, for , the optimal policy chooses between and to decide whether to observe or not for each at round . This turns our definition of regret in (1) as linear reward over vector , from which we build upon Filippi et al. (2010) for the analysis of the regret bound.

Inputs: Batch size , initialization sample size

and eigenvalue

, cutoff
Inputs: Lipschitz const. , norm bounds , time horizon , confidence
Initialization: Choose to observe pairs of , set
for  do
     Solve for such that using e.g. Newton’s method
     if  then
     else  Perform projection step on as
     Set for
     for  do
         Choose to observe if
         Update and if chosen to observe      
Algorithm 2 Adaptive One-sided Batch UCB

The algorithm proceeds by first training a model on an initial labeled sample with the assumption that after initialization, the empirical covariance matrix is invertible with the smallest eigenvalue . At each time step, we solve for the MLE fit on the examples observed so far. If is too large, we perform a projection step – this step is only required as a theoretical artifact to ensure that is bounded so the derivative of is lower bounded by a positive quantity whenever it is evaluated in the algorithm. The model then produces point estimate for each example in the batch.

From here, we adopt an upper confidence bound approach and the uncertainty in the prediction for data point is proportional to (where is the design matrix of the labeled examples seen thus far), multiplied by a slowly increasing factor in order to balance the exploration and exploitation for the best regret guarantee. At an intuitive level, the algorithm explores more for the samples (and the corresponding subspace) we haven’t collected enough information on. The algorithm then predicts positively if the upper bound estimate is above the threshold and receive the labels of all of the positively predicted examples in the batch at once for updating the model in the next round.

We give the following regret bound for the proposed algorithm, whose proof we defer to Appendix C.

Theorem 2 (Regret Guarantee for Algorithm 2).

Suppose that Assumption 1 holds. Given a batch size , we have that for all ,

with probability at least for , where and hides poly-logarithmic factors in .

7 Experiments

To further support our theoretical findings and demonstrate the effectiveness of our algorithm in practice, we test our method on the following datasets:
1. Adult Lichman and others (2013) ( examples). The task is to predict whether the person’s income is more than k.
2. Bank Marketing Lichman and others (2013) ( examples). Predict if someone will subscribe to a bank product.
3. ProPublica’s COMPAS ProPublica (2018) ( examples). Recidivism data.
4. Communities and Crime Lichman and others (2013) ( examples). Predict if community is high (>70%tile) crime.
5. German Credit Lichman and others (2013) ( examples). Classify into good or bad credit risks.
6. Blood Transfusion Service Center Vanschoren et al. (2013) ( examples). Predict if person donated blood.
7. Diabetes Vanschoren et al. (2013) ( examples). Detect if patient shows signs of diabetes.
8. EEG Eye State Vanschoren et al. (2013) ( examples). Detect if eyes are open or closed based on EEG data.
9. Australian Credit Approval Vanschoren et al. (2013) ( examples). Predict for credit card approvals.
10. Churn Vanschoren et al. (2013) ( examples). Determine whether or not the customer churned.

We compare our method against the following baselines:
1. Greedy, where we perform least-squares/logistic fit on the collected data and predict positive/observe label if .
2. -Greedy Sutton and Barto (2018), which with probability , we make a random decision on the prediction (with equal probability), otherwise we use the greedy approach.
3. One-sided -Greedy, which with probability we predict positively, otherwise we use the greedy approach. This baseline is inspired from ideas in the original apple tasting paper Helmbold et al. (2000).
4. Noise, which we add to the prediction where is drawn uniformly on .
5. One-sided Noise, which we add to the prediction where is drawn uniformly on .
6. Margin, which we add to the prediction. This can be seen as a non-adaptive version of our approach, since the quantity we add to the prediction for this baseline is uniform across all points.

Dataset cutoff greedy -grdy os--grdy noise os-noise margin ours
Adult 50% 239.45 236.34 211.74 230.77 165.77 162.31 144.92
70% 134.74 134.18 133.8 131.66 132.39 132.67 129.81
Bank 50% 164.23 162.67 117.86 136.0 88.49 86.26 74.64
70% 207.6 197.0 185.9 198.66 153.3 150.75 137.24
COMPAS 50% 41.56 36.67 36.93 36.93 28.09 28.12 26.01
70% 41.66 39.16 39.61 39.87 38.03 36.98 34.07
Crime 50% 15.77 15.77 15.5 15.66 14.93 14.73 13.95
70% 22.0 21.75 21.99 20.33 20.63 20.1 19.19
German 50% 14.7 14.51 14.12 13.62 11.12 10.52 9.63
70% 15.89 15.53 15.93 15.41 14.09 14.52 13.07
Blood 50% 2.06 2.06 2.06 2.06 1.92 1.72 1.52
70% 3.7 2.78 3.04 2.38 3.13 3.06 2.65
Diabetes 50% 4.17 4.16 4.23 3.94 3.81 3.95 3.61
70% 6.05 5.56 6.14 6.05 5.6 5.39 5.33
EEG Eye 50% 256.47 200.04 175.8 173.52 106.26 96.85 119.7
70% 175.71 167.94 168.73 157.68 167.52 160.76 155.79
Australian 50% 3.74 3.74 3.77 3.63 3.0 2.79 2.65
70% 6.77 6.77 6.77 6.66 5.09 5.26 4.65
Churn 50% 46.98 43.65 30.65 36.64 21.24 18.83 14.89
70% 49.99 47.84 47.91 49.89 41.18 36.17 35.27
Table 1: Cumulative regret for Linear Regression.

For each dataset, we take all the examples and make a random stratified split so that of the data is used to train the initial model and the rest is used for online learning. For the linear regression experiments, we used a batch size of while for logistic regression we used a batch size of for Adult, Bank, EEG Eye State and for the rest due to computational costs of retraining after each batch using sci-kit learn’s implementation of logistic regression. We compute the regret based on using an estimated obtained by fitting the respective model (either linear or logistic) on the entire dataset. Due to space limitation, we only show the results for cutoff chosen so that and of the data points are below the cutoff w.r.t. . Full results are in Appendix D. For each dataset and setting of , we averaged the performance of each method across different random splits of the dataset and tuned over a grid of powers of (except greedy).

Figure 2: Average regret for LS. Each round consists of presenting a batch of example.
cutoff greedy -grdy os--grdy noise os-noise margin ours
Adult 50% 43.48 43.55 43.48 43.35 43.41 43.38 42.63
70% 102.86 102.86 102.9 102.6 102.81 102.47 100.06
Bank 50% 23.22 23.26 23.18 23.3 23.33 23.2 23.23
70% 85.72 85.94 85.67 85.51 85.26 85.27 85.75
COMPAS 50% 44.47 43.88 44.15 43.07 42.11 42.64 40.34
70% 43.7 43.59 43.41 43.66 43.83 43.7 43.7
Crime 50% 11.04 10.83 11.04 10.85 10.33 10.44 9.42
70% 26.05 25.93 26.13 25.94 25.84 25.55 24.46
German 50% 35.71 35.21 33.55 33.35 24.19 23.19 20.33
70% 42.55 41.14 42.18 40.98 40.64 40.3 37.12
Blood 50% 5.05 5.05 4.87 4.83 4.71 4.53 4.24
70% 13.04 13.04 13.03 13.04 10.84 12.14 9.69
Diabetes 50% 28.23 28.23 27.75 27.22 26.67 26.18 25.16
70% 29.36 28.0 27.79 28.0 27.4 27.9 28.11
EEG Eye 50% 239.33 238.92 239.09 236.65 200.61 201.51 187.28
70% 209.48 207.89 208.83 206.63 204.94 205.4 199.04
Australian 50% 21.88 21.88 21.87 21.21 21.76 20.81 20.38
70% 17.47 17.29 17.46 16.49 17.24 17.46 17.43
Churn 50% 61.04 57.74 54.13 53.85 39.46 38.88 34.89
70% 122.96 117.49 116.04 112.36 94.61 88.3 82.23
Table 2: Cumulative regret for Logistic Regression.
Figure 3: Average regret for Logistic. Each round consists of presenting samples.

Broader Impact

Machine learning has played an increasingly important role in policy making. Many machine learning systems learn under one-sided feedback – this justifies their modeling as a dynamical process where careful experimental design is interleaved with more traditional observational data study. In such scenarios, the data collection is informed by past decisions and can be inherently biased. In this work, we show that without accounting for such biased sampling, the model could enter a feedback loop that only reinforce its past misjudgements, resulting in a policy that does not necessarily align with the long term learning goal. Indeed, we demonstrate that the de facto default approach such as greedy and offline learning often yield suboptimal performance when viewed through this lens. In turn, we propose a natural notion of regret for the one-sided learner and give a simple, practical algorithm that can be used to avoid such undesirable downstream effects and ultimately drive the decision making process towards a better equilibrium. Both the theoretical grounding and the empirical effectiveness of the proposed adaptive algorithm offer evidence that it serves as a much better alternative in such settings.


  • A. Antos, G. Bartók, D. Pál, and C. Szepesvári (2013) Toward a classification of finite partial-monitoring games. Theoretical Computer Science 473, pp. 77–99. Cited by: §2.
  • G. Bartók and C. Szepesvári (2012) Partial monitoring with side information. In International Conference on Algorithmic Learning Theory, pp. 305–319. Cited by: §2.
  • Y. Bechavod, K. Ligett, A. Roth, B. Waggoner, and S. Z. Wu (2019) Equal opportunity in online classification with partial feedback. In Advances in Neural Information Processing Systems, pp. 8972–8982. Cited by: §2.
  • R. Berk (2017) An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Journal of Experimental Criminology 13 (2), pp. 193–216. Cited by: §1.
  • A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire (2011)

    Contextual bandit algorithms with supervised learning guarantees


    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , G. Gordon, D. Dunson, and M. Dudík (Eds.),
    Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 19–26. Cited by: Appendix B.
  • C. Brekke and A. H. Solberg (2005) Oil spill detection by satellite remote sensing. Remote sensing of environment 95 (1), pp. 1–13. Cited by: §1.
  • N. Cesa-Bianchi, C. Gentile, and L. Zaniboni (2006a) Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research 7 (Jul), pp. 1205–1230. Cited by: §2.
  • N. Cesa-Bianchi, G. Lugosi, and G. Stoltz (2006b) Regret minimization under partial monitoring. Mathematics of Operations Research 31 (3), pp. 562–580. Cited by: §2.
  • P. K. Chan, W. Fan, A. L. Prodromidis, and S. J. Stolfo (1999) Distributed data mining in credit card fraud detection. IEEE Intelligent Systems and Their Applications 14 (6), pp. 67–74. Cited by: §1.
  • P. Covington, J. Adams, and E. Sargin (2016)

    Deep neural networks for youtube recommendations

    In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §1.
  • D. Ensign, S. A. Friedler, S. Neville, C. Scheidegger, and S. Venkatasubramanian (2017) Decision making with limited feedback: error bounds for recidivism prediction and predictive policing. Cited by: §2.
  • S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári (2010) Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 586–594. Cited by: Appendix C, Appendix C, §1, §2, §6.
  • J. Granek (2016) Application of machine learning algorithms to mineral prospectivity mapping. Ph.D. Thesis, University of British Columbia. Cited by: §1.
  • M. Hardt, E. Price, and N. Srebro (2016) Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §2.
  • X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1, §1.
  • D. P. Helmbold, N. Littlestone, and P. M. Long (2000) Apple tasting. Information and Computation 161 (2), pp. 85–139. Cited by: §1, §2, §7.
  • N. Jindal and B. Liu (2007) Review spam detection. In Proceedings of the 16th international conference on World Wide Web, pp. 1189–1190. Cited by: §1.
  • G. Kou, Y. Peng, and C. Lu (2014) MCDM approach to evaluating bank loan default models. Technological and Economic Development of Economy 20 (2), pp. 292–311. Cited by: §1.
  • U. Lee, S. Yoon, H. Shim, P. Vasseur, and C. Demonceaux (2014) Local path planning in a complex environment for self-driving car. In The 4th Annual IEEE International Conference on Cyber Technology in Automation, Control and Intelligent, pp. 445–450. Cited by: §1.
  • M. Lichman et al. (2013) UCI machine learning repository. Irvine, CA. Cited by: §7.
  • B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent vehicles 1 (1), pp. 33–55. Cited by: §1.
  • M. J. Pazzani and D. Billsus (2007) Content-based recommendation systems. In The adaptive web, pp. 325–341. Cited by: §1.
  • J. C. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt (2020) Performative prediction. arXiv preprint arXiv:2002.06673. Cited by: §2.
  • ProPublica (2018) COMPAS recidivism risk score data and analysis. External Links: Link Cited by: §7.
  • V. Rodriguez-Galiano, M. Sanchez-Castillo, M. Chica-Olmo, and M. Chica-Rivas (2015)

    Machine learning predictive models for mineral prospectivity: an evaluation of neural networks, random forest, regression trees and support vector machines

    Ore Geology Reviews 71, pp. 804–818. Cited by: §1.
  • D. Sculley (2007) Practical learning from one-sided feedback. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 609–618. Cited by: §1, §2.
  • A. Srivastava, A. Kundu, S. Sural, and A. Majumdar (2008)

    Credit card fraud detection using hidden markov model

    IEEE Transactions on dependable and secure computing 5 (1), pp. 37–48. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §7.
  • A. K. Tiwari (2018) Machine learning application in loan default prediction. Machine Learning 4 (5). Cited by: §1.
  • N. Tollenaar and P. Van der Heijden (2013) Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models. Journal of the Royal Statistical Society: Series A (Statistics in Society) 176 (2), pp. 565–584. Cited by: §1.
  • K. N. Topouzelis (2008)

    Oil spill detection by sar images: dark formation detection, feature extraction and classification algorithms

    Sensors 8 (10), pp. 6642–6659. Cited by: §1.
  • C. Tsai and M. Chen (2010) Credit rating by hybrid machine learning techniques. Applied soft computing 10 (2), pp. 374–380. Cited by: §1.
  • J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2013) OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Link, Document Cited by: §7.
  • P. Wang, R. Mathieu, J. Ke, and H. Cai (2010) Predicting criminal recidivism with support vector machine. In 2010 International Conference on Management and Service Science, pp. 1–9. Cited by: §1.

Appendix A Proof for Section 5

The following Gaussian anti-concentration bound is used throughout the proof.

Lemma 1.

For , we have the lower bound on Gaussian density as:

a.1 Proof of Theorem 1

Proof of Theorem 1.

The optimality condition for MLE fit gives that , which implies

In the direction , where is orthogonal to every other vector drawn from , we have for the number of times has appeared in the samples used for warm-starting the greedy learner,

and therefore assuming , we have . If for , by anticoncentration property of the Gaussian distribution (Lemma 1):

Hence, with constant probability the greedy procedure will reject at the next round. From then on-wards, the greedy learner will reject all instances of and will incur regret every time it encounters it in any subsequent round. ∎

a.2 Linear Regret for Empirical Regret Minimization

Let . For observed pairs of denoted by set , the empirical risk minimization of our regret can be rewritten as

from which it’s obvious that it’s not convex in . However, in the case (or any other orthogonal system), the coordinate decouples (after rotation) and the problem becomes solving problems in 1D as

and the following procedure would find the optimal solution to the problem: (1) for all such that , compute ; (2) similarly compute ; (3) compare the two quantities if , any would be a global optimum for the problem with objective function value , otherwise any would be a global optimum for the problem with objective function value . Therefore for group () with response , where we initialize with observed samples, again using Lemma 1,

for and , after which no observations will be made on group as and linear regret will be incurred with constant probability.

Appendix B Proof for Algorithm 1

Before proceeding, we give a lemma below that characterizes the optimal solution to the regret minimization problem on the population level.

Lemma 2.

The optimal policy for the expected regret minimization problem satisfies

for policy class , where and expectation is taken over data that follows . In other words, is the optimal policy for the objective at population level.


We can rewrite the objective in terms of as

Therefore it suffices to show that

As is zero-mean and independent of by assumption, we have

and the claim above immediately follows. ∎

With this in hand, we are ready to show the regret bound for the offline learner in Algorithm 1.

Proof of Proposition 1.

To construct an -cover in the pseudo-metric for i.i.d feature-utility pairs , since the linear threshold functions in has a VC dimension of , a standard argument with Sauer’s lemma (see e.g. (Beygelzimer et al., 2011)) concludes that for sequences drawn i.i.d, with probability over a random subset of size ,

for . Now running the offline algorithm for the discretized policy class on the exploration data we collected in the first phase (consists of rounds), we have for a fixed policy , since the terms are i.i.d and unbiased,

Now via a Hoeffding’s inequality for bounded random variables and a union bound over

, with probability at least , simultaneously for all ,

Therefore applying the inequality twice with and , and using the optimality of as the empirical minimizer, we have for each round,

with probability at least . Now summing up over rounds, putting together the inequalities established and minimizing over and , gives the final utility bound as

with probability at least . This in turn gives the regret bound

with the same probability, where we used Lemma 2 in the first equality and the fact that is zero-mean and independent of for the last step. ∎

Appendix C Proof for Algorithm 2

Lemma 3.

Suppose that Assumption 1 holds. Let . Then there exists such that for all satisfying and .


By Assumption 1, we have is continuous and positive everywhere. Define the interval . We have by Cauchy-Schwarz that . Since is closed and bounded, by Heine–Borel is compact in . Since the image of a continuous function on a compact set is also compact, it follows that there exists such that for all , as desired. ∎

Much of the analysis in the lemma below is built upon (Filippi et al., 2010), generalized to our setting.

Lemma 4 (Instantaneous Regret).

For all and some , with and , we have under the assumption stated in Theorem 2 that

with probability at least for , where is either or depending on whether we choose to observe the context.


We recast the problem as picking 1 out of choices (induced by all possible binary decisions on each of the samples in the batch) in each round with linear reward function. To this end, for each feature vector , we encode the algorithm’s two choices as -dimensional vectors (selecting it) and (not selecting it). Let us denote , where is the cutoff. Then for each at round , OPT chooses to predict positively and observe if exceeds where . We can then define the following notation representing Algorithm 2’s choices at round :


where with as the MLE fit on