1 Introduction
Binary classification models are used for online decisionmaking in a wide variety of practical settings. These settings include bank lending (Tsai and Chen, 2010; Kou et al., 2014; Tiwari, 2018), criminal recidivism prediction (Tollenaar and Van der Heijden, 2013; Wang et al., 2010; Berk, 2017), credit card fraud (Chan et al., 1999; Srivastava et al., 2008), spam detection (Jindal and Liu, 2007; Sculley, 2007), selfdriving motion planning (Paden et al., 2016; Lee et al., 2014), and recommendation systems (Pazzani and Billsus, 2007; Covington et al., 2016; He et al., 2014). In many of these tasks, the model only receives the true labels for examples accepted by the learner. We refer to this class of online learning problem as the bank loan problem (BLP), motivated by the following prototypical task. A learner interacts with an online sequence of loan applicants. At the beginning of each timestep the learner receives features describing a loan applicant, from which the learner decides whether or not to issue the requested loan. If the learner decides to issue the loan, the learner, at some point in the future, observes whether or not the applicant repays the loan. If the loan is not issued, no further information is observed by the learner. The process is then repeated in the subsequent timesteps. The learner’s objective is to maximize its reward by handing out loans to as many applicants that can repay them and deny loans to applicants unable to pay them back.
The BLP can be viewed as a contextual bandit problem with two actions: accepting/rejecting a loan application. Rejection carries a fixed, known, reward of 0: with no loan issued, there is no dependency on the applicant. If in contrast the learner accepts a loan application, it receives a reward of in case the loan is repaid and suffers a loss of
if the loan is not repaid. The probabilities of repayment associated to any individual are not known in advance, thus the learner is required to build an accurate predictive model of these while ensuring not too many loans are handed out to individuals who can’t repay them and not too many loans are denied to individuals that can repay. This task can become tricky since the samples used to train the model, which govern the model’s future decisions, can suffer from bias as they are the result of
past predictions from a potentially incompletely trained model. In the BLP setting, a model can get stuck in a selffulfilling false rejection loop, in which the very samples that could correct the erroneous model never enter the training data in the first place because they are being rejected by the model.Existing contextual bandit approaches typically assume a known parametric form on the reward function. With restrictions on the reward function, a variety of methods ((Filippi et al., 2010; Chu et al., 2011)) introduce strong theoretical guarantees and empirical results, both in the linear and generalized linear model settings^{1}^{1}1In a linear model the expected response of a point satisfies , whereas in a generalized linear model the expected response satisfies for some nonlinearity , typically assumed to be the logistic function.. These methods often make use of endtoend optimism, incorporating uncertainty about the reward function in both the reward model and decision criteria.
In practice, however, deep neural networks (DNNs) are often used to learn the binary classification model (Riquelme et al., 2018; Zhou et al., 2020), presenting to us a scenario that is vastly richer than the linear and generalized linear model assumptions of many contextual bandit works. In the static setting these methods have achieved effective practical performance, and in the case of Zhou et al. (2020), theoretical guarantees. A large class of these methods use two components: a feature extractor, and a posthoc exploration strategy fed by the feature extractor. These methods leave the feature extractor itself open to bias, with the limited posthoc exploration strategy. Another class of methods incorporate uncertainty into the neural network feature extractor (Osband et al. (2018)
), building on the vast literature on uncertainty estimation in neural networks.
We introduce an algorithm, (see Algorithm 1), which explicitly trains DNNs to make optimistic online decisions for the BLP, by incorporating optimism in both representation learning and decision making. The intuition behind ’s optimistic optimization procedure is as follows:
“If I trained a model with the query point having a positive label, would it predict it to be positive?"
If the answer to this question is yes, would accept this point. To achieve this, at each time step, the algorithm retrains its base model, treating the new candidate points as if they had already been accepted and temporarily adds them to the existing dataset with positive pseudolabels. ’s accept and reject decisions are based on the predictions from this optimistically retrained base model. The addition of the fake pseudolabeled points prevents the emergence of selffulfilling false negatives. In contrast to false rejections, any false accepts introduced by the pseudolabels are selfcorrecting. Once the model has (optimistically) accepted enough similar data points and obtained their true, negative label, these will overrule the impact of the optimistic label and result in a reject for novel queries.
While conceptually and computationally simple, we empirically show that obtains competitive performance across a set of 3 different benchmark problems in the BLP domain. With minimal hyperparameter tuning, it matches or outperforms greedy, greedy (with a decaying schedule) and stateoftheart methods from the literature such as NeuralUCBZhou et al. (2020). Furthermore, our analysis shows that is 35 times more likely to accept a data point that the current model rejects if the data point is indeed a true accept, compared to a true reject.
1.1 Related Work
Contextual Bandits
As we formally describe in Section 2, the BLP can be formulated as a contextual bandit. Perhaps the most related setting the BLP in the contextual bandits literature is the work of Metevier et al. (2019). In their paper the authors study the loan problem in the presence of offline data. In contrast with our definition for the BLP, their formulation is not concerned with the online decision making component of the BLP, the main point of interest in our formulation. Furthermore, their setting is also concerned with studying ways to satisfy fairness constraints, an aspect of the loan problem that we consider completely orthogonal to our work. Other recent works have forayed into the analysis and deployment of bandit algorithms in the setting of function approximation. Most notably the authors of Riquelme et al. (2018) conduct an extensive empirical evaluation of existing bandit algorithms on public datasets. A more recent line of work has focused on devising ways to add optimism to the predictions of neural network models. Methods such as Neural UCB and Shallow Neural UCB Zhou et al. (2020); Xu et al. (2020) are designed to add an optimistic bonus to the model predictions, that is a function of the representation layers of the model. Their theoretical analysis is inspired by insights gained from Neural Tangent Kernel (NTK) theory. Other recent works in the contextual bandit literature such as Foster and Rakhlin (2020) have started to pose the question of how to extend theoretically valid guarantees into the function approximation scenario, so far with limited success Sen et al. (2021). A fundamental component of our work that is markedly different from previous approaches to is to explicitly encourage optimism throughout representation learning, rather than posthoc exploration on top of a nonoptimistic representation.
Learning with Abstention
The literature on learning with abstention shares many features with our setting. In this literature, an online learner can choose to abstain from prediction for a fixed cost, rather than incurring arbitrary regret(Cortes et al., 2018). In our setting, a rejected point always receives a constant reward, similar to learning with abstention. However, here, regret in the BLP is measured against the potential reward, rather than against a fixed cost. Although the BLP itself does not naturally admit abstention, the extension of PLOT to abstention setting is an interesting future problem.
Repeated Loss Minimization
A closely related problem setting to the BLP (see Section 2) is Repeated Loss Minimization. Previous works Hashimoto et al. (2018) have studied the problem of repeated classification settings where the acceptance or rejection decision produces a change in the underlying distribution of individuals faced by the decision maker. In their work, the distributional shift induced by the learner’s decisions is assumed to be intrinsic to the dynamics of the world. This line of work has recently garnered a flurry of attention and inspired the formalization of different problem domains such as strategic classification Hardt et al. (2016a) and performative prediction Perdomo et al. (2020); Miller et al. (2021). A common theme in these works is the necessity of thinking strategically about the learner’s actions and how these may affect its future decisions as a result of the world reacting to them. In this paper we focus on a different set of problems encountered by decision makers when faced with the BLP in the presence of a reward function. We do not treat the world as a strategic decision maker, instead we treat the distribution of data points presented to the learner as fixed, and focus on understanding the effects that making online decisions can have on the future accuracy and reward experienced by an agent engaged in repeated classification. The main goal in this setting is to devise methods that allow the learner to get trapped in false rejection or false acceptance cycles that may compromise its reward. Thus, the learner’s task is not contingent on the arbitrariness of the world, but on a precise understanding of its own knowledge of the world.
Learning with partial feedback
In Sculley (2007), the authors study the onesided feedback setting for the application of email spam filtering and show that the approach of Helmbold et al. (2000) was less effective than a simple greedy strategy. The onesided feedback setting has in common with our definition of the BLP the assumption that an estimator of the instantaneous regret is only available in the presence of an accept decision. The main difference between the setting of Helmbold et al. (2000) and ours is that the BLP is defined in the presence of possibly noisy labels. Moreover our algorithm can be used with powerful function approximators, a setting that goes beyond the simple label generating functions studied in Helmbold et al. (2000). In a related work Bechavod et al. (2019) considers the problem of onesided learning in the groupbased fairness context with the goal of satisfying equal opportunity Hardt et al. (2016b)
at every timestep. They consider convex combinations over a finite set of classifiers and arrive at a solution which is a randomized mixture of at most two of these classifiers. Moving beyond the single player onesided feedback problem
CesaBianchi et al. (2006) studies a setting which generalizes the onesided feedback, called partial monitoring, through considering repeated twoplayer games in which the player receives a feedback generated by the combined choice of the player and the environment, proposing a randomized solution. Antos et al. (2013) provides a classification of such twoplayer games in terms of the regret rates attained and Bartók and Szepesvári (2012) study a variant of the problem with side information.2 Setting
We formally define the bank loan problem (BLP) as a sequential contextual binary decision problem with conditional labels, where the labels are only observed when datapoints are accepted. In this setting, a decision maker and a data generator interact through a series of timesteps, which we index by time . At the beginning of every timestep , the decision maker receives a data point and has to decide whether to accept or reject this point. The label , is only observed if the data point is accepted. If the datapoint is rejected the learner collects a reward of zero. If instead the datapoint is accepted the learner collects a reward of if and otherwise. Here, we focus our notation and discussion on the setting where only one point is acted upon in each timestep. All definitions below can be extended naturally to the batch scenario, where a learner receives a batch of data points , and sees labels for accepted points in each timestep, where is the size of the batch, and is the number of accepted points.
In the BLP, contexts (unlabeled batches of query points) do not have to be IID – their distributions, , may change adversarially. As a simple example, the bank may only see applicants for loans under $1000, until some time point t, where the bank sees applicants for larger loans. Although contexts do not have to be IID, we assume that the reward function itself is always stationary – i.e. the conditional distribution of the responses, , is fixed for all . Finally, in the BLP, it is common for rewards to be delayed – e.g. the learner does not observed the reward for its decision at time until some later time, . In this work, we assume that rewards are received immediately, as a wide body of work exists for adapting online learning algorithms to delayed rewards (Mesterharm, 2005).
Reward
The learner’s objective is to maximize its cumulative accuracy, or the number of correct decisions it has made during training, as in (Kilbertus et al., 2020). In our model, if the learner accepts the datapoint, , it receives a reward of , where
is a binary random variable, while a rejection leads to a reward of
. Concretely, in the loan scenario, this reward model corresponds to a lender that makes a unit gain for a repaid loan, incurs a unit loss for an unpaid loan and collects zero reward whenever no loan is given.Contextual Bandit Reduction
The BLP can be easily modeled as a specific type of contextual bandit problem Langford and Zhang (2008), where at the beginning of every timestep the learner receives a context , and has the choice of selecting one of two actions . It is easy to see this characterization implies an immediate reduction of our problem into a twoaction contextual bandit problem, where the payout of one of the actions is known by the learner (). To distinguish the problem we study from a more general contextual bandits setting, we refer to our problem setting as the bankloan problem (BLP).
3 PseudoLabel Optimism
Overview
In this section, we describe in detail, and provide theoretical guarantees for the method. Recall the discussion from Section 1 where we described the basic principle behind the Algorithm. At the beginning of each timestep , retrains its base model by adding the candidate points with positive pseudolabels into the existing data buffer. The learner then decides whether to accept or reject the candidate points by following the predictions from this optimistically trained model. Although the implementation details of (see Algorithm 1) are a bit more involved than this, the basic operating principle behind the algorithm remains rooted in this very simple idea.
Primarily, aims to provide similar guarantees as the existing contextual bandit literature, generalized to the function approximation regime. To do so, we rely on the following realizability assumption for the underlying neural model and the distribution of the labels:
Assumption 1 (Labels generation).
We assume the labels are generated according to the following model:
(1) 
For some function parameterized by and where is the logistic link function. We denote the function class parameterized by as .
For simplicity, we discuss under the assumption that is a parametric class. Our main theoretical results of Theorem 1 hold when this parameterization is rich enough to encompass the set of all constant functions.
In the learner’s decision at time is parameterized by parameters, and a function and takes the form:
We call the function the learner’s model. We denote by the indicator of whether the learner has decided to accept (1) or reject (0) data point . We measure the performance of a given decision making procedure by its pseudoregret^{2}^{2}2The prefix pseudo in the naming of the pseudoregret is a common moniker to indicate the reward considered is a conditional expectation. It has no relation to the pseudolabel optimism of our methods.:
For all we denote the pseudoreward received at time as . The optimal pseudoreward at time equals if and otherwise. Minimizing regret is a standard objective in the online learning and bandits literature (see Lattimore and Szepesvári (2020)). As a consequence of Assumption 1, the optimal reward maximizing decision rule equals the true model .
In order to show theoretical guarantees for our setting we will work with the following realizability assumption.
Assumption 2 (Neural Realizability).
There exists an Lipschitz function such that for all :
(2) 
Recall that in our setting the learner interacts with the environment in a series of timesteps observing points and labels only during those timesteps when . The learner also receives an expected reward of
, a quantity for which the learner only has access to an unbiased estimator. Whenever Assumption
2 holds, a natural way of computing an estimator of is via maximumlikelihood estimation. If we denote by as the dataset of accepted points up to the learners decision at time , the regularized loglikelihood (or negative crossentropy loss) can be written as,The Realizable Linear Setting
If , the BLP can be reduced to a generalized linear contextual bandit problem (see Filippi et al. (2010)) via the following reduction. At time , the learner observes a context of the form . In this case, the payoff corresponding to a decision can be realized by action for all models in .
Unfortunately, this reduction does not immediately apply to the neural realizable setting. In the neural setting, there may not exist a vector
with which to model the payoff of the action known a priori to satisfy . Stated differently, there may not exist an easy way to represent the bank loan problem as a natural instance of a two action contextual bandit problem with the payoffs fully specified by the neural function class at hand. We can get around this issue here, because in the BLP it is enough to compare the model’s prediction with the neutral probability . Although we make Assumption 2 for the purpose of analyzing and explaining our algorithms, in practice it is not necessary that this assumption holds.Just as in the case of generalized linear contextual bandits, utilizing the model given by may lead to catastrophic under estimation of the true response for any query point . The core of
is a method to avoid the emergence of selffulfilling negative predictions. We do so using a form of implicit optimism, resulting directly from the optimization of a new loss function, which we call the optimistic pseudolabel loss.
Definition 1 (Optimistic PseudoLabel Loss).
Let be a dataset consisting of labeled datapoints and responses and let be a dataset of unlabeled data points. We define the optimistic pseudolabel loss of as,
Where is a weighting factor, is the ’focus’ radius and .
Let’s take a closer look at the optimistic pseudolabel loss. Given any pair of labeledunlabeled datasets , optimizing for the optimistic loss corresponds to minimizing the crossentropy of a dataset of pairs of the form or such that . In other words, the minimizer of aims to satisfy two objectives:

Minimize error on the labeled data

Maximize the likelihood of a positive label for the unlabeled points in .
The model resulting from minimizing will therefore strive to be optimistic over while keeping a low loss value, and consequently a high accuracy (when Assumption 2 holds) over the true labels of the points in . We note that if is much larger than (the weighted size of ), optimizing will favor models that are accurate over instead of optimistic over . Whenever , the opposite is true.
PLOT Algorithm
Based on these insights, we design PseudoLabels for Optimism (), an algorithm that utilizes the optimistic pseudolabel loss to inject the appropriate amount of optimism into the learner’s decisions: high for points that have not been seen much during training, and low for those points whose acceptance may cause a catastrophic loss increase over the points accepted by the learner so far.
During the very first timestep (), accepts all the points in . In subsequent timesteps makes use of a dual greedy and MLEgreedy filtering subroutine to find a subset of the current batch composed of those points that are both currently being predicted as rejects by the MLE estimator and have been selected by the greedy schedule (see step 3 of ).
This dual filtering mechanism ensures that only a small proportion (based on the ) of the datapoints are ever considered to be included into the empirical optimistic pseudolabel loss. The MLE filtering mechanism further ensures that not all the points selected by greedy are further investigated, but only those that are currently being rejected by the MLE model. This has the effect of preventing the pseudolabel filtered batch from growing too large.
As we have mentioned above, the relative sizes of the labeled and unlabeled batches has an effect on the degree of optimism the algorithm will inject into its predictions. As the size of the collected dataset grows, the inclusion of , has less and less effect on . In the limit, once the dataset is sufficiently large and accurate information can be inferred about the true labels, the inclusion of into the pseudolabel loss has vanishing effect. The later has the beneficial effect of making false positive rate decrease with .
The following guarantee shows that in the case of separable data satisfying Assumption 2, the Algorithm initialized with the right parameters satisfies a logarithmic regret guarantee.
Theorem 1.
Let be a distribution over data point, label pairs satisfying

All are bounded .

The conditional distributions of the labels satisfy the data generating Assumption 2 with a class of Lipschitz functions containing all constant functions.

holds for all .
Let the marginal distribution of over points be and let’s assume the algorithm will be used in the presence of i.i.d. data such that independently for all . Define and where corresponds to the ball centered at and radius . Let . If , and , the Algorithm with batch size satisfies for all simultaneously,
With probability at least , where is a parameter that only depends on the geometry of and hides logarithmic factors in and .
The proof can be found in Appendix B. Although the guarantees of Theorem 1 require knowledge of , in practice this requirement can easily be alleviated by using any of a variety of Model Selection approaches such as in Pacchiano et al. (2020b); Cutkosky et al. (2021); AbbasiYadkori et al. (2020); Lee et al. (2021); Pacchiano et al. (2020a), at the price of a slightly worse regret rate. In the following section we conduct extensive empirical studies of and demonstrate competitive finite time regret on a variety of public classification datasets.
4 Experimental Results
Experiment Design and Methods
We evaluate the performance of PLOT^{3}^{3}3Google Colab: shorturl.at/pzDY7 on three binary classification problems adapted to the BLP setting. In timestep , the algorithm observes context , and classifies the point, a.k.a accepts/rejects the point. If the point is accepted, a reward of one is given if that point is from the target class, and minus one otherwise. If the point is rejected, the reward is zero. We focus on two datasets from the UCI Collection Dua and Graff (2017)
, the Adult dataset and the Bank dataset. Additionally we make use of MNIST
Lecun et al. (1998) (d=784). The Adult dataset is defined as a binary classification problem, where the positive class has income > $50k. The Bank dataset is also binary, with the positive class as a successful marketing conversion. On MNIST, we convert the multiclass problem to binary classification by taking the positive class to be images of the digit 5, and treating all other images as negatives. Our main metric of interest is regret, measured against a baseline model trained on the entire dataset. The baseline model is used instead of the true label, as many methods cannot achieve oracle accuracy on realworld problems even with access to the entire dataset.We focus on comparing our method to other neural algorithms, as prior papers Riquelme et al. (2018), Kveton et al. (2019), Zhou et al. (2020) generally find neural models to have the best performance on these datasets. In particular, we focus on NeuralUCBZhou et al. (2020) as a strong benchmark method. We perform a grid search over a few values of the hyperparameter of NeuralUCB, considering {0.1, 1, 4, 10} and report results from the best value. We also consider greedy and greedy methods. For greedy, we follow Kveton et al. (2019), and give the method an unfair advantage, i.e. we use a decayed schedule, dropping to 0.1% exploration by T=2000. Otherwise, the performance is too poor to plot.
In our experiments, we set the PLOT weight parameter to 1, equivalent to simply adding the pseudolabel point to the dataset. We set the PLOT radius parameter to , thus including all prior observed points in the training dataset. Although our regret guarantees require problemdependent settings of these two parameters, PLOT achieves strong performance with these simple and intuitive settings, without sweeping.
For computational efficiency, we run our method on batches of data, with batch size = 32. We average results over 5 runs, running for a horizon of = 2000 timesteps. Our dataset consists of the points accepted by the algorithm, for which we have the true labels. We report results for a twolayer, 40node, fullyconnected neural network. At each timestep, we train this neural network on the above data, for a fixed number of steps. Then, a second neural network is cloned from those weights. A new dataset with pseudolabel data and historical data is constructed, and the second neural network is trained on that dataset for the same number of steps. This allows us to keep a continuously trained model which only sees true labels. The pseudolabel model only ever sees one batch of pseudolabels. Each experiment runs on a single Nvidia Pascal GPU, and replicated experiments, distinct datasets, and methods can be run in parallel, depending on GPU availability.
Analysis of Results
In the top row of Figure 1
, we provide cumulative regret plots for the above datasets and methods. Our method’s cumulative regret is consistently competitive with other methods, and outperforms on MNIST. In addition, the variance of our method is much lower than that of NeuralUCB and Greedy, showing very consistent performance across the five experiments.
The bottom row of Figure 1 provides a breakdown of the decisions made by our model. As described in Section 3, on average the pseudolabel model only acts on percent of points classified as negative by the base model. We provide the cumulative probability of acceptance of true positive and true negative points acted on by the pseudolabel model. As the base model improves, the pseudolabel model receives fewer false positives, and becomes more confident in supporting rejections from the base model. To differentiate this decaying process from the pseudolabel learning, we highlight the significant gap between the probability of accepting positives and the probability of accepting negatives in our method. This shows that the PLOT method is not simply performing a decayed strategy, but rather learning for which datapoints to inject optimism into the base classifier.
. Reward and pseudolabel accuracy are reported as a function of the timestep. One standard deviation from the mean, computed across the five experiments, is shaded.
PLOT in action.
We illustrate the workings of the PLOT algorithm by testing it on a simple XOR dataset. In Figure 2 we illustrate the evolution of the model’s decision boundary in the presence of pseudolabel optimism. On the top left panel of Figure 2 we plot samples from the XOR dataset. There are four clusters in the XOR dataset. Each of these is produced by sampling a multivariate normal with isotropic covariance with a diagonal value of . The cluster centers are set at , and . All points sampled from the red clusters are classified as and all points sampled from the black clusters are classified as . Although there are no overlaps in this picture, there is a nonzero probability that a black point may be sampled from deep inside a black region and vice versa.
5 Conclusion
We propose , a novel algorithm that provides endtoend optimism for the bank loan problem with neural networks. Rather than posthoc optimism separate from the neural net training, optimism is directly incorporated into the neural net loss function through the addition of optimistic pseudolabels. We provide regret guarantees for PLOT, and demonstrate its performance on realworld problems, where its performance illustrates the value of endtoend optimism.
Our current analysis and deployment of PLOT
is focused on the bank loan problem, with binary actions, where pseudolabel semantics are most clear. Due to the deep connections with active learning and binary contextual bandits, extending this work to larger action spaces is an interesting future direction.
Although the BLP is naturally modeled with delayed feedback, PLOT assumes that rewards are received immediately, as a wide body of work exists for adapting online learning algorithms to delayed rewards (Mesterharm, 2005). Contexts (unlabeled batches of query points) do not have to be IID – their distributions, , may change adversarially. Handling this type of shift is a key component of PLOT’s approach. Optimism is essential to avoiding feedback loops in online learning algorithms, with significant implications for the fairness literature. We presented regret analyses here, which we hope can lay the foundation for future work on the analysis of optimism in the fairness literature.
6 Statement of Broader Impact
Explicitly incorporating optimism into neural representation learning is key to ensuring optimal exploration in the bank loan problem. Other methods for exploration run the risk of feature blindness, where a neural network loses its uncertainty over certain features. When a representation learning method falls victim to this, additional optimism is insufficient to ensure exploration. This has ramifications for fairness and safe decision making. We believe that explicit optimism is a key step forward for safe and fair decision making.
However, we do want to provide caution around our method’s limitations. Our regret guarantees and empirical results assume I.I.D data, and may not prevent representation collapse in nonstationary and adversarial settings. Additionally, although our empirical results show strong performance in the nonseparable setting, our regret guarantees only hold uniformly in the separable setting.
References
 [1] (2020) Regret balancing for bandit and rl model selection. arXiv preprint arXiv:2006.05491. Cited by: §3.
 [2] (2013) Toward a classification of finite partialmonitoring games. Theoretical Computer Science 473, pp. 77–99. Cited by: §1.1.
 [3] (2012) Partial monitoring with side information. In International Conference on Algorithmic Learning Theory, pp. 305–319. Cited by: §1.1.
 [4] (2019) Equal opportunity in online classification with partial feedback. In Advances in Neural Information Processing Systems, pp. 8972–8982. Cited by: §1.1.

[5]
(2017)
An impact assessment of machine learning risk forecasts on parole board decisions and recidivism
. Journal of Experimental Criminology 13 (2), pp. 193–216. Cited by: §1.  [6] (2006) Regret minimization under partial monitoring. Mathematics of Operations Research 31 (3), pp. 562–580. Cited by: §1.1.
 [7] (1999) Distributed data mining in credit card fraud detection. IEEE Intelligent Systems and Their Applications 14 (6), pp. 67–74. Cited by: §1.

[8]
(2021)
On the theory of reinforcement learning with onceperepisode feedback
. arXiv preprint arXiv:2105.14363. Cited by: Appendix B.  [9] (2011) Unbiased online active learning in data streams. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 195–203. Cited by: §1.
 [10] (201810–15 Jul) Online learning with abstention. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1059–1067. External Links: Link Cited by: §1.1.
 [11] (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §1.
 [12] (2021) Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pp. 2276–2285. Cited by: §3.
 [13] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.
 [14] (2010) Parametric bandits: the generalized linear case.. In NIPS, Vol. 23, pp. 586–594. Cited by: §1, §3.
 [15] (2020) Beyond ucb: optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. Cited by: §1.1.
 [16] (2016) Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pp. 111–122. Cited by: §1.1.

[17]
(2016)
Equality of opportunity in supervised learning
. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §1.1.  [18] (2018) Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pp. 1929–1938. Cited by: §1.1.
 [19] (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1.
 [20] (2000) Apple tasting. Information and Computation 161 (2), pp. 85–139. Cited by: §1.1.
 [21] (2007) Review spam detection. In Proceedings of the 16th international conference on World Wide Web, pp. 1189–1190. Cited by: §1.

[22]
(2020)
Fair decisions despite imperfect predictions.
In
International Conference on Artificial Intelligence and Statistics
, pp. 277–287. Cited by: §2.  [23] (2014) MCDM approach to evaluating bank loan default models. Technological and Economic Development of Economy 20 (2), pp. 292–311. Cited by: §1.
 [24] (201909–15 Jun) Garbage in, reward out: bootstrapping exploration in multiarmed bandits. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 3601–3610. External Links: Link Cited by: §4.

[25]
(2008)
The epochgreedy algorithm for multiarmed bandits with side information
. In Advances in Neural Information Processing Systems, J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Vol. 20, pp. . External Links: Link Cited by: §2.  [26] (2020) Bandit algorithms. Cambridge University Press. Cited by: §3.
 [27] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: §4.
 [28] (2021) Online model selection for reinforcement learning with function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3340–3348. Cited by: §3.
 [29] (2014) Local path planning in a complex environment for selfdriving car. In The 4th Annual IEEE International Conference on Cyber Technology in Automation, Control and Intelligent, pp. 445–450. Cited by: §1.
 [30] (2005) Online learning with delayed label feedback. In Algorithmic Learning Theory, S. Jain, H. U. Simon, and E. Tomita (Eds.), Berlin, Heidelberg, pp. 399–413. External Links: ISBN 9783540316961 Cited by: §2, §5.
 [31] (2019) Offline contextual bandits with high probability fairness guarantees. Advances in neural information processing systems 32. Cited by: §1.1.
 [32] (2021) Outside the echo chamber: optimizing the performative risk. arXiv preprint arXiv:2102.08570. Cited by: §1.1.
 [33] (2018) Randomized prior functions for deep reinforcement learning. External Links: 1806.03335 Cited by: §1.
 [34] (2020) Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045. Cited by: §3.
 [35] (2020) Model selection in contextual stochastic bandit problems. arXiv preprint arXiv:2003.01704. Cited by: §3.
 [36] (2016) A survey of motion planning and control techniques for selfdriving urban vehicles. IEEE Transactions on intelligent vehicles 1 (1), pp. 33–55. Cited by: §1.
 [37] (2007) Contentbased recommendation systems. In The adaptive web, pp. 325–341. Cited by: §1.
 [38] (2020) Performative prediction. arXiv preprint arXiv:2002.06673. Cited by: §1.1.

[39]
(2018)
Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling
. External Links: 1802.09127 Cited by: §1.1, §1, §4.  [40] (2007) Practical learning from onesided feedback. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 609–618. Cited by: §1.1, §1.
 [41] (2021) Top extreme contextual bandits with arm hierarchy. arXiv preprint arXiv:2102.07800. Cited by: §1.1.

[42]
(2008)
Credit card fraud detection using hidden markov model
. IEEE Transactions on dependable and secure computing 5 (1), pp. 37–48. Cited by: §1.  [43] (2018) Machine learning application in loan default prediction. Machine Learning 4 (5). Cited by: §1.
 [44] (2013) Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models. Journal of the Royal Statistical Society: Series A (Statistics in Society) 176 (2), pp. 565–584. Cited by: §1.
 [45] (2010) Credit rating by hybrid machine learning techniques. Applied soft computing 10 (2), pp. 374–380. Cited by: §1.
 [46] (2019) Highdimensional statistics: a nonasymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: Appendix B.

[47]
(2010)
Predicting criminal recidivism with support vector machine
. In 2010 International Conference on Management and Service Science, pp. 1–9. Cited by: §1.  [48] (2020) Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780. Cited by: §1.1.
 [49] (2020) Neural contextual bandits with ucbbased exploration. External Links: 1911.04462 Cited by: §1.1, §1, §1, Figure 1, §4.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work? Yes, in our societal impact section we discuss limitations of our work.

Did you discuss any potential negative societal impacts of your work? We have added a section on the broader impact of our method.

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results?

Did you include complete proofs of all theoretical results? The appendix contains detailed proofs of our theoretical claims


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Our plots include error bars for 5 runs.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? We described the types of GPUs used, as well as the parallelism used in our method.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? We cited the MNIST and Adult dataset creators.

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL? We have included our code in the supplemental material.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? The datasets we used are public and previously published

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? The datasets we used are public and previously published, and do not have PII/offensive content


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? No crowdsourcing or human subjects.

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Why Optimism?
In this section we describe the common proof template behind the principle of optimism in Stochastic Bandit problems. We illustrate this in the setting of binary classification that we work with.
As we mentioned in Section 2, we work with decision rules based on producing at time a model that is used to make a prediction of the form of the probability that point should be accepted. If , point will be accepted and its label observed, whereas if , the point will be discarded and the label will remain unseen. Here we define an optimistic algorithm in this setting:
Definition 2 (Optimistic algorithm).
We say an algorithm is optimistic for this setting if the models selected at all times satisfy for all .
We now show the regret of any optimistic algorithm can be upper bounded by the model’s estimation error,
Let’s see why inequality holds. Notice that for any optimistic model, the false negative rate must be zero. Rejection of a point may occur only for points that are truly negative. This implies the instantaneous regret satisfies
By definition only when . This observation plus the optimistic nature of the models implies that and thus inequality .
As a consequence of this discussion we can conclude that in order to control the regret of an optimistic algorithm, it is enough to control its estimation error. In other words, finding a model that overestimates the response is not sufficient, the models’ error must converge as well.
Appendix B Theory  Proof of Theorem 1
In this section we prove the results stated in Theorem 1. The following property of the logistic function will prove useful.
Remark 1.
The logistic function is Lipschitz.
Throughout the discussion we will make use of the notation to denote the ball of radius centered around point .
In this section we will make the following assumptions.
Assumption 3 (Bounded Support ).
has bounded support. All satisfy .
Assumption 4 (Lipschitz ).
The function class is Lipschitz and contains all constant functions ( such that for ).
Assumption 5 (Gap).
For all , the values are bounded away from zero.
where .
The following supporting result regarding the logistic function will prove useful.
Lemma 1.
For , the logistic function satisfies, where and .
Proof.
The derivative of satisfies and is a decreasing function in the interval with a minimum value of .
Consider the function . It is easy to see that and that for all , therefore, is increasing in the interval and we conclude that for all . The result follows:
To prove the second direction we consider the function . Observe that and therefore since this implies ther result.
∎
We will make use of Pinsker’s inequality,
Lemma 2 (Pinsker’s inequality).
Let and be two distributions defined on the unvierse . Then,
Recall the unregularized and normalized negative cross entropy loss over a dataset equals,
(3) 
We can extend this definition to the population level. For any distribution we define the unregularized normalized cross entropy loss over whose labels are generated according to a logistic model with parameter as
As an immediate consequence of the last equality, we see that when , the vector is a minimizer of the population cross entropy loss. From now on we’ll use the notation to denote the empirical distribution over datapoints given by .
Observe also that if , for all such that , we have that as a consequence of Assumption 4,
and therefore because of Assumption 5,
Now let’s consider such that . By Assumption 5, this implies that . Similarly if such that implies that .
Let’s start by considering the case when satisfy
. Let’s for a moment assume that
for all and therefore (by Assumption 5) that . If this is the case, we will assume thatLemma 3.
If satisfies the following properties,

it holds that .

There exists such that .

.
Then,
Satisfies, for all .
Proof.
First observe that as a consequence of the Lipschitzness of having all points in be contained within a ball of radius implies that for all the difference . In particular this also implies that . The last inequality holds because .
Let be a point in such that . By Assumption 5, and therefore, . This implies that for all , and that .
By Lemma 1 where and therefore for all . In other words, all points in should have true positive average labels with a probability gap value (away from ) of at least .
We will prove this Lemma by exhibiting an Lipschitz classifier whose loss always lower bounds the loss of any classifier that rejects any of the points. But first, let’s consider a classifier parametrized by such that for some . If this holds, the radius of and the Lipschitzness of the function class imply,
And therefore that . Similar to the argument we made for above, Lipschitzness implies,
(4) 
Combining these observations we conclude that
(5) 
Let’s now consider be a parameter such that so that for all . This is a constant classifier whose responses lie exactly midway between the lower bounds for the predictions of and .
Denote by and .
Recall that . Hence,
Notice that and therefore for all .
Recall that by Assumption 5, the gap and therefore by Lemma 1, where . Let’s try showing that . Since by assumption , this statement holds if
(6) 
for all . The optimization problem corresponding to can be considered first. Let . The derivative of w.r.t equals,
Comments
There are no comments yet.