# Fast Rates for Online Prediction with Abstention

In the setting of sequential prediction of individual {0, 1}-sequences with expert advice, we show that by allowing the learner to abstain from the prediction by paying a cost marginally smaller than 1/2 (say, 0.49), it is possible to achieve expected regret bounds that are independent of the time horizon T. We exactly characterize the dependence on the abstention cost c and the number of experts N by providing matching upper and lower bounds of order log N/1-2c, which is to be contrasted with the best possible rate of √(Tlog N) that is available without the option to abstain. We also discuss various extensions of our model, including a setting where the sequence of abstention costs can change arbitrarily over time, where we show regret bounds interpolating between the slow and the fast rates mentioned above, under some natural assumptions on the sequence of abstention costs.

## Authors

• 27 publications
• 15 publications
10/29/2019

### Dying Experts: Efficient Algorithms with Optimal Regret Bounds

We study a variant of decision-theoretic online learning in which the se...
06/21/2019

### Thompson Sampling for Adversarial Bit Prediction

We study the Thompson sampling algorithm in an adversarial setting, spec...
06/18/2012

### Learning the Experts for Online Sequence Prediction

Online sequence prediction is the problem of predicting the next element...
02/20/2015

### Low-Cost Learning via Active Data Procurement

We design mechanisms for online procurement of data held by strategic ag...
04/13/2021

### Sequential Ski Rental Problem

The classical 'buy or rent' ski-rental problem was recently considered i...
07/24/2020

### A PDE Approach to the Prediction of a Binary Sequence with Advice from Two History-Dependent Experts

The prediction of a binary sequence is a classic example of online machi...
10/27/2021

### Fast rates for prediction with limited expert advice

We investigate the problem of minimizing the excess generalization error...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the problem of online prediction of individual sequences which is one of the first and well-studied online learning models (Cesa-Bianchi and Lugosi, 2006). In this setup, a learner and an environment interact in a sequence of rounds as follows. In round , the learner observes the -valued predictions of experts, denoted as . Based on these observations and possibly some use of randomness, the learner predicts and then the environment reveals . In particular, the environment can be aware of the strategy of the learner but not the random bits used for producing . Having made its prediction, the learner suffers the real-valued loss , which in this paper is chosen as the binary loss . The aim of the learner is to minimize its regret which is the difference between the total loss of the prediction strategy and the total loss of the best fixed expert chosen in full knowledge of the sequence of outcomes.

Since the naive upper bound always holds for the regret, we are aiming for better dependencies. It is well known that it is not possibly to guarantee non-trivial bounds on the regret when the learner’s predictions are deterministic. Therefore, we allow the learner to randomize its decisions and measure its performance in terms of the expected regret defined formally as

 RT=maxi∈[N]E[T∑t=1(ˆℓt−ℓt,i)],

where the expectation is taken with respect to the randomness injected by the learner and is the loss suffered by expert in round . As shown in the seminal work of Littlestone and Warmuth (1994), it is possible to construct an algorithm that achieves regret of order in this setting, and this is essentially the best one we hope for as long as the binary loss together with randomization is considered—see also the classic works of Hannan (1957), Cover (1965) and Vovk (1990) and the excellent monograph by Cesa-Bianchi and Lugosi (2006). Notably, the result holds without any assumptions on the behaviour of the environment which can be completely adversarial.

One natural direction of research is to understand when it is possible to improve on the worst-case regret guarantees mentioned above. Of particular interest are bounds that are independent on the time horizon . Bounds of this type are sometimes refereed to as the fast rates in online learning (to be contrasted with the worst-case slow rate

mentioned above). There are several conditions on the loss function or the sequence of outcomes that are known to imply fast rates. A classic result due to

Haussler, Kivinen, and Warmuth (1998) and Vovk (1990) claims that if the learner is allowed to output and the loss function is convex with sufficient curvature, then there exists a deterministic prediction strategy guaranteeing . A second type of assumptions that can lead to fast rates even in the case of the binary loss is that the outcomes are generated in an i.i.d. manner, rather than being chosen by an adversarial environment. In this favourable setup, assuming that the set of experts and the distribution of satisfy the so-called Bernstein assumption (introduced originally in the statistical learning literature by Bartlett and Mendelson, 2006), Koolen et al. (2016) show that it is possible to obtain intermediate rates between and depending on the parameters in the Bernstein assumption. Similar results are well known in the statistical learning setup (see, e.g., Tsybakov, 2004). For an extensive survey on fast rates in online learning, we refer the interested reader to van Erven et al. (2015).

In this paper we show that a simple variation on the basic protocol of online binary prediction also permits regret of : instead of forcing the learner to output a prediction in , we also allow the learner to abstain from prediction. This model is sometimes referred to as Chow’s reject option model, introduced in the seminal work of Chow (1970). This setup was intensively studied in statistical learning where the learner can output one of three values and the value corresponds to an abstention (see, e.g., the works of Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Bousquet and Zhivotovskiy, 2019). In this model the loss is assumed to be binary on outputs but the price for abstention is assumed to be equal to . This means that the learner may profit from using the option to abstain whenever they believe that there is no way to predict better than by flipping an unbiased coin. We only assume that is marginally smaller than (say, 0.49), so that there is no reason for the learner to abstain too often if they want to have a small regret. [roundcorner=12pt,linewidth=1pt,leftmargin=2cm,rightmargin=2cm,topline=true,bottomline=true,skipabove=6pt]

The protocol of online binary prediction with abstentions

For each round , repeat:

1. The learner observes the -valued predictions of experts, denoted as ,

2. based on these observations and possibly some use of randomness, the learner predicts ,

3. the environment reveals ,

4. the learner suffers the loss if , otherwise a loss of if .

We stress that even when deciding to abstain, the learner still gets to observe the true outcome. The first thing to notice regarding the hardness of our setting is that the learner still needs to randomize its decisions in order to achieve sublinear regret. Specifically, one can easily adapt the classic counterexample of Cover (1965) to show that any deterministic strategy will result in a regret of at least for the learner—we provide the details in Appendix A.

We are not the first to explore the effects of allowing abstentions in the context of online learning. The most notable contribution is the work of Cortes, DeSalvo, Gentile, Mohri, and Yang (2018), who extended the model of Herbei and Wegkamp (2006); Bartlett and Wegkamp (2008) from the setting of statistical learning to that of online learning and provided a range of results for both stochastic and adversarial environments. A crucial feature of their setup is that abstention is an action available to the experts: each expert produces a prediction in , which is then used by the learner for producing its their own randomized action. Furthermore, when deciding to abstain, the learner does not get to observe the outcome produced by the environment. In contrast, our setting treats abstention as an action only available to the learner, and the true outcome is always revealed to the learner at the end of the round independently of the learner’s decision. Thus, the setting of Cortes et al. (2018) is in many senses more complicated than ours, and as a result their regret guarantees are weaker: they only match the worst-case rate of . It is unclear if it is possible to refine their techniques to obtain fast rates comparable to the ones we prove. Another work related to ours is that of Zhang and Chaudhuri (2016), who considered an online prediction problem with an abstention option under a realizability assumption and noted the possibility of adapting Chow’s reject option model to the sequential prediction setting as an interesting direction of future work—which is precisely what we address in this paper.

Our pursuit of fast rates is motivated by the recent results of Bousquet and Zhivotovskiy (2019) who have shown for the first time that allowing the learner to abstain from prediction enables fast rates in statistical learning

. Notably, their results were obtained without any further assumptions on the data distribution (i.e., no Bernstein condition) or on the curvature of the loss function (i.e., binary loss). Their approach is based on an application of the empirical risk minimization principle on an extended set of classifiers, resulting in a

deterministic algorithm guaranteeing an excess risk of at most when the set of classifiers is finite. While our results are of similar flavor, the approach that we put forward in this paper is different in almost every respect from theirs.

#### Our contributions.

Our main contribution the following: For a fixed abstention cost we show in Section 2 that there is an algorithm with the regret bound

 RT≤logN1−2c∧√TlogN2.

We also prove that our bound is tight up to multiplicative constant factors if is bounded away from zero (Section 4). Assuming that the abstention cost changes in time, we prove intermediate rates between and in Section 3. Finally, we provide two natural extensions of our main result in Section 5.

#### Notation.

We define and . We also use the standard notation. The indicator of the event is denoted by . Given integer we define .

## 2 Algorithm and main result

Our algorithm is based on the classic exponentially weighted forecaster of Littlestone and Warmuth (1994) (see also Vovk, 1990; Freund and Schapire, 1997), and is described as follows. In each round , the algorithm computes the set of weights

 wt,i=e−η∑t−1k=1ℓk,i

for some learning rate

, the corresponding probability distribution over all experts

 qt,i=wt,i∑jwt,j,and the aggregated % predictionpt=∑iqt,iyt,i.

The crucial step in the algorithm design is generating a randomized -valued prediction based on . The key intuition is that the learner should abstain whenever and output a deterministic prediction whenever . We choose to linearly interpolate between these extreme cases by producing a randomized output supported on when and otherwise. Precisely, our algorithm will base its decision on the probability of the most likely label : it will abstain with probability , and output the most likely label with probability . The procedure is compactly presented as Algorithm 1.

Note that the abstention probability is in due to the fact that . To gain some intuition about this rule, observe that when assigning the numerical value to abstention, the resulting prediction has expectation . Our main result regarding the performance of our algorithm is the following: Suppose that and . Then, the expected regret of Algorithm 1 satisfies

 RT≤logNη.

Notably, this theorem shows that whenever the abstention cost is bounded away from , we can set and our algorithm achieves a regret bound that is independent of the time horizon . However, when is very close to , one may favor to fall back to the standard worst-case regret guarantee of order . This is, however, easily achieved by choosing a conservative value of for this unfavorable case. The following simple corollary of Theorem 2 summarizes the rates achieved by our algorithm in all regimes. Setting , the regret of our algorithm satisfies

 RT≤logN1−2c∧√TlogN2.

We provide the simple proofs of both Theorem 2 and Corollary 2 below. A central object in our analysis is the quantity

 ˜ℓt=−1ηlog∑iqt,ie−ηℓt,i, (1)

often called the mix loss in the literature (Vovk, 1990, 1998; De Rooij et al., 2014). The following classic result highlights the key role of the mix loss in the analysis of the exponentially weighted forecaster: For any the cumulative mix loss of the exponentially weighted forecaster satisfies

 T∑t=1˜ℓt≤miniT∑t=1ℓt,i+logNη.

We include the proof in Appendix B for the sake of completeness. The heart of our analysis is the following lemma that establishes a connection between the mix loss and the loss suffered by our algorithm: Suppose that and . Then, . Let us define the misclassification probability of the plain exponentially weighted forecaster as . With this notation, the mix loss can be written as

 −1ηlog∑iqt,ie−ηℓt,i =−1ηlog∑iqt,i(1+(e−η−1)I{yt,i≠yt})=−1ηlog(1+rt(e−η−1)).

On the other hand, the expected loss of our algorithm is

 E[ˆℓt]=αtc+(1−αt)I{k∗t≠yt}=2(1−p∗t)c+(2p∗t−1)I{k∗t≠yt}.

In the case , we have so the above becomes

 E[ˆℓt]=2(1−p∗t)c=2rtc=rt−(1−2c)rt.

Otherwise, we have so that the expected loss is

 E[ˆℓt]=2(1−p∗t)c+2p∗t−1=2(1−rt)c+2rt−1=rt−(1−2c)(1−rt).

Also noticing that holds if and only if , we conclude that

 E[ˆℓt]=rt−(1−2c)(rt∧(1−rt)).

To finish the proof, we define the functions

 f(r)=−1ηlog(1+r(e−η−1))% andg(r)=r−(1−2c)(r∧(1−r)), (2)

and note that and . We will show that for all , which will imply the statement of the theorem. To this end, note that both functions are convex and equal to each other at and , so the desired inequality will hold if the respective derivatives at these two points satisfy . To verify this, first observe that

On the other hand, we have

 g′(r)={2cif r=0,2(1−c)if r=1.

Thus, the condition on the derivatives we seek is satisfied when

 1−e−ηη≥2candeη−1η≤2(1−c).

Also observing that

 1−e−ηη≥1−ηandeη−1η≤1+η,

both hold for , we can verify that the conditions above are satisfied whenever , as required in the statement of the theorem. This concludes the proof. Putting Lemmas 2 and 2 together proves Theorem 2. Corollary 2 is proved by first observing that

 E[ˆℓt]=rt−(1−2c)(rt∧(1−rt))≤rt=∑iqt,iℓt,i,

holds as long as , and then using Hoeffding’s lemma (Lemma 2.2 in Cesa-Bianchi and Lugosi, 2006) that guarantees

 ∑iqt,iℓt,i≤˜ℓt+η8.

Putting this inequality together with Lemma 2, we retrieve the standard bound of the exponentially weighted forecaster: . Combining this bound with the one of Theorem 2 and tuning proves the corollary.

## 3 Changing abstention costs

Let us now consider a slight variation on our problem where the abstention cost may depend on time: in round , the cost of abstention is . The main question we address in this section is whether it is possible to attain fast rates even in the case where can get arbitrarily close to in a small number of rounds. It is easy to see that our algorithm given in Section 2 satisfies the following regret bound with its proof presented in Appendix B. Suppose that for all and . Then, the regret of our algorithm satisfies

 RT≤logNη+η8T∑t=1I{1−2ct<η}.

Notably, this result asserts that it is not necessary to assume that is strictly bounded away from in order to obtain an improvement over the worst-case regret bound. In fact, we show that a clear improvement is possible when making the following quantitative assumption about the behavior of the abstention costs around : [Tsybakov’s condition for abstention costs] Assume that there are constants such that for any ,

 1TT∑t=1I{1−2ct

As the name suggests, this condition is directly inspired by the renowned Tsybakov’s margin assumption introduced first by Mammen and Tsybakov (1999) in the context of binary classification and which was further analyzed by Tsybakov (2004). In the context of classification where one has to predict a random label based on a random instance , Tsybakov’s assumption is used to characterize the distribution of the regression function around , which is related to the hardness of the classification problem at hand. Indeed, instances such that the regression function is further away from are “easier” to classify than ones closer to this value, so restricting the density of around this problematic region can make the overall learning problem easier—for more details, and relations of Tsybakov’s assumption with the so-called Bernstein assumption we refer the interested reader to van Erven et al. (2015). Similarly, our own online prediction problem becomes “harder” when the abstention costs are densely distributed around . By analogy with the case of binary classification, we choose to quantify the density of the quantity of interest around through Tsybakov’s assumption: larger values of correspond to less frequent values around the margin, whereas smaller values correspond to more problematic points. Similarly, the situation where the abstention cost is strictly bounded away from can be seen as an analog of Massart’s margin assumption in the statistical learning setup (Massart and Nédélec, 2006).

Assuming that the costs satisfy the inequality (3), our algorithm can be easily seen to achieve the following result: Suppose that the sequence of abstention costs satisfies Tsybakov’s margin condition. Then, setting , the regret of our algorithm satisfies

 RT=O((logN)12−αT1−α2−α). (4)

The proof is immediate given Proposition 3 and the definition of Tsybakov’s condition. We remark that the bound of the very same form as (4) has been recently shown in (Koolen et al., 2016) under the assumption that the losses are i.i.d. and that the Bernstein assumption is satisfied. However, our assumption is different and does not imply the Bernstein assumption, even if it leads to similar regret bounds.

One downside of Corollary 3 is that it relies on a choice of the learning rate that requires prior knowledge of the sequence of abstention costs. More generally, the learning rate that minimizes the bound of Proposition 3 is clearly a function of the abstention costs. One may wonder if it is possible to attain a regret guarantee comparable to

 R∗T=minη>0(logNη+η8T∑t=1I{1−2ct<η}),

without having prior knowledge of the sequence of abstention costs.

We answer this question in the positive by considering a simple variation of our algorithm based on the exponentially weighted forecaster with adaptive learning rates. Precisely, in each round , we will choose a positive learning-rate parameter , compute the weights , and then use these weights in the same way as our basic algorithm did. The intuition driving our algorithm design is to keep the learning rate as large as possible, and only decrease it when observing high abstention costs. Specifically, we let be the number of times that the learning rate has exceeded before round , and define our learning rate as , with defined as . The full algorithm is shown as Algorithm 2.

The following theorem (proved in Appendix B) establishes a regret bound for this algorithm. The regret of Algorithm 2 satisfies .

## 4 Lower bound

In this section we prove that the regret is optimal in the fixed abstention cost setup. Precisely, the main result we present here is the following non-asymptotic lower bound which shows that the bound of Theorem 2 is sharp with respect to and provided that is separated away from zero: Fix the abstention price , and . There is a set of experts of size such that for any randomized online prediction strategy there is a strategy of the environment such that

 RT=Ω(logN1−2c∧√TlogN).

The proof of this result is based on observing that any randomized online algorithm can be converted to a randomized batch algorithm in a way that the excess risk in the latter setting is closely related to the regret in the former setting. Thus, any lower bound in the batch setting will imply a lower bound on the best achievable regret. An additional crucially important ingredient is relating the cost of abstention to a suitably chosen strongly convex loss function.

First, we recall the standard statistical learning setup. In this setting we consider the instance space and the label space , where is a constant. Let be an i.i.d. sequence of points sampled according to some unknown distribution on . We will refer to this sequence as the learning sample and to this setup as the batch setting. Given the loss function , the risk of a hypothesis is , where the expectation is taken with respect to . We denote and . We need the following standard result which can be found in, e.g., Audibert (2009).

[Online to batch conversion] Consider any randomized online learning algorithm that produces the possibly randomized 111To simplify the notation we do not write the parameters responsible for randomization explicitly in the functions. hypothesis at time when the environment uses an i.i.d. sequence . We define the randomized batch algorithm that, given the data and any new observation , predicts its label according to the hypothesis chosen uniformly at random among . We denote this randomized hypothesis by . Then, the risk of satisfies

 E[ℓ(ˆgT(X),Y)]=1TE[T∑t=1ℓ(ht(Xt),Yt)],

where the expectation on the right-hand side integrates over the random hypothesis chosen by , the instances , and the randomization of the online algorithm. Our main technical tool is the following lower bound in the batch setting. [Theorem 8.4 in Audibert (2009)] Let and , if contains at least points, there exists a set of hypotheses satisfying for any (possibly randomized 222We note that the original version of this result does not highlight that is potentially randomized. Nevertheless, the standard argument (see, e.g., the discussions on Yao’s minimax principle in Shamir (2015) or Remark 8.1 in Audibert (2009) claiming that any lower bound for the batch setting leads to a lower bound in the sequential prediction setting) shows that Lemma 4 continues to hold in the same form.

there exists a probability distribution on such that

 E[|Y−ˆg(X)|q]−ming∈GE[|Y−g(X)|q]≥(q90(q−1)∨e−1)Bq(⌊log2N⌋T+1∧1),

where the expectation in is taken with respect to the learning sample, and the (possible) randomness of the algorithm.

With these results in mind we are ready to provide some intuition on the proof of Theorem 4. For it appears that the construction in Lemma 4 uses only the distributions such that and the set of hypotheses taking their values in . Using Lemma 4 based on our online learning algorithm we may construct an estimator taking its values in where the value corresponds to an abstention. Indeed, it is clear that as long as we consider only outputs the loss is equivalent to binary. However, if predicts then for the loss will be equal to for any as above. The full proof of Theorem 4 is presented in Appendix B.

## 5 Extensions

In this section we provide two natural extensions of our results. First, we discuss that the same results are possible for multiclass classification and then we extend our main results to the case where the set of experts is infinite.

### 5.1 Multiclass classification

We now turn our attention to a generalization of our online prediction setup that allows multiple classes. Specifically, we will now consider the case where the outcomes can take different values, and the learner’s prediction is evaluated through the binary loss. Formally, our setup can be described as follows: [roundcorner=12pt,linewidth=1pt,leftmargin=2cm,rightmargin=2cm,topline=true,bottomline=true,skipabove=6pt]

The protocol of online multiclass prediction with abstentions

For each round , repeat:

1. The learner observes the -valued predictions of experts, denoted as ,

2. based on these observations and possibly some randomness, the learner predicts ,

3. the environment reveals ,

4. the learner suffers the loss if , otherwise a loss of if .

Our algorithm described in the previous sections can be extended to this setup in a straightforward way. Specifically, let us define and for each expert in the same way as before, and define the aggregated probability assigned to class as . Given the above notation, Algorithm 1 can be adapted to the multiclass case by making the following adjustments:

1. Redefine and , that is, the label attaining the highest probability, and

2. let and predict with probability and with probability .

Note that whenever , the algorithm abstains with probability . While this may appear surprising at first sight, it is easily justified by observing that the probability of predicting incorrectly in such cases results in a loss greater than , thus one can only win by abstaining by our assumption that . We also note that the lower bound of Theorem 4 continues to hold in the multiclass setting since one can clearly embed a binary classification problem into this more general problem setting. Thus, we cannot expect to have rates faster than in the multiclass case either. The following theorem (proved in Appendix B) establishes an upper bound of this exact order, thus matching the lower bound. Suppose that and . Then, the expected regret of our algorithm for multiclass classification satisfies

 RT≤logN1−2c∧√TlogN2.

### 5.2 Infinite sets of experts

Let us return to the case of binary classification with the twist that we now allow the set of hypotheses to be infinite. As shown by Ben-David, Pál, and Shalev-Shwartz (2009), the complexity of online prediction in this case is characterized by the so-called Littlestone dimension of the class of hypothesis—for the precise definition, we refer to Ben-David et al. (2009). Assuming that the Littlestone dimension of the class is finite, Ben-David et al. show that a clever extension of the exponentially weighted forecaster satisfies the regret bound .

In our setup we provide the following regret bound, and defer the proof to Appendix B. Assume that has finite Littlestone dimension and , and that the price of abstention is . Then, there is a randomized prediction strategy satisfying

 RT≤LlogeTL1−2c∧ ⎷LTlog(eTL)2.

In particular, whenever is separated from the regret scales as which is much better than the standard dependence. We also remark that this result can be seen as an online analog of Theorem 2.2 in Bousquet and Zhivotovskiy (2019), with the Littlestone dimension replacing the VC dimension therein. However, as we mentioned, their algorithm and analysis are based on completely different techniques that take their roots in empirical process theory.

## 6 Concluding remarks and directions of future research

For the first time in the context of online learning, we have shown that endowing the learner with the option of abstaining from prediction allows a significant improvement in the best achievable rates, and that a very natural algorithm can attain the optimal rates. Notably, these results are proved without making any assumption about the sequence of outcomes that the learner has to predict or about the curvature of the loss functions. In this view, our results represent a new flavor of fast rates that, to the best of our knowledge, haven’t been explored before in the online learning literature. Arguably, one key factor that makes these fast rates possible is that the learner has access to an action that the comparator strategy cannot use. However, this feature alone cannot explain our results: since we only require the extra action to be marginally better than a uniform random prediction, the learner cannot abuse it to easily beat the best expert. How to characterize properties of such powerful additional actions for the learner in more general convex optimization problems remains to be seen.

Our key technical contribution is observing that the expected loss of our algorithm is upper bounded by the mix loss. This technique played a crucial role in proving a number of classic results in the literature, and was traditionally enabled by directly making an assumption about the curvature of the loss function evaluating the learner’s predictions. To our knowledge, our work is the first to make use of this technique without making direct assumptions about the loss. Rather, our approach is based on transforming the output of the exponential weights algorithm into a -valued prediction in a way that effectively replaces the linear loss by a mixable one from the perspective of the algorithm producing . Since this transformation can be applied to the output of any algorithm, we find it plausible that our rates could be improved by a more refined prediction algorithm such as the Aggregating Forecaster of Vovk (1998, 2001). Another possible improvement to pursue is proving high-probability versions of our upper bounds.

Another important technique we employed for proving our bounds in Section 3 was an adaptive stepzise schedule that decreases the learning rate in each round when it proves too large for the loss to be mixable. This technique bears a vague resemblance to the AdaHedge algorithm of Van Erven et al. (2011); De Rooij et al. (2014) that uses a learning rate proportional to the cumulative gap between the expected loss and the mix loss. However, their analysis is restricted to the case where this gap is nonnegative, whereas our bounds crucially use that they can be often negative. As a result, our technique can be directly adapted to account for more general mixable losses with time-dependent mixability parameters. Similarly, our variant of Tsybakov’s condition on the abstention costs can be adapted to mixability parameters to characterize the hardness of learning with a non-stationary sequence of loss functions. For instance, one can easily prove the following result by adjusting our proof techniques in a straightforward way: Consider the problem of predicting a sequence of outcomes in under a sequence of loss functions , that satisfy the following Tsybakov’s type margin assumption for : for any , holds for some . Then, there is a deterministic prediction strategy satisfying . This result uses a variant of Tsybakov’s condition that characterizes the behaviour of the second derivative of the loss around zero. The statement is proven by observing that the expected loss is upper-bounded by the mix loss whenever , and then closely following the proof steps in Section 3—we leave this as an easy exercise for the reader. The result can be seen to address an open question of Haussler et al. (1998) about whether intermediate rates between and are attainable in online prediction under suitable curvature assumptions on the loss functions. It also bears resemblance to the results of Bartlett, Hazan, and Rakhlin (2008) who provide algorithms that adapt to the curvature of the losses in more general online convex optimization problems, the key difference being that the curvature of our losses can be quantified in terms of exp-concavity rather than strong convexity as done by Bartlett et al..

Finally, we note that while our techniques are different from those of Bousquet and Zhivotovskiy (2019) who prove analogous results for the setting of statistical learning, the two analyses actually related in a very subtle way: both of them are based on techniques that are traditionally used for proving fast rates for losses with curvature, in their respective settings. While our approach is based on bounding the mixability gap as discussed above, the one of Bousquet and Zhivotovskiy is based on an extension of the recently proposed technique of Mendelson (2019) for statistical learning with the square loss. Our lower bound also makes use of this connection. Deeper investigation of this connection between curvature and the abstention may lead to further interesting results.

## References

• Audibert [2009] Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4):1591–1646, 2009.
• Bartlett and Mendelson [2006] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311–334, 2006.
• Bartlett and Wegkamp [2008] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss.

Journal of Machine Learning Research

, 9(Aug):1823–1840, 2008.
• Bartlett et al. [2008] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65–72, 2008.
• Ben-David et al. [2009] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.
• Bousquet and Zhivotovskiy [2019] Olivier Bousquet and Nikita Zhivotovskiy. Fast classification rates without standard margin assumptions. arXiv preprint arXiv:1910.12756, 2019.
• Cesa-Bianchi and Lugosi [2006] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
• Chow [1970] C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.
• Cortes et al. [2018] Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning with abstention. In Proceedings of the 35th International Conference on Machine Learning, pages 1059–1067, 2018.
• Cover [1965] Thomas M. Cover. Behavior of sequential predictors of binary sequences. Technical report, Trans. 4th Prague Conf. Information Theory Statistical Decision Functions, Random Processes, 1965.
• De Rooij et al. [2014] Steven De Rooij, Tim Van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, hedge if you must. The Journal of Machine Learning Research, 15(1):1281–1316, 2014.
• Freund and Schapire [1997] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
• Hannan [1957] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the theory of games, 3:97–139, 1957.
• Haussler et al. [1998] David Haussler, Jyrki Kivinen, and Manfred K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906–1925, 1998.
• Herbei and Wegkamp [2006] Radu Herbei and Marten H. Wegkamp. Classification with reject option. Canadian Journal of Statistics, 34(4):709–721, 2006.
• Koolen et al. [2016] Wouter M Koolen, Peter Grünwald, and Tim van Erven. Combining adversarial guarantees and stochastic fast rates in online learning. In Advances in Neural Information Processing Systems, pages 4457–4465, 2016.
• Littlestone and Warmuth [1994] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
• Mammen and Tsybakov [1999] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
• Massart and Nédélec [2006] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
• Mendelson [2019] Shahar Mendelson. An unrestricted learning procedure. J. ACM, 66(6), 2019.
• Shamir [2015] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. The Journal of Machine Learning Research, 16(1):3475–3486, 2015.
• Tsybakov [2004] Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
• Van Erven et al. [2011] Tim Van Erven, Wouter M. Koolen, Steven D Rooij, and Peter Grünwald. Adaptive hedge. In Advances in Neural Information Processing Systems, pages 1656–1664, 2011.
• van Erven et al. [2015] Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16(54):1793–1861, 2015.
• Vapnik and Chervonenkis [1968] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Proc. USSR Acad. Sci., 1968.
• Vovk [1990] Vladimir Vovk. Aggregating strategies.

Proc. of Computational Learning Theory

, 1990.
• Vovk [1998] Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, 1998.
• Vovk [2001] Vladimir Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.
• Zhang and Chaudhuri [2016] Chicheng Zhang and Kamalika Chaudhuri. The extended Littlestone’s dimension for learning with mistakes and abstentions. In Conference on Learning Theory, pages 1584–1616, 2016.

## Appendix A Why do we need to randomize?

Consider the case of two experts such that one of them always predicts zero and the other always predicts one. Assume that the environment waits for the learner’s prediction and then provides the opposite label as the true label if the learner predicts zero or one. If the learner decides to abstain the environment provides the label according to the prediction of the best expert on rounds. In what follows, we show that the regret of any deterministic strategy with abstention will be at least Assume that out of rounds the learner used the abstention times and on the remaining rounds they predicted or . The total loss suffered by the learner will be equal to . Now we need to upper bound the loss of the best expert. The simple upper bound is . Now, since the environment is aware of the deterministic strategy of the learner we can always guarantee that it reveals the labels such that the loss of the best expert on these rounds is equal to zero. This implies the number of mistakes made by the best expert is at most . Considering the case we have that the regret is at least . If we have that the regret is at least .

## Appendix B Omitted proofs

### Proof of Lemma 2

The proof is based on studying the evolution of the cumulative weights . Defining and for all , we have

 −miniLT,i =mini1ηloge−ηLT,i≤1ηlog∑ie−ηLT,i=1ηlogWT+1=1ηT∑t=1logWt+1Wt+logW1η =1ηT∑t=1log∑iwt,ie−ηℓt,i∑jwt,j+logNη=1ηT∑t=1log∑iqt,ie−ηℓt,i+logNη.

Reordering gives the result.

### Proof of Proposition 3

The proof is based on considering two types of rounds depending on the value of . In the case where is such that is satisfied, we have by Lemma 2. Otherwise, we use the general bound as in the proof of Corollary 2. The proof is concluded by appealing to Lemma 2.

### Proof of Theorem 2

We begin by studying some properties of the learning rate optimizing the function , where . We observe that is lower-semicontinuous with at most discontinuities, and thus it takes its minimum either at one of its discontinuities or when its derivative is zero. In either case, satisfies

 logNη∗≥η∗τ(η∗)8,

with equality when the minimum is achieved where the function is continuous. Note that this implies that is bounded as . Furthermore, the bound also entails .

Regarding the regret of our algorithm, standard arguments can be used to prove the bound

 RT≤logNηT+1+T∑t=1(E[ˆℓt]−˜ℓt),

which actually holds for general nonincreasing learning-rate sequences (see, e.g., Lemma 2 in De Rooij et al., 2014). By the same argument as used to prove Corollary 2, each term in the sum on the right-hand side of the above bound can be bounded as

 E[ˆℓt]−˜ℓt≤ηt8I{1−2ct<ηt},

so that the sum itself can be bounded as

 T∑t=1(E[ˆℓt]−˜ℓt)≤T∑t=1ηt8I{1−2ct<ηt}≤18dT∑k=1√logNk≤14√dTlogN,

where we used that increases precisely whenever , and that . Now notice that by monotonicity of , we have , which implies that there exists a such that holds for all , so that we have

 ηt=√logNdt≤√logNd∗=η∗.

After round , the remaining number of times that our algorithm updates the learning rate is at most , which implies the bound , and the cumulative mixability gap is bounded as

 T∑t=1(E[ˆℓt]−˜ℓt)≤14√dTlogN≤3logN4η∗+√logN4.

On the other hand, we have

 logNηT+1=√dTlogN≤3logNη∗+√logN.

Thus, the total regret of our algorithm is bounded as

 15logN4η∗+5√logN4.

Comparing this bound with the lower bound on concludes the proof.

### Proof of Theorem 4

Our trick will be to construct a randomized prediction algorithm in the batch setup using an algorithm for our main setting through the online-to-batch conversion described in Lemma 4. First, we fix and observe that it follows from the proof of Lemma 4 that the class of functions used in the lower bound is a cube of -valued functions consisting of unique functions and that we are only interested in distributions with . For the exact details implying these two facts we refer to the definition of the hypercube of probability distributions (Definition 8.1 in Audibert, 2009) and the first lines of the proof of Theorem 8.3 and 8.4 where Audibert sets . Finally, his proof uses and this implies that the among all measurable functions is achieved by some -valued function for each distribution in the hypercube of distributions.

With these important observations in mind we will construct the following game of prediction with expert advice. For each round , the environment draws , sets , and each expert (we can naturally identify with experts) predicts .

This setup allows us to use our online learning algorithm to produce a randomized hypothesis defined on and taking its values in and matching the bound of Lemma 4. For each matching the conditions of Lemma 4 by running on the sequence . For any , produces its prediction by feeding the expert predictions to , obtaining a random prediction , and finally mapping to an element of according to , and . Observe that since the following holds if we fix ,

 |Y−^g(x)|q={I{^g(x)≠Y},if ^g(x)=±12;c,if ^g(x)=0.

At the same time, for any , it holds that . To conclude, we have in this setup that the loss of is binary whenever the output is or equal to whenever the output is equal to zero regardless of the value of . Using Lemma 4 we have

 E[|Y−^g(X)|q]=1TE[T∑t=1ˆℓt]andming∈GE[|Y−g(X)|q]=ming∈GE[I{g(X)≠Y}]. (5)

Now, provided that for our choice we have and

 12−12(q−1)log2≤12q≤12−14(q−1)log2,

which implies

 14(q−1)log2≤12−c≤12(q−1)log2. (6)

Since in order to apply Lemma 4 we need