Efficient Transductive Online Learning via Randomized Rounding

06/13/2011 ∙ by Nicolò Cesa-Bianchi, et al. ∙ Microsoft Università degli Studi di Milano 0

Most traditional online learning algorithms are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, tailored for transductive settings, which combines "random playout" and randomized rounding of loss subgradients. As an application of our approach, we present the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online learning algorithms, which have received much attention in recent years, enjoy an attractive combination of computational efficiency, lack of distributional assumptions, and strong theoretical guarantees. However, it is probably fair to say that at their core, most of these algorithms are based on the same small set of fundamental techniques, in particular mirror descent and regularized follow-the-leader (see for instance

[14]).

In this work we revisit, and significantly extend, an algorithm which uses a completely different approach. This algorithm, known as the Minimax Forecaster, was introduced in [9, 11] for the setting of prediction with static experts. It computes minimax predictions in the case of known horizon, binary outcomes, and absolute loss. Although the original version is computationally expensive, it can easily be made efficient through randomization.

We extend the analysis of [9]

to the case of non-binary outcomes and arbitrary convex and Lipschitz loss functions. The new algorithm is based on a combination of “random playout” and randomized rounding, which assigns random binary labels to future unseen instances, in a way depending on the loss subgradients. Our resulting

Randomized Rounding () Forecaster has a parameter trading off regret performance and computational complexity, and runs in polynomial time (for predictions, it requires computing empirical risk minimizers in general, as opposed to for generic follow-the-leader algorithms). The regret of the Forecaster is determined by the Rademacher complexity of the comparison class. The connection between online learnability and Rademacher complexity has also been explored in [2, 1]. However, these works focus on the information-theoretically achievable regret, as opposed to computationally efficient algorithms. The idea of “random playout”, in the context of online learning, has also been used in  [16, 3], but we apply this idea in a different way.

We show that the Forecaster can be used to design the first efficient online learning algorithm for collaborative filtering with trace-norm constrained matrices. While this is a well-known setting, a straightforward application of standard online learning approaches, such as mirror descent, appear to give only trivial performance guarantees. Moreover, our regret bound matches the best currently known sample complexity bound in the batch distribution-free setting [21].

As a different application, we consider the relationship between batch learning and transductive online learning. This relationship was analyzed in  [16], in the context of binary prediction with respect to classes of bounded VC dimension. Their main result was that efficient learning in a statistical setting implies efficient learning in the transductive online setting, but at an inferior rate of (where is the number of rounds). The main open question posed by that paper is whether a better rate can be obtained. Using the Forecaster, we improve on those results, and provide an efficient algorithm with the optimal rate, for a wide class of losses. This shows that efficient batch learning not only implies efficient transductive online learning (the main thesis of [16]), but also that the same rates can be obtained, and for possibly non-binary prediction problems as well.

We emphasize that the Forecaster requires computing many empirical risk minimizers (ERM’s) at each round, which might be prohibitive in practice. Thus, while it does run in polynomial time whenever an ERM can be efficiently computed, we make no claim that it is a “fully practical” algorithm. Nevertheless, it seems to be a useful tool in showing that efficient online learnability is possible in various settings, often working in cases where more standard techniques appear to fail. Moreover, we hope the techniques we employ might prove useful in deriving practical online algorithms in other contexts.

2 The Minimax Forecaster

We start by introducing the sequential game of prediction with expert advice —see [10]. The game is played between a forecaster and an adversary, and is specified by an outcome space , a prediction space , a nonnegative loss function , which measures the discrepancy between the forecaster’s prediction and the outcome, and an expert class . Here we focus on classes of static experts, whose prediction at each round does not depend on the outcome in previous rounds. Therefore, we think of each simply as a sequence where each . At each step of the game, the forecaster outputs a prediction and simultaneously the adversary reveals an outcome . The forecaster’s goal is to predict the outcome sequence almost as well as the best expert in the class , irrespective of the outcome sequence . The performance of a forecasting strategy is measured by the worst-case regret

(1)

viewed as a function of the horizon . To simplify notation, let .

Consider now the special case where the horizon is fixed and known in advance, the outcome space is , the prediction space is , and the loss is the absolute loss . We will denote the regret in this special case as .

The Minimax Forecaster —which is based on work presented in [9] and [11], see also [10] for an exposition— is derived by an explicit analysis of the minimax regret , where the infimum is over all forecasters producing at round a prediction as a function of . For general online learning problems, the analysis of this quantity is intractable. However, for the specific setting we focus on (absolute loss and binary outcomes), one can get both an explicit expression for the minimax regret, as well as an explicit algorithm, provided can be efficiently computed for any sequence . This procedure is akin to performing empirical risk minimization (ERM) in statistical learning. A full development of the analysis is out of scope, but is outlined in Appendix A. In a nutshell, the idea is to begin by calculating the optimal prediction in the last round , and then work backwards, calculating the optimal prediction at round , etc. Remarkably, the value of is exactly the Rademacher complexity of the class , which is known to play a crucial role in understanding the sample complexity in statistical learning [5]. In this paper, we define it as111In the statistical learning literature, it is more common to scale this quantity by , but the form we use here is more convenient for stating cumulative regret bounds.:

(2)

where

are i.i.d. Rademacher random variables, taking values

with equal probability. When , we get a minimax regret which implies a vanishing per-round regret.

In terms of an explicit algorithm, the optimal prediction at round is given by a complicated-looking recursive expression, involving exponentially many terms. Indeed, for general online learning problems, this is the most one seems able to hope for. However, an apparently little-known fact is that when one deals with a class of fixed binary sequences as discussed above, then one can write the optimal prediction in a much simpler way. Letting be i.i.d. Rademacher random variables, the optimal prediction at round can be written as

(3)

In words, the prediction is simply the expected difference between the minimal cumulative loss over , when the adversary plays at round and random values afterwards, and the minimal cumulative loss over , when the adversary plays at round , and the same random values afterwards. Again, we refer the reader to Appendix A for how this is derived. We denote this optimal strategy (for absolute loss and binary outcomes) as the Minimax Forecaster (mf):

  for  to  do
     Predict as defined in Eq. (3)
     Receive outcome and suffer loss
  end for
Algorithm 1 Minimax Forecaster (mf)

The relevant guarantee for mf is summarized in the following theorem.

Theorem 1.

For any class of static experts, the regret of the Minimax Forecaster (Algorithm 1) satisfies .

2.1 Making the Minimax Forecaster Efficient

The Minimax Forecaster described above is not computationally efficient, as the computation of requires averaging over exponentially many ERM’s. However, by a martingale argument, it is not hard to show that it is in fact sufficient to compute only two ERM’s per round.

  for  to  do
     For , let be a Rademacher random variable
     Let
     Predict , receive outcome and suffer loss
  end for
Algorithm 2 Minimax Forecaster with efficient implementation (mf*)
Theorem 2.

For any class of static experts, the regret of the randomized forecasting strategy mf* (Algorithm 2) satisfies

with probability at least . Moreover, if the predictions are computed reusing the random values computed at the first iteration of the algorithm, rather than drawing fresh values at each iteration, then it holds that

Proof sketch.

To prove the second statement, note that for any fixed and bounded in , and use Thm. 1. To prove the first statement, note that for is a martingale difference sequence with respect to , and apply Azuma’s inequality. ∎

The second statement in the theorem bounds the regret only in expectation and is thus weaker than the first one. On the other hand, it might have algorithmic benefits. Indeed, if we reuse the same values for , then the computation of the infima over in mf* are with respect to an outcome sequence which changes only at one point in each round. Depending on the specific learning problem, it might be easier to re-compute the infimum after changing a single point in the outcome sequence, as opposed to computing the infimum over a different outcome sequence in each round.

3 The Forecaster

The Minimax Forecaster presented above is very specific to the absolute loss and for binary outcomes , which limits its applicability. We note that extending the forecaster to other losses or different outcome spaces is not trivial: indeed, the recursive unwinding of the minimax regret term, leading to an explicit expression and an explicit algorithm, does not work as-is for other cases. Nevertheless, we will now show how one can deal with general (convex, Lipschitz) loss functions and outcomes belonging to any real interval .

The algorithm we propose essentially uses the Minimax Forecaster as a subroutine, by feeding it with a carefully chosen sequence of binary values , and using predictions which are scaled to lie in the interval . The values of are based on a randomized rounding of values in , which depend in turn on the loss subgradient. Thus, we denote the algorithm as the Randomized Rounding () Forecaster.

To describe the algorithm, we introduce some notation. For any scalar , define to be the scaled versions of into the range

. For vectors

, define . Also, we let denote any subgradient of the loss function with respect to the prediction . The pseudocode of the Forecaster is presented as Algorithm 3 below, and its regret guarantee is summarized in Thm. 3. The proof is presented in Appendix B.

  Input: Upper bound on for all and ; upper bound on ; precision parameter .
  for  to  do
     
     for  to  do
        For , let be a Rademacher random variable
        Draw
        Let
     end for
     Predict
     Receive outcome and suffer loss
     Let
     Let with probability , and with probability
  end for
Algorithm 3 The Forecaster
Theorem 3.

Suppose is convex and -Lipschitz in its first argument. For any the regret of the Forecaster (Algorithm 3) satisfies

(4)

with probability at least .

The prediction which the algorithm computes is an empirical approximation to

by repeatedly drawing independent values to and averaging. The accuracy of the approximation is reflected in the precision parameter . A larger value of improves the regret bound, but also increases the runtime of the algorithm. Thus, provides a trade-off between the computational complexity of the algorithm and its regret guarantee. We note that even when is taken to be a constant fraction, the resulting algorithm still runs in polynomial time , where is the time to compute a single ERM. In subsequent results pertaining to this Forecaster, we will assume that is taken to be a constant fraction.

We end this section with a remark that plays an important role in what follows.

Remark 1.

The predictions of our forecasting strategies do not depend on the ordering of the predictions of the experts in . In other words, all the results proven so far also hold in a setting where the elements of are functions , and the adversary has control on the permutation of that is used to define the prediction of expert at time .222 Formally, at each step : (1) the adversary chooses and reveals the next element of the permutation; (2) the forecaster chooses and simultaneously the adversary chooses . Also, Thm. 1 implies that the value of remains unchanged irrespective of the permutation chosen by the adversary.

4 Application 1: Transductive Online Learning

The first application we consider is a rather straightforward one, in the context of transductive online learning [6]. In this model, we have an arbitrary sequence of labeled examples , where only the set of unlabeled instances is known to the learner in advance. At each round , the learner must provide a prediction for the label of . The true label is then revealed, and the learner incurs a loss . The learner’s goal is to minimize the transductive online regret with respect to a fixed class of predictors of the form .

The work [16] considers the binary classification case with zero-one loss. Their main result is that if a class of binary functions has bounded VC dimension , and there exists an efficient algorithm to perform empirical risk minimization, then one can construct an efficient randomized algorithm for transductive online learning, whose regret is at most in expectation. The significance of this result is that efficient batch learning (via empirical risk minimization) implies efficient learning in the transductive online setting. This is an important result, as online learning can be computationally harder than batch learning —see, e.g., [8] for an example in the context of Boolean learning.

A major open question posed by [16] was whether one can achieve the optimal rate , matching the rate of a batch learning algorithm in the statistical setting. Using the Forecaster, we can easily achieve the above result, as well as similar results in a strictly more general setting. This shows that efficient batch learning not only implies efficient transductive online learning (the main thesis of [16]), but also that the same rates can be obtained, and for possibly non-binary prediction problems as well.

Theorem 4.

Suppose we have a computationally efficient algorithm for empirical risk minimization (with respect to the zero-one loss) over a class of -valued functions with VC dimension . Then, in the transductive online model, the efficient randomized forecaster mf* achieves an expected regret of with respect to the zero-one loss.
Moreover, for an arbitrary class of -valued functions with Rademacher complexity , and any convex -Lipschitz loss function, if there exists a computationally efficient algorithm for empirical risk minimization, then the Forecaster is computationally efficient and achieves, in the transductive online model, a regret of with probability at least .

Proof.

Since the set of unlabeled examples is known, we reduce the online transductive model to prediction with expert advice in the setting of Remark 1. This is done by mapping each function to a function by , which is equivalent to an expert in the setting of Remarks 1. When maps to , and we care about the zero-one loss, we can use the forecaster mf* to compute randomized predictions and apply Thm. 2 to bound the expected transductive online regret with . For a class with VC dimension , for some constant , using Dudley’s chaining method [12], and this concludes the proof of the first part of the theorem. The second part is an immediate corollary of Thm. 3. ∎

We close this section by contrasting our results for online transductive learning with those of [7] about standard online learning. If contains -valued functions, then the optimal regret bound for online learning is order of , where is the Littlestone dimension of . Since the Littlestone dimension of a class is never smaller than its VC dimension, we conclude that online learning is a harder setting than online transductive learning.

5 Application 2: Online Collaborative Filtering

We now turn to discuss the application of our results in the context of collaborative filtering with trace-norm constrained matrices, presenting what is (to the best of our knowledge) the first computationally efficient online algorithms for this problem.

In collaborative filtering, the learning problem is to predict entries of an unknown matrix based on a subset of its observed entries. A common approach is norm regularization, where we seek a low-norm matrix which matches the observed entries as best as possible. The norm is often taken to be the trace-norm [22, 19, 4], although other norms have also been considered, such as the max-norm [18] and the weighted trace-norm [20, 13].

Previous theoretical treatments of this problem assumed a stochastic setting, where the observed entries are picked according to some underlying distribution (e.g., [23, 21]). However, even when the guarantees are distribution-free, assuming a fixed distribution fails to capture important aspects of collaborative filtering in practice, such as non-stationarity [17]. Thus, an online adversarial setting, where no distributional assumptions whatsoever are required, seems to be particularly well-suited to this problem domain.

In an online setting, at each round the adversary reveals an index pair and secretely chooses a value for the corresponding matrix entry. After that, the learner selects a prediction for that entry. Then is revealed and the learner suffers a loss . Hence, the goal of a learner is to minimize the regret with respect to a fixed class of prediction matrices, . Following reality, we will assume that the adversary picks a different entry in each round. When the learner’s performance is measured by the regret after all entries have been predicted, the online collaborative filtering setting reduces to prediction with expert advice as discussed in Remark 1.

As mentioned previously, is often taken to be a convex class of matrices with bounded trace-norm. Many convex learning problems, such as linear and kernel-based predictors, as well as matrix-based predictors, can be learned efficiently both in a stochastic and an online setting, using mirror descent or regularized follow-the-leader methods. However, for reasonable choices of , a straightforward application of these techniques can lead to algorithms with trivial bounds. In particular, in the case of consisting of matrices with trace-norm at most , standard online regret bounds would scale like . Since for this norm one typically has , we get a per-round regret guarantee of . This is a trivial bound, since it becomes “meaningful” (smaller than a constant) only after all entries have been predicted.

On the other hand, based on general techniques developed in [15] and greatly extended in [1], it can be shown that online learnability is information-theoretically possible for such . However, these techniques do not provide a computationally efficient algorithm. Thus, to the best of our knowledge, there is currently no efficient (polynomial time) online algorithm, which attain non-trivial regret. In this section, we show how to obtain such an algorithm using the Forecaster.

Consider first the transductive online setting, where the set of indices to be predicted is known in advance, and the adversary may only choose the order and values of the entries. It is readily seen that the Forecaster can be applied in this setting, using any convex class of fixed matrices with bounded entries to compete against, and any convex Lipschitz loss function. To do so, we let be the set of entries, and run the Forecaster with respect to , which corresponds to a class of experts as discussed in Remark 1.

What is perhaps more surprising is that the Forecaster can also be applied in a non-transductive setting, where the indices to be predicted are not known in advance. Moreover, the Forecaster doesn’t even need to know the horizon in advance. The key idea to achieve this is to utilize the non-asymptotic nature of the learning problem —namely, that the game is played over a finite matrix, so the time horizon is necessarily bounded.

The algorithm we propose is very simple: we apply the Forecaster as if we are in a setting with time horizon , which is played over all entries of the matrix. By Remark 1, the Forecaster does not need to know the order in which these entries are going to be revealed. Whenever is convex and is a convex function, we can find an ERM in polynomial time by solving a convex problem. Hence, we can implement the Forecaster efficiently.

To show that this is indeed a viable strategy, we need the following lemma, whose proof is presented in Appendix C.

Lemma 1.

Consider a (possibly randomized) forecaster for a class whose regret after steps satisfies with probability at least . Furthermore, suppose the loss function is such that . Then

Note that a simple sufficient condition for the assumption on the loss function to hold, is that and for all .

Using this lemma, the following theorem exemplifies how we can obtain a regret guarantee for our algorithm, in the case of consisting of the convex set of matrices with bounded trace-norm and bounded entries. For the sake of clarity, we will consider matrices.

Theorem 5.

Let be a loss function which satisfies the conditions of Lemma 1. Also, let consist of matrices with trace-norm at most and entries at most , suppose we apply the Forecaster over time horizon and all entries of the matrix. Then with probability at least , after rounds, the algorithm achieves an average per-round regret of at most

Proof.

In our setting, where the adversary chooses a different entry at each round, [21, Theorem 6] implies that for the class of all matrices with trace-norm at most , it holds that . Therefore, . Since , we get by definition of the Rademacher complexity that as well. By Thm. 3, the regret after rounds is with probability at least . Applying Lemma 1, we get that the cumulative regret at the end of any round is at most , as required. ∎

This bound becomes non-trivial after entries are revealed, which is still a vanishing proportion of all entries. While the regret might seem unusual compared to standard regret bounds (which usually have rates of for general losses), it is a natural outcome of the non-asymptotic nature of our setting, where can never be larger than . In fact, this is the same rate one would obtain in a batch setting, where the entries are drawn from an arbitrary distribution. Moreover, an assumption such as boundedness of the entries is required for currently-known guarantees even in a batch setting —see [21] for details.

References

  • [1] K. Sridharan A. Rakhlin and A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, 2010.
  • [2] J. Abernethy, P. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online convex games. In COLT, 2009.
  • [3] J. Abernethy and M. Warmuth. Repeated games against budgeted adversaries. In NIPS, 2010.
  • [4] F. Bach. Consistency of trace-norm minimization. Journal of Machine Learning Research, 9:1019–1048, 2008.
  • [5] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. In COLT, 2001.
  • [6] S. Ben-David, E. Kushilevitz, and Y. Mansour. Online learning versus offline learning. Machine Learning, 29(1):45–63, 1997.
  • [7] S. Ben-David, D. Pál, and S. Shalev-Shwartz. Agnostic online learning. In COLT, 2009.
  • [8] A. Blum. Separating distribution-free and mistake-bound learning models over the boolean domain. SIAM J. Comput., 23(5):990–1000, 1994.
  • [9] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. Helmbold, R. Schapire, and M. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997.
  • [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
  • [11] T. Chung. Approximate methods for sequential decision making using expert advice. In COLT, 1994.
  • [12] R. M. Dudley. A Course on Empirical Processes, École de Probabilités de St. Flour, 1982, volume 1097 of Lecture Notes in Mathematics. Springer Verlag, 1984.
  • [13] R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro. Learning with the weighted trace-norm under arbitrary sampling distributions. In NIPS, 2011.
  • [14] E. Hazan. The convex optimization approach to regret minimization. In S. Nowozin S. Sra and S. Wright, editors, Optimization for Machine Learning. MIT Press, To Appear.
  • [15] P. Bartlett J. Abernethy, A. Agarwal and A. Rakhlin. A stochastic view of optimal regret through minimax duality. In COLT, 2009.
  • [16] S. Kakade and A. Kalai. From batch to transductive online learning. In NIPS, 2005.
  • [17] Y. Koren. Collaborative filtering with temporal dynamics. In KDD, 2009.
  • [18] J. Lee, B. Recht, R. Salakhutdinov, N. Srebro, and J. Tropp. Practical large-scale optimization for max-norm regularization. In NIPS, 2010.
  • [19] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2007.
  • [20] R. Salakhutdinov and N. Srebro. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. In NIPS, 2010.
  • [21] O. Shamir and S. Shalev-Shwartz. Collaborative filtering with the trace norm: Learning, bounding, and transducing. In COLT, 2011.
  • [22] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS, 2004.
  • [23] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, 2005.

Appendix A Derivation of the Minimax Forecaster

In this appendix, we outline how the Minimax Forecaster is derived, as well as its associated guarantees. This outline closely follows the exposition in [10, Chapter 8], to which we refer the reader for some of the technical derivations.

First, we note that the Minimax Forecaster as presented in [10] actually refers to a slightly different setup than ours, where the outcome space is and the prediction space is , rather than and . We will first derive the forecaster for the first setting, and then show how to convert it to the second setting.

Our goal is to find a predictor which minimizes the worst-case regret,

where is the prediction sequence.

For convenience, in the following we sometimes use the notation to denote a vector in . The idea of the derivation is to work backwards, starting with computing the optimal prediction at the last round , then deriving the optimal prediction at round and so on. In the last round , the first outcomes have been revealed, and we want to find the optimal prediction . Since our goal is to minimize worst-case regret with respect to the absolute loss, we just need to compute which minimizes

In our setting, it is not hard to show that (see [10, Lemma 8.1]). Using this, we can compute the optimal to be

(5)

where .

Having determined , we can continue to the previous prediction . This is equivalent to minimizing

where

(6)

Note that by plugging in the value of from Eq. (5), we also get the following equivalent formulation for :

Again, it is possible to show that the optimal value of is

Repeating this procedure, one can show that at any round , the minimax optimal prediction is

(7)

where is defined recursively as and

(8)

for all .

At first glance, computing from Eq. (7) might seem tricky, since it requires computing whose recursive expansion in Eq. (8) involves exponentially many terms. Luckily, the recursive expansion has a simple structure, and it is not hard to show that

(9)

where is a sequence of i.i.d. Bernoulli random variables, which take values in with equal probability. Plugging this into the formula for the minimax prediction in Eq. (7), we get that333This fact appears in an implicit form in [9] —see also [10, Exercise 8.4].

(10)

This prediction rule constitutes the Minimax Forecaster as presented in [10].

After deriving the algorithm, we turn to analyze its regret performance. To do so, we just need to note that equals the worst-case regret —see the recursive definition at Eq. (6). Using the alternative explicit definition in Eq. (9), we get that the worst-case regret equals

where are i.i.d. Rademacher random variables (taking values of and with equal probability). Recalling the definition of Rademacher complexity, Eq. (2), we get that the regret is bounded by the Rademacher complexity of the shifted class, which is obtained from by taking every and replacing every coordinate by .

Finally, it remains to show how to convert the forecaster and analysis above to the setting discussed in this paper, where the outcomes are in rather than and the predictions are in rather than . To do so, consider a learning problem in this new setting, with some class . For any vector , define to be the shifted vector , where is the all-ones vector. Also, define to be the shifted class . It is easily seen that for any . As a result, if we look at the prediction given by our forecaster in Eq. (3), then is the minimax optimal prediction given by Eq. (10) with respect to the class and the outcomes . So our analysis above applies, and we get that

which is exactly the Rademacher complexity of the class .

Appendix B Proof of Thm. 3

Let denote the set of Bernoulli random variables chosen at round . Let denote expectation with respect to , conditioned on as well as . Let denote the expectation with respect to the random drawing of , conditioned on .

We will need two simple observations. First, by convexity of the loss function, we have that for any , . Second, by definition of and , we have that for any fixed ,

The last transition uses the fact that . By these two observations, we have

(11)

Now, note that for is a martingale difference sequence: for any values of (which fixes ), the conditional expectation of this expression over is zero. Using Azuma’s inequality, we can upper bound Eq. (11) with probability at least by

(12)

The next step is to relate Eq. (12) to . It might be tempting to appeal to Azuma’s inequality again. Unfortunately, there is no martingale difference sequence here, since is itself a random variable whose distribution is influenced by . Thus, we need to turn to coarser methods. Eq. (12) can be upper bounded by

(13)

Recall that is an average over i.i.d. random variables, with expectation . By Hoeffding’s inequality, this implies that for any , with probability at least over the choice of , By a union bound, it follows that with probability at least over the choice of ,

Combining this with Eq. (13), we get that with probability at least ,

(14)

Finally, by definition of , we have

This is exactly the Minimax Forecaster’s prediction at round , with respect to the sequence of outcomes , and the class . Therefore, using Thm. 1, we can upper bound Eq. (14) by

By definition of and Rademacher complexity, it is straightforward to verify that . Using that to rewrite the bound, and slightly simplifying for readability, the result stated in the theorem follows.

Appendix C Proof of Lemma 1

The proof assumes that the infimum and supremum of certain functions over are attainable. If not, the proof can be easily adapted by finding attainable values which are -close to the infimum or supremum, and then taking .

For the purpose of contradiction, suppose there exists a strategy for the adversary and a round such that at the end of round , the forecaster suffers a regret with probability larger than . Consider the following modified strategy for the adversary: the adversary plays according to the aforementioned strategy until round . It then computes

At all subsequent rounds , the adversary chooses

By the assumption on the loss function,

Thus, the regret over all rounds, with respect to , is

which is at least with probability larger than . On the other hand, we know that the learner’s regret is at most most with probability at least . Thus we have a contradiction and the proof is concluded.