# On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed regret bounds), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers for martingales), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence, we prove exponential tail bounds for norms of Banach space valued martingales via deterministic regret bounds for the online mirror descent algorithm with an adaptive step size. We extend these results beyond the linear structure of the Banach space: we define a notion of martingale type for general classes of real-valued functions and show its equivalence (up to a logarithmic factor) to various sequential complexities of the class (in particular, the sequential Rademacher complexity and its offset version). For classes with the general martingale type 2, we exhibit a finer notion of variation that allows partial adaptation to the function indexing the martingale. Our proof technique rests on sequential symmetrization and on certifying the existence of regret minimization strategies for certain online prediction problems.

## Authors

• 42 publications
• 31 publications
• ### Sparsity regret bounds for individual sequences in online linear regression

We consider the problem of online linear regression on arbitrary determi...
01/05/2011 ∙ by Sébastien Gerchinovitz, et al. ∙ 0

• ### On laws exhibiting universal ordering under stochastic restart

For each of (i) arbitrary stochastic reset, (ii) deterministic reset wit...
04/23/2019 ∙ by Matija Vidmar, et al. ∙ 0

• ### Uniform regret bounds over R^d for the sequential linear regression problem with the square loss

We consider the setting of online linear regression for arbitrary determ...
05/29/2018 ∙ by Pierre Gaillard, et al. ∙ 0

We propose a general framework for studying adaptive regret bounds in th...
08/21/2015 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Online Learning: Beyond Regret

We study online learnability of a wide class of problems, extending the ...
11/14/2010 ∙ by Alexander Rakhlin, et al. ∙ 0

• ### Online Learning: Stochastic and Constrained Adversaries

Learning theory has largely focused on two main learning scenarios. The ...
04/27/2011 ∙ by Alexander Rakhlin, et al. ∙ 0

• ### Mirror Descent Meets Fixed Share (and feels no regret)

Mirror descent with an entropic regularizer is known to achieve shifting...
02/15/2012 ∙ by Nicolò Cesa-Bianchi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let be a martingale difference sequence taking values in a separable -smooth Banach space . A result due to Pinelis [17] asserts that for any

 P(psupn≥1∥∥ ∥∥n∑t=1Zt∥∥ ∥∥≥σu)≤2exp{−u22D2}, (1)

where is a constant satisfying . Writing the norm as the supremum over the dual ball, we may re-interpret (1) as a one-sided tail control for the supremum of a stochastic process . In this paper, we consider several extensions of (1), motivated by the following questions:

1. Can (1) be strengthened by replacing with a “path-dependent” version of variation?

2. Does a version of (1) hold when we move away from the linear structure of the Banach space?

Positive answers to these questions constitute the first contribution of our paper. The second contribution involves the actual technique. The cornerstone of our analysis is a certain equivalence of martingale inequalities and deterministic pathwise statements. The latter inequalities are studied in the field of online learning (or, sequential prediction), and are referred to as regret bounds. We show that the existence (which can be certified via the minimax theorem) of prediction strategies that minimize regret yields predictable processes that help in answering (a) and (b). The equivalence is exploited in both directions, whereby stronger regret bounds are derived from the corresponding probabilistic bounds, and vice versa. To obtain one of the main results in the paper, we sharpen the bound by passing several times between the deterministic statements and probabilistic tail bounds. The equivalence asserts a strong connection between probabilistic inequalities for martingales and online learning algorithms.

In the remainder of this section, we present a simple example of the equivalence based on the gradient descent method, arguably the most popular convex optimization procedure. The example captures, loosely speaking, a correspondence between deterministic optimization methods and probabilistic bounds. Consider the unit Euclidean ball in . Let and define, recursively, the Euclidean projections

 ˆyt+1=ˆyt+1(z1,…,zt)=ProjB(ˆyt−n−1/2zt) (2)

for each , with the initial value . Elementary algebra111See the two-line proof in the Appendix, Lemma 19. shows that for any , the regret inequality holds deterministically for any sequence . We re-write this statement as

 ∥∥ ∥∥n∑t=1zt∥∥ ∥∥−√n≤n∑t=1⟨ˆyt,−zt⟩. (3)

Applying the deterministic inequality to a -valued martingale difference sequence ,

 (4)

The latter upper bound is an application of the Azuma-Hoeffding’s inequality. Indeed, the process is predictable with respect to , and thus is a -valued martingale difference sequence. It is worth emphasizing the conclusion:

one-sided deviation tail bounds for a norm of a vector-valued martingale can be deduced from tail bounds for real-valued martingales with the help of a deterministic inequality

. Next, integrating the tail bound in (4) yields a seemingly weaker in-expectation statement

 E∥∥ ∥∥n∑t=1Zt∥∥ ∥∥≤c√n (5)

for an appropriate constant . The twist in this uncomplicated story comes next: with the help of the minimax theorem, [23] established existence of strategies such that

 ∀z1,…,zn,f∈B,     n∑t=1⟨ˆyt−f,zt⟩≤psupE∥∥ ∥∥n∑t=1Zt∥∥ ∥∥, (6)

with the supremum taken over all martingale difference sequences with respect to a dyadic filtration. In view of (5), this bound is .

What have we achieved? Let us summarize. The deterministic inequality (3), which holds for all sequences, implies a tail bound (4). The latter, in turn, implies an in-expectation bound (5), which implies (3) (with a worse constant) through a minimax argument, thus closing the loop. The equivalence—studied in depth in this paper—is informally stated below:

#### Informal:

The following bounds imply each other: (a) an inequality that holds for all sequences; (b) a deviation tail probability for the size of a martingale; (c) an in-expectation bound on the size of a martingale.

The equivalence, in particular, allows us to amplify the in-expectation bounds to appropriate high-probability tail bounds.

As already mentioned, the pathwise inequalities, such as (3), are extensively studied in the field of online learning. In this paper, we employ some of the recently developed data-dependent (adaptive) regret inequalities to prove tail bounds for martingales. In turn, in view of the above equivalence, martingale inequalities shall give rise to novel deterministic regret bounds.

While writing the paper, we learned of the trajectorial approach, extensively studied in recent years. In particular, it has been shown that Doob’s maximal inequalities and Burkholder-Davis-Gundy inequalities have deterministic counterparts [2, 3, 13, 4]. The online learning literature contains a trove of pathwise inequalities, and further synthesis with the trajectorial approach (and the applications in mathematical finance) appears to be a promising direction.

This paper is organized as follows. In the next section, we extend the Euclidean result to martingales with values in Banach spaces, and also improve it by replacing with square root of variation. We define a notion of martingale type for general classes of functions in Section 3, and exhibit a tight connection to the growth of sequential Rademacher complexity. Section 4 presents sequential symmetrization; here we prove that statements for the dyadic filtration automatically yield corresponding tail bounds for general discrete-time stochastic processes. In Section 5, we introduce the machinery for obtaining regret inequalities, and show how these inequalities allow one to amplify certain in-expectation bounds into high-probability statements (Section 6). The last two sections contain some of the main results: In Section 7 we prove a high probability bound for the notion of martingale type, and present a finer analysis of adaptivity of the variation term in Section 8.

## 2 Results in Banach spaces

For the case of the Euclidean (or Hilbertian) norm, it is easy to see that the bound of (5) can be improved to a distribution-dependent quantity . Given the equivalence sketched earlier, one may wonder whether this implies existence of a gradient-descent-like method with a sequence-dependent variation governing the rate of convergence of this optimization procedure. Below, we indeed present such a method for -smooth Banach spaces.

Let be a separable Banach space, and let denote its dual. is of martingale type (for ) if there exists a constant such that

 E∥∥ ∥∥n∑t=1Zt∥∥ ∥∥p≤Cpn∑t=1E∥Zt∥p (7)

for any -valued martingale difference sequence. The best possible constant in this inequality (as well as its finiteness) is known to depend on the geometry of the Banach space. For instance, for a Hilbert space (7) holds for with constant . On the other hand, triangle inequality implies that any space has the trivial type .

An equivalent way to define martingale type is to ask that there exist a constant such that

 Epsup∥y∥∗≤1 n∑t=1⟨y,Zt⟩=E∥∥ ∥∥n∑t=1Zt∥∥ ∥∥≤C(n∑t=1E∥Zt∥p)1/p. (8)

We now show that the strengthening to a sequence-dependent variation holds for any -smooth Banach space, as we show next. Based on the equivalence mentioned earlier, we immediately obtain tail bounds.

Assume is -smooth. Let be the Bregman divergence with respect to a convex function , which is assumed to be -strongly convex on the unit ball of . Denote . We extend and improve (4) as follows.

###### Theorem 1.

Let be a -valued martingale difference sequence, and let stand for the conditional expectation given . For any , it holds that

 (9)

where

 Vn=n∑t=1∥Zt∥2     %and    Wn=n∑t=1Et−1∥Zt∥2. (10)

Furthermore, the bound holds with if the martingale differences are conditionally symmetric.

In addition to extending the Euclidean result of the previous section to Banach spaces, (9) offers several advantages. First, it is -independent. Second, deviations are self-normalized (that is, scaled by root-variation terms). We refer to Lemma 11 for other forms of probabilistic bounds.

To prove the theorem, we start with a deterministic inequality from [21, Corollary 2]. For completeness, the proof is provided in the Appendix.

###### Lemma 2.

Let be a convex set. Define, recursively,

 (11)

with , , and with . Then for any and any ,

 ∑nt=1⟨ˆyt−f,zt⟩≤2.5Rpmax(√∑nt=1∥zt∥2+1).
###### Proof of Theorem 1.

We take to be the unit ball in , ensuring . For any martingale difference sequence with values in , the above lemma implies, by definition of the norm,

 ∥∥∑nt=1Zt∥∥−2.5Rpmax(√Vn+1)≤∑nt=1⟨ˆyt,Zt⟩ (12)

for all sample paths. Dividing both sides by , we conclude that the left-hand side in (9) is upper bounded by

 P⎛⎜ ⎜⎝∑nt=1⟨ˆyt,Zt⟩√Vn+Wn+(E√Vn+Wn)2>u⎞⎟ ⎟⎠. (13)

To control this probability, we recall the following result [8, Theorem 2.7]:

###### Theorem 3 ([8]).

For a pair of random variables

, with , such that

 Eexp{λA−λ2B2/2}≤1     ∀λ∈R, (14)

it holds that

 P(|A|√B2+(EB)2>u)≤√2exp{−u2/4}.

To apply this theorem, we verify assumption (14):

###### Lemma 4.

The random variables and satisfy (14).

The proof of the Lemma, as well as most of the proofs in this paper, is postponed to the Appendix. This concludes the proof of Theorem 1. ∎

Let us make several remarks. First, [21, Corollary 2] proves a more general deterministic inequality: for any collection of functions , there exists a strategy such that

Second, the reader will notice that the pathwise inequality (12) does not depend on and the construction of is also oblivious to this value. A simple argument (Lemma 20 in the Appendix) then allows us to lift the real-valued Burkholder-Davis-Gundy inequality (with the constant from [6]) to the Banach space valued martingales:

 Epmaxs=1,…,n∥∥ ∥∥s∑t=1Zt∥∥ ∥∥≤(2.5Rpmax+√3)E√Vn+2.5Rpmax .

Notably, the constant in the resulting BDG inequality is proportional to .

We also remark that Theorem 1 can be naturally extended to -smooth Banach spaces . This is accomplished in a straightforward manner by extending Lemma 2.

In conclusion, we were able to replace the distribution-independent bound with a sequence-dependent quantity . One may ask whether this phenomenon is general; that is, whether a sequence-dependent variation bound necessarily holds whenever the corresponding distribution-independent bound does. We prove in Theorem 5 below that this is indeed the case (up to a logarithmic factor), a result that holds for general classes of functions.

## 3 Martingale Type for a General Class of Functions

We now define the analogue of a martingale type for a class of real-valued measurable functions on some abstract measurable space . To this end, we assume that is a discrete time process on a probability space . Let denote the expectation on this probability space, and let denote the conditional (given ) expectation with respect to . For any ,

 (15)

is a sum of martingale differences . We let be a tangent sequence; that is, and are independent and identically distributed conditionally on . Let denote the conditional (given ) expectation with respect to .

###### Definition 1.

A class has martingale type if there exists a constant such that

 E[psupg∈Gn∑t=1(g(Zt)−Et−1[g(Zt)])]≤C E(n∑t=1E′t−1psupg∈G∣∣g(Zt)−g(Z′t)∣∣p)1/p. (16)
###### Remark 3.1.

We conjecture that the statements below also hold for the definition of martingale type where on the right-hand side of (16) is replaced with a smaller and more natural quantity .

In proving (16), we shall work with a dyadic filtration. Let generated by independent Rademacher (symmetric -valued) random variables . Let be a predictable process with respect to this filtration (that is, is -measurable) with values in some set . Sequential Rademacher complexity222This complexity is defined in [25] without the absolute values; this difference is minor (and disappears if ). of an abstract class on is defined as

 Rn(F;x)=E∣∣ ∣∣psupf∈Fn∑t=1ϵtf(xt)∣∣ ∣∣ . (17)
###### Definition 2.

Let . We say that sequential Rademacher complexity of exhibits an growth with constant if

 ∀n≥1,  ∀x,   Rn(F;x)≤Cn1/r⋅psupf∈F,ϵ∈{±1}n,t≤n|f(xt(ϵ))| . (18)

We will work with a particular class of functions defined on . It is immediate that exhibits whenever does, and vice versa, with at most doubling of the constant .

Using a sequential symmetrization technique, it holds (see [25]) that

 E[psupg∈Gn∑t=1(g(Zt)−Et−1[g(Zt)])]≤2psupzRn(G;z) . (19)

Therefore, the statement “ has martingale type whenever exhibits an growth” corresponds to the phenomenon that, loosely speaking, “one may replace the distribution-independent bound with a sequence-dependent variation.”

The next theorem shows a tight connection between the complexity growth and martingale type.

###### Theorem 5.

For any function class , the following statements hold:

1. If for some sequential Rademacher complexity exhibits growth, then has martingale type for every .

2. If has martingale type , then sequential complexity exhibits an growth.

The proof relies on the development in the next few sections, and especially on Lemma 15. The technique is partly inspired by the work of Burkholder [7] and Pisier [18]. In particular, a key tool is the reverse Hölder principle [19, Prop. 8.53].

In addition to Theorem 5, let us state informal versions of Theorems 17 and 18 which appear, respectively, in Sections 7 and 8. Define the random variables

 Varp=E′n∑t=1psupg∈G∣∣g(Zt)−g(Z′t)∣∣p,   Varp(g)=E′n∑t=1∣∣g(Zt)−g(Z′t)∣∣p

where is expectation with respect to the tangent sequence, conditionally on . Then Theorem 17 states that with high probability controlled by ,

 psupg∈Gn∑t=1(g(Zt)−Et−1[g(Zt)]) ≲ log(n)Var1/rr+uVar1/22

whenever exhibits growth of sequential Rademacher complexity. Theorem 8 addresses the case of martingale type and states that with high probability controlled by ,

 psupg∈Gn∑t=1(g(Zt)−Et−1[g(Zt)])−nq4(Var1/22(g))2−q4−uVar1/22(g) ≲ 0

whenever sequential entropy (defined below) at scale behaves as .

### 3.1 Other complexity measures

We see that the martingale type of is described by the behavior of sequential Rademacher complexity. The latter behavior can, in turn, be quantified in terms of geometric quantities, such as sequential covering numbers and the sequential scale-sensitive dimension. We present the following two definitions from [25], both stated in terms of a predictable process with respect to the dyadic filtration. It may be beneficial (at least it was for the authors of [25]) to think of as a complete binary tree of depth , decorated by elements of , and specifying a path in this tree.

###### Definition 3 (Sequential covering number).

Let be an -valued predictable process with respect to the dyadic filtration, and let . A collection of -valued predictable processes is called an -cover (with respect to ) of on if

 (20)

The cardinality of the smallest -cover is denoted by and , and both are referred to as sequential covering numbers. Sequential entropy is defined as .

###### Definition 4 (Sequential fat-shattering dimension).

We say that shatters the predictable process at scale if there exists a real-valued predictable process such that

 ∀ϵ∈{±1}n,  ∃f∈F,    s.t.    ∀t≤n,    ϵt(f(xt(ϵ))−st(ϵ))≥α/2.

The largest length of a shattered predictable process is called the sequential fat-shattering dimension at scale and denoted .

The sequential covering numbers and the fat-shattering dimension are natural extensions of the classical notions, as shown in [25]. In particular, a Dudley-type entropy integral upper bound in terms of sequential covering numbers holds for sequential Rademacher complexity. The sequential covering numbers, in turn, are upper bounded in terms of the fat-shattering dimensions, in a parallel to the way classical empirical covering numbers are controlled by the scale-sensitive version of the Vapnik-Chervonenkis dimension. We summarize the implications of these relationships in the following corollary:

###### Corollary 6.

For any function class ,

1. If for some either or , then has martingale type for any .

2. If has martingale type then, for every , there exists such that and , for all .

We have established a relation between the martingale type of a function class and several sequential complexities of the class. However, unlike our starting point (1) and Theorem 1, our results so far do not quantify the tail behavior for the difference between the supremum of the martingale process and the corresponding variation. A natural idea is to mimic the “equivalence” argument used in Section 2 to conclude the exponential tail bounds. Unfortunately, the deviation inequalities of the previous section rest on pathwise regret bounds that, in turn, rely on the linear structure of the associated Banach space, as well as on properties such as smoothness and uniform convexity. Without the linear structure, it is not clear whether the analogous pathwise statements hold. The goal of the rest of the paper is to bring forth some of the tools recently developed within the online learning literature, and to apply these pathwise regret bounds to conclude high probability tail bounds associated to martingale type. In addition to this goal, we will seek a version of Theorem 5(i) for bounded functions, where the growth of sequential Rademacher complexity implies martingale type (rather than any ), but with an additional

factor. Our third goal will be to establish per-function variation bounds (similar to the notion of a weak variance

[5]). We show that this latter bound is a finer version of the variation term, possible for classes that are “not too large”.

Our plan is as follows. First, we reduce the problem to one based on the dyadic filtration. After that, we shall introduce certain deterministic inequalities from the online learning literature that are already stated for the dyadic filtration.

## 4 Symmetrization: dyadic filtration is enough

The purpose of this section is to prove that statements for the dyadic filtration can be lifted to general processes via sequential symmetrization. Consider the martingale

 Mg=n∑t=1g(Zt)−E[g(Zt)|Z1,…,Zt−1]

indexed by . If is adapted to a dyadic filtration , each increment takes on the value

 fg(xt(ϵ1:t−1))≜(g(Zt(ϵ1:t−1,+1))−g(Zt(ϵ1:t−1,−1)))/2

or its negation, where is a predictable process with values in and defined by . In the rest of the paper, we work directly with martingales of the form , indexed by an abstract class and an abstract -valued predictable process .

We extend the symmetrization approach of Panchenko [15] to sequential symmetrization for the case of martingales. In contrast to the more frequently-used Giné-Zinn symmetrization proof (via Chebyshev’s inequality) [12, 26] that allows a direct tail comparison of the symmetrized and the original processes, Panchenko’s approach allows for an “indirect” comparison. The following immediate extension of [15, Lemma 1] will imply that any