# Towards Practical Conditional Risk Minimization

We study conditional risk minimization (CRM), i.e. the problem of learning a hypothesis of minimal risk for prediction at the next step of a sequentially arriving dependent data. Despite it being a fundamental problem, successful learning in the CRM sense has so far only been demonstrated using theoretical algorithms that cannot be used for real problems as they would require storing all incoming data. In this work, we introduce MACRO, a meta-algorithm for CRM that does not suffer from this shortcoming, as instead of storing all data it maintains and iteratively updates a set of learning subroutines. Using suitable approximations, MACRO can be implemented and applied to real data, leading, as we illustrate experimentally, to improved prediction performance compared to traditional non-conditional learning.

## Authors

• 3 publications
• 4 publications
• ### Conditional Risk Minimization for Stochastic Processes

We study the task of learning from non-i.i.d. data. In particular, we ai...
10/09/2015 ∙ by Alexander Zimin, et al. ∙ 0

• ### Approximation to Object Conditional Validity with Conformal Predictors

Conformal predictors are machine learning algorithms that output predict...
02/15/2021 ∙ by Anthony Bellotti, et al. ∙ 0

• ### Performance of Empirical Risk Minimization for Linear Regression with Dependent Data

This paper establishes bounds on the performance of empirical risk minim...
04/25/2021 ∙ by Christian Brownlees, et al. ∙ 0

• ### Meta-Learned Invariant Risk Minimization

Empirical Risk Minimization (ERM) based machine learning algorithms have...
03/24/2021 ∙ by Jun-Hyun Bae, et al. ∙ 0

• ### Neutralized Empirical Risk Minimization with Generalization Neutrality Bound

Currently, machine learning plays an important role in the lives and ind...
11/06/2015 ∙ by Kazuto Fukuchi, et al. ∙ 0

• ### Out-of-distribution Prediction with Invariant Risk Minimization: The Limitation and An Effective Fix

This work considers the out-of-distribution (OOD) prediction problem whe...
01/16/2021 ∙ by Ruocheng Guo, et al. ∙ 0

• ### SA vs SAA for population Wasserstein barycenter calculation

In Machine Learning and Optimization community there are two main approa...
01/21/2020 ∙ by Darina Dvinskikh, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Conditional risk minimization (CRM) is a fundamental learning problem when the available data is not an i.i.d. sample from a fixed data distribution, but a sequence of interdependent observations, i.e. a stochastic process. Like i.i.d. samples, stochastic processes can be interpreted in a generative way: each data point is sampled from a conditional data distribution

, where the conditioning is on the sequence of observations so far. This view is common, e.g., in the literature on time-series prediction, where the goal is to predict the next value of a time-series given the sequence of observed values to far. CRM is the discriminative analog to this: for a given loss function and a set of hypotheses, the goal is to identify the best hypothesis to apply at the next time step.

Conditional risk minimization has many application for learning tasks in which data arrives sequentially and decisions have to be made quickly, e.g. frame-wise classification of video streams. As a more elaborate example, imagine the problem of predicting which flights will be delayed over the next hour at an airport. Clearly, the observed data for this task has a temporal structure and exhibits strong dependencies. A prediction function that is adjusted to the current conditions of the airport will be very useful. It will not only make it possible to predict the delays of incoming flights in advance, but also to identify the possible parameters changes that would avoid delays, e.g. by rescheduling or rerouting aircrafts (Rosenberger2003).

Despite its fundamental nature, CRM remains largely an unsolved problem in theory as well as in practice. One reason is that it is a far harder task than ordinary i.i.d. learning, because the underlying conditional distributions, and therefore the optimal hypotheses, change at every time step. Until recently, it was not even clear how to formalize successful learning in the CRM context, as there exists no single target hypothesis to which a sequential learning process could converge. zimin2016aistats was the first work to offer a handle on the problem by formalizing a notion of learnability in the CRM sense, see Section 2 for a formal discussion. Unfortunately, their results are purely theoretical as the suggested procedure would require storing all observations of the stochastic process for all future time steps, which is clearly not possible in practice.

In this work, we make three contributions. First, we generalize the notion of learnability from zimin2016aistats to a more practical approximate -learnability, resembling the classic probably approximately correct (PAC) framework (valiant1984theory). Second, we introduce MACRO, a meta-algorithm for CRM, and prove that it achieves -learnability under less restrictive assumptions than previous approaches. At last, we show that MACRO is practical, as it requires only training a set of elementary learning algorithms on different subsets of the available data, but not storing the complete sequence of observations, and report on practical experiments that highlight MACRO’s straight-forward applicability to real-world problems.

#### Related work

The fundamentals of statistical learning theory were build originally based on the assumption of independent and identically distributed data

(Vapnik01), but very soon extensions to stochastic processes were suggested.

When one is more interested in short-term behaviour of the process, it makes sense to focus on the conditional risk, where the expectation is taken with respect to the conditional distribution of the process, as it was argued in (Pestov2010; Shalizi13). Learnability, i.e. the ability to perform a risk minimization with respect to the conditional risk, was established for a number of particular classes of stochastic processes, such as i.i.d., exchangeable, mixing and some others, see (steinwart2005consistency; Pestov2010; Berti04; Mohri02fixed; zimin2015arxiv)

. Most of these works focus on the estimation of the conditional risk by the average of losses over the observed data so far. Conditional risk minimization was considered by

(kuznetsov2015learning) with a later extension in (kuznetsov2016time). Without trying to achieve learnability, they consider the behaviour of the empirical risk minimization algorithm at each fixed time step by using a non-adaptive estimator. A general study of learnability was performed in (zimin2016aistats) by emphasizing the role of pairwise discrepancies and the necessity to use an adaptive estimator. A number of related setting that utilize the notion of conditional risk and its variants have been studied in Kuznetsov01; zimin2015arxiv; wintenberger2014optimal.

A related topic to conditional risk minimization is time series prediction. While, in general, time series methods can not be applied to CRM, both fields share a number of ideas. Traditional approaches to prediction include forecasting by fitting different parametric models to the data, such as ARMA or ARIMA, or using spectral methods, see, e.g.,

(box2015time). Alternative approaches include nonparametric prediction of time series (Modha01; modha1998memory; alquier2012model), and prediction by statistical learning (alquier2013prediction; mcdonald2012time). Similar research problems are studied in the field of dynamical systems, when one tries to identify the underlying transformation that governs the transitions, see (nobel2001consistent; farmer1987predicting; casdagli1989nonlinear; steinwart2009consistency).

## 2 Conditional Risk Minimization

We first introduce our main notations. We are given a sequence of observations from a stochastic process taking values in some space . The most common option for would be a product space, , where and

are input and output sets, respectively, of a supervised prediction problem. Other choices are possible, though, e.g. for modeling unsupervised learning tasks. We write

as a shorthand for a sequence for . We fix a hypotheses class , which is usually a subset of with being some decision space, e.g. . We also fix a loss function , that allows us to evaluate the loss of a given hypothesis on a given point as , with the latter version used to shorten the notations. For any time step , we denote by the expectation of a function with respect to the distribution of the next step, conditioned on the data so far, i.e.

To characterize the capacity of the hypothesis class we use sequential covering numbers with respect to the -norm, , introduced by (Rakhlin01). This complexity measure is applied to the induced functions space . Throughout the paper we assume that has a finite sequential fat-shattering dimension, a notion of complexity of a function class. These quantities are generalizations of more traditional measures for i.i.d. data, so readers unfamiliar with the theory of stochastic processes can read the results in terms of usual covering numbers and VC dimension. The formal definitions can be found in the supplementary material.

Our task is to find a hypothesis of minimal risk, i.e. solve

 minh∈HRn(h), (1)

where is the conditional risk at step . Note that the conditional distribution is different at every time step, so the objective also constantly changes. At each step we can compute only an approximate solution, , based on the data. A desired property is an improvement of the quality with the amount of the observations, as summarized in the following definition.

###### Definition 1 (Learnability(zimin2016aistats)).

For a fixed loss function and a hypotheses class , we call a class of processes conditionally learnable in the limit if there exists an algorithm that, for every process in , produces a sequence of hypotheses, , each based on , satisfying

 Rn(hn)−infh∈HRn(h)→0 (2)

in probability over the samples drawn from . An algorithm that satisfies (2) we call a limit learner for the class .

From a practical perspective, one might be satisfied with achieving only some target accuracy. For this purpose, we introduce the following relaxed definition.

###### Definition 2 (ε-Learnability).

For a fixed loss function and a hypotheses class , we call a class of processes -conditionally learnable for if there exists an algorithm that, for every process in , produces a sequence of hypotheses, , each based on , satisfying

 P[Rn(hn)−infh∈HRn(h)>ε]→0. (3)

An algorithm that satisfies (3) we call an -learner for the class .

A class of processes is learnable in the sense of (zimin2016aistats) if and only if it is -learnable for all .

### 2.1 Discrepancies

Following (zimin2016aistats), our approach relies on a specific notion of distance between distributions called discrepancy, a subclass of integral probability metrics, (zolotarev1983probability). It is a popular measure used, for example, in the field of domain adaptation (kifer2004detecting; BenDavid2007; ben2010theory). In this work we use this distance only to quantify distances between the conditional distributions, therefore we define the discrepancies only between those.

###### Definition 3 (Pairwise discrepancy).

For a sample from a fixed stochastic process, the pairwise discrepancy between time points and is

 di,j=suph∈H∣∣Ri−1(h)−Rj−1(h)∣∣. (4)

Given a sequence of conditional distributions, , an interesting quantity is the covering number of the space with the discrepancy as a metric, which we denote by . If, as a thought experiments, we identify -close distributions with each other, then characterizes the minimal number of distributions sufficient to represent all of them. This quantity can therefore serve as a measure of the complexity of learning from this sequence and it appears naturally in the analysis of the performance of our algorithm as a lower bound on the necessary computational resources.

## 3 A Meta-Algorithm for Conditional Risk Optimization (MACRO)

This section contains our main contributions. We introduce a meta-algorithm, MACRO, that can utilize any ordinary learning algorithm as a subroutine, and we study its theoretical properties, in particular establishing its ability to -learn a broad class of stochastic processes. For the sake of readability, for the main theorems we only provide their statements, while the proofs can be found in the supplemental material.

The main feature of the analysis provided in zimin2016aistats is that all properties of the process under consideration are summarized in a single assumption: the existence of a specific upper bound on the discrepancies. In the present paper we continue in this framework, but use the following weaker assumption.

###### Assumption 1.

For any , there exist a value that is a function of , such that .

In words, we must be able to upper-bound the discrepancies of the process by a quantity we can actually compute. Note that the assumption is trivially fulfilled for many simple processes. For example, works for i.i.d. data and

for discrete Markov chains. In contrast to

(zimin2016aistats) we do not require any additional measurability conditions, since those are taken care of by the way MACRO uses the upper bounds. Further examples of upper bounds can be found in zimin2016aistats. Note that for real data, it might not be possible to verify assumption 1 with non-trivial , but this is the case with the most assumptions made about the stochastic process in the literature, and even the i.i.d. property in standard learning.

In principle MACRO can use any computable upper bounds, even the trivial constant 1. However, learnability is only guaranteed if they satisfy certain conditions, which we will discuss after the corresponding theorems.

### 3.1 Conditional Risk Minimization with Bounded Discrepancies

The main idea behind MACRO is the thought experiment from the previous section: if two conditional distributions are very similar, we can use the same hypothesis for both them. To find these hypotheses, the meta-algorithm maintains a list of learning subroutines, where each of them is run independently and updated using a selected subset of the observed data points. Over the course of the algorithms, the meta-algorithm always maintains an active hypothesis that can immediately be applied when a new observation arrives. After each observation, one or more of the existing subroutines are updated, and a new subroutine can be added to the list, if necessary. The meta-algorithm then constructs a new active hypothesis from the ones produced by the currently running subroutines, to be prepared for the next step of the process. The schema of the algorithm is given in Figure 1.

Before we proceed to the theoretical properties of the meta-algorithm, we fix further notations for its components. At any time step , we denote by the number of started subroutines (i.e. the current value of ). The time steps in which the -th subroutine is updated up to step form a set of size . By we denote the output of the -th subroutine after having been updated -times. By we denote the index of the subroutine that MACRO outputs in step , i.e. .

#### Computational considerations.

The amount of computations done by MACRO in step is at most proportional to the current number of subroutines, . Therefore, we first discuss the quantitative behavior of this number.

###### Lemma 1.

Let be an -covering number of with respect to ’s. Then for any , it holds that

 N(M,n,ε)≤Nn≤N(M,n,ε/2). (5)

Observe that is always lower-bounded by , making it a natural limit on how many separate subroutines are required to learn a particular sequence.

The overall computational complexity of the previous algorithms based on the ERM principle from zimin2015arxiv; zimin2016aistats is proportional to for a dataset of size , while MACRO is able to reduce it to with a potential for further reduction, which allows its application to much larger datasets as shown in Section 4.

#### Exceptional sets.

As discussed in zimin2016aistats (and resembling the "probably" aspect of PAC learning), learnability guarantees for stochastic processes may not hold for every possible realization of the process. Henceforth, we follow the same strategy and introduce a set of exceptional realizations. However, the definition differs from the one in (zimin2016aistats), as it is adapted to the working mechanisms of the meta-algorithm.

###### Definition 4 (Exceptional set).

For a fixed , for any and , set

 Ek,m={∣∣\operatornamewithlimitssupp(In)∣∣≤k∧minj∈\operatornamewithlimitssupp(In)sj,n≥m}, (6)

where denotes the support of . Then , the complement of , is an exceptional set of realizations.

In words, the favorable realizations are the ones that do not force the algorithm to use too many subroutines (at most ) and, at same time, all used subroutines are updated often enough (at least times). The intuition behind this is that a subroutine will be slow in converging to an optimal predictor if it is updated very rarely. However, the overall performance of the meta-algorithm can suffer only if rarely updated subroutines are nevertheless used from time to time.

### 3.2 Subroutines

MACRO, as a meta algorithm, relies on the subroutines to perform the actual learning of hypotheses. In the following sections we will go through several option for subroutines and discuss the resulting theoretical guarantees.

#### Empirical risk minimization.

We start with the simplest choice of a subroutine: an empirical risk minimization (ERM) algorithm that stores all data points it is updated with. When required, it outputs the hypothesis that minimizes the average loss over this training set. Formally, the -th ERM subroutine outputs

 hj,i=\operatornamewithlimitsargminh∈H^Rn(h,j)for^Rn(h,j)=1sj,n∑t∈Cj,nℓ(h,zt). (7)

Consequently, MACRO’s output is for which we can prove the following theorem.

###### Theorem 1.

If MACRO is run with ERM as a subroutine, then we have for any and

 P[Rn(hn)−infh∈HRn(h)>α+2ε]≤2kN∞(L(H),β,n)(α−4β)2e−12m(α−4β)2+P[Eck,m].

From this theorem we can read off the conditions for learnability of the meta-algorithm. If there exist sequences , satisfying and , then the meta-algorithm with ERM as a subroutine is an -learner (up to a constant). The condition on the rate of growth of comes from the fact that it needs to compensate for the growth of covering numbers, which is a polynomial of (see the supplementary material for more details). The existence of such sequences and depends purely on the properties of the process (or class of processes) that the data is sampled from. Importantly, neither nor are needed to be known by MACRO as it automatically adapts to unfavorable conditions and exploits the favorable ones.

Note that the computation of can be seen as a minimization of non-uniformly weighted average over the observed data, an approach proposed by (zimin2016aistats). However, our method differs in the way the weights are computed, how the exceptional set is defined and relies on a less restrictive assumption.

#### Online learning.

ERM as a subroutine is interesting from a theoretical perspective, but it defeats the main purpose of the meta-algorithm, namely that not all data of the process has to be stored. Instead, one would prefer to rely on a subroutine that can be trained incrementally, i.e. one sample at a time, as it is typical in online learning.

In the following, by an online subroutine we understand any algorithm that is designed to control the regret over each particular realization, see (Cesa-Bianchi01) for a thorough study of the problem. The regret of the -th subroutine at the step is defined as

 Wj,n=sj,n∑i=1ℓ(hj,i−1,ztj,i)−infh∈Hsj,n∑i=1ℓ(h,ztj,i). (8)

The choice of a particular subroutine depends on the loss function and the hypotheses class. To abstract from concrete bounds and subroutines, we prove a theorem that bounds the performance of the meta-algorithm in terms of the regrets of the subroutines. Thereby, we obtain that any regret minimizing algorithm will be efficient as a subroutine for MACRO as well.

As our goal is not to minimize regret, but the conditional risk, we perform an online-to-batch conversion to choose the output hypothesis of each subroutine. In this work we consider two of the many existing online-to-batch conversion methods, one specifically for the convex losses and the other one for the general case.

#### Convex losses.

For a convex loss function, the output of a subroutine is the average over the hypotheses it produced so far. In this case, MACRO’s output is . and we can prove the following theorem.

###### Theorem 2.

For a convex loss , if the subroutines of MACRO use an averaging for online-to-batch conversion, we have for any and

 P[Rn(hn)−infh∈HRn(h)>α+WIn,n/sIn,n+4ε]≤ 4kN∞(L(H),β,n)(α/2−4β)2e−12m(α/2−4β)2+P[Eck,m] (9)

For Hannan-consistent online algorithms, vanishes as grows. Hence, the same conditions as the ones given after Theorem 1 ensures that MACRO is an -learner in this case.

#### Non-convex losses.

For non-convex losses, a simple averaging for online-to-batch conversion does not work, so we need to perform a more elaborate procedure. We use a modification of the method introduced in (Cesa-Bianchi02). Due to space constraints we omit the description of the approach and just state the performance guarantee that we are able to prove.

###### Theorem 3.

For any and , denote

 Uδ(j,β)=2 ⎷1sj,nlogs3j,n(sj,n+1)δ+ ⎷1sj,nlogs2j,nδ+√1sj,nlogs2j,nN∞(L(H),β,n)δ+4β. (10)

If the subroutines of MACRO use the score-based online-to-batch conversion of Cesa-Bianchi02 with confidence , it holds that

 P[Rn(hn)−infh∈HRn(h)>WIn,n/sIn,n+Uδ(In,β)]≤kδ/m+P[Eck,m]. (11)

The same conditions as before will ensure -learnability. Note that to perform this form of online-to-batch conversion neither nor need to be known.

## 4 Experiments

In this section we highlight the practical applicability of the meta-algorithm by applying it to two large-scale sequential prediction problems, showing how CRM can lead to improved prediction quality compared to ordinary marginal learning. The code for all the experiments will be made publicly available.

We adopt a classification setting, i.e. , where denotes a feature space, a set of labels, and is the -loss. Following the discussions of zimin2016aistats

, we use a distance between histories for the discrepancy bound. As the performance of the algorithm will depend on the choice of distance, we perform the experiments with two distances of different characteristics and study how they affect the predictive performance. For the first distance, we consider only the labels of the data points in the histories and compare the vectors of the fractions of labels. For the second distance, we use the feature space and consider the

-distance between histories. The final conclusions for both distances are quite similar, therefore, we present the results only for the feature-based distance in the main manuscript. Results for the label-based distance as well as the exact definitions of the distances can be found in the supplementary material.

### 4.1 DataExpo Airline Dataset

First, we apply MACRO to the DataExpo Airline dataset (airline_dataset), which contains entries about all commercial flights in the United States between 1987 and 2008. Out of these, we select the most recent year with complete data, 2007, and a number of the most active airports at that time, which gives, for example, more than 300000 flights for the Atalanta airport (ATL). The task is a classification: predict if a flight is delayed () or not (), where flights count as delayed if they arrive more than 15 minutes later than their scheduled arrival time. Clearly, the temporal order creates dependencies between flight delays that a CRM approach can try to exploit for higher classification accuracy. Observations are defined by grouping the flights into 10 minute chunks, so that at each time step, the task is to output a predictor that is applied to all flights in the next chunk.

Since any algorithm can be used in MACRO as a subroutine, our goal is to show that MACRO is able to improve upon the baseline of just running the subroutine on the whole data (that any standard approach would do). We perform experiments for both types of subroutines that we introduced in Section 3.2 and which reflect the go-to choices for online classification problems in practice.

ERM

As tractable approximations for ERM we use logistic regression classifiers that are trained incrementally using stochastic gradient descent, i.e.

, where are the parameters of the model at step .

VW

As an online learning subroutine, we use Vowpal Wabbit, a popular software package for large-scale online learning tasks. We set VW to use logistic loss as well with the default choice of meta parameters.

Figure 2 shows the results of the evaluating the MACRO with ERM and VW as subroutines comparing to a single ERM and VW algorithms run on the whole data. Numeric results can be found in the supplemental material. We see that in all of the presented airports MACRO achieves a better accuracy than the marginal versions of the corresponding algorithms for a wide range of thresholds . The effect is most profound with VW subroutine, where MACRO is able to achieve the performance on the level of MACRO with ERM subroutine, even though the VW subroutine itself seems to perform sub-optimally.

In addition to evaluating MACRO for a range of fixed thresholds, we show results for two methods that do not require to fix this parameter. Both of them run a number of MACRO instances with different thresholds in parallel and choose the output by two standard online learning strategies: either Follow The Leader (FTL) or Exponentially Weighted Average (EWA). Both strategies generally achieve good results, in particular better than marginal training, with the online FTL strategy usually outperforming the EWA strategy and in all cases achieving an error-rate close to the best fixed threshold. Even though both strategies use much more resources than a single instance of MACRO, they have the advantage of making the learning process completely parameter-free, and are therefore attractive if sufficient resources are available.

### 4.2 Breakfast Actions Dataset

In this set of experiments we present MACRO in a quite different setting. We use the Breakfast Actions Dataset, which consists of videos of 52 people performing 10 actions related to breakfast preparation. Each combination of a person and an action is treated as a separate learning task and the performance is measured by per frame error rate. Following the usage of a Gaussianity assumption by previous approaches kuehne2014language; kuehne2016end

, we use Gaussian Naive Bayes classifiers trained online as subroutines.

G-NB

The algorithm tracks the running average in the feature space for each class separately and predicts the class with the closest mean. After receiving a new point, the algorithm incrementally updates the mean of the corresponding class.

As for the airports dataset, we present the results for the feature-based distance, while results for the label-based one can be found in the supplement. As above, we also evaluate the FTL and EWA strategies for threshold selection.

The results are presented in Figure 3. We observe that the effect of MACRO is even stronger than for the airport dataset. The error-rate is always reduced, in some cases by more than 70% relatively to the baseline. Both threshold-selection strategies show excellent performance, with FTL again outperforming EWA.

Overall, we see that MACRO consistently outperforms the traditional online algorithms for both datasets. This illustrates two facts: CRM is indeed a promising approach to sequential prediction problems, and MACRO allows applying CRM principles to large real-world datasets that previously suggested methods are unable to handle.

## 5 Conclusion

In this paper we presented a new meta-algorithm, MACRO, for conditional risk minimization that is based on the idea of maintaining a number of learning subroutines that are created when necessary on-the-fly and trained individually only on relevant subsets of data. We proved theoretical guarantees on the performance of the presented meta-algorithm for different choices of subroutines. In contrast to previous work, MACRO does not require storing all observed data and can be efficiently implemented. This makes MACRO the first CRM algorithm that is able to handle sequential learning problems of practically relevant size, as we demonstrate by applying it to two large scale problems, the DataExpo Airline and the Breakfast Actions datasets.

## Supplementary material

To characterize a complexity of some function class we use covering numbers and a sequential fat-shattering dimension. But before we could give those definitions, we need to introduce a notion of -valued trees.

A -valued tree of depth is a sequence of mappings . A sequence defines a path in a tree. To shorten the notations, is denoted as . For a double sequence , we define as if and if . Also define distributions over as , where is a distribution of a process under consideration. Then we can define a distribution over two -valued trees and as follows: and are sampled independently from the initial distribution of the process and for any path for , and are sampled independently from .

For any random variable

that is measurable with respect to (a -algebra generated by ), we define its symmetrized counterpart as follows. We know that there exists a measurable function such that . Then we define , where the samples used by ’s are understood from the context.

Now we can define covering numbers.

###### Definition 5.

A set, , of -valued trees of depth is a (sequential) -cover (with respect to the -norm) of on a tree of depth if

 ∀f∈F, ∀ε∈{±1}n,∃v∈V: (12) max1≤t≤n|f(zt(ε))−vt(ε)|≤θ. (13)

The (sequential) -covering number of a function class on a given tree is

 N∞(F,θ,z)=min{ |V|:V is an θ-cover (14) w.r.t. ℓ∞-norm of F on z}. (15)

The maximal -covering number of a function class over depth- trees is

 N∞(F,θ,n)=supzN∞(F,θ,z). (16)

To control the growth of covering numbers we use the following notion of complexity.

###### Definition 6.

A -valued tree of depth is -shattered by a function class if there exists an -valued tree of depth such that

 ∀ε∈{±1}n, ∃f∈F s.t. 1≤t≤n, (17) εt(f(zt(ε))−st(ε))≥θ/2. (18)

The (sequential) fat-shattering dimension at scale is the largest such that -shatters a -valued tree of depth .

An important result of [Rakhlin01] is the following connection between the covering numbers and the fat-shattering dimension.

###### Lemma 2 (Corollary 1 of [Rakhlin01]).

Let . For any and any , we have that

 N∞(F,θ,n)≤(2enθ)fatθ(F). (19)

In all of the proofs we use the following technical lemma about the meta-algorithm.

###### Lemma 3.

Irrespectively of the subroutine used by the meta-algorithm, for any and , we have

 P⎡⎢⎣suph∈H|1sIn,n∑t∈CIn,n(ℓ(h,zt)−Rt−1(h))|>α∧Ek,m⎤⎥⎦ ≤2kN∞(L(H),β,n)(α−4β)2e−12m(α−4β)2. (20)

Moreover, for any with

 P[|1sIn,nsIn,n∑i=1gIn,i|>α∧Ek,m]≤2kα2e−12mα2.
###### Proof.

Introduce events for and (we suppress the dependence on to increase readability). Observe that . Denoting , we have

 P[Λ(In)>α∧Ek,m] (21) ≤∑j∈\operatornamewithlimitssupp(In)P[Λ(j)>α∧{∪i≥mBi,j}]. (22)

Each of the last probabilities can be bounded using a union bound.

 P[Λ(j)>α∧{∪i≥mBi,j}]≤∑i≥mP[Λ(j)>α∧Bi,j]. (23)

Using Lemma 4 from [zimin2016aistats], we get

 P[Λ(j)>α∧Bi,j]≤2N∞(L(H),β,n)e−12i(α−4β)2. (24)

Summing the probabilities, we obtain the first statement of the lemma.

For the second part of the lemma, denote and, using the same decomposition as above, we have to bound . Observe that each is adapted to the filtration generated by , hence, behaves like a martingale difference sequence. However, there is a technical difficulty in the fact that the indices are in fact stopping times. To get around it, observe that we can write as a sum over all the data with the adapted weights. Set to if we updated the algorithm at step and to otherwise. Correspondingly, define as the last chosen hypothesis by the -th algorithm. This way, both and are adapted to the original process. Then

 sj,n∑i=1(ℓ(hj,i−1,ztj,i)−Rtj,i−1(hj,i−1)) (25) =n∑t=1wt(ℓ(¯ht,zt)−Rt−1(¯ht)). (26)

At this point we can again use Lemma 4 from [zimin2016aistats] and get the second statement of the lemma. ∎

###### Proof of Lemma 1.

The lower bound comes from the fact that MACRO constructs an -covering. For the upper bound, observe that a new subroutine is started if and only if its associated conditional distribution differs by more than from the ones of all previously created subroutines. Therefore, the set of conditional distribution associated with subroutines form an -separated set with respect to ’s (no two elements are closer than to each other). The maximal size of such a set is at most the covering number of half the distance. ∎

###### Proof of Theorem 1.

We start by the usual argument for the empirical risk minimization that allows us to focus on the uniform deviations.

 Rn(hn)−infh∈HRn(h)≤2suph∈H|Rn(h)−^Rn(h,In)|. (27)

Denoting by , we can upper bound the last term.

 suph∈H|Rn(h)−^Rn(h,In)| (28) ≤suph∈H|Rn(h)−¯Rn(h,In)|+suph∈H|¯Rn(h,In)−^Rn(h,In)| (29) ≤1sIn,n∑i∈CIn,ndi,n+1+suph∈H|¯Rn(h,In)−^Rn(h,In)| (30) ≤1sIn,n∑i∈CIn,nMi,n+1+suph∈H|¯Rn(h,In)−^Rn(h,In)| (31) ≤2ε+suph∈H|¯Rn(h,In)−^Rn(h,In)|, (32)

where the last bound follows from the way the meta-algorithm chooses . Hence, we get

 P[Rn(hn)−infh∈HRn(h)>α+2ε] (33) ≤P[suph∈H|¯Rn(h,In)−^Rn(h,In)|>α]. (34)

The last probability can be bounded using Lemma 3 giving us the statement of the theorem. ∎

###### Proof of Theorem 2.

Note that by the way is chosen, we get for any that

 Rn(h)−RtIn,i−1(h)≤2ε. (35)

Therefore, by using the convexity of the loss

 Rn(hn) ≤1sIn,nsIn,n∑i=1Rn(hIn,i) (36) ≤1sIn,nsIn,n∑i=1RtIn,i−1(hIn,i)+2ε. (37)

Similarly, for any fixed

 Rn(h)≥1sIn,nsIn,n∑i=1RtIn,i−1(h)−2ε. (38)

Therefore,

 Rn(hn)−infh∈HRn(h) (39) ≤4ε+1sIn,nsIn,n∑i=1RtIn,i−1(hIn,i) (40) −infh∈H1sIn,nsIn,n∑i=1RtIn,i−1(h). (41)

We split the last difference into the following three terms and deal with them separately.

 T1 =1sIn,nsIn,n∑i=1(RtIn,i−1(hIn,i)−ℓ(hIn,i,ztIn,i)) (42) T2 =1sIn,nsIn,n∑i=1ℓ(hIn,i,ztIn,i)−infh∈HsIn,n∑i=1ℓ(h,ztIn,i) (43) T3 =infh∈HsIn,n∑i=1ℓ(h,ztIn,i)−infh∈H1sIn,nsIn,n∑i=1RtIn,i−1(h). (44)

The first term can be bounded using Lemma 3. is in fact just . For observe that

 infh∈H1sIn,nsIn,n∑i=1RtIn,i−1(h)≥infh∈H1sIn,nsIn,n∑i=1ℓ(h,ztIn,i) (45) +infh∈H(1sIn,nsIn,n∑i=1(RtIn,i−1(h)−ℓ(h,ztIn,i)). (46)

Therefore, is bounded by :

 ~T3=suph∈H(1sIn,nsIn,n∑i=1(ℓ(h,ztIn,i)−RtIn,i−1(h)). (47)

Combining everything together,

 P[Rn(hn)−infhRn(h)>α+4ε+WIn,n] (48) ≤P[T1+~T3>α∧Ek,m]+P[Eck,m] (49) ≤P[T1>α/2∧Ek,m] (50) +P[~T3>α/2∧Ek,m]+P[Eck,m]

The both terms in the last line can be bounded using Lemma 3 giving us the statement of the theorem. ∎

#### Online-to-batch conversion for non-convex losses.

Here we describe the modification of the online-to-batch conversion method of [Cesa-Bianchi02]. As the original method was designed to work for i.i.d. data, we need to extend it to stochastic processes. The general idea is to assign a score to each of and choose the one with the lowest score. For a given confidence , the score of is computed as

 un(j,i)=~Rn(j,i)+cj,δ(sj,n−i). (51)

where

 ~Rn(j,i)=1sj,n−isj,n∑k=i+1ℓ(hj,i,ztj,k)andcj,δ(t)= ⎷12(t+1)logs3j,n(sj,n+1)δ. (52)

Setting , MACRO’s output is

The following lemma is analog of Lemma 3 from [Cesa-Bianchi02] proved for the case of dependent data and the conditional risk.

###### Lemma 4.

For the setting of Theorem 3, let

 v(j,i)=Rn(hj,i)+2cIn,δ(sIn,n−i). (53)

Then we have

 P[Rn(hn)>min1≤i≤sIn,nv(In,i)+2ε∧Ek,m]≤kδm. (54)
###### Proof.

Introduce events . Using a union bound, we have

 P[Rn(hn)>min1≤i≤sIn,nv(In,i)+2ε∧Ek,m] (55) ≤∑r≥mP[Rn(hn)>min1≤i≤sIn,nv(In,i)+2ε∧Ar]. (56)

Therefore, we will focus on the last probabilities. Let and also introduce events . Then, since is always true, we get

 P[Rn(hn)>min1≤i≤sIn,nv(In,i)+2ε∧Ar] (57) ≤r∑i=1P[Rn(hIn,i)>v(In,J⋆n)+2ε∧Bi∧Ar]. (58)

Observe that if is true, then at least one of the following events is also true.

 D1,i={~Rn(In,i)≤Rn(hIn,i)−ε−cIn,δ(sIn,n−i)}, (59) D2,i={Rn(hIn,i)