Locally-Adaptive Nonparametric Online Learning

02/05/2020 ∙ by Ilja Kuzborskij, et al. ∙ Università degli Studi di Milano 5

One of the main strengths of online algorithms is their ability to adapt to arbitrary data sequences. This is especially important in nonparametric settings, where regret is measured against rich classes of comparator functions that are able to fit complex environments. Although such hard comparators and complex environments may exhibit local regularities, efficient algorithms whose performance can provably take advantage of these local patterns are hardly known. We fill this gap introducing efficient online algorithms (based on a single versatile master algorithm) that adapt to: (1) local Lipschitzness of the competitor function, (2) local metric dimension of the instance sequence, (3) local performance of the predictor across different regions of the instance space. Extending previous approaches, we design algorithms that dynamically grow hierarchical packings of the instance space, and whose prunings correspond to different "locality profiles" for the problem at hand. Using a technique based on tree experts, we simultaneously and efficiently compete against all such prunings, and prove regret bounds scaling with quantities associated with all three types of local regularities. When competing against "simple" locality profiles, our technique delivers regret bounds that are significantly better than those proven using the previous approach. On the other hand, the time dependence of our bounds is not worse than that obtained by ignoring any local regularities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In online convex optimization (Zinkevich, 2003; Hazan, 2016), a learner interacts with an unknown environment in a sequence of rounds. In the specific setting considered in this paper, at each round the learner observes an instance and outputs a prediction for the label associated with the instance. After predicting, the learner incurs the loss . We consider two basic learning problems: regression with square loss, where and , and binary classification with absolute loss, where and (or, equivalently, for randomized predictions with ). The performance of a learner is measured through the notion of regret, which is defined as the amount by which the cumulative loss of the learner predicting with exceeds the cumulative loss —on the same sequence of instances and labels— of any function in a given reference class of functions , namely

(1)

In order to capture complex environments, we focus on nonparametric classes containing Lipschitz functions . The specific approach adopted in this paper is inspired by the simple and versatile algorithm of Hazan and Megiddo (2007), henceforth denoted with HM, achieving a regret bound of the form 111We use to denote and to denote .

(2)

for all given , where is the class of -Lipschitz functions such that

(3)

for all 222The bound for the square loss, which is not contained in (Hazan and Megiddo, 2007), can be proven with a straightforward extension of the analysis in that paper. Although Lipschitzness is a standard assumption in nonparametric learning, a function in may alternate regions of low variation with regions of high variation. This implies that, if computed locally (i.e., on pairs that belong to the same small region), the value of the smallest satisfying (3) would change significantly across these regions. If we knew in advance the local Lipschitzness profile, we could design algorithms that exploit this information to gain a better control on regret.

Although asymptotic rates that improve on (2) can be obtained using different and more complicated algorithms, it is not clear whether these other algorithms can be made locally adaptive in a principled way as we do with HM.

Local Lipschitzness.

Our first contribution is an algorithm for regression with square loss that competes against all functions in . However, unlike the regret bound (2) achieved by HM, the regret of our algorithm depends in a detailed way on the local Lipschitzness profile of . Our algorithm operates by sequentially constructing a -level hierarchical packing of the instance space with balls whose radius decreases with each level of the hierarchy. The levels are associated with local Lipschitz constants provided as an input parameter to the algorithm.

Figure 1: Matching functions to prunings. Profiles of local smoothness correspond to prunings so that smoother functions are matched to smaller prunings.

If we view the hierarchical packing as a -level tree whose nodes are the balls in the packing at each level, then the local Lipschitzness profile of a function translates into a pruning of this tree (this is visually explained in Figure 1). By training a base predictor in each ball, we can use the leaves of a pruning to approximate a function whose local Lipschitz profile “matches” . Namely, a function that satisfies (3) with for all observed instances that belong to some leaf of at level , for all levels (since is a pruning of the hierarchical packing , there is a one-to-one mapping between instances and leaves of ). Because our algorithm is simultaneously competitive against all prunings, it is also competitive against all functions whose local Lipschitz profile —with respect to the instance sequence— is matched by some pruning. More specifically, we prove that for any and for any pruning matching on the sequence of instances,

(4)

where, from now on, always denotes the total number of time steps in which the current instance belongs to a leaf at level of the pruning

. The expectation is with respect to the random variable

that takes value with probability equal to the fraction of leaves of at level . The first term in the right-hand side of (4

) bounds the estimation error, and is large when most of the

leaves of reside at deep levels (i.e., has just a few regions of low variation). The second term bounds the approximation error, and is large whenever most of the instances belongs to leaves of at deep levels.

In order to compare this bound to (2), consider with . If is matched by some pruning such that most instances belong to shallow leaves of , then our bound on becomes of order , as opposed to the bound of (2) which is of order . On the other hand, for any we have at least a pruning matching the function: the one whose leaves are all at the deepest level of tree. In this case, our bound on becomes of order , which is asymptotically equivalent to (2). This shows that, up to log factors, our bound is never worse than (2), and can be much better in certain cases.

Our locally adaptive approach can be generalized beyond Lipschitzness. Next, we present two additional contributions where we show that variants of our algorithm can be made adaptive with respect to different local properties of the problem.

Local metric dimension.

It is well known that nonparametric regret bounds inevitably depend exponentially on the metric dimension of the set of data points (Hazan and Megiddo, 2007; Rakhlin et al., 2015). Similarly to local Lipschitzness, we want to take advantage of cases in which most of the data points live on manifolds that locally have a low metric dimension. In order to achieve a dependence on the “local dimension profile” in the regret bound, we propose a slight modification of our algorithm, where each level of the hierarchical packing is associated with a local dimension bound such that . Note that —unlike the case of local Lipschitzness— the local dimension is decreasing as the tree gets deeper. Although this might seem counterintuitive, it is explained by the fact that higher-dimensional balls occupy a larger volume than lower-dimensional ones with the same radius, and so they occur at shallower levels in the hierarchical packing.

We say that a pruning of the tree associated with the packing matches a sequence of instances if the number of leaves of the pruning at each level is . For regression with square loss we can prove that, for any and for any pruning matching , this modified algorithm achieves regret

(5)

where, as before, the expectation is with respect to the random variable that takes value with probability equal to the fraction of leaves of at level . If most lie in a low-dimensional manifold of , so that is matched by some pruning with deeper leaves, we obtain a regret of order . This is nearly a parametric rate whenever . In the worst case, when all instances are concentrated at the top level of the tree, we still recover (2).

Local loss bounds.

Whereas the local Lipschitz profile measures a property of a function with respect to an instance sequence, and the local dimension profile measures a property of the instance sequence, we now consider the local loss profile, which measures a property of a local online learner with respect to a sequence of examples . The local loss profile describes how the cumulative loss of the local predictor changes across different regions of the instance space. To this end, we introduce the functions , which upper bound the total loss incurred by our local predictors sitting on nodes at level . We can use the local predictors on the leaves of a pruning to predict a sequence of examples whose local loss profile matches that of . Namely, such that the online local learners run on the subsequence of examples that belong to leaves at level of incur a total loss bounded by , for all levels . In order to take advantage of good local loss profiles, we focus on losses —such as the absolute loss— for which we can prove “first-order” regret bounds that scale with the loss of the expert against which the regret is measured. For the absolute loss, the algorithm we consider attains regret

(6)

for any , where —as before— the expectation is with respect to the random variable that takes value with probability equal to the fraction of leaves of at level . For concreteness, set , so that deeper levels correspond to loss rates that grow faster with time. When has shallow leaves and is negligible for , the regret becomes of order , which has significantly better dependence on than achieved by HM. Note that we have a pruning matching all sequences: the one whose leaves are all at the deepest level of the tree. Indeed, is a trivial upper bound on the absolute loss of any online local learner. In this case, our bound on becomes of order , which is asymptotically equivalent in compared to (2). Note that our dependence on the Lipschitz constant is slightly worse than (2). This happens because we have to pay an additive constant regret term in each ball which is unavoidable in any first-order regret bounds.

Intuition about the proof.

Hazan and Megiddo (2007) prove (2) using a greedy construction of a ball packing of the instance space, where each ball hosts a local online learner, and the label for a new instance is predicted by the learner in the nearest ball. Balls shrink at a polynomial rate in time, and a new ball is allocated whenever an instance falls outside the current packing. The algorithms we present here generalize this approach to a hierarchical construction of packings at multiple levels. Each ball at a given level contains a lower-level packing using balls of smaller radius, and we view this nested structure of packings as a tree. Radii are now tuned not only with respect to time, but also with respect to the level , where the dependence on is characterized by the specific locality setting (i.e., local smoothness, local dimension, or local losses). The main novelty of our proof is in the fact that we analyze HM in a level-wise manner, while simultaneously competing against the best pruning over the entire hierarchy. Our approach is adaptive because regret now depends on both the number of leaves of the best pruning and the number of observations made by the pruning at each level. In other words if the best pruning has no leaves at a particular level, or is active for a few time steps at that level, then the algorithm will seldom use the local predictors hosted at that level.

Our main algorithmic technology is the sleeping experts framework of Freund et al. (1997), where each node of the tree is treated as an expert predicting with the learner hosted in the associated ball, and active (non-sleeping) experts in a given time step are those along the root-to-leaf path associated with the current instance. For regression with square loss we use exponential weights (up to re-normalization due to active experts). For classification with absolute loss, we avoid the tuning problem by resorting to the parameter-free algorithm AdaNormalHedge of Luo and Schapire (2015). This makes our approach computationally efficient: despite the exponential number of experts in the comparison class we only pay in the regret a factor corresponding to the depth of the tree.

2 Definitions

Throughout the paper, we assume instances have a bounded norm, , so that is the unit ball with center in . We use to denote the ball of center and radius , and we write instead of .

Definition 1 (Coverings and packings).

An -cover of a set is a subset such that for each there exists such that . An -packing of a set is a subset such that for any distinct , we have .

Definition 2 (Metric dimension).

A set has metric dimension if there exists333Note that is exactly quantifiable for various metrics (Clarkson, 2006). such that, for all , has an -cover of size at most .

In this paper we consider the following online learning protocol with oblivious adversary. Given an unknown sequence of instances and labels, for every round

  1. [topsep=0pt,parsep=0pt,itemsep=0pt]

  2. The environment reveals the instance .

  3. The learner selects an action and incurs the loss .

  4. The learner observes .

In the rest of the paper, we use as an abbreviation for .

2.1 Hierarchical packings, trees, and prunings

A pruning of a rooted tree is the tree obtained after the application of zero or more replace operations, where each replace operation deletes the subtree rooted at an internal node without deleting the node itself (which becomes a leaf).

Recall that our algorithms work by sequentially building a hierarchical packing of the instance sequence. This tree-like structure is defined as follows.

Definition 3 (Hierarchical packing).

A hierarchical packing of depth of an instance sequence is a sequence of nonempty subsets and radii satisfying the following properties. For each level :

  1. [topsep=0pt,parsep=0pt,itemsep=0pt]

  2. the set is a -packing of the elements of with balls ,

  3. for all either or for some .

  4. if , then there exists such that .

Figure 2: An example of mapping between tree and a hierarchical packing of some sequence .
Figure 3: Pruning of a tree .

Any such hierarchical packing can be viewed as a rooted tree (conventionally, the root of the tree is the unit ball ) defined by the parent function, where if and only if for , and —see Figure 2.

Given an instance sequence , let be the family of all trees of depth generated from by choosing the -packings at each level in all possible ways. Given and its pruning , we use to denote the subset of containing the nodes of that correspond to leaves of —see Figure 3. When is clear from the context, we abbreviate with . For any fixed let also be the number of leaves in .

3 Related Work

In nonparametric prediction, a classical topic in statistics, one is interested in predicting well compared to the best function in a large class, which typically includes all functions that have certain regularities. In online learning, nonparametric prediction was studied by Vovk (2006a, b, 2007), who analyzed the regret of algorithms against Lipschitz function classes with bounded metric entropy. Rakhlin and Sridharan (2014) later used a non-constructive argument establishing minimax regret rates (when ) for both square and absolute loss. Inspired by their work, Gaillard and Gerchinovitz (2015) devised the first online algorithms for nonparametric regression enjoying minimax regret. A computationally efficient variant of their algorithm, with running time , relies on a nested covering of a function class, where —roughly speaking— functions are approximated by an aggregation of indicator functions at different levels of a cover.

In this work we employ a nested packing approach, which bears a superficial resemblance to the construction of Gaillard and Gerchinovitz (2015) and to the analysis technique of Rakhlin and Sridharan (2014). However, the crucial difference is that we hierarchically cover the input space, rather than the function class, and use local no-regret learners within each element of the cover. Our algorithm is conceptually similar to the one of Hazan and Megiddo (2007), however their space packing can be viewed as a “flat” version of the one proposed here, while their analysis only holds for a known time horizon (which is later improved till unknown one by Kpotufe and Orabona (2013)).

Our algorithms adapt to the regularity of the problem in an online fashion using the tree-expert variant of the prediction with expert advice setting —see also (Cesa-Bianchi and Lugosi, 2006). In this setting, originally introduced by Helmbold and Schapire (1997), there is a tree-expert for each pruning of a complete tree with a given branching factor. Although the number of such prunings is exponential, predictions and updates can be performed in time linear in the tree depth using the context tree algorithm of Willems et al. (1995). In this work we consider a conceptually simpler version, which relies on sleeping experts (Freund et al., 1997), where each node of a tree is associated with an expert, and on each round only experts are awake. The goal is to compete against the best pruning in hindsight, which typically requires knowledge of the pruning size for tuning purposes. In case of prediction with absolute loss, we avoid the tuning problem by exploiting a parameter-free algorithm of Luo and Schapire (2015).

Local adaptivity to regularities of a competitor, as discussed in the current paper, can be also viewed as automatic parameter tuning through hierarchical expert advice. A similar idea, albeit without the use of a hierarchy, was explored by van Erven and Koolen (2016) for automatic step size tuning in online convex optimization —see (Orabona and Pál, 2016)

for a detailed discussion on the topic. Finally, the idea to exploit a variant of a context tree for nonlinear classification was also explored in neural network learning by

Veness et al. (2019), where —roughly speaking— context trees are used to combine randomly initialized halfspaces.

While standard results in statistics assume some form of a uniform regularity of an optimal function (such as Lipschitzness or Hölder continuity), several works have investigated nonparametric regression under local smoothness assumptions. For instance, Mammen and van de Geer (1997) considered a one-dimensional nonparametric regression problem with a fixed design, where the regression function belongs to the class of -times weakly differentiable functions with bounded total variation. They proposed and analyzed locally-adaptive regression splines, where an estimator is a variant of Regularized Least Squares (RLS) with a total variation penalty, and showed minimax optimal rates with exponential dependence in . A similar direction was also pursued by Tibshirani (2014) through trend filtering. He proposed a less computationally intensive algorithm with comparable rates. Unlike these works, here we address local Lipschitzness in general metric spaces without any statistical assumptions.

Adaptivity of -NN regression and kernel regression to the local effective dimension of the stochastic data-generating process was studied by Kpotufe (2011); Kpotufe and Garg (2013), however they considered a notion of locality different from the one studied here. The idea of adaptivity to the global effective dimension, combined with the packing construction of Hazan and Megiddo (2007) in the online setting, were proposed by Kpotufe and Orabona (2013). Kuzborskij and Cesa-Bianchi (2017)

investigated a stronger form of adaptivity to the dimension in nonparametric online learning, which is related to the recovering of the subspace where the target function is smoother. In online convex optimization, adaptivity to the global Lipschitz constant of the loss function was recently proposed by

Mhammedi et al. (2019).

Finally, related ideas of hierarchical covering were also explored in the global optimization literature (Munos, 2011), where adaptivity to local regularities is exploited for the search of critical points.

4 Description of the algorithm

Recall that we identify a hierarchical packing with a tree whose nodes correspond to the elements of the packing. Our algorithm predicts using a hierarchical packing evolving with time, and competes against the best pruning of the tree corresponding to the final hierarchical packing. A local online learner is associated with each node of except for the root. When a new instance is observed, it is matched with the closest center at each level , until a leaf is reached. The local learners associated with these closest centers output predictions, which are then aggregated using an algorithm for prediction with expert advice, where the local learner at each node is viewed as an expert. Since only a fraction of experts (i.e., those associated with the closest centers, which form a path in a tree) are active at any given round, this can be viewed as an instance of the “sleeping experts” framework of Freund et al. (1997). In the regression case, since the square loss is exp-concave for bounded predictions, we can directly apply the results of Freund et al. (1997). In the classification case, we use instead the parameter-free approach of Luo and Schapire (2015).

One might wonder if a similar algorithm can be formulated without dynamically evolving packing by constructing a fixed partition of the instance space ahead of time. Such algorithm would be inferior to ours since it would be competitive only for a known time horizon (unless one would use a cumbersome doubling trick or resort to a non-trivial tree-growing extension of the algorithm). In addition, identifying an element in such partition is straightforward for metric, while it would be computationally non-trivial for an arbitrary metric. On the other hand, the dynamic algorithm presented here works with any metric and enjoys local adaptivity on the induced metric space.

Algorithm 1 contains the pseudocode for the case of exp-concave loss functions. The algorithm invokes two subroutines propagate and update. The former collects the predictions of the local learners along the path of active experts corresponding to an incoming instance, the latter updates these learners.

1:Depth parameter , radius tuning function
2: Centers at each level
3:for each round  do
4:     Receive Prediction
5:      propagate() Subroutine 2
6:     
7:     Predict
8:     Observe Update
9:     update() Subroutine 3
10:     
11:     for each  do
12:         
13:     end for
14:end for
Algorithm 1 Locally Adaptive Online Learning (Hedge style)

We use to denote the root-to-leaf path in of active experts associated with the current instance . The subroutine propagate finds in each level the center closest to . Then, the path

of active experts associated with these centers and the vector

of their predictions are returned to the algorithm (line 5). The sum of the current weight of each active expert on the path is computed in line 6, where is used to denote a node in whose path is a prefix of . This sum is used to compute the aggregated prediction on line 7. After observing the true label (line 8), the subroutine update updates the active experts. Finally, the weights of the active experts are updated (lines 10 and 12).

We now describe concrete implementations of propagate and update which will be used in Section 5. For simplicity, we assume that all variables of the meta-algorithm which are not explicitly given as input values are visible to both procedures.

1:instance , time step index
2: Start from root
3:for depth  do
4:     if  then
5:          Create initial ball at depth
6:         Create predictor at
7:     end if
8:      Find active expert at level
9:      Add index of active expert to path
10:      prediction of active expert Add prediction to prediction vector
11:      Get current radius
12:      Set ball of active expert as current element in the packing
13:end for
14:path of active experts and vector of active expert predictions
Subroutine 2 propagate.

The subroutine propagate finds in a tree the path of active experts associated with an instance . When invoked at time , the tree is created as a list of nested balls with common center and radii for (lines 56). For all , starting from the root node set as parent node (line 2), the procedure finds in each level the center closest to the current instance among those centers which belong to the parent node (line 8). Note that the parent node is a ball and therefore there is at least one center in . The active expert indices are collected in a vector , while their predictions are stored in a vector and then aggregated using Algorithm 1. We use to denote the subset of time steps on which the expert at node is active. These are the such that occurs in .

1:Path of active experts, example , time step
2:for depth  do
3:      Get current radius
4:      Get next active expert in path
5:     if  then
6:         Update active expert using
7:     else
8:          Add new center to level
9:         Create predictor at and initialize it with
10:     end if
11:end for
Subroutine 3 update.

The subroutine update checks whether the current instance belongs to the each ball that host an active expert listed in . If belongs to the active ball at level , then is used to update the expert (line 6). If is outside of the active ball at level , then a new ball with center is created in the packing at that level. Then, a new predictor is created associated with that ball and initialized using the current example (line 9).

5 Nonparametric regression with local Lipschitzness

We first consider the case of local Lipschitz bounds for regression with square loss , where for all . Here we use Follow-the-Leader (FTL) as local online predictor. As explained in the introduction, we need to match prunings to functions with certain local Lipschitz profiles. This is implemented by the following definition.

Definition 4 (Functions admissible with respect to a pruning).

Given , a hierarchical packing of an instance sequence , and a time-dependent radius tuning function , we define the set of admissible functions with respect to a pruning of by

Equipped with this definition, we establish a regret bound with respect to admissible functions. Recall that is the total number of time steps in which the current instance belongs to a leaf at level of the pruning .

Theorem 1.

Given , suppose that Algorithm 1 using Subroutines 2 and 3 is run for rounds with radius tuning function , and let be the resulting hierarchical packing. Then, for all prunings of the regret satisfies

(7)

The expectation is understood with respect to the random variable that takes value with probability equal to the fraction of leaves of at level .

Since is the hierarchical packing generated by Algorithm 1, the prunings and the admissible functions depend on the algorithm through . Similar remarks hold for our results in Sections 6 and 7.

6 Nonparametric regression with local dimension

In this section we look at a different notion of adaptivity, namely we demonstrate that Algorithm 1 is also capable of adapting to the local dimension of the data sequence. We consider a decreasing sequence of local dimension bounds, where is assigned to the level of the hierarchical packing maintained by Algorithm 1. We also make a small modification to update (Subroutine 3). Namely, we add a new center at level only if the designated size of the packing (which depends on the local dimension bound) has not been exceeded. The modified subroutine is updateDim (Algorithm 4).

1:Path of active experts, example , time step , (see Def. 2)
2:for depth  do
3:      Get current radius
4:      Get next active expert in path
5:     if  then
6:         Update active expert using
7:     else if  then Restrict packing size at each level
8:          Add new center to level
9:         Create predictor at and initialize it with
10:     end if
11:end for
Subroutine 4 updateDim.

Since the local dimension assumption is made on the instance sequence rather than on the function class, in this scenario we may afford to compete against the class of all -Lipschitz functions, while we restrict the prunings to those that are compatible with the local dimension bounds with respect to the hierarchical packing built by the algorithm.

Definition 5 (Prunings admissible with respect to local dimension bounds).

Given and a hierarchical packing of an instance sequence , we define the set of admissible prunings by

We prove the following regret bound.

Theorem 2.

Given , suppose that Algorithm 1 using Subroutines 2 and 4 is run for rounds with radius tuning function , and let the resulting hierarchical packing. Then, for all prunings the regret satisfies

(8)

7 Nonparametric classification with local losses

The third notion of adaptivity we study is with respect to the loss of the local learners in each node of a hierarchical packing. The local loss profile is parameterized with respect to a sequence of nonnegative and nondecreasing such that each bounds the total loss of all local learners at level of the hierarchical packing. In order to achieve better regrets when the data sequence can be predicted well by local learners in a shallow pruning we assume for all , where the choice of allows us to fall back to the standard regret bounds if the data sequence is hard to predict.

Whereas in Sections 5 and 7, where we consider regression with the square loss, here we work with binary classification with absolute loss , which — unlike the square loss— is not exp-concave. As we explained in Section 1, using losses that are not exp-concave is motivated by the presence of first-order regret bounds, which allow us to take advantage of good local loss profiles. While the exp-concavity of the square loss dispensed us from the need of tuning Algorithm 1 using properties of the pruning, here we circumvent the tuning issue by replacing Algorithm 1 with the parameter-free Algorithm 5 (stated in Appendix A), which is based on the AdaNormalHedge algorithm of Luo and Schapire (2015). Instead of the standard exponential weights on which the updates of Algorithm 1 are based, AdaNormalHedge performs update using the function

As online local learners we use self-confident Weighted Majority (Cesa-Bianchi and Lugosi, 2006, Exercise 2.10) with two constant experts predicting and . In the following, we denote by the cumulative loss of a local learner at node over the time steps when the expert is active. Similarly to the previous section, we compete against the class of all Lipschitz functions, and introduce the following constraint on the prunings

(9)

If then the total loss of all the leaves at a particular level behaves in accordance with .

Theorem 3.

Suppose that the Algorithm 5 runs self-confident weighted majority at each node with radius tuning function

and let the resulting hierarchical packing. Then for all pruning the regret satisfies :

Acknowledgments.

We are grateful to Pierre Gaillard, Sébastien Gerchinovitz, and András György for many insightful comments.

References

  • Zinkevich [2003] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learing (ICML), 2003.
  • Hazan [2016] E. Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Hazan and Megiddo [2007] E. Hazan and N. Megiddo. Online Learning with Prior Knowledge. In Learning Theory, pages 499–513. Springer, 2007.
  • Rakhlin et al. [2015] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities.

    Journal of Machine Learning Research

    , 16(2):155–186, 2015.
  • Freund et al. [1997] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In

    Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing

    , pages 334–343. ACM, 1997.
  • Luo and Schapire [2015] H. Luo and R. E. Schapire. Achieving All with No Parameters: AdaNormalHedge. In

    Conference on Computational Learning Theory (COLT)

    , 2015.
  • Clarkson [2006] K. L. Clarkson. Nearest-neighbor searching and metric space dimensions. Nearest-neighbor methods for learning and vision: theory and practice, pages 15–59, 2006.
  • Vovk [2006a] V. Vovk. On-line regression competitive with reproducing kernel Hilbert spaces. In International Conference on Theory and Applications of Models of Computation. Springer, 2006a.
  • Vovk [2006b] V. Vovk. Metric entropy in competitive on-line prediction. arXiv preprint cs/0609045, 2006b.
  • Vovk [2007] V. Vovk. Competing with wild prediction rules. Machine Learning, 69(2):193–212, 2007.
  • Rakhlin and Sridharan [2014] A. Rakhlin and K. Sridharan. Online Non-Parametric Regression. In Conference on Computational Learning Theory (COLT), 2014.
  • Gaillard and Gerchinovitz [2015] P. Gaillard and S. Gerchinovitz. A chaining algorithm for online nonparametric regression. In Conference on Computational Learning Theory (COLT), 2015.
  • Kpotufe and Orabona [2013] S. Kpotufe and F. Orabona. Regression-Tree Tuning in a Streaming Setting. In Conference on Neural Information Processing Systems (NIPS), 2013.
  • Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Helmbold and Schapire [1997] D. P. Helmbold and R. E. Schapire.

    Predicting nearly as well as the best pruning of a decision tree.

    Machine Learning, 27(1):51–68, 1997.
  • Willems et al. [1995] F. M.J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
  • van Erven and Koolen [2016] T. van Erven and W. M. Koolen. Metagrad: Multiple learning rates in online learning. In Conference on Neural Information Processing Systems (NIPS), 2016.
  • Orabona and Pál [2016] F. Orabona and D. Pál. Coin betting and parameter-free online learning. In Conference on Neural Information Processing Systems (NIPS), 2016.
  • Veness et al. [2019] J. Veness, T. Lattimore, A. Bhoopchand, D. Budden, C. Mattern, A. Grabska-Barwinska, P. Toth, S. Schmitt, and M. Hutter. Gated linear networks. arXiv preprint arXiv:1910.01526, 2019.
  • Mammen and van de Geer [1997] E. Mammen and S. van de Geer. Locally adaptive regression splines. The Annals of Statistics, 25(1):387–413, 1997.
  • Tibshirani [2014] R. J. Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
  • Kpotufe [2011] S. Kpotufe. k-NN regression adapts to local intrinsic dimension. In Conference on Neural Information Processing Systems (NIPS), 2011.
  • Kpotufe and Garg [2013] S. Kpotufe and V. Garg. Adaptivity to local smoothness and dimension in kernel regression. In Conference on Neural Information Processing Systems (NIPS), 2013.
  • Kuzborskij and Cesa-Bianchi [2017] I. Kuzborskij and N. Cesa-Bianchi. Nonparametric Online Regression while Learning the Metric. In Conference on Neural Information Processing Systems (NIPS), 2017.
  • Mhammedi et al. [2019] Z. Mhammedi, W. M. Koolen, and T. Van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Computational Learning Theory (COLT), 2019.
  • Munos [2011] R. Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Conference on Neural Information Processing Systems (NIPS), 2011.
  • Mourtada and Maillard [2017] J. Mourtada and O.-A. Maillard. Efficient tracking of a growing number of experts. In Algorithmic Learning Theory (ALT), 2017.
  • Auer et al. [2002] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.

Appendix A Algorithm for nonparametric classification with local losses

1:Depth parameter , radius tuning function
2: Centers at each level
3:for each round  do
4:     Receive Prediction
5:      propagate() Algorithm 2
6:     for each  do
7:         if  then
8:              
9:         else
10:              
11:         end if
12:     end for
13:     Predict where
14:     Observe Update
15:     update Algorithm 3
16:     
17:     for each  do
18:         
19:     end for
20:end for
Algorithm 5 Locally Adaptive Online Learning (AdaNormalHedge style)

Appendix B Learning with expert advice over trees

In order to prove the regret bounds in our locally-adaptive learning setting, we start by deriving bounds for prediction with expert advice when the competitor class is all the prunings of a tree whose each node hosts an expert, a framework initially investigated by Helmbold and Schapire [1997]. Our analysis uses the sleeping experts setting of Freund et al. [1997], in which only a subset of the node experts are active at each time step . In our locally-adaptive setting, the set of active experts at time corresponds to the active root-to-leaf path selected by the current instance —see Section 4. The inactive experts at time neither output predictions nor get updated. The prediction of a pruning at time , denoted with is the prediction of the node expert corresponding to the unique leaf of on .

1:Tree and initial weights for each node of the tree
2:for each round  do
3:     Observe predictions of active experts (corresponding to a root-to-leaf path in the tree)
4:     Predict and observe
5:     Update the weight of each active expert
6:end for
Algorithm 6 Learning over trees through sleeping experts

Next, we consider two algorithms for the problem of prediction with expert advice over trees. In order to be simultaneously competitive with all prunings, we need algorithms that do not require tuning of their parameters depending on the specific pruning against which the regret is measured. In case of exp-concave losses (like the square loss) tuning is not required and Hedge-style algorithms work well. In case of generic convex losses, we use the more complex parameterless algorithm AdaNormalHedge.

We start by recalling the algorithm for learning with sleeping experts and the basic regret bound of Freund et al. [1997]. The sleeping experts setting assumes a set of experts without any special structure. At every time step only an adversarially chosen subset of the experts provides predictions and gets updated —see Algorithm 7.

1:Initial nonnegative weights
2:for each round  do
3:     Receive predictions of active experts
4:      Prediction
5:     Observe
6:     For Update
7:end for
Algorithm 7 Exponential weights with sleeping experts for -exp-concave losses

The regret bound is parameterized in terms of the relative entropy between the initial of distribution over experts and any target distribution . The following theorem states a slightly more general bound that holds for any -exp-concave loss function (for completeness, the proof is given in Appendix D).

Theorem 4 ([Freund et al., 1997]).

If Algorithm 7 is run on any sequence of -exp-concave loss functions, then for any sequence of awake experts and for any distribution over , the following holds

(10)

where .

By taking to be uniform over the experts, the above theorem implies a bound with a factor. However, since we predict and perform updates only with respect to awake experts, this can be improved to , where is the number of distinct experts ever awake throughout the time steps. The following lemma (whose proof is deferred to Appendix D) formally states this fact.

Fix a sequence of awake experts such that

. Let the uniform distribution supported over the awake experts, denoted with

, be defined by if and otherwise.

Lemma 1.

Suppose Algorithm 7 is run with initial weights for and with a sequence of awake experts. Then the regret of the algorithm initialized with matches the regret of the algorithm initialized with .

We use Theorem 4 and Lemma 1 to derive a regret bound for Algorithm 6 when predictions and updates are provided by Algorithm 7. The same regret bound can be achieved through the analysis of [Mourtada and Maillard, 2017, Theorem 3], albeit their proof follows a different argument.

Theorem 5.

Suppose that Algorithm 6 is run using predictions and updates provided by Algorithm 7. Then, for any sequence of -exp-concave losses and for any pruning of the input tree ,

Proof.

Let be the uniform distribution over the terminal nodes of . At each round, exactly one terminal node of is in the active path of . Therefore , and also for all because only one expert in is awake in the support of . Now note that although the algorithm is actually initialized with , Lemma 1 shows that the regret remains the same if we assume the algorithm is initialized with . The choice of the competitor gives us . By applying Theorem 4 we finally get

(only one expert awake in the active path)

concluding the proof. ∎

In case of general convex losses, we simply apply the following theorem where is the cumulative loss of pruning .

Theorem 6 (Section 6 in [Luo and Schapire, 2015]).

Suppose that Algorithm 6 is run using predictions and updates provided by AdaNormalHedge. Then, for any sequence of convex losses and for any pruning of the input tree ,

Appendix C Proofs for nonparametric prediction

We start by proving a master regret bound that can be specialized to various settings of interest. Recall that the prediction of a pruning at time is , where is the prediction of the node expert sitting at the unique leaf of the pruning on the active path . Recall also that is the center of the ball in the hierarchical packing corresponding to node in the tree. As in our locally-adaptive setting node experts are local learners, should be viewed as the prediction of the local online learning algorithm sitting at node of the tree. Let be the subset of time steps when is on the active path . We now introduce the definitions of regret for the tree expert

and for node expert

where is either (regression with square loss) or (classification with absolute loss), and

Note that, for all and for defined as above,

(11)
Lemma 2.

Suppose that Algorithm 1 (or, equivalently, Algorithm 5) is run on a sequence of convex and -Lipschitz losses and let be the resulting hierarchical packing. Then for any pruning of and for any