AMF: Aggregated Mondrian Forests for Online Learning

by   Jaouad Mourtada, et al.

Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. The most commonly used RF variants however are "offline" algorithms, which require the availability of the whole dataset at once. In this paper, we introduce AMF, an online random forest algorithm based on Mondrian Forests. Using a variant of the Context Tree Weighting algorithm, we show that it is possible to efficiently perform an exact aggregation over all prunings of the trees; in particular, this enables to obtain a truly online parameter-free algorithm which is competitive with the optimal pruning of the Mondrian tree, and thus adaptive to the unknown regularity of the regression function. Numerical experiments show that AMF is competitive with respect to several strong baselines on a large number of datasets for multi-class classification.



There are no comments yet.


page 2

page 11


Q-learning with online random forests

Q-learning is the most fundamental model-free reinforcement learning alg...

Random Forests for Big Data

Big Data is one of the major challenges of statistical science and has n...

WildWood: a new Random Forest algorithm

We introduce WildWood (WW), a new ensemble algorithm for supervised lear...

Regression-Enhanced Random Forests

Random forest (RF) methodology is one of the most popular machine learni...

Randomer Forests

Random forests (RF) is a popular general purpose classifier that has bee...

Pruning Random Forests for Prediction on a Budget

We propose to prune a random forest (RF) for resource-constrained predic...

Slow-Growing Trees

Random Forest's performance can be matched by a single slow-growing tree...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Introduced by Breiman (2001), Random Forests (RF) is one of the algorithms of choice in many supervised learning applications. The appeal of these methods comes from their remarkable accuracy in a variety of tasks, the small number (or even the absence) of parameters to tune, their reasonable computational cost at training and prediction time, and their suitability in high-dimensional settings.

Most commonly used RF algorithms, such as the original random forest procedure (Breiman, 2001), extra-trees (Geurts et al., 2006), or conditional inference forest (Hothorn et al., 2010) are batch algorithms, that require the whole dataset to be available at once. Several online random forests variants have been proposed to overcome this issue and handle data that come sequentially. Utgoff (1989)

was the first to extend Quinlan’s ID3 batch decision tree algorithm

(see Quinlan, 1986) to an online setting. Later on, Domingos and Hulten (2000) introduce Hoeffding Trees that can be easily updated: since observations are available sequentially, a cell is split when enough observations have fallen into this cell, the best split in the cell is statistically relevant (a generic Hoeffding inequality being used to assess the quality of the best split).

Since random forests are known to exhibit better empirical performances than individual decision trees, online random forests have been proposed (see, e.g., Saffari et al., 2009; Denil et al., 2013). These procedures aggregate several trees by computing the mean of the tree predictions (regression setting) or the majority vote among trees (classification setting). The tree construction differs from one forest to another but share similarities with Hoeffding trees: a cell is to be split if and (defined above) are verified.

One forest of particular interest for this paper is the Mondrian Forest (Lakshminarayanan et al., 2014) based on the Mondrian process (Roy and Teh, 2009). Their construction differs from the construction described above since each new observation modifies the tree structure: instead of waiting for enough observations to fall into a cell in order to split it, the properties of the Mondrian process allow to update the Mondrian tree partition each time a sample is collected. Once a Mondrian tree is built, its prediction function uses a hierarchical prior on all subtrees and the average of predictions on all subtrees is computed with respect to this hierarchical prior using an approximation algorithm.

The algorithm we propose, called AMF, and illustrated in Figure 1 below on a toy binary classification dataset, differs from Mondrian Forest by the smoothing procedure used on each tree. While the hierarchical Bayesian smoothing proposed in Lakshminarayanan et al. (2014) requires approximations, the prior we choose allows for exact computation of the posterior distribution. The choice of this posterior is inspired by Context Tree Weighting (see, e.g., Willems et al., 1995; Willems, 1998; Helmbold and Schapire, 1997; Catoni, 2004), commonly used in lossless compression to aggregate all subtrees of a prespecified tree, which is both computationally efficient and theoretically sound.

Figure 1: Evolution of the decision function of AMF along the online learning steps. We observe the online property of this algorithm, which produces a smooth decision function at each iteration, and leads to a correct AUC on a test set even in the early stages.

Since we are able to compute exactly the posterior distribution, our approach is drastically different from Bayesian trees (see, for instance, Chipman et al., 1998; Denison et al., 1998; Taddy et al., 2011), and from BART (Chipman et al., 2010) which implement MCMC methods to approximate posterior distributions on trees. The Context Tree Weighting algorithm has been applied to regression trees by Blanchard (1999) in the case of a fixed-design tree, in which splits are prespecified. This requires to split the dataset into two parts (using the first part to select the best splits and the second to compute the posterior distribution) and to have access to the whole dataset, since the tree structure needs to be fixed in advance.

As noted by Rockova and van der Pas (2017), the theoretical study of Bayesian methods on trees (Chipman et al., 1998; Denison et al., 1998) or sum of trees (Chipman et al., 2010) is less developed. Rockova and van der Pas (2017) analyzes some variant of Bayesian regression trees and sum of trees; they obtain near minimax optimal posterior concentration rates. Likewise, Linero and Yang (2018) analyze Bayesian sums of soft decision trees models, and establish minimax rates of posterior concentration for the resulting SBART procedure. While these frameworks differ from ours (herein results are posterior concentration rates as opposed to regret bounds and excess risk bounds, and the design is fixed), their approach differs from ours primarily in the chosen trade-off between computational complexity and adaptivity of the method: these procedures involve approximate posterior sampling over large functional spaces through MCMC methods, and it is unclear whether the considered priors allow for reasonably efficient posterior computations. In particular, the prior used in Rockova and van der Pas (2017) is taken over all subsets of variables, which is exponentially large in the number of features.

The literature focusing on the original RF algorithm or its related variants is more extensive, even if the data-dependent nature of the algorithm and its numerous components (sampling procedure, split selection, aggregation) make the theoretical analysis difficult. The consistency of stylized RF algorithms was first established by Biau et al. (2008), and later obtained for more sophisticated variants in Denil et al. (2013); Scornet et al. (2015). Note that consistency results do not provide rates of convergence, and hence only offer limited guidance on how to properly tune the parameters of the algorithm. Starting with Biau (2012); Genuer (2012), some recent work has thus sought to quantify the speed of convergence of some stylized variants of RF. Minimax optimal nonparametric rates were first obtained by Arlot and Genuer (2014) in dimension  for the Purely Uniformly Random Forests (PURF) algorithm, in conjunction with suboptimal rates in arbitrary dimension (the number of features exceeds ).

Several recent works (Wager and Walther, 2015; Duroux and Scornet, 2016) also established rates of convergence for variants of RF that essentially amount to some form of Median Forests, where each node contains at least a fixed fraction of observations of its parent. While valid in arbitrary dimension, the established rates are suboptimal. More recently, adaptive minimax optimal rates were obtained by Mourtada et al. (2018) in arbitrary dimension for the batch Mondrian Forests algorithm. Our proposed online algorithm, AMF, also achieves minimax rates in an adaptive fashion, namely without knowing the smoothness of the regression function.

In this paper, we introduce AMF, a random forest algorithm which is fully online and computationally exact: unlike Bayesian trees and sum-of-trees procedures relying on approximate posterior sampling, we are able to compute exactly the prediction function of AMF in a very efficient way. Section 2 introduces the setting considered and general notations, and provides a precise construction of the AMF algorithm. A theoretical analysis of AMF is given in Section 3, where we establish regret bounds for AMF together with a minimax adaptive upper bound. Section 4 introduces a modification of AMF which is used in all the numerical experiments of the paper, together with a guarantee and a discussion on its computational complexity. Numerical experiments are provided in Section 5, on a large number of datasets, that include a comparison of AMF with several strong baselines. Our conclusions are provided in Section 6. The proofs of all the results are gathered in Section 7.

2 Towards a forest of aggregated Mondrian trees

We define in Section 2.1 the setting and notations that will be used throughout the paper, together with the definition of the Mondrian process, introduced by Roy and Teh (2009), which is a key element of our algorithm. In Section 2.2, we explicitly describe the prediction function that we want to compute, and prove in Proposition 1 that the AMF algorithm described in Section 2.3 computes it exactly.

2.1 The setting, trees, forests and the Mondrian process

We are interested in an online supervised learning problem in which we assume that the dataset is not fixed in advance. In this scenario, we are given an i.i.d. sequence of

-valued random variables that come sequentially, such that each

has the same distribution as a generic pair .

Our aim is to design an online algorithm that can be updated “on the fly” given new sample points, that is, at each time step , a randomized prediction function

where is the dataset available at time , where is a random variable that accounts for the randomization procedure and is a prediction space, see Examples 1 and 2 below for example. In the rest of the paper, we omit the explicit dependence in .

We consider prediction rules that are random forests, defined as the averaging of a set of randomized decision trees. We let be randomized tree predictors at a point at time , associated to the same randomized mechanism, where the for are i.i.d. and correspond to a random tree partition, which is described below. Setting , the

random forest estimate

is then defined by


namely taking the average over all tree predictions . The online training of each tree can be done in parallel, since they are fully independent of each other and each of them follow the exact same randomized construction. Therefore, we describe only the construction of single tree (and its associated random partition and prediction function) and omit from now on the dependence on .

The random tree partitions are given by , where is a binary tree and contains information about each node in , such as splits, as explained below. Let us now introduce notations and definitions of these objects, for simplicity we first assume that is fixed, and remove the dependence on for a little while.

Definition 1 (Tree partition).

Let be a hyper-rectangular box of the form , (the interval being open at an infinite extremity). A tree partition (or d tree, guillotine partition) of is a pair , where

  • is a finite ordered binary tree, which is represented as a finite subset of the set of all finite words on the alphabet . The set is endowed with a tree structure (and called the complete binary tree): the empty word is the root, and for any , the left (resp. right) child of is (resp. ), obtained by adding a (resp. ) at the end of . We denote by the set of its interior nodes and by the set of its leaves, which are disjoint by definition.

  • is a family of splits at the interior nodes of , where each split is characterized by its split dimension and its threshold . In Section 2.3, we will actually store in more information about nodes .

One can associate to a partition of as follows. For each node , its cell is a hyper-rectangular region defined recursively: the cell associated to the root of is , and, for each , we define

Then, the leaf cells form a partition of by construction.

Mondrian partitions are a specific family of random tree partitions whose construction is described below. An infinite Mondrian partition of can be sampled from the infinite Mondrian process, denoted from now on, using the procedure described below. If with intervals , we denote and . We denote by

the exponential distribution with intensity

and by

the uniform distribution on a finite interval


1:  Inputs: The cell and creation time of a node
2:  Sample a random variable and put
3:  Sample a split coordinate with
4:  Sample a split threshold conditionally on as
5:  Following Definition 1, the split defines children cells and
6:  return
Algorithm 1 : sample a Mondrian starting from a cell and time

The call to corresponds to a call starting at the root node , since and the birth time of is . This random partition is built by iteratively splitting cells at some random time, which depends on the linear dimension of the input cell . The split coordinate

is chosen at random, with a probability of sampling

which is proportional to the side length of the cell, and the split threshold is sampled uniformly in . The number of recursions in this procedure is infinite, the Mondrian process is a distribution on infinite tree partitions of , see Roy and Teh (2009) and Roy (2011) for a rigorous construction. The random partition described in Section 2.3

below, is, however, not infinite, and depends on the features vectors

seen until time . The implementation of AMF used in all our experiments, described in Section 4 below, also considers finite partitions, through the concept of restricted Mondrian partitions, introduced in Lakshminarayanan et al. (2014). At this point, the birth times computed in Algorithm 1 are not used. They will allow to define time prunings of a Mondrian partition in Section 3.1 below, a notion which is necessary to prove that AMF has adaptation capabilities to the optimal time pruning. Birth times are also necessary for the definition of restricted Mondrian partitions in Section 4, which is an important ingredient in the actual implementation of AMF.

2.2 Aggregation with exponential weights and prediction functions

The prediction function of AMF is an aggregation of the predictions given by all finite subtrees of the infinite Mondrian partition . This aggregation step is performed in a purely online fashion, using an aggregation algorithm based on exponential weights, with a branching process prior over the subtrees, see Definition 3 below. This weighting scheme gives more importance to subtrees with a good predictive performance.

Let us assume that the realization of an infinite Mondrian partition is available at some fixed step . We will argue in Section 2.3 that it suffices to store a finite partition , and show how to update it. The definition of the prediction function used in AMF require the notion of node and subtree prediction, defined below.

Definition 2.

Given , we define

for each node (which defines a cell following Definition 1) and each , where is a prediction algorithm used in each cell, with its prediction space and

a generic loss function. The prediction at time

of a finite subtree associated to some features vector is defined by

where is the leaf of that contains . We define also the cumulative loss of at time as

Before defining the prediction function of AMF, let us first make explicit the prediction function and the loss considered in two specific cases of interest: regression and classification.

Example 1 (Regression).

In regression, we use empirical mean forecasters

where , and we simply put if is empty (namely, contains no data point). The loss is the quadratic loss for any and where .

Example 2 (Classification).

For multi-class classification, we have labels where is a finite set of label modalities (such as ) and predictions are in

, the set of probability distributions on

. We use the Krichevsky-Trofimov (KT) forecaster (see Tjalkens et al., 1993) in each node , which predicts


for any , where . For an empty we use the uniform distribution on . We consider the logarithmic loss (also called cross-entropy or self-information loss) , where .

Remark 1.

The Krichevsky-Trofimov forecaster coincides with the exponential weights algorithm under the logarithmic loss (with ) on with a prior equal to the Dirichlet distribution , namely the Jeffreys prior on the multinomial model .

Definition 3.

Let and . The prediction function of AMF at step is given by

where the sum is over all subtrees of and where the prior on subtrees is the probability distribution defined by


where is the number of nodes in and is a parameter called learning rate.

Note that is the distribution of the branching process with branching probability at each node of , with exactly two children when it branches; this branching process gives finite subtrees almost surely. The learning rate can be optimally tuned following theoretical guarantees from Section 3, see in particular Corollaries 1 and 2. This aggregation procedure is a non-greedy way to prune trees: the weights do not depend only on the quality of one single split but rather on the performance of each subsequent split.

Let us stress that computing from Definition 3 seems computationally infeasible in practice, since it involves a sum over all subtrees of . Besides, it requires to keep in memory one weight for all subtrees , which seems prohibitive as well. Indeed, the number of subtrees of the minimal tree that separates points is exponential in the number of nodes, and hence exponential in . However, the proper choice of the prior in Equation (3) allows us to prove that can actually be computed very efficiently, at almost no memory cost, as stated in Proposition 1 below, where we prove that the AMF algorithm described in Section 2.3 below allows to compute exactly and efficiently.

Proposition 1.

Let and . The value from Definition 3 can be computed exactly via the procedure (see Algorithms 2 and 3 from Section 2.3 below).

The proof of Proposition 1 is given in Section 7. It proves that aggregating predictions of all subtrees weighted by the prior can be done exactly via Algorithm 3. This prior choice enables to bypass the need to maintain one weight per subtree, and leads to a “collapsed” implementation that only requires to maintain one weight per node (which is exponentially smaller). Note that this algorithm is exact, in the sense that it does not require any approximation scheme. Moreover, this online algorithm corresponds to its batch counterpart, in the sense that there is no loss of information coming from the online (or streaming) setting versus the batch setting (where the whole dataset is available at once).

The proof of Proposition 1 relies on some standard identities that enable to efficiently compute sums of products over tree structures in a recursive fashion (from Helmbold and Schapire, 1997), recalled in Lemma 3 from Section 7. Such identities are at the core of the Context Tree Weighting algorithm (CTW), which our online algorithm implements (albeit over an evolving tree structure, as explained in Section 2.3 below), and which consists of an efficient way to perform Bayesian mixtures of contextual tree models under a branching process prior. The CTW algorithm, based on a sum-product factorization, is a state-of-the art algorithm used in lossless coding and compression. We use a variant of the Tree Expert algorithm (Helmbold and Schapire, 1997; Cesa-Bianchi and Lugosi, 2006), which is closely linked to CTW (Willems et al., 1995; Willems, 1998; Catoni, 2004).

2.3 AMF: a forest of aggregated Mondrian trees

In an online setting, the number of sample points increases over time, allowing one to capture more details on the distribution of conditionally on . This means that the complexity of our models (in this context, the complexity of the decision trees) should increase over time. We will therefore need to consider not just an individual, fixed tree partition , but a sequence , indexed by “time” corresponding to the number of samples available. Furthermore, AMF uses the aggregated prediction function given in Definition 3 (independently within each tree from the forest, see Equation (1)). When a new sample point becomes available, the algorithm does two things, in the following order:

  • Partition update. Using , update the decision tree structure from to , i.e. sample new splits in order to ensure that each leaf in the tree contains at most one point among . This update uses the recursive properties of Mondrian partitions;

  • Prediction function update. Using and , update the prediction functions and weights and that are necessary for the computation of from Definition 3. These updates are local and are performed only along the path of nodes leading to the leaf containing . This update is efficient and enables the computation of from Definition 3, which aggregates the decision functions of all the prunings of the tree, thanks to a variant of CTW.

Both updates can be implemented on the fly in a purely sequential manner. Training over a sequence means using each sample once for training, and both updates are exact and do not rely on an approximate sampling scheme. Both steps are precisely described in Algorithm 2 below and illustrated in Figure 2. Also, in order to ease the reading of this technical part of the paper, we gather in Table 1 notations that are used in this Section.

Notation or formula Description
A node
A tree
  (resp.   ) The left (resp. right) child of
A subtree rooted at
The set of leaves of
The set of the interior nodes of
The cells of the partition defined by
Prediction of a node at time
Cumulative loss of the node at time
Weight stored in node at time
Average weight stored in node at time
Table 1: Notations and definitions used in AMF

Partition update.

Before seeing the point , the algorithm maintains a partition , which corresponds to the minimal subtree of the infinite Mondrian partition that separates all distinct sample points in . This corresponds to the tree obtained from the infinite tree by removing all splits of “empty” cells (that do not contain any point among ). As becomes available, this tree is updated as follows (this corresponds to Lines 2–11 in Algorithm 2 below):

  • find the leaf in that contains ; it contains at most one point among ;

  • if the leaf contains no point , then let . Otherwise, let be the unique point among (distinct from ) in this cell. Splits of the cell containing are successively sampled (following the recursive definition of the Mondrian distribution), until a split separates and .

Prediction function update.

The algorithm maintains weights and and predictions in order to compute the aggregation over the tree structure (lines 12–18 in Algorithm 2). Namely, after round (after seeing sample )), each node has the following quantities in memory:

  • the weight , where ;

  • the averaged weight , where the sum ranges over all subtrees rooted at ;

  • the forecast in node at time .

Now, given a new sample point , the update is performed as follows: we find the leaf containing in (the partition has been updated with already, since the partition update is performed before the prediction function update). Then, we update the values of for each along an upwards recursion from to the root, while the values of nodes outside of the path are kept unchanged:

  • ;

  • if then , otherwise

  • using the prediction algorithm , see Definition 2. Note that the prediction algorithms given in Examples 1 and 2 can be updated online using only and do not require to look back at the sequence .

The partition update and prediction function update correspond to the procedure described in Algorithm 2 below.

1:  Input: a new sample
2:  Let be the leaf such that and put
3:  while  contains some  do
4:     Use Lines 1–5 from Algorithm 1 to split and obtain children cells and
5:     if  for some  then
6:         Put , and ( is the default initial prediction described in Examples 1 and 2)
7:     else
8:         Let be such that and . Put and and
9:     end if
10:  end while
11:  Put (memorize the fact that contains )
12:  Let
13:  while  do
14:     Set
15:     Set if is a leaf and otherwise
16:     Update using (following Definition 2)
17:     If let , otherwise let
18:  end while
Algorithm 2 update AMF with a new sample

Training AMF over a sequence means using successive calls to . The procedure maintains in memory the current state of the Mondrian partition . The tree contains the parent and children relations between all nodes , while each can contain


namely the split coordinate and split threshold (only if ), the prediction function , aggregation weights and a vector if . An illustration of Algorithm 2 is given in Figure 2 below.

Remark 2.

The complexity of

is twice the depth of the tree at the moment it is called, since it requires to follow a downwards path to a leaf, and to go back upwards to the root. As explained in Proposition 

2 from Section 4 below, the depth of the Mondrian tree used in AMF is in expectation at step of training, which leads to a complexity both for Algorithms 2 and 3, where corresponds to the update complexity of a single node, while the original MF algorithm uses an update with complexity that is linear in the number of leaves in the tree (which is typically exponentially larger).

Tree before seeing

Updated tree

Updates along the path of :

Figure 2: Illustration of the procedure from Algorithm 2: update of the partition, weights and node predictions as a new data point for becomes available. Left: tree partition before seeing . Right: update of the partition (in red) and new splits to separate from . Empty circles () denote empty leaves, while leaves containing a point are indicated by a filled circle (). The path of in the tree is indicated in bold. The updates of weights and predictions along the path are indicated, and are computed in an upwards recursion.


At any point in time, one can ask AMF to perform prediction for an arbitrary features vector . Let us assume that AMF did already training steps on the trees it contains and let us recall that the prediction produced by AMF is the average of their predictions, see Equation (1), where the prediction of each decision tree is computed in parallel following Definition 3.

The prediction of a decision tree is performed through a call to the procedure described in Algorithm 3 below. First, we perform a temporary partition update of using , following Lines 2–10 of Algorithm 2, so that we find or create a new leaf node such that . Let us stress that this update of using is discarded once the prediction for is produced, so that the decision function of AMF does not change after producing predictions. The prediction is then computed recursively, along an upwards recursion going from to the root , in the following way:

  • if we set ;

  • if (it is an interior node such that ), then assuming that () is the child of such that , we set

The prediction of the tree is given by , which is the last value obtained in this recursion. Let us recall that this computes the aggregation with exponential weights of all the decision functions produced by all the prunings of the current Mondrian tree, as described in Definition 3 and stated in Proposition 1 above. The prediction procedure is summarized in Algorithm 3 below.

1:  Input: a features vector
2:  Follow Lines 2–10 of Algorithm 2 to do a temporary update of the current partition using and let be the leaf such that
3:  Set
4:  while  do
5:     Let (for some )
6:     Let
7:  end while
8:  Return
Algorithm 3 predict the label of

The next Section 3 provides theoretical guarantees for AMF, but before that, let us provide the following numerical illustration on three toy datasets for binary classification. The aim of this illustration is to exhibit the effect of aggregation in AMF, compared to the same method with no aggregation, the original Mondrian Forest algorithm, batch Random Forest and Extra Trees (see Section 5 for a precise description of the implementations used). We observe that AMF with aggregation (AMF(agg)) produces a very smooth decision function in all cases, which generalizes better on this instance (AUCs displayed on the bottom right-hand side of each plot are computed on a test dataset) than all other methods. All the other algorithms display rather non-smooth decision functions, which suggests that the underlying probability estimates are not well-calibrated.

Figure 3: Decision functions of AMF, Mondrian Forest (MF), Breiman’s batch random forest (RF) and batch Extra Trees (ET), on several toy datasets for binary classification (input data ). We observe that AMF, thanks to aggregation, leads to a smooth decision function, hence with a better generalization property (AUC on the test sets, displayed bottom right of each plot, is slightly better in all cases). Let us stress that both AMF and MF do a single pass on the data, while RF and ET require many passes. All algorithms use a forest containing 10 trees.

3 Theoretical guarantees

In addition to being efficiently implementable in a streaming fashion, AMF is amenable to a thorough end-to-end theoretical analysis. This relies on two main ingredients: a precise control of the geometric properties of the Mondrian partitions and a regret analysis of the aggregation procedure (exponentially weighted aggregation of all finite prunings of the infinite Mondrian) which in turn yields excess risk bounds and adaptive minimax rates. The guarantees provided below hold for a single tree in the Forest, but hold also for the average of several trees (used in by the forest) by convexity of the loss (see Examples 1 and 2).

3.1 Regret bounds

For now, the sequence is arbitrary, and is in particular not required to be i.i.d. Let us recall that at step , we have a realization of a finite Mondrian tree, which is the minimal subtree of the infinite Mondrian partition that separates all distinct sample points in . Let us recall also that are the tree forecasters from Definition 2, where is some subtree of . We need the following

Definition 4.

Let . A loss function is said to be -exp-concave if the function is concave for each .

The following loss functions are -exp-concave:

  • The logarithmic loss , with a finite set and , with (see Example 2 above);

  • The quadratic loss on , with .

We start with Lemma 1, which states that the prediction function used in AMF (see Definition 3) satisfies a regret bound where the regret is computed with respect to any pruning of .

Lemma 1.

Consider a -exp-concave loss function . Fix a realization and let be a finite subtree. For every sequence , the prediction functions based on and computed by AMF satisfy


where we recall that is the number of nodes in .

Lemma 1 is a direct consequence of a standard regret bound for the exponential weights algorithm (see Lemma 4 from Section 7), together with the fact that the Context Tree Weighting algorithm performed in Algorithms 2 and 3 computes it exactly, as stated in Proposition 1. By combining Lemma 1 with regret bounds for the online algorithms used in each node, both for the logarithmic loss (see Example 2) and the quadratic loss (see Example 1), we obtain the following regret bounds with respect to any pruning of .

Corollary 1 (Classification).

Fix as in Lemma 1 and consider the classification setting described in Example 2 above. For any finite subtree of and every sequence , the prediction functions based on computed by AMF with satisfy


for any function which is constant on the leaves of .

Corollary 2 (Regression).

Fix as in Lemma 1 and consider the regression setting described in Example 1 above with . For every finite subtree of and every sequence , the prediction functions based on computed by AMF with satisfy


for any function