1 Introduction
Introduced by Breiman (2001), Random Forests (RF) is one of the algorithms of choice in many supervised learning applications. The appeal of these methods comes from their remarkable accuracy in a variety of tasks, the small number (or even the absence) of parameters to tune, their reasonable computational cost at training and prediction time, and their suitability in highdimensional settings.
Most commonly used RF algorithms, such as the original random forest procedure (Breiman, 2001), extratrees (Geurts et al., 2006), or conditional inference forest (Hothorn et al., 2010) are batch algorithms, that require the whole dataset to be available at once. Several online random forests variants have been proposed to overcome this issue and handle data that come sequentially. Utgoff (1989)
was the first to extend Quinlan’s ID3 batch decision tree algorithm
(see Quinlan, 1986) to an online setting. Later on, Domingos and Hulten (2000) introduce Hoeffding Trees that can be easily updated: since observations are available sequentially, a cell is split when enough observations have fallen into this cell, the best split in the cell is statistically relevant (a generic Hoeffding inequality being used to assess the quality of the best split).Since random forests are known to exhibit better empirical performances than individual decision trees, online random forests have been proposed (see, e.g., Saffari et al., 2009; Denil et al., 2013). These procedures aggregate several trees by computing the mean of the tree predictions (regression setting) or the majority vote among trees (classification setting). The tree construction differs from one forest to another but share similarities with Hoeffding trees: a cell is to be split if and (defined above) are verified.
One forest of particular interest for this paper is the Mondrian Forest (Lakshminarayanan et al., 2014) based on the Mondrian process (Roy and Teh, 2009). Their construction differs from the construction described above since each new observation modifies the tree structure: instead of waiting for enough observations to fall into a cell in order to split it, the properties of the Mondrian process allow to update the Mondrian tree partition each time a sample is collected. Once a Mondrian tree is built, its prediction function uses a hierarchical prior on all subtrees and the average of predictions on all subtrees is computed with respect to this hierarchical prior using an approximation algorithm.
The algorithm we propose, called AMF, and illustrated in Figure 1 below on a toy binary classification dataset, differs from Mondrian Forest by the smoothing procedure used on each tree. While the hierarchical Bayesian smoothing proposed in Lakshminarayanan et al. (2014) requires approximations, the prior we choose allows for exact computation of the posterior distribution. The choice of this posterior is inspired by Context Tree Weighting (see, e.g., Willems et al., 1995; Willems, 1998; Helmbold and Schapire, 1997; Catoni, 2004), commonly used in lossless compression to aggregate all subtrees of a prespecified tree, which is both computationally efficient and theoretically sound.
Since we are able to compute exactly the posterior distribution, our approach is drastically different from Bayesian trees (see, for instance, Chipman et al., 1998; Denison et al., 1998; Taddy et al., 2011), and from BART (Chipman et al., 2010) which implement MCMC methods to approximate posterior distributions on trees. The Context Tree Weighting algorithm has been applied to regression trees by Blanchard (1999) in the case of a fixeddesign tree, in which splits are prespecified. This requires to split the dataset into two parts (using the first part to select the best splits and the second to compute the posterior distribution) and to have access to the whole dataset, since the tree structure needs to be fixed in advance.
As noted by Rockova and van der Pas (2017), the theoretical study of Bayesian methods on trees (Chipman et al., 1998; Denison et al., 1998) or sum of trees (Chipman et al., 2010) is less developed. Rockova and van der Pas (2017) analyzes some variant of Bayesian regression trees and sum of trees; they obtain near minimax optimal posterior concentration rates. Likewise, Linero and Yang (2018) analyze Bayesian sums of soft decision trees models, and establish minimax rates of posterior concentration for the resulting SBART procedure. While these frameworks differ from ours (herein results are posterior concentration rates as opposed to regret bounds and excess risk bounds, and the design is fixed), their approach differs from ours primarily in the chosen tradeoff between computational complexity and adaptivity of the method: these procedures involve approximate posterior sampling over large functional spaces through MCMC methods, and it is unclear whether the considered priors allow for reasonably efficient posterior computations. In particular, the prior used in Rockova and van der Pas (2017) is taken over all subsets of variables, which is exponentially large in the number of features.
The literature focusing on the original RF algorithm or its related variants is more extensive, even if the datadependent nature of the algorithm and its numerous components (sampling procedure, split selection, aggregation) make the theoretical analysis difficult. The consistency of stylized RF algorithms was first established by Biau et al. (2008), and later obtained for more sophisticated variants in Denil et al. (2013); Scornet et al. (2015). Note that consistency results do not provide rates of convergence, and hence only offer limited guidance on how to properly tune the parameters of the algorithm. Starting with Biau (2012); Genuer (2012), some recent work has thus sought to quantify the speed of convergence of some stylized variants of RF. Minimax optimal nonparametric rates were first obtained by Arlot and Genuer (2014) in dimension for the Purely Uniformly Random Forests (PURF) algorithm, in conjunction with suboptimal rates in arbitrary dimension (the number of features exceeds ).
Several recent works (Wager and Walther, 2015; Duroux and Scornet, 2016) also established rates of convergence for variants of RF that essentially amount to some form of Median Forests, where each node contains at least a fixed fraction of observations of its parent. While valid in arbitrary dimension, the established rates are suboptimal. More recently, adaptive minimax optimal rates were obtained by Mourtada et al. (2018) in arbitrary dimension for the batch Mondrian Forests algorithm. Our proposed online algorithm, AMF, also achieves minimax rates in an adaptive fashion, namely without knowing the smoothness of the regression function.
In this paper, we introduce AMF, a random forest algorithm which is fully online and computationally exact: unlike Bayesian trees and sumoftrees procedures relying on approximate posterior sampling, we are able to compute exactly the prediction function of AMF in a very efficient way. Section 2 introduces the setting considered and general notations, and provides a precise construction of the AMF algorithm. A theoretical analysis of AMF is given in Section 3, where we establish regret bounds for AMF together with a minimax adaptive upper bound. Section 4 introduces a modification of AMF which is used in all the numerical experiments of the paper, together with a guarantee and a discussion on its computational complexity. Numerical experiments are provided in Section 5, on a large number of datasets, that include a comparison of AMF with several strong baselines. Our conclusions are provided in Section 6. The proofs of all the results are gathered in Section 7.
2 Towards a forest of aggregated Mondrian trees
We define in Section 2.1 the setting and notations that will be used throughout the paper, together with the definition of the Mondrian process, introduced by Roy and Teh (2009), which is a key element of our algorithm. In Section 2.2, we explicitly describe the prediction function that we want to compute, and prove in Proposition 1 that the AMF algorithm described in Section 2.3 computes it exactly.
2.1 The setting, trees, forests and the Mondrian process
We are interested in an online supervised learning problem in which we assume that the dataset is not fixed in advance. In this scenario, we are given an i.i.d. sequence of
valued random variables that come sequentially, such that each
has the same distribution as a generic pair .Our aim is to design an online algorithm that can be updated “on the fly” given new sample points, that is, at each time step , a randomized prediction function
where is the dataset available at time , where is a random variable that accounts for the randomization procedure and is a prediction space, see Examples 1 and 2 below for example. In the rest of the paper, we omit the explicit dependence in .
We consider prediction rules that are random forests, defined as the averaging of a set of randomized decision trees. We let be randomized tree predictors at a point at time , associated to the same randomized mechanism, where the for are i.i.d. and correspond to a random tree partition, which is described below. Setting , the
random forest estimate
is then defined by(1) 
namely taking the average over all tree predictions . The online training of each tree can be done in parallel, since they are fully independent of each other and each of them follow the exact same randomized construction. Therefore, we describe only the construction of single tree (and its associated random partition and prediction function) and omit from now on the dependence on .
The random tree partitions are given by , where is a binary tree and contains information about each node in , such as splits, as explained below. Let us now introduce notations and definitions of these objects, for simplicity we first assume that is fixed, and remove the dependence on for a little while.
Definition 1 (Tree partition).
Let be a hyperrectangular box of the form , (the interval being open at an infinite extremity). A tree partition (or d tree, guillotine partition) of is a pair , where

is a finite ordered binary tree, which is represented as a finite subset of the set of all finite words on the alphabet . The set is endowed with a tree structure (and called the complete binary tree): the empty word is the root, and for any , the left (resp. right) child of is (resp. ), obtained by adding a (resp. ) at the end of . We denote by the set of its interior nodes and by the set of its leaves, which are disjoint by definition.

is a family of splits at the interior nodes of , where each split is characterized by its split dimension and its threshold . In Section 2.3, we will actually store in more information about nodes .
One can associate to a partition of as follows. For each node , its cell is a hyperrectangular region defined recursively: the cell associated to the root of is , and, for each , we define
Then, the leaf cells form a partition of by construction.
Mondrian partitions are a specific family of random tree partitions whose construction is described below. An infinite Mondrian partition of can be sampled from the infinite Mondrian process, denoted from now on, using the procedure described below. If with intervals , we denote and . We denote by
the exponential distribution with intensity
and bythe uniform distribution on a finite interval
.The call to corresponds to a call starting at the root node , since and the birth time of is . This random partition is built by iteratively splitting cells at some random time, which depends on the linear dimension of the input cell . The split coordinate
is chosen at random, with a probability of sampling
which is proportional to the side length of the cell, and the split threshold is sampled uniformly in . The number of recursions in this procedure is infinite, the Mondrian process is a distribution on infinite tree partitions of , see Roy and Teh (2009) and Roy (2011) for a rigorous construction. The random partition described in Section 2.3below, is, however, not infinite, and depends on the features vectors
seen until time . The implementation of AMF used in all our experiments, described in Section 4 below, also considers finite partitions, through the concept of restricted Mondrian partitions, introduced in Lakshminarayanan et al. (2014). At this point, the birth times computed in Algorithm 1 are not used. They will allow to define time prunings of a Mondrian partition in Section 3.1 below, a notion which is necessary to prove that AMF has adaptation capabilities to the optimal time pruning. Birth times are also necessary for the definition of restricted Mondrian partitions in Section 4, which is an important ingredient in the actual implementation of AMF.2.2 Aggregation with exponential weights and prediction functions
The prediction function of AMF is an aggregation of the predictions given by all finite subtrees of the infinite Mondrian partition . This aggregation step is performed in a purely online fashion, using an aggregation algorithm based on exponential weights, with a branching process prior over the subtrees, see Definition 3 below. This weighting scheme gives more importance to subtrees with a good predictive performance.
Let us assume that the realization of an infinite Mondrian partition is available at some fixed step . We will argue in Section 2.3 that it suffices to store a finite partition , and show how to update it. The definition of the prediction function used in AMF require the notion of node and subtree prediction, defined below.
Definition 2.
Given , we define
for each node (which defines a cell following Definition 1) and each , where is a prediction algorithm used in each cell, with its prediction space and
a generic loss function. The prediction at time
of a finite subtree associated to some features vector is defined bywhere is the leaf of that contains . We define also the cumulative loss of at time as
Before defining the prediction function of AMF, let us first make explicit the prediction function and the loss considered in two specific cases of interest: regression and classification.
Example 1 (Regression).
In regression, we use empirical mean forecasters
where , and we simply put if is empty (namely, contains no data point). The loss is the quadratic loss for any and where .
Example 2 (Classification).
For multiclass classification, we have labels where is a finite set of label modalities (such as ) and predictions are in
, the set of probability distributions on
. We use the KrichevskyTrofimov (KT) forecaster (see Tjalkens et al., 1993) in each node , which predicts(2) 
for any , where . For an empty we use the uniform distribution on . We consider the logarithmic loss (also called crossentropy or selfinformation loss) , where .
Remark 1.
The KrichevskyTrofimov forecaster coincides with the exponential weights algorithm under the logarithmic loss (with ) on with a prior equal to the Dirichlet distribution , namely the Jeffreys prior on the multinomial model .
Definition 3.
Let and . The prediction function of AMF at step is given by
where the sum is over all subtrees of and where the prior on subtrees is the probability distribution defined by
(3) 
where is the number of nodes in and is a parameter called learning rate.
Note that is the distribution of the branching process with branching probability at each node of , with exactly two children when it branches; this branching process gives finite subtrees almost surely. The learning rate can be optimally tuned following theoretical guarantees from Section 3, see in particular Corollaries 1 and 2. This aggregation procedure is a nongreedy way to prune trees: the weights do not depend only on the quality of one single split but rather on the performance of each subsequent split.
Let us stress that computing from Definition 3 seems computationally infeasible in practice, since it involves a sum over all subtrees of . Besides, it requires to keep in memory one weight for all subtrees , which seems prohibitive as well. Indeed, the number of subtrees of the minimal tree that separates points is exponential in the number of nodes, and hence exponential in . However, the proper choice of the prior in Equation (3) allows us to prove that can actually be computed very efficiently, at almost no memory cost, as stated in Proposition 1 below, where we prove that the AMF algorithm described in Section 2.3 below allows to compute exactly and efficiently.
Proposition 1.
The proof of Proposition 1 is given in Section 7. It proves that aggregating predictions of all subtrees weighted by the prior can be done exactly via Algorithm 3. This prior choice enables to bypass the need to maintain one weight per subtree, and leads to a “collapsed” implementation that only requires to maintain one weight per node (which is exponentially smaller). Note that this algorithm is exact, in the sense that it does not require any approximation scheme. Moreover, this online algorithm corresponds to its batch counterpart, in the sense that there is no loss of information coming from the online (or streaming) setting versus the batch setting (where the whole dataset is available at once).
The proof of Proposition 1 relies on some standard identities that enable to efficiently compute sums of products over tree structures in a recursive fashion (from Helmbold and Schapire, 1997), recalled in Lemma 3 from Section 7. Such identities are at the core of the Context Tree Weighting algorithm (CTW), which our online algorithm implements (albeit over an evolving tree structure, as explained in Section 2.3 below), and which consists of an efficient way to perform Bayesian mixtures of contextual tree models under a branching process prior. The CTW algorithm, based on a sumproduct factorization, is a stateofthe art algorithm used in lossless coding and compression. We use a variant of the Tree Expert algorithm (Helmbold and Schapire, 1997; CesaBianchi and Lugosi, 2006), which is closely linked to CTW (Willems et al., 1995; Willems, 1998; Catoni, 2004).
2.3 AMF: a forest of aggregated Mondrian trees
In an online setting, the number of sample points increases over time, allowing one to capture more details on the distribution of conditionally on . This means that the complexity of our models (in this context, the complexity of the decision trees) should increase over time. We will therefore need to consider not just an individual, fixed tree partition , but a sequence , indexed by “time” corresponding to the number of samples available. Furthermore, AMF uses the aggregated prediction function given in Definition 3 (independently within each tree from the forest, see Equation (1)). When a new sample point becomes available, the algorithm does two things, in the following order:

Partition update. Using , update the decision tree structure from to , i.e. sample new splits in order to ensure that each leaf in the tree contains at most one point among . This update uses the recursive properties of Mondrian partitions;

Prediction function update. Using and , update the prediction functions and weights and that are necessary for the computation of from Definition 3. These updates are local and are performed only along the path of nodes leading to the leaf containing . This update is efficient and enables the computation of from Definition 3, which aggregates the decision functions of all the prunings of the tree, thanks to a variant of CTW.
Both updates can be implemented on the fly in a purely sequential manner. Training over a sequence means using each sample once for training, and both updates are exact and do not rely on an approximate sampling scheme. Both steps are precisely described in Algorithm 2 below and illustrated in Figure 2. Also, in order to ease the reading of this technical part of the paper, we gather in Table 1 notations that are used in this Section.
Notation or formula  Description 

A node  
A tree  
(resp. )  The left (resp. right) child of 
A subtree rooted at  
The set of leaves of  
The set of the interior nodes of  
The cells of the partition defined by  
Prediction of a node at time  
Cumulative loss of the node at time  
Weight stored in node at time  
Average weight stored in node at time 
Partition update.
Before seeing the point , the algorithm maintains a partition , which corresponds to the minimal subtree of the infinite Mondrian partition that separates all distinct sample points in . This corresponds to the tree obtained from the infinite tree by removing all splits of “empty” cells (that do not contain any point among ). As becomes available, this tree is updated as follows (this corresponds to Lines 2–11 in Algorithm 2 below):

find the leaf in that contains ; it contains at most one point among ;

if the leaf contains no point , then let . Otherwise, let be the unique point among (distinct from ) in this cell. Splits of the cell containing are successively sampled (following the recursive definition of the Mondrian distribution), until a split separates and .
Prediction function update.
The algorithm maintains weights and and predictions in order to compute the aggregation over the tree structure (lines 12–18 in Algorithm 2). Namely, after round (after seeing sample )), each node has the following quantities in memory:

the weight , where ;

the averaged weight , where the sum ranges over all subtrees rooted at ;

the forecast in node at time .
Now, given a new sample point , the update is performed as follows: we find the leaf containing in (the partition has been updated with already, since the partition update is performed before the prediction function update). Then, we update the values of for each along an upwards recursion from to the root, while the values of nodes outside of the path are kept unchanged:

;

if then , otherwise
The partition update and prediction function update correspond to the procedure described in Algorithm 2 below.
Training AMF over a sequence means using successive calls to . The procedure maintains in memory the current state of the Mondrian partition . The tree contains the parent and children relations between all nodes , while each can contain
(4) 
namely the split coordinate and split threshold (only if ), the prediction function , aggregation weights and a vector if . An illustration of Algorithm 2 is given in Figure 2 below.
Remark 2.
The complexity of
is twice the depth of the tree at the moment it is called, since it requires to follow a downwards path to a leaf, and to go back upwards to the root. As explained in Proposition
2 from Section 4 below, the depth of the Mondrian tree used in AMF is in expectation at step of training, which leads to a complexity both for Algorithms 2 and 3, where corresponds to the update complexity of a single node, while the original MF algorithm uses an update with complexity that is linear in the number of leaves in the tree (which is typically exponentially larger).Prediction.
At any point in time, one can ask AMF to perform prediction for an arbitrary features vector . Let us assume that AMF did already training steps on the trees it contains and let us recall that the prediction produced by AMF is the average of their predictions, see Equation (1), where the prediction of each decision tree is computed in parallel following Definition 3.
The prediction of a decision tree is performed through a call to the procedure described in Algorithm 3 below. First, we perform a temporary partition update of using , following Lines 2–10 of Algorithm 2, so that we find or create a new leaf node such that . Let us stress that this update of using is discarded once the prediction for is produced, so that the decision function of AMF does not change after producing predictions. The prediction is then computed recursively, along an upwards recursion going from to the root , in the following way:

if we set ;

if (it is an interior node such that ), then assuming that () is the child of such that , we set
The prediction of the tree is given by , which is the last value obtained in this recursion. Let us recall that this computes the aggregation with exponential weights of all the decision functions produced by all the prunings of the current Mondrian tree, as described in Definition 3 and stated in Proposition 1 above. The prediction procedure is summarized in Algorithm 3 below.
The next Section 3 provides theoretical guarantees for AMF, but before that, let us provide the following numerical illustration on three toy datasets for binary classification. The aim of this illustration is to exhibit the effect of aggregation in AMF, compared to the same method with no aggregation, the original Mondrian Forest algorithm, batch Random Forest and Extra Trees (see Section 5 for a precise description of the implementations used). We observe that AMF with aggregation (AMF(agg)) produces a very smooth decision function in all cases, which generalizes better on this instance (AUCs displayed on the bottom righthand side of each plot are computed on a test dataset) than all other methods. All the other algorithms display rather nonsmooth decision functions, which suggests that the underlying probability estimates are not wellcalibrated.
3 Theoretical guarantees
In addition to being efficiently implementable in a streaming fashion, AMF is amenable to a thorough endtoend theoretical analysis. This relies on two main ingredients: a precise control of the geometric properties of the Mondrian partitions and a regret analysis of the aggregation procedure (exponentially weighted aggregation of all finite prunings of the infinite Mondrian) which in turn yields excess risk bounds and adaptive minimax rates. The guarantees provided below hold for a single tree in the Forest, but hold also for the average of several trees (used in by the forest) by convexity of the loss (see Examples 1 and 2).
3.1 Regret bounds
For now, the sequence is arbitrary, and is in particular not required to be i.i.d. Let us recall that at step , we have a realization of a finite Mondrian tree, which is the minimal subtree of the infinite Mondrian partition that separates all distinct sample points in . Let us recall also that are the tree forecasters from Definition 2, where is some subtree of . We need the following
Definition 4.
Let . A loss function is said to be expconcave if the function is concave for each .
The following loss functions are expconcave:

The logarithmic loss , with a finite set and , with (see Example 2 above);

The quadratic loss on , with .
We start with Lemma 1, which states that the prediction function used in AMF (see Definition 3) satisfies a regret bound where the regret is computed with respect to any pruning of .
Lemma 1.
Consider a expconcave loss function . Fix a realization and let be a finite subtree. For every sequence , the prediction functions based on and computed by AMF satisfy
(5) 
where we recall that is the number of nodes in .
Lemma 1 is a direct consequence of a standard regret bound for the exponential weights algorithm (see Lemma 4 from Section 7), together with the fact that the Context Tree Weighting algorithm performed in Algorithms 2 and 3 computes it exactly, as stated in Proposition 1. By combining Lemma 1 with regret bounds for the online algorithms used in each node, both for the logarithmic loss (see Example 2) and the quadratic loss (see Example 1), we obtain the following regret bounds with respect to any pruning of .
Comments
There are no comments yet.