Learning a probability distribution based on an i.i.d. sample of observations is a fundamental inference task with a variety of applications. However, when the sample space involves more than a few dimensions, learning the measure through, for example, density estimation is very challenging due to the so-called curse of dimensionality(Hastie et al., 2009). Few observations are close to each other in such spaces and so traditional local density smoothers do not produce satisfactory estimates. Similar challenges also arise for other density estimators such as random Bayesian nonparametric mixture models.
In other contexts of statistical learning such as classification and regression, boosting is acknowledged as one of the most powerful learning algorithms for high-dimensional problems, largely due to its ability to overcome the curse of dimensionality and achieve a desirable balance in bias-variance trade-off. Boosting fits an additive ensemble of weak learners, often in the form of regression or classification trees, in a sequential or iterative manner to the current residuals of the original observations through either optimization(Hastie et al., 2009) or Bayesian sampling (Chipman et al., 2010). The success of boosting in nonparametric regression and classification problems suggests that a similar strategy will likely also prevail in learning the structures of high-dimensional probability measures based on a sample of i.i.d. observations.
Adapting boosting to learning high-dimensional distributions, however, requires the formulation of two indispensable concepts—(i) the addition of probability measures and (ii) the residual of an observation after “subtracting” a probability measure from it. While a natural idea to define addition on probability measures might be to embed them as elements in a space where the operation of addition is available, we are not aware of any such embedding that renders a conceptually and computationally simple notion of “residualization”—that is, subtracting a probability measure—from i.i.d. observations.
Therefore we introduce a new notion of “adding” probability measures that allows simple residualization of an i.i.d. sample. We start from the case of univariate distributions, for which the addition of two measures can be defined simply in terms of the composition of their cumulative distribution functions (CDFs). Accordingly, the residual of an observation is simply the application of the corresponding CDF transform to that observation. In generalizing this notion of addition to multivariate distributions on with , however, the classical notion of the CDF, which maps to the interval is insufficient, as easily seen by the fact that one can neither define a composition of two such CDFs nor define residuals that still lie in . Interestingly, we show that a proper notion of the CDF for multivariate distributions, which maps to , does exist and it naturally generalizes our addition rule and residuals to multivariate distributions.
Based on these new notions of addition and residuals, we introduce a forward-stagewise (FS) boosting algorithm for learning probability measures. As in classical boosting, our algorithm in each step (i) computes the current residuals by subtracting the fitted measure at the current step from the observations and (ii) fits a tree-based weak learner on the residuals and adding the estimated distribution to the current fit. The output of the algorithm enables the user to analytically compute the density function of the fitted distribution and it provides an exact Monte Carlo simulator for drawing independent samples from the fitted distribution.
Our boosting algorithm can use as the weak learner a variety of tree-based density estimators that produce a piece-wise constant density estimate on a recursive dyadic partition tree of the sample space with axis-aligned partition lines. We demonstrate using such a density estimator introduced in Awaya and Ma (2020), which achieves state-of-the-art performance among single-tree density estimators. It is based on computing a maximum a posteriori (MAP) estimate of a Pólya tree (PT) prior on densities (Ferguson, 1973; Lavine, 1992) along an inferred recursive partition tree structure (Wong and Ma, 2010; Awaya and Ma, 2020). Analogous to classical boosting algorithms, with this choice a single tuning parameter can be use to adjust for the amount of shrinkage/regularization on the weak learners. Some other tree-based density estimators that could potentially be used as a weak learner include the Bayesian sequential partitioning (BSP) model (Lu et al., 2013), the tree-based density estimation using the squared loss (Anderlini, 2016), and data-driven histograms (Scott, 1979; Lugosi et al., 1996).
Our proposed boosting algorithm inherits the tuning strategies of classical boosting. In particular, we show that the performance of the algorithm can be adjusted by specifying the number of trees, the complexity of the trees, and the amount of shrinkage/regularization. Popular devices for summarizing the importance of different variables also finds a natural counterpart in the current context.
Our boosting algorithm also produces a very useful inference device that seems to be unique for the problem of learning an unknown probability measure. Namely, the exact simulator for the fitted measure that our algorithm provides is a very useful tool for exploring high-dimensional distributions in practice. In such problems, summaries of the underlying distribution might not be analytically computable even given the exact density function. Yet if one can draw as many independent samples from the fitted measure as one wish then one can estimate such summaries by Monte Carlo. We will demonstrate this use and validate its behavior in the context of a mass-cytometry dataset.
2.1 CDF-based addition and residualization for univariate distributions
We start by defining notions of addition and residuals in the one-dimensional setting before generalizing them to multivariate distributions. Without loss of generality, let represent the one-dimensional sample space. Let be an i.i.d. sample generated from , an unknown distribution of interest.
In order to learn the unknown distribution as the “sum” of probability measures , we consider an addition rule based on compositions of their cumulative distribution functions (CDFs). Specifically, let be the corresponding CDFs. We define the sum of , denoted by , to be the distribution whose CDF is
where represents function composition. This definition of addition on probability distributions is motivated by the following fact.
If have full support on , (i.e., when the CDFs are strictly increasing,) then
This property follows easily from the definition of the CDF, and it leads to a natural notion of “subtracting” a probability measure from a collection of observations. Specifically, “subtracting” a measure from an observation boils down to simply applying its CDF transform to the observation: . Proposition 2.1 then implies that such residualization can be applied sequentially just as in usual subtraction in a Euclidean space. That is, if is the residual of after subtracting . Then , and for
The “additivity” induced by the composition of the CDFs is perhaps most clearly demonstrated on the log-likelihood scale. Specifically, suppose be the pdf for all with respect to a common dominating measure . Then the density of satisfies
It is a well-known fact that if , then . Thus when the observations follow a distribution similar to , the distribution of the residuals from subtracting ,
, is close to the uniform distribution. Because the CDF of the Unifis the identity map, the corresponding residual after subtracting the standard uniform from an observation remains . As such, the uniform distribution corresponds to a notion of “zero”.
|Boosting for regression||Boosting for probability measures|
|Zero||Uniform distribution on|
Table 1 compares the corresponding notions of addition, residuals, and zero in boosting for regression and for learning probability measures. With these new notions of addition and residuals, we are ready to introduce a boosting algorithm for one-dimensional distributions. However, the more interesting application involves multivarate (in fact high-dimensional) distributions. As such, we first generalize these notions to multivariate cases, and then introduce a multivariate version our boosting algorithm that contains the (less interesting) univariate scenario as a special case.
2.2 Generalization to multivariate distributions
The above notions of addition and residuals do not find direct counterparts for multivariate measures if one uses the traditional definition of CDFs for multivariate distributions. In particular, because the traditional CDF is a mapping from to instead of , we can neither take the composition of the CDFs nor compute the residuals, which should remain in the space of .
Interestingly, there exists a new notion of multivariate CDFs that generalizes the one-dimensional CDF while maintaining those critical properties. In particular, a multivariate CDF for a measure should satisfy the following minimal requirements:
is a mapping from to .
If , then .
is uniquely determined by .
The first and second conditions are necessary for defining the notions of addition and residuals. The third condition is required for identifying a measure based on .
There are many choices of the mapping that satisfy these requirements. In fact, as we will show in the following, one can find a mapping that satisfies the three conditions given any dyadic recursive partition tree with axis-aligned partition lines (defined later) on the sample space. While for general measure the tree needs to have infinite depth and generate the Borel -algebra, in the context of boosting, we only need to consider (the sum of) probability measures with piece-wise constant densities on the leaves of finite partition trees. For such measures the mapping can be defined based on partition trees of finite depths. Moreover, as we shall see, the partition trees needed in carrying out boosting are exactly those produced by tree-based weaker learners.
2.2.1 Characterizing probability measures on a recursive dyadic partition tree
A recursive dyadic partition of depth is a sequence of nested dyadic partitions
on the sample space . The first partition only includes , and for , the partition consists of all the sets generated by dividing each into two children and , where and . (Throughout, we use subscripts and to indicate left and right children respectively.) We can denote the recursive partition using a tree . As such, we refer to the sets in the partitions as “nodes”. We call the nodes in the “terminal” nodes or “leafs” of ; the nodes in other levels are the “non-leaf” nodes or “interior” nodes, which we denote by .
We consider partition trees with only axis-aligned partition lines. In other words, a node is of the following rectangular form:
For a non-leaf node , the children and are generated by dividing in one of the dimensions, say ,
In the following, for each partition tree , we let be the class of probability measures that are conditionally uniform on the leafs of and have full support on . That is,
where is the uniform measure, and and indicate the corresponding conditional measures on . To generalize the CDF transform in one-dimensional cases to a multivariate CDF, first we note that in the univariate setting , applying a standard CDF transform for any measure on an observation can actually be accomplished sequentially in a bottom-up fashion along the branch in the partition tree in which falls. Specifically, suppose has depth , then for , let be a sequence of nodes in such that and
The CDF transform is equivalent to the composition of a sequence of local-move functions:
where for any with two children and , the mapping “slides” a point in in the direction of the child node with less (conditional) probability mass than the (conditional) uniform measure. Specifically, the mapping is determined by
Note that the conditional measures and the input and output of have the following relationship:
Figure 1 visualizes how “slides” an input point in each level. The amount of sliding on is proportional to the probability mass differential between the two children of in relative to .
If we think of applying the CDF transform as “subtracting” the information contained in a probability measure from an observation, Eq. (4) indicates that such subtraction can be done through a sequence of “local” moves, each substracting a piece of the information regarding the measure from the observation. This perspective leads to a generalization of the CDF transform for the multivariate case as we describe below.
For a point , again let be a recursive dyadic partition tree of depth on the sample space , and the sequence of nodes in that contains as before. Then we define a mapping in terms of a sequence of fine-to-coarse local moves along that branch in . Specifically, for a node as in Eq. (2) with children and attained from dividing in the th dimension as described in Eq. (3), we define a local move mapping such that for any , where for all , and
Just as in the one-dimensional case, the local move mapping “slides” in the direction of the child node with less probability mass relative to the uniform measure, except that now in the multivariate setting there are a total of available directions.
As before, we now define a mapping called a “tree-CDF” as the composition of these local move functions. That is,
is injective from to for any with full support on . Additionally, because is surjective for every , is also surjective. Hence we have the following property.
The tree-CDF mapping is bijective for any .
The next two theorems show that satisfies Conditions (C2) and (C3). That is, applying the mapping to an observation effectively “subtracts” the information in from that observation, and that uniquely determines .
If , then . Conversely, if , then .
Theorem 2.1 also provides a practical strategy for generating samples from based on , which we will use later to build an exact Monte Carlo simulator from the fitted probability measure.
A measure for some partition tree can be determined by the tree-CDF mapping as follows
Theorem 2.2 implies that while the mapping for is not unique but tree-specific, uniquely determines , regardless of the tree from which is defined.
2.2.2 Addition and residuals for multivariate distributions
From the properties we have obtained, we are now ready to define addition for multivariate distributions. Let be a collection of probability measures such that for , and let be the corresponding tree-CDFs. As a generalization of the discussion on univariate cases, we define “a summation” of the measures in terms of their tree-CDFs, and first we show that such a sum indeed pins down a unique probability measure in the following lemma.
For (), the mapping defined as
is a probability measure.
Then, we can define the sum of measures, , to be the measure given in Eq. (5). This definition of addition contains the one-dimensional case presented earlier as a special case. We note that the addition implicitly involves the tree structures . This dependency on the trees, however, is suppressed in the “” notation for simplicity without causing confusion.
Next we turn to the notion of residuals. We first generalize Proposition 2.1 from the univariate case to multivariate distributions.
Let be a collection of probability measures such that for . Then
Just as in the univariate case, this proposition implies that we can define the residual of an observation after subtracting as , and the residual contains the information that allow us to fit the next summand . Moreover, the sequential update of the residuals given in Eq. (1) remains valid. The only difference is that now the residualization in each step depends on an implicit partition tree structure, encapsulated in the tree-CDF mappings.
2.3 A forward-stagewise (FS) algorithm for boosting
Equipped with the new notions of addition and residuals, we are ready to design a boosting algorithm based on forward-stagewise (FS) fitting. Suppose we have an i.i.d. sample from an unknown distribution , which we model as an additive ensemble of probability measures
where each is modeled as a member in for some (unknown) . We introduce an FS algorithm in which we compute the residuals step-by-step and at the th step, fit to the current residuals. The fit at the th step produces an estimate for along with a partition tree , which is used to define the tree-CDF used in the next step for computing the new residuals.
Set and set , the uniform distribution.
- Forward-stagewise fitting
Repeat the following steps for :
Update the residuals , where .
Fit a weak learner that produces a pair of outputs to the residualized observations , where is an inferred partition tree and is the tree-CDF for a measure .
The output of the FS algorithm in terms of the collection of pairs for contains all of the information from the data regarding the underlying distribution. (In fact the ’s alone contain all the relevant information, but the ’s are indispensable for effectively representing and storing the ’s.) Next we show how such information can be extracted. In particular, we show (i) how to compute the density function of the fitted measure at any point in the sample space analytically, and (ii) how to draw Monte Carlo samples from the measure .
Evaluating the density function of . Density estimation is a common objective in learning multivariate distributions. The next proposition generalizes the additive decomposition of the log-likelihood for the univariate case and provides a recipe for evaluating the density for the fitted measure analytically based on the output of the FS algorithm.
For any , the density for of the additive form in Eq. (6) is given by
where is the density of , , is the residual for after subtracting , and in particular . In other words, the density is exactly the product of the fitted density of each weak learner evaluated at the corresponding sequence of residuals for .
An exact Monte Carlo simulator for . The structure of a high-dimensional distribution can often be understood through evaluating various lower-dimensional summaries. However, such summaries are often not available analytically. One can use Monte Carlo to estimate them if one can simulate from the distribution of interest. Thus the availability of a Monte Carlo sampler for the fitted distribution is very useful in practice.
It turns out that one can use the classical idea of inverse-CDF sampling to draw exact Monte Carlo samples from based on Theorem 2.1. Specifically, we can simulate from by generating and then compute the following transform
where is the corresponding inverse for the tree-CDF for . To implement the sampler, we next obtain the analytic form of the inverse of a tree-CDF.
Recall that in Section 2.2.1 we showed that a tree-CDF for a measure can be expressed as the composition of a sequence of local move mappings along each subbranch of . The inverse of the local move mapping can be expressed as for any , where
With the inverse local move function available for all , we can obtain the explicit form for the inverse tree-CDF as
2.4 Evaluating variable importance
Just as in classification and regression problems, it is often desirable to evaluate the contribution of each dimension to the richness in the structure of the measure
. Thus we provide a way to quantify variable importance in a conceptually similar manner to that used in boosting for classification and regression with decision trees as the weak learner (seeHastie et al. (2009)).
A natural quantity for measuring the importance of a variable at each split of a tree is the extent to which it contributes to the deviation of the resulting probability measure from “zero”, i.e., the uniform measure , in terms of the KL divergence. For any measure , the KL divergence KL can be decomposed over the splits of the tree as follows
The derivation of this expression is provided in the supplementary materials. Thus, the total contribution to the KL between and from dividing in the th dimension is
where represents the collection of nodes in that are split in the th dimension. Then we can define the importance of the th variable in the additive measure by summing over the variable importance across the ’s,
2.5 A weak learner based on the Pólya tree
Our boosting algorithm can be applied in conjunction with different weaker learners for fitting each . The minimal requirement for the weak learner is that it should produce an estimate for each along with a tree . Beyond the minimal requirement, it is also desirable to choose a learner for which one can adjust the level of shrinkage/regularization (i.e., the bias-variance trade-off) easily, because it is well-known that the amount of shrinkage is an important tuning parameter for the effectiveness of boosting in classification and regression settings.
Through the rest of the paper we adopt a learner based on the so-called Pólya tree (PT) process (Ferguson, 1973; Lavine, 1992), which is a Bayesian nonparametric model for probability measures. While the standard Pólya tree model is specified on a fixed partition tree , recent developments in Wong and Ma (2010) and Awaya and Ma (2020) introduce strategies for learning a flexible dyadic partition tree in multivariate settings. In particular, the approach by Awaya and Ma (2020) is scalable to high-dimensional distributions, and its computational cost is shown to be approximately linear to the dimensionality, so we use it as our weak learner. This learner can be specified in such a way that the resulting shrinkage is specified by a single parameter with smaller values of corresponding to stronger shrinkage; corresponds to complete shrinkage to the uniform measure and corresponds to no shrinkage at all. We provide the details of this learner in the supplementary materials and interested readers may refer to Awaya and Ma (2020) for additional details as well as the relevant literature.
3 Numerical experiments
We carry out numerical experiments to investigate the performance of the new boosting method using the following two representative scenarios in 48 dimensions:
“Clusters”: are independent, and for ,
“Correlation”: are independent, and each 4-tuple follows , where
The true marginal densities of are visualized in Figure 3. In the first scenario, the observations arise from (two) local spikes, which are far from a piece-wise uniform distribution. In the second scenario, the variables are highly correlated. Thus the density contours are not axis-aligned and the underlying densities are smooth. As such, both scenarios are difficult to capture using a piece-wise constant density on a single tree.
In this experiment, the training sample size is set to . There are a total of three tuning parameters—namely, the total number of weak learners or trees , the maximum depth of each tree , and the shrinkage parameter for the weak learner. To examine how the performance of boosting is affected by the different specifications of these tuning parameters, we vary the number of trees in our additive model up to 2,000, adopt three different tree depth , and three different values of the shrinkage parameter . We compare the performance of our boosting algorithm to the single tree model of Awaya and Ma (2020) with a maximum depth of along with a parameter specification that makes it a “strong” learner following the default recommendation in that paper.
We evaluate the fit of the fitted density function using a simulated training data set and computing a predictive score
where on a simulated testing set . We generate 20 different data sets and take the average of the predictive scores to compare the model with each combination of the tuning parameters.
Figure 4 presents the comparison of the predictive scores under the different number of trees. The results show that boosting can improve density estimation substantially over the single tree model. The maximum resolution needs to be relatively small (i.e., 3 or 5) allowing relatively shallow trees in each weak learner, and otherwise, the over-fitting can happen as is seen in the first scenario. The decay in performance is also observed when the parameter is large, resulting in too little shrinkage (and thus high variance). All of these are consistent with the traditional wisdom in applying boosting to classification and regression (Friedman, 2001).
Figure 5 compares the estimated densities (evaluated at the simulated observations) under different combinations of and . By comparing the distributions given by and , we see that when each tree is too shallow (i.e., ), the estimate can fail to capture the detailed distributional structures accurately in Scenario 1. Also, when there is too little shrinkage (i.e., ) overfitting is observed in a form of artificial jumps not present in the true densities.
4 Case study: a mass-cytometry data set
Finally we demonstrate our boosting method and especially the use of the simulator for the fitted measure in the context of a mass-cytometry data set collected by Kleinsteuber et al. (2016). This data records the values of 19 biomarkers for cells in a blood sample provided by an HIV-infected patient. The result of the density estimation can be utilized in the automated classification (Malek et al., 2015) and identify cross-sample differences.
The one-dimensional marginal means and center 90% quantile bins of the 19 biomarkers in the original mass-cytometry data and the replicated data. The numberindicates the size of the data used to construct the simulator for replication.
In understanding the nature and composition of the cell population, it is useful to examine lower-dimensional summaries–such as marginal mean, quantiles, and pairwise covariance—of the underlying distribution. As such we demonstrate how we use the simulator for the fitted measure from our boosting algorithm to estimate such summaries. We note that we chose a data set with a massive sample size and so many low-dimensional summaries such as the first two moments can be empirically estimated from the original sample. This provides a proxy to the “truth”. Specifically we will fit our boosting algorithm on both this full data set as well as a small subsample of size. After computing the summaries estimated from “replicate” data sets generated from our boosting fit on the full data and on the subsample (), we compare them to the “truth”, thereby allowing us to evaluate the quality of our fit.
The comparison is visualized in Figure 6 and Figure 7, and they show that the estimated density accurately retrieves the characteristics of the original data in these low-dimensional quantities, even with a sample size .
provides the computed variable importance of the 19 biomarkers (the importance is re-scaled so that the summation is 1) we obtained based on the subsample of size 5,000. This figure also includes the marginal densities for the three biomarkers with the highest and lowest importance. We can see that the markers with high importance tend to have highly skewed densities while the markers with low importance have distributions that are much closer to the uniform distribution.
5 Concluding remarks
We have proposed a new boosting method for learning multivariate probability measures by introducing new notions of addition and residuals based on tree-CDF transforms. We demonstrate how one can carry out density estimation and simulate from the fitted measure based on the output of the algorithm. Given its similarity to classical boosting for regression and classification, we expect other techniques for the boosting in such contexts, for example subsampling (Friedman, 2002), could further improve the performance of our boosting method.
An R package for our method is available at https://github.com/MaStatLab/boostPM.
LM’s research is partly supported by NSF grants DMS-2013930 and DMS-1749789. NA is partly supported by a fellowship from the Nakajima Foundation.
Density estimation trees as fast non-parametric modelling tools. In Journal of Physics: Conference Series, Vol. 762, pp. 012042. Cited by: §1.
- Hidden markov Pólya trees for high-dimensional distributions. arXiv preprint arXiv:2011.03121. Cited by: Appendix B, Appendix B, Appendix B, §1, §2.5, §3.
- BART: Bayesian additive regression trees. The Annals of Applied Statistics 4 (1), pp. 266–298. Cited by: §1.
- A Bayesian analysis of some nonparametric problems. The Annals of Statistics, pp. 209–230. Cited by: §1, §2.5.
Greedy function approximation: a gradient boosting machine. The Annals of Statistics, pp. 1189–1232. Cited by: §3.
- Stochastic gradient boosting. Computational Statistics & Data Analysis 38 (4), pp. 367–378. Cited by: §5.
- The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §1, §1, §2.4.
- Standardization and quality control for high-dimensional mass cytometry studies of human samples. Cytometry Part A 89 (10), pp. 903–913. Cited by: §4.
- Some aspects of Pólya tree distributions for statistical modelling. The Annals of Statistics 20 (3), pp. 1222–1235. Cited by: §1, §2.5.
- Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association 108 (504), pp. 1402–1410. Cited by: §1.
- Consistency of data-driven histogram methods for density estimation and classification. The Annals of Statistics 24 (2), pp. 687–706. Cited by: §1.
- FlowDensity: reproducing manual gating of flow cytometry data by automated density-based cell population identification. Bioinformatics 31 (4), pp. 606–607. Cited by: §4.
- On optimal and data-based histograms. Biometrika 66 (3), pp. 605–610. Cited by: §1.
- Sequential monte carlo methods in practice. Springer Science & Business Media. Cited by: Appendix B.
Optional Pólya tree and Bayesian inference. The Annals of Statistics 38 (3), pp. 1433–1459. Cited by: §1, §2.5.
Appendix A Proofs
In the following proofs, for a tree CDF and , the image is denoted by
and the same notation rule is applied for its inverse. Additionally, denotes the Lebesgue measure.
a.1 Proof of Proposition 2.1
Suppose . For , we have
so . The converse can be shown in the same way.
a.2 Proof of Theorem 2.1
To prove the first assertion, define for as
. Then the first assertion is equivalent to that if , then
which we prove here.
In the proof, we let and for
Let denote a probability measure for . With these notations, we prove (7) by induction: We show that, if for , and
Assume Eq (8) holds for some . By the definition, is bijective, and for every . Then for and , we have,
Hence the first equation in Eq (9) holds. To prove the second equation, let . Then, we show that for ,
holds. The probability in the left hand side can be written as
Let be divided in the th dimension. By the definition of , for , . For , because is strictly increasing,
where . The expression of changes depending on whether or not, where is a partition point at which is divided. We first assume that . In this case, the second term in Eq (11) is 0 because does not happen if . Also, by the definition of ,
Therefore, it follows that