In this paper, we consider the problem of the conditional density estimation. We observe a -sample of a couple , in which
is the vector of interest whilegathers auxiliary variables. We denote the joint dimension. In particular we are interested in the inference of the -dimensional conditional density of conditionally to .
There is a growing demand for methods of conditional density estimation in a wide spectrum of applications such as Economy[HRL04], Cosmology [IzbickiLee16], Medicine [takeuchi09], Actuaries [efromovich10a], Meteorology [JeonTaylor] among others. It can be explained by the double role of the conditional density estimation: deriving the underlying distribution of a dataset and determining the impact of the vector of auxiliary variables on the vector of interest . In this aspect, the conditional density estimation is richer than both the unconditional density estimation and the regression problem. In particular, in the regression framework, only the conditional mean are estimated instead of the full conditional density, which can be especially poorly informative in case of an asymmetric or multi-modal conditional density. Conversely, from the conditional density estimators, one can, e.g.
, derive the conditional quantiles[TLSS06] or give accurate predictive intervals [FLCLY02]. Furthermore, since the posterior distribution in the Bayesian framework is actually a conditional density, the present paper also offers an alternative method to the ABC methodology (for Approximate Bayesian Computation) [beaumont02 ; MPR12 ; BCG15] in the case of an intractable-yet-simulable model.
The challenging issue in conditional density estimation is to circumvent the ”curse of dimensionality”. The problem is twofold: theoretical and practical. In theory, it is stigmatized by the minimax approach, stating that in a-dimensional space the best convergence rate for the pointwise risk over a -regular class of functions is : in particular, the larger is , the slower is the rate. In practice, the larger the dimension is, the larger the sample size is needed to control the estimation error. In order to maintain reasonable running times in moderately large dimensions, methods have to be designed especially greedy.
Furthermore, one interesting question is how to retrieve the eventual relevant components in case of sparsity structure on the conditional density . For example, if we have at disposal plenty of auxiliary variables without any indication on their dependency with our vector of interest , the ideal procedure will take in input the whole dataset and still achieve a running time and a minimax rate as fast as if only the relevant components were given and considered for the estimation. More precisely, two goals are simultaneously addressed : converging at rate with the relevant dimension, i.e. the number of components that influence the conditional density , and detect the irrelevant components at an early stage of the procedure in order to afterwards only work on the relevant data and thus speed up the running time.
1.2 Existing methodologies
Several nonparametric methods have been proposed to estimate conditional densities: kernel density estimators [rosenblatt69 ; HBG96 ; BLR16] and various methodologies for the selection of the associated bandwidth [BH01 ; FYim04 ; HRL04]; local polynomial estimators [FYT96 ; HY02]; projection series estimators [efromovich99 ; efromovich07]; piecewise constant estimator [GK07 ; sart17]; copula [faugeras09].
But while most of the aforementioned works are only defined for bivariate data or at least when either or is univariate, they are also computationally intractable as soon as .
It is in particular the case for the kernel density methodologies (Hall, Racine ,Li 2004, Bertin et al. 2016): they achieve the optimal minimax rate, and even the detection of the relevant components, thanks to an adequate choice of the bandwidth (for the two aforementioned methods by cross validation and Goldenshluger-Lepski methodology), but the computational cost of these bandwidth selections is prohibitive even for moderate sizes of and . To the best of our knowledge, only two kernel density methods have been proposed to handle large datasets. [HGI10] propose a fast method of approximated cross-validation, based on a dual-tree speed-up, but they do not establish any rate of convergence and only show the consistency of their method. For scalar , [FPYZ09] proposed to perform a prior step of dimension reduction on to bypass the curse of dimensionality, then they estimate the bivariate approximated conditional density by kernel estimators. But the proved convergence rate is not the optimal minimax rate for the estimation of a bivariate function of assumed regularity . Moreover, the step of dimension reduction restricts the dependency of to a linear combination of its components, which may induce a significant loss of information.
Projection series methods for scalar have also been proposed. [efromovich10b] extends his previous work [efromovich07] to a multivariate . Theoretically the method achieves an oracle inequality, thus the optimal minimax rate. Moreover it performs an automatic dimension reduction on when there exists a smaller intrinsic dimension. To our knowledge, it is the only method which addresses datasets of dimension larger than with reasonable running times and does not pay its numerical performance with non optimal minimax rates. However the computation cost is prohibitive when both and are large. More recently, Izbicki and Lee have proposed two methodologies using orthogonal series estimators [IzbickiLee16 ; IzbickiLee17]. The first method is particularly fast and can handle very large (with more than covariates). Moreover the convergence rate adapts to an eventual smaller unknown intrinsic dimension of the support of the conditional density. The second method originally proposes to convert successful high dimensional regression methods into the conditional density estimation, interpreting the coefficients of the orthogonal series estimator as regression functions, which allows to adapt to all kind of figures (mixed data, smaller intrinsic dimension, relevant variables) in function of the regression method. However both methods converge slower than the optimal minimax rate. Moreover their optimal tunings depend in fact on the unknown intrinsic dimension.
For multivariate and , [OtneimTjostheim17] propose a new semiparametric method, called Locally Gaussian Density Estimator: they rewrite the conditional density as a product of a function depending on the marginal distribution functions (easily estimated since univariate, then plug-in), and a term which measures the dependency between the components, which is approximated by a centred Gaussian whose covariance is parametrically estimated. Numerically, the methodology seems robust to addition of covariates of independent of , but it is not proved. Moreover they only establish the asymptotic normality of their method.
1.3 Our strategy and contributions
The challenge in this paper is to handle large datasets, thus we assume at our disposal a sample of large size and of moderately large dimension. Then our work is motivated by the following three objectives:
achieving the optimal minimax rate (up to a logarithm term);
being greedy, meaning that the procedure must have reasonable running times for large and moderately large dimensions, in particular when ;
adapting to a potential sparsity structure of . More precisely, in the case where locally depends only on a number of its components, can be seen as the local relevant dimension. Then the desired convergence rate has to adapt to the unknown relevant dimension : under this sparsity assumption, the benchmark for the estimation of a -regular function is to achieve a convergence rate of the order , which is the optimal minimax rate if the relevant components were given by an oracle.
Our strategy is based on kernel density estimators. The considered family has been recently introduced and studied in [BLR16]. This family is especially designed for conditional densities and
is better adapted for the objective c than the intensively studied estimator built as the ratio of a kernel estimator of the joint density over one of the marginal density of . For example, a relevant component for the joint density and the marginal density of may be irrelevant for the conditional density and it is the case if a component of is independent of . Note though that many more cases of irrelevance exist since we define the relevance as a local property.
The main issue with kernel density estimators is the selection of the bandwidth , and in our case, we also want to complete the objective b, since the pre-existing methodologies of bandwidth selection does not satisfy this restriction and thus cannot handle large datasets. In this paper, it is performed by an algorithm we call CDRodeo, which is derived from the algorithm Rodeo [LW08 ; LLW07], which has respectively been applied for the regression and the unconditional density estimation. The greediness of the algorithm allows us to address datasets of large sizes while keeping a reasonable running time (see Section 3.5 for further details). We give a simulated example with a sample of size and of dimension in Section 4. Moreover, Rodeo-type algorithms ensure an early detection of irrelevant component, and thus achieve the objective c while improving the objective b.
From the theoretical point of view, if the regularity of is known, our method achieves an optimal minimax rate (up to a logarithmic factor), which is adaptive to the unknown sparsity of . The last property is mostly due to the Rodeo-type procedures. The improvement of our method in comparison to the paper [LLW07] which estimates the unconditional density with Rodeo is twofold. First, our result is extended to any regularity , whereas [LLW07] fixed . Secondly, our notion of relevance is both less restrictive and more natural. In [LLW07], they studied the -risk of their estimator, therefore they have to consider a notion of global relevance, whereas we consider a pointwise approach, which allows us to define a local property of relevance, which can be applied to a broader class of functions. Moreover, their notion of relevance is not intrinsic to the unknown density, but in fact depends on a tuning of the method, a prior chosen baseline density which has no connexion with the density, which limits the interpretation of the relevance.
Our paper is organized as follows. We introduce the CDRodeo method in Section 2. The theoretical results are in Section 3, in which we specify the assumptions and the tunings of the procedure from which are derived the convergence rate and the complexity cost of the method. A numerical example is presented in Section 4. The proofs are in the last section.
2 CDRodeo method
Let be a sample of a couple of multivariate random vectors: for ,
with valued in and in . We denote the joint dimension.
We assume that the marginal distribution of and the conditional distribution of given are absolutely continuous with respect to the Lebesgue measure, and we define such as for any , is the conditional density of conditionally to . We denote the marginal density of .
Our method estimates pointwisely : let us fix the point of interest.
Our method is based on kernel density estimators. More specifically, we consider the family proposed in [BLR16], which is especially designed for the conditional density estimation. Let be a kernel function, ie: , then for any bandwidth , the estimator of is defined by:
where is an estimator of , built from another sample of . We denote by the sample size of . The choices of and are specified later (see section 3.2).
In kernel density estimation, selecting the bandwidth is a critical choice which can be viewed as a bias-variance trade-off. In[BLR16], it is performed by the Goldenshluger-Lepski methodology (see [GL11]) and requires an optimization over an exhaustive grid of couples of bandwidths, which leads to intractable running time when the dimension exceeds (and large dataset).
That is why we focus in a method which excludes optimization over an exhaustive grid of bandwidths to rather propose a greedy algorithm derived from the algorithm Rodeo. First introduced in the regression framework [LW06 ; LW08], a variation of Rodeo was proposed in [LLW07] for the density estimation. Our method we called CDRodeo (for Conditional Density Rodeo) addresses the more general problem of conditional density estimation.
Like Rodeo (which means Regularisation Of Derivative Expectation Operator), the CDRodeo algorithm generates an iterative path of decreasing bandwidths, based on tests on the partial derivatives of the estimator with respect to the components of the bandwidth. Note that the greediness of the procedure leans on the selection of this path of bandwidths, which enables us to address high dimensional problems of functional inference.
Let us be more precise: we take a kernel of class and consider the statistics for and , defined by:
is easily computable, since it can be expressed by:
where is the function defined by:
The details of the CDRodeo procedure are described in Algorithm 1 and can be summed up in one sentence: for a well-chosen threshold (specified in Section 3.3), the algorithm performs at each iteration the test to determine if the component of the current bandwidth must be shrunk or not. It can be interpreted by the following principle: the bandwidth of a kernel estimator quantifies within which distance of the point of interest and at which degree an observation
helps in the estimation. Heuristically, the larger the variation ofis, the smaller the bandwidth is required for an accurate estimation. The statistics are used as a proxy of to quantify the variation of in the direction . Note in particular that since the partial derivatives vanish for irrelevant components, this bandwidth selection leads to an implicit variable selection, and thus to avoid the curse of dimensionality under sparsity assumptions.
3 Theoretical results
This section gathers the theoretical results of our method.
We consider a compactly supported kernel. For any bandwidth , we define the neighbourhood of (typically, or , and or ) as follows:
Then we denote the CDRodeo initial bandwidth and for short, .
We also introduce the notation for the supremum norm over a set .
The following first assumption ensures a certain amount of observations in the neighbourhood of our point of interest .
Assumption 1 ( bounded away of ).
We assume .
Note that if the neighbourhood does not contain any observation , the estimation of the conditional distribution of given the event is obviously intractable.
The second assumption specifies the notions of ”sparse function” and ”relevant component”, under which the curse of high dimensionality can be avoided.
Assumption 2 (Sparsity condition).
There exists a subset such that for any fixed , the function is constant on .
In other words, if we denote the cardinal of , Assumption 2 means that locally depends on only of its variables. We call relevant any component in .
The notion of relevant component depends on the point where is estimated. For example, a component which behaves as in the conditional density is only relevant in the neighbourhood of and .
Note that this local property addresses a broader class of functions, which extends the application field of Theorem 2 and improves the convergence rate of the method.
Finally, the conditional density is required to be regular enough.
Assumption 3 (Regularity of ).
There exists a known integer such that is of class on and such that for all .
3.2 Conditions on the estimator of
Given the definition of the estimator (1), we need an estimator of .
If is known.
We take . This case is not completely obvious. In particular, it tackles the case of unconditional density estimation, if we set by convention and .
If is unknown.
We need an estimator which satisfies the following two conditions:
a positive lower bound:
a concentration inequality in local sup norm: there exists a constant such that:
The following proposition proves these conditions are feasible. Furthermore, the provided estimator of (see the proof in Section 5.3.1) is easily implementable and does not need any optimisation.
3.3 CDRodeo parameters choice.
We choose the kernel function of class , with compact support and of order , i.e.: for , , and .
Note that considering a compactly supported kernel is fundamental for the local approach. In particular, it relaxes the assumptions by restricting them to a neighbourhood of .
Taking a kernel of order is usual for the control of the bias of the estimator.
Let be the decreasing factor of the bandwidth. The larger , the more accurate the procedure, but the longer the computational time. From the theoretical point of view, it remains of little importance, as it only affects the constant terms. In practice, we set it close to .
We recall that we set (and the initial bandwidth as ).
For any bandwidth and for , we set the threshold as follows:
with (where is defined in (3)) and . The expression is obtained by using concentration inequalities on . For the proof, the parameter has to be tuned such that:
which is satisfied for large enough. The influence of this parameter is discussed in the next section, once the theoretical results are stated.
Hereafter, unless otherwise specified, the parameters are chosen as described in this section.
3.4 Mains results
Let us denote the bandwidth selected by CDRodeo. In Theorem 2, we introduce a set of bandwidths which contains
with high probability, which leads to an upper bound of the pointwise estimation error with high probability. Increftypecap 3, we deduce the convergence rate of CDRodeo from Theorem 2.
More precisely, in Theorem 2, we determine lower and upper bounds (with high probability) for the stopping iteration of each bandwidth component. We set:
Then we define the set of bandwidths by:
Assume that satisfies Conditions a and b of section 3.2 and Assumptions 2, 3 and 1 are satisfied. Then, the bandwidth selected by CDRodeo belongs to with high probability. More precisely, for any and for large enough:
Moreover, with probability larger than , the CDRodeo estimator verifies:
Under the assumptions of Theorem 2, for any :
creftypecap 3 presents a generalization of the previous works on Rodeo [LW08] and [LLW07] whose results are restricted to the regularity and to simpler problems, namely regression and density estimation.
We compare the convergence rate of CDRodeo with the optimal minimax rate. In particular, our benchmark is the pointwise minimax rate, which is of order , for the problem of -regular -dimensional density estimation, obtained by [DL92].
Without sparsity structure (), CDRodeo achieves the optimal minimax rate, up to a logarithmic factor. The exponent of this factor depends on the parameter . For the proofs, we need in order to satisfy (5), but if an upper bound (or a pre-estimator) of were known, we could obtain the similar result with and a modified constant term. Note that the logarithmic factor is a small price to pay for a computationally-tractable procedure for high-dimensional functional inference, in particular see section 3.5 for the computational gain of our procedure.
Under sparsity assumptions, we avoid the curse of high dimensionality and our procedure achieves the desired rate (up to a logarithmic term), which is optimal if the relevant components were known. Note that some additional logarithmic factors could be unavoidable due to the unknown sparsity structure, which needs to be estimated. Identifying the exact order of the logarithm term in the optimal minimax rate for the sparse case remains an open challenging question.
We now discuss the complexity of CDRodeo
without taking into account the pre-computation cost of at the points , (used for computing the ), but a fast procedure for is required, to avoid losing CDRodeo computational advantages.
For CDRodeo, the main cost lies in the computation of the ’s along the path of bandwidths.
The condition restricts to at most updates of the bandwidth across all components, leading to a worst-case complexity of order .
But as shown in Theorem 2, with high probability, , in which only the relevant components are active after the first iteration. In first iteration, the ’s computation costs operations, while the product kernel enables us to compute the ’s in following iteration with only operations, which leads to the complexity .
In order to grasp the advantage of CDRodeo greediness, we compare its complexity with optimization over an exhaustive bandwidth grid with values for each component of the bandwidth (which is often the case in others methods: Cross validation, Lepski methods…): for each bandwidth of -sized grid, the computation of a statistic from the -sized dataset needs at least operation, which leads to a complexity of order . Using the parameters used in the simulated example in section 4 (, , , ), the ratio of complexities is , and even without sparsity structure: . It means that CDRodeo run is a billion times faster on this data set.
In this section, we test the practical performances of our method. In particular, we study CDRodeo on a 5-dimensional example. The major purpose of this section is to assess if the numerical performances of our procedure. Let us describe the example. We set and and simulate an i.i.d sample with the following distribution: for any :
the first component of
follows a uniform distribution on,
the other components , , are independent standard normal and are independent of ,
is independent of , and and the conditional distribution of given is exponential with survival parameter .
The estimated conditional density function is then defined by:
This example enables us to test several criteria: sparsity detection, behaviour when fonctions are not continuous, bimodality estimation, robustness when takes small values.
In the following simulations, if not stated explicitly otherwise, Rodeo is run with sample size , product Gaussian kernel, initial bandwidth value , bandwidth decreasing factor and parameter and .
Figure 1 illustrates CDRodeo bandwidth selection. In which, the boxplots of each selected bandwidth component are built from 200 runs of CDRodeo at the point . This figure reflects the specificity of CDRodeo to capture the relevance degree of each component, and one could compare it with variable selection (as done in [LW08]). The components and are irrelevant and for this point of interest, the components and are clearly relevant while the component is barely relevant as is constant in the direction in near neighbourhood of . As expected, the irrelevant and are mostly deactivated at the first iteration, while the relevant and are systematically shrunk. The relevance degree of is also well detected as the values of are smaller than , but significantly larger than and .
Figure 2 gives CDRodeo estimation of from one -sample. The function is well estimated. In particular, irrevance, jumps and bi-modality are features which are well detected by our method. As expected, main estimation errors are made on points of discontinuity for and or at the boundaries for , and . Note that the values are particularly small at the boundaries of the plots in function of , leading to lack of observations for the estimation. Note however that null value for does not deteriorate the estimation (cf top left plot), since the estimate of vanishes automatically when there is no observation near the point of interest.
The simulations are implemented in R on a Macintosh laptop with a GHz Intel Core i7 processor. In the Figure Figure 1, the runs of CDRodeo take seconds (around minutes), or seconds per run.
We first give the outlines of the proofs in Section 5.1. To facilitate the lecture of the proof, we have divided the proofs of the main results (Proposition 1, Theorem 2 and creftypecap 3) into intermediate results which are stated in Section 5.2 and proved in Section 5.4. The proof of the main results are in Section 5.3.
5.1 Outlines of the proofs
We first prove Proposition 1 by constructing an estimator of with the wanted properties. In this proof, we use some usual properties of a kernel density estimator (control of the bias, concentration inequality), which are gathered in Lemma 4.
Theorem 2 states two results: the bandwidth selection (8) and the estimation error of the procedure (9). For the proof of the bandwidth selection (8), Proposition 8 mades explicit the highly probable behaviour of CDRodeo along a run, and thus the final selected bandwidth. In particular, the proof leans on an analysis of , which is made in two steps. We first consider , a simpler version of in which we substitute the estimator of by itself, and we detail its behaviour in Lemma 6. Then we control the difference (see in 1. of Lemma 7) to ensure behaves like .
To control the estimation error of the procedure (9), we similarly analyse in two parts: in Lemma 5, we describe the behaviour of , the simpler version of in which we substitute the estimator of by itself, and in 2. of Lemma 7, we bound the difference . Then the bandwidth selection (8) leads to the upper bound with high probability of the estimation error of (9).
Finally, we obtain the expected error of stated in creftypecap 3 by controlling the error on the residual event.
5.2 Intermediate results
For any bandwidth , we define the kernel density estimator by: for any ,
where a kernel which is compactly supported, of class ,
of order where we recall that is defined by .
We also introduce the neighbourhood
Lemma 4 ( behaviour).
We assume is on with , then for any bandwidth ,
if we denote , then
If the condition
is satisfied, then for and for any :
Lemma 5 ( behaviour).
Lemma 6 ( behaviour).
We introduce the notation , , the state of the bandwidth at iteration if . In particular for a fixed , is identical for any . Then we consider the event:
where we denote