The empirical Christoffel function is defined by an input measure , which is a scaled counting measure uniformly supported on a cloud of data-points, and by a degree . It has a strong connection to the population Christoffel function associated to a measure with density on an unknown input set . In particular, typically is obtained by a sample from , in which case can be seen as an estimation of the population Christoffel function (see Lasserre and Pauwels (2019)).
The (population) Christoffel function itself has a long history of research in the mathematical analysis literature. Its construction is based on multivariate polynomials of degree at most and it has strong links to the theory of orthogonal polynomials. Especially, the asymptotic behavior of the Christoffel function as the degree increases provides useful information regarding the support and density of the associated input measure . Important references in multivariate settings include Bos (1994); Bos et al. (1998); Xu (1999a); Kroó and Lubinsky (2013a, b), which concern specific cases of the input measure and set . These works not only provide valuable information on the asymptotics of the population Christoffel function as goes to infinity, but also motivates the usage of this function in statistical contexts, especially in support recovery. Indeed, Lasserre and Pauwels (2019) provides a thresholding scheme using the Christoffel function which approximates the compact support of the measure with strong asymptotic guarantees. More precisely, Lasserre and Pauwels (2019) considers a family of polynomial sublevel sets with , where the degree increases with and where the threshold is well-chosen between a lower bound of the Christoffel function inside and an upper bound outside . Another thresholding scheme can be found in Marx et al. (2019), which provides useful results on the relation between and its estimator
. The topic of set estimation based on the population Christoffel function is thus currently a subject of active interest with a large range of applications in machine learning (seePauwels and Lasserre (2016); Lasserre and Pauwels (2019)).
In a statistical context, the population Christoffel function is not available and only the empirical Christoffel function is, based on the observed empirical measure . Let us detail results and discussions presented in Lasserre and Pauwels (2019). Statistical procedures based on the empirical Christoffel function have three important features: (i) computations are remarkably simple and involve no optimization procedures, (ii) they scale efficiently with the number of observations and (iii) the procedures are affine invariant. Furthermore, when considering a compactly supported population measure as well as its empirical counterpart supported on a sample of vectors in , drawn independently from and when the degree is fixed, the empirical Christoffel function converges uniformly to , almost surely with respect to the draw of the random sample. This asymptotic result is appealing given the strong connections between and the support of , which suggest that could be used for inferring the support of the population measure . Yet more precise quantifications on the relation between sample size and the degree bound are required, but Lasserre and Pauwels (2019) does not provide any explicit way to choose the degree as a function of , and does not provide any convergence guaranty for the full plugin approach based on the empirical Christoffel function , when depends on . These shortcomings constitute one of the main motivations for the present work.
Our contribution is twofold:
We adapt the thresholding scheme in Lasserre and Pauwels (2019), using the empirical Christoffel function, by a careful tuning of the degree and the threshold level set in the limit of large sample size. This scheme allows to estimate the compact support of a measure. Our results include, under regularity assumptions on , a quantitative rate of convergence analysis which was unknown for this estimator. More precisely, we consider the Hausdorff distance between the original set and its estimator and between their respective boundaries, as well as the Lebesgue measure of their symmetric difference. These results rigorously establish the property that, when is large enough, these distances decrease to zero with an explicit rate.
This analysis relies on results which could be considered of independent interest. First, we provide a quantitative concentration result regarding the convergence of the empirical Christoffel function to its population counterpart. Second, this concentration relies on an estimate of the supremum of the Christoffel Darboux kernel on the support of the underlying measure. We prove that, for a large class of slowly decaying densities with smooth support boundary, this supremum is at most polynomial in the degree . This shows that the considered class of measures is regular in the sense of the Bernstein-Markov, see (Piazzon, 2016) and references therein.
Comparison with the existing literature on set estimation
Support inference (more generally set estimation) has been a topic of research in the statistics literature for more than half a century. The main subject of interest is to infer a set (support of an input measure, level set of an input density function,…) based on samples that are drawn independently from an unknown distribution. Introduction and first results on this subject can be found in Rényi and Sulanke (1963); Geffroy (1964), which motivate a subsequent analysis of estimators based on convex hulls for convex domains (Chevalier, 1976) or unions of balls for non-convex sets (Devroye and Wise, 1980). More involved estimators follow, such as the excess mass estimator (Polonik et al., 1995) or the plug-in approach based on the use of density estimators (Cuevas et al., 1997; Molchanov, 1998; Cuevas et al., 2006). Those works also motivated the development of minimax statistical analysis for the set estimation problem. We might find minimax results for the recovery of sets with (piecewise) smooth boundaries in Mammen and Tsybakov (1995), for the estimation of smooth or convex density level sets in Tsybakov et al. (1997) and for the plug-in approach in Rigollet et al. (2009). More current works related to set estimation include local convex hull estimators (Aaron and Bodart, 2016) and cone-convex hulls (Cholaquidis et al., 2014).
We obtain convergence rates both in terms of symmetric difference measure, and Hausdorff distance, which can be arbitrarily close to where is the sample size, is the ambient dimension and measures the speed of decrease of the population density around the boundary of the support ( corresponds to a density which is uniformly bounded away from ). In Cuevas and Rodríguez-Casal (2004), the Devroye and Wise estimator is shown to have a convergence rate of order in Hausdorff distance111Similarly as we do, they consider Hausdorff distances both between sets and between their boundaries., under similar geometric assumptions as ours corresponding to the choice . Latter on, Biau et al. (2008) proved for the same estimator, under similar assumptions as ours, a rate which can be arbitrarily close to for the measure of the symmetric difference for and . Earlier work presented in Mammen and Tsybakov (1995) proved that is minimax optimal for the symmetric difference measure for a special class of piece-wise boundaries. Recently Patschkowski et al. (2016) proved a minimax lower bound on the convergence rate for symmetric difference, of order for adaptive estimators to unknown . Although the rate which we obtain is not optimal, the dependency in the dimension and speed of decrease of the density seem reasonable in comparison to existing rates. Let us insist on the fact that our analysis allows to cover a wide range of density decrease regimes and a variety of divergence measures between sets for which the results for other estimates are not known. A detailed comparison between all geometric conditions on the support, its boundary and different notions of divergence between sets is out of reach given the diversity of assumptions in the literature, and as such we only consider a high level general discussion based on orders of magnitude here.
From a computational point of view, our approach using the empirical Christoffel function has important advantages. The most important one is that this approach estimates the support of by a polynomial sublevel set, which is conceptually simple to manipulate. As an important illustration example, consider the situation when one is interested in performing numerical optimization over the estimated support. This situation can arise when a criterion is to be optimized over a feasible domain, which needs to be estimated from data. In this optimization case, the fact the the estimated support is a polynomial sublevel set is beneficial, for instance one can use nonlinear optimization techniques such as Sequential Quadratic Programming (SQP) or barrier functions. If the support is estimated by an union of balls centered at the observations (Devroye and Wise, 1980), the estimated support may be less amenable to numerical optimization. In terms of numerical implementation, our approach requires to compute and store the inverse of a matrix of size (see Sections 2 and 3) where is the selected degree for the sample size . Then, each input point can be tested to belong to the estimated support or not, with the cost of evaluating a quadratic form of size and of computing monomials in dimension . In practice, is smaller than (to avoid rank deficiencies), and in our asymptotic results, is selected such that .
Organisation of the paper
Section 2 introduces the notation and definitions which will be used throughout the text, especially the definition of the population and empirical Christoffel functions and their known properties. In Section 3, we present our main assumptions as well as our results on support estimation and convergence of the empirical Christoffel function to the population one. Concluding remarks are provided in Section 4. The proofs are postponed to the appendix. The appendix also contains additional results of interest on upper and lower bounds on the Christoffel function, outside and inside the support.
2.1 General notations
When is a measure on , we denote by the support of . Let be a measurable function from to . The push-forward measure of by , denoted by , is a measure on defined by: for all Borel sets of . Given an arbitrary (measurable) set , we denote by the interior of , the boundary of , the complement of , the Lebesgue measure of , the diameter of , the Lebesgue measure restricted on and the uniform measure on (when ).
When is a squared matrix, we denote by the operator norm of , i.e.
If in addition, is symmetric and positive definite, we can define its inverse and its unique square root which are also symmetric positive definite matrices. We denote by the inverse of the square root of , which is also symmetric and positive definite.
For , we let be the Euclidean norm between and . For and , let .
We also denote by the open Euclidean ball of radius and centered at while is the associated closed ball. In particular, denotes the unit Euclidean ball .
We denote by
the surface area of the -dimensional unit sphere in . We denote by
the normalization constant of the measure which density is on the unit ball (see e.g. Xu (1999b), page 2441, (2.2)). Finally, for and , the associated binomial coefficient is defined as follows:
2.2 Problem setting
The following notation and assumptions will be standing throughout the text.
is a Borel probability measure onand its support is compact with nonempty interior.
, is fixed and are independent and identically distributed random vectors with distribution . The corresponding empirical measure is denoted by
where is the dirac measure at .
Using the notations of Assumption 2.1, given the sample our goal is to build an estimator in order to approximate . We construct a specific kind of estimator based on the empirical Christoffel function. The rest of this section is dedicated to the presentation of further background needed to define our estimator. Convergence of our estimator to using different criteria is described next in Section 3.
2.3 The Christoffel function
Polynomials of variables are indexed by the set of multi-indices. For example, given a set of variables and a multi-index , the monomial is given by which degree is
The space of polynomials of degree at most is the linear span of monomials of degree up to :
The space of polynomials of variables is
The degree of a polynomial , denoted by , is the maximum degree of its monomial associated to a nonzero coefficient (the null polynomial has degree 0). Note that , we denote by the quantity throughout the text.
Since satisfies Assumption 2.1, we have the following inner product:
where are polynomials. A sequence of orthonormal polynomials with respect to in is a sequence of polynomials such that 222 is if , otherwise for all . The Gram-Schmidt orthonormalization process guarantees the existence of such an orthonormal sequences, restricting the degree up to , we obtain which is also a basis of .
Now, let be a basis of (not necessarily orthonormal). We denote
The moment matrix of with respect to the basis is a squared matrix of dimension which is defined by
where the integral is taken entry-wise. We have the following property of the moment matrix which is useful in the sequel.
Let which representations with respect to the basis are
where . Then
The Christoffel - Darboux kernel
The space of polynomials of degree at most along with the inner product defined by is then a finite-dimensional Hilbert space of functions from to and . Moreover, is a reproducing kernel Hilbert space (RKHS) (see Aronszajn (1950) for the definition). Indeed, we notice that the function is linear on the space of polynomials and is finite-dimensional (hence all norms are equivalent), therefore we obtain the continuity of this function on for any . This property of guarantees the existence and uniqueness of a reproducing kernel which is defined as follows.
The Christoffel - Darboux kernel, denoted by , is the reproducing kernel of the RKHS , i.e. for all and , we have and
The two following propositions are explicit formulas for the Christoffel - Darboux kernel. The first one is its expression as a sum of squares of orthonormal polynomials, while the other is a computation based on the moment matrix (and does not require an orthonormal basis).
Proposition 2.4 (see e.g Dunkl and Xu (2014), page 97, (3.6.3)).
Let be an orthonormal basis of with respect to . Then for all
Proposition 2.5 (see e.g. Lasserre and Pauwels (2019), page 7, (3.1)).
Let be a basis of and be the corresponding moment matrix (see (2.4)). For all , we have
By Proposition 2.4,
where is an orthonormal basis of . Moreover, the cannot be all since otherwise, the polynomial will be at point , which is impossible. So for all .
The Christoffel function
Now, we will define the (population) Christoffel function and provide some of its properties which are useful for the sequel.
Let . The Christoffel function associated to and is the function
Note that the Christoffel function is well-defined by the positivity of the Christoffel - Darboux kernel. The following proposition is an equivalent definition of the Christoffel function.
Proposition 2.7 (see e.g. Dunkl and Xu (2014), Theorem 3.6.6).
We now highlight the following properties of the Christoffel function which will be useful in the sequel. The following proposition guarantees the invariance of the Christoffel function by affine transformations.
Proposition 2.8 (see e.g. Pauwels and Lasserre (2016), Lemma 1).
Let be an invertible affine map from to . Recall that is the push-forward measure of by . Then for all ,
The next proposition expresses the monotonicity property of the Christoffel function. It is a direct consequence of Proposition 2.7.
If is a Borel measure on , such that , in the sense that for all Borel sets , then for all ,
2.4 The empirical Christoffel function
The Christoffel function associated to (see Assumption 2.1), is called the empirical Christoffel function. It is to be compared to the population Christoffel function . The convergence of the empirical Christoffel function towards its population counterpart as and for a fixed has been shown in (Lasserre and Pauwels, 2019). This allows by a careful choice of threshold and degree , to construct a sequence of polynomial sublevel sets
which estimate the support . It is worth mentioning that the empirical Christoffel function can be computed using the inversion of a squared matrix of size thanks to Proposition 2.5.
3 Main results
From now on, we consider the case where the probability measure has density with respect to Lebesgue measure. Our main result is that for a large enough number of observations , by choosing pertinently a degree and a threshold for the empirical Christoffel function , we obtain a sequence of polynomial sublevel sets
which approximates the support of . More explicitly, we show that under smoothness assumptions on , is close to both in Hausdorff distance and Lebesgue measure of their symmetric difference. For any , we obtain an explicit convergence rate of order
where measures the speed of decrease of the density of , , close to , see Assumption 3.5.
Those results are obtained from the following materials:
1. Properties of the population Christoffel function. We provide a lower bound on the Christoffel function in the interior of the support and an upper bound in the exterior of . We also provide a bound on the supremum of the Christoffel-Darboux kernel on . Those results will be discussed in Appendices A and B.
2. Concentration results for the speed of convergence of the empirical Christoffel function to its population counterpart . This part requires the above mentioned bound on the supremum of the Christoffel - Darboux kernel. Those results could be of independent interest and will be discussed in Subsection 3.4 with all the proofs in Appendix C.
3. We introduce a thresholding scheme using the empirical Christoffel function as in (3.1) by a careful tuning of the degree and the threshold in the limit of large sample size . With this thresholding scheme, we prove the desired results described in (3.2). The details will be in Subsection 3.3 with proofs postponed to Appendix D.
3.2 Conditions on the support and the density
Throughout the text, we consider a probability measure which is supported on and has density .
3.2.1 Assumptions on the support
We first introduce the following definitions, notations and assumptions.
Consider a closed set and a constant . We say that a ball of radius rolls inside if for any , there exists a ball centered at of radius such that . If a ball of radius rolls inside , then we say that a ball of radius rolls outside .
Consider a closed set . Denote by the -extension of , defined as
We also define the volume function
where we recall that denotes the Lebesgue measure of a set.
is a compact set which has non-empty interior and satisfies:
There exists such that a ball of radius rolls inside and outside .
For small ,
where is a constant which only depends on .
We will rely on Assumption 3.3 for our results and proofs. The first part of this assumption is made relatively frequently in the support inference literature, see for instance Cuevas and Rodríguez-Casal (2004). This part is interpreted as meaning that the boundary of is smooth. In particular, this assumption prevents corners in the boundary of . The case of sets with non-smooth boundaries is a future research topic of interest that is not addressed here for the sake of concision. The second part of Assumption 3.3 will be needed when working with the Lebesgue measure of the symmetric difference.
Next, we provide a class of sets with some geometric properties, under which Assumption 3.3 holds.
This lemma is a sufficient condition on so that satisfies Assumption 3.3. Its proof requires the tubular neighborhood theorem from differentiable geometry (for the first part of Assumption 3.3) and Weyl’s tube formula (for the second one). The details of the proof is presented in Appendix E. Smoothness of support boundary assumption was considered by Biau et al. (2008) to analyse the Devroye and Wise estimator.
3.2.2 Assumption on the density
Now, for , we set
The next assumption concerns the rate of decay of the density of at the boundary of the support .
The density is such that and for all , we have
where and are fixed constants (depending only on ).
3.3 Main results for support estimation
First, we design our thresholding scheme using the empirical Christoffel function . This thresholding scheme depends on the constants given by the assumptions on (Assumptions 3.3 and 3.5). It also depends on a constant which can be made arbitrarily small (a smaller leads to a better rate of convergence), and on a constant which is small such that our following results hold with probability .
Given and , we define
The explicit results for this thresholding scheme will be presented in the next subsections. First, we set
3.3.1 Result for the Hausdorff distance between two sets and two boundaries
Recall the definition of the Hausdorff distance between two subsets of :
The following result provides an explicit quantitative rate of convergence for the estimation of using the thresholding scheme (3.3) based on the empirical Christoffel function. More explicitly, this estimation of by is measured by the Hausdorff distance between them and between their boundaries. Thus, this theorem is one of the most important results of this paper.
3.3.2 Result for the Lebesgue measure of the symmetric difference between two sets
Recall the definition of the symmetric difference between two subsets of :
In this section, in order to measure the convergence of the estimator to the true set , we will use the Lebesgue measure of their symmetric difference:
The following result, which is a counterpart of Theorem 3.6 for the Lebesgue measure of the symmetric difference, is the second main result of this paper.
The order of magnitude of the error for the thresholding scheme (3.3) is for both the Hausdorff distance between two sets and between their boundaries as well as the Lebesgue measure of their symmetric difference. Since can be taken arbitrarily small, the rate of convergence is essentially .
The tuning of and in (3.3) depends on the constants and from Assumption 3.5 and on the constant from Assumption 3.3. In practice, these constants are typically unknown, but then the values of and can be selected in a data driven way, for instance by cross validation.
On a theoretical level, the main aim of this paper is to show that it is possible to obtain rates of convergence, by selecting and according to the constants , and . For the sake of concision, the situation where , and are estimated from data is not studied in this paper. Let us nevertheless discuss it briefly here. First, we remark that if Assumptions 3.5 and 3.3 hold with constants , and , then they hold a fortiori with constants , and . Hence, in order to obtain rates of convergence, it is sufficient to tune and based on conservative values of and that are overly small and of that are overly large, such that Assumptions 3.5 and 3.3 hold. Obtaining conservative values is statistically easier than obtaining the sharpest possible values of , and such that Assumptions 3.5 and 3.3 hold. Another important question is adaptivity: obtaining a procedure based on the Christoffel function, with no knowledge of the values of , and such that Assumptions 3.5 and 3.3 hold, and which yields the same rates of convergence as when knowing the sharpest values of , and such that Assumptions 3.5 and 3.3 hold.
3.3.3 Sketch of the main proofs
First, we suppose that the estimation of the population Christoffel function by its empirical counterpart can be controlled. More explicitly, we assume that there exists a constant such that for all ,
Now we introduce a sequence of polynomial sublevel sets which estimates the support using the empirical function where does not depend on .
For fixed and for , we define:
The idea of this estimator comes from Marx et al. (2019), Section 4.1. The difference is that we let arbitrarily small for a better rate of convergence (instead of setting like in Marx et al. (2019)). Moreover, by choosing carefully the threshold, we obtain an estimator such that not only is contained in a small enlargement of (which has been shown in Marx et al. (2019)), but we also have a small enlargement of that contains . The explicit result is as follows.
This above relation between and is important since the difference between and is controlled by a decreasing sequence:
By adding some assumptions on , we can obtain results concerning the Hausdorff distances and the Lebesgue measure of the symmetric distance between and .
Now, under Assumption 3.3 - part 1 and Assumption 3.5 and thanks to the concentration results in Subsection 3.4, we can select such that (3.4) holds with high probability with . Subsequently, we can select a threshold that will optimize the convergence rate of to . We obtain now the thresholding scheme (3.3) and all the results regarding the Hausdorff distances and the Lebesgue measure of the symmetric difference between and will follow.
All the proofs’ details are postponed to Appendix D for the sake of clarity.
3.4 A concentration result for the approximation of the Christoffel function by its empirical counterpart
Let be a measure which satisfies Assumption 2.1 and be the corresponding empirical measure. We consider now the speed of convergence of the empirical Christoffel function towards . All the proofs of the following results will be postponed to Appendix C.
First, we state below a technical lemma which bounds uniformly the quantity
by the operator norm of a moment-based random matrix.
Let be a basis of orthonormal polynomials with respect to . Denote by the moment matrix of with respect to the basis (see Subsection 2.3). Then for all , we have
where we recall that the norm of matrices is the operator norm.
Note that is actually the associated moment matrix of with respect to the basis . Now, to control the operator norm of the random matrix , we rely on Theorem 5.44 from Vershynin (2010). The following theorem makes use of this random matrix result and of Lemma 3.9 to obtain an upper bound for the quantity with high probability.
Let be a measure which satisfies Assumption 2.1 and be the corresponding empirical measure. Then for all and , we have
with probability at least , where
Note that in our case, the supremum of the Christoffel - Darboux kernel has a quantitative upper bound of order which is of independent interest and will be provided in Appendix B. The following corollary is a consequence of Theorem 3.10 combined with Theorem B.3, and is useful in the tuning of for the thresholding scheme (3.3).