NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
Sobolev quantities (norms, inner products, and distances) of probability density functions are important in the theory of nonparametric statistics, but have rarely been used in practice, partly due to a lack of practical estimators. They also include, as special cases, L^2 quantities which are used in many applications. We propose and analyze a family of estimators for Sobolev quantities of unknown probability density functions. We bound the bias and variance of our estimators over finite samples, finding that they are generally minimax rate-optimal. Our estimators are significantly more computationally tractable than previous estimators, and exhibit a statistical/computational trade-off allowing them to adapt to computational constraints. We also draw theoretical connections to recent work on fast two-sample testing. Finally, we empirically validate our estimators on synthetic data.
READ FULL TEXT VIEW PDFThis project collects the different accepted papers and their link to Arxiv or Gitxiv
quantities (i.e., inner products, norms, and distances) of continuous probability density functions are important information theoretic quantities with many applications in machine learning and signal processing. For example, estimates of the norm as can be used for goodness-of-fit testing goria05new, image registration and texture classification (hero2002aes)
, and parameter estimation in semi-parametric models
Wolsztynski85minimum.inner products estimates can be used with linear or polynomial kernels to generalize kernel methods to inputs which are distributions rather than numerical vectors.
(poczos12kernelsForImages) Estimators of distance have been used for two-sample testing (anderson94L_2TwoSampleTest; pardo05divergenceMeasures), transduction learning (quadrianto09distributionMatching), and machine learning on distributional inputs (poczos12divergenceApplication). principe10RenyiEntropy gives further applications of quantities to adaptive information filtering, classification, and clustering.quantities are a special case of less-well-known Sobolev quantities. Sobolev norms measure global smoothness of a function in terms of integrals of squared derivatives. For example, for a non-negative integer and a function with an derivative , the -order Sobolev norm is given by (when this quantity is finite). See Section 2 for more general definitions, and see leoni09Sobolev for an introduction to Sobolev spaces.
Estimation of general Sobolev norms has a long history in nonparametric statistics (e.g., schweder75windowVariance; ibragimov78nonparametricFunctionals; hall87SobolevNorms; bickel88squaredDensityDerivatives) This line of work was motivated by the role of Sobolev norms in many semi- and non-parametric problems, including density estimation, density functional estimation, and regression, (see Tsybakov:2008:INE:1522486, Section 1.7.1) where they dictate the convergence rates of estimators. Despite this, to our knowledge, these quantities have never been studied in real data, leaving an important gap between the theory and practice of nonparametric statistics. We suggest this is in part due a lack of practical estimators for these quantities. For example, the only one of the above estimators that is statistically minimax-optimal (bickel88squaredDensityDerivatives) is extremely difficult to compute in practice, requiring numerical integration over each of
different kernel density estimates, where
denotes the sample size. We know of no estimators previously proposed for Sobolev inner products and distances.The main goal of this paper is to propose and analyze a family of computationally and statistically efficient estimators for Sobolev inner products, norms, and distances. Our specific contributions are:
We propose a family of nonparametric estimators for Sobolev norms, inner products, and distances (Section 4).
We analyze the bias and variance of the estimators. Assuming the underlying density functions have bounded support in and lie in a Sobolev class of sufficient smoothness parametrized by , we show that the estimator for Sobolev quantities of order converges to the true value at the “parametric” rate of in mean squared error when , and at a slower rate of otherwise. (Section 5).
We derive asymptotic distributions for our estimators, and we use these to derive tests for the general statistical problem of two-sample testing. We also draw theoretical connections between our test and the recent work on nonparametric two-sample testing. (Section 9).
We validate our theoretical results on simulated data. (Section 8).
In terms of mean squared error, minimax lower bounds matching our convergence rates over Sobolev or Hölder smoothness classes have been shown by krishnamurthy2014renyiAndFriends for (i.e., quantities), and birge95functionalLowerBounds for Sobolev norms with integer . We conjecture but do not prove that our estimator is minimax rate-optimal for all Sobolev quantities and .
As described in Section 7, our estimators are computable in time using only basic matrix operations, where is the sample size and is a tunable parameter trading statistical and computational efficiency; the smallest value of at which the estimator continues to be minimax rate-optimal approaches as we assume more smoothness of the true density.
Let and let denote the Lebesgue measure on . For -tuples of integers, let ^{1}^{1}1We suppress dependence on ; all function spaces are over except as discussed in Section 2.1. defined by for all denote the element of the -orthonormal Fourier basis, and, for , let denote the Fourier coefficient of . ^{2}^{2}2Here, denotes the dot product on . For a complex number , denotes the complex conjugate of , and denotes the modulus of . For any , define the Sobolev space of order on by ^{3}^{3}3When , . For , should be read as , so that even when . In the case, we use the convention that .
(1) |
Fix a known and a unknown probability density functions , and suppose we have IID samples and from each of and . We are interested in estimating the inner product
(2) |
Estimating the inner product gives an estimate for the (squared) induced norm and distance, since ^{4}^{4}4 is pseudonorm on because it fails to distinguish functions identical almost everywhere up to additive constants; a combination of and is used when a proper norm is needed. However, since probability densities integrate to , is a proper metric on the subset of (almost-everywhere equivalence classes of) probability density functions in , which is important for two-sample testing (see Section 9). For simplicity, we use the terms “norm”, “inner product”, and “distance” for the remainder of the paper.
(3) |
Since our theoretical results assume the samples from and are independent, when estimating , we split the sample from in half to compute two independent estimates of , although this may not be optimal in practice.
For a more classical intuition, we note that, in the case and , (via Parseval’s identity and the identity ), that one can show the following: includes the subspace of functions with at least derivatives in and, if denotes the derivative of
(4) |
In particular, when , , , and . As we describe in the supplement, equation (4) and our results generalizes trivially to weak derivatives, as well as to non-integer via a notion of fractional derivative.
A notable restriction above is that and are supported in . In fact, our estimators and tests are well-defined and valid for densities supported on arbitrary subsets of . In this case, they act on the -periodic summation defined for by , which is itself a probability density function on . For example, the estimator for will instead estimate , and the two-sample test for distributions and will attempt to distinguish from . In most cases, this is not problematic; for example, for most realistic probability densities, and have similar orders of smoothness, and if and only if . However, there are (meagre) sets of exceptions; for example, if is a translation of by exactly , then , and one can craft a highly discontinuous function such that is uniform on . (zygmund02trigSeries) These exceptions make it difficult to extend theoretical results to densities with arbitrary support, but in practice, they are fixed simply by randomly rescaling the data (similar to the approach of chwialkowski15fastTwoSample). If the densities have (known) bounded support, they can simply be shifted and scaled to be supported on .
There is a large body of work on estimating nonlinear functionals of probability densities, with various generalizations in terms of the class of functionals considered. Table 1 gives a subset of such work, for functionals related to Sobolev quantities. As shown in Section 2, the functional form we consider is a strict generalization of norms, Sobolev norms, and inner products. It overlaps with, but is neither a special case nor a generalization of the remaining functional forms in the table.
Functional Name | Functional Form | References |
---|---|---|
norms | schweder75windowVariance; gine08L2Norm | |
(Integer) Sobolev norms | bickel88squaredDensityDerivatives | |
Density functionals | laurent92integralFunctionals; laurent96integralFunctionals | |
Derivative functionals | birge95functionalLowerBounds | |
inner products | krishnamurthy2014renyiAndFriends; krishnamurthy2014L2Divergence | |
Multivariate functionals | singh14densityFunctionals; kandasamy15vonMises |
Nearly all of the above approaches compute an optimally smoothed kernel density estimate and then perform bias corrections based on Taylor series expansions of the functional of interest. They typically consider distributions with densities that are -Hölder continuous and satisfy periodicity assumptions of order on the boundary of their support, for some constant (see, for example, Section 4 of krishnamurthy2014renyiAndFriends for details of these assumptions). The Sobolev class we consider is a strict superset of this Hölder class, permitting, for example, certain “small” discontinuities. In this regard, our results are slightly more general than most of these prior works.
Finally, there is much recent work on estimating entropies, divergences, and mutual informations, using methods based on kernel density estimates (singh14RenyiDivergence; singh14densityFunctionals; moon16improvingConvergence; krishnamurthy2014renyiAndFriends; krishnamurthy2014L2Divergence; kandasamy15vonMises) or -nearest neighbor statistics (leonenko08RenyiEntropy; poczos11alphaDivergence; moon14divergencesEnsemble; moon14divergencesConfidence). In contrast, our estimators are more similar to orthogonal series density estimators, which are computationally attractive because they require no pairwise operations between samples. However, they require quite different theoretical analysis; unlike prior work, our estimator is constructed and analyzed entirely in the frequency domain, and then related to the data domain via Parseval’s identity. We hope our analysis can be adapted to analyze new, computationally efficient information theoretic estimators.
For a non-negative integer parameter (to be specified later), let
(5) |
denote the projections of and , respectively, onto the linear subspace spanned by the -orthonormal family . Note that, since whenever , the Fourier basis has the special property that it is orthogonal in as well. Hence, since and lie in the span of while and lie in the span of , . Therefore,
(6) |
We propose an unbiased estimate of
. Notice that Fourier coefficients of are the expectations . Thus, and are independent unbiased estimates of and , respectively. Since is bilinear in and , the plug-in estimator for is unbiased. That is, our estimator for is(7) |
Here, we present our main theoretical results, bounding the bias, variance, and mean squared error of our estimator for finite .
By construction, our estimator satisfies
Thus, via (6) and Cauchy-Schwarz, the bias of the estimator satisfies
(8) |
is the error of approximating by an order- trigonometric polynomial, a classic problem in approximation theory, for which Theorem 2.2 of kreiss79Fourierstability shows:
(9) |
In combination with (8), this implies the following bound on the bias of our estimator:
(Bias bound) If for some , then, for ,
(10) |
Hence, the bias of decays polynomially in , with a power depending on the “extra” orders of smoothness available. On the other hand, as we increase , the number of frequencies at which we estimate increases, suggesting that the variance of the estimator will increase with . Indeed, this is expressed in the following bound on the variance of the estimator.
(Variance bound) If for some , then
(11) |
where and are the constants (in )
(12) |
and .
The proof of Theorem 2 is perhaps the most significant theoretical contribution of this work. Due to space constraints, the proof is given in the appendix. Combining Theorems 1 and 2 gives a bound on the mean squared error (MSE) of via the usual decomposition into squared bias and variance:
(Mean squared error bound) If for some , then
(13) |
If, furthermore, we choose (optimizing the rate in inequality 13), then
(14) |
Corollary 3 recovers the phenomenon discovered by bickel88squaredDensityDerivatives: when , the minimax optimal MSE decays at the “semi-parametric” rate, whereas, when , the MSE decays at a slower rate. Also, the estimator is -consistent if and as . This is useful in practice, since is known but is not.
In this section, we derive the asymptotic distributions of our estimator in two cases: (1) the inner product estimator and (2) the distance estimator in the case
. These results provide confidence intervals and two-sample tests without computationally intensive resampling. While (1) is more general in that it can be used with (
3) to bound the asymptotic distributions of the norm and distance estimators, (2) provides a more precise result leading to a more computationally and statistically efficient two-sample test. Proofs are given in the supplementary material.Theorem 4 shows that our estimator has a normal asymptotic distribution, assuming slowly enough as , and also gives a consistent estimator for its asymptotic variance. From this, one can easily estimate asymptotic confidence intervals for inner products, and hence also for norms.
(Asymptotic normality) Suppose that, for some , , and suppose and as . Then, is asymptotically normal with mean . In particular, for and with , define and , so that and are column vectors in . Let
denote the empirical means and covariances of and , respectively. Then, for
where denotes convergence in distribution.
Since distances can be written as a sum of three inner products (Eq. (3)), Theorem 4
might suggest an asymptotic normal distribution for Sobolev distances. However, extending asymptotic normality from inner products to their sum requires that the three estimates be independent, and hence that we split data between the three estimates. This is inefficient in practice and somewhat unnatural, as we know, for example, that distances should be non-negative. For the particular case
(as in the null hypothesis of two-sample testing), the following theorem
^{5}^{5}5This result is closely related to Proposition 4 of chwialkowski15fastTwoSample. However, in their situation, and the set of test frequencies is fixed as , whereas our set is increasing. provides a more precise asymptotic () distribution of our Sobolev distance estimator, after an extra decorrelation step. This gives, for example, a more powerful two-sample test statistic (see Section
9 for details).(Asymptotic null distribution) Suppose that, for some , , and suppose and as . For and with , define , so that is a column vector in . Let
denote the empirical mean and covariance of , and define . Then, if , then
where
denotes the quantile function (inverse CDF) of the
distribution with degrees of freedom.Let denote our estimator for (i.e., plugging into (3)). While Theorem 5 immediately provides a valid two-sample test of desired level, it is not immediately clear how this relates to , nor is there any suggestion of why the test statistic ought to be a good (i.e., consistent) one. Some intuition is as follows. Notice that
. Since, by the central limit theorem,
has a normal asymptotic distribution, if the components of were uncorrelated (and were fixed), we would expect to have an asymptotic distribution with degrees of freedom. However, because we use the same data to compute each component of , they are not typically uncorrelated, and so the asymptotic distribution of is difficult to derive. This motivates the statistic , since the components of are (asymptotically) uncorrelated.Here, we give statistical and computational considerations for choosing the smoothing parameter .
Statistical perspective: In practice, of course, we do not typically know , so we cannot simply set , as suggested by the mean squared error bound (14). Fortunately (at least for ease of parameter selection), when , the dominant term of (14) is for . Hence if we are willing to assume that the density has at least orders of smoothness (which may be a mild assumption in practice), then we achieve statistical optimality (in rate) by setting , which depends only on known parameters. On the other hand, the estimator can continue to benefit from additional smoothness computationally.
Computational perspective One attractive property of the estimator discussed is its computational simplicity and efficiency with respect to , in low dimensions. Most competing nonparametric estimators, such as kernel-based or nearest-neighbor methods, either take time or rely on complex data structures such as -d trees or cover trees (ram09linearTime) for time performance. Since computing the estimator takes time and memory (that is, the cost of estimating each of Fourier coefficients by an average), a statistically optimal choice of gives a runtime of . Since the estimate requires only a vector outer product, exponentiation, and averaging, the constants involved are small and computations parallelize trivially over frequencies and data.
Under severe computational constraints, for very large data sets, or if is large relative to , we can reduce to trade off statistical for computational efficiency. For example, if we want an estimator with runtime and space requirement for some , setting still gives a consistent estimator, with mean squared error of the order .
Kernel- or nearest-neighbor-based methods, including nearly all of the methods described in Section 3, tend to require storing previously observed data, resulting in space requirements. In contrast, orthogonal basis estimation requires storing only estimated Fourier coefficients. The estimated coefficients can be incrementally updated with each new data point, which may make the estimator or close approximations feasible in streaming settings.
In this section, we use synthesized data to demonstrate the effectiveness of our methods. A MATLAB implementation of our estimators, two-sample tests, and experiments is available at https://github.com/sss1/SobolevEstimation. For all experiments, we use samples for estimation.
We first test our estimators on 1D distances. Figure 0(a) shows estimated distance between and ; Figure 0(b) shows estimated distance between and ; Figure 0(c) shows estimated distance between Unif and Unif; Figure 0(d) shows estimated distance between and a triangular distribution whose density is highest at . Error bars indicate asymptotic confidence intervals based on Theorem 4. These experiments suggest samples is sufficient to estimate
distances with high confidence. Note that we need fewer samples to estimate Sobolev quantities of Gaussians than, say, of uniform distributions, consistent with our theory, since (infinitely differentiable) Gaussians are smoothier than (discontinuous) uniform distributions.
Next, we test our estimators on distances of multivariate distributions. Figure 0(e) shows estimated distance between and ; Figure 0(f) shows estimated distance between and . Again, these experiments show that our estimators can also handle multivariate distributions.
Lastly, we test our estimators for norms. Figure 0(g) shows estimated norm of and Figure 0(h) shows norm of . Notice that we need fewer samples to estimate than , which verifies our theory.
Here, we discuss the use of our estimator in two-sample testing. There is a large literature on nonparametric two-sample testing, but we discuss only some recent approaches with theoretical connections to ours.
Let denote our estimate of the Sobolev distance, consisting of plugging into equation (3). Since is a metric on the space of probability density functions in , computing leads naturally to a two-sample test on this space. Theorem 5
suggests an asymptotic test, which is computationally preferable to a permutation test. In particular, for a desired Type I error rate
our test rejects the null hypothesis if and only if .When
, this approach is closely related to several two-sample tests in the literature based on comparing empirical characteristic functions (CFs). Originally, these tests
(heathcote1972test; epps86twoSampleECF) computed the same statistic with a fixed number of random -valued frequencies instead of deterministic -valued frequencies. This test runs in linear time, but is not generally consistent, since the two CFs need not differ almost everywhere. Recently, chwialkowski15fastTwoSample suggested using smoothed CFs, i.e., the convolution of the CF with a universal smoothing kernel . This is computationally easy (due to the convolution theorem) and, when , for almost all , reducing the need for carefully choosing test frequencies. Furthermore, this test is almost-surely consistent under very general alternatives. However, it is not clear what sort of assumptions would allow finite sample analysis of the power of their test. Indeed, the convergence as can be arbitrarily slow, depending on the random test frequencies used. Our analysis instead uses the assumption to ensure that small, -valued frequencies contain most of the power of . ^{6}^{6}6Note that smooth CFs can be used in our test by replacing with , whereis the inverse Fourier transform of a characteristic kernel. However, smoothing seems less desirable under Sobolev assumptions, as it spreads the power of the CF away from small
-valued frequencies where our test focuses.These fixed-frequency approaches can be thought of as the extreme point of the computational/statistical trade-off described in section 7: they are computable in linear time and (with smoothing) are strongly consistent, but do not satisfy finite-sample bounds under general conditions.
At the other extreme () are MMD-based tests of gretton06MMD_NIPS; gretton12MMD_JMLR, which utilize the entire spectrum . These tests are statistically powerful and have strong guarantees for densities in an RKHS, but have computational complexity. ^{7}^{7}7Fast MMD approximations have been proposed, including the Block MMD, (zaremba13blockMMD) FastMMD, (zhao15fastMMD) and sub-sampled MMD, but these lack the statistical guarantees of MMD. The computational/statistical trade-off discussed in Section 7
can be thought of as an interpolation (controlled by
) of these approaches, with runtime in the case approaching quadratic for large and small .In this paper, we proposed nonparametric estimators for Sobolev inner products, norms and distances of probability densities, for which we derived finite-sample bounds and asymptotic distributions.
A natural follow-up question to our work is whether estimating smoothness of a density can guide the choice of smoothing parameters in nonparametric estimation. For some problems, such as estimating functionals of a density, this may be especially useful, since no error metric is typically available for cross-validation. Even when cross-validation is an option, as in density estimation or regression, estimating smoothness may be faster, or may suggest an appropriate range of parameter values.
(Variance Bound) If for some , then
(15) |
where and are the constants (in )
and .
Proof: We will use the Efron-Stein inequality (efronStein81) to bound the variance of . To do this, suppose we were to draw additional IID samples , and define, for all ,
Let
denote our estimate when we replace by . Noting the symmetry of in and , the Efron-Stein inequality tells us that
(16) |
where the expectation above (and elsewhere in this section) is taken over all samples . Expanding the difference in (16), note that any terms with cancel, so that ^{8}^{8}8It is useful here to note that and that .
and so
(17) |
Since and are IID,
and, since are IID,
In view of these two equalities, taking the expectation of (17) and using the fact that and are independent of , (17) reduces:
Comments
There are no comments yet.