1 Introduction
Fix a compact sample space . Let denote the family of all Borel probability measures on . For each , let
denote the characteristic function of
given bywhere denotes the Euclidean inner product on .
For any family of realvalued coefficients indexed by a countable set , define a subspace
Now fix two unknown probability measures . We study estimation of the semiinner product ^{1}^{1}1For a complex number , denotes the complex conjugate of .
(1) 
as well as the squared seminorm and pseudometric , using IID samples and from each distribution. Specifically, we assume and lie in a smaller subspace parameterized by a indexed real family , and we study the minimax error of estimating , over and lying in a (unit) ellipsoid with respect to ; that is, the quantity
(2) 
where the infimum is taken over all estimators (i.e., all complexvalued functions of the data).
The minimax rate is primarily governed by the rates at which and decay to as .^{2}^{2}2By equivalence of finitedimensional norms, the choice of norm here affects only constant factors. This has been studied extensively in the Sobolev (or polynomialdecay) case, where, for some , and , corresponding to estimation of order Sobolev semiinner products under order Sobolev smoothness assumptions on and (as described in Example 3 below) (bickel88squaredDerivatives; donoho90minimaxQuadratic; laurent00adaptiveQuadratic; singh16sobolev). In this case, the rate of has been identified (by bickel88squaredDerivatives, donoho90minimaxQuadratic, and singh16sobolev, in increasing generality) as ^{3}^{3}3Here and elsewhere, denotes equality up to constant factors.
(3) 
so that the “parametric” rate dominates when , and the slower rate dominates otherwise. laurent00adaptiveQuadratic additionally showed that, for , increases by a factor of in the “adaptive” case, when the tail index is not assumed to be known to the estimator.
However, the behavior of for other (nonpolynomial) decay rates of and has not been studied, despite the fact that, as discussed in Section 1.1, other rates of decay of and , such as Gaussian or exponential decay, correspond to products and assumptions commonly considered in nonparametric statistics. The goal of this paper is therefore to understand the behavior of for general sequences and .
Although our results apply more generally, to simply summarize our results, consider the case where and are “radial”; i.e. and are both functions of some norm . Under mild assumptions, we show that the minimax convergence rate is then a function of the quantities
which can be thought of as measures of the “strengths” of and , for a particular choice of a “smoothing” (or “truncation”) parameter . Specifically, we show
(4) 
While (4) is difficult to simplify without knowing the forms of and , it is easy to check that, in the Sobolev case (where and decay polynomially), the assumptions are satisfied, and (4) recovers the previously known rate (3). Moreover, our assumptions are also satisfied by other rates of interest, such as exponential (where and ) and Gaussian (where and ) rates, for which we are the first to identify minimax rates. As in the Sobolev case, the rates here exhibit the socalled “elbow” phenomenon, where the the convergence rates is “parametric” (i.e., of order ) when is sufficiently large relative to , and slower otherwise. However, for rapidly decaying such as in the exponential case, the location of this elbow no longer depends directly on the dimension ; the parametric rate is achieved as soon as .
We note that, in all of the above cases, the minimax rate (4) is achieved by a simple bilinear estimator:
where and are linear estimates of and , and is a tuning parameter. We also show that, in many cases, a rateoptimal can be chosen adaptively (i.e., without knowledge of the space in which and lie).
1.1 Motivating Examples
Here, we briefly present some examples of products and spaces of the form (1) that are commonly encountered in statistical theory and functional analysis. In the following examples, the base measure on is taken to be the Lebesgue measure , and “probability densities” are with respect to . Also, for a squareintegrable function , we use to denote the Fourier coefficient of .
The simplest example is the standard inner product:
Example 1.
In the “unweighted” case where for all , includes the usual space of squareintegrable probability densities on , and, for and with squareintegrable densities , we have
Typically, however, we are interested in weight sequences such that as and will be strictly smaller than to ensure that is finitevalued; this corresponds intuitively to requiring additional smoothness of functions in . Here are two examples widely used in statistics:
Example 2.
If is a reproducing kernel Hilbert space (RKHS) with a translationinvariant kernel (where ), one can show via Bochner’s theorem (see, e.g., Theorem 6.6 of (wendland2005scattered)) that the semiinner product
satisfies
Hence, setting each , contains any distributions and on with densities and, moreover, we have .
Example 3.
If, for some , is the order Sobolev space
one can show via Parseval’s identity that the semiinner product
over satisfies
(5) 
(indeed, (5) is commonly used to generalize , for example, to noninteger values of ). Thus, setting , contains any distributions with densities , and, moreover, we have . Note that, when , one can show via Bochner’s theorem that is in fact also an RKHS, with translationinvariant kernel defined as above by .
Paper Organization
The remainder of this paper is organized as follows: In Section 2, we provide notation needed to formally state our estimation problem, given in Section 3. Section 4 reviews related work on estimation of functionals of probability densities, as well as some applications of this work. Sections 5 and 6 present our main theoretical results, with upper bounds in Sections 5 and minimax lower bounds in Section 6; proofs of all results are given in Appendix 10. Section 7 expands upon these general results in a number of important special cases. Finally, we conclude in Section 8 with a discussion of broader consequences and avenues for future work.
2 Notation
We assume the sample space is a compact subset of , and we use to denote the usual Lebesgue measure on . We use to denote the standard orthonormal Fourier basis of , indexed by tuples of integer frequencies . For any function and , we use
to denote the Fourier coefficient of (i.e., the projection of onto ), and for any probability distribution , we use the same notation
to denote the characteristic function of .
We will occasionally use the notation for indices . Due to equivalence of finite dimensional norms, the exact choice of norm affects only constant factors; for concreteness, one may take the Euclidean norm.
For certain applications, it is convenient to consider only a subset of indices of interest (for example, Sobolev seminorms are indexed only over ). The subset may be considered arbitrary but fixed in our work.
Given two valued sequences^{5}^{5}5A more proper mathematical term for and would be net. and , we are interested in products of the form
and their induced (semi)norms over spaces of the form ^{6}^{6}6Specifically, we are interested in probability densities, which lie in the simplex , so that we should write, e.g., . Henceforth, “density” refers to any function lying in .
(and similarly when replacing by ). Typically, we will have and whenever , implying the inclusion .
3 Formal Problem Statement
Suppose we observe IID samples and IID samples , where and are (unknown) distributions lying in the (known) space . We are interested in the problem of estimating the inner product (1), along with the closely related (squared) seminorm and pseudometric given by
(6) 
We assume and lie in a (known) smaller space , and we are specifically interested identifying, up to constant factors, the minimax mean squared (i.e., ) error of estimating over and lying in a unit ellipsoid with respect to ; that is, the quantity
(7) 
where the infimum is taken over all estimators (i.e., all functions of the data ).
4 Related Work
4.1 Prior work on special cases
While there has been substantial work on estimating unweighted norms and distances of densities (schweder75L2; anderson94L2TwoSampleTest; gine08simpleL2), to the best of our knowledge, most work on the more general problem of estimating weighted inner products or norms has been work on estimating Sobolev quantities (see Example 3 in Section 1) by bickel88squaredDerivatives, donoho90minimaxQuadratic, and singh16sobolev. bickel88squaredDerivatives considered the case of integerorder Sobolev norms, which have the form
(8) 
for which they upper bounded the error of an estimator based on plugging a kernel density estimate into (
8) and then applying an analytic bias correction. They also derived matching minimax lower bounds for this problem. ^{7}^{7}7bickel88squaredDerivatives actually make Hölder assumptions on their densities (essentially, an bound on the derivatives of the density), rather than our slightly milder Sobolev assumption (essentially, an bound on the derivative). However, these assumptions are closely related such that the results are comparable up to constant factors. singh16sobolev proved ratematching upper bounds on the error of a much simpler inner product estimator (generalizing an estimator proposed by donoho90minimaxQuadratic), which applies for arbitrary . Our upper and lower bounds are strict generalizations of these results. Specifically, relative to this previous work on the Sobolev case, our work makes advances in three directions:
We consider estimating a broader class of inner product functionals , for arbitrary sequences . The Sobolev case corresponds to for some .

We consider a broader range of assumptions on the true data densities, of the form , for arbitrary sequences . The Sobolev case corresponds to for some .

We prove lower bounds that match our upper bounds, thereby identifying minimax rates. For many cases, such as Gaussian or exponential RKHS inner products or densities, these results are the first concerning minimax rates, and, even in the Sobolev case, our lower bounds address some previously open cases (namely, noninteger and , and .
The closely related work of fan91quadratic also generalized the estimator of donoho90minimaxQuadratic, and proved (both upper and lower) bounds on for somewhat more general sequences, and also considered norms with exponent (i.e., norms not generated by an inner product, such as those underlying a broad class of Besov spaces). However, his analysis placed several restrictions on the rates of and ; for example, it requires
This holds when and decay polynomially, but fails in many of the cases we consider, such as exponential decay. The estimation of norms with and and decaying nonpolynomially, therefore, remains an important unstudied case, which we leave for future work.
Finally, we note that, except singh16sobolev, all the above works have considered only (i.e., when the sample space , despite the fact that can play an important role in the convergence rates of the estimators. The results in this paper hold for arbitrary .
4.2 Estimation of related functionals
Although minimax rates for basic cases have been long established (birge95integralFunctionals; laurent96integralFunctionals), there has been quite a large amount of recent work (singh14RenyiDivergence; moon14divergencesEnsemble; singh14densityFunctionals; krishnamurthy14RenyiAndFriends; moon14divergencesConfidence; krishnamurthy15L2Divergence; kandasamy15vonMises; gao15stronglyDependent; gao15localGaussian; mukherjee15lepski; mukherjee16adaptive; moon16improvingConvergence; singh16kNN; berrett16kNNentropy; gao17density; gao2017estimating; jiao2017nearest; han2017optimal; noshad2017direct; wisler2017direct; singh2017nonparanormal; noshad2018scalable; bulinski2018statistical) on practical estimation of nonlinear integral functionals of probability densities, of the form ^{8}^{8}8Some of this work has considered more general functionals that can depend on multiple densities or their derivatives.
(9) 
where is nonlinear but smooth. The focus has been on estimating informationtheoretic quantities such as variants of entropy, mutual information, and divergence. However, as discussed in detail by laurent96integralFunctionals, under Sobolev smoothness assumptions on , estimation of quadratic functionals (such as those considered in this paper) is key to constructing minimax rateoptimal estimators for general functionals of the form (9). The reason for this is that minimax rateoptimal estimators of can be constructed by approximating a secondorder Taylor expansion of around a density estimate of that is itself minimax rateoptimal (with respect to integrated mean squared error); the secondorder term of this expansion is precisely a quadratic functional of the density. Indeed, to the best of our knowledge, this is the approach taken by all estimators that are known to achieve minimax rates (birge95integralFunctionals; laurent96integralFunctionals; krishnamurthy14RenyiAndFriends; kandasamy15vonMises; mukherjee15lepski; mukherjee16adaptive) for general functionals of the form (9).
Interestingly, the estimators studied in the recent papers above are all based on either kernel density estimators (singh14RenyiDivergence; singh14densityFunctionals; krishnamurthy14RenyiAndFriends; krishnamurthy15L2Divergence; kandasamy15vonMises; moon16improvingConvergence; mukherjee15lepski; mukherjee16adaptive) or nearest neighbor methods (moon14divergencesEnsemble; moon14divergencesConfidence; singh16kNN; berrett16kNNentropy; gao17density). This contrasts with our approach, which is more comparable to orthogonal series density estimation; given the relative computational efficiency of orthogonal series methods, it may be desirable to try to adapt our estimators to these classes of functionals.
When moving beyond Sobolev assumptions, only estimation of very specific functionals has been studied. For example, under RKHS assumptions, only estimation of maximum mean discrepancy (MMD)(gretton12kernel; ramdas15decreasingPower; tolstikhin16minimaxMMD), has received much attention. Hence, our work significantly expands our understanding of minimax functional estimation in this setting. More generally, our work begins to provide a framework for a unified understanding of functional estimation across different types of smoothness assumptions.
Along a different line, there has also been some work on estimating norms for regression functions, under similar Sobolev smoothness assumptions (lepski99regressionNorm). However, the problem of norm estimation for regression functions turns out to have quite different statistical properties and requires significantly different estimators and analysis, compared to norm estimation for density functions. Generally, the problem for densities is statistically easier in terms of having a faster convergence rate under a comparable smoothness assumption; this is most obvious when , since the norm of a density is always , while the norm of a regression function is less trivial to estimate. However, this is true more generally as well. For example, lepski99regressionNorm showed that, under order Sobolev assumptions, the minimax rate for estimating the norm of a dimensional regression function (up to factors) is , whereas the corresponding rate for estimating the norm of a density function is , which is parametric when . To the best of our knowledge, there has been no work on the natural question of estimating Sobolev or other more general quadratic functionals of regression functions.
4.3 Applications
Finally, although this paper focuses on estimation of general inner products from the perspective of statistical theory, we mention a few of the many applications that motivate the study of this problem.
Estimates of quadratic functionals can be directly used for nonparametric goodnessoffit, independence, and twosample testing (anderson94L2TwoSampleTest; dumbgen98goodnessOfFit; ingster12nonparametric; goria05new; pardo05statistical; chwialkowski15fastTwoSample). They can also by used to construct confidence sets for a variety of nonparametric objects (li89honestConfidence; baraud04confidence; genovese05waveletRegression)
, as well as for parameter estimation in semiparametric models
(wolsztynski05minimumEntropyEstimation).In machine learning, Sobolevweighted distances can also be used in transfer learning
(du17hypothesisTransferLearning) and transduction learning (quadrianto09transduction) to measure relatedness between source and target domains, helping to identify when transfer can benefit learning. Semiinner products can be used as kernels over probability distributions, enabling generalization of a wide variety of statistical learning methods from finitedimensional vectorial inputs to nonparametric distributional inputs (sutherland16thesis). This distributional learning approach has been applied to many diverse problems, including image classification (Poczos:2011:NDE:3020548.3020618; poczos12imageClassification), galaxy mass estimation (ntampaka15galaxyMass), ecological inference (flaxman15election; flaxman16election), aerosol prediction in climate science (szabo15learningTheoryDistributions), and causal inference (lopez15causalTheory). Further applications of these quantities can be found in (principe10information).5 Upper Bounds
In this section, we provide upper bounds on minimax risk. Specifically, we propose estimators for semiinner products, seminorms, and pseudometrics, and bound the risk of the semiinner product estimator; identical bounds (up to constant factors) follow easily for seminorms and pseudometrics.
5.1 Proposed Estimators
Our proposed estimator of consists of simply plugging estimates of and into a truncated version of the summation in Equation (1). Specifically, since
we estimate each by and each by . Then, for some finite set (a tuning parameter to be chosen later) our estimator for the product (1) is
(10) 
To estimate the squared seminorm from a single sample , we use
(11) 
where is estimated using the first half of the sample, is estimated using the second half of the sample. While it is not clear that sample splitting is optimal in practice, it allows us to directly apply convergence results for the semiinner product, which assume the samples from the two densities are independent.
5.2 Bounding the risk of
Here, we state upper bounds on the bias, variance, and mean squared error of the semiinner product estimator
, beginning with an easy bound on the bias of (proven in Appendix 10.1):Proposition 1 (Upper bound on bias of ).
Suppose . Then,
(12) 
(note that, assuming faster than as , ). While (12) does not explicitly depend on the sample size , in practice, the parameter set will be chosen to grow with , and hence the supremum over will decrease monotonically with . Next, we provide a bound on the variance of , whose proof, given in Appendix 10.2, is more involved.
Proposition 2 (Upper bound on variance of ).
Suppose . Then,
(13) 
where
(14) 
Having bounded the bias and variance of the estimator , we now turn to the mean squared error (MSE). Via the usual decomposition of MSE into bias and variance, Propositions 1 and 2 together immediately imply the following bound:
Theorem 3 (Upper bound on MSE of ).
Corollary 4 (Norm estimation).
In the particular case of norm estimation (i.e., when ), this simplifies to:
(16) 
5.3 Discussion of Upper Bounds
Two things might stand out that distinguish the above variance bound from many other nonparametric variance bounds: First, the rate depends on the smoothness of . Smoothness assumptions in nonparametric statistics are usually needed only to bound the bias of estimators (Tsybakov:2008:INE:1522486). The reason the smoothness appears in this variance bound is that the estimand in Equation (1) includes products of the Fourier coefficients of and . Hence, the estimates of are scaled by , and vice versa, and as a result, the decay rates of and affect the variance of the tails of
. One consequence of this is that the convergence rates exhibit a phase transition, with a parametric convergence rate when the tails of
and are sufficiently light, and a slower rate otherwise.Second, the bounds are specific to the Fourier basis (as opposed to, say, any uniformly bounded basis, e.g., one with ). The reason for this is that, when expanded, the variance includes terms of the form , for some . In general, these covariancelike terms are difficult to bound tightly; for example, the uniform boundedness assumption above would only give a bound of the form .
For the Fourier basis, however, the recurrence relation allows us to bound in terms of assumptions on the decay rates of the coefficients of . It turns out that decays significantly faster than , and this tighter bound is needed to prove optimal convergence rates.
More broadly, this suggests that convergence rates for estimating inner products in terms of weights in a particular basis may depend on algebraic properties of that basis. For example, another common basis, the Haar wavelet basis, satisfies a different recurrence relation: , depending on whether (and how) the supports of and are nested or disjoint. We leave investigation of this and other bases for future work.
Clearly, if and only if is summable (i.e., ). Thus, assuming , this already identifies the precise condition required for the minimax rate to be parametric. For the optimal choice of the parameter
it is the case that
and hence the second term in (13) will be dominated by the first. Hence, the upper bound is of order
(17) 
Simplifying the bound further requires some knowledge of the form of and/or , and we develop this in several cases in Section 7. In Section 8
, we also consider some heuristics for approximately simplifying (
17) in certain settings.6 Lower Bounds
In this section, we provide a lower bound the minimax risk of the estimation problems described in Section 3. Specifically, we use a standard information theoretic framework to lower bound the minimax risk for seminorm estimation; bounds of the same rate follow easily for inner products and pseudometrics. In a wide range of cases, our lower bound matches the MSE upper bound (Theorem 3) presented in the previous section.
Theorem 5 (Lower Bound on Minimax MSE).
Suppose has finite base measure and suppose the basis contains the constant function and is uniformly bounded (i.e., ). Define , and let and . If , then we have the minimax lower bound
where is chosen to satisfy . Also, if , then we have the (looser) minimax lower bound
Remark 6.
The uniform boundedness assumption permits the Fourier basis, our main case of interest, but also allows other bases (see, e.g., the “generalized Fourier bases” used in Corollary 2.2 of liang2017well).
Remark 7.
Intuitively, the ratio measures the relative strengths of the norms and . As expected, consistent estimation is possible if and only if is a stronger norm than .
7 Special Cases
In this section, we develop our lower and upper results for several special cases of interest. Some Notation: Here, for simplicity, we assume that the estimator uses a choice of that is symmetric across dimensions; in particular, (for some depending on ) is the Cartesian product of sets of the first integers. Throughout this section, we use and to denote inequality up to factors. Although we do not explicitly discuss estimation of norms, it appears as a special case of the Sobolev case with .

Sobolev Case: For some , and .
Upper Bound: By Proposition 1, , and, by Proposition 2,
Thus,
One can check that is minimized when , and that, for this choice of , the term is of lower order, giving the convergence rate
Lower Bound: Note that and . Solving gives . Thus, Theorem 5 gives a minimax lower bound of
matching the upper bound. Note that the rate is parametric () when , and slower otherwise.

Gaussian RKHS: For some , and .
Upper Bound: By Proposition 1 . If we use the upper bound
for any and some , then Proposition 2 gives
Thus,
One can check that is minimized when , and that, for this choice of , the term is of lower order, giving an MSE convergence rate of
Lower Bound: Again, we use the bound
as well as the trivial lower bound . Solving gives up to factors. Thus, ignoring factors, Theorem 5 gives a minimax lower bound of
for some , matching the upper bound rate. Note that the rate is parametric when , and slower otherwise.

Exponential RKHS: For some , and .