I Introduction
Let
be a probability measure with alphabet size
, where we use a vector representation of
; i.e., and . Let be a mapping from to . Given a set of i.i.d. samples , we deal with a problem of estimating an additive functional of . The additive functional of is defined as(1) 
We simplify this notation to . Most entropylike criteria can be formed in terms of . For instance, when , is Shannon entropy. For a positive real , letting , becomes Rényi entropy. More generally, letting where is a concave function, becomes
entropies Akaike1998InformationPrinciple. The estimation problem of such entropylike criteria is a basic but important component for various research areas, such as physics Lake2011AccurateDevices, neuroscience Nemenman2004EntropyProblem, security Gu2005DetectingEstimation, and machine learning Quinlan1986InductionTrees,Peng2005FeatureMinredundancy.
The goal of this study is to construct the minimax optimal estimator of given a function . To precisely define the minimax optimality, we introduce the (quadratic) minimax risk. A sufficient statistic of is a histogram , where letting be the indicator function, and . The estimator of can thus be defined as a function , where for an integer . The quadratic minimax risk is defined as
(2) 
where is the set of all probability measures on , and the infimum is taken over all estimators . With this definition, an estimator is minimax (rate)optimal if there is a constant such that
(3) 
Since no estimator achieves smaller worst case risk than the minimax risk, we can say that the minimax optimal estimator is the best regarding the worst case risk.
Notations. We now introduce some additional notations. For any positive real sequences and , denotes that there exists a positive constant such that . Similarly, denotes that there exists a positive constant such that . Furthermore, implies and . For an event , we denote its complement as . For two real numbers and , and .
Ia Related Work
Many researchers have been dealing with the estimation problem of the additive functional and provides many estimators and analyses in decades past. The plugin estimator or the maximum likelihood estimator (MLE) is the simplest way to estimate the additive functional , in which the empirical probabilities are substituted into as . The plugin estimator is asymptotically consistent Antos2001ConvergenceDistributions, asymptotically efficient and minimax optimal Vaart1998AsymptoticStatistics under weak assumptions for fixed . However, this is not true for the large regime. Indeed, Jiao2015MinimaxDistributions and Wu2016MinimaxApproximation derived a lower bound for the quadratic risk for the plugin estimator of and . In the case of Shannon entropy, the lower bound is given by Jiao2015MinimaxDistributions as
(4) 
The first term comes from the bias and indicates that if grows linearly with respect to , the plugin estimator becomes inconsistent. Biascorrection methods, such as Miller1955NoteEstimates,Grassberger1988FiniteEstimates,Zahl1977JackknifingDiversity, can be applied to the plugin estimator of to reduce the bias whereas these biascorrected estimators are still inconsistent if is larger than . The estimators based on Bayesian approaches in Schurmann1996EntropySequences,Schober2013SomeDistributions,Holste1998BayesEntropies are also inconsistent for Han2015DoesProblem.
Paninski2004EstimatingSamples firstly revealed existence of a consistent estimator even if the alphabet size is larger than linear order of the sample size . However, they did not provide a concrete form of the consistent estimator. The first estimator that achieves consistency in the large regime is proposed by Valiant2011EstimatingCLTs. However, the estimator of Valiant2011EstimatingCLTs has not been shown to achieve the minimax rate even in a more detailed analysis in Valiant2011TheEstimators.
Recently, many researchers investigated the minimax optimal risk for the additive functionals in the large regime for some specific . Acharya2015TheEntropy showed that the biascorrected estimator of Rényi entropy achieves the minimax optimal risk in regard to the sample complexity if and , but they did not show the minimax optimality for other . Jiao2015MinimaxDistributions introduced a minimax optimal estimator for for any in the large regime. Wu2015ChebyshevUnseen derived a minimax optimal estimator for . For , Jiao2015MinimaxDistributions,Wu2016MinimaxApproximation independently introduced the minimax optimal estimator in the large regime. table I shows the summary of the existing minimax optimal risks for the additive functional estimation with some specific . The first column shows the target function , and the second column denotes the parameter appeared in . The third column shows the minimax optimal risk corresponding to , where these rates only proved when the condition shown in the fourth column is satisfied. In the case of Shannon entropy, the optimal risk was obtained as
(5) 
The first term is improved from eq. 4. It indicates that the introduced estimator can consistently estimate Shannon entropy even when , as long as .
minimax risk  condition  

Jiao2015MinimaxDistributions  
  Jiao2015MinimaxDistributions  
    Jiao2015MinimaxDistributions,Wu2016MinimaxApproximation  
Jiao2015MinimaxDistributions  
  Jiao2017MaximumDistributions  
    Wu2015ChebyshevUnseen 
While the recent efforts revealed the minimax optimal estimators for the additive functionals with some specific , there is no unified methodology to derive the minimax optimal estimator for the additive functional with general . Jiao2015MinimaxDistributions suggested that their proposed estimator can be extended for general additive functional . However, the minimax optimality of the estimator was only proved for specific cases of , including and . To prove the minimax optimality for other , we need to individually analyze the minimax optimality for specific . The aim of the present paper is to clarify which property of substantially influences the minimax optimal risk when estimating the additive functional.
The optimal estimators for divergences with large alphabet size have been investigated in Bu2016EstimationDistributions,Han2016MinimaxDistributions,Jiao2016MinimaxDistance,Acharya2018ProfileDivergence. The estimation problems of divergences are much complicated than the additive function, while the similar techniques were applied to derive the minimax optimality.
IB Contributions
In this paper, we investigate the minimax optimal risk of the additive functional estimation and reveal a substantial property of that characterizes the minimax optimal risk. More precisely, we show that the divergence speed of , which is defined as below, characterizes the minimax optimal risk of the additive functional estimation. [Divergence speed] For a positive integer and , the th divergence speed of is if there exist constants , and such that for all ,
(6) 
The divergence speed is faster if is larger. Informally, the meaning of “the th divergence speed of a function is ” is that goes to infinity at the same speed of the th derivative of when approaches . In table I, the divergence speed of for noninteger is for any . Also, the divergence speed of is for any .
minimax risk  condition  estimator  

no consistent estimator    
best poly. & 2ndorder biascorrection  
  best poly. & 2ndorder biascorrection  
  best poly. & 2ndorder biascorrection  
  best poly. & 4thorder biascorrection  
  plugin 
The results are summarized in table II. This table shows the minimax optimal risk (third column), the condition to prove the minimax optimal risk (fourth column), and the estimator that achieves the optimal risk (fifth column) for each range of . The column (second column) means that the presented minimax optimality is valid if the th divergence speed of is . As we can see from table II, the minimax risk are affected only by . Thus, we success to characterize the minimax optimal risk only by the property of , i.e., in the divergence speed, without specifying . In general, the convergence speed of the minimax optimal risk becomes faster as increases but is saturated for .
As shown in table II, the behaviour of the minimax optimal risk is changed by the ranges of . If , is an unbounded function. Thus, we trivially show that there is no consistent estimator of . In other words, the minimax optimal rate is larger than constant order if . This means that there is no reasonable estimator if , and thus there is no need to derive the minimax optimal estimator for this case.
For and , Jiao2015MinimaxDistributions showed the same minimax optimal risk for . Besides, Jiao2015MinimaxDistributions,Wu2016MinimaxApproximation proved the same minimax optimal risk for . Our result is a generalized version of their results such that it is applicable to the general including and .
For , we show that the minimax optimal risk is
(7) 
where we assume to prove the above rate. As an existing result, Jiao2015MinimaxDistributions proved the same minimax optimal risk for , where their proof requires an assumption (first row in table I). Their assumption is stronger than the assumption we assumed, i.e., . In this sense, we provide clearer understanding of the additive functional estimation problem for this case.
For , we prove the following minimax optimal risk
(8) 
As an existing result for this range, Jiao2015MinimaxDistributions investigated the minimax optimal risk for . As shown in table I (fourth row), they showed that the minimax optimal rate for is . However, their analysis requires the strong condition , and minimax optimality for is, therefore, far from clear understanding. In contrast, we success to prove eq. 8 without any condition on the relationship between and . As a result, we clarify the number of samples that is necessary to estimate consistently.
Ii Preliminaries
Iia Poisson Sampling
We employ the Poisson sampling technique to derive upper and lower bounds for the minimax risk. The Poisson sampling technique models the samples as independent Poisson distributions, while the original samples follow a multinomial distribution. Specifically, the sufficient statistic for
in the Poisson sampling is a histogram , whereare independent random variables such that
. The minimax risk for Poisson sampling is defined as follows:(9) 
The minimax risk of the Poisson sampling well approximates that of the multinomial distribution. Indeed, Jiao2015MinimaxDistributions presented the following lemma. [Jiao2015MinimaxDistributions] The minimax risk under the Poisson model and the multinomial model are related via the following inequalities:
(10) 
section IIA states , and thus we can derive the minimax rate of the multinomial distribution from that of the Poisson sampling.
IiB Polynomial Approximation
Cai2011TestingFunctional presented a technique of the best polynomial approximation for deriving the minimax optimal estimators and their lower bounds for the risk. Besides, Jiao2017MaximumDistributions used the Bernstein polynomial approximation to derive the upper bound on the estimation error of the plugin estimator. We use such polynomial approximation techniques to derive the upper and the lower bound on the minimax optimal risk for the additive functional estimation.
The key to characterize these approximations is the (weighted) modulus of smoothness. For an interval , let us define the th finite difference of a realvalued scalar function at point as
(11) 
if and , and otherwise . The modulus of smoothness of on an interval is defined as
(12) 
More generally, we can define the weighted modulus of smoothness by introducing a weight function . The weighted modulus of smoothness of on an interval is defined as
(13) 
Note that for . is also known as the modulus of continuity.
We introduce a useful property of the modulus of smoothness, which will be used in later analyses: [DeVore1993ConstructiveApproximation] For a positive integer and any times continuously differentiable function where , there exists a constant only depending on such that
(14) 
For later analyses, we use two kinds of polynomial approximations; Bernstein polynomial approximation and best polynomial approximation.
Bernstein Polynomial Approximation A Bernstein polynomial is a linear combination of Bernstein basis polynomials, which is defined as
(15) 
Given a function , the polynomial obtained by the Bernstein polynomial approximation with degree is defined as
(16) 
If is continuous on , the Bernstein polynomial converges to as tends to .
Ditzian1994DirectPolynomials provided an upper bound of the pointwise error on the Bernstein polynomial approximation by using the secondorder modulus of smoothness: [A special case of Ditzian1994DirectPolynomials] Given a function , for an arbitrary , we have
(17) 
Best Polynomial Approximation Let be the set of polynomials of which degree is up to . Given a polynomial and a function defined on an interval , the error between and is defined as
(18) 
The best polynomial of with a degree polynomial is a polynomial that minimizes the error. Such a polynomial uniquely exists if is continuous and can be obtained, for instance, by the Remez algorithm Remez1934SurDonnee if is bounded.
The error of the best polynomial approximation is defined as
(19) 
This error is nonincreasing as increases because covers all the smaller degree polynomials, i.e., . Decreasing rate of this error with respect to the degree has been studied well since the 1960s Timan1963TheoryVariable,Petrushev1988RationalFunctions,Ditzian2012ModuliSmoothness,Achieser2004TheoryApproximation.
Ditzian2012ModuliSmoothness revealed that if , the weighted modulus of smoothness with characterizes the best polynomial approximation error regarding . They showed that the following direct and converse inequalities: [direct result Ditzian2012ModuliSmoothness] For a continuous function defined on , we have for ,
(20) 
[converse result Ditzian2012ModuliSmoothness] For a continuous function defined on , we have
(21) 
As a consequence of these lemmas, the best polynomial approximation error is characterized by the weighted modulus of smoothness as follows: [Ditzian2012ModuliSmoothness] Let be a continuous realvalued function on . Then, for ,
(22) 
are equivalent, where . From section IIB, we can obtain the best polynomial approximation error rate regarding by analyzing the weighted modulus of smoothness.
Iii Main Results
Our main results reveal the minimax optimal risk of the additive functional estimation in characterizing with the divergence speed defined in section IB. The behaviour of the minimax optimal risk varies depending on the range of as , , , , , . We will show the minimax optimal risk for each range of one by one.
First, we demonstrate that we cannot construct a consistent estimator for . Suppose is a function whose first divergence speed is for . Then, there is no consistent estimator, i.e., . The proof of section III is given in appendix A. The consistency is a necessary property for a reasonable estimator, and thus this proposition show that there is no reasonable estimator if . For this reason, there is no need to derive the minimax optimal estimator in this case.
For , the minimax optimal risk is obtained as follows. Suppose is a function whose fourth divergence speed is for . If and , the minimax optimal risk is obtained as
(23) 
If , there is no consistent estimator. For this range, Jiao2015MinimaxDistributions showed the same minimax optimal risk for the specific function . However, their analysis needs stronger condition, , to prove rate. We success to prove this rate with weaker condition . However, for , the optimal risk is not obtained in the current analysis and remains as an open problem.
We follow the similar manner of Jiao2015MinimaxDistributions to derive the estimator that achieves section III
. In Jiao2015MinimaxDistributions, the optimal estimator is constructed from two estimators; best polynomial estimator and biascorrected plugin estimator. The best polynomial estimator is an unbiased estimator of the polynomial that best approximates
. The biascorrected plugin estimator is an estimator obtained by applying two techniques to the plugin estimator of. The first technique is Miller1955NoteEstimates’s biascorrection Miller1955NoteEstimates, which offsets the secondorder approximation of the bias. The second technique deals with the requirement from the plugin estimator; that is, the plugging function should be smooth. Jiao2015MinimaxDistributions utilized some interpolation technique to fulfill the smoothness requirement. However, the second technique is not applicable to general
directly. We therefore introduce truncation operator as a surrogate of the interpolation technique. We will describe the optimal estimator for general using the truncation operator in section IV.Next, the following theorem gives the optimal risk of the additive functional estimation for . Suppose is a function whose fourth divergence speed is for . If , the minimax optimal risk is obtained as
(24) 
otherwise there is no consistent estimator. Jiao2015MinimaxDistributions also showed the same minimax optimal risk for with this range of . section III generalizes their result to general . The optimal estimator for is equivalent to the estimator for .
If , the optimal minimax risk is obtained as the following theorem. Suppose is a function whose fourth divergence speed is for . If , the minimax optimal risk is obtained as
(25) 
otherwise there is no consistent estimator. One of the functions that satisfies the divergence speed assumption with is , and the optimal risk of this function was revealed by Jiao2015MinimaxDistributions,Wu2016MinimaxApproximation. They showed the same optimal risk for , and thus section III is a generalized version of their result. The optimal estimator is also equivalent to the estimator for .
If , the optimal minimax risk is obtained as the following theorem. Suppose is a function whose sixth divergence speed is for . If , the minimax optimal risk is obtained as
(26) 
otherwise there is no consistent estimator. For this range, Jiao2015MinimaxDistributions showed the minimax optimal risk for as under the condition that . In contrast, we success to prove the minimax optimal risk for this range without the condition . The first term corresponds to their result because it is same as their result under the condition they assume. The optimal estimator is similar to the estimator for except that we apply a slightly different biascorrection technique to the biascorrected plugin estimator. In the estimator for , we use the biascorrection technique of Miller1955NoteEstimates, which offsets the second order Taylor approximation of bias. Instead of the Miller1955NoteEstimates’s technique, we introduce a technique that offsets the fourth order Taylor approximation of the bias. This technique will be explained in detail in section IV.
A lower bound for the second term in sections III, III and III is easily obtained by applying Le Cam’s two point method Tsybakov2009IntroductionEstimation. For proving the lower bound of the first term in sections III, III, III and III, we follow the same manner of the analysis given by Wu2015ChebyshevUnseen, in which the minimax lower bound is connected to the lower bound on the best polynomial approximation. Our careful analysis of the best polynomial approximation yields the lower bound.
The optimal minimax risk for is obtained as follows. Suppose is a function whose second divergence speed is for . Then, the minimax optimal risk is obtained as
(27) 
The second divergence speed of is as long as the second derivative of is bounded. Since a function whose th divergence speed is for any and has bounded second derivative, this result covers all cases for . The upper bound is obtained by employing the plugin estimator; its analysis is easy if because the second derivative of is bounded. For , we extend the analysis of given by Jiao2017MaximumDistributions to be applicable to general . The lower bound can be obtained easily by application of Le Cam’s two point method Tsybakov2009IntroductionEstimation.
Iv Estimator for
In this section, we describe our estimator for in the case . The optimal estimator for is composed of the biascorrected plugin estimator and the best polynomial estimator. We first describe the overall estimation procedure on the supposition that the biascorrected plugin estimator and the best polynomial estimator are black boxes. Then, we describe the biascorrected plugin estimator and the best polynomial estimator in detail.
For simplicity, we assume the samples are drawn from the Poisson sampling model, where we first draw , and then draw i.i.d. samples . Given the samples , we first partition the samples into two sets. We use one set of the samples to determine whether the biascorrected plugin estimator or the best polynomial estimator should be employed, and the other set is used to estimate . Let
be i.i.d. random variables drawn from the Bernoulli distribution with parameter
, i.e., for . We partition according to , and construct the histograms and , which are defined as(28) 
for . Then, and are independent histograms, and .
Given , we determine whether the biascorrected plugin estimator or the best polynomial estimator should be employed for each alphabet. Let be a threshold depending on and to determine which estimator is employed, which will be specified as in section VI on section VI. We apply the best polynomial estimator if , and otherwise, i.e., , we apply the biascorrected plugin estimator. Let and be the best polynomial estimator and the biascorrected plugin estimator for , respectively. Then, the estimator of is written as
(29) 
Next, we describe the details of the best polynomial estimator and the biascorrected plugin estimator .
Iva Best Polynomial Estimator
The best polynomial estimator is an unbiased estimator of the polynomial that provides the best approximation of . Let be coefficients of the polynomial that achieves the best approximation of by a degree polynomial with range , where will be specified in section VI on section VI. Then, the approximation of by the polynomial at point is written as
(30) 
From eq. 30, an unbiased estimator of can be derived from an unbiased estimator of . For the random variable drawn from the Poisson distribution with mean parameter , the expectation of the
th factorial moment
becomes . Thus, is an unbiased estimator of . Substituting this into eq. 30 gives the unbiased estimator of as(31) 
Next, we truncate so that it is not outside of the domain of . Let and . Then, the best polynomial estimator is defined as
(32) 
IvB Biascorrected Plugin Estimator
The problem of the plugin estimator is that it causes a large bias when the occurrence probability of an alphabet is small. This large bias comes from a large derivative around . To avoid this large bias, we truncate the function as follows:
(33) 
Besides, we define the derivative of the truncated function as
(34) 
where . Then, we construct biascorrected plugin estimators for this truncated function. Note that this derivative is not the same as the standard derivative of because is not differentiable at and . Even when differentiability is lost by this truncation, a technique using the generalized Hermite interpolation shown in section VI
enables us to obtain the bounds on the bias and variance of the biascorrected plugin estimator for this truncated function.
We use a slightly different bias correction methods for and . We use secondorder bias correction for and fourthorder bias correction for . We describe the secondorder bias correction first and then move onto description of the fourthorder bias correction.
Secondorder bias correction. For , we employ the plugin estimator with the Miller1955NoteEstimates’s bias correction Miller1955NoteEstimates. The bias correction offsets the secondorder Taylor approximation of the bias, which is obtained as follows.
(35) 
where for . The bias corrected function is hence obtained as .
Using the truncation operator, the truncated secondorder bias corrected function is defined as
(36) 
Then, is the plugin estimator of ; that is
(37) 
Forthorder bias correction. For , we employ the fourth order bias correction. In analogy with the secondorder bias correction, the fourth order bias correction offsets the fourth order approximation of bias. By the Taylor approximation, the bias of the plugin estimator for is obtained as
(38) 
Thus, the fourthorder bias corrected function is obtained as .
As well as the second order bias correction, we define the truncated fourthorder bias corrected function as
(39) 
Then, is the plugin estimator of ; that is,
(40) 
V Consequential Properties of Divergence Speed Assumption
In this section, we present some properties that come from the divergence speed assumption; these properties are useful for the later analyses.
Va Lower Order Divergence Speed
If the th divergence speed of a function is for some and , the same divergence speed is satisfied for the lower order, such as , ,…, under a certain condition. The precise claim is shown in the following lemma. For a real , let be an times continuously differentiable function on whose th divergence speed is . If , the th divergence speed of is also . The proof of this lemma can be found in appendix B. We immediately obtain a consequence lemma of section VA: For a real , let be an times continuously differentiable function on whose th divergence speed is . Then, for any positive integer such that and , the th divergence speed of is also .
Proof.
Applying section VA times yields the claim. ∎
When is an integer, th derivative of diverges as approaches to zero with logarithmic divergence speed as shown in the following lemma: For a positive integer , let be an times continuously differentiable function on whose th divergence speed is , where . Then, there exists constants , and such that for all ,
(41) 
The proof of section VA is also shown in appendix B. The bounds on in section VA is similar to the divergence speed in section IB except is replaced by . The function diverges to infinity as approaches to zero where its speed is slower than for any real and any integer such that .
VB Hölder Continuity
The divergence speed assumption induces Hölder continuity to . For a real , a function is Hölder continuous on if
(42) 
In particular, Hölder continuity is known as Lipschitz continuity.
We reveal Hölder continuity in and its derivative under the assumption that the th divergence speed of is . We derive the Hölder continuity by dividing into three cases; , , and . Suppose is a function whose th divergence speed is for and . Then, is Hölder continuous. Suppose is a function whose th divergence speed is for and . Then, is Hölder continuous for any . Suppose is a function whose th divergence speed is for and such that . If , is Lipschitz continuous, and is Hölder continuous. Note that we can assume without loss of generality because, for any , where .
Vi Upper Bound Analysis
In this section, we analyze the worstcase quadratic errors of the plugin estimator and the proposed estimator described in section IV. We will prove the following theorems: Suppose is a function such that one of the following condition holds:

the fourth divergence speed of is for ,

the sixth divergence speed of is for .
Let and where and are universal constants such that , , and . If , the worstcase risk of is bounded above as
(43) 
where we need if . If , the worstcase risk of is bounded above as
(44) 
If , the worstcase risk of is bounded above as