# Tail bounds for empirically standardized sums

Exponential tail bounds for sums play an important role in statistics, but the example of the t-statistic shows that the exponential tail decay may be lost when population parameters need to be estimated from the data. However, it turns out that if Studentizing is accompanied by estimating the location parameter in a suitable way, then the t-statistic regains the exponential tail behavior. Motivated by this example, the paper analyzes other ways of empirically standardizing sums and establishes tail bounds that are sub-Gaussian or even closer to normal for the following settings: Standardization with Studentized contrasts for normal observations, standardization with the log likelihood ratio statistic for observations from an exponential family, and standardization via self-normalization for observations from a symmetric distribution with unknown center of symmetry. The latter standardization gives rise to a novel scan statistic for heteroscedastic data whose asymptotic power is analyzed.

## Authors

• 6 publications
06/18/2018

### Robust model selection between population growth and multiple merger coalescents

We study the effect of biological confounders on the model selection pro...
10/21/2018

### On the Non-asymptotic and Sharp Lower Tail Bounds of Random Variables

The non-asymptotic tail bounds of random variables play crucial roles in...
09/08/2018

### Asymptotic law of a modified score statistic for the asymmetric power distribution with unknown location and scale parameters

For an i.i.d. sample of observations, we study a modified score statisti...
06/10/2018

### Bounds for the asymptotic distribution of the likelihood ratio

In this paper we give an explicit bound on the distance to chisquare for...
06/11/2021

### On an Asymptotic Distribution for the MLE

The paper presents a novel asymptotic distribution for a mle when the lo...
03/19/2020

### Chernoff-type Concentration of Empirical Probabilities in Relative Entropy

We study the relative entropy of the empirical probability vector with r...
03/04/2020

### Adaptive exponential power distribution with moving estimator for nonstationary time series

While standard estimation assumes that all datapoints are from probabili...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tail bounds and concentration inequalities for sums of independent random variables play a key role in statistics and machine learning, see e.g. van der Vaart and Wellner (1996), Boucheron et al. (2013), Vershynin (2018), or Wainwright (2019). Of particular importance are exponential tails bounds, which typically involve the expected value of the sum as well as a scale factor such as the variance. On the other hand, few results seem to be available when these parameters need to be estimated from the data, as may be required to make statistical methodology operational. The most prominent example is the

-statistic: If are i.i.d. N(), then

 T := 1√m∑mi=1(Xi−μ)√1m−1∑mi=1(Xi−¯¯¯¯¯X)2 (1)

has the heavy algebraic tails of the -distribution, so estimating with the sample variance comes at the expense of losing the exponential tail decay. This paper explores the case where the expectation is also unknown and must be estimated. This is the typical setting for scan statistics, where observations in a scan window are assessed against an unknown baseline which is estimated with the sample mean of all observations, see e.g. Yao (1993). It turns out that, rather than exacerbating the situation, this additional estimation step actually restores the exponential tail bound:

###### Corollary 1 (to Proposition 1)

Let i.i.d. N() and . Then for :

satisfies

 IP(V>t) ≤ IP(N(0,1)>t) \quad for {t≥2.5 and n≥10,  ort≥2.75 and n≥6.

This result raises the question whether exponential tail bounds hold for other relevant ways of empirically (i.e. without using population parameters) standardizing sums. The answer turns out to be positive and this paper establishes tail bounds that are sub-Gaussian or even closer to normal for the following settings: Standardization by empirically centering and Studentizing sums of normal observations in Section 2, standardization with the log likelihood ratio statistic for observations from an exponential family in Section 3, and standardization via self-normalization for observations from a symmetric distribution with unknown center of symmetry in Section 4. The latter standardization give rise to a novel scan statistic for heteroscedastic data that is based on self-normalization, and its asymptotic power properties are also analyzed in Section 4. This analysis shows that the tail bounds are tight in the sense that they allow optimal detection in a certain scan problem; it is known that this optimality hinges on having the correct sub-Gaussian tail bound.

## 2 Normal tail bounds for Studentized constrasts and empirically centered sums

Corollary 1 about empirically centered and Studentized sums is a consequence of the following result about Studentized linear contrasts:

###### Proposition 1

Let i.i.d. N() and with , . Then

is a pivot and satisfies a normal tail bound:

 V d=∑n−1i=1Zi√∑n−1i=1Z2i for Zi i.i.d. N(0,1), V2n−1 ∼Beta(12,n−22), IP(V>t) ≤IP(N(0,1)>t) \quad for {t≥2.5 and n≥10,  ort≥2.75 and n≥6,

and the analogous bound holds for the left tail of .

Corollary 1 follows from Proposition 1 by setting with if and otherwise. Then and .

Studentization is a special case of self-normalization, see e.g. de la Peña et al. (2009) and Section 4

. Self-normalization has certain advantages over standardizing with the population standard deviation because, roughly speaking, erratic fluctuations of the statistic are mirrored and therefore compensated by the random self-normalizing (Studentizing) term in the denominator, see Shao and Zhou (2016,2017) for formal results. Corollary

1 shows that centering empirically rather than with the expected value can likewise be advantageous.

## 3 Sub-Gaussian tail bounds for the log likelihood ratio statistic

Let be independent observations from a regular one-dimensional natural exponential family , i.e. has a density with repect to some -finite measure which is of the form and the natural parameter space is open.

In order to derive good finite sample tail bounds in this setting, it turns out that it is useful to standardize with the log likelihood ratio statistic rather than by centering and scaling. In more detail, let and . Then the generalized log likelihood ratio statistic based on the observations is

 logLRm(θ0) =logsupθ∈Θ∏mi=1fθ(Xi)∏mi=1fθ0(Xi) =supθ∈Θ((θ−θ0)m∑i=1Xi−m(A(θ)−A(θ0))) (2)

The MLE is defined as the argmax of (2) if the argmax exists. Note that is always well defined whether exists or not.

represents a standardization of the sum since by Wilk’s theorem is asymptotically pivotal if the population parameter is . The idea pursued in this section is that is therefore approximately standard normal, and hence it might be possible to establish a finite sample sub-Gaussian tail bound. In the binomial case such a tail bound was indeed established by Rivera and Walther (2013), see also Harremoës (2016) for bounds when . This section first extends the binomial bound to the exponential family case and then addresses the case of empirical standardization where the typically unknown is replaced by the MLE.

It should be pointed out that while the square root of the log likelihood ratio does not commonly appear in the current literature, it has a history as a statistic for inference in exponential families. Barndorff-Nielsen (1986) calls , as well as its empirically standardized counterpart below, the signed likelihood ratio statistic. Rivera and Walther (2013), Frick et al. (2014) and König et al. (2020) use this statistic for detection problems. An important advantage of working with this standardization is that it allows to make full use of the power of the Chernoff bound, as can be seen from the proof of Theorem 1(a). The resulting tail bound is therefore tighter than those obtained from various relaxations of the Chernoff bound such as the Hoeffding or Bennett bounds.

Usually is not known. Then an empirical standardization is obtained with the MLE substituted into the log likelihood ratio statistic for all the observations :

 logLRm,n(^θn) =log(supθ∈Θ∏mi=1fθ(Xi))(supθ∈Θ∏ni=m+1fθ(Xi))supθ∈Θ∏ni=1fθ(Xi) (3) =supθ∈Θ(θm∑i=1Xi−mA(θ))+supθ∈Θ(θn∑i=m+1Xi−(n−m)A(θ))−supθ∈Θ(θn∑i=1Xi−nA(θ)).

As an aside, this statistic can be interpreted as the generalized log likelihood ratio test statistic for testing a common

against different for and . The standardization in Corollary 1 has the same interpretation. In fact, if is N with unknown mean and known , then one computes that equals with the sample variance replaced by in the definition of .

As another example, if the are Bernoulli with unknown parameter , then the natural parameter for the exponential family is . One computes that equals

 m(¯¯¯¯¯Xmlog¯¯¯¯¯Xm¯¯¯¯¯X+(1−¯¯¯¯¯Xm)log1−¯¯¯¯¯Xm1−¯¯¯¯¯X)+(n−m)(¯¯¯¯¯Xmclog¯¯¯¯¯Xmc¯¯¯¯¯X+(1−¯Xmc)log1−¯¯¯¯¯Xmc1−¯¯¯¯¯X)

where , and . This statistic was proposed as a scan statistic by Kulldorff (1997) and, despite its lengthy form, has been widely adopted for scanning problems in computer science and statistics, see e.g. Neill and Moore (2004a,2004b) and Walther (2010).

###### Theorem 1

Let be i.i.d. , a regular one-dimensional natural exponential family, and let . Then for :

•  IPθ0(√2logLRm(θ0)>x) ≤ 2exp(−x22)
•  IPθ0(√2logLRm,n(^θn)>x) ≤ ⎧⎪⎨⎪⎩(4+2x2)exp(−x22)(4+2e)exp(−x22) if x≤(nC)1/6

for a certain constant .

The bounds can be divided by 2 if one considers the signed square-root for one-sided inference. The proof of (a) proceeds by inverting the Cramér-Chernoff tail bound as in Rivera and Walther (2013), where this technique is employed for the binomial case. The bounds in (b) do not quite match the bound in (a) and the author has not been able to establish the simple bound for (b). Simulations suggest that in fact an even better bound holds which is closer to the standard normal bound, i.e. a bound that gains the factor on the sub-Gaussian bound as in (6). Establishing such a bound is a relevant open problem given its importance for scan statistics, see Walther and Perry (2019) and the references therein.

## 4 Tail bounds for self-normalized and empirically centered sums of symmetric random variables

The goal of this section is to extend the results for i.i.d. normal observations in Section 2 to the case of heteroscedastic normal observations with possibly unequal expected values. It turns out that the proposed methodology allows to treat as well the more general setting of independent (not necessarily identically distributed) observations having a symmetric distribution with an unknown center of symmetry.

It is informative to recapitulate the short and well known argument for establishing a sub-Gaussian tail bound via self-normalization in the case where the center of symmetry is known to be zero, see e.g. de la Peña et al. (2009): If are independent and symmetric about 0, then introduce i.i.d. Rademacher random variables , , which are independent of the . Then and hence for :

 (4)

by Hoeffding’s inequality. Hence the sub-Gaussian tail bound is inherited from the Rademacher sum. Based on various heuristic and numerical arguments, Efron (1969, pp. 1285–1288) suggested that the sub-Gaussian tail bound (

4) can be lowered to the normal tail in the usual hypothesis testing range , but Fig. 1 in Pinelis (2007) shows that the normal tail is too small by a factor of at least 1.2 for certain . However, recent remarkable results by Pinelis (2012) and Bentkus and Dzindzalieta (2015) show that the sub-Gaussian tail bound (4) for the Rademacher sum can be improved upon to a bound of the order , namely to a multiple of where the multiple is at most 3.18 and is even close to 1 for large . This tail bound will then translate to the sum after self-normalization via the above argument. This makes the use of the self-normalization very attractive in this setting, cf. the remarks in Section 1.

The first aim of this section is to extend these results to the case where the center of symmetry is unknown and may vary between the . At first glance, this would appear to be a hopeless undertaking since the above Rademacher argument depends crucially on the symmetry around zero. However, there are observations available outside the summation window which can be used for an empirical standardization. The idea is to construct an empirical centering which eliminates the unknown center of symmetry from the symmetrization argument, or which at least results in certain bounds on the center of symmetry. The second step then is to show that these bounds still allow for nearly normal tails.

For simplicity of exposition it is assumed in the following that for integers and . If is much smaller than

, as is typically the case for scan problems, then this can always be arranged by discarding a small fraction of the observations if necessary. The proposed empirical centering is given by a linear transformation

, where the matrix satisfies the conditions in Proposition 2. One example of such an empirical centering is

 ˜Xi := Xi−1p−1m+i(p−1)∑j=m+(i−1)(p−1)+1Xj,i=1,…,m (5)

Corresponding to the linear tranformation write , where is the center of symmetry of . Note that it is not assumed that the have a finite expected value. The subscript denotes averaging over the index set , so and .

###### Proposition 2

Let be a matrix that has non-zero entries in each row and one non-zero entry in each column, and these entries are 1 in columns and in columns . 111This uniquely determines up to permutations of the columns and permutations of the columns .

Let , be independent and symmetric about (so the need not be identically distributed).

• If , then the self-normalized sum of the satisfies

 ∑mi=1˜Xi√∑mi=1˜X2i = nn−m∑mi=1(Xi−¯¯¯¯¯X)√XTATAX =: Tm
• If and for all and for all , then

 IP(Tm≥t) ≤ min(3.18,g(t))IP(N(0,1)>t) (6)

for all , where as .

• If and (7) or (8) hold, then the tail bound (6) holds for for some .

Condition (7) requires that the don’t vary much:

 m∑i=1(˜μi−˜μI)2≤ vm∑i=1˜μ2i for some v∈[0,1) (7)

Condition (8) requires that the don’t vary much and likewise for :

 (8)
• The analogous inequalities to (b) and (c) hold for the left tail of if .

The proof of Proposition 2 shows that the transformed is symmetric about which may not equal zero. Nevertheless, the self-normalized sum of the satisfies the normal tail bound (6) if the satisfy the conditions given in (b) or (c). (b) is a standard assumption for testing against an elevated mean on , see Yao (1993). Note that is similar to the statistic used in Corollary 1 for the homoscedastic case. Indeed, the proof of Proposition 1 shows that is the self-normalized sum of for a certain matrix .

### 4.1 Scanning heteroscedastic normal observations

As the statistic appears to be new, it is incumbent to demonstrate its utility with an analysis of its power. To this end this section considers the scan problem where one observes independent , , and the goal is to detect an elevated mean on some interval . Both the starting point and the length are unknown, likewise the and are unknown.

tests for an elevated mean on the interval . It is straightforward to analyze a different interval , e.g. by applying

to the rearranged data vector

. Denote this statistic by . Analyzing all possible intervals gives rise to a multiple testing problem that is addressed by combining the corresponding into a scan statistic. Walther and Perry (2019) give several ways for combining the such that optimal inference is possible, such as the Bonferroni scan. The use of that scan requires the availability of a tail bound for null distribution, such as (6). The Bonferroni scan and the normal tail bound (6) give a critical value of the form with , which follows as in the proof of Theorem 2 in Walther and Perry (2019).

In order to compare the power of this scan statistic to an optimal benchmark, this section first considers the homoscedastic case where all the equal a common . Then it is known that there is a precise condition under which detection is possible with asymptotic power 1: , provided that does not go to zero too quickly: . One the other hand, dedection is impossible if ‘’ is replaced by ‘’. Hence measures the difficulty of the detection problem, and the theory of that problem shows that it affects this difficulty as an exponent. This explains the efforts in the literature to approach as fast as possible, and the rates given above appear to be the currently best known rates. Attaining the factor hinges on having the correct scale factor in the sub-Gaussian null distribution of the test statistic. References and summaries of these results are given in Walther and Perry (2019) and Walther (2021).

(11) in Theorem 2 shows that in the practically important range the Bonferroni scan based on the does indeed have asymptotic power 1 if exceeds the above detection threshold, since (10) gives and by homoscedasticity. It is notable that this Bonferroni scan, which is designed to deal with heteroscedastic data, allows optimal detection in the special case of homoscedastic data. In fact, Theorem 2 shows that it achieves the detection boundary for the homoscedastic case already if only the , are equal and the outside don’t grow too quickly, as required in (9).

If the data are heteroscedastic, then Theorem 2 requires that needs to be replaced by in the lower bound for . There appears to be not much literature about the scanning problem with heteroscedastic observations, presumably because it is difficult to derive appropriate methodology. For example, the recent work of Enikeeva (2018) considers the heteroscedastic Gaussian detection problem where is allowed to be different on and , but it is assumed that is constant and known on both and on . The finite-sample tail bound (6) holds without such a restriction and thus self-normalized statistics may prove to be quite useful for scanning problems.

###### Theorem 2

Let , , be independent, , let be the linear transformation (5) and write . Assume the satisfy (7) or (8).

If with , and , and if

 σ2jσ2I ≤ S√maxi∈I(j−i)for all j∈{1,…,n} and some S>0, (9)

then

 RI ≤ 1+2S√|I|2n (10)

and

 IP⎛⎝TI > √2logn|I|+O(1)⎞⎠ →1(n→∞). (11)

This result extends to intervals , , by applying Theorem 2 to .

## 5 Proofs

### 5.1 Proof of Proposition 1

Write and let be an orthogonal matrix with first row . Then is a vector of independent normal random variables with variance and , , . Further . Note that this is the same transformation that is commonly used in textbooks to derive the distribution of Student’s -statistic. In the latter case one is interested in , which is independent of . In contrast, the condition ensures that is a function of only:

 V = ⟨b,X⟩√1n−1∑ni=1(Xi−¯¯¯¯¯X)2 = ⟨b,ATY⟩√1n−1∑ni=2Y2i = ⟨c,Y⟩√1n−1∑ni=2Y2i, (12)

where has and thus .

Set , . Then

has the uniform distribution on the

-dimensional unit sphere in since the are i.i.d. N. Therefore , the length of the projection of onto a unit vector , has the same distribution for every unit vector .

Setting gives222Alternatively, construct rows 2 to

of the orthogonal matrix

such that . Then (12) gives without assuming that the are normal. This also shows that is a self-normalized sum. However, the may not be independent if the are not normal.

 V = √n−1n∑i=2ciUi d= √n−1n∑i=2wiUi = ∑ni=2Yi√∑ni=2Y2i d= ∑n−1i=1Zi√∑n−1i=1Z2i

where the are i.i.d. N. Setting gives

 V d= √n−1n∑i=2wiUi = √n−1Y2√∑ni=2Y2i d= √n−1Z1√∑n−1i=1Z2i,

so

follows from a well known fact about the beta distribution.

It is also known that the uniform distribution on the sphere in , , gives the density , hence has density

 fV(t) = 1√mΓ(m2)Γ(12)Γ(m−12)(1−t2m)m−321(−√m≤t≤√m).

The plan is to show that is not larger than the standard normal density for large enough. Clearly for . For one has for by Gautschi’s inequality, and for :

 fV(t) ≤1√2πexp(m−32log(1−t2m)) ≤1√2πexp(m−32(−t2m−t42m2)) =ϕ(t)exp(32mt2−m−34m2t4) ≤ϕ(t) for  t2≥6mm−3 (13)

The condition is satisfied if e.g. and . Less conservative bounds obtain by employing higher order terms for bounding . For example, for yields

 fV(t) ≤ ϕ(t)exp(32mt2−m−34m2t4−m−36m3t6).

Dividing the argument in the exponent by shows that the argument is non-positive if

 3mm−3−12t2−13mt4 ≤ 0

and this inequality holds for . One checks numerically that

 maxm∈{5,…,8}g(m)≤2.752,  maxm∈{9,…,75}g(m)≤2.52. (14)

Therefore follows for and from (13), for and from (14), and for and from these results together with (14). The last claim of the Proposition now obtains with .

### 5.2 Proof of Theorem 1

The proof of (a) proceeds by inverting the Cramér-Chernoff tail bound, as in Rivera and Walther (2013) for the binomial case.

for . Markov’s inequality gives for :

 IPθ0(1mm∑i=1Xi>x) ≤inft≥0IEexp(t∑mi=1Xi)exp(tmx) ≤exp{−supt≥0,t+θ0∈Θ(tmx−m(A(θ0+t)−A(θ0))))} =exp{−logLRm(x,θ0)}

where . This conclusion used the fact that the sup over equals the sup over since convexity of yields

 (θ−θ0)x−(A(θ)−A(θ0)) ≤ (θ−θ0)x−(θ−θ0)A′(θ0) (15)

and the RHS is negative if and . The following claim will be proved below:

 The function x↦logLRm(x,θ0) is continuous and strictly increasing on [IEθ0X1,∞)∩M0 (16)

where denotes the convex hull of the support of . Analogously one shows that for :

 IPθ0(1mm∑i=1Xi

and is continuous and strictly decreasing on . Together with , which follows from (15) and , one obtains

 IPθ0(logLRm(1mm∑i=1Xi,θ0)>t) ≤ 2exp(−t)

for and claim (a) follows. It remains to prove (16). This follows from Lemma 6.7 in Brown (1986) or from a general result in convex analysis to the effect that the Legendre transform satisfies if (in which case the MLE exists uniquely and is given by by exponential family theory) and since the exponential family is minimal. Hence is differentiable wrt and

 ddxlogLRm(x,θ0) = m(θ(x)−θ0)

It was shown above that if , then the maximizer satisfies . Now (16) follows from for . Part (a) of the theorem is proved.

As for part (b), by the definition (3)

 logLRm,n(^θn) ≤ logLRm,n(θ0) = logLRI(θ0)+logLRIc(θ0) (17)

where and and for an index set write

 logLRJ(θ0) = logsup