# Distribution-free properties of isotonic regression

It is well known that the isotonic least squares estimator is characterized as the derivative of the greatest convex minorant of a random walk. Provided the walk has exchangeable increments, we prove that the slopes of the greatest convex minorant are distributed as order statistics of the running averages. This result implies an exact non-asymptotic formula for the squared error risk of least squares in isotonic regression when the true sequence is constant that holds for every exchangeable error distribution.

• 4 publications
• 17 publications
• 2 publications
06/03/2020

### Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators

The least squares estimator (LSE) is shown to be suboptimal in squared e...
12/20/2021

### How to estimate the memory of the Elephant Random Walk

We introduce an original way to estimate the memory parameter of the ele...
03/08/2022

### On the elephant random walk with stops playing hide and seek with the Mittag-Leffler distribution

The aim of this paper is to investigate the asymptotic behavior of the s...
09/20/2021

### `Basic' Generalization Error Bounds for Least Squares Regression with Well-specified Models

This note examines the behavior of generalization capabilities - as defi...
03/19/2022

### New algorithms for computing the least trimmed squares estimator

Instead of minimizing the sum of all n squared residuals as the classica...
02/11/2014

### On Zeroth-Order Stochastic Convex Optimization via Random Walks

We propose a method for zeroth order stochastic convex optimization that...
04/29/2021

### Uncertainty Principles in Risk-Aware Statistical Estimation

We present a new uncertainty principle for risk-aware statistical estima...

## 1 Introduction

Isotonic regression refers to the problem of estimating a monotone sequence

based on a noisy observation vector

which is assumed to be an additive perturbation of ,

 Y=θ∗+σZ,

where the components of

are assumed to have zero mean and unit variance. It is commonly assumed that

are independent and identically distributed (i.i.d.) but we work with the more general assumption of exchangeability in this paper. A natural estimator for in this setting is the isotonic Least Squares Estimator (LSE), defined as

 ^θ:=ΠMn(Y):=argminθ∈Mn∥Y−θ∥22,

where denotes the usual Euclidean norm on and is the monotone cone of length non-decreasing sequences. As is a closed convex cone, as defined above exists uniquely; it can also be computed in time by the pool adjacent violators algorithm [4, 11].

The statistical properties of are typically studied in terms of the risk or the normalized mean squared error:

 R(^θ,θ∗):=1nEθ∗∥^θ−θ∗∥22.

A key quantity in understanding is

 δn(μ):=EZ∼μ∥ΠMn(Z)∥22,

where denotes the law of the noise vector . Indeed, it is clear that

 nσ2R(^θ,θ∗)=δn(μ)                    when θ∗1=⋯=θ∗n.

When are not all equal, let be the finest partition of such that is constant on each . It has been shown  [15, 9, 3] that

 nσ2R(^θ,θ∗){≤δn1(μA1)+⋯+δnk(μAk)for every σ>0→δn1(μA1)+⋯+δnk(μAk)%asσ↓0, (1.1)

where denotes the marginal distribution of and is the length of the block for all . We emphasize that (1.1) holds for arbitrarily dependent with zero mean and finite variance. It was also shown in [3] that also bounds the risk of the isotonic LSE in misspecified settings where does not lie in .

The quantity therefore crucially controls the risk of the isotonic LSE. The goal of this paper is to explicitly determine for every under the additional assumption that is exchangeable. Specifically, under the assumption of exchangeability, we show in Corollary 3.3 that, for all ,

 δn(μ)=ρn+(1−ρ)Hn, (1.2)

where is the harmonic number and is the pairwise correlation. Combined with (1.1), our result provides a sharp non-asymptotic bound on the risk of isotonic regression for any exchangeable noise vector. In the special case when are i.i.d. with zero mean and unit variance, and thus (1.2) gives:

 δn(⊗ni=1η)=Hnfor every probability measure η. (1.3)

Here is the common distribution of the independent variables .

Previously, the formula (1.3) was known when is the standard Gaussian probability measure on . This was observed by Amelunxen et al. [2] who proved it by observing first that when and is the standard Gaussian measure, the formula

 E∥ΠK(Z)∥22=n∑k=0kνk(K) (1.4)

holds for every closed convex cone where is the intrinsic volume of . When is the monotone cone, the right hand side in equation (1.4) can be shown to be equal to by using the fact that the generating function can be computed in closed form. Amelunxen et al. [2] used the theory of finite reflection groups [7] to obtain the exact expression for this generating function. However, the exact expression for can already be found in the classical literature on isotonic regression (see Theorem 2.4.2 in Roberston et al. [16] and references therein).

The above proof does not work for non-Gaussian mainly because the expression (1.4) does not hold for general . In fact, the best available result on for non-Gaussian is in equation (2.11) of Zhang [18], who proved the asymptotic result:

 δn(⊗ni=1η)=(1+o(1))(1+logn)as n→∞.

This bound gives the right behavior as the right hand side of equation (1.3) but only as . We improve this result by proving for every that is always equal to the harmonic number for every probability measure having mean and variance .

We prove (1.2) by developing a precise characterization of the marginal distribution of each individual component of . Specifically, as long as is exchangeable, we show in Theorem 2.2 that has the same distribution as , the order statistic of the running averages . We prove Theorem 2.2 in Section 2, using a characterization of the components of the isotonic LSE as the left-hand slopes of the greatest convex minorant of the random walk with increments . This result and its continuous-time analogue may be of independent interest outside the study of isotonic regression, so in Section 2 we also address consequences for the greatest convex minorant of a stochastic process with exchangeable increments. The order statistics of the running averages can be fairly complicated even when is Gaussian; however, Theorem 2.2 easily implies results such as (1.2). In Section 3, we detail some risk calculations for isotonic regression and its variants which all follow from Theorem 2.2.

## 2 Main Result

Let denote the partial sums for , started at . Identify the random walk with its cumulative sum diagram (CSD) , where for integers

and linearly interpolated between integers. Let

denote the greatest convex minorant (GCM) of , i.e. the greatest convex function that lies below . See Figure 1 for a depiction of the GCM of the CSD. With this notation, we now recall the graphical representation of the isotonic LSE as given in Theorem 1.2.1 of Roberston et al. [16].

###### Lemma 2.1.

For any vector , the isotonic LSE is given by the left-hand slopes of the greatest convex minorant of the cumulative sum diagram. For all

 (ΠMn(Z))k=C(k)−C(k−1)=∂−C(k).

For the remainder of this section let

 Δk:=∂−C(k)=mink≤v≤nmax0≤u

denote the left-hand slope of the GCM at , so is equal to by the lemma. In particular, when we have . When , we have , and if then . Our next result generalizes this observation, showing that the slope is equal in distribution to the smallest running average if is exchangeable.

###### Theorem 2.2.

Suppose is exchangeable. Let denote the running average for and let denote their order statistics. Then

 Δkd=¯Z(k) (2.2)

marginally for all .

###### Proof.

As before, let denote the partial sum. Let be the last argmin of the sequence , and let be the amount of time the walk is non-positive . We will use Corollary 11.14 of Kallenberg [12], due to Sparre-Andersen, which says as long as is exchangeable.

Note that the slope of the GCM switches from non-positive to positive at time , since the horizontal line with intercept minorizes the GCM and touches it at time . Hence, no matter the sequence of increments , there is the identity of events

 (Δk≤0)=(M≥k). (2.3)

Also, for the time that the walk is non-positive, since if and only if , there is the identity of events

 (¯Z(k)≤0)=(N≥k).

The equality in distribution then implies

 P(Δk≤0)=P(¯Z(k)≤0).

If the sequence is modified to for some fixed , the modified sequence is exchangeable, and the values of and for the modified sequence are just and . Applying the above identity to the modified sequence gives

 P(Δk≤z)=P(Δk−z≤0)=P(¯Z(k)−z≤0)=P(¯Z(k)≤z).

So and

have the same cumulative distribution function, hence the same distribution. ∎

The proof of Theorem 2.2 has a straightforward generalization to the setting where is a continuous-time stochastic process. Knight [13] showed that the analogous distributional identity holds when has exchangeable increments and . Hence, by a similar proof, we find that the slope of the greatest convex minorant of at time has the same distribution as the percentile point of the occupation measure for the process . We record this result as the following corollary.

###### Corollary 2.3.

Let denote a real-valued càdlàg stochastic process on with exchangeable increments, such that . Define as the slope of the greatest convex minorant of at , and let denote the (random) cdf associated with the occupation measure of ,

 F(x)=λ({t∈[0,1]:S(t)≤tx}), (2.4)

where denotes Lebesgue measure. Then

 Δ(p)=infp≤v≤1sup0≤u

marginally for all .

See Abramson et al. [1] for a general study of convex minorants of random walks and processes with exchangeable increments. In the special cases where is a standard Brownian motion or Brownian bridge on the unit interval, Carolan & Dykstra [6] derive the distribution of the slope , jointly with the process and its convex minorant at , for a fixed value . Given our corollary, their explicit formula for the slope provides the distribution of , giving new information about the occupation measure of for Brownian motion and Brownian bridge. The distribution of the percentile point of the occupation measure for has been obtained under the same generality as Corollary 2.3: see the introduction of Dassios [8] and references therein.

## 3 Consequences for Isotonic Regression

Since the identity of Theorem 2.2 holds marginally, it allows us to simplify expectations of functions that are additive in the components of . As long as is exchangeable,

 n∑k=1Eh((ΠMn(Z))k)=n∑k=1Eh(¯Z(k))=n∑k=1Eh(¯Zk). (3.1)

Taking , we obtain our first corollary.

###### Corollary 3.1.

Suppose is exchangeable. For ,

 E∥ΠMn(Z)∥pp =n∑k=1E∣∣ ∣∣1kk∑i=1Zi∣∣ ∣∣p, (3.2)

provided .

###### Remark 3.2.

Viewed through its graphical representation, is the left-derivative of the GCM at , so when the power , equation (3.2) yields the discrete arc-length formula

 n∑k=1E|C(k)−C(k−1)| =E∥ΠMn(Z)∥1=n∑k=11kE|Sk| (3.3)

Closely related to this formula is the identity of Spitzer & Widom [17], which takes

to be a sequence of i.i.d. random variables in

(or the complex plane ) with finite variance. If is the partial sum and is the length of the perimeter of the convex hull , then

 E~Ln =2n∑k=11kE∥~Sk∥. (3.4)

These formulas connect the geometry of the convex hull of a random walk to the magnitudes of the running means.

Consider the case when . Since is exchangeable, every pair of components has the same correlation . If we further assume has zero mean and unit variance, the right hand side of equation (3.2) can be computed explicitly

 E(1kk∑i=1Zi)2=ρ+1−ρk.

Summing over yields our next result.

###### Corollary 3.3.

Suppose is an exchangeable random vector with zero mean, unit variance, and pairwise correlation . Then

 δn(μ)=ρn+(1−ρ)Hn.

This result should be contrasted with other distribution-free identities, namely

 E∥Z∥22=n and E∥¯Zn1n∥22=1,

provided has i.i.d. components with zero mean and unit variance. In particular, suppose we observe where has i.i.d. components with zero mean and unit variance, but it turns out that is constant. If we know is constant, we can estimate it by a constant sequence and pay a constant price in total risk. If we know nothing about the structure of and use , the risk is quite large by comparison. The monotone sequence estimate resides in the middle, with a much smaller risk of and knowledge only about the relative order. We explained in Section 1 how risk calculations when generalize to MSE bounds that are sharp in the low noise limit for arbitrary . For example, when has constant pieces, then (1.1), Corollary 3.3 and the fact that for every imply that

 R(^θ,θ∗)≤kσ2nlog(enk)

whenever are i.i.d. with mean zero and unit variance. Also if is not necessarily in , then Corollary 3.3, together with the results of [3], implies that

 R(^θ,θ∗)≤infθ∈Mn(1n∥θ−θ∗∥2+σ2k(θ)nlog(enk(θ))),

where is the number of constant pieces of the vector . These formulae (with the leading constant of 1 in front of the term on the right hand side) were previously only known when the distribution of was standard Gaussian.

Define the -risk of the isotonic LSE

 R(p)(^θ,θ∗)=1nE∥^θ−θ∗∥pp

so that . We can similarly employ Theorem 2.2 to explicitly calculate the -risk of the isotonic LSE when is constant and is Gaussian:

###### Corollary 3.4.

Suppose . Then for any ,

 E∥ΠMn(Z)∥pp=Hn,p/2E|Z1|p=Hn,p/2√2pπΓ(p+12),

where .

###### Proof.

Note and apply the theorem. ∎

Corollary 3.4 should similarly be contrasted with the following identities when :

 E∥Z∥pp=nE|Z1|p and E∥¯Z1n∥pp=n1−p/2E|Z1|p

respectively. In particular, when , the bound holds for all , which is to say is bounded when whereas grows without bound as grows.

When is constant and , the risk of isotonic regression is

 R(p)(^θ,θ∗) =Hn,p/2nσpE|Z1|p. (3.5)

When , Theorem 2.3 of Zhang [18] shows an asymptotic result for the risk on constant that agrees with equation (3.5).

The continuous-time distributional identity in Corollary 2.3 applies to the asymptotic distribution of the isotonic least squares estimator. A standard model for studying the asymptotic behavior of isotonic regression is

 θ∗k =f∗(kn)

where is non-decreasing. We observe , a noisy version of , and calculate by projecting onto the monotone cone. The function estimate is defined by and linearly interpolated between design points. Here, as before, the dependence on in is suppressed, but now we are interested in the behavior of isotonic least squares at a fixed point as .

Define the partial sum process by , linearly interpolated between design points. When the function is constant, the quantity

 √n(^f(p)−f∗(p))

is given by the left-derivative of the greatest convex minorant of at . By the invariance principle, this converges in distribution to the left-derivative of the greatest convex minorant of standard Brownian motion at . This asymptotic result is well known and a similar result was noted for the Grenander estimator in Carolan & Dykstra [5], where Brownian motion is replaced with a Brownian bridge. Corollary 2.3 relates this asymptotic distribution to the percentile points of the occupation measure for .

Finally, Corollary 3.3 on the projection onto extends over to that of the set of non-negative monotone sequences . Theorem 1 of Németh & Németh [14] observes that the projection of onto is given by , the element-wise positive part of the projection onto . Hence the distributional identity Theorem 2.2 yields a similar set of identities for non-negative isotonic regression.

###### Corollary 3.5.

For any exchangeable noise vector ,

 (ΠMn+(Z))kd=(¯Z(k))+ (3.6)

Provided ,

 E∥ΠMn+(Z)∥pp =n∑k=1E(1kk∑i=1Zi)p+, (3.7)

Furthermore, if is symmetric with unit variance, the generalized statistical dimension of the monotone cone is

 E∥ΠMn+(Z)∥22=ρn+(1−ρ)Hn2, (3.8)

where is the pairwise correlation.

###### Proof.

Equation (3.7) follows from equation (3.1) by taking . When is symmetric with unit variance,

 E(1kk∑i=1Zi)2+=12E(1kk∑i=1Zi)2=12(ρ+1−ρk).

Summing over yields equation (3.8). ∎

Equation (3.8) is also shown in Amelunxen et al. [2] in the special case using the theory of finite reflection groups. The identity (3.7) allows us to show equation (3.8) for a much wider variety of noise vectors, and as before also allows us to obtain relations for the expected norms of the projection of the noise vector. All of our exact formulae follow from the distributional identity in Theorem 2.2, which exploits the geometric characterization of the isotonic LSE in Lemma 2.1. An interesting open question is whether similar characterizations— such as for convex regression [10]—may yield exact non-asymptotic risk calculations in other shape-constrained estimation problems.

## References

• [1] J. Abramson, J. Pitman, N. Ross, and G. U. Bravo. Convex minorants of random walks and lévy processes. Electronic Communications in Probability, 16:423–434, 2011.
• [2] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp.

Living on the edge: Phase transitions in convex programs with random data.

Information and Inference: A Journal of the IMA, 3(3):224–294, 2014.
• [3] P. C. Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics, 46(2):745–780, 2018.
• [4] H. D. Brunk, R. E. Barlow, D. J. Bartholomew, and J. M. Bremner. Statistical inference under order restrictions; the theory and application of isotonic regression. Wiley, New-York, 1972.
• [5] C. Carolan and R. Dykstra. Asymptotic behavior of the grenander estimator at density flat regions. Canadian Journal of Statistics, 27(3):557–566, 1999.
• [6] C. Carolan and R. Dykstra. Marginal densities of the least concave majorant of Brownian motion. Ann. Statist., 29(6):1732–1750, 2001.
• [7] H. S. M. Coxeter and W. O. J. Moser. Generators and relations for discrete groups, volume 14. Springer Science & Business Media, 2013.
• [8] A. Dassios.

On the quantiles of brownian motion and their hitting times.

Bernoulli, 11(1):29–36, 2005.
• [9] B. Fang and A. Guntuboyina. On the risk of convex-constrained least squares estimators under misspecification. arXiv preprint arXiv:1706.04276, 2017.
• [10] P. Groeneboom, G. Jongbloed, and J. A. Wellner. Estimation of a convex function: characterizations and asymptotic theory. The Annals of Statistics, 29(6):1653–1698, 2001.
• [11] S. J. Grotzinger and C. Witzgall. Projections onto order simplexes. Applied mathematics and Optimization, 12(1):247–270, 1984.
• [12] O. Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006.
• [13] F. B. Knight. The uniform law for exchangeable and Lévy process bridges. Astérisque, (236):171–188, 1996.
• [14] A. B. Németh and S. Z. Németh. How to project onto the monotone nonnegative cone using pool adjacent violators type algorithms. arXiv preprint arXiv:1201.2343, 2012.
• [15] S. Oymak and B. Hassibi. Sharp mse bounds for proximal denoising. Foundations of Computational Mathematics, 16(4):965–1029, 2016.
• [16] T. Robertson, F. T. Wright, and R. L. Dysktra. Order restricted statistical inference. 1988.
• [17] F. Spitzer and H. Widom. The circumference of a convex polygon. Proceedings of the American Mathematical Society, 12(3):506–509, 1961.
• [18] C.-H. Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528–555, 2002.