Fast rates for empirical risk minimization with cadlag losses with bounded sectional variation norm

Empirical risk minimization over sieves of the class F of cadlag functions with bounded variation norm has a long history, starting with Total Variation Denoising (Rudin et al., 1992), and has been considered by several recent articles, in particular Fang et al. (2019) and van der Laan (2015). In this article, we show how a certain representation of cadlag functions with bounded sectional variation, also called Hardy-Krause variation, allows to bound the bracketing entropy of sieves of F and therefore derive fast rates of convergence in nonparametric function estimation. Specifically, for any sequence a_n that (slowly) diverges to ∞, we show that we can construct an estimator with rate of convergence O_P(2^d/3 n^-1/3 ( n)^d/3 a_n^2/3 ) over F, under some fairly general assumptions. Remarkably, the dimension only affects the rate in n through the logarithmic factor, making this method especially appropriate for high dimensional problems. In particular, we show that in the case of nonparametric regression over sieves of cadlag functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting.

Authors

• 6 publications
• 44 publications
07/22/2019

Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm

Empirical risk minimization over classes functions that are bounded for ...
01/28/2021

Interpolating Classifiers Make Few Mistakes

This paper provides elementary analyses of the regret and generalization...
11/17/2019

Oracle inequalities for image denoising with total variation regularization

We derive oracle results for discrete image denoising with a total varia...
03/04/2019

Multivariate extensions of isotonic regression and total variation denoising via entire monotonicity and Hardy-Krause variation

We consider the problem of nonparametric regression when the covariate i...
08/20/2018

Use Of Vapnik-Chervonenkis Dimension in Model Selection

In this dissertation, I derive a new method to estimate the Vapnik-Cherv...
06/03/2020

Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators

The least squares estimator (LSE) is shown to be suboptimal in squared e...
06/02/2021

Statistical optimality conditions for compressive ensembles

We present a framework for the theoretical analysis of ensembles of low-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empirical risk minimization setting.

We consider the empirical risk minimization setting. Suppose that

are i.i.d. realizations of a random variable with distribution

and taking values in , for some integer . We suppose that lies in a set of distributions that we will denote , and which we will call the statistical model. Consider a mapping from the statistical model to a set that we will call the parameter set. We want to estimate the parameter of the data-generating distribution , defined by . Let be a loss mapping, that is for all ,

is the loss function for the parameter value

. We suppose that is a valid loss in the sense that

 θ0∈argminθ∈ΘP0L(θ). (1)

Our estimator will be defined as the empirical risk minimizer over some subset of , that is . We further discuss in the next paragraph.

Statistical model, sieve, and estimator

We suppose that the parameter space and the loss mapping are such that the loss functions belong to class of cadlag functions with bounded sectional variation norm. We define now the notion of sectional variation norm. Denote the set of real-value cadlag functions with domain . Consider a function . For all subset

and for all vector

, define the vectors , , and the section of as the mapping . The sectional variation norm of is defined as

 ∥f∥v≡∑s⊂{1,...,d}∫|fs(dxs)|, (2)

where is the signed measure generated by the cadlag function . Denote

 F:={f∈D((0,1]d):∥f∥v<∞}, (3)

and, for all ,

 FM:={f∈D((0,1]d):∥f∥v

In the most general version of our results, the statistical model will be defined implicitly. Specifically, we are going to consider a sieve through , that is a growing (for the inclusion) collection of subsets such that for all , there exists an such that for all . We will take the sieve to be such that for a sequence .

Example: least squares regression with bounded loss.

An example of the setting just described is least squares regression over the class of cadlag functions over , under the assumption that the range of the dependent variable is bounded. Formally, this corresponds to the above setting with , such that , parameter set , and loss mapping such that for all and all , .

Rate of convergence result.

Our main theoretical result states that the empirical risk minimizer converges to at least as fast as , if we take a sieve such that the sectional variation norm of the loss grows no faster than . The key to proving this result is a characterization of the bracketing entropy of the class of cadlag function with bounded sectional variation norm. A rate of convergence is then derived based on the famed “peeling” technique.

Tractable representation of the estimator.

Fang-Guntuboyina-Sen-2019 showed that if the parameter space is itself a set of cadlag functions with bounded sectional variation norm, then the empirical risk minimizer can be represented as a finite combination , for a certain set of basis functions that depends on the observations. The empirical risk minimization problem then reduces to a LASSO problem.

Related work and contributions.

vdL_2015 considered empirical risk minimization over sieves of , under the general bounded loss setting, and showed that it achieves a rate of convergence strictly faster than in loss-based dissimilarity. Fang-Guntuboyina-Sen-2019 consider nonparametric least-squares regression with Gaussian errors and a lattice design, over for a certain , and show that the least-squares estimator achieves rate of convergence for a certain constant . In this article, we show that a similar rate of convergence can be achieved under the general setting of empirical risk minimization with bounded loss over sieves of . We show that this setting covers the case of nonparametric least-squares regression with a bounded dependent variable, under no assumption on the design. We also consider the nonparametric regression with sub-exponential errors setting, and show that this rate is achieved by the least squares estiamtor over a certain sieve of the set of cadlag functions with bounded sectional variation norm.

2 Representation and entropy of the cadlag functions with bounded sectional variation norm

As recently recalled by vdL_2015, Gill_et_al_1995 showed that any cadlag function on with bounded sectional variation norm can be represented as a sum of signed measures of bounded variation. This readily implies that any such function can be written as a sum of

differences of scaled cumulative distribution functions, as formally stated in the following proposition.

Proposition 1.

Consider such that , for some . For all subset , and for all vector , define the vector . The function can be represented as follows: for all

 f(x)=f(0)+M∑s∈{1,...,d},s≠∅∫xs0αs,1gs,1(dxs)−αs,2gs,2(dxs), (5)

where and are cumulative distribution functions on the hypercube , and , where is the -standard simplex.

This and a recent result [Song_Wellner_2008] on the bracketing entropy of distribution functions implies that the class of cadlag functions with variation norm bounded by has well-controlled entropy, as formalized by the following proposition.

Proposition 2.

For a given , consider the class of cadlag functions sectional variation norm smaller than and bounded by . For any ,

, the bracketing entropy of with respect to the norm satisfies

 logN[](ϵ,FM,Lr(P))≲2dK(r)Mϵ−1(log(M/ϵ))d, (6)

where is a constant that depends only on . This implies the following bound on the bracketing entropy integral of with respect to the norm: for all ,

 J[](δ,FM,Lr(P))≲2d/2√K(r)√Mδ1/2(log(M/δ))d/2. (7)

3 Rate of convergence under the general bounded loss setting

We now present our rate of convergence result for empirical risk minimizers over sieves of . Consider an increasing sequence of positive numbers that diverges to infinity. We present assumptions under which the empirical risk minimizer achieves rate of convergence .

Assumption 1 (Function class of losses).

The sieve a growing (for the inclusion) sequence of subsets of such that, for all ,

 L(Θn)⊆Fan, (8)

and that, for all , there exists an such that for all , .

Note that the above assumption defines the function class implicitly: the cadlag and bounded sectional variation norm requirements are expressed for , not for directly. In practice, we will often directly want to assume that the parameter space is cadlag with bounded sectional variation norm, instead of formulating an assumption on the losses. We argue that under fairly mild conditions, if is a set of cadlag functions with bounded section variation, then assumption 1 holds. We illustrate this fact in the least-squares example with bounded dependent variable.

The reason why we consider a growing sieve is to ensure we don’t have to know in advance an upper bound on the variation norm of the losses. The rate impacts the asymptotic rate of convergence and finite sample performance. As the theorem will make clear, the slower we pick , the better the speed of convergence. However, for too slow , might not be included in even for reasonable sample sizes. Note nevertheless that however slow is picked, will always be included in for large enough, and therefore, we can always achieve an asymptotic rate of convergence asymptotically close to , the rate corresponding to constant .

We will express our rate of convergence result in terms of loss-based dissimilarity, which we define now.

Definition 1.

Let . Denote . For all , we define the square of the loss-based dissimilarity between and as the discrepancy

 d2(θ,θn)=P0L(θ)−P0L(θn). (9)

The second main assumption of our theorem requires the loss to be smooth in the loss-based dissimilarity.

Assumption 2 (Smoothness).

For every , it holds that

 supθ∈Θn∥L(θ)−L(θn)∥P0,2≤and(θ,θn). (10)

We can now state our theorem.

Theorem 1.

Consider a sieve such that assumptions 1 and assumption 2 hold for the sequence considered here. Suppose that for some . Consider our estimator , which, we recall, is defined as the empirical risk minimizer over , that is

 ^θn=argminθ∈ΘnPnL(θ). (11)

Suppose that the parameter space contains the parameter of the data-generating distribution . Then, we have the following upper bound on the rate of convergence of to :

 d(^θn,θ0)=OP(n−1/3(logn)d/2an). (12)

4 Application to least squares regression with bounded dependent variable

In this section we show how least-squares regression over the class of cadlag functions with bounded sectional variation norm falls under the setting of therorem 1, provided the dependent variable has bounded range.

Let us make this formal. Suppose we collect observations that are i.i.d replicates of , with and . The sample space is then . Suppose that . The target parameter mapping maps every distribution in the model to the regression function . The least-squares loss is defined, for all as . We suppose that , and for all , we define .

The following proposition characterizes the function classes and , for .

Proposition 3.

Consider the setting and notations of this section. Suppose that for all , , for some . Then and, for all ,

 L(Θ(M))⊆F4(M+MY)2. (13)

The following proposition characterizes the loss-based dissimilarity under the least-squares loss and shows the loss is Lipschitz w.r.t. .

Proposition 4.

Consider the setting of this section. Let such that . Then, for all ,

 d(θ,θ0)=∥θ−θ0∥P0,2, (14)

and

 ∥L(θ)−L(θ0)∥P0,2≤2(MY+M)d(θ,θ0). (15)

Consider . Propositions 3 and 4 tell us that choosing ensures that assumptions 1 and 2 are satisfied for the choice of considered here.

Therefore, we can state as corrolary of theorem 1 the following result on the convergence rate of least squares regression on .

Corrolary 1.

Consider the setting of this section. Consider the sieve . Then, for , the least-squares estimator over ,

 d(^θn,θ0)=OP(n−1/3(logn)d/2an). (16)

5 Least-squares regression with sub-exponential errors

In this section we consider a fairly general nonparametric regression setting, namely least-squares regression over a sieve of cadlag functions with bounded sectional variation norm, under the assumption that the errors follow a subexponential distribution. Although this situation isn’t covered by the hypothesis of theorem 1, our general bounded loss result, it is handled by fairly similar arguments. This is a setting of interest in the literature (see e.g. section 3.4.3.2 of vdV-Wellner-1996).

Suppose that we collect observations , which are i.i.d. random variable following law distribution . Suppose that for all , , , and that

 Yi=θ0(Xi)+ei, (17)

where , and

are i.i.d. errors that follow a sub-exponential distributions with parameters

. Suppose that for all , and are independent. As in the previous section, we define .

The following theorem characterizes the rate of convergence of our least-squares estimators, which we explicitly define in the statement of the theorem.

Theorem 2.

Consider the setting of this section. Consider the sieve . Suppose that . Then, , the least-squares estimator over , satisfies

 ∥^θn−θ0∥P0,2=OP(2d/3((~C(α,ν)+3)an+∥θ0∥∞)n−1/3(logn)d/3). (18)

where the constant is defined in the appendix.

Appendix A Proof of the bracketing entropy bound (proposition 2)

The proof of proposition 2 relies on the representation of cadlag functions with bounded sectional variation norm and on the the three results below.

The first result characterizes the bracketing entropy of the set of -dimensional cumulative distribution functions.

Lemma 1 (Song_Wellner_2008).

Consider the set of -dimensional cumulative distribution functions. For any probability measure , any ,

 logN[](ϵ,G0,d,Lr(Q))≲ϵ−1(log(1/ϵ))d. (19)

The next lemma will be useful to bound the bracketing entropy integral.

Lemma 2.

For any and any , we have that

 ∫δ0ϵ−1/2(log(1/ϵ))d/2dϵ≲δ1/2(log(1/δ))⌈d/2⌉. (20)
Proof.

The result is readily obtained by integration by parts. ∎

We can now present the proof of proposition 2.

Proof.

We will first upper bound the (-bracketing number for . An upper bound on the (-bracketing number for will then be obtained at the end of the proof by means of change of variable. Recall that any function in can be written as

 f=∑s⊂{1,...,d}s≠∅αs,1gs,1−αs,2gs,2, (21)

with and , where is the -standard simplex.

Let . Denote the -covering number of . Let

 {α(j):j=1,...,N(ϵ/2d+1,Δ2d+1,∥⋅∥∞)} (22)

be an -covering of . For all , denote the -bracketing number of , and let

 {(l(j)s,u(j)s):j=1,...,N[](ϵ,Gs,Lr(P))} (23)

be an -bracketing of .

Step 1: Construction of a bracket for F1.

We now construct a bracket for from the cover for and the bracketkngs for , we just defined. By definition of an -cover, there exists such that . Consider , . By definition of an -bracket exists such that

 l(js,i)s≤gs,i≤u(js,i)s. (24)

This and the fact that

 α(j0)s,i−ϵ/2d+1≤αs,i≤α(j0)s,i+ϵ/2d+1, (25)

will allow us to construct a bracket for . Some care has to be taken due to the fact can be negative (as bracketing functions do not necessarily belong to the class they bracket). Observe that, since , we have

 αs,il(js,i)s≤αs,igs,i≤αs,iu(js,i)s. (26)

Denoting and the positive and negative part of , we have that

 (α(j0)s,i−ϵ/2d+1)(l(js,i)s)+ ≤αs,il+s (27) and −(α(j0)s,i+ϵ/2d+1)(l(js,i)s)− ≤−αs,il−s. (28)

Therefore,

 α(j0)s,il(js,i)s−ϵ/2d+1|l(js,i)s|≤αs,il(js,i)s. (29)

Since (at it is above at least one cumulative distribution function from ), and , we have that

 αs,igs,i≤(α(j0)s,i+ϵ/2d+1)u(js,i)s. (30)

Therefore, we have shown that

 α(j0)s,il(js,i)s−ϵ/2d+1|l(js,i)s|≤αs,igs,i≤(α(j0)s,i+ϵ/2d+1)u(js,i)s. (31)

Summing over and , we have that

 Λ1−Γ2≤f≤Γ1−Λ2, (32)

where, for ,

 Λi =∑s⊂{1,...,d}α(j0)s,iljs,is−ϵ/2d+1|ljs,is|, (33) and Γi =∑s⊂{1,...,d}(α(j0)s,i+ϵ/2d+1)ujs,is. (34)

Step 2: Bounding the size of the brackets.

For ,

 0≤Γi−Λi=∑s⊂{1,...,d}α(j0)s,i(ujs,is−ljs,is)+ϵ/2d+1(ujs,is+|ljs,is|). (35)

Since, for every , and are at most -away in norm from a cumulative distribution function, we have that and . By definition, for all , , . Therefore, from the triangle inequality,

 ∥Γi−Λi∥P,r ≤ϵ∑s∈⊂{1,...,d}αs,i+ϵ(1+ϵ). (36)

Therefore, using the triangle inequality one more time,

 ∥Γ1−Λ2−(Λ1−Γ2)∥P,r ≤ϵ∑s⊂{1,...,d}αs,1+αs,2+2ϵ(1+ϵ) (37) ≤3ϵ+2ϵ2. (38)

Since cumulative distribution functions have range , brackets never need to be of size larger than 1. Therefore, without loss of generality, we can assume that . Therefore, pursuing the above display, we get

 |Γ1−Λ2−(Λ1−Γ2)∥P,r≤5ϵ. (39)

Step 3: Counting the brackets.

Consider the set of brackets of the form , where, for ,

 Λi =∑s⊂{1,...,d}α(j0)s,iljs,is−ϵ/2d+1|ljs,is|, (40) and Γi =∑s⊂{1,...,d}(α(j0)s,i+ϵ/2d+1)ujs,is, (41)

where and for any . From step 1 and step 2, we know that this set of brackets is a -bracketing of . Its cardinality is no larger than the cardinality of its index set. Therefore

 N[](5ϵ,F1,∥⋅∥P,r)≤N(ϵ/2d+1,Δ2d+1,∥⋅∥∞)∏s⊂{1,...,d}N[](ϵ,Gs,∥⋅∥P,r)2. (42)

The covering number of the simplex can be bounded (crudely) as follows:

 (43)

therefore

 logN(ϵ/2d+1,Δ2d+1,∥⋅∥∞)≤dlog(1/ϵ)+d(d+1)log2. (44)

From 1,

 logN[](ϵ,Gs,∥⋅∥P,r)≤Kϵ−1(log(1/ϵ))d. (45)

Therefore

 logN[](5ϵ,F1,∥⋅∥P,r) ≤K2d+2ϵ−1(log(1/ϵ))d+dlog(1/ϵ)+d(d+1)log2 (46) ≲K2d+2ϵ−1(log(1/ϵ))d. (47)

Therefore, doing a change of variable, (and for a different constant absorbed in the symbol),

 logN[](ϵ,FM,∥⋅∥P,r)≲K2d+2Mϵ−1(log(M/ϵ))d. (48)

Appendix B Proofs of theorem 1 and pertaining results

The proof of the theorem relies on the theorem 3.4.1 in vdV-Wellner-1996, which gives an upper bound on the rate of convergence of the estimator in terms of the “modulus of continuity” of an empirical process indexed by a difference in loss functions. We bound this “modulus of continuity” by using a maximal inequality for this empirical process. This maximal inequality is expressed in terms of the bracketing entropy integrals of the sieve

We first restate here the theorem 3.4.1. in vdV-Wellner-1996.

Theorem 3 (Theorem 3.4.1 in vdV-Wellner-1996).

For each , let and be stochastic processes indexed by a set . Let (possibly random) and be arbitrary, and let be an arbitrary map (possibly random) from to . Suppose that, for every and ,

 supδ/2≤dn(θ,θn)≤θ,θ∈ΘnMn(θ)−Mn(θn)≤−δ2, (49)
 E∗supδ/2≤dn(θ,θn)≤θ,θ∈Θn√n[(Mn−Mn)(θ)−(Mn−Mn)(θn)]+≲ϕn(δ), (50)

for functions such that is decreasing on for some . Let satisfy

 r2nϕn(1rn)≤√n,% for every n. (51)

If the sequence takes its values in and satisfies

 Mn(^θn)≥Mn(θn)−OP(r−2n) (52)

and converges to zero in outer probability, them . If the displayed conditions are valid for , then the conditions that is consistent is unnecessary.

The quantity is the so-called “modulus of continuity” of the centered process over . Theorem 3 essentially teaches us that the rate of the modulus of continuity gives us the (an upper bound on) the rate of convergence of the estimator.

We now restate the maximal inequality that we will use to bound the modulus of continuity.

Lemma 3 (Lemma 3.4.2 in vdV-Wellner-1996).

Let be a class of measurable functions such that and for every . Then

 E∗Psupf∈F√n|(Pn−P)f|≲J[](δ,F,L2(P))(1+J[](δ,F,L2(P))δ2√nM). (53)

Application of the above maximal inequality is what will allow us to bound the “modulus of continuity”.

We now present the proof of theorem 1.

Proof of theorem 1. .

The proof essentially consists of checking the assumptions of theorem 3 for a certain choice of , , and . Specifically, we set, for every , and every ,

 Mn(θ) =−PnL(θ), (54) Mn(θ) =−P0L(θ), (55) θn =argminθ∈ΘnP0L(θ), (56) dn(θ,θn) =P0L(θ)−P0L(θn), (57) rn =2−d/3a−2/3nn1/3(logn)−d/3. (58)

Further set and . From now, we proceed in three steps.

Step 1: Checking condition 49.

By definition of and by definition of the loss-based dissimilarity, we directly have, for every ,

 Mn(θ)−Mn(θn)=−P0(L(θ)−L(θn))=−d2(θ,θn). (59)

Therefore, condition 49 holds.

Step 2: Bounding the modulus of continuity.

We want to bound

 EP0supθ∈Θn:dn(θ,θn)≤δ|(Mn−Mn)(θ)−(Mn−Mn)(θn)| (60) =EP0supθ∈Θn:dn(θ,θn)≤δ|(Pn−P0)(L(θ)−L(θn))| (61) =EP0supg∈Gn(δ)|(Pn−P0)g|, (62)

where

 Gn(δ)={L(θ)−L(θn):θ∈Θn,dn(θ,θn)≤δ}. (63)

We now further characterize the set . From assumption 2, for all , . From assumption 1, . Therefore,

 Gn(δ)⊆{g∈Fan−L(θn):∥g∥P0,2≤anδ}. (64)

Therefore, from (62) and the maximal inequality of lemma 3, we have

 EP0supθ∈Θn:dn(θ,θn)≤δ|(M