 # Box-constrained monotone L_∞-approximations and Lipschitz-continuous regularized functions

Let f:[0,1]→[0,1] be a nondecreasing function. The main goal of this work is to provide a regularized version, say f̃_L, of f. Our choice will be a best L_∞-approximation to f in the set of functions h:[0,1]→[0,1] which are Lipschitz-continuous, for a fixed Lipschitz norm bound L, and verify the boundary restrictions h(0)=0 and h(1)=1. Our findings allow to characterize a solution through a monotone best L_∞-approximation to the Lipschitz regularization of f. This is seen to be equivalent to follow the alternative way of the average of the Pasch-Hausdorff envelopes. We include results showing stability of the procedure as well as directional differentiability of the L_∞-distance to the regularized version. This problem is motivated within a statistical problem involving trimmed versions of distribution functions as to measure the level of contamination discrepancy from a fixed model.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction.

Let us briefly motivate the problem. In his seminal paper , Huber introduced the contamination neighbourhood of a probability, becoming one of the very basis of Robust Statistics. An (

-) contamination neighbourhood of a probability distribution

is the set of probability distributions

 Vα(P0)={(1−α)P0+αQ:Q∈P}, (1)

where is the set of all probability distributions in the space. Although it can be defined in a wholly general setting, throughout the paper will be the set of probabilities on the (Borel) sets, , of the real line ). In this way, given an “ideal” model , the vicinity includes those probabilities which are distorted versions of the model through gross or rounding errors: Given a particular value , a probability in would generate samples with an approximate percentage of data coming from . A dual point of view would consider that a such sample could be suitably “trimmed” as to obtain a right sample from the model. In fact, even would arose from an appropriate trimming of .

The introduction of general trimmings of a probability goes back at least to . A probability is said to be a trimming of level of whenever there exists a down-weighting function such that and for all the sets . Equivalently, it must be absolutely continuous w.r.t. , with Radon-Nykodim derivative bounded by . The set of -trimmings of the probability distribution will be denoted by :

 Rα(P)={~P∈P:~P≪P,d~PdP≤11−αP-a.s.}, (2)

and the key link between (1) and (2), obtained in , is given by

 P∈Vα(P0)⟺P0∈Rα(P). (3)

The drawback of both approaches is that in a realistic statistical setting we know neither the value nor the “contaminated” distribution . We just dispose of an approximation to ; usually will be the sample distribution associated to a data set, and our goal is to search for statistical evidence, on the basis of for or against the hypothesis . For such task we can resort to a metric, , on and consider

as an estimator of

. With the introduction of this distance, adjusting the trim level accordingly, we measure to what extent our distribution, , can be considered as a contaminated version of the model, . The success of this strategy will strongly depend on the suitability of the metric for this task. Our choice here is the Kolmogorov distance, , that for two probabilities is defined by the -distance between their distribution functions and . In Section 2 we will present some alternative characterization of the set as well as its main topological properties in this setting. In particular we will show (see Lemma 2.4) that, by defining , and being the distribution functions of and , with great generality, the following identity holds:

 dK(P0,Rα(P))=min{∥h−Γ∥,h∈Cα}, (4) where Cα:={h:[0,1]→[0,1] with h(0)=0,h(1)=1, and ∥h∥Lip≤1/1−α}. (5)

Here, as will be used throughout, for any real valued mapping defined on a metric space , with and we will denote the and the Lipschitz norms:

 ∥f∥=supx∈ℵ|f(x)|,    ∥f∥Lip=supx,y∈ℵ|f(x)−f(y)|d(x,y).

Therefore, (4) translates the problem to finding a useful expression for a best -approximation to a monotone function by Lipschitz-continuous functions verifying the boundary conditions . This goal should include a computably feasible characterization of (4), as to be used for statistical purposes. Both objectives will be obtained in Theorem 2.5, although the proof will be given through Section 3. There, we will show how the Pasch-Hausdorff envelopes (see ) of a monotone function preserve monotonicity and provide the basis to build a best -approximation verifying the boundary constraints. We will also relate this process with the alternative way of obtaining the Ubhaya’s monotone -best approximation (see [9, 10]) to the Lipschitz regularization of the objective function. This approach is followed in Section 4. Finally, we must highlight our results on stability of the constrained regularization (see Proposition 2.2) as well as on directional differentiability of the -distance to the regularized version (see Theorem 4.4), where the last approach is better suited. The relevance of this type of results on differentiability has been pointed out in , and recently highlighted in relation with statistical applications in . In fact, these results provide a sound mathematical foundation allowing incoming statistical applications of the proposed methodology.

## 2 The set of trimmings in the L∞-topological setting

Since probabilities on are determined by their distribution functions (d.f.’s in the sequel) and (1) and (2) can be equivalently stated in terms of the corresponding distribution functions, we will use the same notation and , with the same meanings as before, but defined in terms of distribution functions. On the other hand, the Kolmogorov distance between probabilities is defined just through the -distance between the corresponding d.f.’s, but we will often keep the notation for this distance.

The set can be also characterized, as shown in  (see also Proposition 2.2 in  for a more general result), in terms of the set of -trimmed versions of the uniform probability . Notice that this set is just , as defined in (5). The parameterization, obtained through the composition of the functions and : gives

 Rα(F)={Fh:h∈Cα}. (6)

We note that, as a consequence, the “trimmed Kolmogorov distance” from to is

 dK(F0,Rα(F)):=inf~F∈Rα(F)∥~F−F0∥=infh∈Cα∥h∘F−F0∥.

The set is convex and also well behaved w.r.t. weak convergence of probabilities and widely employed probability metrics (see Section 2 in ). We show next that this also holds for .

###### Proposition 2.1

For and distribution functions , and , we have:

• is compact w.r.t. .

• .

Proof. By the Ascoli-Arzelà Theorem, is a compact subset of the space of continuous functions on endowed with the uniform norm. Hence, from any sequence of elements in , say (recall (6)), we can extract a uniformly convergent subsequence . But then, obviously, in , which proves (a). Since, on the other hand,

 ∣∣∥h1∘F−F0∥−∥h2∘F−F0∥∣∣≤∥h1∘F−h2∘F∥≤∥h1−h2∥,

we see that the map is continuous and, consequently, it attains its minimum in , as claimed in (b). Finally, to check (c) we note that

 ∣∣dK(G1,Rα(F1))−dK(G1,Rα(F2))∣∣≤suph∈Cα∣∣∥G1−h∘F1∥−∥G1−h∘F2∥∣∣ ≤ suph∈Cα∥h∘F1−h∘F2∥≤11−α∥F1−F2∥

and

 ∣∣dK(G1,Rα(F2))−dK(G2,Rα(F2))∣∣≤suph∈Cα∣∣∥G1−h∘F2∥−∥G2−h∘F2∥∣∣≤∥G1−G2∥. (8)

Now, (2) and (8) yield (c).

Proposition 2.1 guarantees the existence of optimal -approximations to every distribution function by -trimmed versions of :

 There exists  ~F∈Rα(F)  such that  ∥F0−~F∥=dK(F0,Rα(F)). (9)

It also shows, through (3), that for

 F∈Vα(F0)  if and only if dK(F0,Rα(F))=0. (10)

Moreover, by convexity of , the set of optimally trimmed versions of associated to problem (9) is also convex. However, guarantying uniqueness of the minimizer (as it holds w.r.t. - Wasserstein metric by Corollary 2.10 in ) is not possible here.

An additional consequence of Proposition 2.1 is the continuity of in and . We quote this and some additional facts in our next result.

###### Proposition 2.2

For , if and are d.f.’s such that then:

• for every there exist such that

• if , then there exists some -convergent subsequence . If is the limit of such a subsequence, necessarily .

• if, additionally, and are d.f.’s such that then as

Proof. To prove a), since , with , it suffices to consider and recall that is Lipschitz. For b), we write and argue as in the proof of Proposition 2.1 to get a -convergent subsequence from which we easily get Finally c) is a direct consequence of Proposition 2.1 (c).

By Polya’s uniform convergence theorem, if and are continuous and are sequences of d.f.’s which, respectively, weakly converge to , then they also converge in the -sense, therefore holds. Also, a direct application of the Glivenko-Cantelli theorem and item c) above guarantee the following strong consistency result.

###### Proposition 2.3

Let and be the sequence of empirical d.f.’s based on a sequence

of independent random variables with distribution function

. If is any sequence of distribution functions -approximating the d.f. (i.e. ), then:

 dK(Gn,Rα(Fm))→dK(G,Rα(F)),  as n,m→∞,  with probability one.

Given a d.f. , we write

for the associated quantile function (or left continuous inverse function), namely,

. We recall that if random variable, has d.f. . Similarly, if has a continuous d.f. , the composed function is the quantile function associated to the r.v. . As we show next, under some regularity assumptions can be expressed in terms of the function

. We will see later the usefulness of this fact both for the asymptotic analysis and the practical computation of

when is an empirical d.f. based on a data sample . Recall that then .

###### Lemma 2.4

Let . If are continuous d.f.’s and is additionally strictly increasing then

 dK(F0,Rα(F))=minh∈Cα∥h−F0∘F−1∥  and  dK(F0,Rα(Fn))=minh∈Cα∥h−F0∘F−1n∥.

Proof. For the first identity observe that

 ∥h∘F−F0∥ =supx∈R|h(F(x))−F0(x)|=supF(x)∈[0,1]|h(F(x))−F0(F−1(F(x)))| =supt∈[0,1]|h(t)−F0(F−1(t))|=∥h−F0(F−1)∥.

On the other hand, if denote the ordered sample associated to (the same set of values but ordered in nondecreasing sense) and

 t0=0,ti=in,hi=h(Fn(x(i)))=h(ti),andF0,i=F0(x(i)),1≤i≤n.

Taking into account that and are piecewise constant while and are non decreasing and continuous, we obtain

 ∥h(Fn)−F0∥=max1≤i≤nmax(F0,i−hi−1,hi−F0,i)=∥h−F0(F−1n)∥,

and the other identity follows from Proposition 2.1, part (b).

Our final result in this section provides a simple representation of (hence, of ). In this statement we assume that is a nondecreasing function taking values in (which is always the case if ). Note that taking right and left limits at 0 and 1, respectively, we can assume that is a nondecreasing (and left continuous) function from to .

###### Theorem 2.5

Let . Assume is a nondecreasing function. Define , , and

 ~hα(t)=max(min(U(t)+L(t)2,0),−α1−α).

Then,

 minh∈Cα∥h−Γ∥=∥~hα−G∥.

The proof of this result will be developed in Section 3. In fact Theorem 3.3 is just a rephrasing of this result. A look at that Theorem shows that is an element of such that , that is, is an optimal trimming function in the sense described above. We recall that we do not claim uniqueness of this minimizer, but this particular choice allows to compute for sample d.f.’s. Moreover, Theorem 2.5 even provides a simple way for the computation of for theoretical distributions. Let us see an illustration of this use.

###### Example 2.1 (Trimmed Kolmogorov distances in the Gaussian model.)

Consider the case , , where denotes the standard normal d.f., and . Here we have . We note that if and only if , where

 p(x)=(σ2−1)x2+2μσx+μ2−2log((1−α)σ). (11)

To avoid cumbersome computations we focus on the cases , and , .

If and then is linear with positive slope and we see that if and only if . This means that is increasing in and decreasing in . Since, , we have that, for , where is (the unique) solution to , and for . We conclude that . The case can be handled similarly to obtain

 dK(Rα(N(μ,1)),N(0,1))=Φ(|μ|2+1|μ|log(1−α))−11−αΦ(−|μ|2+1|μ|log(1−α)),μ≠0. (12)

We focus now on the case . If , is a parabola with negative leading coefficient and discriminant . Hence, is positive for with , . Equivalently, if and only if . This means that is increasing in , decreasing in , increasing in , and . Arguing as above, we have for , for , and . We conclude that . Hence,

 dK(Rα(N(0,σ2)),N(0,1))=Φ(−σΔ21−σ2)−11−αΦ(−Δ21−σ2),if σ<1.

If then we have that for all and . In particular, .

Finally, we consider the case . In this case is positive for with , . This means that for with . Therefore, is decreasing in , increasing in , decreasing in , and . Hence, , , , , , . In particular, , that is,

 dK(Rα(N(0,σ2)),N(0,1))=Φ(σΔ2σ2−1)−Φ(Δ2σ2−1)−α21−α,if σ>11−α.

## 3 Best L∞-approximations by Lipschitz-continuous functions with box constraints

In this section we refresh the notation. The role of will be played now by a generic Lipschitz constant ; our will be substituted by a bounded function , where is (at least at the beginning) a general metric space, while we maintain as the range of values. We will also use the notation (resp. ) for the maximum (resp. minimum) of both numbers (or functions). Regarding the Lipschitz norm, recall the trivial inequalities

 ∥f∧g∥Lip,∥f∨g∥Lip≤∥f∥Lip∨∥g∥Lip. (13)

The first lemma collects some basic properties on the role of the Pasch-Hausdorff envelopes of a function to obtain a Lipschitz-continuous best -approximation with constrained Lipschitz constant. For the sake of completeness, we will also include a simple proof.

###### Lemma 3.1

For a function , given a constant let us consider

 fL,1(x):=infy∈ℵ(f(y)+Ld(x,y)),   fL,2(x):=supy∈ℵ(f(y)−Ld(x,y)).
• This defines functions such that

• is the pointwise largest function satisfying and . Likewise is the pointwise smallest function satisfying and .

• The average satisfies and

 ∥g−f∥≥∥fL−f∥=∥fL,2−fL,1∥

for any function such that

Proof. Part (i) follows directly from the definitions of and , because, for every :

 infy∈ℵf(y)≤fL,1(x)≤f(x)+Ld(x,x)=f(x)=f(x)−Ld(x,x)≤fL,2(x)≤supy∈ℵf(y).

To address part (ii) observe that, for arbitrary , the triangle inequality for the distance implies leading to the inequalities

 |fL,j(x2)−fL,j(x1)|≤Ld(x1,x2)   for j=1,2,

thus to Now, if satisfies and then for : with equality if . Hence

 g(x)=infy∈ℵ(g(y)+Ld(x,y))≤infy∈ℵ(f(y)+Ld(x,y))=fL,1(x).

Analogously, it follows from and that proving (ii).

As to part (iii), let Then and Consequently, by part (ii),

 g−ϵ≤fL,1≤f≤fL,2≤g+ϵ

This implies that

 |fL−f|=(f−fL)∨(fL−f)≤(fL,2−fL)∨(fL−fL,1)=fL,2−fL,12≤ϵ,

whence

 ∥fL−f∥≤∥fL,2−fL,1∥2≤∥g−f∥.

Since taking gives the announced equality

When is a real interval and is non-decreasing, the functions and in Lemma 3.1 share also that property and can be alternatively expressed in terms of the the Ubhaya’s monotone envelopes of the function . This is the content of the following lemma.

###### Lemma 3.2

Let be a real interval, equipped with the usual distance If is non-decreasing, then the functions in Lemma 3.1 are non-decreasing too, and for arbitrary and

 fL,j(x)=γL,j(x)+Lx,

where are the non-increasing functions

 γL,1(x):=infy∈ℵ:y≤x(f(y)−Ly)   and   γL,2(x):=supy∈ℵ:y≥x(f(y)−Ly).

In particular,

 ∥fL,2−fL,1∥=∥γL,2−γL,1∥=supy,x∈ℵ:y≤x(f(x)−f(y)−L(x−y)). (14)

Proof. The representations of and in terms of and follow from the fact that for arbitrary

 f(y)+Ld(x,y){=f(y)+L(x−y)=f(y)−Ly+Lxif$y≤x$≥f(x)=f(x)−Lx+Lxif$y≥x,$
 f(y)−Ld(x,y){=f(y)−L(y−x)=f(y)−Ly+Lxif$y≥x$≤f(x)=f(x)−Lx+Lxif$y≤x,$

where the inequalities follow from being non-decreasing. Note that both functions and are non-increasing, but adding the term to them leads to non-decreasing functions: For with isotonicity of implies that

 fL,2(x1) = supy≥x2(f(y)−Ly+Lx1)∨supx1≤y≤x2(f(y)−Ly+Lx1) ≤ (fL,2(x2)−Lx2+Lx1)∨f(x2) ≤ fL,2(x2),

and

 fL,1(x2) = infy≤x1(f(y)−Ly+Lx2)∧supx1≤y≤x2(f(y)−Ly+Lx2) ≥ (fL,1(x2)+Lx2−Lx1)∧f(x1) ≥ fL,1(x1),

because

Finally, let us include in the problem the boundary restrictions.

###### Theorem 3.3

Let be non-decreasing. For consider the function

 ~fL(x) := (fL(x)