# On the stability of bootstrap estimators

It is shown that bootstrap approximations of an estimator which is based on a continuous operator from the set of Borel probability measures defined on a compact metric space into a complete separable metric space is stable in the sense of qualitative robustness. Support vector machines based on shifted loss functions are treated as special cases.

• 11 publications
• 7 publications
• 13 publications
01/29/2013

### On the Consistency of the Bootstrap Approach for Support Vector Machines and Related Kernel Based Methods

It is shown that bootstrap approximations of support vector machines (SV...
10/12/2016

### Exploring the Entire Regularization Path for the Asymmetric Cost Linear Support Vector Machine

We propose an algorithm for exploring the entire regularization path of ...
07/23/2010

### Support Vector Machines for Additive Models: Consistency and Robustness

Support vector machines (SVMs) are special kernel based methods and belo...
05/30/2022

### Deep Bootstrap for Bayesian Inference

For a Bayesian, the task to define the likelihood can be as perplexing a...
01/27/2021

### Tropical Support Vector Machines: Evaluations and Extension to Function Spaces

Support Vector Machines (SVMs) are one of the most popular supervised le...
08/05/2022

### Improved Rates of Bootstrap Approximation for the Operator Norm: A Coordinate-Free Approach

Let Σ̂=1/n∑_i=1^n X_i⊗ X_i denote the sample covariance operator of cent...
05/20/2005

### Upgrading Pulse Detection with Time Shift Properties Using Wavelets and Support Vector Machines

Current approaches in pulse detection use domain transformations so as t...

## 1 Introduction

The finite sample distribution of many nonparametric methods from statistical learning theory is unknown because the distribution

from which the data were generated is unknown and because there are often only asymptotical results on the behaviour of such methods known.

The goal of this paper is to show that bootstrap approximations of an estimator which is based on a continuous operator from the set of Borel probability distributions defined on a compact metric space into a complete separable metric space is stable in the sense of qualitative robustness. As a special case it is shown that bootstrap approximations for the support vector machine (SVM) are stable, both for the risk functional and for the SVM operator itself. The results can be interpreted as generalizations of theorems derived by

[4].

The rest of the paper has the following structure. Section 2 gives the general result and Section 3 contains the results for SVMs. All proofs are given in the appendix.

## 2 On Qualitative Robustness of Bootstrap Estimators

If not otherwise mentioned, we will use the Borel -algebra on a set and denote the Borel -algebra on by .

###### Assumption 1.

Let be a probability space, where is unknown, be a compact metric space, and be the Borel -algebra on . Denote the set of all Borel probability measures on by . On we use the Borel -algebra and the bounded Lipschitz metric , see (4.11). Let be a statistical operator defined on with values in a complete, separable metric space enclipped with its Borel -algebra . Let ,

, be independent and identically distributed random variables and denote the image measure by

. Let be a statistic with values in . Denote the empirical measure of by . The statistic is defined via the operator

 S:(M1(Z,B(Z)),B(M1(Z,B(Z)))→(W,B(W))

where . Denote the distribution of when by . Accordingly, we denote the distribution of when by .

Efron [9, 10] proposed the bootstrap, whose main idea is to approximate the unknown distribution by . Note that these bootstrap approximations are (probability measure-valued) random variables with values in .

Following [4] we call a sequence of bootstrap approximations qualitatively robust at if the sequence of transformations

 gn:M1(Z,B(Z))→M1(W,B(W)),gn(Q)=L(Ln(S;Qn)),n∈N, (2.1)

is asymptotically equicontinuous at , i.e. if

 ∀ε>0 ∃δ>0 ∃n0∈N:dBL(Q,P)<δ  ⇒  supn≥n0dBL(L(Ln(S;Qn)),L(Ln(S;Pn)))<ε. (2.2)

Following [4] again, we call a sequence of statistics uniformly qualitatively robust in a neighborhood of if

 ∃n0∈N ∀ε>0 ∀n≥n0 ∃δ>0 ∀P∈U(P0):dBL(Q,P)<δ⇒dBL(Ln(S;Q),Ln(S;P))<ε. (2.3)

The following two results and Theorem 8 in the next section are the main results of this paper.

###### Theorem 2.

If Assumption 1 is valid and if is uniformly continuous in a neighborhood of , then is uniformly qualitatively robust in .

###### Theorem 3.

If Assumption 1 is valid and if is uniformly qualitatively robust in a neighborhood of , then the sequence of bootstrap approximations of is qualitatively robust for .

As an immediate consequence from both theorems given above we obtain

###### Corollary 4.

If Assumption 1 is valid and if is a continuous operator, then the sequence of bootstrap approximations of is qualitatively robust for all .

###### Remark 5.

The Theorems 2 and 3 can be considered as a generalization of [4, Thm. 2, Thm. 3], who considered the case being a finite interval and -valued random variables . In our case, the statistics are -valued statistics, where is a complete separable metric space and its dimension can be infinite.

## 3 On Qualitative Robustness of Bootstrap SVMs

In this section we will apply the previous results to support vector machines which belong to the modern class of statistical machine learning methods. I.e., we will consider the special case that is a reproducing kernel Hilbert space used by a support vector machine (SVM). Note that typically has an infinite dimension, which is true, e.g., if the popular Gaussian RBF kernel , for ) is used.

To state our result on the stability of bootstrap SVMs in Theorem 8 below, we need the following assumptions on the loss function and the kernel.

###### Assumption 6.

Let be a compact metric space with metric , where is closed. Let be a loss function such that is continuous and convex with respect to its third argument and that is uniformly Lipschitz continuous with respect to its third argument with uniform Lipschitz constant , i.e. is the smallest constant such that for all . Denote the shifted loss function by , . Let be a continuous kernel with reproducing kernel Hilbert space and assume that is bounded by . Let .

These assumptions can be considered as standard assumptions for stable SVMs, see, e.g., [1] and [15, Chap. 10], .

In this paper the RKHS , the penalyzing constant , and the loss function and thus the shifted loss function are fixed. Therefore, we write in the next definition just and instead of and to shorten the notation.

###### Definition 7.

The SVM operator is defined by

 S(P):=fL⋆,P,λ:=argminf∈HEPL⋆(X,Y,f(X))+λ∥f∥2H. (3.4)

The SVM risk functional is defined by

 R(P):=EPL⋆(X,Y,S(P)(X))=EPL⋆(X,Y,fL⋆,P,λ(X)). (3.5)

If Assumption 6 is valid, then is well-defined because exists and is unique, is well-defined because exists and is unique, and it holds, for all ,

 ∥S(P)∥∞≤1λ|L|1∥k∥2∞<∞and|R(P)|≤1λ|L|21∥k∥2∞<∞, (3.6)

see [2, Thm 5, Thm. 6, (17),(18)].

###### Theorem 8.

If the general Assumption 1 and the Assumption 6 are valid, then the SVM operator and the SVM risk functional fulfill:

1. The sequence of bootstrap SVM estimators of is qualitatively robust for all .

2. The sequence of bootstrap SVM risk estimators of is qualitatively robust for all .

## 4 Proofs

### 4.1 Proofs of the results in Section 2

For the proofs we need Theorem 9 and Theorem 10, see below. To state Theorem 9 on uniform Glivenko-Cantelli classes, we need the following notation. For any metric space and real-valued function , we denote the bounded Lipschitz norm of by

 ∥f∥BL:=supx∈S|f(x)|+supx,y∈S,x≠y|f(x)−f(y)|d(x,y). (4.7)

Let be a set of measurable functions from . For any function (such as a signed measure) define

 ∥G∥~F:=sup{|G(f)|:f∈~F}. (4.8)
###### Theorem 9.

[8, Prop. 12] For any separable metric space and ,

 ~FM:={f:(S,B(S))→(R,B);∥f∥BL≤M} (4.9)

is a universal Glivenko-Cantelli class. It is a uniform Glivenko-Cantelli class, i.e., for all ,

 limn→∞supν∈M1(S,B(S))Pr∗(supm≥n∥νm−ν∥~FM>ε)=0, (4.10)

if and only if is totally bounded. Here, denotes the outer probability.

Note that the term in (4.10) equals the bounded Lipschitz metric of the probability measures and if , i.e.

 ∥νm−ν∥~F1=supf∈~F1|(νm−ν)(f)|=supf;∥f∥BL≤1∣∣∫fdνm−∫fdν∣∣=:dBL(νm,ν), (4.11)

see [7, p. 394]. Hence, Theorem 9 can be interpreted as a generalization of [4, Lemma 1, p. 186], which says that if is a finite interval, then converges almost surely to uniformly in . For various characterizations of Glivenko-Cantelli classes, we refer to [16, Thm. 22] and [6].

We next list the other main result we need for the proof of Theorem 8. This result is an analogon of the famous Strassen theorem for the bounded Lipschitz metric instead of the Prohorov metric.

###### Theorem 10.

[13, Thm. 4.2, p. 30] Let be a Polish space with topology . Let be the bounded Lipschitz metric defined on the set of all Borel probability measures on . Then the following two statements are equivalent:

1. There are random variables with distribution and with distribution such that .

2. .

• We closely follow the proof by [4, Thm. 2]. However, we use Theorem 9 instead of their Lemma 1 and we use [3, Lem. 1] instead of [12, Lem. 1].

Let be the set of empirical distributions of order , i.e.

 Pn:={Pn∈M1(Z,B(Z));∃(z1,…,zn)∈Zn such that Pn=1nn∑i=1δzi}, (4.12)

and let . If misunderstandings are unlikely, we identify with the set of atoms.

It is enough to show that

 (4.13)

such that and for all and for all we have

 dBL(Qn,~Qn)<δ⇒dW(S(Qn),S(~Qn))<ε. (4.14)

From this we obtain that is uniformly qualitatively robust by [3, Lem. 1].

Let . Since the operator is uniformly continuous in we obtain

 ∃δ0>0 ∀P∈U(P0):  dBL(P,Q)<δ0⇒dW(S(P),S(Q))<ε/2. (4.15)

Hence by Theorem 9 for the special case and by (4.11), we get

 ∃n0∈N:  supP∈U(P0)Pr∗(supn≥n0dBL(Pn,P)<δ0)>1−ε. (4.16)

For and , define

 En,P:={Qn∈Pn:dBL(Qn,P)<δ0/2}. (4.17)

It follows, that together with and implies that

 dBL(Qn,P)<δ0/2anddBL(~Qn,P)<δ0.

The triangle inequality thus yields due to (4.15)

 dW(S(Qn),S(~Qn))≤dW(S(Qn),S(P))+dW(S(P),S(~Qn))<ε, (4.18)

from which the assertion follows.

• The proof mimics the proof of [4, Thm. 3], but uses Theorem 9 instead of [4, Lem. 1].

Fix and . By the uniform qualitative robustness of in , there exists such that for all there exists such that

 dBL(Q,P)<δ⇒supm≥nsupP∈U(P0)dBL(Lm(S;Q),Lm(S;P))<ε. (4.19)

Define . Due to Theorem 9 for the special case and by (4.11), we have, for all ,

 limn→∞supP∈M1(Z,B(Z))Pr∗(supm≥ndBL(Pm,P)>ε)=0. (4.20)

Hence (4.19) and Varadarajan’s theorem on the almost sure convergence of empirical measures to a Borel probability measure defined on a separable metric space, see e.g. [7, Thm. 11.4.1, p. 399], yields for the empirical distributions from and from that,

 ∃n1>n ∀n≥n1: dBL(Q,P0)<δ1⇒dBL(Qn,P0,n)<δ  almost surely. (4.21)

It follows from the uniform qualitative robustness of , see (4.19), that

 ∃n1∈N ∀ε>0 ∀n≥n1 ∃δ>0 ∀P∈U(P0):dBL(Q,P)<δ⇒dBL(Ln(S;Qn),Ln(S;P0,n))<ε  almost surely. (4.22)

For notational convenience, we write for the sequences of bootstrap estimators

 ξ1,n:=Ln(S;Qn),ξ2,n:=Ln(S;P0,n),n∈N. (4.23)

Note that and are (measure-valued) random variables with values in the set . We denote the distribution of by for and . Hence (4.22) yields

 dBL(ξ1,n,ξ2,n)<ε  almost surely for all n≥n1 (4.24)

and it follows

 E[dBL(ξ1,n,ξ2,n)]≤ε,∀n≥n1. (4.25)

Now an application of an analogon of Strassen’s theorem, see Theorem 10, yields

 supn≥n1dBL(L(ξ1,n),L(ξ2,n))≤ε∀n≥n1, (4.26)

which completes the proof, because

 L(ξ1,n)=L(Ln(S;Qn))andL(ξ2,n)=L(Ln(S;P0,n)). (4.27)

### 4.2 Proofs of the results in Section 3

• Proof of part (i). By assumption, is a compact metric space, where . Let be the Borel -algebra on . It is well-known that the bounded Lipschitz metric metrizes the weak topology on the space , see [7, Thm. 11.3.3], and that is a compact metric space if and only if is a compact metric space, see [14, p. 45, Thm. 6.4]. From the compactness of , it of course follows that this metric space is separable and totally bounded, see [5, Thm. 1.4.26].

Under the assumptions of the theorem we have, for all fixed , that the SVM operator , , is well-defined because it exists and is unique, see [2, Thm. 5, Thm. 6] and is continuous with respect to the combination of the weak topology on and the norm topology on , see [11, Thm. 3.3, Cor. 3.4]. There it was also shown that the operator , , is continuous with respect to the combination of weak topology on and the norm topology on . Because is a compact metric space, the operators and are therefore even uniformly continuous on the whole space with respect to the mentioned topologies, see [5, Prop. 1.5.9].

Because the reproducing kernel Hilbert space is a Hilbert space, is complete. Furthermore, because the input space is separable and the kernel is continuous, the RKHS is also separable, see [15, Lem. 4.33]. Therefore, Theorem 2 yields that the sequence of -valued statistics

 Sn((X1,Y1),…,(Xn,Yn))=argminf∈H1nn∑i=1L⋆(Xi,Yi,f(Xi))+λ∥f∥2H, n∈N, (4.28)

is uniformly qualitatively robust in a neighborhood for every probability measure . Now we apply Theorem 3, which yields that the sequence of bootstrap SVM estimators of is qualitatively robust for all , which gives the first assertion of the theorem.

Proof of part (ii). The proof consists of two steps. In Step 1 the continuity of the SVM risk functional will be shown. In Step 2, the Theorems 2 and 3 will be used to show that the sequence , , of bootstrap SVM risk estimators is qualitatively robust.

Step 1. We will first show that the SVM risk functional is continuous with respect to the combination of the weak topology on and the standard topology on .

As mentioned in part (i), the assumption that is a compact metric space implies that is a compact metric space and hence this space is separable and totally bounded.

Under the assumptions of the theorem, the SVM operator , , is well-defined because exists and is unique for all and for all , see [2, Thm. 5, Thm. 6]. Furthermore, is continuous with respect to the combination of the weak topology on and the norm topology on , see [11, Thm. 3.3]. Hence the function

 gP:X×Y→R,gP(x,y):=L⋆(x,y,S(P)(x))=L⋆(x,y,fL⋆,P,λ(x)) (4.29)

is well-defined. Because the kernel is bounded and continuous, all functions , and hence in particular , are continuous, see e.g. [15, Lem. 4.28, Lem. 4.29]. Hence the function is continuous (with respect to ), because the loss function and hence the shifted loss function , , are continuous. Furthermore, the function is bounded, because with is by assumption a compact metric space, the Lipschitz continuous loss function maps from to , and , see [2, p. 314, (17)]. Hence . Because the bounded Lipschitz metric metrizes the weak topology on , it follows that

 ∀ε1>0 ∃δ1>0:   dBL(Q,P)<δ1⟹∣∣∫gPdQ−∫gPdP∣∣<ε1. (4.30)

Recall that is continuous with respect to the combination of the weak topology on and the norm topology on , see [11, Thm. 3.3]. Hence

 ∀ε2>0 ∃δ2>0:   dBL(Q,P)<δ2⟹∥S(Q)−S(P)∥H<ε2. (4.31)

Fix . Define

 ε1:=ε3andε2:=ε3|L|1∥k∥∞  .

Using the triangle inequality in (4.2), the definition of the shifted loss function in (4.35), the definition of the function in (4.35), the Lipschitz continuity of in (4.36), and the well-known formula

 ∥f∥∞≤∥k∥∞∥f∥H,f∈H, (4.32)

see e.g. [15, p. 124] we obtain that implies

 |R(Q)−R(P)| = ∣∣∫L⋆(x,y,S(Q)(x))dQ(x,y)−∫L⋆(x,y,S(P)(x))dP(x,y)∣∣ ≤ ∣∣∫L⋆(x,y,S(Q)(x))dQ(x,y)−∫L⋆(x,y,S(P)(x))dQ(x,y)∣∣ +∣∣∫L⋆(x,y,S(P)(x))dQ(x,y)−∫L⋆(x,y,S(P)(x))dP(x,y)∣∣ ≤ ∫|L(x,y,S(Q)(x))−L(x,y,S(P)(x))|dQ(x,y) (4.35) + ∣∣∫gPdQ−∫gPdP∣∣ (???)≤ |L|1∥S(Q)−S(P)∥∞+ε1 (4.36) (???)≤ (4.37) (???)≤ |L|1∥k∥∞ε2+ε1=23ε. (4.38)

Hence, is continuous with respect to the combination of the weak topology on and the standard topology on .

Step 2. Because is a compact metric space and the risk functional is continuous, is even uniformly continuous with respect to the mentioned topologies, see [5, Prop. 1.5.9]. Obviously is a complete separable metric space. Therefore, Theorem 2 yields that the sequence of -valued statistics

 Rn((X1,Y1),…,(Xn,Yn))=1nn∑i=1L⋆(Xi,Yi,fL⋆,D,λ(Xi)),n∈N,

where , is uniformly qualitatively robust in a neighborhood for every probability measure . Now we apply Theorem 3, which yields that the sequence of bootstrap SVM estimators of is qualitatively robust for all , which completes the proof.

## References

• [1] A. Christmann and I. Steinwart. Consistency and robustness of kernel based regression. Bernoulli, 13:799–819, 2007.
• [2] A. Christmann, A. Van Messem, and I. Steinwart. On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface, 2:311–327, 2009.
• [3] A. Cuevas. Qualitative robustness in abstract inference. J. Statist. Plann. Inference, 18:277–289, 1988.
• [4] A. Cuevas and R. Romo. On robustness properties of bootstrap approximations. J. Statist. Plann. Inference, 1993.
• [5] Z. Denkowski, S. Migórski, and N. Papageorgiou. An introduction to nonlinear analysis: Theory. Kluwer Academic Publishers, Boston, 2003.
• [6] R. M. Dudley.

Uniform Central Limit Theorems

.
Cambridge University Press, Cambridge, 1999.
• [7] R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, 2002.
• [8] R. M. Dudley, E. Giné, and J. Zinn. Uniform and universal Glivenko-Cantelli classes. J. Theor. Prob., 4:485–510, 1991.
• [9] B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26, 1979.
• [10] B. Efron. The Jackknife, the Bootstrap, and Other Resampling Plans, volume 38. CBMS Monograph, Society for Industrial and Applied Mathematics, Philadelphia, 1982.
• [11] R. Hable and A. Christmann. Qualitative robustness of support vector machines.

Journal of Multivariate Analysis

, 102:993–1007, 2011.
• [12] F. R. Hampel. A general qualitative definition of robustness. Ann. Math. Statist., 42:1887–1896, 1971.
• [13] P. J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.
• [14] K. R. Parthasarathy. Probability Measures on Metric Spaces. Academic Press, New York, 1967.
• [15] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.
• [16] M. Talagrand. The Glivenko-Cantelli problem. Ann. Probability, 15:837–870, 1987.