# Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability

We study a statistical method to estimate the optimal value, and the optimality gap of a given solution for stochastic optimization as an assessment of the solution quality. Our approach is based on bootstrap aggregating, or bagging, resampled sample average approximation (SAA). We show how this approach leads to valid statistical confidence bounds for non-smooth optimization. We also demonstrate its statistical efficiency and stability that are especially desirable in limited-data situations, and compare these properties with some existing methods. We present our theory that views SAA as a kernel in an infinite-order symmetric statistic, which can be approximated via bagging. We substantiate our theoretical findings with numerical results.

## Authors

• 21 publications
• 3 publications
• ### Admissibility of solution estimators for stochastic optimization

We look at stochastic optimization problems through the lens of statisti...
01/21/2019 ∙ by Amitabh Basu, et al. ∙ 0

• ### Sample Complexity of Sample Average Approximation for Conditional Stochastic Optimization

In this paper, we study a class of stochastic optimization problems, ref...
05/28/2019 ∙ by Yifan Hu, et al. ∙ 0

• ### Optimal Convergence for Stochastic Optimization with Multiple Expectation Constraints

In this paper, we focus on the problem of stochastic optimization where ...
06/08/2019 ∙ by Kinjal Basu, et al. ∙ 0

• ### A Kernel Mean Embedding Approach to Reducing Conservativeness in Stochastic Programming and Control

We apply kernel mean embedding methods to sample-based stochastic optimi...
01/28/2020 ∙ by Jia-Jie Zhu, et al. ∙ 0

• ### CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing

Deep learning applications require optimization of nonconvex objective f...
05/29/2020 ∙ by Oleksandr Borysenko, et al. ∙ 0

• ### Inference by Stochastic Optimization: A Free-Lunch Bootstrap

Assessing sampling uncertainty in extremum estimation can be challenging...
04/20/2020 ∙ by Jean-Jacques Forneron, et al. ∙ 0

• ### Data-Pooling in Stochastic Optimization

Managing large-scale systems often involves simultaneously solving thous...
06/01/2019 ∙ by Vishal Gupta, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a stochastic optimization problem

 Z∗=minx∈X{Z(x)=EF[h(x,ξ)]} (1)

where is generated under some distribution , and denotes its expectation. We focus on the situations where is not known, but instead a collection of i.i.d. data for , say , are available. Obtaining a good solution for (1) under this setting has been under active investigation both from the stochastic and the optimization communities. Common methods include the sample average approximation (SAA) [46, 28], stochastic approximation (SA) or gradient descent [29, 8, 39], and (distributionally) robust optimization [14, 4, 55, 3]. These methods aim to find a solution that is nearly optimal, or in some way provide a safe approximation. Applications of the generic problem (1) and its data-driven solution techniques span from operations research, such as inventory control, revenue management, portfolio selection (see, e.g., [46, 5]

) to risk minimization in machine learning (e.g.,

[23]).

This paper concerns the estimation of using limited data. Moreover, given a solution, say , a closely related problem is to estimate the optimality gap

 G(^x)=Z(^x)−Z∗ (2)

This allows us to assess the quality of , in the sense that the smaller is, the closer is the solution to the true optimum in terms of achieved objective value. More precisely, we will focus on inferring a lower confidence bound for , and, correspondingly, an upper bound for - noting that its first term can be treated as a standard population mean of that is estimable using a sample independent of the given , or that can be represented as the max of the expectation of whose estimation is structurally the same as .

This problem is motivated by the fact that many state-of-the-art solution methods mentioned before are only amenable to crude, worst-case performance bounds. For instance, [48] and [28]

provide large deviations bounds on the optimality gap of SAA in terms of the diameter or cardinality of the decision space and the maximal variance of the function

. [39] and [24]

provide bounds on the expected value and deviation probabilities of the SA iterates in terms of the strong convexity parameters, space diameter and maximal variance. These bounds can be refined under additional structural information (e.g.,

[47]). While they are very useful in understanding the behaviors of the optimization procedures, using them as a precise assessment on the quality of an obtained solution may be conservative. Because of this, a stream of work study approaches to validate solution performances by statistically bounding optimality gaps. [37, 1, 35, 45] investigate the use of SAA to estimate these bounds, [32] validate the performances of SA iterates by using convexity conditions. [49, 42] study approaches like the jackknife and probability metric minimization to reduce the bias in the resulting gap estimates. [2] utilize gap estimates to guide sequential sampling. [17, 6, 31] investigate the use of empirical and profile likelihoods to estimate optimal values. Our investigation in this paper follows the above line of work on solution validation, focusing on the situation when data are limited and hence the statistical efficiency becomes utmost important. We also point out a related series of work that validate feasibility under uncertain constraints (e.g., [36, 40, 54, 12, 11]), though their problem of interest is beyond the scope of this paper, as we focus on deterministically constrained problems and objective value performances.

More precisely, we introduce a bootstrap aggregating, or commonly known as bagging [9], approach to estimate a lower confidence bound for . This comprises repeated resampling of data to construct SAAs, and ultimately averaging the resampled optimal SAA values. We demonstrate how this approach applies under very general conditions on the cost function and decision space , while enjoys high statistical efficiency and stability. Compared to procedures based on batching (e.g., [37]

), which also have documented benefits in wide applicability and stability, the data recycling in our approach breaks free a tradeoff between the tightness of the resulting bound and the statistical accuracy/correctness exhibited by batching. In cases where sufficient smoothness is present and central limit theorem (CLT) for SAA (e.g.,

[46, 1]

) can be directly applied, we also see that our approach gains stability regarding standard error estimation, thanks to the smoothing effect brought by bagging. Nonetheless, our approach generally requires higher computational load than these previous methods due to the need to solve many resampled programs. While we focus primarily on statistical performances, towards the end of this paper we will discuss some computational implications.

The theoretical justification of our bagging scheme comes from viewing SAA as a kernel in an infinite-order symmetric statistic [22], and an established optimistic bound for SAA as its asymptotic limit. A symmetric statistic is a generalization of sample mean in which each summand consists of a function (i.e., kernel) acting on more than one observation [44]

. In particular, the size of the SAA program can be seen as precisely the kernel “order” (or “degree”), which depends on the data size and is consequently of an infinite-order nature. Our bagging scheme serves as a Monte Carlo approximation for this symmetric statistic. As a main methodological contribution, we analyze the asymptotic behaviors of the statistic and the resulting bounds as the SAA size grows, and translate them into efficient performances of our bagging scheme. Finally, we note that the notion of infinite-order symmetric statistics has been used in analyzing ensemble machine learning predictors like random forests

[52]; our SAA kernels are, from this view, in parallel to the base learners in the latter context.

Finally, we mention that [21]

has also studied the resampling of SAA programs to construct confidence intervals for the optimal values of stochastic programs. Our approach connects with, but also differs substantially from

[21] in several regards. In terms of scope of applicability, [21]

focuses on mixed-integer linear programs, while we consider cost functions that can be generally non-Donsker. However, we instead require an additional “non-degeneracy” condition that depends on the cost function and the underlying probability distribution. In terms of methodology,

[21]

utilizes the quantiles of the resampled distribution to generate confidence intervals, by observing the same limiting distribution between an original CLT and the bootstrap CLT. The resampling in

[21] requires a “two-layer” extended bootstrap where each resample is drawn from a new sample of the true distribution (as opposed to some bootstrap methods that allows repeated resample from the same original sample, with the availability of a conditional bootstrap CLT). Thus the approach requires substantial data size or otherwise resorting to subsampling. Our bagging approach, in contrast, is based on a direct use of Gaussian limit and standard error estimation in the CLT for the optimistic bound. Our burden lies on the bootstrap size requirement to obtain consistent standard error estimate, and less on the data size requirement.

We summarize our contributions of this paper as follows:

1. Motivated from the challenges of existing techniques (Section 2), we introduce a bagging procedure to estimate a lower confidence bound for , correspondingly an upper confidence bound for (Section 3). We present the idea of our procedure that views SAA as a kernel in a symmetric statistic, and an optimistic bound for SAA as its associated limiting quantity (Section 4).

2. We analyze the asymptotic behaviors of the infinite-order symmetric statistic generated from the SAA kernel, under minimal smoothness requirements on the optimization problem. Moreover, when smoothness conditions are introduced, we demonstrate how these behaviors recover the classical CLT on SAA. These results are presented in Section 5. The mathematical developments without smoothness conditions utilize a combination of probabilistic coupling arguments and a new hypergeometric representation associated with the Hajek projection [50] (Appendices A.1 and A.2). The developments to recover the classical CLT use another analysis-of-variance (ANOVA) decomposition and a maximal deviation bound for empirical processes (Appendix A.3).

3. Building on the above results, we demonstrate how the bounds generated from our bagging procedure exhibit asymptotically correct coverages, and improve a tradeoff between the bound tightness and the statistical accuracy in existing batching schemes. This efficiency gain can be seen by an asymptotic comparison of the standard error in our estimator and an interpretation using conditional Monte Carlo. These developments are in Sections 6 and 7, with mathematical details in Appendices A.4-A.8.

4. We explain the stability in our generated bounds brought by the smoothing effect of bagging in estimating standard error. This compares favorably with the direct use of CLT in situations where the objective function is smooth. This property is supported by our numerical experiments (Section 8).

## 2 Existing Challenges and Motivation

We discuss some existing methods and their challenges, to motivate our investigation. We start the discussion with the direct use of asymptotics from sample average approximation (SAA).

### 2.1 Using Asymptotics of Sample Average Approximation

When the cost function in (1) is smooth enough, it is known classically that a central limit theorem (CLT) governs the behavior of the estimated optimal value in SAA, namely

 ^Zn=minx∈X1nn∑i=1h(x,ξi). (3)

We first introduce the following Lipschitz condition:

###### Assumption 1 (Lipschitz continuity in the decision).

The cost function is Lipschitz continuous with respect to , in the sense that

 |h(x1,ξ)−h(x2,ξ)|≤M(ξ)∥x1−x2∥

for any , where satisfies .

Denote “” as convergence in distribution. The following result is taken from [46]:

###### Theorem 1 (Extracted from Theorem 5.7 in [46]).

Suppose that Assumption 1 holds, for some point , and is compact. Given i.i.d. data , consider the SAA problem (3). The SAA optimal value satisfies

 √n(^Zn−Z∗)⇒infx∈X∗Y(x) (4)

where is the set of optimal solutions for (1), and is a centered Gaussian process on that has a covariance structure defined by between any .

Roughly speaking, Theorem 1 stipulates that, under the depicted conditions, one can use (4) to obtain

 ^Zn−^q√n (5)

as a valid lower confidence bound for (and analogously for given ), where is some suitable error term that captures the quantile of the limiting distribution in (4). Indeed, in the case of estimating , [1] provides an elegant argument that shows that, to achieve confidence, one can take where is the standard normal critical value and

is a standard deviation estimate, regardless of whether the limit in (

4

) is a Gaussian distribution (or in other words the solution is unique).

[1] calls this the single-replication procedure. More precisely, is obtained from

 ^σ2=1n−1n∑i=1(h(^x,ξi)−h(^x∗n,ξi)−(¯h(^x)−¯h(^x∗n)))2

where is the solution from (3), and .

Though Theorem 1 (and other related work, e.g., [15, 28]) is very useful, there are at least two reasons why one would need more general methods:

1. When the decision space contains discrete elements (e.g., combinatorial problems), Assumption 1 does not hold anymore. There is no guarantee in using the bound (5), i.e., it may still be correct but conservative, or it may simply possess incorrect coverages. We note, however, that for some class of problems (e.g., two-stage mixed-integer linear programs), extensions to Theorem 1 and approaches such as quantile-based bootstrapping (e.g., [21]) are useful alternatives.

2. If the SAA solutions have a “jumping” behavior, namely that program (1) has several near-optimal solutions with hugely differing objective variances, then the standard deviation estimate needed in the bound (5) can be unreliable. This is because depends heavily on , which can fall close to any of the possible near-optimal solutions with substantial chance and make the resulting estimation noisy. This issue is illustrated in, e.g., Examples 1 and 2 in [1].

We should also mention that, as an additional issue, the bias in relative to can be quite large in any given problem, i.e., arbitrarily close to order described in the CLT, even if all the conditions in Theorem 1 hold [41]. Note that this bias is in the optimistic direction (i.e., the resulting bound is still correct, but conservative), and it also appears in the “optimistic bound” approach that we discuss next. There have been techniques such as jackknife [41, 42] and probability metric minimization [49] in reducing this bias effect.

### 2.2 Batching Procedures

An alternate approach is to use the optimistic bound [37, 45, 25]

 E[^Zn]≤Z∗ (6)

where in (6) is taken with respect to the data in constructing the SAA value . The bound (6) holds for any , as a direct consequence from exchanging the expectation and the minimization operator in the SAA, and holds as long as are i.i.d.

The bound (6) offers a simple way to construct a lower bound for under great generality. Note that the left hand side of (6) is a mean of SAA. Thus, if one can “sample” a collection of SAA values, then a lower confidence bound for can be constructed readily by using a standard estimate of population mean. To “sample” SAA values, an approach suggested by [37] is to batch the i.i.d. data set into say batches, each batch consisting of observations, so that (we ignore rounding issues). For each , solve an SAA using the observations in the -th batch; call this value . Then use

 ~Zk−z1−α~σ√m (7)

where and are the sample mean and variance from , and is the -level standard normal quantile.

The bound (7) does not rely on any continuity of , and is simply the sample standard deviation for a sample mean. In these regards, the bound largely circumvents the two concerns described before.

Nonetheless, there is an intrinsic tradeoff between tightness and statistical accuracy in this batching approach. On one hand, must be chosen big enough (e.g., roughly ) so that one can use the CLT to justify the approximation (7). Moreover, the larger is , typically the smaller is the magnitude of the standard error in the second term of (7). On the other hand, the larger is , the closer is to in (6), leading to a tighter lower bound for . This is thanks to a monotonicity property in that is non-decreasing in [37]. Therefore, there is a tradeoff between the statistical accuracy controlled by (in terms of the validity of the CLT and the magnitude of the standard error term) and the tightness controlled by (in terms of the position of in (6)). In the batching or the so-called multiple-replication approach of [37], this tradeoff is confined to the relation . There have been suggestions to improve this tradeoff, e.g., by using overlapping batches [35, 34], but their validity requires uniqueness or exponential convergence of the solution (e.g., in discrete decision space).

### 2.3 Motivation and Overview of Our Approach

Thus, in general, when the sample size is small, the batching approach appears to necessarily settle for a conservative bound in order to retain statistical validity/accuracy. The starting motivation for the bagging procedure that we propose next is to break free this tightness-accuracy tradeoff. In particular, we offer a bound roughly in the form

 Zbagk−qbag√n (8)

where is a point estimate obtained from bagging many resampled SAA values, and signifies the size of the resampled SAA (i.e., the “bags”). The quantity relies on a standard deviation estimate of . Our method operates at a similar level of generality as batching and handles the two concerns Points 1 and 2 in Section 2.1: The estimate does not succumb to the “jumping” solution behavior, and the bound holds regardless of the continuity to the decision. Moreover, compared to the batching bound (7), our bound has a standard error term shrunk to order from (and relies on an asymptotic on , not ), thus gaining higher statistical precision. In fact, this term regains the same order of precision level as the bound (5) that uses SAA asymptotics directly.

On the other hand, we will show that the choice of in (8), which affects the tightness, can be taken as roughly in general. Compared with the direct-CLT bound (5), our bound appears less tight. However, we have gained estimation stability of and, moreover, we consider conditions more general than when (5) is applicable. We will see that if we re-impose Lipschitz continuity on the decision (i.e., Assumption 1), then can be set arbitrarily close to the order of . This means that our approach is almost as statistically efficient as the bound (5), with the extra benefit of stability in estimating .

Nonetheless, we point out that our approach requires solving resampled SAA programs many times, and is thus computationally more costly than batching and direct-CLT methods. The higher computation cost is the price to pay to elicit our benefits depicted above. Our approach is thus most recommended when statistical performance is of higher concern than computation efficiency, prominently in small-sample situations.

The next section will explain our procedure in more detail. A key insight is to view SAA as a symmetric kernel and the optimistic bound (6) as a limiting quantity of an associated symmetric statistic, which can be estimated by bagging. On a high level, the stability in estimating the standard error can be attributed to the nature of bagging as a smoother [10, 19].

## 3 Bagging Procedure to Estimate Optimal Values

This section presents our approach. Instead of batching the data, we uniformly resample observations from for many, say , times. We use each resample to form an SAA problem and solve it. We then average all these resampled SAA optimal values. The resampling can be done with or without replacement (we will discuss some differences between the two). We summarize our procedure in Algorithm 1.

In the output of Algorithm 1, the first term is the average of many bootstrap resampled SAA values, which resembles a bagging predictor by viewing each SAA as a “base learner” [9]. The quantity in (10) is the covariance between the count of a specific observation in a bootstrap resample, denoted , and the resulting resampled SAA value . The quantity is an empirical version of the so-called infinitesimal jackknife (IJ) estimator [19], which has been used to estimate the standard deviation of bagging schemes, including in random forests or tree ensembles [53]. The additional constant factor in the second line of (9) is a finite-sample correction specific to resampling without replacement. Although the IJ variance estimator is not affected by this factor in the asymptotic regime that we will discuss, we find it significantly improves the finite-sample performance of our method.

## 4 SAA as Symmetric Kernel

We explain how Algorithm 1 arises. In short, the in Algorithm 1 acts as a point estimator for in the optimistic bound (6), whereas captures the standard error in using this point estimator.

To be more precise, let us introduce a functional viewpoint and write

 Wk(F)=EFk[Hk(ξ1,…,ξk)] (11)

where

 Hk(ξ1,…,ξk)=minx∈X1kk∑i=1h(x,ξi)

is the SAA value, expressed more explicitly in terms of the underlying data used. Here, the expectation is generated with respect to i.i.d. variables , i.e., denotes the product measure of ’s. For convenience, we denote as the expectation either with respect to or the product measure of ’s when no confusion arises. Also, we denote .

With these notations, the optimistic bound (6) can be expressed as

 Wk(F)≤Z∗

with the best bound being thanks to the monotonicity property of the expected SAA value mentioned before.

Suppose that we have used sampling with replacement in Algorithm 1. Also say we use infinitely many bootstrap replications, i.e., . Then, the estimator in Algorithm 1 becomes precisely

 ~Zbagk=Wk(^F)

where is the empirical distribution formed by , i.e., where is the delta measure at . If is “smooth” in some sense, then one would expect to be close to . Indeed, when is fixed, , which is expressible as the -fold expectation under in (11), is multi-linear, i.e.,

 Wk(F)=EFk[Hk(ξ1,…,ξk)]=∫⋯∫Hk(ξ1,…,ξk)k∏j=1dF(ξj)

and is always differentiable with respect to (in the Gateaux sense) from the theory of von Mises statistical functionals [44]. This ensures that is close to probabilistically, as elicited by a CLT (Theorem 2 below).

Note that is exactly the average of over all possible combinations of drawn with replacement from . This is equivalent to

 Vn,k=1nk∑ij∈{1,…,n},j=1,…,kHk(ξi1,…,ξik) (12)

which is the so-called -statistic. If we have used sampling without replacement in Algorithm 1, we arrive at the estimator (assuming again )

 Un,k=1(nk)∑(i1,…,ik)∈CkHk(ξi1,…,ξik) (13)

where denotes the collection of all subsets of size in . The quantity (13) is known as the -statistic. The and estimators in (12) and (13) both belong to the class of symmetric statistics [44, 50, 13], since the estimator is unchanged against a shuffling of the ordering of the data . Correspondingly, the function is known as the symmetric kernel. Symmetric statistics generalize the sample mean, the latter corresponding to the case when .

When , then and above are approximated by a random sampling of the summands on the right hand side of (12) and (13). These are known as incomplete - and -statistics [33, 7, 27], and are precisely our . As is chosen large enough, will well approximate and .

To discuss further, we make the following assumptions:

###### Assumption 2 (L2-boundedness).

We have

 Esupx∈X|h(x,ξ)|2<∞

Denote . Denote as the variance under .

###### Assumption 3 (Finite non-zero variance).

We have .

We have the following asymptotics of and :

###### Theorem 2.

Suppose is fixed, and Assumptions 2 and 3 hold. Then

 √n(Un,k−Wk)⇒N(0,k2Var(gk(ξ))) (14)

and

 √n(Vn,k−Wk)⇒N(0,k2Var(gk(ξ))) (15)

as , where

is a normal distribution with mean 0 and variance

.

###### Proof.

Assumption 2 implies that for any (possibly identical) indices , since

 EHk(ξi1,…,ξik)2≤1k2Esupx∈X(k∑j=1h(x,ξij))2≤Esupx∈X|h(x,ξ)|2<∞ (16)

by the Minkowski inequality. Then, under (16) and Assumption 3, (14) follows from Theorem 12.3 in [50], and (15) follows from Section 5.7.3 in [44]. ∎

Theorem 2 is a consequence of the classical CLT for symmetric statistics. The expression , as a function defined on the space , is the so-called influence function of , which can be viewed as its functional derivative with respect to [26]. Alternately, for a -statistic , the expression is the so-called Hajek projection [50], which is the projection of the statistic onto the subspace generated by the linear combinations of and any measurable function . It turns out that these two views coincide, and the - and -statistics (whose approximation uses the projection viewpoint and the functional derivative viewpoint respectively) obey the same CLT as depicted in Theorem 2.

The output of Algorithm 1 is now evident given Theorem 2. When , is precisely under sampling without replacement or under sampling with replacement. The quantity in Algorithm 1, an empirical IJ estimator, can be shown to approximate the asymptotic variance as , by borrowing recent results in bagging [19, 52] (Theorems 10 and 11 below show stronger results). Then the procedural output is the standard CLT-based lower confidence bound for .

The discussion above holds for a fixed , the sample size used in the resampled SAA. It also shows that, at least asymptotically, using with or without replacement does not matter. However, using a fixed regardless of the size of is restrictive and leads to conservative bounds. The next subsection will relax this requirement and present results on a growing against , which in turn allows us to get a tighter in the optimistic bound (6).

## 5 Asymptotic Behaviors with Growing Resample Size

We first make the following strengthened version of Assumption 2:

###### Assumption 4 (L2+δ-bounded modulus of continuity).

We have

 Esupx∈X|h(x,ξ)−h(x,ξ′)|2+δ<∞

where are i.i.d. generated from .

Assumption 4 holds quite generally, for instance under the following sufficient conditions:

###### Assumption 5 (Uniform boundedness).

is uniformly bounded over .

###### Assumption 6 (Uniform Lipschitz condition).

is Lipschitz continuous with respect to , where the Lipschitz constant is uniformly bounded in , i.e.,

 |h(x,ξ)−h(x,ξ′)|≤L∥ξ−ξ′∥

where is some norm in . Moreover, .

###### Assumption 7 (Majorization).
 |h(x,ξ)−h(x,ξ′)|≤f(ξ)+f(ξ′)

where .

That Assumption 5 implies Assumption 4 is straightforward. To see how Assumption 6 implies Assumption 4, note that, if the former is satisfied, we have

 Esupx∈X|h(x,ξ)−h(x,ξ′)|2+δ≤L2+δE∥ξ−ξ′∥2+δ<∞

Similarly, Assumption 7 implies Assumption 4 because the former leads to

 Esupx∈X|h(x,ξ)−h(x,ξ′)|2+δ≤E(f(ξ)+f(ξ′))2+δ<∞

Next, we also make the following assumption:

###### Assumption 8 (Non-degeneracy).

We have

 P(minx∈X{h(x,ξ)−Z(x)}>0)+P(E[minx∈X{h(x,ξ)−h(x,ξ′)}∣∣ξ′]>0)>0 (17)

where .

Roughly speaking, Assumption 8 means that is sufficiently mixed so that the optimal value of a data-driven optimization problem with only one (or two) data point can deviate away from its mean. This assumption holds, e.g., when lies in a positive region in the real space that is bounded away from the origin. The assumption can be further relaxed in practical problems. For example, one can replace in (17) by a smaller region that can possibly contain any candidates of optimal solutions. Moreover, if the cost function is Lipschitz (i.e., Assumption 1 holds), it suffices to replace the entire decision space in (17) with the set of optimal solutions , namely:

###### Assumption 9 (A weaker non-degeneracy condition).

We have

 P(minx∈X∗{h(x,ξ)−Z∗}>0)+P(E[minx∈X∗{h(x,ξ)−h(x,ξ′)}∣∣ξ′]>0)>0 (18)

where is the set of optimal solutions for (1). In particular, when the optimal solution is unique, i.e., , this assumption is reduced to .

An important implication of the above two assumptions is to ensure that is bounded away from 0 even as grows, thus leading to a behavior similar to Assumption 3 for the finite case.

###### Lemma 1 (Non-degenerate asymptotic variance).

Suppose Assumption 2 holds. Also, suppose either Assumption 8 holds, or that Assumptions 1 and 9 hold jointly and is compact. Then for some constant , when is sufficiently large.

The proof of Lemma 1 uses a coupling argument between , which is a conditional expectation on , and , the full expectation on

, by assigning the same random variables

. This coupling is used to bound the difference used in calculating the variance , which then combines with the non-degeneracy condition (Assumption 8 or 9) to get a lower bound for . See Appendix A.1 for the detailed proof.

We have the following asymptotics:

###### Theorem 3 (CLT for growing resample size under sampling without replacement).

Suppose Assumptions 2, 4 and 8 hold. If the resample size , then

 √n(Un,k−Wk)k√Var(gk(ξ))⇒N(0,1)

where is the standard normal variable.

###### Theorem 4 (CLT for growing resample size under sampling with replacement).

Suppose Assumptions 2, 4 and 8 hold. If the resample size for some , then

 √n(Vn,k−Wk)k√Var(gk(ξ))⇒N(0,1)

where is the standard normal variable.

Theorems 3 and 4 are analogs of Theorem 2 when . In both theorems, we see that there is a limit in how large we can take relative to , which is thresholded at roughly order . A symmetric statistic with a growing is known as an infinite-order symmetric statistic [22], and has been harnessed in analyzing random forests [38, 53, 52]. Theorems 3 and 4 give the precise conditions under which the SAA kernel results in an asymptotically converging infinite-order symmetric statistic.

The proof of Theorem 3 utilizes a general projection theorem, in which one can translate the convergence of a projected statistic into convergence of the beginning statistic, if the ratio of their variances tends to 1 (Theorem 11.2 in [50]; restated in Theorem 13 in Appendix A). In our case, the considered projection is the Hajek projection of the infinite-order

-statistic. To execute this theorem, we approximate the variance ratios between the projection and the remaining orthogonal component. This requires using a further coupling argument among the higher-order conditional expectations, and combining with a representation of the variance ratio in terms of moments of hypergeometric random variables. Then, the CLT for the

-statistic follows by verifying the Lyapunov condition of the Hajek-projected -statistic.

From Theorem 3, the conclusion of Theorem 4 follows by using a relation between - and -statistics in the form

 nk(Un,k−Vn,k)=(nk−nPk)(Un,k−Rn,k) (19)

where and is the average of all with at least two of being the same (see, e.g., Section 5.7.3 in [44]). By carefully controlling the difference between and , one can show an asymptotic for under a similar growth rate of as that for . This leads to a slightly less general result for in Theorem 4. We mention that the growth rates of in both Theorems 3 and 4 are sufficient conditions. We will also see in the next section that, under further conditions, the growth of can be allowed bigger.

The proofs of Theorems 3 and 4 are both in Appendix A.2. These two theorems conclude that and continue to well approximate the optimistic bound even as , under the depicted assumptions and bounds on the growth rate.

Taking one step further, the following shows that bagging under sampling without replacement achieves almost the same efficiency as the direct use of CLT for SAA in (5).

###### Theorem 5 (Validity of nearly full resample under Lipschitzness).

Suppose Assumptions 1, 2, 4 and 9 hold, and that the decision space is compact. Then the conclusion of Theorem 3 holds by choosing .

Theorem 5 implies that, asymptotically, we can use almost the full data set to construct the resampled SAA in . This implies that its standard error is of order close to , and also the point estimate is approximately the SAA with full size . Hence both the tightness and statistical accuracy of the resulting bound reach the level of (5). Moreover, the standard error of our bagging estimator is stabler than the one in (5), as it does not rely on the quality of only one particular SAA solution.

Next we show yet another refinement when, in addition to Lipschitzness, the optimal solution is also unique. Under this additional assumption, our bagging scheme elicits essentially the same CLT as Theorem 1, and thus recovers the direct-CLT bound in (5).

###### Theorem 6 (Recovery of the classical CLT for SAA under solution uniqueness).

In addition to the conditions in Theorem 5, if we further assume that (1) has a unique optimal solution , then the conclusion of Theorem 3 holds for any . Moreover we have and as . In particular, if for some constant , then

 √n(Un,k−Z∗)⇒N(0,Var(h(x∗,ξ)))

where is the normal variable with mean zero and variance .

Note that, compared with Theorems 3 and 4, the centering quantity in Theorem 6 is changed from to . The asymptotic distribution is Gaussian with variance precisely the objective variance at . This recovers Theorem 1 in the special case where . If the uniqueness condition does not hold, there could be a discrepancy between the optimistic bound and (This can be hinted by observing the different types of limits between Theorems 3, 4 and Theorem 1, namely Gaussian versus the minimum of a Gaussian process).

We obtain Theorems 5 and 6 from a different path than Theorem 3, in particular by looking at the variance of via an analysis-of-variance (ANOVA) decomposition [20] of the symmetric kernel . Thanks to the uncorrelatedness among the ANOVA terms, we can control the variance of by using a bound from [52], which can be shown to depend on the maximal deviation of an empirical process generated by the centered cost function indexed by the decision, i.e., . The Lipschitz assumption allows us to estimate this maximal deviation using empirical process theory. Appendix A.3 shows the proof details.

## 6 Statistical Properties of Bagging Bounds and Comparisons with Batching

We analyze the properties of our confidence bounds implied from Theorems 3 and 4, namely consisting of a point estimator or and a standard error . We first show that the latter is of order , thus reconciling with our claim in (8) and demonstrating an asymptotically higher statistical precision compared to the batching bound in (7).

###### Proposition 1 (Magnitude of the standard error).

Under Assumption 2, we have for some constant , as . Consequently, the asymptotic standard deviation of or , namely , is of order .

Note that Proposition 1 is quite general in that it does not impose any growth rate restriction on . We also note that, under conditions that provide a CLT for the SAA (i.e., Theorem 1), the in the batching bound (7) can be of order as the data size per batch grows, and thus the resulting error term there can be controlled to be like ours (and also the direct-CLT bound (5)). Nonetheless, Proposition 1 is free of such type of assumptions. Its proof uses the coupling argument in bounding the variance that appears in the proof of Theorem 3. The proof details are in Appendix A.4.

The following shows a more revealing result on the higher statistical efficiency of our bagging procedure compared to batching:

###### Theorem 7 (Asymptotic variance reduction).

Recall that is the point estimate in the bound (7) given by the batching procedure. Assume the same conditions and resample sizes of either Theorem 3 or 5 in the case of resampling without replacement, or Theorem 4 in the case of resampling with replacement. With the same batch size and resample size, both denoted by , we define the asymptotic ratios of variance

 rU:=limsupn,k→∞Var(Un,k)Var(~Zk),rV:=limsupn,k→∞Var(Vn,k)Var(~Zk). (20)

We have , and in particular

1. when

2. when the conditions of Theorem 5 hold, is not a singleton and the covariance for is not a constant

3. when the conditions of Theorem 5 hold and is a singleton.

The following example shows that in the second case of Theorem 7, i.e. when the cost function is Lipschitz continuous in decision and there are multiple optimal solutions, the asymptotic ratio of variance not only is strictly less than but also can be arbitrarily close to .

###### Example 1.

Consider the cost function

 h(x,ξ)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(2−x)ξ1+(x−1)ξ2 if 1≤x≤2⋮⋮(j+1−x)ξj+(x−j)ξj+1 if j

for and uncertain quantity where are independent standard normal variables. In other words, at the cost is set to

and everywhere else given by a linear interpolation between the two neighboring integer points. In this case, the objective is constantly zero over the entire decision space so

. The SAA value where is the sample mean of the -th component , hence is the minimum of independent standard normal variables. A direct application of Corollary 1.9 in [16] leads to for some universal constant . In Appendix A.5 we show that . Therefore .

Furthermore, the following shows that the point estimator under sampling without replacement always has a smaller variance than the batching estimator, for any and :

###### Theorem 8 (Variance reduction under any finite sample).

Recall that is the point estimate in the bound (7) given by the batching procedure. Let be the order statistic of the data set . With the same batch size and resample size, both denoted by , we have

 Var(~Zk)=Var(Un,k)+E[Var(~Zk|ξ(1),…,ξ(n))]

and hence for any .

###### Proof.

By the law of total variance we have

 Var(~Zk)=E[Var(~Zk|ξ(1),…,ξ(n))]+Var(E[~Zk|ξ(1),…,ξ(n)]).

The desired conclusion follows from noticing that . ∎

Theorem 8 reinforces the smaller standard error in bagging compared to batching from asymptotic to any finite sample, provided that we use sampling without replacement. Intuitively, bagging eliminates the additional variability contributed from the ordering of the data, whereas the batching estimator is subject to change if the data are reordered. Alternately, one can also interpret bagging as a conditional Monte Carlo scheme applied on the batching estimator given the data ordering.

Next, the following result concerns the biases of and :

###### Theorem 9 (Bias).

Under the same assumptions and resample sizes as Theorems 3 and 4, the bias of in estimating is 0, whereas the bias of in estimating is where is any fixed positive integer.

The zero-bias property of is trivial: Each summand in its definition is an SAA value with distinct i.i.d. data, and thus has mean exactly . On the other hand, the summands in are SAA values constructed from potentially repeated observations, which induces bias relative to . The proof of the latter again utilizes the relation (19), and is left to Appendix A.6.

From Theorem 9, we see that outperforms in terms of bias control. When is fixed, such an advantage for is relatively mild, since the bias of in estimating the optimistic bound is of order . However, as grows, this advantage becomes more significant, and the bias of can be arbitrarily close to (when ).

Theorems 5, 8 and 9 together justify that, in terms of both standard error and bias, sampling without replacement, i.e., , seems to be the more recommendable choice for our bagging procedure. However, in our numerical experiments in Section 8, and appear to perform quite similarly.

Lastly, we should mention that the biases depicted in Theorem 9 concern the estimators of , but do not capture the discrepancy between and . The latter quantity is of separate interest. As discussed at the end of Section 2.1, it can be generally reduced by existing methods like the jackknife or probability metric minimization [42, 49].

## 7 Error Estimates and Coverages

Finally, we analyze the use of the IJ estimator in approximating the standard error and the error coming from the Monte Carlo noise in running the bootstrap. Together with the results in Section 5 and 6, these will give us an overall CLT on the output from Algorithm 1. First, we have the following consistency of the IJ variance estimator, relative to the magnitude of the target standard error:

###### Theorem 10 (Relative consistency of IJ estimator under resampling without replacement).

Consider resampling without replacement. Under the same conditions and resample sizes of either Theorem 3 or 5, the IJ variance estimator is relatively consistent, i.e.

 n2(n−k)2n∑i=1Cov2∗(N∗i,H∗k)/k2nVar(gk(ξ))p→1.
###### Theorem 11 (Relative consistency of IJ estimator under resampling with replacement).

Consider resampling with replacement. Under the same conditions and resample sizes of Theorem 4, the IJ variance estimator is relatively consistent, i.e.

 n∑i=1Cov2∗(N∗i,H∗k)/k2nVar(gk(ξ))p→1.

Theorem 10 is justified by adopting the arguments for random forests in [52]

and a weak law of large numbers, and Theorem

11 follows from analyzing the difference between - and -statistics as in the proof of Theorem 4. Appendix A.7 shows the details.

When a large enough bootstrap size is used in Algorithm 1, the Monte Carlo errors in estimating the point estimator and its variance both vanish. This gives an overall CLT for the output of our bagging procedure, as in the next theorem:

###### Theorem 12 (CLT for Algorithm 1).

Under the same conditions and resample sizes of either Theorem 3 or 5 in the case of resampling without replacement, or Theorem 4 in the case of resampling with replacement, if the bootstrap size in Algorithm 1 is such that , then the output of Algorithm 1 satisfies

 ~Zbagk−Wk~σIJ⇒N(0,1)

where is the standard normal variable.

An immediate consequence of Theorem 12 is the correct coverage of the true optimal value:

###### Corollary 1 (Correct coverage from Algorithm 1).

Under the same assumptions, growth rates of the resample size and the bootstrap size in Theorem 12, the output of Algorithm 1 satisfies

 P(~Zbagk−z1−α~σIJ≤Z∗)≥P(~Zbagk−z1−α~σIJ≤Wk)→1−α

where is generated under the data .

Theorem 12 and Corollary 1 thus close our analyses by showing an exact asymptotic coverage of our bagging bound for the optimistic bound , and a correct asymptotic coverage for