Scalable Uncertainty Quantification via GenerativeBootstrap Sampler

06/01/2020 ∙ by Minsuk Shin, et al. ∙ 0

It has been believed that the virtue of using statistical procedures is on uncertainty quantification in statistical decisions, and the bootstrap method has been commonly used for this purpose. However, nowadays as the size of data massively increases and statistical models become more complicated, the implementation of bootstrapping turns out to be practically challenging due to its repetitive nature in computation. To overcome this issue, we propose a novel computational procedure called Generative Bootstrap Sampler (GBS), which constructs a generator function of bootstrap evaluations, and this function transforms the weights on the observed data points to the bootstrap distribution. The GBS is implemented by one single optimization, without repeatedly evaluating the optimizer of bootstrapped loss function as in standard bootstrapping procedures. As a result, the GBS is capable of reducing computational time of bootstrapping by hundreds of folds when the data size is massive. We show that the bootstrapped distribution evaluated by the GBS is asymptotically equivalent to the conventional counterpart and empirically they are indistinguishable. We examine the proposed idea to bootstrap various models such as linear regression, logistic regression, Cox proportional hazard model, and Gaussian process regression model, quantile regression, etc. The results show that the GBS procedure is not only accelerating the computational speed, but it also attains a high level of accuracy to the target bootstrap distribution. Additionally, we apply this idea to accelerate the computation of other repetitive procedures such as bootstrapped cross-validation, tuning parameter selection, and permutation test.



There are no comments yet.


page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since its introduction by Efron (1979), the bootstrap methods have been considered as a seminal tool in uncertainty quantification

(UQ) of statistical inference procedures. The bootstrap can be applied to a wide range of situations, of which many lack concrete theoretical guidance and/or other practical means on how to quantify the sampling variability of a statistical procedure. Generally, only some minor regularity conditions, such as a finite variance and a certain degree of smoothness in the functional evaluated, are required for theoretically valid results in bootstrap procedures

(Hall, 1992, 1986; Efron and Tibshirani, 1994; Shao and Tu, 2012; Davison and Hinkley, 1997). Despite its popularity and theoretical justifications, the bootstrap’s practical use is hindered by its computational burden (and need) of repetitive evaluations. It is required to re-sample the observations and evaluate the statistic at least a few hundreds times, which significantly increase the total computation time. In particular, in the era of big data, the size of data sets is likely to be massive, and statistical models are also highly complicated so that practitioners are not computationally capable of training the statistical model multiple times.

Even when the theoretical form of the sampling distribution of an estimator is explicitly known, a practical UQ would be computationally challenging under high-dimensional settings. For example, in linear regression models with

number of covariates, we know the closed form of the sampling distribution of the coefficient estimator, and its variance is the inverse of the Gram matrix. However, the evaluation of the inverse matrix takes a computational complexity at an order of . When is large, say millions, this computation is infeasible due to issues of computational time and memory capacity in the computing server (Aghazadeh et al., 2018). In astronomy and computational biology, it is not uncommon to see modern scientific applications where the feature space is in a dimension of millions (Weinberger et al., 2009; Wood and Salzberg, 2014; Bray et al., 2016; Vervier et al., 2016)

. For these massive-sized data sets, theoretical sampling distribution itself is not helpful in practical inference such as constructing a confidence interval.

To resolve these issues of UQ for big data and complex models, we propose a scalable computational procedure called Generative Bootstrap Sampler (GBS) that accelerates the computation of nonparametric bootstrap (Efron, 1979) and random weight bootstrap (Barbe and Bertail, 2012; Newton and Raftery, 1994; Præstgaard and Wellner, 1993). This novel computational strategy is to construct a generator function that maps the random weights in bootstrap to bootstrap samples of a statistic. This means that once the generator function is trained, one can generate massive-sized bootstrap samples of the statistic with an almost free computational cost. This is implemented by plugging in randomly generated weights to the trained generator.

The proposed procedure enjoys multiple advantages over the existing UQ procedures, and we list them below:

Accurate evaluation of bootstrapped distribution. The GBS is exact or at least approximately accurate. The vanilla version of the GBS is exact in a sense that each bootstrapped statistic generated by the GBS is exactly equivalent to the corresponding counterpart in classical bootstrap sharing the same weight value (the detail of will be introduced in Section 2.1). This result will be rigorously proven in Theorem 2.1. The faster version of the GBS, which is a subgrouped bootstrap (Carlstein et al., 1998) that will be introduced in Section 2.3, is also asymptotically approximating the classical bootstrap distribution as shown in Theorem 2.2

. We empirically also show that this approximation is remarkably accurate in various examples such as linear models, logistic regression, Cox proprtional hazard regression, Gaussian mixture models, quantile regression models, etc.

Scalability. The GBS is fast and scalable. We consider neural networks to formalize the generator function in the GBS, and its computation can be efficiently implemented via a single optimization procedure, and this optimization is utilizing a subsampling technique of stochastic gradient descent (Bottou, 2010). When the sample size is extremely massive, it iteratively updates the generator by using a small subset of samples so that each iteration of optimization is feasible in a GPU computation. We show that the GBS is capable of accelerating the bootstrap by a few hundreds times for various models with massive-sized data sets.

Unlike subsampling-based bootstrap procedures in parallel computing environments (Politis et al., 2001; Kleiner et al., 2014), the GBS can be used to accelerate the computational speed of high-dimensional problems. The subsampling bootstrap procedures improve computational efficiency in a way that they reduce the sample size via a subsampling and parallelly evaluate bootstrapped estimators. As a result, they suffer from a bias problem by nature and they are not computationally optimal for so-called “large

problems”. In contrast, even though the GBS algorithm utilizes subsamples in each iteration, the whole optimization procedure is aiming at minimizing the loss function of the full samples. Not only that, recent machine learning applications to image data sets

(Karras et al., 2018; Ledig et al., 2017; Wang et al., 2018; Choi et al., 2018) empirically proved that neural network generators can be efficiently computed even when the dimension of a model is in a scale of millions, and our empirical results also show that the GBS is stably working well when the number of the parameters in the model is as large as thousands.

Widely applicable to avoid repetitive computations. The proposed idea can be generally applied to a wide range of statistical procedures whose loss function can be expressed as a sum of individual losses. The GBS is applicable to general statistical procedures such that the loss function of the procedure is differentiable with respect to the parameters of the model, and the generator in the GBS is smooth with respect to the weight values in the loss function of the GBS. Although these two conditions are not necessary to ensure theoretical ideals, they are important in practical and computational performance of the algorithm of the GBS. Since a gradient update is an essential step for the optimization of the neural network in the GBS, the use of SGD for a highly non-differentiable loss function would degrade the optimization stability.

Moreover, we extend the idea of the GBS to other statistical procedures that require multiple computations of similar evaluations. These examples include bootstrapped cross-validation (CV; Efron and Tibshirani (1997)) to quantify the uncertainty in the estimation of out-of-sample prediction error, evaluations on difference tuning parameters, and the evaluation for permuted null distribution in hypothesis testing. Like the bootstrap application of the GBS, the GBS offers a way to circumvent multiple repetitive computations in these examples by employing a single optimization on the generator, instead of thousands of computational repetitions. In Section 4, we show that their computational efficiency can be significantly improved by the GBS idea.

We note that while the GBS achieves such desirable properties, one caveat is that its application is restricted to nonparametric bootstrap (Efron, 1979) and weighted bootstrap procedures (Shao and Tu, 2012) in general. Only under limited situations, where an explicit functional form of the data-generating process is available, the GBS can be applied to parametric bootstraps. A brief discussion on this limitation will be given in Section 5.

Convenient diagnosis of optimization. The optimization convergence of the GBS can be diagnosed via an intuitively straightforward way. Since the GBS generates the exact bootstrapped samples of a statistic, the convergence quality of the optimization can be diagnosed by comparing the evaluations from the classical bootstrap procedure and the corresponding GBS counterparts. If the optimization procedure is converged well, these bootstrap samples of the statistic should be the same or close enough to their counterparts. For this purpose, we do not need to evaluate many bootstrap estimator from the classical bootstrap procedure In Section 2.5, we only consider five bootstrap samples to diagnose the convergence, and the computational cost of this evaluation is negligible compared with that of classical bootstrap procedures that require thousands of evaluations.

2 Generative Bootstrap Sampler

2.1 A General Form of Bootstrap

We consider observations generated from a true data-generating distribution characterized by a parameter of interest . For many problems, a valid estimator of can be found as the solution of the following optimization problem:

with being a suitable loss function chosen by the researcher. When the class of distributions belongs to a nice parametric family, an attractive choice of the loss function is the negative log-likelihood and the resulting is the Maximum Likelihood Estimate (MLE).

The conventional bootstrap procedure for assessing statistical properties of the resulting estimator relies on the generation of the bootstrap samples, , , via the following repetitions


where is independently generated from a distribution . More explicitly, for each drawn from , we can write the resulting solution of (1) as .

The classic nonparametric bootstrap (Efron, 1979) induces the distribution , which is termed “nonparametric bootstrap" and is most boradly used in practice. In contrast, the parametric bootstrap procedure generates the bootstrap samples by using repeatedly generated new samples from the estimated distribution in a parametric family, i.e., , corresponding to using the negative log-likelihood as the loss function in (1. The Bayesian bootstrap (BB) was introduced by (Rubin, 1981) to provide a nonparametric Bayesian view of the bootstrap, which turns out to be also a smoother version. More precisely, BB can be viewed as the posterior distribution of the unknown sampling distribution under the Dirichlet process prior with its base measure converging to zero. As an extension of the BB, Newton and Raftery (1994) proposes the Weighted Likelihood Bootstrap (WLB) by choosing the loss function as the log-likelihood and employing . Theoretically, they showed that the bootstrap distribution of the WLB is asymptotically valid at a first-order correctness under some regularity conditions. Later the idea was generated to estimating equation frameworks Chatterjee et al. (2005)

. In this paper, we mainly focus on the WLB, because of technical convenience. The Dirichlet distribution with a uniform parameter of one can be easily approximated by independent exponential distribution. That is,

for i.i.d. . Since

by the law of large number,

approximately follows the Dirichlet distribution. This property is convenient in a sense that we do not need to consider the dependence structure in , and simply generate independent samples from .

We note that, although formulation (1) cannot cover all estimators of interest, it is general enough and includes many later variants of bootstrap. All variants of nonparametric bootstrap (Rubin, 1981; Efron, 1979; Kirk and Stumpf, 2009; Silverman and Young, 1987) require repetitions of evaluating the bootstrapped estimators. Even though these evaluations can be completely parallelizable, its computational burden is still hazardously heavy to generate an large enough number of bootstrap samples in practice.

2.2 Generative Bootstrap Sampler

Instead of this inefficient repetition, we take a new point of view at the bootstrap computation. That is, we explicitly treat as a function of . By doing so, the bootstrap computation can be summarized as the evaluation of a generator function such that


where is the expectation operator with respect to . Then, it is straightforward that the distribution of is equivalent to the conventional bootstrap distribution, and this point is rigorously addressed in the following theorem:

Theorem 2.1.

Suppose that is the solution of (2) and for a given . Also, assume that the solution is unique for any . Then, .


For a given , we have for all . This means that , which completes the proof. ∎

In practice, we restrict the generator function

to a class of feed-forwarding neural networks

. As shown in classical references (Cybenko, 1989; Hornik et al., 1989; Barron, 1993), feed-forwarding neural networks flexibly approximate any functional forms when the number of nodes is sufficiently large. Also, applications of image generation in machine learning support the success of neural networks in approximating extremely complicated functions (Goodfellow et al., 2014; Arjovsky et al., 2017; Isola et al., 2017). Under the adoption of the neural networks, the optimization procedure in (2) turns out to minimize the loss function with respect to the weight and bias parameters of the neural network, and this can be implemented via a backpropagation algorithm (Rumelhart et al., 1986). Its optimization can be numerically conducted via an efficient GPU-optimized platform such as Pytorch and Tensorflow. The details of the neural networks are given in Section 2.4.

Once is trained, one can easily generate bootstrap estimators with almost free computational costs (0.1 seconds to generate 10,000 bootstrap estimators for some examples in Section 3). More detailed steps are described below:

For ,

1. Sample .

2. Evaluate .

Although ideally this vanilla version of the GBS generates exact bootstrap samples of a statistic, one critical issue in this procedure is that due to the large-sized input dimension of , which is , the optimization of (2) is practically challenging for “large ” data sets. This is because as the dimension of the input increases, the resulting optimization procedures on a high-dimensional neural net tend to converge slowly, and its network size also tremendously increases, which devours vast amounts of computing resources. To relieve this issue, we consider a modified bootstrap procedure that subgroups the observations and imposes the same weight on observations in the same subgroup.

2.3 Subgroup Bootstrap

Conventional bootstrap weights.
An example of a subgroup bootstrap with .
Figure 1: The structure of bootstrap weights for the subgroup bootstrap.

We consider an arbitrary partition of the index set of observations, . That is, and , . Typically as but at a slower rate, i.e., . Without loss of generality, we assume that the size of each is the same, i.e., for . We define an index function , which assigns each observation to the corresponding subgroup: if then . Then, for some weight distribution for , we impose the same value of weight on all elements in a subgroup as


where for , and we denote . Similar to the vanilla GBS previously introduced, setting and result in an approximated versions of the WLB and the nonparametric bootstrap, respectively. Then, the input of the resulting generator function is , and the dimension is reduced from to . This simple modification dramatically improves the computational efficiency of the GBS. When the sample size is larger than millions, the vanilla setup considers the generator with the input dimension being millions, but the input dimension for the subgrouped bootstrap is just (). While this improvement is advantageous, a natural question still remain: does this modification produce a theoretically valid bootstrap distribution? To investigate this theoretical justification, we introduce some helpful notation below.

We distinguish a random variable

and its observed value . Let

be an iid sequence from the probability measure space

. We denote the empirical probability measure by , where is a point mass measure at , and let , where is a probability measure and is -measurable. Let and denote the true parameter that involves in the true data-generating process by , and Then, can be considered as a function of the empirical process defined as . Suppose that weakly converges to a probability measure defined on some sample space and its sigma field . In the regime of bootstrap, what we are interested in is to estimate by using some weighted empirical distribution that is , where is an iid weight random variable from a probability measure . In the same sense, the probability measure acts on the block bootstrap is denoted by . When is set to be a multinomial probility law with a uniform one parameter, Giné and Zinn (1990) showed that under some measurability condition on a collection of some continuous functions of interest , the following statement


where is the envelope function of , is equivalent to


Præstgaard and Wellner (1993) generalized this result and suggested a set of conditions on to guarantee the asymptotic justification in (5). Based on the theoretical findings in Præstgaard and Wellner (1993), the following theorem justifies the use of the proposed subgrouped bootstrap:

Theorem 2.2.

Recall that is defined in Theorem 2.1. Suppose that (4) holds and in (3). Assume that as . Recall that is defined in Theorem 2.1. Then,

where is the Borel sigma algebra.


See the appendix. ∎

Figure 2: Comparisons of block bootstraps across different numbers of blocks.

Even though the result in Theorem 2.2 is valid in asymptotics, it does not guarantee practically sufficient accuracy of the approximation in finite samples. To examine its finite sample accuracy, we consider a simple example of a linear regression model as


where for . we generate a synthesized data set with and , and its true coefficient is a sequence of equi-spaced values between -2 and 2, and . We generate a data set from this setting and apply the block bootstrap version of the GBS to the synthesized data set.

For the simulated data set, Figure 2 shows boxplots of block bootstrap distributions of MLE with different number of blocks ( 5 and 100). An interesting point in Figrue 2 is that the block bootstrap distribution approximates well the target distribution (the non-block bootstrap distribution), even when the number of subgroups is tiny (). When the number of blocks is moderately large (), its approximation is indistinguishable from the target. This result is confirmed again in Figure 3, which shows histograms of block bootstrap distributions for a few coefficients. When , while the location of the block bootstrap distribution is close to that of the target one, the scale is unstably different for some coefficients. However, as increases, the quality of the approximation of the block bootstrap distribution clearly get improved, and when , the target bootstrap distribution is almost perfectly approximated by the block bootstrap distribution. Not only for this toy example, in all simulated and real applications we examined in Section 3 and Section 4, we find that the approximations of the block bootstrap are highly accurate. We use this block bootstrap with as a default of the GBS in sequel.

Block bootstrap with with .
Block bootstrap with with .
Block bootstrap with with .
Figure 3: Histograms of block bootstrap distributions with various for the coefficient of (left column), (middle column), and (right column). The red line indicates the density function of the target distribution. Each true coefficient is marked by a blue x.

2.4 Generator Function Constructed by Neural Network

A popular way to construct a function with a complicated structure is to use neural networks. In applications of machine learning, various results empirically showed that the use of neural networks as a generator of complicated patterns, e.g. image generations, is practically adequate and computationally feasible even for a massive-sized data set (Ledig et al., 2017; Wang et al., 2018; Karras et al., 2018; Goodfellow et al., 2014; Arjovsky et al., 2017). We thus consider a simple form of the network, a modified feed-forwarding neural network, to construct the generator in (2).

First, we explain a general notion of feed-forwarding neural networks, then we introduce our neural net used in the GBS. Feed-forwarding neural nets are constructed by composing activated linear transformations, and we consider

number of network layers where for . For the -th layer, its weight parameter, which is represented by a matrix, and its bias parameter, which is a

-dimensional vector are denoted by

and , respectively. Then, the -th layer is expressible as

Fitting a neural network requires to choose an activation function

, and commonly used activation functions are a sigmoid function, a hyperbolic tangent function, the Rectified Linear Unit (ReLU), etc. For our generator, we use the ReLU function that is

as a default.

A feed-forwarding neural network is defined by a composition of these layers as


where and are the input layer and the output layer, respectively. This means that the structure of the neural network in (7) is composed in a way that the output of the previous layer will be used as the input of the next layer. The input of the generator is bootstrap weights , and these weights are feed-forwarded towards the first hidden-layer , and the output of is fed to the second hidden-layer , and so on. At the last layer, the bootstrap sample of is expressed as a linear combinations of components in the final hidden-layer. We consider a feed-forwarding neural net with three hidden-layers in all simulation and real data studies. The details of the neural net structure are deferred to the appendix.

Figure 4: An example of the structure of via the modified feed-forwarding neural network.

An issue of this feed-forwarding neural net is that the variation on the weights is less likely to be transmitted towards the output as the number of layers increases. As a result, we empirically found that simple feed-forwarding neural nets tend to underestimate the variance of the bootstrap distribution. To overcome this issue, we modify the simple feed-forwarding neural net in a way that every layer receives the bootstrap weight as an input. This modification can be formally expressed as a recursive way that follows


for . Figure 4 illustrates this structure of . At every layer, the bootstrap weight is concatenated to the output of the previous layer, and the concatenated results become the input of the next layer. Compared to the simple feed-forwarding neural network, this modified one directly connects the weight and the bootstrap output, and helps gradient flow in training deep neural networks. We use this neural net as a default of the GBS and examine its performance in various examples in Section 3.

2.5 Diagnosis of Optimization Convergence

The GBS procedures enjoy a clear advantage over some other optimization-based UQ procedures in terms of convergence diagnosis in optimization. In the Bayesian paradigm, variational inference and its variants (Rezende and Mohamed, 2015; Blei et al., 2017; Kingma et al., 2016; Ho et al., 2019) are commonly used to approximate the posterior distribution for big data. However, they not only underestimate the posterior variance (Pati et al., 2018; Yao et al., 2018; MacKay, 2003), but also lack diagnostic tools for monitoring optimization convergence. Thus, using the variational Bayes is not ideal for constructing a confidence interval or quantifying uncertainty in statistical inference.

In contrast, the convergence status of the GBS procedures can be diagnosed by comparing the GBS solution and the corresponding classical solution that shares the same weight. Theorem 2.1 shows that when the weight is the same, the GBS solution and the classical solution should be equivalent. So, if the optimization procedure for the GBS converges well, the resulting solutions should be close to these from the classical bootstrap. To be more formal, we consider this diagnosis as a hypothesis testing procedure. Since the discrepancy between two results can be interpreted as a paired two sample problem. More specifically, for a small number , say , we generate from the pre-specified weight distribution, and we can calculate the discrepancy between results from the GBS and the classical bootstrap procedure. We define the discrepancy as . The basic idea is that if is close enough to zero for , we can conclude that the optimization procedure is converged well. This procedure is reasonable in a sense that it is rare to match the results from two procedures by chance, when the number of parameters is moderately large. However, a question “how close towards zero is close enough?” still remains. We thus employ a statistical testing procedure to examine a formal way to diagnose the convergence.

We assume that for . Then, diagnosing the optimization convergence can be processed via a hypothesis testing such as

This formulation is natural in examining the significance of the discrepancy, but a classical hypothesis testing procedure is mainly designed to detect a significant difference, not to assert no difference. Because the situation where the optimization is converged is corresponding to for all , classical significance testing procedures are not appropriate in this problem. Instead, we consider a Bayesian hypothesis testing procedure that provides a measure of evidence that directly compares two hypotheses (or models). For the Bayesian hypothesis testing for each , under we set


for . Then, the Bayesian hypothesis testing procedure measures evidence on each hypothesis by comparing marginal likelihoods evaluated under and

, and the odd between these marginal likelihood is called

Bayes factor (Kass and Raftery, 1995; Jeffreys, 1961). That is,

where and are the marginal likelihoods under and , respectively.

By using the conjugacy, we can derive an explicit form of the Bayes factor under (9), and it follows


where . As noted by Liang et al. (2008) and Johnson and Rossell (2010), the value of Bayes factor is significantly affected by the prior scale parameter . As the prior on gets more diffused, more evidence will be assigned on , so when is larger, this testing procedure will be more in favor of . We admit that the diagnosis is sensitive to the choice of , but a simple setting performs well in all examples we tested. Then, our criterion to diagnose the convergence is the sample mean of ’s; . As a rule of thumb, we decide that the optimization is converged if for . We set as a default, and this setting reasonably works well in various examples considered in this paper. We note that this criterion does not follow the standard decision rules suggested by Kass and Raftery (1995). They recommended a decision criterion that a value of larger than 5 indicate “very strong” evidence in favor of , but this rule cannot be applied to our case, since . This means that we may have a situation where the Bayes factor cannot exceed the criterion suggested by Kass and Raftery (1995) when is small. In algorithm 1, we employ this convergence diagnosis () within the iterative algorithm, and every 100 iterations we diagnose the optimization chain is converged. If the optimization procedure consecutively achieves five times, we diagnose that the optimization is converged, and we stop the algorithm after iterations.

This diagnosis procedure of the GBS can be a unique advantage over the other competing procedures, because other optimization-based uncertainty quantification methods are lack of convergence diagnosis tools. For example, while variational inference procedures (Blei et al., 2017; Pati et al., 2018; Carbonetto and Stephens, 2012) are commonly used to approximate the posterior distribution through a family of simple distributions, it is not trivial to check its algorithmic convergence, and practitioners are exposed to a risk of obtaining sub-optimal results. In contrast, the convergence diagnosis of the GBS is straightforward . If the result of the proposed Bayesian test is against the evidence of a convergence, we can tune technical settings in the optimization procedure or run more iterations until the optimization procedure is diagnosed to be converged. This diagnosis procedure requires to evaluate (a small number less than 10) number of bootstrap samples from the classical procedure, but this extra computational burden would be negligible compared to that required in sampling thousands of classical bootstrap samples.

2.6 Computational Strategy

As examined in various machine learning literature, the optimization for loss functions based on neural networks is highly efficient in computation. That is because neural networks contain simple structure as in (7), and the gradient can be efficiently evaluated by using backpropagation algorithm (Rumelhart et al., 1986; Hecht-Nielsen, 1992) implemented via GPU computing. Once the gradient is evaluated, Stochastic Gradient Descent (SGD) algorithm and its variants update the parameter in an iterative sense. The same strategy is used for the computation of GBS. See the appendix for the details of the SGD.

To accelerate the convergence, we use a small number of bootstrap samples evaluated from the conventional bootstrap procedure. We add an extra penalty on the discrepancy between the GBS bootstrap samples and the conventional bootstrap samples, and the resulting loss function follows that


where is the neural net parameter of the generator , and is a tuning parameter that controls the influential level of the discrepancy, and and indicate a pre-sampled weight and its corresponding bootstrap estimator, respectively, for . By adding this extra penalty term , we expect that the convergence speed of the SGD accelerates, and the stability of the optimization is improved. That is because this distance directly forces the generator to converge towards the target values. However, we cannot overlook a concern of overfitting towards ’s. If the tuning parameter is large, the generator would be updated in a way that only the generated samples from for are accurate and the other weights would result in sub-optimal generated values. To avoid that, we set after the optimization status is diagnosed as converged. After converged, we run a long enough length of iterations to prevent the GBS from overfitting. If evaluating a few classical bootstrap samples is not available for some reason, we can ignore the extra penalty term and consider the original loss function in (2). We find that there is a tendency that the convergence of the original form in (2) is relatively slower than the form with the extra penalty term when the data size is massive.

Algorithm 1 shows the details of the optimization algorithm for minimizing (2) or (11). The main idea is that the expectation is approximated by a Monte Carlo method with number of randomly sampled bootstrap weights, and its gradient can be evaluated, then we update the neural net parameter of the generator. If a few bootstrap samples are computable from the conventional bootstrap procedure, we process a convergence diagnosis step at every 100 iteration in the algorithm. If the current optimization status is diagnosed as converged five time in a row via the diagnosis procedure introduced in Section 2.5, we set and run iterations more, then the algorithm stops.

Preparation step
Generate (this should be sampled by subgroups; see Section 2.3).
Evaluate for .
Optimization step
Set batch size , , , , , , , and .
Initialize a matrix .
Set and .
while the stop condition is not satisfied or  do
       Let .
       Sub-sample samples from the original data set, and denote the index set by .
       Randomly select number of rows of . For each selected row of , randomly select number of elements,
      and replace them with generated values from i.i.d .
       Transform the updated to a matrix (see Section 2.3 for details).
       Calculate , where , is the -th row
      and -th element of , respectively.
      (Optional) Set .
       Update with via a SGD step.
      if  then
             Evaluate , where is defined in (10).
            if  then
            end if
      end if
      if   then
             Set in (11) and stop after N iterations.
      end if
end while
Algorithm 1 An optimization algorithm for the GBS.

3 Examples and Simulation Studies

3.1 Gaussian Linear Regression

We consider an example of linear regression models in (6) to examine the performance of the GBS in evaluating the bootstrap distribution. We note that the exact form of the sampling distribution of MLE is available for linear models, and the advantage of using bootstrapping is minimal for this simple example in practice. Nerverthless, examining linear models is still meaningful in a sense that the GBS is a general procedure that is applicable to a wide range of models even under situations where the evaluation of the sampling distribution is nontrivial.

To more rigorously examine the performance of the GBS, we consider a simulation study. The settings of the simulation follows and , , and the true regression coefficient are equi-spaced between and . For simplicity, we assume that is known. Each covariate is independently generated from standard Gaussian or dependent such as , where , if , and , if , and they are denoted by “Independent” and “Correlated” in Table 1, respectively. For each bootstrap procedures, we generate 10,000 bootstrap estimators. We report the average of the results across 100 independently replicated data sets.

This simulation study is implemented on a workstation equipped with a AMD Threadripper 2990WX CPU with 32 cores and 64 threads of base clock 3.0GHz, two Nvidia RTX2080Ti GPUs, and 128GB memory with DDR4 2,800Mhz clock speed. To implement the WLB, we use lm and glm functions in R. For the GBS, Pytorch

, which is a GPU-efficient deep learning library in

Python, is used to optimize the generator function . A parallel computing environment with 25 cores is considered for the standard bootstrap procedures to reduce the computation time via an R package snowfall. For the GBS procedures, we separately report the training time to train and the time to generate samples from the trained generator. For example of the GBS, on the first row of Table 1, “37.7+0.1” means that it takes 37.7 seconds to optimize the generator function and 0.1 seconds to generate 10,000 bootstrap samples after the training. For large-sized data sets with , because the classical bootstraps are computationally infeasible, we mark their result by “NA”, and we estimate their computation time from the time in generating 20 bootstrap samples, instead of 10000. We also consider the target distribution of the bootstrap distribution, which is , to examine the approximation performance of the GBS. All MSE and the coverage results of the classical bootstrap are reported from the parallel environment with 25 cores.

Correlated Independent
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.969 0.432 37.7+0.1 0.936 0.205 45.2+0.1 0.936 0.110 39.4+0.1 0.957 0.052 47.9+0.1
BT(1C) 0.932 0.435 16.2 0.933 0.206 25.1 0.931 0.113 18.4 0.937 0.052 27.3
BT(25C) 0.9 1.3 0.9 1.3
Target 0.951 0.429 0.950 0.204 0.953 0.109 0.954 0.052
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.928 0.805 61.4+0.1 0.923 0.212 112.8+0.1 0.915 0.205 63.8+0.1 0.928 0.053 114.9+0.1
BT(1C) 0.910 0.805 1924.3 0.932 0.211 11209.8 0.911 0.202 1967.4 0.932 0.053 11024.2
BT(25C) 92.6 525.7 91.1 520.2
Target 0.948 0.804 0.948 0.211 0.948 0.202 0.948 0.053
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.940 0.448 244.9+0.1 0.966 0.169 366.3+0.1 0.914 0.113 245.7+0.1 0.948 0.044 361.2+0.1
BT(1C) NA NA > 1 week NA NA > 2 weeks NA NA > 1 week NA NA > 2 weeks
BT(25C) NA NA > 8 hours NA NA > 14 hours NA NA > 8 hours NA NA > 14 hours
Target 0.949 0.446 0.949 0.167 0.949 0.112 0.950 0.041
Table 1: Linear regression results of the coverage of confidence intervals, MSE, and the actual computation time in second to generate 10,000 bootstrap samples for each procedure. We omit “Cov” and “MSE” of bootstrap(25C), because they are the same with those of bootstrap(1C).

For linear models, the explicit form of the solution of (1) is available, i.e. , where is a diagonal matrix of . This means that bootstrap sampling can be computationally more convenient and faster than other arbitrary models. However, despite this computational efficiency, the results in Table 1 suggest that implementing a classical bootstrap for a large-sized data set, like , is practically challenging and requires a huge amount of computing resource. With a single-threading computation, it takes more than 2 weeks to generate 10,000 bootstrap samples, and even under a parallel computing environment with 25 cores, the computation is expected to be longer than 14 hours. In contrast, the training time of the GBS procedure is around 6 minutes under the same setting, and the generation time for bootstrap samples is expected to 0.1 second in all scenarios. This result shows that the GBS is capable of accelerating computing bootstrap distribution by hundreds of folds in the linear regression examples where conventional bootstrap procedures are favored in computation. Moreover, the 95% coverage of the confidence intervals constructed by the GBS is stably close to that of the WLB where the GBS is approximating.

3.2 Robust regression model

When outliers exist, it is reasonable to consider a robust regression model based on an

M-estimator that minimizes the following cost function:


where is a residual function that satisfies i) for all ; ii) ; iii) ; iv) is monotonically increasing (Rousseeuw and Leroy, 2005). Common choices of include the Huber function, the absolute value function for median regression, the bisquare function, etc. Unlike the Gaussian linear regression model previously examined, this M-estimator does not have a closed form of the solution, and its computation is commonly conducted by a computationally expensive procedure like the Iteratively Reweighted Least Square (IRLS, Street et al. (1988)). This computational burden hurdles the use of bootstrapping procedures to quantify the uncertainty in the statistical procedure. Moreover, it is also challenging to derive the sampling distribution of in these robust regression models. That is because their sampling distribution depends on the true distribution of errors, which is unknown, and estimating the error distribution is extremely challenging in general. Thus, the UQ on the statistical inference based on the M-estimator is practically difficult to evaluate.

We apply the GBS to overcome these bottlenecks in UQ, and provide a scalable way to generate a bootstrap distribution of the M-estimator. By directly applying the form (2), the objective function of the GBS follows:


We examine the accuracy of the GBS in computing the bootstrap distribution by using some simulated data set that are generated from the same setting used for linear regression models, except that the errors are generated from

distribution with degree of freedom 3. We use an

R package robust to implement the WLB as a reference of the GBS.

Correlated Independent
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.951 0.281 208.6+0.1 0.966 0.117 243.0+0.1 0.964 0.069 265.9+0.1 0.960 0.028 295.6+0.1
BT(1C) 0.98 0.268 3006.0 0.976 0.096 15772.0 0.985 0.065 3102.8 0.979 0.024 14610.9
BT(25C) 144.5 756.7 146.8 684.4
BT(GPU) > 3 hours > 3 hours > 3 hours > 3 hours
Method Cov MSE Time Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.987 0.179 214.9+0.1 0.981 0.087 351.3+0.1 0.937 0.042 342.6 0.940 0.023 478.2
BT(1C) NA NA > 1 week NA NA > 2 weeks NA NA > 1 week NA NA > 2 weeks
BT(25C) NA NA > 7 hours NA NA > 20 hours NA NA > 7 hours NA NA > 20 hours
BT(GPU) NA NA > 3 hours NA NA > 3 hours NA NA > 3 hours NA NA > 3 hours
Table 2: Results for least absolute deviation regression.

3.3 Logistic regression model

Suppose we have observations , where the ’s are binary, equaling either 0 or 1. Consider the standard logistic regression model, in which the log odds of the -th event is a linear function of its covariate , i.e.,

A standard way to construct a confidence interval for a regression coefficient

is based on a Wald-type procedure. However, a Wald-type CI highly relies on the asymptotic normality of the MLE, it can perform poorly if the sample size is small or moderate. In the generalized linear model, a Wald-type CI can also perform poorly when the distribution of the parameter estimator is skewed or if the standard error is a poor estimate of the standard deviation of the estimator. For the logistic regression model, one of the most widely used generalized linear models, profile likelihood confidence interval can provide better coverage (Royston,2007). Profile likelihood confidence interval is derived from the asymptotic

distribution of the likelihood-ratio test statistic. The confidence interval for a single entry of a vector is constructed by repeatedly maximizing over the other parameters. Since a grid of potential values of this entry are evaluated in this process, profile likelihood CI can be time and computation consuming when the dimension of the parameter is high. Another alternative to the Wald-type CI is the bootstrap confidence interval. Bootstrap CI does not involve any assumptions of the sampling distribution of the estimate but it can be inefficient when the model is complicated. In this section, we compared the proposed GBS CI with the profile likelihood CI and the bootstrap CI from the perspective of 95% coverage rate, MSE and computation time.

To generate simulated data sets, we assume the true regression coefficients are equi-spaced numbers between -0.5 and 0.5. We follow the same setting, used for the lineare regression examples, in generating simulated covariates. Analogous to the setting in linear regression, we compared the performance of GBS to the standard bootstrap and the profile likelihood confidence interval when covariates are correlated and independent, with sample size and dimension of covariates . For each configuration, 100 replicates were conducted in evaluating MSE, the coverage of the CI, and actual computation time. The results are demonstrated in Table 3.

Correlated Independent
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.922 0.336 121.4+0.1 0.919 0.033 127.4+0.1 0.895 0.121 116.4+0.1 0.929 0.010 124.8+0.1
BT(1C) 0.939 0.339 25.1 0.945 0.033 134.5 0.943 0.123 24.3 0.953 0.010 129.3
BT(25C) 6.8 14.7 6.7 14.2
BT(GPU) > 12 hours > 12 hours > 12 hours > 12 hours
ProfileLik 0.949 0.322 0.2 0.955 0.033 0.6 0.957 0.114 0.2 0.955 0.010 0.6
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.942 0.530 246.5+0.1 0.953 0.239 280.3+0.1 0.928 0.255 236.4+0.1 0.944 0.105 267.8+0.1
BT(1C) 0.933 0.525 1284.2 0.937 0.240 3283.7 0.911 0.245 1387.9 0.926 0.104 3236.2
BT(25C) 60.5 148.9 66.2 147.5
BT(GPU) > 15 hours > 15 hours > 15 hours > 15 hours
ProfileLik 0.947 0.488 55.3 0.947 0.231 152.8 0.939 0.205 51.9 0.945 0.095 150.9
Method Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec) Cov MSE Time(sec)
GBS 0.952 0.938 607.5+0.1 0.923 0.359 683.9+0.1 0.913 0.625 571.8+0.1 0.971 0.275 655.4+0.1
BT(1C) NA NA > 2 days NA NA > 4 days NA NA > 2 days NA NA > 4 days
BT(25C) NA NA > 2 hours NA NA > 5 hours NA NA > 2 hours NA NA > 5 hours
BT(GPU) NA NA > 3 days NA NA > 1 week NA NA > 3 days NA NA > 1 week
ProfileLik NA NA > 5 hours NA NA > 13 hours NA NA > 5 hours NA NA > 13 hours
Table 3: Results for logistic regression models.

3.4 Quantile Regression Model

Quantile regression analysis has been commonly used in modeling a relation between a certain quantile of the response and covariates, and it models a conditional quantile of the response by a linear transform of the covariate as follows: for a given



where is the conditional -th quantile of the response given . Bootstrap samples of an estimate of can be obtained by setting the loss function in (1) as



. Unlike simple parametric models such as logistic and linear regression models, it is challenging to derive its asymptotic sampling distribution of the coefficient estimator in quantile regression. As a result, uncertainty quantification procedures, such as estimations of standard errors, variance-covariance matrices, and confidence interval, cannot be conducted from theoretical results. That is because direct estimation of the asymptotic sampling distribution requires an estimate of the regression error density function, and this density function estimation itself is an extremely difficult task under high-dimensional settings

(Koenker, 1994). Instead, in routine applications of quantile regression analysis, bootstrap procedures are popular to approximate the behavior of the sampling distribution (Feng et al., 2011; Hahn, 1995; Kocherginsky et al., 2005). However, the nature of resampling in bootstrap procedures imposes heavy burden in computation. When a practitioner is interested in investigating multiple quantile levels, it is also required to evaluate multiple bootstrap distributions at different quantile levels. These computational repetitions are disastrous and practically infeasible when the size of data is massive.

To relieve this computational bottleneck, we shall apply our GBS to bootstrapping quantile regression models by plugging in (2). We note that we make a minor modification from the original loss function (2) to account for varying . The main idea of the modification is that we can consider each optimizer of the loss function as a function of and . This point is reflected in the resulting loss function below:


where is the expectation operator on and , assuming that follows some distribution whose support is (0,1) and independent with . A default choice is that and independently follows a scaled Dirichlet distribution with unit parameter. Like the main idea of the GBS, this loss function (16) for the quantile regression views the optimizer of (1) as a function of weights and a quantile level . As a corollary of Theorem 2.1, one can show that a well-trained generator in (16) generates the same bootstrap samples with those produced from standard bootstrap procedures in (1) under the same weights:

Corollary 3.1.

Suppose that is the solution of (16) and for a given and . Then, .


This is an immediate corollary of Theorem 2.1. ∎

This corollary states that once the generator is trained, we can easily produce bootstrap samples of the quantile regression under a specific quantile level by just plugging in randomly generated ’s and the quantile level of interest into the generator. This means that the GBS procedure does not require any re-optimizations for a large number of different weights, which is usually computationally intensive when examining multiple different quantile levels. As a consequence, it is expected that the computing time of the bootstrapping would be significantly reduced without compromising the computational accuracy.

To examine a practical usefulness of this procedure, we examine an example considered in Feng et al. (2011), and a simulated data set is generated from a model with and , such as , where and . We choose

to be one for the first 80% of the observations and zero for the rest, and the other covariates are independently sampled from the standard log-normal distribution.

Figure 5 illustrates 95% confidence bands computed via the GBS and the standard bootstrap (WLB) over quantiles varying from 0.1 and 0.9. Not only the results show that our GBS generates almost identical bootstrap distribution with that of the standard bootstrap, but both 95% confidence bands also successfully cover the true coefficient. We note that without an repetitive computation, this results are obtained from just substituting different values of weights and a value of tuning parameter. In the next section, we further extend the idea of the GBS to other statistical procedures with repetitive computations, and we show that the GBS improve their computational efficiency.

Figure 5: Comparisons between 95% confidence bands from the GBS (left column) and the classical bootstrap (right column) across quantile. The top figure is for the coefficient of the intercept; the middle is for the coefficient of ; the bottom is for the coefficient of . The true coefficients are marked by red dotted lines.

3.5 Gaussian Mixture Model

We consider a Gaussian mixture model with components as


where and are mean and variance of the -th Gaussian component, respectively, for . This Gaussian mixture model can be modeled by a density function such as , where is a Gaussian density function and and for . Thus, we can construct the objective function of a bootstrap procedure for Gaussian mixture models as

Since we can interpret each as an output of a generator function from an input value , the GBS idea can be applied to this Gaussian mixture model by following as

where , , , and are the parameters of the -th Gaussian component. Also, the generator functions are mapping as , , , and , where is the space of -dimensional simplex, i.e. , and is the space of lower triangular matrices so that for any , is always positive definite. We parameterize the inverse of the covariance matrix as the product of the lower triangular matrix to avoid a matrix inversion in the likelihood.

Figure 6: An application of the GBS to a Gaussian mixture model. (a) estimated individual component proportion from the bootstrap distribution is illustrated as a mixture of red, green, and blue colors; (b) The bootstrap distribution of location parameters. The 100,000 bootstrap samples of the location parameters are marked by red dots, and the true locations are marked by X. (c) A bootstrap distribution of component proportion parameters. The colors (red, blue, and green) are matched with these in (a) for each cluster. (d) A bootstrap distribution of the off-diagonal term in the covariance matrix for each Gaussian component, and the true values are noted by X.

We consider a synthetic example of a 2-dimensional Gaussian mixture model whose sample size is moderate, but large enough to impede the use of boostrapping procedures due to computational burdens ().

The bootstrap distribution of this simulation setting is illustrated in Figure 6. The figure shows that the GBS reasonably induces bootstrap distributions as the bootstrap samples of , , and in Figure 6 (b) cover the true parameter well. Figure 6 (c) and (d) also present the bootstrap distribution distribution of and for , respectively. Because the classical bootstrap takes 44 days to evaluate 10000 bootstrap samples, we cannot compare the results from the GBS with that of the classical bootstrap. However, we note that the computation time of the GBS takes only 15 minutes to get 10000 bootstrap sample. This computational acceleration is not ignorable.

Finally, we make a summary of the computation times of the GBS and the classical counterpart for the considered models in Table 4. This results shows that the GBS improves the computational efficiency at a fold of hundred, compared with a parallel computing with 25 cores. Compared with a single core computing, the speed-up in computation is almost a fold of thousands.

Model Linear LAD Logistic GaussMix
GBS about 6 min about 6 min about 11 min about 15 min
BT(1C) > 2 weeks > 2 weeks > 4 days > 44 days
BT(25C) > 14 hours > 20 hours > 5 hours > 2 days
Table 4: A brief summary of the comparisons of computation time in generating 10,000 bootstrap samples.

4 Some Extensions of Generative Bootstrap Sampler

4.1 Tuning Parameter Selection

In practice, tuning parameter selection has been a thorny issue and computationally expensive for many machine learning algorithms. A main reason is that many candidates of the tuning parameters need to be considered and for each candidate the same computational (optimization) procedure has to be repeated and an evaluation score (such as the cross-validation error or the value of an information criterion) of the resulting prediction model is computed. However, when the data size is massive, the required repetitive computations can be practically infeasible. To overcome this issue, we apply the GBS in computing the MLEs for different tuning parameters, and the results show that our GBS dramatically improves computational efficiency by bypassing undesirable repetitions in computation.

Consider a model with a loss function for a tuning parameter , and a popular example is a penalized likelihood via setting , where is a likelihood function and is a penalty function on . These forms are common and some useful examples include Gaussian process regression (Kirk and Stumpf, 2009; Rasmussen and Williams, 2006) and splines (Wahba, 1978, 1990), LASSO (Tibshirani, 1996), group LASSO (Yuan and Lin, 2006), SCAD (Fan and Li, 2001), MCP (Zhang, 2010), etc. In a framework of the GBS, the tuning parameter can be viewed as an extra input of the generator function , and this point can be formalized as:


where is the expectation operator with respect to and , and is assumed to follow a pre-specified distribution that attains a large enough amount of density on interesting regions in the tuning parameter space. It is also reasonable to assume that and are independent. When a practitioner is not interested in constructing a bootstrapping distribution and only wants to focus on tuning parameter evaluation, he or she can just simply set for .

LASSO. A popular example of penalized likelihood procedures is the LASSO, and the corresponding loss function for the GBS version is expressible as


where is the norm. By changing the log-likelihood part and the penalty part in (19), we can easily apply the GBS to other penalized likelihood procedures.

Like the vanilla GBS in the previo