1 Introduction
The numerical evaluation of integrals is a foundational aspect of mathematics, with impact on diverse areas such as finance, uncertainty quantification, stochastic programming, and many others. Monte Carlo (MC) sampling is arguably the most robust means of estimating such integrals and can be easily applied to arbitrary integration domains and measures. The MC estimate of an integral is unbiased, and its rate of convergence is independent of the number of variables and smoothness of the integrand.
However, when the integrand consists of a function that is expensive to evaluate, for instance a highfidelity computational simulation, obtaining a MC estimate with even moderate accuracy can be infeasible because the variance of a MC estimator is proportional to the ratio of the variance of the true integral and the number of samples used. Control variate (CV) techniques have a long history of enhancing MC estimates of integrals [18, 20, 19, 14]. They reduce the number of samples to compute an integral by introducing additional functions, which are correlated with the integrand, to reduce the variance of its estimate.
The use of CV methods has recently seen a resurgence, driven by the field of uncertainty quantification (UQ). In UQ, the need to evaluate statistics of computationally expensive highfidelity models is of paramount importance, and multiple information sources, arising from multiple simulation models, are often available to accelerate the convergence of statistics [22, 11, 27, 2, 6]. Such models can be used to accelerate algorithms for both forward UQ problem and inverse UQ [1]. A set of models can result from simulating different sets of equations (i.e., the multifidelity case of differing model forms) and from varying temporal and spatial discretizations (i.e., the multilevel case of differing numerical resolutions for the same set of equations). Multiple conceptual dimensions can exist within a modeling hierarchy, leading to the generalized notion of multiindex MC methods [12], although it can be important to distinguish dimensions which define a convergent sequence (i.e., multilevel) from those that do not (e.g., multifidelity). In the most general setting, the model ensemble could include reducedorder models [28], dimensionreduction or surrogate models [23] (e.g., active subspace approximations), and even data from physical experiments [17]. Finally, both multiphysics and multiscale simulations contribute additional combinatorial richness to the modeling ensemble.
Traditional CV methods [18] require explicit knowledge of the statistics (for instance the expected value) of their approximate information sources, however these estimates are frequently unavailable a priori in the UQ simulationbased context. Consequently, CV methods must be modified to balance the computational cost of evaluating lower fidelity models and the reduction in error they each provide. Existing strategies for using these CV techniques in the UQ context have therefore focused on slightly modified frameworks. For instance, in the case where additional information sources arise from a hierarchy of discretization levels, multilevel and multiindex Monte Carlo (MLMC) [8, 12]
use the additional information sources through the estimation of statistics of discrepancy terms. When the variance of these discrepancy decays, significant variance reduction can be achieved. In more general contexts, heuristic estimators have been developed and analyzed
[24, 26]. These estimators use additional samples of lowfidelity information sources to obtain approximations of their missing statistical information. Finally, more flexible multilevelmultifidelity estimators that combine these notions have also been investigated [3, 5, 6].Within these multilevel, multifidelity, and multilevelmultifidelity UQ strategies, an optimal sample allocation is defined across different model instances for the case where a recursive model dependency structure has been defined a priori. In this paper, we present a framework to leverage control variate techniques in the context where both the controlvariate means and the model dependency structure are unknown. We demonstrate how existing algorithms, e.g., multilevel Monte Carlo (MLMC) [9] and multifidelity Monte Carlo (MFMC) [21, 26], are suboptimal realizations of this framework in that they do not extract all possible value from the correlations among information sources.
There are two mechanisms by which control variates can be used to reduce the variance of a MC estimator: (1) increasing the number of high fidelity evaluations and (2) leveraging the correlations of the lowfidelity models to the greatest extent possible. An optimal way to reduce the variance uses both of these mechanisms to allocate samples amongst all available information sources to reduce the variance per unit cost. In this paper, we provide a framework to exploit the second pathway for variance reduction in a more optimal way. Furthermore, we then embed this framework into an optimal sample allocation scheme. In particular, we extend the control variate theory cited above to the case of unknown controlvariate means and make three primary contributions:

New theoretical results demonstrating the suboptimality (in term of maximum possible variance reduction for a fixed highfidelity number of evaluations) of MLMC and MFMC.

Within the context of MLMC, we derive a novel estimator that, for fixed resources, is provably more accurate than the standard MLMC estimator.

Within the context of MFMC, we derive a dropin replacement (“ACVMF”) that can address this suboptimality.


A framework for new control variate algorithms that, for estimated control means, converges toward a level of variance reduction that can be orders of magnitude greater than the maximum level of variance reduction achievable by MLMC and MFMC.

Initial sample allocation strategies for optimal variance reduction given a fixed computational cost.
A significant gap can exist, of orders of magnitude in some cases, between the variance reduction achievable by current schemes and our generalized schemes. The initial sample allocation strategies we present for exploiting this gap are most effective when the number of highfidelity samples that can be generated is fixed or the cost of the high fidelity model is significantly more expensive than the low fidelity models. This situation occurs in practice when the high fidelity model is expensive to run and requires significant computational resources. In this scenario, our estimators can achieve greater variance reduction by changing the allocation and pairings of samples for each of the approximate models used as control variates. Central to these observations, we prove the following two results:

An approximate control variate scheme can be devised to converge to the optimal linear control variate in the limit of infinite data (Theorem 5).
The remainder of the paper is structured as follows: in Section 2, we describe a unifying framework for multimodel control variate approaches and then identify existing multilevel and multifidelity estimators within this general formalism; in Section 3, we develop a set of new approximate control variate estimators; and, in Section 4, we demonstrate our approaches for several prototypical applications in fluid dynamics.
2 A unifying approximate control variate framework
In this section we provide background on control variate methods and present a generalized framework for estimating integrals using control variates with unknown mean. Then we show how existing methodologies (MLMC and MFMC) are suboptimal instances of this general framework.
Let
be a probability space and
, be a vectorvalued random variable defined on this space. Let
denote a mapping from to a scalar valued output, we will call this mapping a “highfidelity model” because it will represent the process whose statistics we desire to estimate. In particular, our goal is to generate an estimate of using a set of samples of the input random variables and the corresponding evaluations of the function at those samples . The MC estimate of the mean(1) 
is unbiased, that is , and conveges a.s. For
with bounded variance, the Central Limit Theorem implies that the error in the estimate becomes normally distributed with variance
, as .2.1 Traditional control variate estimation
The MCbased^{2}^{2}2Controlvariatebased variance reduction algorithms can certainly be performed with other samplingbased and nonsamplingbased estimators. In this paper we focus on the commonly Monte Carlo estimators. linear CV algorithm seeks a new estimator that has smaller variance than while still requiring only evaluations of . The algorithm works by first introducing an additional random variable with known mean . Then it requires computing an estimator of using the same samples that were used for . Finally, these quantities are assembled into the following estimator
(2) 
for some scalar ^{3}^{3}3In the rest of the paper we drop the term “linear.” It is assumed that any scheme that is discussed is linear. For reference, nonlinear control variates are those that may have polynomial corrections of the form . As we will see shortly, the variance of is strongly influenced by the correlation between and , and greater variance reduction is achieved for higher correlation. To this end we will call the CV correlated mean estimator (CME), and we will refer to as the control variate mean (CVM). This approach can be extended to information sources , see e.g., [18, 20, 19], using
(3) 
where the estimator now uses a vector of CV weights ^{4}^{4}4CV methods can estimate multiple quantities of interest, e.g. [29, 31]. For simplicity we only consider a scalar ..
The effectiveness of the control variate algorithm is measured by the variance reduction ratio ratio between the variances of and :
(4) 
The optimal CV (OCV) uses weights that minimize this variance:
(5) 
Following [20], let denote the covariance matrix among and denote the vector of covariances between and each . The solution to Equation (5) is . If we further define , where is the Pearson correlation coefficient between and , then the variance reduction corresponding to these weights becomes
(6) 
where . When , then we have no advantage over MC, and when , we have the maximum variance reduction possible. When there is only a single approximate information source, this quantity is equivalent to .
2.2 Approximate control variate estimation
Traditional control variate estimation assumes that the means are known. In our motivating problem of multifidelity uncertainty quantification, these means are not known, but rather must be estimated from lower fidelity simulations. To this end, let denote a set of samples used to evaluate the highfidelity function . Furthermore, let denote an ordered set of samples partitioned into two ordered subsets and such that . Note that and are not required to be disjoint, i.e., in many strategies they overlap so that . We will construct and analyze approximate CV estimators of the following form
(7) 
where , and we use to denote the input values to the highfidelity model and each CV model . Here we use additional samples to estimate each mean , and this estimate is denoted by and called the estimated control variate mean (ECVM). All of the estimators in the literature, and those that we derive in this paper, are differentiated by how the samples are related among the CVs and how the subsets and are defined within each CV. It will be useful to consider ordered sets of samples such that In this case, if and intersect then for some and
Because we must now use extra samples to estimate , we will have to account for the additional cost this imposes. Let denote the number of available realizations of and
denote the total number of available evaluations of the th CV for scaling factor . Let denote the ratio between the cost of a single realization of control variates and the cost of obtaining a realization of the highfidelity . Then the cost of the CV estimator is
(8) 
For fixed , the estimator is unbiased when each component, , , and for is unbiased. The variance of is given in the following proposition.
Proposition 1 (Variance of the approximate CV estimator)
Proof
This result follows directly from the the rules regarding the variance of MC estimators, the variance of scaled random variables, and the variance of sums of random variables. We have
where the middle term on the first line is the definition, the last term follows from the variance of sums of random variables, and the second line factors out the baseline estimator variance.
This result means that the variance ratio of the approximate control variate (ACV) estimator is
(10) 
The optimal approximate CV estimator consists of the which minimizes this ratio. The properties of this optimal estimator are given in the following proposition.
Proposition 2 (Optimal approximate CV)
Assume is positive definite and . The CV weight that provides the greatest variance reduction for the approximate CV Equation (7) is given by
(11) 
with corresponding estimator variance
(12) 
Proof
The variance reduction is a quadratic function of and so its extremum can be found by setting the gradient to zero
If we substitute this weight into Equation (9), we obtain the stated result
Proposition 2 provides a means to construct and theoretically compare different control variate estimators. In the following, we will use this tool to derive a number of existing estimators which use different sampling strategies to compute and .
2.3 Multilevel Monte Carlo (MLMC)
The MLMC estimator [8] is inspired by ideas drawn from multigrid solvers for PDEs. It posits that if a stochastic process can be simulated using various resolution methods, then low resolution simulations can be used to obtain a reduction in the variance of the statistics of higher resolution simulations. An initial set of motivating applications included stochastic differential equations [7, 9, 13], but it has been extended also to elliptic problems [30] (among others), filtering problems [15] and a large number of other scientific and engineering applications [16, 6, 4, 10].
In this section, we provide several new contributions. First, we derive a novel expression for the variance reduction provided by MLMC that is a function of the correlations between each model. Second, we identify that MLMC is suboptimal in that it cannot provide a greater variance reduction than the optimal control variate estimator with one control variate (OCV1), regardless of how many resolutions or how much data is used. Third, we provide a simple enhancement to the standard MLMC assembly strategy to provide a provably greater variance reduction for virtually no additional cost.
Let be a sequence of random variables, ordered bottomup^{5}^{5}5Here, we use instead of because in this paper is always of “higher fidelity” than , for and MLMC traditionally reverses these notions. We use a different letter and then provide a mapping back to to reflect this discrepancy. from coarsest resolution at level 0 to finest resolution at level . The MLMC estimator seeks the expectation of by rewriting it as a telescopic sum . Then MLMC calls for (by definition) estimating the expectation using independent sample sets for each level , by summing the following sequence of estimators
(13) 
From the cancellation of terms, we see that this estimator is still an unbiased estimate of
.To link this expression to Eq. (7), we make the following associations: the number of control variates is , the finest resolution level is our quantity of interest , and the order of is reversed to obtain our control variates for . Using this assignment it is clear that the the same sum can be written as
(14) 
where we have introduced independent sample sets for each level with for and From here it is clear that two estimators of each QoI are obtained; one estimator uses and the other Thus we can identify the ECVM of with the estimator that uses since this is the set with more samples. Accounting for the sample set for the high fidelity model we can rewrite this equation as
Finally, repairing terms to reflect the same level but different sample sets, we arrive at the form of the approximate CV in (7)
(15) 
where we have set and for and for all . It is important to note that the sampling strategy implies a recursive underpinning to the MLMC strategy. In this sense, MLMC can be derived by starting with a single with unknown CVM. Then we introduce as a control variate for the ECVM. It is for this reason that the samples between and are shared.
Given this partitioning, the receptive cardinality of and is and , and
(16) 
for . Here the evaluation ratios for each model are and because . The sampling scheme is illustrated in Figure 0(a) for reference. The variance reduction of the MLMC estimator is given in the following proposition.
Lemma 1 (Variance reduction of MLMC)
Let the MLMC estimator be defined as
(17) 
with the sampling strategy: . Each ordered set of samples is partitioned according to , , , and for . The variance of this estimator is
(18) 
where
(19) 
is the ratio of the standard deviations,
is the Pearson correlation coefficient between , and are defined recursively according to
The proof is provided in Appendix A. MLMC provides a smaller variance reduction than the single optimal CV.
Theorem 1 (Maximum variance reduction of MLMC)
The variance reduction of MLMC is bounded above by the optimal single CV i.e.,
(20) 
The proof is provided in Appendix B. This bound is a result of the sampling strategy used by MLMC, which does not allocate any of the samples used to evaluate the high fidelity model to the models for . Because and do not share samples there is no correlation between the estimators of and for . Ignoring these correlations limits the maximum attainable variance reduction of the final estimator to the one corresponding to OCV1; specifically, the control variates , for , are used to accelerate convergence of to OCV1 using model .
Note that the bound given in Theorem 1 does not depend on the number of samples of each . Even with an infinite number of samples of the CVs, the manner in which the samples are distributed restricts the variance reduction achieved. As far as we are aware, this suboptimality has not been discussed previously in the literature, where the focus has been to show some variance reduction as compared with MC.
This suboptimality property can have a significant effect in practice. We will show that significantly greater variance reduction can be obtained with essentially no additional computational expense by simply redistributing the sample allocation (even within the two components of ).
Finally, we briefly mention that multiindex Monte Carlo MIMC algorithm [12] is a multidimensional analogue of MLMC in that it considers a two (or multi) dimensional hierarchy of models This estimator can also be written in the language of control variates; whereas MLMC introduces a single to reduce the variance in estimating the ECVM , MIMC introduces several control variates to reduce the variance of the ECVM . As a result MIMC can be more efficient because it has the potential for greater variance reduction; however, it still suffers from a similar suboptimality, with respect to metric, because it still uses lowerfidelity models to only accelerate convergence of the of the immediate higherfidelity models rather than of – in other words they are still recursive.
2.3.1 Weighted Multilevel Monte Carlo
A small modification of the MLMC algorithm can reduce its variance with no additional computational cost. This improvement is obtained by replacing the weights with the optimal weights given by Proposition 2. Since these weights guarantee that the estimator achieves the greatest variance reduction, using these weights will yields an estimator that employs the same sampling strategy as the standard MLMC algorithm (i.e., sampling in a sequence of model pairs), but has a lower variance (error).
Pseudocode for constructing the optimal multilevel MC estimator is given in Algorithm 1. The values of and are provided in Appendix A. Here we assume that the allocation of the samples is given a priori. For example a typical MLMC allocation strategy that attempts to balance physical prediction bias and statistical error can be used. Such an allocation may no longer be optimal. We discuss how to construct an optimal allocation strategy in Section 3.3.
2.4 Multifidelity Monte Carlo (MFMC)
An alternative to MLMC, called Multifidelity Monte Carlo (MFMC), was recently proposed in [26, 22, 24, 27]. In this section, we provide several new results regarding this sampler. First, we derive a new variance reduction expression, an alternative to [26, Lemma 3.3], that relates the correlation of the CVs to the variance reduction. Second, we use this expression to demonstrate that MFMC is a recursive estimator, like MLMC, and therefore is also suboptimal. That is, the maximum variance reduction provided by this estimator is limited to that provided by the single optimal CV, regardless of how many samples are used to evaluate .
The MFMC estimator can be obtained from the following recursive procedure. Samples are used to obtain and the CME () of . Then an enriched set of samples are used for the ACVM () of to obtain
(21) 
where we have chosen and . In other words, consists of the original samples along with an additional set.
Now, let us introduce another CV to reduce the variance of the according to
where we have again set and
Continuing in this recursive pattern and collapsing products, we obtain the MFMC estimator
(22) 
and the sampling strategy partitions into and according to
for and and . The sampling scheme for the MFMC estimator is illustrated in Figure 0(b) for reference.
Unlike the MLMC estimator, this estimator has free weights that we can tune, and the optimal weights are given in the following Lemma.
Lemma 2 (Optimal CV weights for MFMC)
The optimal weights for the MFMC estimator are
(23) 
The proof is provided in Appendix C, and provides an alternate derivation to the identical result in [26, Th. 2.4], but it does not require the same assumptions; in the prior work, these weights were derived using assumptions on model costs and correlation orderings. The next Lemma provides a new representation of the variance reduction of this estimator in terms of the correlations between model pairs.
Lemma 3 (Variance reduction of the optimal MFMC)
Assume and let . The variance reduction of the optimal MFMC estimator is
(24) 
where
(25) 
The proof is provided in Appendix D. Under the assumption made in [26] that, without loss of generality (can be constructed by reordering), for , we can use Lemma 3 to show that the MFMC estimator cannot perform better than the optimal single CV, regardless of how many samples are allocated to the control variate models.
Theorem 2 (Maximum variance reduction of MFMC)
The variance reduction of MFMC is bounded above by the optimal single CV, i.e.,
(26) 
The proof is provided in Appendix E. The variance reduction of MFMC approaches in the limit of infinite data. Though the MFMC estimator is limited by the highest correlation , it will be able to approach this limit more efficiently than a single lowfidelity CV because is often larger than , given the (assumed) decreasing computational cost in the sequence from the first to the last lowfidelity model . Similarly to the MLMC case, the suboptimality of MFMC is due to the recursive nature of this estimator. Each successive CV is used to enhance convergence of .
2.5 Summary and example
Table 1 below summarizes the sample distribution and the variance reduction ratio of the stateoftheart methods described above, and Figure 1 pictorially illustrate the existing variance reduction algorithms that do not require a known mean.
Algorithm  Relation between and  Reduction ratio  

OCV  
OCV1  
MLMC [8]  , for  ,  
MFMC [26]  for all  , 
To demonstrate these results we consider the following simple monomial example: let and for , where . The correlation matrix for this problem is given in Table 2.
1  0.994995  0.975042  0.927132  0.820633  
0.994995  1  0.992172  0.958367  0.865941  
0.975042  0.992172  1  0.986021  0.916385  
0.927132  0.958367  0.986021  1  0.968153  
0.820633  0.865941  0.916385  0.968153  1 
As we have not yet introduced a cost model, we first explore the performance of these methods through the lens of an assumed sample ratio so that for . As a result, the sample allocations across are prescribed by , rather than computed from known and estimated ; this is the viewpoint taken in Figures 2, 4, and 5. We explore the introduction of and resulting optimal allocation strategies later in Sections 3.3.1 and 4.1. These two views highlight two critical features for these algorithms: (a) the existence of a significant gap between OCV1 and OCV, and (b) an effective ability to exploit this gap to achieve computational savings.
Figure 2 shows the variance reduction ratio for the described algorithms. The dotted horizontal lines provide baseline variance reduction ratios corresponding to MC (), the single OCV (OCV1), double OCV (OCV2), triple OCV (OCV3), and the optimal control variate that uses all the models (OCV). We clearly see that the variance reduction of the existing estimators is bounded by that provided by OCV1, as established by Theorems 1 and 2 and summarized in Table 1.
To summarize, this example shows that when the number of high fidelity samples is fixed, the existing stateoftheart estimators are suboptimal. As such, in applications where it is infeasible to obtain more samples of the highest fidelity model, we can potentially achieve orders of magnitude greater variance reduction. We want to emphasize that we are aware of no existing methods in the form (7) that can converge to the OCV estimator (with known means) as the amount of data increases. This defines our goal for the discussion to follow: overcome the suboptimality of the existing estimators by developing algorithms that can converge to the optimal control variate estimator (red OCV line in Figure 2).
3 Approximate CV
In this section we use Propositions 1 and 2 to derive approximate CV (ACV) estimators that converge to the knownmean optimal control variate (OCV) estimators with increasing data.
3.1 Two convergent estimators
The most straightforward way to guarantee convergence of an ACV estimator to an OCV estimator is to set for each CV and then to use all available samples to estimate , i.e., . In other words, we use the exact same input samples for as those used for ; any additional samples available for each control variate to evaluate .
The formal definition of this estimator, which we coin approximate CVIS (ACVIS) where IS stands for “independent samples”, is
Definition 1 (AcvIs)
Let , and for and . Then the estimator ACVIS is defined as
(27) 
The estimator only requires shared evaluations for the input samples that make up and each . The rest of the samples for each control variate are completely independent. As a result, an attractive feature of the sample distribution strategy for ACVIS is that each control variate can be evaluated separately and in parallel. The sampling scheme for the ACVIS estimator is illustrated in Figure 2(a) for reference.
The optimal control variate weights and variance reduction ratio for ACVIS are now given.^{6}^{6}6In this paper, denotes a Hadamard, or elementwise, product.
Theorem 3 (Optimal CVweights and variance reduction for ACVIS)
The optimal CV weights and resulting estimator variance for the ACVIS estimator are
(28) 
and
(29) 
where
(30) 
and such that
(31) 
The proof is provided in Appendix F.
Next we consider the case where the samples in the sets are not independent among models, but rather use overlapping sets. We call this estimator ACVMF.
Definition 2 (AcvMf)
Let , and for and . Then ACVMF estimator is
(32) 
Note that this estimator can use an identical set of samples to the MFMC estimator and can thus be considered a dropin replacement. The only difference is that is evaluated using only the first samples instead of . Furthermore, using less samples for does not cause loss of accuracy because the control variates method does not require an accurate estimate of in terms of how close it is to , it requires an estimator that is correlated to and unbiased. The sampling scheme for the ACVMF estimator is illustrated in Figure 2(b) for reference.
The optimal weights and resulting variance reduction of the ACVMF estimator are now provided.
Theorem 4 (Optimal CVweights and variance reduction for ACVMF)
The optimal CV weights and resulting estimator variance for the ACVMF estimator are
(33) 
and
(34) 
where