We consider optimization problems in the form
is a probability measure governing the random variableon some space and is a set depending on . Problem (1.1) enforces a solution to satisfy with high probability, namely at least . This problem is often known as a probabilistically constrained or chance-constrained program (CCP) (Prékopa, 2003). It provides a natural framework for decision-making under stochastic resource capacity or risk tolerance, and has been applied in various domains such as production planning (Murr and Prékopa, 2000), inventory management (Lejeune and Ruszczynski, 2007), reservoir design (Prékopa and Szántai, 1978; Prékopa et al., 1978), communications (Shi et al., 2015), and ranking and selection (Hong et al., 2015).
We focus on the situations where is unknown, but some i.i.d. data, say , are available. One common approach to handle (1.1) in these situations is to use the so-called scenario optimization (SO) or constraint sampling (Campi and Garatti, 2011; Nemirovski and Shapiro, 2006). This replaces the unknown constraint in (1.1) with , namely, by considering
Note that CCP (1.1) is generally difficult to solve even when the set is convex for any given and the distribution is known (Prékopa, 2003). Thus, the sampled problem (1.2) offers a tractable approximation for the difficult CCP even in non-data-driven situations, assuming the capability to generate these samples.
Our goal is to find a good feasible solution for (1.1) by solving (1.2) under the availability of i.i.d. data described above. As the sample size increases, the number of constraints in (1.2) increases and one expects them to sufficiently populate the safety set , thus ultimately give rise to a feasible solution for (1.1). To make this more precise, we first mention that because of the statistical noise from the data, one must settle for finding a solution that is feasible with a high confidence. More specifically, define, for any given solution ,
to be the violation probability of under probability measure that generates . Obviously, is feasible for (1.1) if and only if
We would like to obtain a solution, say , from the data such that
where is the distribution that generates the i.i.d. data , and is a given confidence level (e.g., ). In other words, we want to satisfy the chance constraint in (1.1) with the prescribed confidence.
Under the convexity of and mild additional assumptions, the seminal work (Campi and Garatti, 2008)
provides a tight estimate on the required data sizeto guarantee (1.4). They show that a solution obtained by solving (1.2) satisfies
with equality held for the class of “fully-supported” optimization problems (Campi and Garatti, 2008). Thus, suppose we have a sample size large enough such that
However, in small-sample situations in which the data size is not large enough to support (1.6), the feasibility guarantee described above may not hold. It can be shown (Campi and Garatti, 2008) that the minimum that achieves (1.4) is linear in and reciprocal in , thus may impose challenges especially in high-dimensional and low-tolerance problems. Similar dependence on the key problem parameters also appears in other related methods such as in (De Farias and Van Roy, 2004; Campi and Garatti, 2011; Luedtke and Ahmed, 2008). Several recent lines of techniques have been suggested to overcome these challenges and reduce sample size requirements, including the use of support rank and solution-dependent support constraints (Schildbach et al., 2013; Campi and Garatti, 2018), regularization (Campi and Carè, 2013), and sequential approaches (Carè et al., 2014; Calafiore et al., 2011; Chamanbaz et al., 2016; Calafiore, 2017).
In this paper, we offer a different path to alleviate the data size requirement than the above methods, when possesses known parametric structures. Namely, we assume for some parametric family of distribution, where satisfies two basic requirements: It is estimatable, i.e., the unknown quantity or parameter can be estimated from data, and simulatable, i.e., given , samples from can be drawn using Monte Carlo methods. Under these presumptions, our approach turns the CCP (1.1), with an unknown parameter, into a CCP that has a definite parameter and a suitably re-adjusted tolerance level, which then allows us to generate enough Monte Carlo samples and consequently utilize the guarantee provided from (1.5). On a high level, this approach replaces the data size requirement in using (1.2) (or, in fact, any of its variant methods) with a Monte Carlo size requirement, the latter potentially more available given cheap modern computational power. Our methodological contributions consist of the development of procedures, related statistical results on their sample size requirement translations, and also showing some key differences between parametric and nonparametric regimes.
Our approach starts with a distributionally robust optimization (DRO) to incorporate the data-driven parametric uncertainty. The latter is a framework for decision-making under modeling uncertainty on the underlying probability distributions in stochastic problems. It advocates the search for decisions over the worst case, among all distributions contained in a so-called uncertainty set or ambiguity set (e.g.,(Wiesemann et al., 2014; Delage and Ye, 2010; Goh and Sim, 2010; Ben-Tal et al., 2013)). In CCP, this entails a worst-case chance constraint over this set (e.g., (Hanasusanto et al., 2015; Zymler et al., 2013; Hanasusanto et al., 2017; Li et al., 2019; Jiang and Guan, 2016; Zhang et al., 2016; Hu and Hong, 2013)). When the uncertainty set covers the true distribution with a high confidence (i.e., the set is a confidence region), then feasibility for the distributionally robust CCP would convert into a confidence on the feasibility for the original CCP. We follow this viewpoint and utilize uncertainty sets in the form of a neighborhood ball surrounding a baseline distribution, where the ball size is measured by a statistical distance (e.g., (Petersen et al., 2000; Hansen and Sargent, 2008; Love and Bayraksan, 2015; Dupuis et al., 2016; Lam and Zhou, 2017; Lam, 2019; Gotoh et al., 2018; Duchi et al., 2016; Esfahani and Kuhn, 2015; Blanchet and Murthy, 2016; Blanchet et al., 2016; Gao and Kleywegt, 2016)). In the parametric case, a suitable choice of this distance (such as the -divergence that we focus on) allows easy and meaningful calibration of the ball size from the data, so that the resulting DRO provides a provable feasibility conversion to the CCP.
Our next step is to combine this DRO with Monte Carlo sampling and scenario approximation. The definition of DRO means that there are many possible candidate distributions that can govern the truth, whereas the statistical guarantee for SO assumes a specific distribution that generates the data or Monte Carlo samples. To resolve this discrepancy, we select a generating distribution that draws the Monte Carlo samples, and develop a translation of the guarantee from a fixed distribution into one on the DRO. Moreover, we investigate the optimal choice of the generating distribution in relation to the target DRO, in the sense of requiring the least Monte Carlo size. We show that, if there is no ambiguity on the distribution (i.e., a standard CCP), or when the uncertainty set of a DRO is constructed via a divergence ball in the nonparametric space, the best generating distribution is, in a certain sense, the true or the baseline distribution at the center of the ball. However, if there is parametric information, the optimal choice of the generating distribution can deviate from the baseline distribution in a divergence-based DRO. We derive these results by casting the problem of selecting a generating distribution into a hypothesis testing problem, which connects the sampling efficiency of the generating distribution with the power of the test and the Neyman-Pearson lemma (Lehmann and Romano, 2006). The results on DRO in particular combines this Neyman-Pearson machinery with the established DRO reformulation of chance constraints in (Jiang and Guan, 2016; Hu and Hong, 2013), with the discrepancy between the best generating distribution and the baseline distribution in the parametric case stemming from the removal of the extremal distributions in the corresponding nonparametric uncertainty set. These connections among hypothesis testing, SO and DRO are, to our best knowledge, the first of its kind in the literature.
Finally, given the non-optimality of the baseline distribution of a divergence-based DRO in generating Monte Carlo samples, we further develop procedures to search over generating distributions that improve upon this baseline. On a high level, this can be achieved by increasing the sampling variability to incorporate the uncertainty of the distributional parameters (one may intuit this from the perspective of a posterior distribution in a Bayesian framework). We provide several classes of distributions with such a variability enlargement, and study descent-type algorithms to search for good distributions in these classes.
We conclude this introduction by briefly discussing two other lines of related literature. The first is the so-called robust Monte Carlo or robust simulation that, like us, also considers using Monte Carlo sampling together with DRO (Hu et al., 2012; Glasserman and Xu, 2014; Lam, 2016; Hu and Hong, 2015; Lam, 2018; Ghosh and Lam, 2019). However, this literature focuses on approximating DRO with stochasticity in the objective function, and does not study the chance constraint feasibility and SO that constitute our main focus. Second, (Erdoğan and Iyengar, 2006) considers a scenario approach to distributionally robust CCP with an uncertainty set based on the Prohorov distance. Like (De Farias and Van Roy, 2004), (Erdoğan and Iyengar, 2006) utilizes the Vapnik-Chervonenkis dimension in studying feasibility, in contrast to the convexity-based argument in (Campi and Garatti, 2008) that we utilize. More importantly, we aim to optimize the efficiency of Monte Carlo sampling in handling limited-data CCP, thus motivating us to study the choice of distance, calibration schemes, and selection of generating distributions that are different from (Erdoğan and Iyengar, 2006).
2. From Data-Driven DRO to Scenario Optimization
This section introduces our overall framework. Recall our goal as to find a good feasible solution for (1.1), and suppose that we have an i.i.d. data size possibly less than the requirement shown in (1.6). As discussed in the introduction, we first formulate a DRO that incorporates the parametric estimation noise and subsequently allows us to resort to Monte Carlo sampling to obtain a feasible solution for (1.1). In the following, Section 2.1 first describes the basic guarantees from DRO. Section 2.2 investigates Monte Carlo sampling that provides guarantees on DRO. Section 2.3 discusses the choice of the uncertainty set.
2.1. Overview of Data-Driven DRO
For concreteness, suppose the unknown true distribution , the class of possible probability distributions for (to be specified later). Given the observed data , the basic steps in our data-driven DRO are:
Step 1: Find a data-driven uncertainty set such that
where denotes the measure generating the data .
Step 2: Given , set up the distributionally robust CCP:
where the probability measure is the decision variable in the .
Step 3: Find a solution feasible for (2.2).
2.2. Monte Carlo Sampling for DRO
To use the above procedure, we need to provide a way to construct the depicted and to find a (confidently) feasible solution for (2.2). We postpone the set construction to the next subsection and focus on finding a feasible solution here. We resort to SO, via Monte Carlo sampling, to handle (2.2). Note that, unlike in the standard SO discussed in the introduction, the distribution here can be any candidate within the set . Thus, let us select a generating distribution, called (which can depend on the data), to generate Monte Carlo samples , and solve
For convenience, denote, for any ,
From the result of (Campi and Garatti, 2008) discussed in the introduction, using or more Monte Carlo samples from in (2.4) would give a solution (highlighting the dependence on ) that satisfies (where the two are independent) with confidence level . This is not exactly the distributionally robust feasibility statement for problem (2.2). To address this discrepancy, we consider, conditional on the data ,
If we can bound the optimal value in (2.6), then we can trace back the level of that is required to ensure a chance constraint validity of tolerance level . However, the event involved in defining and , namely , can be challenging to handle in general. Thus, we relax (2.6) to
Conditional on the data , the optimal value of optimization problem (2.7), which we denote , is clearly an upper bound for that of (2.6). In fact, it is also clear from (2.7) that is non-decreasing in and
by simply taking and in (2.7). We have the following guarantee:
Theorem 2.2.1 ().
Given , and , suppose there exists small enough such that
where is the measure governing the real-data generation under the true distribution , then the obtained solution would be feasible for (1.1) with confidence at least .
where is the measure with respect to the Monte Carlo samples drawn from . Moreover, based on the monotonicity property of and (2.8), we have
Thus (2.9) implies that
Thus, if we denote to be entire sequence consisting of real data and the generated Monte Carlo samples, it then follows that
Theorem 2.2.1 can be cast in terms of asymptotic instead of finite-sample guarantees by following the same line of arguments. We summarize it as the following corollary.
Corollary 2.2.2 ().
To summarize, in the presence of data insufficiency, if we choose to satisfy the confidence property (2.1), and are able to evaluate the bounding function that translates the violation probability under to a worst-case violation probability over , then we can run SO with Monte Carlo samples from to obtain a solution for (1.1) with confidence .
Finally, we also note that the above scheme still holds if the in (2.5) is replaced by the sample size requirements of other variants of SO (e.g., FAST (Carè et al., 2014)) that are potentially smaller. This works as long as we stay with the same SO-based procedure in using the Monte Carlo samples. For clarity, throughout most of our exposition we will focus on the sample size requirement depicted in (2.5), but we will discuss other variants in our implementation and numerical sections.
2.3. Constructing Uncertainty Sets: A Preliminary Discussion
In this section we give some preliminary discussion on the construction of the uncertainty set , with further improvement building upon this discussion in Section 3. We assume the true distribution of lies in a parametric family. We denote the true parameter as . To highlight the parametric dependence, we call the true distribution indexed by . Given data , we want to construct an uncertainty set satisfying
so that Corollary 2.2.2 applies. To do so, we first estimate from the data. There are various approaches to do so; here we apply the common maximum likelihood estimator (MLE) , and set to be
where is the quantile of , the
-distribution with degree of freedom, and is the -divergence between two probability measures, i.e., given a convex function , with , a distance between two probability measures and defined as
assuming is absolutely continuous with respect to with Radon-Nikodym derivative on . Moreover, we assume that is twice continuously differentiable with , and if necessary set the continuation of to as for . In (2.17), we call the center of the divergence ball, , to be the baseline distribution.
To guarantee desirable asymptotic properties of our uncertainty set, we make the following assumption:
Let be the true parameter and let be the MLE of estimated from i.i.d. data points. Then, as , consistency and asymptotic normality of holds:
where is the fisher information for the parametric family with well-defined inverse that is continuous in the domain .
Under Assumption A1, it can be shown (Pardo, 2005; Vaart, 1998) that in (2.17) satisfies the confidence guarantee (2.16). Furthermore, since we can identify each in with , we can equivalently view as a subset of , and write it as
For convenience, we shall use the two definitions of interchangeably depending on the context. It is also known that the asymptotic confidence properties of (2.17) or (2.20) are the same among different choices within the -divergence class. These can be seen via a second order expansion of the -divergences. Moreover, they are asymptotically equivalent to
where is the estimated Fisher information, under the regularity conditions above (Pardo, 2005; Nielsen and Nock, 2014; Vaart, 1998). In other words, under Assumption A1, both (2.20) and (2.21) satisfy
) depends on the higher-order properties of the parametric model, which in turn can depend on the parameter dimension. Different from the sample size requirements in SO, this convergence rate is a consequence of MLE properties. Some details on finite-sample behaviors of MLE can be found in(Korostelev and Korosteleva, 2011).
The discussed above is a set over the parametric class of distributions (or parameter values). Considering tractability, DRO over nonparametric space could be easier to handle than parametric, which suggests a relaxation of the parametric constraint to estimate the bounding function . This also raises the question of whether one can possibly contain in a nonparametric ball with a shrunk radius and subsequently obtain a better . These would be the main topics of Sections 3 and 4.
3. Bounding Functions and Generating Distributions
Given the uncertainty set in (2.20), we turn to the choice of the generating measure and the bounding function which, as we recall, is the optimal value of optimization problem (2.7). In the discussed parametric setup, the latter becomes
From Theorem 2.2.1 and the fact that is non-decreasing in , we want to choose that minimizes so that we can take the maximum and subsequently achieve overall confident feasibility with the least Monte Carlo sample size. Note that is a multi-input function depending on both and , and so a priori it is not clear that a uniform minimizer can exist across all values of so that the described task is well-defined. It turns out that this is possible in some cases, which we shall investigate in detail.
3.1. Neyman-Pearson Connections and A Least Powerful Null Hypothesis
We first consider, for a given , the optimization problem
This problem can be viewed as choosing a most powerful decision rule in a statistical hypothesis test. More precisely, one can think of
as a rejection region for a simple test with null hypothesisand alternate hypothesis . Subject to a tolerance of Type-I error, optimization problem (3.2) looks for a decision rule that maximizes the power of the test. By the Neyman-Pearson lemma (Lehmann and Romano, 2006), under mild regularity conditions on the parametric family, the optimal set of (3.2) takes the form
with chosen so that . Also, then, the optimal value of (3.2) is . Generalizing the above analysis to all , we conclude that
is the optimal value of (3.1). These observations will be useful for deriving our subsequent results.
Our goal is to choose to minimize (3.4). To start our analysis, let us first consider the extreme case where the uncertainty set consists of only one point . In this case, we look for that minimizes , the optimal value of
That is, for a given measure , we seek for the maximum discrepancy between and over all -measure sets that have or less content. This is similar to minimizing the total variation distance between and , and hints that the optimal choice of is . The following theorem, utilizing the Neyman-Pearson lemma depicted above, confirms this intuition. We remark that the assumptions of the theorem can be relaxed by using more general versions of the lemma, but the presented version suffices for most purposes and also the subsequent examples we will give.
Theorem 3.1.1 ().
Given , among all such that exists and is continuous and positive almost surely, the minimum is obtained by choosing , giving .
Under the assumptions, by the Neyman-Pearson lemma, for a fixed measure , the set achieving the optimal value of (3.5) takes the form for some with . It then follows that
Under the absolute continuity assumption, we define
which can be seen to be a non-increasing function for and a non-decreasing function for . To see this, take , and we have
Thus, when , we have because
while when , we have because
Then, to identify the minimum of , we either decrease from 1 to 0 which gives
by using the dominated convergence theorem (e.g., by considering the set ) or we increase from 1 to which gives
Theorem 3.1.1 shows that under mild regularity conditions, in terms of choosing the generating distribution and minimizing , we cannot do better than simply choosing itself. This means that if we had known the true distribution was , and without additional knowledge of the event of interest, the safest choice (in the minimax sense) for sampling would be , a quite intuitive result. In the language of hypothesis testing, given the simple alternate hypothesis , the null hypothesis that provides the least power for the test, i.e., makes it most difficult to distinguish between the two hypotheses, is .
3.2. Nonparametric and Parametric DRO
Building on the discussion in Section 3.1, we now consider the choice of generating distribution to minimize the bounding function obtained from (3.1). Before so, we first discuss the nonparametric case, where the analog of (3.1) is in the form:
for some ball radius , where the decision variables are in the space of all distributions absolutely continuous with respect to , and .
We show that the above setting can be effectively reduced to the unambiguous case, i.e., when lies in a singleton discussed in Section 3.1. This comes from an established equivalence between a distributionally robust chance constraint and an unambiguous chance constraint evaluated by the center of the divergence ball, when the event is fixed (Hu and Hong, 2013; Jiang and Guan, 2016). In particular, suppose the stochasticity space is , and admits a density . Theorem 1 in (Jiang and Guan, 2016) shows that, for any ,
where can be explicitly determined by , and as
and , , , and if , if and , and otherwise, where is the Lebesgue measure on .
The above equivalence can be used to obtain the following result.
Theorem 3.2.1 ().
Suppose and admits a density. Among all such that exists and is continuous, positive almost surely, an optimal choice of that minimizes , namely the optimal value of (3.8), is the center of the -divergence ball . Moreover, this gives , where is the inverse of the function defined in (3.10) with respect to , given by
which, according to (3.9), has the same optimal value as
Since, fixing and , is a non-decreasing function of , we see that minimizing is equivalent to minimizing . Denoting as the optimal value of the optimization problem
An implication of Theorem 3.2.1 is that, by noting that a parametric divergence ball lies inside a corresponding nonparametric ball, we can compute a bound for to obtain a required Monte Carlo size, drawn from the baseline , to get a feasible solution for the distributionally robust CCP (2.2) and subsequently the CCP (1.1). More precisely, recall the bounding function with , given by (3.1), as the optimal value of
Corollary 3.2.2 ().
Given a data size , suppose and admits a density, where is the MLE under Assumption A1. If we choose and draw Monte Carlo samples from the generating distribution to construct the sampled problem (2.4), then the obtained solution will be feasible for (1.1) with asymptotic confidence level at least .
Note that a parametric divergence ball lies inside a corresponding nonparametric ball in the sense that
Thus, by the definition of , we have
where the equality follows from Theorem 3.2.1. Thus, if we choose such that , or , where as presented in (2.17), and the generating distribution as , then Corollary 2.2.2 guarantees that running SO on Monte Carlo samples gives a feasible solution for (1.1) with confidence asymptotically at least . ∎
Corollary 3.2.2 thus provides an implementable procedure to handle (1.1) through (2.2). Next we discuss further the choice of generating distributions in parametric DRO beyond . While the ball center is a valid choice, the equivalence relation (3.9) does not apply when the divergence ball is in a parametric class, and the optimal choice of the generating distribution may no longer be , as shown in the next result.
Theorem 3.2.3 ().
In terms of selecting a generating distribution to minimize , the optimal value of (3.15), the choice can be strictly dominated by other distributions.
Intuitively, Theorem 3.2.3 arises because the extreme distribution that achieves the equivalence relation (3.9) may not be in the considered parametric family. It implies more flexibility in choosing the generating measure , in the sense of requiring less Monte Carlo samples than using .
From the standpoint of hypothesis testing in Section 3.1, the imposed minimax problem (3.15) in searching for the best can be viewed as finding a simple null hypothesis that is uniformly least powerful across the uncertainty set. This question is related and appears more general than finding the least favorable or powerful prior in testing against composite null hypothesis (Lehmann and Romano, 2006). In the latter context, given a set , one aims to find a distribution such that for all distributions on , where is the optimal value of
The distribution is interpreted as a prior on a composite null hypothesis parametrized by , and is the least favorable prior. The difference between (3.16) and our formulation (3.15) lies in the restriction to measures of the form for the former, leading to a smaller search space than ours. This mixture-type and the Bayesian connection will partly motivate our investigation in Section 4.
To prove Theorem 3.2.3, we present a counter example and also some related discussion.
Example 3.2.4 ().
Consider the uncertainty set within Gaussian location family on with fixing . This can be thought of, e.g., as an uncertainty set based on the -distance, the latter defined between two probability measures and as
Note that the -distance is in the family of -divergences, by choosing . We aim to find a generating distribution to minimize , the optimal value of
We consider several symmetric distributions as (symmetry is reasonably conjectured as a good property since an imbalanced shift might increase the power for the alternative hypothesis on one side and the worst case overall). We list these symmetric distributions in increasing variability: