1. The problem, prior results, perspectives and prelude to our results
1.1. Minmax optimization and scenario approximation
The minmax optimization problem is typically phrased as follows: Let be a positive integer and () be a metric space. Let be a nonempty subset of and be a probability space where is the Borel -algebra on induced by the metric . Let be a lower semicontinous (l.s.c) function.111Recall that a function is lower semicontinous (l.s.c.) if every sublevel sets of is closed, i.e., is closed for all . We are interested in the following robust optimization problem:
Here on the one hand, plays the role of the decision variable, and is the set of variables from which a choice of one decision has to be made. On the other hand, plays the role of a parameter that affects the cost associated with each decision variable, and takes a fixed, albeit unknown, value in the set . In problem (1), in effect, we pick a decision variable that incurs the least cost assuming that the worst possible value of corresponding to each value of the decision variable is realised.
If is an infinite set, then the minmax optimization problem (1) is an example of a semi-infinite optimization problem. Semi-infinite problems have been reported to be computationally intractable to solve in general [BTN98, BTN99, BTN01]. Nevertheless, such optimization problems are of great importance in engineering, and, consequently, there is a natural need to find computationally tractable tight approximations to the problem (1). The central object of study in this work is the following approximation to :
where is an independent and identially distributed (i.i.d) sequence of elements sampled from . This approximation is also known as the scenario approximation to the minmax optimization problem (1) [CC05]. We call each instance of the optimization problem (2) corresponding to the sample i a scenario optimization problem. Observe that each scenario optimization problem is no longer semi-infinite since the inner maximum involves only finitely many variables. This makes the scenario optimization problem computationally more tractable, at least for moderate values of , than otherwise; and this is an attractive feature of (2).
1.2. Desirable properties of scenario approximation
Before proceeding further we record the two following definitions for future reference:
We note that involves an abuse of notation since depends on the sample and not just on its size ; however, we suppress the explicit mention of this dependence in the interest of brevity. It follows immediately from the definition that which means that
In other words, the value of each scenario approximation (2) is always an approximation of from below.
There naturally arises questions about the goodness of such approximations. One natural notion of goodness of the scenario approximation scheme is qualified in the form of:
(G1) Consistency: Recall that a numerical approximation procedure is said to be consistent if, intuitively as the level of approximation is made “finer”, the approximate solution it computes converges to the actual solution of the problem being approximated. Consistency is a very rudimentary property that most sound numerical approximation procedures are expected to possess, and we say that the scenario approach is consistent if
To wit, the scenario approach is consistent if for almost every (countable) sequence of samples from , the approximate solution computed by the scenario approach using only a finite inital segment of the sequence converges to the solution of the original problem with the length of the initial segment. We will study consistency of the scenario approach in greater detail in the later sections. In particular, we will establish an obstruction to consistency that arises when the set is noncompact, a condition under which the scenario approach is guaranteed to be inconsistent.
A second desirable property of the scenario approximation is a good quality of:
(G2) Finite sample behaviour: Observe that the condition of consistency is purely asymptotic; it gives us no information about the nature of the approximate solution computed after drawing only a finite number of samples. But in the real world, information regarding the finite sample behaviour of the scenario approach is crucial and the behaviour of the scenario approach after drawing finitely many samples also warrants attention. In addition, it is desirable that this information is available to us a priori, before the approximation procedure is executed so that the number of samples drawn can be determined based on the accuracy demanded by the application before executing the approximation scheme. We start our study of finite sample behaviour by quantifying levels of approximation and bad samples associated with scenario approximations. For define the set of “bad” samples of size as those that give at least -bad estimates of , that is, those for which is at least away from :
we immediately get the sandwich relation
We find it easier to work with than , and since these two sets are sandwiched between each other, estimates for the probability of one will naturally lead to estimates for probability of the other. Note that there is nothing special about the factor in (7), the relation holds with any factor strictly greater than 1, we chose for convenience. Since
the set defined in (6) is, in fact,
Since we already have the obvious bound (4) that , if , then we also get . In other words, if a sample lies outside the bad set , then the difference between the approximate infimum and the actual infimum is less than than . Naturally, it is desirable that the bad set is as ‘small’ as possible.
As mentioned earlier, in the real world a priori quantitative information regarding the finite sample behaviour of the scenario approach is crucial. We are especially interested in results that provide an upper bound on the probability of occurence of the bad set
before the approximation procedure begins. Such quantitative bounds provide a PAC (“probably approximately correct”) type guarantee that the scenario approximation (2) computed using i.i.d. samples from has an accuracy with probability at least as much as . Given an accuracy level and a confidence level , from such a bound we may determine the number of samples that are required to ensure that the approximate minimum is at most away from the actual minimum with a probability at least . In the subsequent sections we will prove such PAC type guarantees for a large class of nonconvex minmax optimization programs.
1.3. A technical look at the sampling probability
Since the scenario approximation procedure involves i.i.d sampling according to an arbitrary probability distribution, it is not reasonable to expect that the sampled maximum approximates the supremum. If the probability distribution has large “holes” in regions ofwhere the supremum is achieved, these regions are never explored and consequently the sampled maximum does not approximate the supremum. We refer the reader to [Ram18, §1] for a more detailed discussion on this matter. A more meaningful notion of a supremum that we expect the sampled maximum to approximate is that of the essential supremum:
While the supremum of a set of numbers is its least upper bound, the essential supremum is the least “almost” upper bound. If the probability distribution has “holes” in certain regions, the essential supremum avoids considering the value of the function in these regions automatically, and therefore avoiding these regions while trying to approximate the essential supremum does not create any technical issues. However, the supremum posesses a lot of nice properties that the essentail supremum does not posess, and in order to avoid having to deal with the additional complications brought by the essential supremum, we would like to ensure that the supremum and the essential supremum are one and the same. Fortunately, it turns out that the assumption of lower semicontinuity of that we made at the start is sufficient for this, and to verify this statement, we start with a standard definition from measure theory. [[Par67, Definition 2.1, p. 28]] The support of the measure is
It can be shown that is the smallest (w.r.t set inclusion) closed subset of that has -measure 1.
Consider the problem (1) with its associated data. If is lower semicontinuous, then for each
A proof of this result is provided in Appendix A. Lemma 1.3 says that lower semicontinuity of ensures that the essential supremum is equal to the supremum on a certain subset of probability 1. For the sake of brevity of notation in the following discussion, we assume that ; all the results that we derive below carry over to the situation when . We stipulate that is a fully supported probability measure, that is,
This is equivalent to stipulating that
Assumption 1.3 ensures that it is not unreasonable to expect that the sampled maximum approximates the supremum, and this is the content of the next lemma: Consider the problem (1) with its associated data. If Assumption 1.3 holds, then
All the assumptions made until this point will remain in force throughout the article, unless specifically mentioned otherwise; for the convenience of the reader, we recollect them here.
is a subset of and is a metric space.
is a lower semicontinous function.
is a fully supported probability measure on .
1.4. Prior work and contributions
The scenario approach has been studied extensively in the literature in the particular case where is a convex set and is a convex function for each . We will henceforth refer to the problem as a random convex program. We review two recent representative results related to scenario approximations of random convex programs: one on consistency and the other on finite sample behaviour of the scenario approach, and compare the contributions of this article with the two results. We point out that both of these results rely crucially on the results established in [CC05, CC06, CG08].
The first result from [Ram18] establishes the consistency of the scenario approach for random convex programs, under an additional stipulation of an appropriate notion of coercivity on the class of functions . Recall that if is a metric space a function is weakly coercive if for each there exists a compact set such that the -sublevel set of is contained in , that is, . In particular, if is compact, every function is weakly coercive.
In addition, if is convex for all , the scenario approach is consistent in the sense that
Under the additional Assumption of (12), Theorem 1.4 establishes the consistency of the scenario approach for random convex programs. When the set itself is compact (12) holds trivially since any function on a compact set is weakly coercive. If is non-compact (12) may fail to hold, and consequently the consistency of the scenario approach may also be jeopardized. We study this situation in detail, and the first main contribution of the article will be identifying an obstruction to consistency of the scenario approach when is non-compact, that is, a condition that guarantees that the scenario approach will not be consistent.
We now review the main result from [ESL15] that establishes finite sample guarantees for the scenario approach applied to random convex programs.
[[ESL15, Theorem 14 on p. 5]] The tail probability for worst case violation is the function defined by
The function is called a uniform level set bound (ULB) of if for every ,
Given a ULB and numbers , for all we have
Theorem 1.4 provides a guarantee that if number of points are sampled in an i.i.d fashion from and the corresponding scenario approximation problem is solved to obtain an approximate minimum, then one can say with confidence that the approximate minimum is at most away from the true minimum . Note that the guarantee is a priori: one does not need any information related to the actual samples drawn in order to compute , and consequently, one can use Theorem 1.4 to determine the number of samples required to be drawn in order to obtain a solution of the given accuracy fixed at the beginning of the optimization procedure. One of the crucial ingredients in the proof of Theorem 1.4 is a result from [CG08] which is valid only for random convex programs. In the light of recent extensions of [CG08] to the nonconvex case in [CGR18], one can extend some results of [ESL15], including Theorem 1.4, to nonconvex robust optimization problems. However, the results of [CGR18] in the nonconvex regime are of an a posteriori nature, meaning that the guarantees given depend on the sample drawn, and the extension of Theorem 1.4 to the nonconvex case via that route inherits this same a posteriori property. This means that one cannot determine the number of samples that give an approximate solution of desired accuracy before the approximation procedure begins. However, once a sample is drawn and the correspoding scenario approximation is found, then one can find the accuracy of the computed approximate solution. In other words, one can only assess the quality of a scenario approximate solution after the solution is computed. In contrast, the second main contribution of the present article is a methodology to establish a priori PAC type finite sample guarantees similar to Theorem 1.4 that is applicable to scenario approximations of a large class of nonconvex minmax optimization problems.
1.5. Numerical experiments in high dimensions
We devote this subsection to examine in detail, with the aid of numerical experiments, a simple minmax optimization problem and the quality of scenario approximations of it. Recall that for a given vectorthe quantity denotes its infinity norm defined by Consider the optimization problem:
In the language of the problem (1), here we have chosen , , and the continuous function Observe that for each , is a convex function on the convex set . In other words, (18) is a random convex program. Moreover the set is compact and therefore (18) satisfies all the conditions of Theorem 1.4, and the latter guarantees that the scenario approximations will almost surely converge to .
To study the finite sample behaviour of scenario approximations of (18), we first compute the optimal value . This can be done by observing that
Since we have the optimal value in (18) and the scenario approximate solution can be computed numerically on a computer, we can compute the error associated with the scenario approximations of (18). In Figure 1 we present the results of our numerical experiments that give the error in the scenario approximation (2) of (18) and its variation with the dimension of the uncertainty set and the number of samples. We sampled independently from the uniform distribution on to obtain these scenario approximations. The error shown in the figure for each value of and was computed by taking the average error of the scenario approximations over 25 sets of samples of length from .
Figure 1 follows the expected trend that the error decreases as the number of samples increases and the dimension of the uncertainty set decreases. A closer look at the value of the error shows that even for a moderate dimension of (see Figure 2) of the uncertainty set, even after sampling as much as a million scenarios, one still gets an error as large as . To put this in perspective, observe that the value of the cost varies between and as and vary over and , respectively, which means that this is an error of about . The results are much worse for higher dimensions; for instance, when the dimension of the uncertainty set is and a million scenarios are drawn, the error in the scenario approximation is around in absolute units, which puts the relative error at around .
We get even worse results if we consider the slightly modified problem
where the uncertainty set is noncompact. The optimal value is in this case as well; indeed,
In Figure 3 we present the results of our numerical experiments that give the error in the scenario approximation (2) of (21) and its variation with the dimension of the uncertainty set and the number of samples. We sampled independently from the Gaussian distribution with mean 0 and variance on to obtain these scenario approximations. The error shown in the figure for each value of and was computed by taking the average error of the scenario approximations over sets of samples of length from . We see that even for a moderate dimension of (see Figure 4) of the uncertainty set, even after sampling as much as a million scenarios, one still gets an error as large as in absolute units. As expected, the results are much worse for higher dimensions: for instance, when the dimension of the uncertainty set is and a million scenarios are drawn, the error in the scenario approximation is around in absolute units.
Of course, the measuring stick in the scenario approximations of the example immediately above the current paragraph is the probability measure corresponding to the Gaussian employed for sampling, and the specific concentration properties of this measure naturally affects the outcome of the experiment as a consequence. Whether these estimates are satisfactory or not is difficult to assess unilaterally and uniformly across the spectrum of robust minmax optimization problems, and such conclusions are best left to the judgment of the practitioners concerned.
The main culprit in the examples above is the fact that scenario approximations rely on i.i.d samples, and i.i.d samples of high dimensional random vectors tend to concentrate with high probability around certain regions of the space leaving the rest of the space unexplored; this feature leads to a preference for certain (typically thin) regions of the sample space of the algorithm, and unless the optimizers are in these thin sets, the quality of approximation may be low. The preceding observations clearly point to the fact that there is still scope to develop general, computationally feasible, and tight approximation schemes for robust optimization problems, especially in high dimensions insofar as the optimal value is concerned; one such approximation method involving better sampling will be reported subsequently elsewhere.
2. An obstruction to consistency
Consistency of the scenario approach is not guaranteed, in general, for all problems of the form (1). Even in the particular case of random convex programs, observe that the statement of Theorem 1.4 has the additional requirement of coercivity, which is not always satisfied if the set is not compact. We begin with a simple example that illustrates this effect.
Let and . Assume that is endowed with the standard Gaussian probability measure with mean 0 and variance 1. Consider, in the language of (1) the cost function
All the requirements of the problem (1) are satisfied by (24) in addition to Assumption 1.3. Yet we show that the scenario approach is inconsistent in this situation. For a given sample we define . One checks that
which imply that
This means that for any sequence of samples , which shows that consistency fails to hold.
It is clear from Example 2 that the set being noncompact can readily lead to inconsistency of the scenario approach. In this section we study this issue further and characterize one possible obstruction to the consistency of the scenario approach when the set of optimization variables is noncompact. We begin with the following definition that will be need in both the current and the next section: The tail probability is the function defined by
For each , we define the infimum of over all by
It is clear from the proof of Theorem 2 that the set is precisely the set from which one needs to sample in order to get a solution with accuracy; if the sampled sequence does not contain any element from the set corresponding to any one value of , then the approximate solution is going to be atleast far away from . In the light of this fact, the condition that amounts to saying that the sets from which one needs to sample in order to get an approximation of accuracy are arbitrarily small regions of ; consequently and in restrospect the above result appears to be natural.
The function is very similar to an object we have encountered before: cf. the function defined in (14). These two functions and are weakly related to each other. In general, one can say that for every . Indeed, observe that for each , since
since . This implies that for each and , which further implies that
In subsequent sections (see Remark 3.1.2) we will see further evidence pointing to the fundamental nature of in relation to the scenario approach .
Example 1 (continues=exa:cont).
which implies that 222Recall that the function is equal to the when is the standard Gaussian measure (normal with mean 0 and variance 1) on . Clearly,
It is now evident that the obstruction pointed out by Theorem 2 that prevents consistency in this example as well.
The result of Theorem 2 is equally valid in the case where is compact. However, we started the discussion claiming that the obstruction arises when is a noncompact set, and the next proposition affirms this statement: we show that when the set is compact, the obstruction presented in Theorem 2 cannot arise.
3. Finite sample performance guarantees in the nonconvex setting
In this section, for a large class of nonconvex minmax problems, we prove a general positive result that gives an upper bound on the a priori probability of the bad set (8) for finite samples of the scenario approach. In other words, we establish a finite sample performance guarantee in a general nonconvex setting. Of course, in the presence of more detailed structure, we may be able to refine these preliminary estimates, and as an illustration of this scheme we then discusss several special cases of this result.
3.1. General performance guarantees
The first order of business is making the word nonconvexity precise. The class of nonconvex functions is vast, and it appears that very little can be said about a priori estimates under the scenario method at this level of generality; indeed, it is natural to expect, at least in principle, that the greater the regularity of the functions under consideration, tighter the bounds that should be possible to obtain. Physical considerations point us towards focussing our investigations on classes of functions that arise naturally in physical systems, e.g., trigonometric polynomials of finite bandwidth, smooth functions restricted to compact sets, etc. Our approach here follows standard principles of functional analysis and approximation theory via estimates involving covering numbers à la [CZ07]; the techniques exposed here are fairly general, and conform to the following simple steps: [title=Summary of our approach]
[label=(), leftmargin=*, widest=II, align=left]
We find upper bounds on the covering number of the family of functions
in the supremum norm topology defined below. This step provides us with a finite collection of representatives from the (possibly infinite dimensional) class of functions under consideration.
The i.i.d property of the sampling in the scenario approach permits us to employ the bounds on the covering number found in the preceding step in standard probabilistic inequalities to arrive at bounds on the probability .
The class of nonconvex functions is vast, and we consider only a few reasonable classes of finite and infinite dimensional subsets of this class in the article at hand. The primary difficulty with infinite dimensionality of function classes is overcome in a standard way by the consideration of covering numbers. Recall that given a metric space and a subset , a set is called an cover of if for every element , there exist such that . We define the covering number of to be the smallest number such that there exists an cover of of cardinality . It is a standard result that is precompact iff for all , the covering number is finite.
Recall that if is compact, the set of continuous real valued functions on is a metric space when endowed with the metric inherited from the supremum norm given by .
3.1.2. Main Result
The following theorem is the key result of this section. Given the problem (1) and its associated notation, let denote the family of functions
Consider the problem (1) along with its associated data, and suppose that Assumption 1.3 holds. Let be as defined in (28), and recall from (9) that . If is compact and the set of functions defined in (28) is precompact in , then
Fix and . By definition of , there exists a subset of such that for each there exists that satisfies
Recalling the definition of from (8), we see that
This means, in view of the definition of in (9)
as asserted, where we have employed the standard union bound in step and the inequality for in step above. ∎
In the absence of any further structure in the various sets that appear in the definition of in the proof of Theorem 3.1.2, it appears that the standard union bound employed in step in the proof above is a reasonable option. However, in certain specific cases it may be possible to refine this particular step further to arrive at tighter bounds.
Theorem 3.1.2 can be rewritten in the following way. Suppose we are given a desired accuracy level and confidence level . We define
If the number of i.i.d samples (scenarios) drawn from under is at least , then the probability of occurence of the bad set is guaranteed to be at most . Let us compare this estimate with that of Theorem 1.4. We begin by noting that defined in (16) can be rewritten explicitly as (see [CGP09, Theorem 1])
With this in mind, we can rewrite the result of Theorem 1.4 in the language of its statement as follows: If the number of i.i.d samples drawn from under is at least , then the probability of occurence of the bad set is guaranteed to be at most . Observe that samples guarantee an accuracy of as opposed to . To remove this dependency on and to obtain an explicit result, recall the definitions in (13), (14) and (15). If has a well defined inverse , then sampling number of elements would guarantee an accuracy of for each . However such an inverse function may not always exist. Neverthless, the right hand side of (15) in the definition of is a pseudo inverse of the function . Indeed, if is monotonically increasing, then is its inverse in the usual sense. Due to this fact, if we employ as the inverse of , define
and sample number of i.i.d. samples from , then Theorem 1.4 guarantees that the probability of occurence of the bad set is at most .
Theorem 2 along with the estimate of Theorem 3.1.2 in the form given in (30) points to the fundamental nature of the function . On the one hand, Theorem 2 says that if is equal to zero, then the scenario approach is not consistent and even as the number of i.i.d samples drawn approach infinity, the scenario approximation remains at least away from the true minimum. On the other hand, according to Theorem 3.1.2, even if is nonzero, the number of samples that need to be drawn to guarantee an approximation of accuracy grows increasingly large as goes to zero. This is reminiscent of the condition number of a matrix in linear algebra; recall that a square matrix is singular only if its condition number is infinite. However, even if the condition number is finite, it becomes increasingly harder to numerically compute the inverse of a matrix as its condition number increases, to the extent that a matrix with a very large condition number is practically singular from a numerical standpoint. In this sense, is a measure of how well behaved the scenario approximations of a robust optimization problem are: as decreases, the finite sample behaviour of the scenario approximations also deteriorate, and finally, when becomes zero, the performance deteriorates so much that even consistency is lost.
3.2. Scenario bounds for bandlimited trigonometric functions
Here we employ Theorem 3.1.2 of the previous section to derive bounds on the probability of the “bad set” in the situation where is an -dimensional hypercube and the set of functions is a bounded subset of the linear subspace of trigonometric polynomials of bandwidth . More precisely, our premise for this subsection is the following:
In the context of the problem (1) and its associated data, we stipulate that:
[label=(), align=right, widest=iii, leftmargin=*]
For each , the function is a trigonometric polynomial of bandwidth . In other words,
The set of trigonometric polynomials of bandwidth is a dimensional subspace of , and therefore, any bounded subset of it is precompact. Consequently, Theorem 3.1.2 applies to this situation. In the following Lemma, whose proof is deferred to Appendix B, we provide estimates of the covering number .
Consider the problem (1) along with its associated data and suppose that Assumption 1.3 holds. In addition, suppose that the family of functions defined in (28) satisfies Assumption 3.2, and define . Then, for as defined in (9), we have
As mentioned in Remark 3.1.2, we can rewrite the result of Theorem 3.2 in terms of the number of samples required to achieve a desired level of accuracy and confidence. As before, suppose we are given an accuracy level and a confidence level . If we draw
number of i.i.d samples from under , then the probability of occurence of the bad set is guaranteed to be less than .
3.3. Scenario bounds for smooth functions on the -torus
We apply Theorem 3.1.2 to the problem of determining bounds on the probability of the “bad set” when the uncertainty set is an -dimensional torus and the cost function is smooth with respect to the uncertain parameters. Formally, we ask: In the context of the problem (1) and its associated data, we stipulate that:
[label=(), align=right, widest=iii, leftmargin=*]
there exists an integer such that for each , the function