## 1 Introduction

### 1.1 Problem Setting

Data-driven decision-making must manage stochastic uncertainty due to a limited number of available data samples. Suppose that a decision is made by solving the following parameterized optimization problem [6]:

(1) |

where is a parameter, is a decision domain, is a decision variable, is an objective function, and are parameterized constraint functions that are linear in . The ideal decision is given by an optimal solution under true parameter ; however,

is unknown to us and must be estimated from i.i.d. samples

() from a distribution whose expectation is .In practice, we often need a solution that satisfies the true constraints (i.e., constraints with true parameters) even though these are unknown.
In such a case, a naive approach that solves the problem with estimated parameters would be inadequate because the obtained solution might not satisfy the true constraints.
Rather, robust optimization [2, 6] can be used as follows.
For a positive semidefinite matrix , we define .
The *robust optimization problem (with ellipsoidal uncertainty)* is then given by

(2) |

where is a parameter called the nonnegative scale. Let be the solution of the above problem. Then, satisfies the true constraints if . This means that if is large, it will most likely satisfy the true constraints but may have a poor best-case performance. Conversely, if is small, it will have a good best-case performance but also a high risk of violating the true constraints. Therefore, the choice of is a significantly important task [8].

In this study, we consider the problem of finding a such that the problem will have a small objective value and satisfy the true constraints with a sufficiently high probability. Formally, the problem is defined as follows. Let be the true objective function:

Let be the empirical average of , be an estimate of the covariance matrix defined by , and be defined by . Our goal then is to find an algorithm that determines a scale from the observation such that the (

)-quantile of

, i.e., the*value-at-risk*, is minimized as follows

^{1}

^{1}1Here we define , where the probability is taken for the i.i.d. sampling of .:

###### Problem 1.

Given , find an algorithm that minimizes .

### 1.2 Standard Approach

A standard approach for Problem 1 determines a such that holds with a desired probability [4, 7, 10]. Let be the true covariance matrix and . We then obtain the following:

###### Proposition 2.

Suppose that with probability at least . It then holds that .

If

is (asymptotically) distributed by a normal distribution

with mean and covariance , then such will be determined by , whereis a chi-distribution with a degree of freedom

.Observe that such a will be of . Thus, the speed of convergence depends on the dimension of the parameter . The speed of convergence will be especially degraded when a large number of features (

) are available, but only a small number of features is useful, which will cause the so-called “curse of dimensionality” in robust optimization.

### 1.3 Our Approach and Results

In this study, we prove the following theorem, which improves the convergence rate of Corollary 2:

###### Theorem 3 (Informal version of Theorem 20).

It is important to note that our is asymptotically independent of the dimensionality of the parameter , i.e., the speed of convergence is less affected by the dimensionality of the parameter .

The theorem is proved in the following steps. We first derive a new inequality on a subspace. We then establish methods to apply the new inequality effectively.

#### 1.3.1 New Inequality on Subspace

Recall that Corollary 2 is derived from the following fundamental fact in robust optimization.

###### Fact 4.

Let . If and satisfy , then

(3) |

This fact holds because
the assumption implies
for *arbitrary and *.
Here, we observe that this is too conservative.
If we consider given () and a subspace that includes the resulting solution ,
we can prove the same inequality (3) under a weaker assumption.
Let .
For , let be the set of errors that does not cause a violation of the true constraint over , i.e.,

We then obtain the following lemma:

###### Lemma 5.

Lemma 5 generalizes Fact 4 because the assumption of Fact 4 is a sufficient condition in Lemma 5 for since holds for any , and (5) is trivial with . Our idea is to use Lemma 5 rather than Fact 4 to derive better bounds than Corollary 2.

The following example shows the effect of choosing a different .

###### Example 6.

Let us consider the following -dimensional parameterized problem:

(6) |

The true parameters are , and the true covariance of samples is , where

is an identity matrix of size

.Figure 1 illustrates for , , and , where all these include the true optimum solution . The figure clearly shows that, with a fixed , a smaller results in a larger :

Recall that Fact 4 and Lemma 5 lead us to define in order to satisfy with desired probability .

More specifically, let be distributed by the normal distribution and let . Fact 4 with then requires , while Lemma 5 together with requires . Since a smaller yields a closer solution to the optimal one, this indicates the possibility of having a better solution.

∎

For an effective use of Lemma 5, we must be able to evaluate the probability for , and must find a good subspace without knowing and .

#### 1.3.2 Probability Evaluation

In order to evaluate probability without and , in Section 3 we characterize as an intersection of Minkovski’s sum of an ellipsoid and a dual cone. Since such an intersection is a convex set, we can apply the high-dimensional Berry-Esseen theorem for convex sets [5] in order to approximate the distribution of by a normal distribution while maintaining a theoretical guarantee. We then propose a sampling-based algorithm for measuring the probability with sufficient accuracy.

#### 1.3.3 Subspace Selection

To find good subspace , in Section 4 we extend the two-step empirical domain reduction algorithm proposed for risk minimization with a single uncertainty objective [16] to our context for robust optimization with multiple uncertain constraints. Such a reduced space must satisfy (4) and (5) in Lemma 5 with high probability. For (4), since the reduction is based on empirical estimate , the resulting also includes stochastic uncertainty. We must thus avoid over-fitting on a given single sample and keep consistency with other sampling scenarios. In addition, (5) requires that the solution must be included in without forcing it by adding constraints. These two requirements require precise control of domain reduction, which can be done using our empirical two-step domain reduction algorithm.

### 1.4 Related studies

Our approach can be summarized in terms of the following concepts: inequality on subspaces, probability evaluation, and subspace selection. While these concepts are novel in the context of robust optimization, they have already been proven to be effective in the context of variance-based regularization for risk minimization in machine learning. Let us note that variance-based regularization regularizes uncertain objectives by means of scaled (square-root of) variance, and thus it is similar to the robust optimization with ellipsoidal uncertainty that introduces redundancy into uncertain constraints, which redundancy is proportional to (square-root of) variance.

In the context of risk minimization, the relationship between the size of an optimization domain and the speed of convergence has been characterized by various complexity measures, such as VC dimension [15], covering number [13], and Rademacher complexity [1]. Studies of variance-based regularization [13, 14] have determined the scale of a regularizer on the basis of these complexity measures in order for a regularized empirical risk function to bound the true objectives with a desired probabilities. A recent study of empirical hypothesis space reduction [16] has achieved, by means of subspace selection, an acceleration of convergence that is asymptotically independent of the dimensionality of uncertain parameters.

To make use of the approaches in [16], which is designed for variance-based regularization, in robust optimization, we have dealt with the issues below. Robust optimization has several uncertain constraints, while risk minimization has a single uncertain objective. In particular, while an uncertainty objective does not influence the feasible domain, uncertain constraints do influence, which increases the difficulty of subspace selection. In addition, in robust optimization, it is often assumed that uncertain functions are linear with respect to uncertain parameters, and this linearity has led us to propose a novel sampling-based evaluation of violation probability.

## 2 Preliminary

### 2.1 Technical lemmas

This section introduces a series of lemmas. The following Gershgorin circle theorem characterizes the list of eigenvalues of a matrix

by its elements .###### Lemma 7 (The Gershgorin circle theorem; see [11]).

Let , and define . Under suitable ordering, the set of eigenvalues of satisfy .

Hoeffding’s inequality below bounds the gap between sample average and true average.

###### Lemma 8 (Hoeffding’s inequality; see [9]).

The following Berry-Esseen theorem enables us to evaluate the speed of convergence of the central limit theorem (for tighter coefficient

in a one-dimensional case, see [12]).###### Lemma 9 (The high-dimensional Berry-Esseen inequality [5]).

Let be an integer. Let be a set of convex set in and . Let be i.i.d. random variables over with , , and . Let . It then holds that

### 2.2 Estimation accuracy

We can show here the accuracy of estimators and on the basis of the lemmas introduced in the previous section and the following assumption. Let denote the distribution of where .

###### Assumption 10.

An upper-bound of and of is given.

We here assume that (an upper-bound of) the third moment

and sixth moment is known, since the speed of convergence is less influenced by the higher-order moments and than by the first and second moments ( and ). In practice, and can be calculated if the domain of the distribution is bounded, or it can also be estimated using samples .Under Assumption 10, the distribution of converges to the normal distribution by the central limit theorem. Recall the definitions of and in Lemma 9. The speed of convergence can be characterized by , , and Lemma 9 as follows:

###### Lemma 11.

It holds that

For , let us define by

The relative accuracy of estimator can be characterized by Lemma 7 and Lemma 9 as follows.

###### Lemma 12.

For , the following then holds with probability at least :

(7) |

## 3 Probability Evaluation Algorithm

This section proposes an algorithm that realizes the concept of probability evaluation discussed in Section 1.3.

### 3.1 Geometric characterization of

Recall that is the set of estimation error that does not cause a violation of the true constraint over with the scale . Here we prove that is a convex set for any , , and .

Since is linear in , there exists a function such that

For , we define by

We then define the polar cone and dual cone of by

The polar cone and the dual cone are convex cones. For a pair of set , their Minkowski’s sum is defined by . Note that Minkowski’s sum of two convex sets is also a convex set. The following theorem characterize as an intersection of convex sets:

###### Lemma 13.

For any , , and , is a convex set. In particular, if is linear in for all and is convex, then it holds that

### 3.2 Normal approximation via multidimensional Berry-Esseen’ theorem

On the basis of [16, Definition 5], we here introduce the minimum spatial uniform bounds as a minimum scale that satisfies (4) with probability when is distributed by :

###### Definition 14.

Let , , , and be a distribution over . The minimum spatial uniform bound is defined by:

(8) |

Let us define as the distribution of . The ideal goal of this section, then, is to obtain . satisfies the following component-wise monotonicity. Let us denote .

###### Lemma 15.

Let , , , and . Then the following holds.

(i) for all . [Monotonicity in ]

(ii) for all . [Reverse Monotonicity in ]

(iii) and for all . [Reverse Monotonicity in ]

(iv) for all . [Scaling]

Recall that Lemma 13 shows the convexity of for arbitrary , , and . We can then apply Lemma 11 to bound the gap of probability in (8) for and the corresponding normal . Combined with the monotonicity shown in Lemma 15 (i), we can upper-bound of unknown distribution by its asymptotically normal counterpart:

###### Proposition 16.

It holds that

### 3.3 Sampling algorithm for calculating spatial uniform bounds for normal distribution

This section fixes , , and , and we then denote for notational simplicity. Proposition 16 reduces our goal to that of calculating a minimum spatial uniform bound . Lemma 15 (i) shows that such a is monotone in , and satisfies the following property:

###### Lemma 17.

(i) It holds that .

(ii) It holds that

Thus is monotone, upper and lower bounded, and its inverse can be calculated by a series of sampling and optimization. This characterization leads the following algorithm for calculating by sampling from and a binary search, as shown in Algorithm 1.

Given , , and , together with parameters deciding accuracy, Algorithm 1 calculates a that approximates . Line 1 of the algorithm first defines the number of samples by

(9) |

Line 2, then, generates samples for from the normal distribution . Lines 3–18 estimate by the binary search. Line 4 defines the upper-bound and the lower-bound on the basis of Lemma 17. Line 6 updates as a mean of the upper-bound and the lower-bound , and then Lines 7–12 approximately calculate on the basis of the samples for and the following lemma:

###### Lemma 18.

For , if and only if , where

(10) |

For each , Line 8 calculates the optimization problem (10) for , and then Lines 9–11 approximately calculate by counting the number of such for . Note that some of the calculation of can be omitted in practice since is monotonically increasing in . Lines 13–17 update the upper-bound or the lower-bound on the basis of the approximate probability of . Finally, Line 19 outputs the approximate upper-bound of .

This output satisfies the following probabilistic guarantee.

###### Proposition 19.

With probability at least , the output of Algorithm 1 satisfy

## 4 Empirical Domain Reduction

This section presents an algorithm that achieves our main theoretical result via the concept of subspace selection discussed in Section 1.3.

### 4.1 Two-step domain reduction algorithm

Let us define constants and empirically reduced domain for by

We can then propose Algorithm 2 for the calculation of the scale of robustness. Lines 1–4 represent the first stage of this algorithm, which conducts subspace selection. Line 1 first calculates coefficient using Algorithm 1 with . Line 2 defines by scaling , and then Line 3 calculates optimum value by solving (2) on the basis of scale . Line 4 conduct subspace selection by defining . Lines 5–6 represent the second stage of the algorithm, which calculates scale on the basis of the subspace . Line 5 calculates using Algorithm 1 with , and Line 6 then outputs which defined by scaling .

### 4.2 Theoretical analysis

Let be the minimum integer that satisfies

Note that such must exist since . Our main theoretical result can then be given as follows:

###### Theorem 20.

Thus, under the uniqueness of the true optimum solution , the estimated scale is asymptotically independent of the dimension .

The essence of the proof of this statement lies in finding subspaces that satisfy the following two conditions: (i) with high probability, it holds that

and (ii) . Property (i), together with Lemma 5, implies contribution to proof of (11), and the property (ii) implies asymptotic convergence (12). Since is controlled by the estimated optimum value , for the existence of both and , we need to control the range of estimated optimum value . We utilize the following lemma for this control:

###### Lemma 21.

Let and . Suppose that , , and satisfy

It then holds that .

Roughly speaking, we bound the range of in probability by applying Lemma 21 with and .

Let us conclude this section with a brief discussion of the computational tractability of our algorithms.

###### Remark 22.

Algorithm 2 mainly consists of an optimization (3) in Line 3 and two applications of Algorithm 2 in Lines 1 and 5, where each application consists of optimization of (10). If the original robust optimization (3) is a convex programming problem, then (10) with is a series of convex programmings. It is thus natural to assume that Lines 1 and 3 are computationally tractable.

For the second application of Algorithm 1, in Line 5, note that the subspace will not generally be convex because of the concavity of . In practical implementation, we can replace in the definition of by an upper-bound that makes convex. If such a can be bounded above as with some constant , then the resulting output of Algorithm 2 will also satisfy the guarantee in Theorem 20.

Let us introduce a simple example of such an upper-bound . Suppose that is the positive orthant () and that . Then , where is the -norm for . It then holds that . Observe that is linear on the positive orthant , and thus this is computationally tractable.

## 5 Experiments

### 5.1 Experimental setting

Let us consider the following simple portfolio optimization problem, which has been examined by several existing studies in robust optimization [8, 3]. Suppose that there are items whose costs are , which are uncertain parameters. The task is to find a convex combination (portfolio) that minimizes the cost: