A Kernel Mean Embedding Approach to Reducing Conservativeness in Stochastic Programming and Control

We apply kernel mean embedding methods to sample-based stochastic optimization and control. Specifically, we use the reduced-set expansion method as a way to discard sampled scenarios. The effect of such constraint removal is improved optimality and decreased conservativeness. This is achieved by solving a distributional-distance-regularized optimization problem. We demonstrated this optimization formulation is well-motivated in theory, computationally tractable, and effective in numerical algorithms.



There are no comments yet.


page 6


Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem

In order to anticipate rare and impactful events, we propose to quantify...

Stochastic Successive Convex Approximation for General Stochastic Optimization Problems

One key challenge for solving a general stochastic optimization problem ...

Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability

We study a statistical method to estimate the optimal value, and the opt...

When can we improve on sample average approximation for stochastic optimization?

We explore the performance of sample average approximation in comparison...

Distributionally Robust Optimization with Correlated Data from Vector Autoregressive Processes

We present a distributionally robust formulation of a stochastic optimiz...

Tiering as a Stochastic Submodular Optimization Problem

Tiering is an essential technique for building large-scale information r...

Solving Chance Constrained Optimization under Non-Parametric Uncertainty Through Hilbert Space Embedding

In this paper, we present an efficient algorithm for solving a class of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robustification against uncertain events is at the core of modern optimization and control. From the classic S-lemma to recent advances in distributionally robust optimization (DRO), we have witnessed computational tools giving rise to new robustification designs.

Figure 1: Steering the state trajectories (blue) without crossing constraints (red). Two figures depict optimization over different number of scenarios. (left) Num. of scenarios . (right) .

The classic “worst-case” approach sets out to robustify constraints against all realizations of disturbances in a mathematical model, resulting in the often over-conservativeness. Consider the illustrative example of a sample-based approach to solving constrained stochastic control problem in Figure 1. In this case, we sample different numbers of realizations of the uncertainty in (left) and (right), and then solve the control problem under those realized scenarios. Intuitively, one may expect the controller associated with more scenarios in (left) to be more robust against constraint violation than (right). However, this results in conservative designs which may be reflected in high cost.

The central idea of modern data-driven robust optimization (e.g., [bertsimas2018data]), put in lay terms, is to use data samples to form empirical understanding of the true distribution and robustify only against this empirical understanding, instead of against the whole support. One concrete relevance to our discussion, for example in Figure 1, is that fewer scenarios translate to fewer constraints, which in turn lead to reduced cost. Naturally, this is a trade-off between optimality and feasibility.

In this paper, we show that constraint removal can be formulated as an optimization problem aiming to form a new distribution close to the empirical data distribution. Our contributions are (1) We formulate the constraint removal in stochastic programming and control as a tractable convex optimization problem with reproducing kernel Hilbert space (RKHS)-distance regularization or constraint. This formulation is well motivated in theory and effective in numerical studies. (2)

To our knowledge, this is the first use of RKHS-embedding reduced-set method in stochastic optimization and scenario approaches to control. Its implication is a connection between stochastic control and probability-metric-constrained DRO.

Notation. In this work, symbol often denotes a reproducing kernel Hilbert space (RKHS). We write

to denote that the random variable or vector (RV)

follows the distribution law . By empirical distribution of the data, we mean the linear combination of Dirac-measures of the seen data where is the data set.

2 Background & related work

2.1 Stochastic programming and scenario optimization for control

In this paper, the problem of interest is the (chance-constrained) stochastic programming (SP; also known as stochastic optimization) in the following canonical formulation.


As is assumed to be an RV, program (1) may be intuitively understood as making decision under uncertainty originated from . We consider the following sample-based SP (a.k.a. scenario approach).

Suppose we have a set of realizations of , we solve the sample-based program


If and are convex in , measurable in , it can be shown that this formulation is a convex approximation to the original SP (1). As , the solution recovers that of the SP with level . However, with a large , the solution to (2) is overly conservative— it aims to satisfy the constraints almost everywhere in the distribution of . Therefore, the size trades off the conservativeness with constraint-satisfaction. Extensive research (e.g.,[calafiore2006scenario, dentcheva2000concavity, luedtke2010integer] ) has focused on approaches to remove a subset of sampled constraints to reduce conservativeness of the solution.

Most relevant to this paper, [campi2011sampling] established guaranteed bounds for the constraint satisfaction probability and the number of removal constraint (out of total ). Our method is built upon their sampling-and-discarding framework. [campi2013random] used regularization to encourage sparsity in decision variables, which is different from our sparsity in RKHS expansion terms.

For readers who are interested in sample-based stochastic programming, good text references are given by Ch.5 of [shapiro2009lectures] and Ch.9 of [birge2011introduction].

2.2 Reproducing kernel Hilbert space (RKHS) embeddings

This section establishes necessary tools from kernel methods. It is by no means a comprehensive survey. For readers who are not familiar with RKHS embeddings, we refer to [zhu2019new] for an accessible introduction in the context of stochastic systems and [scholkopf2002learning, Muandet2017] for an extensive coverage.

A positive definite kernel is a real-valued bivariate, symmetric function such that for any , , and . One may intuitively think as a generalized similarity measure (inner product) between and after mapping them into the feature space , We refer to as feature map associated with the kernel , and the associated RKHS. A canonical kernel is the Gaussian kernel where is a bandwidth parameter.

RKHS embedding, or kernel mean embedding (KME) [[Smola07Hilbert]

] maps probability distributions to (deterministic) elements of a Hilbert space. Mathematically, the KME of a random variable

is given by the function

, which is a member of the RKHS. For example, the RKHS associated with the second-order polynomial kernel consists of quadratic functions whose coefficients preserve statistical mean and variance. Gaussian kernel embeddings, on the other hand, preserve richer information up to infinite order.

Reduced-set expansion method using RKHS embeddings. Given a data set , the sample-based KME is given by where one can simple choose . It has been shown that one may use fewer than the total data sample to represent the distribution. This is the idea of reduced-set approximation. (cf. [scholkopf2002learning]) Mathematically, this method seeks to find an embedding with fewer expansion terms where the approximation is in the sense of RKHS distance measure. The reduced-set method forms the backbone of our approach. We also note that there are other related approximation methods such as those of [chen2012super, bach2012equivalence]. Recently, [zhu2019new] considered recursive applications of reduced-set method to uncertainty in stochastic systems.

3 Method

3.1 Stochastic programming with reduced-set expansion of RKHS embeddings

We consider the sample-based formulation of the stochastic programming problem (2). Our main idea is to perform constraint removal systematically using the aforementioned RKHS embedding reduced-set methods. Typically, constraint removal discards low-probability scenarios to reduce conservativeness of the resulting solution. Given a set of realized scenarios and positive definite kernel , we formulate optimization problem as


where denotes scaling vector for the -penalty. This can often be set to reflect specific concerns, such as the distance of states to the constraint. The KME expansion weights need not sum to one.

denotes the empirical KME estimator of the distribution

. We further write down the equivalent Lagrangian form.


The resulting solution is sparse due to the sparsity-inducing term. We then discard the points  with the index set . Finally, the we re-solve the stochastic programming problem with the reduced-set scenarios


The intuition of the optimization formulation (3) and (4) is to produce a subset of data whose distribution is close to the empirical data in the sense of RKHS-embedding distance . Meanwhile, the weighted -penalty incentivizes the solution to become sparse. Therefore, the solution to (4) discards the “corner” cases while maintaining the statistical information. We outline the algorithmic procedure in Algorithm 1.

1:  Solve the sample-based stochastic programming problem (2).
2:  Find the reduced-set RKHS embeddings
by solving the convex optimization problem (3). is the reduced index set defined in (5).
3:  Solve the stochastic programming problem according to the reduced set RKHS approximation (i.e., constraint removal by sparse optimization).
4:  Output: Solution of the above reduced stochastic program.
Algorithm 1 RKHS approximation to stochastic programming

Remark. In Step 2 of Algorithm 1, we may also use the reduced set expansion of any transformations of random variables This generalization, termed as kernel probabilistic programming ([Scholkopf2015]), is often of interest as in our numerical examples. The statistical consistency is justified by [SimonGabriel2016]. See also Section 2.2 of [zhu2019new] for an accessible discussion on this. The following lemma shows formulation (4) is computationally tractable. If in problem (4) is the RKHS associated with a positive definite kernel, the objective of optimization problem is convex. Proof sketch. To see this, we summon the sample based estimator for the RKHS distance. Using the kernel trick (cf. [scholkopf2002learning]), this objective is simply


is a constant vector. is the gram matrix associated with the positive definite kernel, which implies . Using convexity of , the conclusion follows.

Remark. (Relation to distributionally robust optimization, DRO) We can equivalently write the constraint of program (3) in the form of maximum mean discrepancy, . where is the empirical distribution of the data samples and , the distribution induced by the reduced-set embedding. Then the distribution associated with reduced-set embedding can be viewed as an -perturbation of the empirical distribution, i.e., must lie within an MMD-ambiguity set Optimization problems with such constraints are often referred to as

generalized moment problems

. The connection to distributionally robust optimization is evident, i.e., we robustify against the worst case within an MMD-ambiguity set around the empirical data distribution instead of the whole support. (We refer readers unfamiliar with DRO to [kuhn2019wasserstein] or [erdougan2006ambiguous] for a recent introduction.)

3.2 Application to stochastic optimal control

Let us consider the following sample-based (scenario) formulation of stochastic optimal control problem (OCP).


where are uncertain variables and their realizations. The uncertainty in the initial state is particularly relevant to MPC designs. After proper transcription and discretization, this OCP subsequently becomes the same form as the sample-based SP (2), solvable by Algorithm 1.

Remark. For conciseness, we restrict the uncertainty to the initial states in OCP (2). Reduced-set RKHS embedding of more general process disturbances has been discussed in [zhu2019new].

4 Numerical Experiments

4.1 Min-max robust regression

We first consider a synthetic stochastic programming problem given in the form of the following min-max robust regression. A similar example was visited in [campi2019scenario].


For simplicity, we consider scalars and generated randomly according to the distributions. where is the (unknown) true parameter drawn from Uniform().

Given the computed solution to the full program (9) , let us consider the quantity of interest , which is an RV due to the uncertainty in and . We now apply Algorithm 1 to find the reduced-set embedding of , In step 3, in solving program (4), we used the scaling factor to incentivize the removal of “corner” points ( may be thought of as the “softness” parameter of this softmax scaling factor). We then remove the constraints with identified index set , from the stochastic program and re-compute a solution. Following our discussion in the previous sections, this embedding captures the distribution information while discarding the rare scenarios. This is done by solving the sparse optimization problem (3). The results are illustrated in the Figure 2. As we can see, scenarios associated with “corner” data points are not selected, causing the reduction in conservativeness.

Figure 2: Solutions of the min-max robust regression problem. Three figures correspond to three different regularization coefficients and number of discarded scenarios : (left) . (center) . (right) . The shaded strip denotes the robust margin. Red points are the selected points by Step 2 in Algorithm 1. Dark points correspond to discarded scenarios and constraints. We used Gaussian kernel of bandwidth to calculate the RKHS embedding in Algorithm 1.

Figure 3 illustrates the effect of constraint removal on the optimal objective value and constraint violation. See the caption for detailed description.

Figure 3: Estimates produced by a large-sample () Monte Carlo simulation. (left). Constraint violation probability of the solution produced by Algorithm 1. This is estimated by the Monte Carlo estimation where is the indicator function of random events. (right).The new expected cost associated with the solution produced by Algorithm 1. This is estimated by the Monte Carlo estimation where is the solution by the proposed method.

4.2 Stochastic control

We now consider the Van der Pol oscilator model


The goal of the control design is to steer the system state to a certain level. This is formulated as the following OCP.


We sample the i.i.d. uncertainty realizations , where . Because of the nonlinear dynamics, we cannot propagate the uncertainty in a tractable manner as in LQG without resorting to approximations. We use the sampled scenarios to form the OCP (8). The continuous-time dynamics is transcribed using multiple-shooting with CVodes (interfaced with CasADi) integrator. We then solve the discretized OCP with IPOPT to obtain the optimal control. An example of states associated with the solution is given in Figure 1 (left). The total time horizon is s and we consider control steps in this experiment.

Let us consider the quantity of interest , the distance from state position to the upper bound constraint. This quantity reflects how close we are to be infeasible. It is random due to the states being a function of RV . In Step 2 of Algorithm 1, the scaling factor is taken to be to encourage the removal of close-to-constraint trajectories.

We are now ready to apply Algorithm 1 to find the reduced-set embedding of , , where is a vector comprising at all time steps. Finally, we re-solve the subsequent reduced-set SP—OCP. Figure 1 (right) illustrates the reduced number of scenarios.

After we applied Algorithm 1, we obtain the “optimistic” controller . To evaluate the performance of this controller, we use large-sample Monte Carlo simulation to estimate the constraint violation probability, i.e., in the chance-constrained SP (1), as well as the expected cost over the large-sample simulation . We plot the state trajectories with different number of removed constraints in Figure 4. The trade of between those is illustrated in Figure 5. The result makes intuitive sense that the more constraints we remove, the less the conservativeness, but with higher violation probability.

Figure 4: (left) The optimistic controller evaluated in independent Monte Carlo simulations with trajectories. The controller is produced with Algorithm 1 with removed scenarios ( regularization coefficient ). The estimated constraint-violation probability is . (right) Controller produced with removed scenarios ( regularization ). The constraint violation probability is .
Figure 5: Performance estimates produced by independent () Monte Carlo simulations. (left). Constraint violation probability of the solution produced by Algorithm 1. This is estimated by the Monte Carlo estimation where is the indicator function. denotes all the constraints in OCP (11). (right).The new expected cost of OCP associated with the solution produced by Algorithm 1. This is estimated by the Monte Carlo estimation where is the optimistic controller.

5 Discussion

This paper proposed a distributional-distance-regularized optimization formulation for stochastic programming under the framework of sampling-and-discarding. We demonstrated effective conservativeness reduction in data-driven optimization and control tasks. Although we did not study the guaranteed bounds, all analysis in [campi2011sampling] applies to our case. However, as our approach produces new distributions that are close to the empirical distribution, the sample complexity is likely to be less conservative.

One particular interesting aspect is the interpretation of perturbing empirical data distribution within an ambiguity set in the sense of RKHS distance (as remarked in Section 3.1). This is worthy of further investigation.

We owe the inspiration for the second numerical example to Joris Gillis. We also thank Wittawat Jitkrittum for his helpful feedbacks. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 798321, the German Federal Ministry for Economic Affairs and Energy (BMWi) via eco4wind (0324125B) and DyConPV (0324166B), and by DFG via Research Unit FOR 2401.