# Mixed Strategy for Constrained Stochastic Optimal Control

Choosing control inputs randomly can result in a reduced expected cost in optimal control problems with stochastic constraints, such as stochastic model predictive control (SMPC). We consider a controller with initial randomization, meaning that the controller randomly chooses from K+1 control sequences at the beginning (called K-randimization).It is known that, for a finite-state, finite-action Markov Decision Process (MDP) with K constraints, K-randimization is sufficient to achieve the minimum cost. We found that the same result holds for stochastic optimal control problems with continuous state and action spaces.Furthermore, we show the randomization of control input can result in reduced cost when the optimization problem is nonconvex, and the cost reduction is equal to the duality gap. We then provide the necessary and sufficient conditions for the optimality of a randomized solution, and develop an efficient solution method based on dual optimization. Furthermore, in a special case with K=1 such as a joint chance-constrained problem, the dual optimization can be solved even more efficiently by root finding. Finally, we test the theories and demonstrate the solution method on multiple practical problems ranging from path planning to the planning of entry, descent, and landing (EDL) for future Mars missions.

## Authors

• 10 publications
• 1 publication
• 106 publications
• 13 publications
10/01/2020

### A Space-Time Variational Method for Optimal Control Problems

We consider a space-time variational formulation of a PDE-constrained op...
02/24/2022

### Tube Stochastic Optimal Control for Nonlinear Constrained Trajectory Optimization Problems

Recent low-thrust space missions have highlighted the importance of desi...
12/13/2017

### Convex programming in optimal control and information theory

The main theme of this thesis is the development of computational method...
01/02/2022

### Reinforcement Learning for Task Specifications with Action-Constraints

In this paper, we use concepts from supervisory control theory of discre...
08/23/2019

### Convergence analysis and a posteriori error estimates of reduced order solutions for optimal control problem of parameterized Maxwell system

In this paper we investigate the reduced order solution of the optimal c...
02/23/2018

### Control and Sensing Co-design

Linear-Quadratic-Gaussian (LQG) control is concerned with the design of ...
02/23/2018

### LQG Control and Sensing Co-design

Linear-Quadratic-Gaussian (LQG) control is concerned with the design of ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The main finding of this paper is that, in optimal control problems with stochastic constraints, choosing control inputs randomly can result in a less expected cost than deterministically optimizing them. To communicate the idea, consider the following toy problem illustrated in Figure 2

. The goal is to plan a path to go to the goal with a minimum expected cost while limiting the chance of failure to 1 %. There are two path options, A and B. A has the expected cost of 20 and the chance of failure is 0.5%; B has the expected cost of 10 and the chance of failure is 1.5%. Choosing B violates the chance constraint, hence the optimal solution is A if only deterministic choice is allowed. However, we can create a mixed solution by flipping a coin to randomly choose between A and B. Assuming the probability of head and tail is 0.5, the resulting mixed solution has the expected cost of 15 and a 1% chance of failure, which satisfies the chance constraint. The expected cost of the mixed solution is less than that of A. In the example above, A and B are represented by two different sequences of control inputs. When a mixed strategy is employed, the system flips a coin once at the beginning, and choose a sequence according to the result of the coin flip. Once a sequence is selected, the system sticks to the sequence until the end.

Mixed strategy is essentially a convexification. As shown in Figure 2, in the cost-risk space, the set of the pure strategies is a nonconvex set consisting of two points. The solution set of mixed strategies is a line segment between A and B, which is the convex hull of the pure solution set. In general, as shown in Figure 3, when the original problem is nonconvex, introducing mixed strategy extends the solution set, which could improve the cost of the optimal solution. The improvement is equal to the duality gap. When there is no duality gap, the optimal mixed solution is equivalent to the optimal pure solution (i.e., choosing the optimal pure solution with the probability of one.) Therefore, the optimal solution to the mixed strategy problem is always as good as the optimal solution to the original problem.

In general, unlike the illustrative example above, a stochastic optimal control problem has infinitely many solutions. An optimal mixed-strategy controller first computes a finite number of control sequences, them randomly chooses one from them. The formal problem definition is given as an extension to a standard finite-horizon, constrained stochastic optimal control problem, where +1 control sequences and the probability to choose the control sequences are optimized (called -randomization). The degree of randomization, , is pre-specified. The controller chooses one control sequence at the beginning (hence called initial randomization). Then the chosen control sequence is executed till the end.

The contribution of this paper is three-fold. First, we provide rigorous characterizations of mixed-strategy, constrained stochastic optimal control, which are summarized in Theorems 1-3 in Section II. Theorem 1 in Section II-B provides the sufficient degree of randomization. Specifically, for a problem with constraints, -randomization is sufficient for optimality. Theorem 2 in Section II-C states that the attainable cost reduction by mixed strategy is equivalent to the duality gap in the original (non-randomized) problem. This is because the original problem is convexified by mixed strategy. In other words, mixed strategy can improve the solution only if the original problem is nonconvex. Theorem 3 in Section II-D provides the necessary and sufficient conditions for optimality, which is built upon the consists of a subset of KKT conditions and an additional constraint requiring that all the candidate control sequences are the minimizers of the Lagrangian function with the optimal dual variable.

The second contribution of this paper is to develop an efficient solution approach to the mixed-strategy, constrained stochastic optimal control problem. A naive solution approach requires co-optimization of

control sequences as well as the probability distribution, which is significantly more complex than the original optimization problem without randomization. A key observation is that, since the mixed-strategy optimization problem is the convexification of the original problem, their dual optimal solutions are the same. This observation leads to our general solution method that solves the dual of the

original optimization problem (without randomization). The primal optimal solution for the mixed-strategy problem can be recovered from the dual optimal solution. Furthermore, in a special case with

, we provide an even more efficient solution approach that solves the dual problem with root finding. For a more specific case where the proposed approach is applied to a linear SMPC with nonconvex constraints, we present an efficient and approximate solution approach where the minimization of Lagrangian function is approximated by a MILP through piece-wise linearization of the cumulative distribution function of the state uncertainty.

The third contribution is to validate the theories and demonstrate the solution method in various practical scenarios. We first show an example of SMPC-based path planning with obstacles and a joint chance constraint, and show that mixed strategy indeed improves the expected cost. We also show that the proposed dual solution approach is also applicable to a finite-state optimal control problem. Two examples are presented in this domain: path planning with obstacles, and the planning of entry, descent, and landing (EDL) for future Mars rover/lander missions.

### I-a Related Work

In game theory, mixed strategy is usually discussed in a context of simultaneous adversary game. A classical example is paper-rock-scissors, where the sole Nash equilibrium is to uniformly randomize the strategy for both players. An underlying assumption here is that both players optimize their strategy given the strategy of the other player. The SMPC problem is different in that one player (controller) optimizes her strategy given the strategy of the other (the nature) but not

vice versa. In other words, one player is cognitive while the other is blind. In paper-rick-scissors with cognitive and blind players, the cognitive player cannot be better off by employing a mixed strategy. Therefore, the fact that the cognitive player can be better off with a mixed strategy in optimal control is seemingly contradictory. This is because, unlike players in classical game-theoretic settings, the controller solves a constrained optimization. Intuitively, the constraint and the objective work adversarially, like two players within a controller.

It is known that mixed strategy can improve the solution of constrained Markov decision process (MDP). Major results on this subject, including randomization, are summarized in [1]. A stochastic optimal control problem can be viewed as an MDP with continuous state and control spaces.

The majority of the existing methods for solving constrained MDPs use the idea of convex-analytical (CA) approach [2]

. The CA approach optimizes the performance metric of the MDP by reducing the problem to an optimization of a linear function over the set of occupancy measures; hence, formulated as a linear program. The CA has shown to be useful for solving MDPs with multiple criteria and constraints when the constraints have the same additive structure as the performance measure (i.e., linear function over the set of occupancy measures). This paper on the other hand, do not limit the constraints/performance metric to any structure, thus it can be used for solving a more general class of constrained MDPs.

The scope of our work considers the performance of a mixture of nonrandomized policies (i.e., only considers initial randomization). It is worth mentioning that previous work has studied whether it is possible to split a randomized policy (i.e., randomization of feedback control law) into a mixture of deterministic policies while preserving performance [3]. It is shown that any Markov policy is a mixture of nonrandomized Markov policies [4, Theorem 5.2]. This inclusion suggests that this work also generalizes to randomization of control policy.

Note that the proposed method is fundamentally different from randomized SMPC methods such as scenario-based MPC [5, 6]. In scenario-based MPC, the optimal control inputs are deterministic but the solution method to obtain them is randomized. In contrast, in this work, the optimal control inputs are randomized but the solution method is deterministic. Randomized control input was considered in a control theoretical context by [7, 8]

. The problem considered in these studies is the probabilistic coordination of swarms of autonomous agents using a Markov chain controller. Here randomized control is used for a different purpose than in our work. In the Markov chain control randomized control inputs are used to achieve the desired spacial density distribution of the swarm agents without assuming inter-agent communication. In contrast, in our work, randomized control inputs are used to achieve less expected cost.

## Ii Method

The following is the rough sketch of the proposed solution process.

1. Prove that the original problem and the mixed strategy problem share the same dual optimal solution

2. Compute the dual optimal solution by solving the dual of the original problem

3. Recover the primal optimal solution of the mixed strategy problem from the dual optimal solution

The solution method is explained in detail in the following subsections.

### Ii-a Problem Formulation

We first formulate a pure-strategy problem that does not involve randomization. Consider a discrete-time optimal control problem with stochastic constraints, where the objective and constraints are on the expected cost over a finite horizon, . Let be the control sequence, be the state sequence, where and are the feasible control set and the state space, respectively. We denote by the sequence of exogenous disturbance, which follows a known probability distribution. The system has a dynamics represented by . We consider a close-loop control, where the feedback law at -th time step is given by a deterministic control policy . Let be the set of deterministic control policy that we consider. Hence, we seek for an optimal sequence of deterministic control policy, . We define cost functions, , for . With a slight abuse of notation, we denote the close-loop cost and dynamics by and , respectively.

The problem is to minimize the expectation of while constraining the expectations of below . The minimized expected cost is denoted by .

PSOC (Pure-strategy Stochastic Optimal Control

 c⋆P=minμ∈MTg(x,μ,w)=0 E[f0(x,μ)] (1) s.t. E[fi(x,μ)]≤Vi,i=1⋯K. (2)

A notable example of constraints in the form of (2) is a chance constraint. Let be the set of feasible states. A chance constraint imposes a bound on the probability that the state stays within over the planning horizon:

 Pr[x∈XTF]≥1−V. (3)

This constraint is posed in the form of (2) by

 fi(x,μ):={0(x∈XTF)1(Otherwise).

The problem is reduced to an open-loop control problem (i.e., optimization of control sequence) if is limited to constant functions. Typical feedback MPCs limits to linear feedback laws, , where is optimized. MDP usually considers all the possible mappings with finite and . Discussion in Section 2 poses no assumptions on , , and (except for standard assumptions such as ), hence it can be applied to a variety of problems ranging from stochastic MPC with continuous state and control to MDP with finite state and control. Then, Sections 3 and 4 discusses more specialized cases.

We next define the mixed-strategy problem, in which one of policy sequences is chosen at the beginning. Following the convention in constrained MDP, we call such a randomization the -randomization in this paper. Consider mixing policy sequences, . Let be the probability that is chosen. The mixed strategy problem is to optimize as well as to minimize the expected cost. The minimized expected cost is denoted by .

MSOC (Mixed-strategy Stochastic Optimal Control)

 c⋆NM=minμ1⋯μN+1∈MT∑Nj=1pj=1, pj≥0g(x,μ,w)=0 N+1∑j=1pjE[f0(x,μj)] (4) s.t. N+1∑j=1pjE[fi(x,μj)]≤Vi (5) i=1⋯K

### Ii-B Sufficient Degree of Randomization

Before solving MSOC, we have to determine . In other words, we have to know what is the sufficient number of control sequence to be mixed. We will show in this subsection that -randomization () is sufficient in order to minimize . In other words, for a problem with stochastic constraints, at most control sequences need to be mixed to form an optimal solution. The formal statement is given in Theorem 1 later in this subsection. But we first need a few preparations.

Let where is the -th cost value:

 ci=E[fi(x,μ)].

We denote by the feasible set of the costs of the original problem, that is,

 F:={c | μ∈M∧g(x,μ,w)=0}. (6)

We assume that is a closed set, which typically holds when and are closed sets. With , PSOC (1), (2) can be written in a simpler form as follows

PSOC’:

 c⋆P=minc∈F c0 (7) s.t. ci≤Vi,i=1⋯K. (8)

Likewise, MSOC is equivalent to:

MSOC’:

 minc1⋯cN+1∈F∑N+1j=1pj=1, pj≥0 N+1∑j=1pjcj0 (9) s.t. N+1∑j=1pjcji≤Vi, (10)

where is the

-th cost vector and

is its -th component. We note that, when actually solving the problem, we do not explicitly compute . We introduce it for the ease of understanding.

For later convenience, we will derive another equivalent form to MSOC. Let

 FNM:={N+1∑j=1pjcj | cj∈F,0≤pj,N+1∑j=1pj=1}. (11)

Observe that MSOC’ is equivalent to:

MSOC”:

 c⋆NM=minc∈FNM c0 (12) s.t. ci≤Vi,i=1⋯K. (13)

Let be the set of positive integers. The following theorem holds:

###### Theorem 1.
 K∈argminN∈Nc⋆NM. (14)
###### Proof.

From the definition of , it is obvious that

 FiM⊆Fi+1M,∀i∈N.

Therefore,

 c⋆iM≥c⋆i+1M,∀i∈N. (15)

Also, since , it follows from Caratheodory’s Theorem that

 FiM=conv(F),∀i∈N,i≥K,

where is the convex hull of a set. Therefore,

 c⋆iM=c⋆i+1M,∀i∈N,i≥K. (16)

The theorem follows from (15) and (16). ∎

Theorem 1 means that we only need to consider -randomization in order to minimize the expected cost. In the remainder of this paper we only consider MSOC, which we simply denote by MSOC. Its dual problem DMSOC will play an important role in the analysis further in the paper. Therefore, we will further assume that the Slater condition is satisfied for the MSOC problem, i.e., there exists a feasible point in the relative interior of conv. The Slater condition guarantees a zero-duality gap. In general, it is easy to check and is rarely violated in practical problems.

We also use the following simplified notation:

 c⋆M:=c⋆KM.

### Ii-C Cost Reduction by Randomization

We next discuss under what condition the mixed strategy control can outperform pure strategies, and if it does, by how much.

###### Lemma 1.

The optimal mixed strategy control is at least as good as the optimal pure strategy control, that is:

 c⋆M≤c⋆P
###### Proof.

If follows from the following:

 F=F0M⊆FKM.

This result is obvious because a pure strategy control can be viewed as a mixed strategy control that always assigns the probability of one to a single control sequence. The next question then is under what condition mixed strategies strictly dominate pure strategies.

###### Lemma 2.

The necessary condition for

 c⋆M

is that is a non-convex set.

###### Proof.

We prove the contraposition. If is a convex set, then

 F=conv(F)=FKM.

Hence,

 c⋆M=c⋆P.

The non-convexity of is not a sufficient condition for the strict dominance because the optimal solution to MSOC could be in . Also note that the convexity of is implied by the convexity of PSOC, but not vice versa.

Figure 3 provides a graphical interpretation of the above Lemmas in the case of . The set painted in solid blue is . Among , the areas to the left of the vertical line at satisfies the constraint. Hence, the optimal solution to PSOC is located at the intersection of the vertical line and the lower edge of , called the minimum common point and shown in the green dot in Figure 3. Likewise, the optimal solution to MSOC is the minimum common point of and the vertical line. The dominance of mixed strategy (Lemma 1), as well as the necessary condition for strict dominance (Lemma 2), is graphically obvious from Figure 3.

What follows next is the discussion on by how much the mixed strategy can improve the expected cost, which requires some preparations. The following is the dual optimization problem of PSOC’, where the dual optimal cost is denoted by :

DPSOC (Dual of PSOC)

 q⋆P=maxλ≥0minc∈F c0+λ(c1:K−V), (17)

where is the dual variables, , and ( is the matrix transpose). From a standard result in optimization, . . The duality gap is denoted by , that is,

 Δ=c⋆P−q⋆P.

The dual optimization of MSOC’ is given as follows:

DMSOC (Dual of MSOC’)

 q⋆M=maxλ≥0minc∈conv(F) c0+λ(c1:K−V). (18)

Since is convex, there is no duality gap, hence .

It turns out that the improvement in expected cost by the optimal mixed strategy is equal to the duality gap of PSOC.

###### Theorem 2.
 c⋆M=c⋆P−Δ.
###### Proof.

Since ,

 minc∈conv(F) c0+λc1:K≤minc∈F c0+λc1:K. (19)

The left hand side of the above is equivalent to

 minc1⋯cK+1∈F∑K+1i=1pi=1, pi≥0 K+1∑i=1pi(ci0+λci1:K).

A convex combination of a set of scalers cannot be less than the minimum of the set, i.e., for ,

 K+1∑i=1pi(ci0+λci1:K)≥minc∈F c0+λc1:K.

Hence,

 minc∈conv(F) c0+λc1:K≥minc∈F c0+λc1:K. (20)

From (19) and (20), it follows that

 minc∈conv(F) c0+λc1:K=minc∈F c0+λc1:K.

Since is a constant, it follows from the above that . Therefore,

 c⋆M=q⋆M=q⋆P=c⋆P−Δ. (21)

The graphical interpretation of Theorem 2 is given by Figure 3

. Consider a hyperplane

that contain in their upper closed halfspace and intersects with , that is, is nonempty. Let the normal vector of be . The points in correspond to the optimal solutions to the inner optimization problem of (17) given . The value of at the crossing point between and is the dual objective value. Therefore, the optimal dual solution to DPSOC corresponds to the line that has the maximum crossing point of [9]. Observe that the maximum crossing point for DPSOC, shown as the orange point in Figure 3, is the same for the minimum common point (i.e., the primal optimal solution) for MSOC. Therefore the reduction in expected cost brought by mixed strategy is equivalent to the duality gap in PSOC.

### Ii-D Solution approach

A naive approach to solve MSOC is simply to solve (4)-(5). However, the multiplication of increases the problem complexity (e.g., linear v.s. bilinear), making it difficult to solve. Instead, in this paper, we present an efficient approach to solve MSOC by solving the dual of PSOC. This approach is built on the fact revealed in the proof of Theorem 2 that the optimal dual solutions to MSOC and PSOC are the same.

Let be the optimal solution to DPSOC, (17). Let be the set of all the optimal solutions to the inner optimization problem of DPSOC, that is,

 C(λ)=argminc∈Fc0+λ(c1:K−V). (22)

For example, in case of Figure 3,

###### Theorem 3.

The necessary and sufficient condition for to be an optimal solution to MSOC’, (9)-(10), is that there exist , that satisfy the followings:

 a) ci∈C(λ) ∨ pi=0, ∀i=1⋯K+1 b) λ(K+1∑i=1pici1:K−V)=0, c) K+1∑i=1pi=1, d) pi≥0 ∀i=1⋯K+1, e) K+1∑i=1pici1:K≤V,and f) ci∈F, ∀i=1⋯K+1. (23)
###### Proof.

Sufficiency: It follows from d), e), and f) that satisfies all the constraints of MSOC’. With regard to a), note that:

 ci∈C(λ)⟺ci0+λ(ci1:K−V)=q⋆P.

It follows from a), b), and c) that

 N+1∑i=1pici0 = N+1∑i=1pici0+λ(K+1∑i=1pici1:K−V) = N+1∑i=1pi{ci0+λ(ci1:K−V)}=q⋆P=c⋆M. (24)

Since we know that the minimum objective value of MSOC’ is , is an optimal solution to MSOC’.

Necessity: We prove the contraposition. Note that b)-f) are part of the KKT conditions [] for MSOC’. Therefore, if any of b)-f) does not hold, is not an optimal solution. Next, assume that only a) does not hold, that is,

 ci0+λ(ci1:K−V)=q⋆P ∧pi>0, ∀i=1⋯K+1.

Using (eq:th3-2), we have . Therefore is not an optimal solution to MSOC’. ∎

###### Remark 1.

Theorem 3 can be immediately adapted to the original MSOC, i.e., (4)-(5). Let

 U(λ)=argminμ∈MTg(x,μ,w)=0 E[f0(x,μ)]−λ(E[fi:K(x,μ)]−V). (25)

Then a) is replaced by the following condition:
a’)  .

The uniqueness of Theorem 3 is in a). It means that the candidate control sequences, from which the controller choose randomly, can be obtained by solving the dual of the pure-strategy problem. More specifically, MSOC can be solved in the following process:

1. Solve DPSOC and obtain the optimal dual solution,

2. If , optimal solutions to PSOC are also optimal for MSOC because (22) reduces to PSOC with .

3. If ,

• Solve (22) to obtain

• Find and such that and .

The concrete solution method of DPSOC depends on problems. In general it can be solved by a general convex optimization method such as subgradient method. The multidimensional bisection method [10] can solve it more efficiently if applicable. More efficient and specialized solution approach would be available for special cases of MSOC. However, such specialized solution approaches are out of the scope of this paper, except for the one that is discussed in the following subsection.

### Ii-E Efficient Solution for K=1

PSOC with (i.e., there is only one stochastic constraint) has important applications, most notably the problems with a joint chance-constraint, which imposes the upper bound on the probability of violating any constraints at any time steps during the planning horizon. A joint chance-constraint has a practical importance since it provides the operator of a system an intuitive way to specify the acceptable level of risk of an entire plan. For example, in the Mars Exploration Rovers (MER) mission, ground operators made decisions on trajectory correction maneuver before atmospheric entry with a lower bound on the probability of successful landing (the thresholds for Spirit and Opportunity rovers were 91% and 96%, respectively) [11].

When = 1, DPSOC can be solved very efficiently by a root finding method. Furthermore, and are obtained as by-products of root finding. This involves evaluating the dual objective function repeatedly by solving (25) with varying . The convergence is very fast; some of standard root finding algorithms, such as Brent’s method, have a superlinear convergence rate.

Let be the dual objective function of DPSOC, that is,

 q(λ)=minc∈Fc0+λ(c1−V1). (26)

From a standard result of convex optimization theory, is a concave function [12], hence its subgradient is monotonically decreasing, as shown in Figure 5. Dual optimal solution, , lies at the zero-crossing of . Also from a standard result of convex optimization theory is that:

 c∈C(λ)⇒c1−V∈∂q(λ).

Therefore the dual optimization problem can be solved by finding a root of . Standard root finding algorithms can be used, such as bisection method and Brent’s method [13].

We assume that there is an algorithm that takes and returns an optimal solution solution to (25), , as well as an optimal solution to (26), , which satisfies a) and f) of Theorem 3. We denote by the -th component of . The root finding algorithm is initiated with an interval, , which includes . The interval is tightened iteratively until a certain terminal condition is met. Through the iteration, , , , and converge to , , , and , respectively, while and converge to , as illustrated in Figure 5. If , and that satisfies b), c), and e) in Theorem 3 are computed by solving the following:

 p1⋆c1⋆1+p2⋆c2⋆1=V1 p1⋆+p2⋆=1.

The solution to the above also satisfies d) because implies and . Therefore, satisfies a)-f) of Theorem 3, hence it is an optimal solution to MSOC. If , an optimal solution is and and can be any that satisfies c) and d).

The optimal mixed control is to execute with probability and with .

## Iii Deployment on Linear SMPC

The proposed algorithm is demonstrated with an implementation on a linear SMPC with normally distributed disturbance and polygonal obstacles in the state space. Since the problem is nonconvex, a mixed strategy may outperform pure strategies. A practical challenge is that (

25) is nonlinear, nonconvex programming. The nonlinearity comes from the cumulative distribution function (CDF) that is used to evaluate the probability of constraint violation. Although an efficient solvers are available for a limited classes of nonconvex programming such as mixed integer linear programming (MILP) and mixed integer quadratic programming (MIQP), the problem does not fall under these classes.

Repeatedly solving such a problem could result in a prohibitive cost. Our approach is to approximate the CDF with a piecewise linear function and convert the problem into MILP.

### Iii-a Formulation

We assume a linear discrete-time dynamics with and :

 xk+1=Axk+Buk+wk,

where is a normally distributed zero-mean disturbance with the covariance of . is assumed to be a polytope, hence , where and are the componentwise inequalities. We assume that there are polytopic obstacles, whose interior is represented as:

 Hixk⪰gi,i=1⋯M, k=1⋯N.

A chance constraint in the form of (3) is imposed to limit the probability of the violation of the obstacles is limited to . The cost function is the total norm of over the horizon, that is, . Since this cost function is deterministic, . The PSOC is given as follows:

 min N∑k=1|uk|1 s.t. Pr[M⋁i=1N⋁k=1Hixk⪰gi]≤V xk+1=Axk+Buk+wk, k=1⋯N Puk⪯q, k=1⋯N,

where is the logical disjunction. The inner optimization problem of the dual optimization, (25), is given as:

 min N∑k=1|uk|1+λ(Pr[M⋁i=1N⋁k=1Hixk⪰gi]−V) s.t. xk+1=Axk+Buk+wk, k=1⋯N Puk⪯q, k=1⋯N.

### Iii-B Conversion to MILP

We use a few tricks and approximations to convert the above problem into MILP. We note that the probability of constraint violation is always approximated conservatively (meaning that it is overestimated) so that a solution to the approximated problem is always a feasible solution to the original problem. First, by replacing absolute values with slack variables, the norm objective is equivalent to the following:

 min|u|1=minm∑d=1vds.t. vd≥ud,vd≥−ud,

where is the -th component of vector . Second, the joint probability is decomposed by Boole’s inequality, whose conservatism is trivial in most practical cases where the risk bound is very small (e.g., ) [14]:

 Pr[M⋁i=1N⋁k=1Hixk⪰g]∼M∑i=1N∑k=1Pr[Hixk⪰g].

The componentwise inequality in the probability is decomposed using the risk selection approach [15], which is again a conservative approximation. Let and be the -th row of and , and be the number of rows,

 minPr[Hixk⪰gi]∼minδs.t. Ri⋁j=1Pr[hijxk≥gij]≤δ. (27)

The probability above is univariate, hence it can be easily evaluated by CDF:

 Pr[hijxk≥gij]=F(hij¯xk−gijhijΣxkhTij),

where is the mean of and is the CDF of the standard normal distribution. The covariance matrix of is computed recursively by .

We apply a piecewise linear approximation of the CDF. Since the CDF of the standard normal distribution is convex at , the piecewise linear approximation can be done without introducing integer variables. An underlying assumption is that the mean state is always outside of obstacles, hence This assumption is implied by because if the mean state is on a constraint boundary, the probability of violating the constraint is 0.5. In practical cases the users usually do not allow 50 % of risk. Let be the linear approximation of at , . The right hand side of (27) is approximated as follows:

 minδs.t. Ri⋁j=1L⋀l=1al(hij¯xk−gijhijΣxkhTij)+bl≤δ.

Finally, the disjunction is replaced by mixed-integer constraints using a standard trick called the big-M method [16]. Letting be a very large positive constant, the optimization problem formulated in the previous subsection is now converted to MILP as follows:

 min N∑k=1m∑i=1vki+λ(M∑i=1N∑k=1δik−V) s.t. vkd≥ukd, vkd≥−ukd,Puk⪯q ¯xk+1=A¯xk+Buk+wk al(hij¯xk−gijhijΣxkhTij)+bl≤δik+Mzij Ri∑j=1zij≤Ri−1,  zij∈{0,1} k=1⋯N,d=1⋯m, i=1⋯M, j=1⋯Ri, l=1⋯L.

### Iii-C Simulation Results

We performed simulations on a double integrator plant:

 A=⎡⎢ ⎢ ⎢⎣10ΔT0010ΔT00100000⎤⎥ ⎥ ⎥⎦, B=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣12ΔT20012ΔT2ΔT00ΔT⎤⎥ ⎥ ⎥ ⎥ ⎥⎦, Σw=⎡⎢ ⎢ ⎢ ⎢⎣σ2w0000σ2w0000000000⎤⎥ ⎥ ⎥ ⎥⎦,ΔT=1, σw=0.1.

We first considered an illustrative example shown in Figure 6, where two obstacles were placed between which there was a narrow shortcut passage. Initial state was at and the mean final state was constrained at . Horizon length was , and finally the risk bound is . The simulation was performed on a machine with Intel Core i7-3612QM CUP clocked at 2.10 GHz and 8.00 GB RAM. The algorithm was implemented in MATLAB using YALMIP [17] and MILP was solved by CPLEX. The bisection method was used for root finding. For comparison, the optimal pure strategy was also computed by using the same MILP approximation presented in Section III-B.

Figure 6 shows the mixed and pure strategy solutions computed by the proposed algorithm. The mixed strategy consisted of the two control sequences shown in blue lines. The lower dual solution corresponds to a risk-taking path that goes through the narrow passage, while the upper dual solution results in a risk-averse path that go around the obstacles. The former took of risk and of cost, while the latter took of risk and of cost. The mixed strategy chose between them with the probabilities of and , resulting in the risk of exactly 0.01 and the expected cost of 4.027. On the other hand, the pure optimal strategy took the risk of exactly 0.01 and the cost of 4.175. Therefore this example validates our claim that mixed strategy can result in less expected cost than pure strategy in a nonconvex SMPC.

We next performed a Monte Carlo simulation in order to empirically validate our claim that the optimal solution to the mixed strategy problem is always as good as the optimal solution to the original (pure) problem. We randomly placed four square obstacles in a 2-D state space. The center of each square was sampled from a uniform distribution within

. The size of each square was sampled from a uniform distribution in . The initial state was , and the mean final state was constrained at .

Figure 7 shows the resulting cost of the optimal mixed and pure solutions to 200 randomized problems. The average computation time was 59.8 sec. There were 163 samples on the line in the plot, meaning that the cost of optimal mixed and pure solutions were identical in those samples. There were 37 samples below the line, meaning that the cost of optimal mixed solution was strictly less than the cost of the optimal pure solution. There was no sample above the line. This result supports our claim that the optimal solution to the mixed strategy problem is always as good as the optimal solution to the original problem. At least in this particular problem domain, mixed strategy outperforms pure strategy not very frequently, and as is seen in Figure 7 the improvement is often marginal. It is certainly possible to engineer a problem that better highlights the advantage of mixed strategy, but that does not serve the objective of this paper. The most important contributions of this paper are the theoretical finding that mixed strategy can outperform pure strategy in nonconvex SMPCs, as well as the algorithm to compute the optimal mixed strategy solutions. The empirical results validate the theoretical finding and the algorithm.

## Iv Deployment on Chance-constrained MDP

The application of the proposed approach is not limited to SMPC. In this section we present applications to finite-state MDPs with a chance constraint.

We consider a finite time steps, . The state space and action space are finite and time-varying, denoted by and . State and control sequence variables are represented as and . A control policy is a map . The sequence of control policy is denoted by . A mixed strategy finds multiple control policy sequences, , and randomly choose one. The control objective is to minimize the expected total cost, . A set of failure states, , is specified for each time step. A joint chance constraint limits the probability that one of the failure states is visited at any time step:

 Pr[T⋁k=1xk∈XFk]≤V.

Since , the mixed-strategy problem can be solved by root finding, as in Section II-E. The inner optimization problem is solved through the chance-constrained dynamic programming[18] 111. In the reminder of this section we present two applications: path planning and Mars Entry, Descent, and Landing (EDL).

### Iv-a Application to Path Planning

In this application, we assume and a two-dimensional state space in discretized into a 100x100 grid. Obstacles are placed as shown in Figure 8. A single integrator dynamics is assumed:

 xk+1=xk+uk+wk ∥uk∥2≤dk,wk∼N(0,σ2I),

where is a two dimensional vector specifying the increment in position,

is a discritized, Gaussian-distributed noise,

and are constant parameters, is a zero-mean Gaussian distribution with the covariance matrix , and

is the two-dimensional identity matrix. We set

and . The cost function is the expected length of the resulting path that connects the start and goal states. The risk bound is .

The optimal solution to MSOC consists of two control policy sequences, and , which have expected path lengths of and while the risks of hitting obstacles being and , respectively. The nominal paths resulting from and (i.e, state sequence assuming when ) are shown in Figure 8. The two pure control strategies are chosen with probabilities of and , respectively. As a result, the mixed control strategy has a expected path length of while the risk of hitting obstacles is exactly . On the other hand, solving PSOC results in the same pure control strategy as , whose expected path length is . As expected, mixed strategy resulted in a less expected cost while respecting the stochastic constraint.

The solution time of MSOC was seconds while that of the PSOC was seconds222Simulations are conducted on a machine with the Intel(R) Xenon(R) X5690 CPU clocked at 3.47GHz and 96GB of RAM. The difference in computation time is small because solving PSOC also requires iterative dual optimization in this case[18].

### Iv-B Application to Mars Entry, Descent, and Landing

We next present an application to the planning of entry, descent, and landing (EDL) for future Mars missions[18]. Mars EDL is subject to various source of uncertainties such as atmospheric variability and imperfect aerodynamics model. The resulting dispersions of the landing position typically spans over tens of kilometers for a 99.9% confidence ellipse [11]. Given such a highly uncertain nature of EDL, a target landing site must be carefully chosen in order to limit the risk of landing on rocky or uneven terrain. At the same time, it is equally important to land near science targets in order to minimize the traverse distance after the landing.

Future Mars lander/rover missions would aim to reduce the uncertainty by using several new active control technologies, consisting of the following three stages: entry-phase targeting, powered-descent guidance (PDG) [19], and hazard detection and avoidance (HDA) [20]. Each control stage is capable of making corrections to the predicted landing position by a certain distance, but each stage is subject to execution errors, which deviates the spacecraft away from the planned landing position.

We pose this problem as an optimal sequential decision making under a persisting uncertainty. At the th control stage, represents the projected landing location without further control. By applying a control at the th stage, the lander can correct the projected landing location to , which must be within an ellipsoid centered around . At the end of the th control stage, the projected landing location deviates from due to a disturbance , which is assumed to have a Gaussian distribution. This EDL model is described as follows:

 xk+1=uk+wk (uk−xk)TDk(uk−xk)≤d2k,wk∼N(0,Σk),

where and are positive definite matrices, and is a scalar constant. We use the same parameter settings as [21].

We consider three control stages, i.e., and is the final landing location. The state space is a 2 km-by-2 km square, which is discretized at a one meter resolution. As a result, the problem has four million states at each time step. The control and the disturbance are also discretized at the same resolution. The cost function is the expected distance to drive on surface to visit two science targets, shown in magenta squares in Figure 9, starting from the landing location. The infeasible areas are specified using the data of HiRISE (High Resolution Imaging Science Experiment) camera on the Mars Reconnaissance Orbiter. We use the real landscape of a site named “East Margaritifer” on Mars.

Figure 9 show the simulation result with a risk bound . The optimal solution to MSOC chooses between two control policy sequences, and , with the probabilities of and . The probability of failure of the two control policy sequences are and while their costs being and . The resulting probability of failure of the mixed strategy is exactly The optimal solution to PSOC is equivalent to . Again, as expected, mixed strategy reduces the expected cost while respecting the stochastic constraint.

Note that the optimal pure control policy takes significantly less risk than the risk bound. This is because there is no other solution that is within the risk bound and has less cost. The mixed control strategy improves the cost by mixing this optimal pure control strategy with another control policy that has an excessive risk but a less cost.

It may sound unrealistic to decide a landing site probabilistically. However, consider a situation where there are 1,000 vehicles and we require 999 of them to land successfully while minimizing the total cost. Then our result means that the optimal strategy is to send 849 of them to the first landing site and 151 of them to the other. When having only one vehicle, the interpretation of this result varies with viewpoint. For a person who knows the result of the coin flip in advance of the landing, the resulting action is no more mixed and hence it may violate the given chance constraint. However, if the result of the coin flip is hidden from the observer, like Schrödinger’s cat in a box, then this mixed strategy results in the minimum expected cost while the probability of failure is still within the specified bound.

## Conclusions

We found that, in nonconvex SMPC, choosing control inputs randomly can result in a less expected cost than deterministically optimizing them. We developed a solution method based on dual optimization and deployed it on a linear nonconvex SMPC problem, which was efficiently solved using an MILP approximation. Finally, we validated our theoretical findings through simulations.

## Acknowledgment

The research described in this paper was in part carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. This research was supported by the Office of Naval Research, Science of Autonomy Program, under Contracts N00014-15-IP-00052 and N00014-15-1-2673.

## References

• [1] E. Altman, Constrained Markov decision processes, ser. Stochastic modeling.   Chapman & Hall/CRC, 1999. [Online]. Available: http://books.google.com/books?id=3X9S1NM2iOgC
• [2] V. S. Borkar, Convex Analytic Methods in Markov Decision Processes.   Boston, MA: Springer US, 2002, pp. 347–375.
• [3] E. A. Feinberg and U. G. Rothblum, “Splitting randomized stationary policies in total-reward markov decision processes,” Math. Oper. Res., vol. 37, no. 1, pp. 129–153, Feb. 2012.
• [4] E. A. Feinberg, On measurability and representation of strategic measures in Markov decision processes, ser. Lecture Notes–Monograph Series.   Hayward, CA: Institute of Mathematical Statistics, 1996, vol. Volume 30, pp. 29–43.
• [5] D. Bernardini and A. Bemporad, “Scenario-based model predictive control of stochastic constrained linear systems,” in Proceedings of Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, 2009.
• [6] G. C. Calafiore, F. Dabbene, and R. Tempo, “Research on probabilistic methods for control system design,” Automatica, vol. 47, no. 7, pp. 1279 – 1293, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0005109811001300
• [7] B. Acikmese and D. Bayard, “A markov chain approach to probabilistic swarm guidance,” in American Control Conference (ACC), 2012, June 2012, pp. 6300–6307.
• [8] B. Acikmese, N. Demir, and M. Harris, “Convex necessary and sufficient conditions for density safety constraints in markov chain synthesis,” Automatic Control, IEEE Transactions on, vol. 60, no. 10, pp. 2813–2818, Oct 2015.
• [9] D. P. Bertsekas, Convex Optimization Theory.   Athena Scientific, 2009.
• [10] Z. Baoping, G. R. Wood, and W. P. Baritompa, “Multidimensional bisection: The performance and the context,” Journal of Global Optimization, vol. 3, no. 3, pp. 337–358. [Online]. Available: http://dx.doi.org/10.1007/BF01096775
• [11] P. C. Knocke, G. G. Wawrzyniak, B. M. Kennedy, P. N. Desai, T. J. Parker, M. P. Golombek, T. C. Duxbury, and D. M. Kass, “Mars exploration rovers landing dispersion analysis,” in Proceedings of AIAA/AAS Astrodynamics Specialist Conferencec and Exhibit, 2004.
• [12] S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge University Press, mar 2004.
• [13] K. E. Atkinson, An Introduction to Numerical Analysis, Second Edition.   John Wiley & Sons, 1989.
• [14] M. Ono and B. C. Williams, “An efficient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure,” in

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08)

, 2008.
• [15] M. Ono, L. Blackmore, and B. C. Williams, “Chance constrained finite horizon optimal control with nonconvex constraints,” in Proceedings of American Control Conference, 2010, pp. 1145–1152.
• [16] D. Bertsimas and J. Tsitsiklis, Introduction to Linear Optimization, 1st ed.   Athena Scientific, 1997.
• [17] J. Lofberg, “Yalmip : A toolbox for modeling and optimization in MATLAB,” in Proceedings of the CACSD Conference, Taipei, Taiwan, 2004. [Online]. Available: http://control.ee.ethz.ch/~joloef/yalmip.php
• [18] M. Ono, M. Pavone, Y. Kuwata, and J. Balaram, “Chance-constrained dynamic programming with application to risk-aware robotic space exploration,” Auton. Robots, vol. 39, no. 4, pp. 555–571, Dec. 2015. [Online]. Available: http://dx.doi.org/10.1007/s10514-015-9467-7
• [19] B. Acikmese and S. Ploen, “Convex programming approach to powered descent guidance for mars landing,” AIAA Journal of Guidance, Control, and Dynamics, vol. 30, no. 5, pp. 1353–1366, 2007.
• [20] A. Johnson, A. Huertas, R. Werner, and J. Montgomery, “Analysis of On-Board Hazard Detection and Avoidance for Safe Lunar Landing,” in IEEE Aerospace Conference, 2008.
• [21] M. Ono, Y. Kuwata, and J. B. Balaram, “Joint chance-constrained dynamic programming,” in Proceedings of the IEEE Conference on Decision and Control, 2012.