DeepAI

# Dynamic Stochastic Approximation for Multi-stage Stochastic Optimization

In this paper, we consider multi-stage stochastic optimization problems with convex objectives and conic constraints at each stage. We present a new stochastic first-order method, namely the dynamic stochastic approximation (DSA) algorithm, for solving these types of stochastic optimization problems. We show that DSA can achieve an optimal O(1/ϵ^4) rate of convergence in terms of the total number of required scenarios when applied to a three-stage stochastic optimization problem. We further show that this rate of convergence can be improved to O(1/ϵ^2) when the objective function is strongly convex. We also discuss variants of DSA for solving more general multi-stage stochastic optimization problems with the number of stages T > 3. The developed DSA algorithms only need to go through the scenario tree once in order to compute an ϵ-solution of the multi-stage stochastic optimization problem. To the best of our knowledge, this is the first time that stochastic approximation type methods are generalized for multi-stage stochastic optimization with T > 3.

• 33 publications
• 10 publications
01/25/2018

### Stochastic Successive Convex Approximation for Non-Convex Constrained Stochastic Optimization

This paper proposes a constrained stochastic successive convex approxima...
12/16/2019

### Complexity of Stochastic Dual Dynamic Programming

Stochastic dual dynamic programming is a cutting plane type algorithm fo...
06/26/2015

### ASOC: An Adaptive Parameter-free Stochastic Optimization Techinique for Continuous Variables

Stochastic optimization is an important task in many optimization proble...
07/05/2013

### Stochastic Optimization of PCA with Capped MSG

We study PCA as a stochastic optimization problem and propose a novel st...
06/23/2021

### Bayesian Joint Chance Constrained Optimization: Approximations and Statistical Consistency

This paper considers data-driven chance-constrained stochastic optimizat...
07/18/2012

### Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions

We develop and analyze stochastic optimization algorithms for problems i...
09/14/2009

### Stochastic Optimization of Linear Dynamic Systems with Parametric Uncertainties

This paper describes a new approach to solving some stochastic optimizat...

## 1 Introduction

Multi-stage stochastic optimization aims at optimal decision-making over multiple periods of time, where the decision in the current period has to take into account what will happen in the future. Optimal decisions over a certain time horizon is of paramount importance to different applications areas including finance, logistics, robotics and clinic trials etc. In this paper, we are interested in solving a class of multi-stage stochastic optimization problems given by

 min{ h1(x1,c1)+Eξ2{min{ h2(x2,c2)+…+EξT[min{ hT(xT,cT)}]}}}s.t. A1x1−b1∈K1s.t. A2x2−b2−B2x1∈K2,s.t. ATxT−bT−BTxT−1∈KT,x1∈X1,    x2∈X2,    xT∈XT. (1.1)

Here denotes the number of stages, are relatively simple convex functions, are closed convex cones, ,

, are the random vectors at stage

, and denote the conditional expectation with respect to given . By defining value functions, we can write problem (1.1) equivalently as

 min {h1(x1,c1)+v2(x1)} s.t.  A1x1−b1∈K1,x1∈X1, (1.2)

where the value factions are recursively defined by

 (1.3)

and

 vT(xT−1):=EξT[VT(xT−1,ξT)],VT(xT−1,ξT):=min hT(xT,cT) s.t.  ATxT−bT−BTxT−1∈KT,xT∈XT. (1.4)

In particular, if are affine, and are polyhedral, then problem (1.1

) reduces to the well-known multi-stage stochastic linear programming problem (see, e.g.,

[2]). The incorporation of the nonlinear (but convex) objective function and conic constraints allows us to model a much wider class of problems. Moreover, if , then problem (1.1) is often referred to as a two-stage (or static) stochastic programming problem.

In spite of its wide applicability, multi-stage stochastic optimization remains highly challenging to solve. Many existing methods for multi-stage stochastic optimization are based on sample average approximation (see Nemirovski and Shapiro [30] and Shapiro [31]). In this approach, one first generate a deterministic counterpart of (1.1) by replacing the expectations with (conditional) sample averages. In particular, if the number of stages , the total number of samples (a.k.a. scenarios) cannot be smaller than in general. Once after a deterministic approximation of (1.1) is generated, one can then develop decomposition methods to solve it to certain accurary. The most popular decomposition methods consists of stage-based and scenario-based decomposition method. One widely-used stage-based method is the stochastic dual dynamic programming (SDDP) algorithm, which is essentially an approximate cutting plane method, first presented by Pereira and Pinto [24] and later studied by Shapiro [32], Donohue and Birge [5], and Hindsberger [12] etc. The progressive hedging algorithm by Rockafellar and Wets [28] is a well-known scenario-based decomposition method, which basically applies an augmented Lagrangian method to penalize the violation of the non-anticipativity constraints. All these methods assume that the scenario tree has been generated and will go through the scenario tree many times. Usually there are no performance guarantees provided regarding their rate of convergence, i.e., the number of times one needs to go through the scenario tree. In SDDP, one also needs to assume that random vectors are stage-wise independent.

Recently, a different approach called stochastic approximation (SA) has attracted much attention for solving static stochastic optimization problems given in the form of

 minx∈X{f(x):=Eξ[F(x,ξ)]}, (1.5)

where is a closed convex set, denotes the random vecctor and is a closed convex function. Observe that when , problem (1.1) can be cast in the form of (1.5) and hence one can apply the aforementioned SA methods to solve these two-stage stochastic optimization problems (see [19, 14]). The basic SA algorithm, initially proposed by Robbins and Monro [27]

, mimics the simple projected gradient descent method by replacing exact gradient with its unbiased estimator. Important improvements for the SA methods have been made by Nemirovski and Yudin

[20] and later by Polayk and Juditsky ([25, 26]). During the past few years, significant progress has been made in SA methods [19, 13, 6, 7, 8, 17, 10]. In particular, Nemirovski et. al. [19] presented a properly modified SA approach, namely, mirror descent SA for solving general nonsmooth convex SP problems. Lan [13] introduced an accelerated SA method, based on Nesterov’s accelerated gradient method [21], for solving smooth, nonsmooth and stochastic optimization in a uniform manner. Novel nonconvex SA methods and their accelerated versions have also been studied in [8, 10, 9]. All these SA algorithms only need to access one single at each iteration, and hence do not require much memory. It has been shown in [19, 14] that SA methods can significantly outperform the SAA approach for solving static (or two-stage) stochastic programming problems. However, it remains unclear whether these SA methods can be generalized for multi-stage stochastic optimization problems with .

In this paper, we intend to shed some light on this problem by developing a dynamic stochastic approximation (DSA) method for multi-stage stochastic optimization. The basic idea of the DSA method is to design an inexact primal-dual SA method for solving the -th stage optimization problem in order to compute an approximate stochastic subgradient for its associated value functions . In the pursuit of this idea, we manage to resolve the following difficulties. First, the first-order information for the value function used to solve the -stage subproblem is not only stochastic, but also biased. We need to control the bias associated with such first-order information. Second, in order to establish the convergence of stochastic optimization subroutines for solving the

-stage problem, we need to guarantee that the variance of approximate stochastic subgradients and hence the dual multipliers associated with the

-stage problem are bounded. Third, we need to make sure that the approximation errors do not accumulate quickly as the number of stages increases. By properly addressing these issues, we were able to show that the DSA method can achieve an optimal rate of convergence in terms of the number of random samples when applied to a three-stage stochastic optimization problem. We further show that this rate of convergence can be improved to when the objective function is strongly convex. Moreover, we discuss variants of the DSA method which exhibit optimal rate of convergence for solving more general multi-stage stochastic optimization problems with . The developed DSA algorithms only need to go through the scenario tree once in order to compute an -solution of the multi-stage stochastic optimization problem. As a result, the required memory for DSA increases only linearly with respect to . To the best of our knowledge, this is the first time that stochastic approximation type methods are generalized to and their complexities are established for multi-stage stochastic optimization.

This paper is organized as follows. In Section 2, we introduce the basic scheme of the DSA algorithm and establish its main convergence properties for solving three-stage stochastic optimization problems. In Section 3, we apply the DSA to the problem (2.1) under the strongly convex assumption on the objective function at each stage. and we develop variants of the DSA method for solving more general form of (1.1) with in Section 4. Finally, some concluding remarks are made in Section 5.

### 1.1 Notation and terminology

For a closed convex set , a function is called a distance generating function with parameter , if is continuously differentiable and strongly convex with parameter with respect to . Therefore, we have

 ⟨x1−x,∇ωX(x1)−∇ωX(x)⟩≥∥x1−x∥2,∀x1,x∈X.

The prox-function associated with is given by

 PX(x,x1)=ωX(x1)−ωX(x)−⟨∇ωX(x),x1−x⟩,∀x1,x∈X.

It can be easily seen that

 PX(x,x1)≥αX2∥x1−x∥2, ∀x1,x∈X. (1.6)

If is bounded, we define the diameter of the set as

 Ω2X:=maxx1,x∈XPX(x,x1). (1.7)

For a given closed convex cone , we choose the distance generating function . For simplicity, we often skip the subscript of whenever we apply it to an unbounded set (such as a cone).

For a given closed convex set and a closed convex function , is called an -subgradient of at if

 V(x1)≥V(x)+⟨g(x),x1−x⟩−ϵ  ∀x1∈X. (1.8)

The collection of all such -subgradients of at is called the -subdeifferential of at , denoted by .

Assume that is Lipschitz continuous in an -neighborhood of , i.e.,

 |V(x1)−V(x)|≤M0∥x1−x∥, ∀x1,x∈Xϵ:={p∈Rn:p=r+x,x∈X,∥r∥≤ϵ}. (1.9)

We can show that

 ∥g(x)∥∗≤M0+1  ∀x∈X. (1.10)

Indeed, if , the result follows immediately by setting and in (1.8). Otherwise, we need to choose properly s.t. and . It should be noted, however, that if is Lipschitz continuous over (rather than ), then one cannot guarantee the boundedness of an -subgradient of .

## 2 Three-stage problems with generally convex objectives

Our goal in this section is to introduce the basic scheme of the DSA algorithm and discuss its convergence properties. For the sake of simplicity, we will focus on three-stage stochastic optimization problems with simple convex objective functions in this section. Extensions to strongly convex cases and more general form of multi-stage stochastic optimization problems will be studied in later sections.

### 2.1 Value functions and stochastic ϵ-subgradients

Consider the following three-stage stochastic programming problem:

 min { h1(x1,c1)+ Eξ2{min { h2(x2,c2) +Eξ3[min { h3(x3,c3)}]}}} (2.1) s.t. A1x1−b1∈K1 s.t. A2x2−b2−B2x1∈K2, s.t. A3x3−b3−B3x2∈K3, x1∈X1, x2∈X2, x3∈X3,

where , , are compact convex sets for some , are relatively simple convex functions, denote the linear mappings from to for some , and are closed convex cones. Note that the first expectation in (2.1) is taken w.r.t. the random vector and the second one is the conditional expectation with respect to given . As an example, if , and are polyhedronal, then (2.1) reduces to a well-known three-stage stochastic linear programming problem.

We can write problem (2.1) in a more compact form by using value functions as discussed in Section 1. More specifically, let be the stochastic value function at the third stage and be the corresponding (expected) value function:

 V3(x2,ξ3):=min h3(x3,c3) s.t.  A3x3−b3−B3x2∈K3,x3∈X3.v3(x2):=Eξ3[V3(x2,ξ3)]. (2.2)

We can then define the stochastic value function and its corresponding (expected) value function as

 V2(x1,ξ2):=min {h2(x2,c2)+v3(x2)}  s.t.  A2x2−b2−B2x1∈K2,x2∈X2.v2(x1):=Eξ2[V2(x1,ξ2)]. (2.3)

Problem (2.1) can then be formulated equivalently as

 min {h1(x1,c1)+v2(x1)} s.t.  A1x1−b1∈K1,x1∈X1. (2.4)

Throughout this paper, we assume that the expected value functions and are well-defined and finite-valued for any and , respectively.

In order to solve problem (2.4), we need to understand how to compute first-order information about the value functions and . Since both and are given in the form of expectation, the exact first-order information is hard to compute. We resort to the computation of a stochastic -subgradient of these value functions defined as follows.

###### Definition 1

is called a stochastic -subgradient of the value function if is an unbiased estimator of an -subgradient of , i.e.,

 Eξ[G(u,ξ)]=g(u)  and  g(u)∈∂ϵv(u). (2.5)

In order to compute a stochastic -subgradient of (resp., ), we need to compute an approximate subgradient of the corresponding stochastic value function (resp., ). To this end, we further assume that strong Lagrange duality holds for the optimization problems defined in (2.3) (resp.,(2.2)) almost surely. In other words, these problems can be formulated as saddle point problems:

 V2(x1,ξ2) =maxy2∈K2∗minx2∈X2⟨b2+B2x1−A2x2,y2⟩+h2(x2,c2)+v3(x2), (2.6) V3(x2,ξ3) =maxy3∈K3∗minx3∈X3⟨b3+B3x2−A3x3,y3⟩+h3(x3,c3), (2.7)

where and are corresponding dual cones to and , respectively.

Observe that both (2.6) and (2.7) can be viewed as special cases of the following more generic saddle point problem

 V(u,ξ)≡V(u,(A,b,B,c)):=maxy∈K∗minx∈X⟨b+Bu−Ax,y⟩+h(x,c)+~v(x), (2.8)

where and denote the linear mappings. For example, (2.7) is a special case of (2.8) with , , , , , , and . Let

 (x∗,y∗)∈Z≡X×K∗

be a pair of optimal solutions of the saddle point problem (2.6), i.e.,

 V(u,ξ) =⟨y∗,b+Bu−Ax∗⟩+h(x∗,c)+~v(x∗)=h(x∗,c)+~v(x∗), (2.9)

where the second identity follows from the complementary slackness of Lagrange duality. It is worth noting that the first stage problem can also be viewed as a special case of (2.8), since (2.4) is equivalent to

 maxy∈K1∗minx1∈X1 {⟨b1−A1x1,y1⟩+h1(x1,c1)+v2(x1)}. (2.10)

Below we provide a different characterization of an -subgradient of than the one in (1.8).

###### Lemma 1

Let and be given. If

 Q(¯z;x,y∗) :=⟨y∗,b+Bu−A¯x⟩+h(¯x,c)+~v(¯x) (2.11) −⟨¯y,b+Bu−Ax⟩−h(x,c)−~v(x)≤ϵ, ∀x∈X,

then is an -subgradient of at .

Proof. For simplicity, let us denote . For any , we denote as a pair of primal-dual solution of (2.8) (with ). Hence,

 V(u1):=⟨y∗1,b+Bu1−Ax∗1⟩+h(x∗1,c)+~v(x∗1). (2.12)

It follows from the definition of in (2.8) and (2.11) that

 V(u) =⟨y∗,b+Bu−Ax∗⟩+h(x∗,c)+~v(x∗) (2.13) ≤⟨y∗,b+Bu−A¯x⟩+h(¯x,c)+~v(¯x) ≤⟨¯y,b+Bu−Ax∗1⟩+h(x∗1,c)+~v(x∗1)+ϵ

Observe that by

 ⟨¯y,b+Bu−Ax∗1⟩=⟨¯y,B(u−u1)⟩+⟨¯y,b+Bu1−Ax∗1⟩ ≤⟨¯y,B(u−u1)⟩+⟨y∗1,b+Bu1−Ax∗1⟩,

where the last inequality follows from the assumption that is a pair of optimal solution of (2.8) with . Combining these two observations and using (2.12), we have

 V(u)≤⟨BT¯y,u−u1⟩+V(u1)+ϵ,

which, in view of (1.8), implies that is an -subgradient of .

In view of Lemma 1, in order to compute a stochastic subgradient of at a given point , we first generate a random realization and then try to find a pair of solutions satisfying (2.11). We can then use as a stochastic -subgradient of at . However, when applied to the value function of the second stage, the difficulty exists in that the function , being the value function of the third stage, is also given in the form of expectation. We will discuss how to address these issues in more details in the next subsection.

### 2.2 The DSA algorithm

Our goal in this subsection is to describe the basic scheme of our dynamic stochastic approximation algorithm applied to problem (2.4).

Our algorithm relies on the following three key primal-dual steps, referred to as stochastic primal-dual transformation (SPDT), applied to the generic saddle point problem in (2.8) at every stage.

:

 ~d =θ(d−d_)+d. (2.14) p+ =argminx∈X⟨b+Bu−Ax,~d⟩+h(x,c)+⟨~v′,x⟩+τPX(p,x). (2.15) d+ =argminy∈K∗⟨−b−Bu+Ap+,y⟩+η2∥y−d∥2. (2.16)

In the above primal-dual tranformation, the input denotes the current primal solution, dual solution, and the previous dual solution, respectively. Moreover, the input denotes a stochastic -subgradient for at the current search point . The parameters describes the problem in (2.8) and are certain algorithmic parameters to be specified. Given these input parameters, the relation in (2.14) defines a dual extrapolation (or prediction) step to estimate the dual variable for the next iterate. Based on this estimate, (2.15) performs a primal prox-mapping to compute , and then (2.16) updates in the dual space to compute by using the updated . We assume that the above SPDT operator can be performed very fast or even has explicit expressions.

In order to solve problem (2.4

), we will combine the above primal-dual transformation applied to all the three stages together with the scenario generation for the random variables

and in the second and third stage, as well as certain averaging steps in both the primal and dual spaces. We are now ready to describe the basic scheme of the DSA algorithm.

This algorithm consists of three loops. The innermost (third) loop runs steps of SPDT in order to compute an approximate stochastic subgradient () of the value function of the third stage. The second loop consists of SPDTs applied to the saddle point formulation of the second-stage problem, which requires the output from the third loop. The outer loop applies SPDTs to the saddle point formulation of the first-stage optimization problem in (2.4), using the stochastic approximate subgradients ( ) for computed by the second loop. In this algorithm, we need to generate and realizations for the random vectors and , respectively. Observe that the DSA algorithm described above is conceptual only since we have not specified any algorithmic parameters yet. We will come back to this issue after establishing some general convergence properties about this method in the next two subsections.

### 2.3 Basic tools: inexact primal-dual stochastic approximation

In this subsection, we provide some basic tools for the convergence analysis of the DSA method.

Our analysis will be centered around an inexact primal-dual stochastic approximation (I-PDSA) method, which consists of iterative applications of the SPDTs defined in (2.14), (2.15) and (2.16) to solve the generic saddle point problem in (2.8).

Algorithm 2 formally describes the I-PDSA method for solving (2.8), which evolves from the primal-dual method by Chambolle and Pork in [3]. The primal-dual method in [3] is an efficient and simple method for solving saddle point problems, which can be viewed as a refined version of the primal-dual hybrid gradient method by Arrow et al. [1]. However, its design and analysis is more closely related to a few recent important works which established the rate of convergence for solving bilinear saddle point problems (e.g., [22, 18, 16, 11]). In particular, it is equivalent to a linearized version of the alternative direction method of multipliers. The first stochastic version of the primal-dual method was studied by Chen, Lan and Ouyang [4] together with an accelerated scheme. Lan and Zhou [15] revealed some inherent relationship between Nesterov’s accelerated gradient method and the primal-dual method, and presented an optimal randomized incremental gradient method. In this section, we provide a thorough analysis for an inexact version of stochastic primal-dual method that differ from the previous studies of stochastic primal-dual method in the following two aspects. First, we need to deal with stochastic and biased subgradient information about the value function . Second, we will investigate how to guarantee the boundedness of output dual multiplier in order to generate an approximate stochastic subgradient of with bounded variance.

Throughout this section, we assume that there exists such that

 E[∥Gk∥2∗]≤M2  ∀k≥1. (2.18)

This assumption, in view of (2.17) and Jensen’s inequality, then implies that For notational convenience, we assume that the Lipschitz constant of the function is also bounded by . Indeed, by definition, any exact subgradient can be viewed as an -subgradient. Hence, the size of subgradient (and the Lipschtiz constant of ) can also be bounded by . Later in this section (see Corollary 7), we will discuss different ways to ensure that the assumption in (2.18) holds.

Below we discuss some convergence properties for Algorithm 2. More specifically, we will first establish in Proposition 2 the relation between and after running one step of SPDT, and then discuss in Theorems 3 and 5 the convergence properties of Algorithm 2 applied to problem (2.8). Using these results, we will establish the convergence of the DSA method for solving problem (2.1) in Section 2.4.

###### Proposition 2

Let be defined in (2.11). For any and , we have

 Q(zk,z)+⟨A(xk−x),yk−yk−1⟩−θk⟨A(xk−1−x),yk−1−yk−2⟩ (2.19) ≤τk[PX(xk−1,x)−PX(xk,x)]+ηk2(∥y−yk−1∥2−∥y−yk∥2)−αXτk2∥xk−xk−1∥2 −ηk2∥yk−1−yk∥2+⟨Δk−1,xk−1−x⟩+(M+∥Gk−1∥∗)∥xk−xk−1∥+¯ϵ +θk⟨A(xk−xk−1),yk−1−yk−2⟩,

where

 Δk:=g(xk)−Gk. (2.20)

Proof. By the Lipschitz continuity of and the definition of an -subgradient, we have

 ~v(xk) ≤~v(xk−1)+M∥xk−xk−1∥ ≤~v(x)+⟨g(xk−1),xk−1−x⟩+M∥xk−xk−1∥+¯ϵ.

Moreover, by (2.20), we have

 ⟨g(xk−1),xk−1−x⟩ =⟨Gk−1,xk−1−x⟩+⟨Δk−1,xk−1−x⟩ =⟨Gk−1,xk−x⟩+⟨Gk−1,xk−1−xk⟩+⟨Δk−1,xk−1−x⟩ ≤⟨Gk−1,xk−x⟩+∥Gk−1∥∗∥xk−xk−1∥+⟨Δk−1,xk−1−x⟩.

Combining the above two inequalities, we obtain

 ~v(xk)−~v(x) ≤⟨Gk−1,xk−x⟩+⟨Δk−1,xk−1−x⟩+(M+∥Gk−1∥∗)∥xk−xk−1∥+¯ϵ. (2.21)

Also observe that by the optimality conditions of (2.15) and (2.16) (with input , output (see, e.g., Lemma 2 of [13]), we have

 ⟨−A(xk−x),~yk⟩+h(xk,c) −h(x,c)+⟨Gk−1,xk−x⟩ ≤τk[PX(xk−1,x)−PX(xk,x)−PX(xk−1,xk)],∀x∈X, (2.22) ⟨−b−Bu+Axk,yk−y⟩ ≤ηk2[∥yk−1−y∥2−∥yk−y∥2−∥yk−1−yk∥2],∀y∈K∗. (2.23)

Using the definition of in (2.11) and the relations (2.21), (2.22) and (2.23), we have

 Q(zk,z)+⟨A(xk−x),yk−~yk⟩≤τk[PX(xk−1,x)−PX(xk,x)]+ηk2[∥yk−1−y∥2−∥yk−y∥2] −τkPX(xk−1,xk)−ηk2∥yk−1−yk∥2+⟨Δk−1,xk−1−x⟩+(M+∥Gk−1∥∗)∥xk−xk−1∥+¯ϵ.

Also note that by the definition of (i.e., in (2.14)), we have and hence

 ⟨A(xk−x),yk−~yk⟩ =⟨A(xk−x),yk−yk−1⟩−θk⟨A(xk−x),yk−1−yk−2⟩ =⟨A(xk−x),yk−yk−1⟩−