Deep Learning Approximation for Stochastic Control Problems

11/02/2016 ∙ by Jiequn Han, et al. ∙ 0

Many real world stochastic control problems suffer from the "curse of dimensionality". To overcome this difficulty, we develop a deep learning approach that directly solves high-dimensional stochastic control problems based on Monte-Carlo sampling. We approximate the time-dependent controls as feedforward neural networks and stack these networks together through model dynamics. The objective function for the control problem plays the role of the loss function for the deep neural network. We test this approach using examples from the areas of optimal trading and energy storage. Our results suggest that the algorithm presented here achieves satisfactory accuracy and at the same time, can handle rather high dimensional problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The traditional way of solving stochastic control problems is through the principle of dynamic programming. While being mathematically elegant, for high-dimensional problems this approach runs into the technical difficulty associated with the “curse of dimensionality”. In fact, it is precisely in this context that the term was first introduced, by Richard Bellman (Bellman1957, )

. It turns out that the same problem is also at the heart of many other subjects such as machine learning and quantum many-body physics.

In recent years, deep learning has shown impressive results on a variety of hard problems in machine learning (Lecun1998, ; Bengio2009, ; Krizhevsky2012, ; LeCun2015, ), suggesting that deep neural networks might be an effective tool for dealing with the curse of dimensionality problem. It should be emphasized that although there are partial analytical results, the reason why deep neural networks have performed so well still largely remains a mystery. Nevertheless, it motivates using the deep neural network approximation in other contexts where curse of dimensionality is the essential obstacle.

In this paper, we develop the deep neural network approximation in the context of stochastic control problems. Even though this is not such an unexpected idea, and has in fact already been explored in the context of reinforcement learning

(Sutton1998, ), a subject that overlaps substantially with control theory, our formulation of the problem still has some merits. First of all, the framework we propose here is much simpler than the corresponding work for reinforcement learning. Secondly, we study the problem in finite horizon. This makes the optimal controls time dependent. Thirdly, instead of formulating approximations to the value function as is commonly done (Powell2011, ), our formulation is in terms of approximating the optimal control at each time. In fact, the control at each time step is approximated by a feedforward subnetwork. We stack these subnetworks together to form a very deep network and train them simultaneously. Numerical examples in section 4 suggest that this approximation can achieve near-optimality and at the same time handle high-dimensional problems with relative ease.

We note in passing that research on similar stochastic control problems has evolved under the name of deep reinforcement learning in the artificial intelligence (AI) community

(Mnih2015, ; Silver2016, ; Schulman2015, ; Lillicrap2016, ; JohnSchulman2016, ). As was stressed in (Duan2016, ), most of these papers deal with the infinite horizon problem with time-independent policy. In contrast, our algorithm only involves a single deep network obtained by stacking together, through model dynamics, the different subnetwork approximating the time-dependent controls.

In dealing with high-dimensional stochastic control problems, the conventional approach taken by the operations research (OR) community has been “approximate dynamic programming” (ADP) (Powell2011, ). There are two essential steps in ADP. The first is replacing the true value function using some function approximation. The second is advancing forward in time from a sample path with backward sweep to update the value function. Unlike ADP, we do not deal with value function at all. We deal directly with the controls. In addition, our approximation scheme appears to be more generally applicable.

2 Mathematical formulation

We consider a stochastic control problem with finite time horizon

on a probability space

with a filtration . Throughout the paper we adopt the convention that any variable indexed by is -measurable. We use to denote the state variable, where is the set of potential states. The control variable is denoted by .

Our setting is model-based. We assume that the evolution of the system is described by the stochastic model:


Here is the deterministic drift term given by the model. is a

-measurable random variable that contains all the noisy information arriving between time

and . One can view this as a discretized version of stochastic differential equations. To ensure generality of the model, we allow some state-dependent constraints on the control for all :


Assuming the state variable completely characterizes the model (in the sense that the optimal control depends only on the current state ), we can write the set of admissible functions for as


Our problem is finally formulated as (taking minimization for example)


where is the intermediate cost, is the final cost and is the total cost. For later purpose we also define the cumulative cost


3 An neural network approximation algorithm

Our task is to approximate the functional dependence of the control on the state, i.e. as a function of . Here we assumed that there are no memory effects but if necessary, memory effects can also be taken into account with no difference in principle. We represent this dependence by a multilayer feedforward neural network,



denotes parameters of the neural network. Note that we only apply the nonlinear activation function at the layers for the hidden variables and no activation function is used at the final layer. To better explain the algorithm, we assume temporarily that there are no other constraints but the

-measurability for the control. Then the optimization problem becomes


Here for clarity, we ignore the conditional dependence on the initial distribution. A key observation for the derivation of algorithm is that given a sample of the stochastic process , the total cost can be regarded as the output of a deep neural network. The general architecture of the network is illustrated in Figure 1. Note that there are three types of connections in this network:

  1. is the multilayer feedforward neural network approximating the control at time . The weights of this subnetwork are the parameters we aim to optimize.

  2. is the direct contribution to the final output of the network. Their functional form is determined by the cost function . There are no parameters to be optimized in this type of connection.

  3. is the shortcut connecting blocks at different time, which is completely characterized by (1). There are also no parameters to be optimized in this type of connection.

If we use hidden layers in each subnetwork, as illustrated in Figure 1, then the whole network has layers in total.

Figure 1: Illustration of the network architecture for stochastic control problems with hidden layers for each subnetwork. Each column (except ) corresponds to a subnetwork at . are the hidden variables in the subnetwork at time .

To deal with constraints, we revise the cumulative cost in Figure 1 by introducing:


Here are the penalty functions for equality and inequality constraints while are penalty coefficients. Specific examples can be found below. We should stress that in the testing stage, we project the optimal controls we learned to the admissible set to ensure that they strictly satisfy all the constraints.

3.1 Training algorithm

During training we sample as the input data and compute

from the neural network. The standard stochastic gradient descent (SGD) method with backpropagation can be easily adapted to this situation. The training algorithm can be easily implemented using common libraries (


, TensorFlow

(Abadi2015, )

) without modifying the SGD-type optimizers. We also adopted the technique of batch normalization

(Ioffe2015, )

in the subnetworks, right after each linear transformation and before activation. This method accelerates the training by allowing a larger step size and easier parameter initialization.

3.2 Implementation

We briefly mention some details of the implementation. All our numerical examples are run on a Dell desktop with 3.2GHz Intel Core i7, without any GPU accerleration. We use TensorFlow to implement our algorithm with the Adam optimizer (Kingma2015, )

to optimize parameters. Adam is an variant of the SGD algorithm, based on adaptive estimates of lower-order moments. We set the default values for corresponding hyper-parameters as recommended in

(Kingma2015, ). To deal with the constraints, we choose the quadratic function as penalty functions:


For the architecture of the subnetworks, we set the number of layers to 4, with 1 input layer (for the state variable), 2 hidden layers and 1 output layer (representing the control). We choose rectified linear unit (ReLU) as our activation function for the hidden variables. All the weights in the network are initialized using a normal distribution without any pre-training.

In the numerical results reported below, to facilitate the comparison with the benchmark, we fix the initial state to some deterministic value rather than from a random distribution. Therefore the optimal control at is also deterministic and batch normalization is skipped at .

4 Numerical results and discussion

4.1 Execution costs for portfolios

Our first example is in the area of finance. It is concerned with minimizing the expected cost for trading blocks of stocks over a fixed time horizon. When a portfolio requires frequent rebalancing, large orders across many stocks may appear, which must be executed within a relatively short time horizon. The execution costs associated with such tradings are often substantial, and this calls for smart trading strategies. Here we consider a linear percentage price-impact model based on the work of (Bertsimas1998, ; Bertimas1999, ). The reason we choose this example is that it has an analytic solution, which facilitates the evaluation of our numerical solutions.

Denote by the number of shares of each stock bought in period at price , . The investor’s objective is to


subject to , where denotes the shares of the stocks to be purchased within time . The execution price is assumed to be the sum of two components

Here is the “no-impact” price, modeled by geometric Brownian motion, and is the impact price, modeled by


where , captures the potential influence of market conditions and . To complete the model specification, we set the dynamics of as a simple multivariate autoregressive process:



is a white noise vector and

. The state variable of this model can be chosen as , where denotes the remaining shares to be bought at time . This problem can be solved analytically using dynamic programming, see (Bertimas1999, ) for the analytic expression of the optimal execution strategy and the corresponding optimal cost.

In our implementation, all the parameters of the model are assigned with realistic values. We choose and , which gives us a generic high-dimensional problem with the control space: . We set the number of hidden units in the two hidden layers to 100, the initial learning rate to 0.001, the batch size to 64 and iteration steps to 15000. The learning curves over five different random seeds with different time horizons are plotted in Figure 2.

(a) relative trading cost
(b) relative error for the controls
Figure 2: Relative trading cost and relative error for the controls (compared to the exact solution) as a function of the number of iterations on validation samples. The shaded area depicts the mean

the standard deviation over five different random seeds. The average relative trading cost and relative error for the controls on test samples are

and for . The average running time is 605 s, 778 s, 905 s respectively.

The dashed line in Figure 2 (a) represents the analytical optimal trading cost (rescaled to 1), defined as the optimal execution cost in cents/share above the no-impact cost . For this problem, the objective function achieves near-optimality with good accuracy: average relative trading cost to the exact solution are for . From Figure 2 (b) we also observe that computed optimal strategy approximates the exact solution well. Note that for , there are 120 layers in total.

In most practical applications, there are usually constraints on execution strategies. For example, a feasible buying strategy might require to be nonnegative. Such constraints can be imposed easily by adding penalty terms in our algorithm. Here we leave the optimization task with constraints to the next example.

4.2 Energy storage and allocation benchmark

Storage of wind energy has recently received significant attention as a way to increase the efficiency of the electrical grid. Practical and adaptive methods for the optimal energy allocation on such power systems are critical for them to operate reliably and economically. Here we consider an allocation problem from (Salas2013, ; Jiang2015, ), which aims at optimizing revenues from a storage device and a renewable wind energy source while satisfying stochastic demand.

The model is set up as follows. Let the state variable be , where is the amount of energy in the storage device, is the amount of energy produced by the wind source, is the price of electricity on the spot market, and is the demand to be satisfied. Let be the maximum rates of charging and discharging from the storage device respectively, and be the capacity of the storage device. The control variable is given by , where is the amount of energy transferred from to at time . The superscript stands for wind, for demand, for storage and for spot market. Figure 3 illustrates the meaning of the control components in a network diagram.

Figure 3: Network diagram for the energy storage problem.

We require that all components of be nonnegative. We also require:

The intermediate reward function at time is


Here we do not consider the holding cost. Let . The dynamics for is characterized by



are modeled by first-order Markov chains in bounded domains, which are all independent from the control (See S2 case in

(Jiang2015, ) for the exact specification).

To maximize the total reward, we need to find optimal control in the space . Since all the components of control should be negative, we add a ReLU activation at the final layer of each subnetwork. We set the number of hidden units in the two hidden layers to 400, the batch size to 256, all the penalty coefficients to 500 and iteration steps to 50000. The learning rate is 0.003 for the first half of the iterations and 0.0003 for the second half. In the literature, many algorithms for multidimensional stochastic control problems, e.g. the ones in (Salas2013, ; Jiang2015, ), proceed by discretizing the state variable and the control variable into a finite set, and present the optimal control in the form of a lookup table. In contrast, our algorithm can handle continuous variables directly. However, for the ease of comparison with the optimal lookup table obtained from backward dynamic programing as in (Jiang2015, ), here we artificially confine to the values in their lookup table. The relative reward over five different random seeds are plotted in Figure 4.

Figure 4: Relative reward as a function of the number of iterations on validation samples, with optimal lookup table obtained from backward dynamic programming being a benchmark. The shaded area depicts the mean the standard deviation over five different random seeds. The average relative reward on test samples is for . The average running time is 5041 s and 8150 s.

Despite the presence of multiple constraints, our algorithm still gives near-optimal reward. When , the neural-network policy gives even higher expected reward than the lookup table policy. It should be noted that if we relax the discretization constraint we imposed on , then our method can achieve better reward than the lookup table in both cases of and .

The learning curves in Figure 4

display clearly a feature of our algorithm for this problem: as time horizon increases, variance becomes larger with the same batch size and more iteration steps are required. We also see that the learning curves are rougher than those in the first example. This might be due to the presence of multiple constraints that result in more nonlinearity in the optimal control policy.

4.3 Multidimensional energy storage

Now we extend the previous example to the case of devices and test the algorithm’s performance for the rather high dimensional problems, in which we do not find any other available solution for comparison. We consider the situation of pure arbitrage, i.e. for all , and allow buying electricity from the spot market to store in the device. The state variable is where is the resource vector denoting the storage of each device. The control variable is characterized by . denote the energy capacity, maximum charging rates and discharging rates of storage device respectively. We also introduce , which are no larger than , as the charging and discharging efficiency of storage device . The holding cost is no longer zero as before, but denoted by . The intermediate reward function at time is revised to be


and the dynamics for becomes


with .

We make a simple but realistic assumption that a device with higher energy capacity has lower power transfer capacity , lower efficiency and lower holding cost . All these model parameters are distributed in the bounded domains. As the number of devices increases, we look for more refined allocation policy and the expected reward should be larger.

Figure 5: Reward relative to the expected total reward (with controls satisfying constraints strictly) in the case as a function of the number of iterations on validation samples. The shaded area depicts the mean the standard deviation over five different random seeds. The average relative reward on test samples is for . The average running time for three cases is 6672 s, 8374 s and 10219 s.

We use the same learning parameters as in the case of single device except that we reduce all the penalty coefficients to and batch size to . Learning curves plotted in Figure 5 confirms our expectation that the reward increases as the number of devises increases. The learning curves behave similarly as in the case of a single device and different random initializations still give similar expected reward. Note that the function space of the control policy: grows as increases from 30 to 50, while our algorithm still finds near-optimal solution with slightly increased computational time.

5 Conclusion

In this paper, we formulate a deep learning approach to directly solve high-dimensional stochastic control problems in finite horizon. We use feedforward neural networks to approximate the time-dependent control at each time and stack them together through model dynamics. The objective function for the control problem plays the role of the loss function in deep learning. Our numerical results suggest that for different problems, even in the presence of multiple constraints, this algorithm finds near-optimal solutions with great extendability to high-dimensional case.

The approach presented here should be applicable to a wide variety of problems in different areas including dynamic resource allocation with many resources and demands, dynamic game theory with many agents and wealth management with large portfolios. In the literature these problems were treated under different assumptions such as separability or mean-field approximation. As suggested by the results of this paper, the deep neural network approximation should provide a more general setting and should give better results.


  • [1] Richard Ernest Bellman. Dynamic Programming. Rand Corporation research study. Princeton University Press, 1957.
  • [2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [3] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
  • [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, may 2015.
  • [6] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. The MIT Press, 1998.
  • [7] Warren B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, 2011.
  • [8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
  • [9] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
  • [10] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), June 2015.
  • [11] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, and David Silver Daan Wierstra Tom Erez, Yuval Tassa. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), May 2016.
  • [12] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), May 2016.
  • [13] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 32nd International Conference on Machine Learning (ICML), June 2016.
  • [14] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from
  • [15] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning (ICML), June 2015.
  • [16] Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), May 2015.
  • [17] Dimitris Bertsimas and Andrew W. Lo. Optimal control of execution costs. Journal of Financial Markets, 1(1):1–50, April 1998.
  • [18] Dimitris Bertsimas, Andrew W. Lo, and P. Hummel. Optimal control of execution costs for portfolios. Computing in Science and Engineering, 1(6):40–53, November 1999.
  • [19] Daniel F. Salas and Warren B. Powell. Benchmarking a scalable approximation dynamic programming algorithm for stochastic control of multidimensional energy storage problems. Technical report, Princeton University, 2013.
  • [20] Daniel R. Jiang and Warren B. Powell. An approximate dynamic programming algorithm for monotone value functions. Operations Research, 63(6):1489–1511, December 2015.