1 Introduction
The traditional way of solving stochastic control problems is through the principle of dynamic programming. While being mathematically elegant, for highdimensional problems this approach runs into the technical difficulty associated with the “curse of dimensionality”. In fact, it is precisely in this context that the term was first introduced, by Richard Bellman (Bellman1957, )
. It turns out that the same problem is also at the heart of many other subjects such as machine learning and quantum manybody physics.
In recent years, deep learning has shown impressive results on a variety of hard problems in machine learning (Lecun1998, ; Bengio2009, ; Krizhevsky2012, ; LeCun2015, ), suggesting that deep neural networks might be an effective tool for dealing with the curse of dimensionality problem. It should be emphasized that although there are partial analytical results, the reason why deep neural networks have performed so well still largely remains a mystery. Nevertheless, it motivates using the deep neural network approximation in other contexts where curse of dimensionality is the essential obstacle.
In this paper, we develop the deep neural network approximation in the context of stochastic control problems. Even though this is not such an unexpected idea, and has in fact already been explored in the context of reinforcement learning
(Sutton1998, ), a subject that overlaps substantially with control theory, our formulation of the problem still has some merits. First of all, the framework we propose here is much simpler than the corresponding work for reinforcement learning. Secondly, we study the problem in finite horizon. This makes the optimal controls time dependent. Thirdly, instead of formulating approximations to the value function as is commonly done (Powell2011, ), our formulation is in terms of approximating the optimal control at each time. In fact, the control at each time step is approximated by a feedforward subnetwork. We stack these subnetworks together to form a very deep network and train them simultaneously. Numerical examples in section 4 suggest that this approximation can achieve nearoptimality and at the same time handle highdimensional problems with relative ease.We note in passing that research on similar stochastic control problems has evolved under the name of deep reinforcement learning in the artificial intelligence (AI) community
(Mnih2015, ; Silver2016, ; Schulman2015, ; Lillicrap2016, ; JohnSchulman2016, ). As was stressed in (Duan2016, ), most of these papers deal with the infinite horizon problem with timeindependent policy. In contrast, our algorithm only involves a single deep network obtained by stacking together, through model dynamics, the different subnetwork approximating the timedependent controls.In dealing with highdimensional stochastic control problems, the conventional approach taken by the operations research (OR) community has been “approximate dynamic programming” (ADP) (Powell2011, ). There are two essential steps in ADP. The first is replacing the true value function using some function approximation. The second is advancing forward in time from a sample path with backward sweep to update the value function. Unlike ADP, we do not deal with value function at all. We deal directly with the controls. In addition, our approximation scheme appears to be more generally applicable.
2 Mathematical formulation
We consider a stochastic control problem with finite time horizon
on a probability space
with a filtration . Throughout the paper we adopt the convention that any variable indexed by is measurable. We use to denote the state variable, where is the set of potential states. The control variable is denoted by .Our setting is modelbased. We assume that the evolution of the system is described by the stochastic model:
(1) 
Here is the deterministic drift term given by the model. is a
measurable random variable that contains all the noisy information arriving between time
and . One can view this as a discretized version of stochastic differential equations. To ensure generality of the model, we allow some statedependent constraints on the control for all :(2)  
(3) 
Assuming the state variable completely characterizes the model (in the sense that the optimal control depends only on the current state ), we can write the set of admissible functions for as
(4) 
Our problem is finally formulated as (taking minimization for example)
(5) 
where is the intermediate cost, is the final cost and is the total cost. For later purpose we also define the cumulative cost
(6) 
3 An neural network approximation algorithm
Our task is to approximate the functional dependence of the control on the state, i.e. as a function of . Here we assumed that there are no memory effects but if necessary, memory effects can also be taken into account with no difference in principle. We represent this dependence by a multilayer feedforward neural network,
(7) 
where
denotes parameters of the neural network. Note that we only apply the nonlinear activation function at the layers for the hidden variables and no activation function is used at the final layer. To better explain the algorithm, we assume temporarily that there are no other constraints but the
measurability for the control. Then the optimization problem becomes(8) 
Here for clarity, we ignore the conditional dependence on the initial distribution. A key observation for the derivation of algorithm is that given a sample of the stochastic process , the total cost can be regarded as the output of a deep neural network. The general architecture of the network is illustrated in Figure 1. Note that there are three types of connections in this network:

is the multilayer feedforward neural network approximating the control at time . The weights of this subnetwork are the parameters we aim to optimize.

is the direct contribution to the final output of the network. Their functional form is determined by the cost function . There are no parameters to be optimized in this type of connection.

is the shortcut connecting blocks at different time, which is completely characterized by (1). There are also no parameters to be optimized in this type of connection.
If we use hidden layers in each subnetwork, as illustrated in Figure 1, then the whole network has layers in total.
To deal with constraints, we revise the cumulative cost in Figure 1 by introducing:
(9) 
Here are the penalty functions for equality and inequality constraints while are penalty coefficients. Specific examples can be found below. We should stress that in the testing stage, we project the optimal controls we learned to the admissible set to ensure that they strictly satisfy all the constraints.
3.1 Training algorithm
During training we sample as the input data and compute
from the neural network. The standard stochastic gradient descent (SGD) method with backpropagation can be easily adapted to this situation. The training algorithm can be easily implemented using common libraries (
e.g.(Abadi2015, )) without modifying the SGDtype optimizers. We also adopted the technique of batch normalization
(Ioffe2015, )in the subnetworks, right after each linear transformation and before activation. This method accelerates the training by allowing a larger step size and easier parameter initialization.
3.2 Implementation
We briefly mention some details of the implementation. All our numerical examples are run on a Dell desktop with 3.2GHz Intel Core i7, without any GPU accerleration. We use TensorFlow to implement our algorithm with the Adam optimizer (Kingma2015, )
to optimize parameters. Adam is an variant of the SGD algorithm, based on adaptive estimates of lowerorder moments. We set the default values for corresponding hyperparameters as recommended in
(Kingma2015, ). To deal with the constraints, we choose the quadratic function as penalty functions:(10) 
For the architecture of the subnetworks, we set the number of layers to 4, with 1 input layer (for the state variable), 2 hidden layers and 1 output layer (representing the control). We choose rectified linear unit (ReLU) as our activation function for the hidden variables. All the weights in the network are initialized using a normal distribution without any pretraining.
In the numerical results reported below, to facilitate the comparison with the benchmark, we fix the initial state to some deterministic value rather than from a random distribution. Therefore the optimal control at is also deterministic and batch normalization is skipped at .
4 Numerical results and discussion
4.1 Execution costs for portfolios
Our first example is in the area of finance. It is concerned with minimizing the expected cost for trading blocks of stocks over a fixed time horizon. When a portfolio requires frequent rebalancing, large orders across many stocks may appear, which must be executed within a relatively short time horizon. The execution costs associated with such tradings are often substantial, and this calls for smart trading strategies. Here we consider a linear percentage priceimpact model based on the work of (Bertsimas1998, ; Bertimas1999, ). The reason we choose this example is that it has an analytic solution, which facilitates the evaluation of our numerical solutions.
Denote by the number of shares of each stock bought in period at price , . The investor’s objective is to
(11) 
subject to , where denotes the shares of the stocks to be purchased within time . The execution price is assumed to be the sum of two components
Here is the “noimpact” price, modeled by geometric Brownian motion, and is the impact price, modeled by
(12) 
where , captures the potential influence of market conditions and . To complete the model specification, we set the dynamics of as a simple multivariate autoregressive process:
(13) 
where
is a white noise vector and
. The state variable of this model can be chosen as , where denotes the remaining shares to be bought at time . This problem can be solved analytically using dynamic programming, see (Bertimas1999, ) for the analytic expression of the optimal execution strategy and the corresponding optimal cost.In our implementation, all the parameters of the model are assigned with realistic values. We choose and , which gives us a generic highdimensional problem with the control space: . We set the number of hidden units in the two hidden layers to 100, the initial learning rate to 0.001, the batch size to 64 and iteration steps to 15000. The learning curves over five different random seeds with different time horizons are plotted in Figure 2.
the standard deviation over five different random seeds. The average relative trading cost and relative error for the controls on test samples are
and for . The average running time is 605 s, 778 s, 905 s respectively.The dashed line in Figure 2 (a) represents the analytical optimal trading cost (rescaled to 1), defined as the optimal execution cost in cents/share above the noimpact cost . For this problem, the objective function achieves nearoptimality with good accuracy: average relative trading cost to the exact solution are for . From Figure 2 (b) we also observe that computed optimal strategy approximates the exact solution well. Note that for , there are 120 layers in total.
In most practical applications, there are usually constraints on execution strategies. For example, a feasible buying strategy might require to be nonnegative. Such constraints can be imposed easily by adding penalty terms in our algorithm. Here we leave the optimization task with constraints to the next example.
4.2 Energy storage and allocation benchmark
Storage of wind energy has recently received significant attention as a way to increase the efficiency of the electrical grid. Practical and adaptive methods for the optimal energy allocation on such power systems are critical for them to operate reliably and economically. Here we consider an allocation problem from (Salas2013, ; Jiang2015, ), which aims at optimizing revenues from a storage device and a renewable wind energy source while satisfying stochastic demand.
The model is set up as follows. Let the state variable be , where is the amount of energy in the storage device, is the amount of energy produced by the wind source, is the price of electricity on the spot market, and is the demand to be satisfied. Let be the maximum rates of charging and discharging from the storage device respectively, and be the capacity of the storage device. The control variable is given by , where is the amount of energy transferred from to at time . The superscript stands for wind, for demand, for storage and for spot market. Figure 3 illustrates the meaning of the control components in a network diagram.
We require that all components of be nonnegative. We also require:
The intermediate reward function at time is
(14) 
Here we do not consider the holding cost. Let . The dynamics for is characterized by
(15) 
and
are modeled by firstorder Markov chains in bounded domains, which are all independent from the control (See S2 case in
(Jiang2015, ) for the exact specification).To maximize the total reward, we need to find optimal control in the space . Since all the components of control should be negative, we add a ReLU activation at the final layer of each subnetwork. We set the number of hidden units in the two hidden layers to 400, the batch size to 256, all the penalty coefficients to 500 and iteration steps to 50000. The learning rate is 0.003 for the first half of the iterations and 0.0003 for the second half. In the literature, many algorithms for multidimensional stochastic control problems, e.g. the ones in (Salas2013, ; Jiang2015, ), proceed by discretizing the state variable and the control variable into a finite set, and present the optimal control in the form of a lookup table. In contrast, our algorithm can handle continuous variables directly. However, for the ease of comparison with the optimal lookup table obtained from backward dynamic programing as in (Jiang2015, ), here we artificially confine to the values in their lookup table. The relative reward over five different random seeds are plotted in Figure 4.
Despite the presence of multiple constraints, our algorithm still gives nearoptimal reward. When , the neuralnetwork policy gives even higher expected reward than the lookup table policy. It should be noted that if we relax the discretization constraint we imposed on , then our method can achieve better reward than the lookup table in both cases of and .
The learning curves in Figure 4
display clearly a feature of our algorithm for this problem: as time horizon increases, variance becomes larger with the same batch size and more iteration steps are required. We also see that the learning curves are rougher than those in the first example. This might be due to the presence of multiple constraints that result in more nonlinearity in the optimal control policy.
4.3 Multidimensional energy storage
Now we extend the previous example to the case of devices and test the algorithm’s performance for the rather high dimensional problems, in which we do not find any other available solution for comparison. We consider the situation of pure arbitrage, i.e. for all , and allow buying electricity from the spot market to store in the device. The state variable is where is the resource vector denoting the storage of each device. The control variable is characterized by . denote the energy capacity, maximum charging rates and discharging rates of storage device respectively. We also introduce , which are no larger than , as the charging and discharging efficiency of storage device . The holding cost is no longer zero as before, but denoted by . The intermediate reward function at time is revised to be
(16) 
and the dynamics for becomes
(17) 
with .
We make a simple but realistic assumption that a device with higher energy capacity has lower power transfer capacity , lower efficiency and lower holding cost . All these model parameters are distributed in the bounded domains. As the number of devices increases, we look for more refined allocation policy and the expected reward should be larger.
We use the same learning parameters as in the case of single device except that we reduce all the penalty coefficients to and batch size to . Learning curves plotted in Figure 5 confirms our expectation that the reward increases as the number of devises increases. The learning curves behave similarly as in the case of a single device and different random initializations still give similar expected reward. Note that the function space of the control policy: grows as increases from 30 to 50, while our algorithm still finds nearoptimal solution with slightly increased computational time.
5 Conclusion
In this paper, we formulate a deep learning approach to directly solve highdimensional stochastic control problems in finite horizon. We use feedforward neural networks to approximate the timedependent control at each time and stack them together through model dynamics. The objective function for the control problem plays the role of the loss function in deep learning. Our numerical results suggest that for different problems, even in the presence of multiple constraints, this algorithm finds nearoptimal solutions with great extendability to highdimensional case.
The approach presented here should be applicable to a wide variety of problems in different areas including dynamic resource allocation with many resources and demands, dynamic game theory with many agents and wealth management with large portfolios. In the literature these problems were treated under different assumptions such as separability or meanfield approximation. As suggested by the results of this paper, the deep neural network approximation should provide a more general setting and should give better results.
References
 [1] Richard Ernest Bellman. Dynamic Programming. Rand Corporation research study. Princeton University Press, 1957.
 [2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [3] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
 [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, may 2015.
 [6] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. The MIT Press, 1998.
 [7] Warren B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, 2011.
 [8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
 [9] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
 [10] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), June 2015.
 [11] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, and David Silver Daan Wierstra Tom Erez, Yuval Tassa. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), May 2016.
 [12] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), May 2016.
 [13] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 32nd International Conference on Machine Learning (ICML), June 2016.
 [14] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [15] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning (ICML), June 2015.
 [16] Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), May 2015.
 [17] Dimitris Bertsimas and Andrew W. Lo. Optimal control of execution costs. Journal of Financial Markets, 1(1):1–50, April 1998.
 [18] Dimitris Bertsimas, Andrew W. Lo, and P. Hummel. Optimal control of execution costs for portfolios. Computing in Science and Engineering, 1(6):40–53, November 1999.
 [19] Daniel F. Salas and Warren B. Powell. Benchmarking a scalable approximation dynamic programming algorithm for stochastic control of multidimensional energy storage problems. Technical report, Princeton University, 2013.
 [20] Daniel R. Jiang and Warren B. Powell. An approximate dynamic programming algorithm for monotone value functions. Operations Research, 63(6):1489–1511, December 2015.
Comments
There are no comments yet.