I Introduction
Stochastic optimal control is a mature discipline of control theory with a plethora of applications to autonomy, robotics, aerospace systems, computational neuroscience, and finance. From a methodological stand point, stochastic dynamic programming is the pillar of stochastic optimal control theory. Application of the stochastic dynamic programming results in the socalled HamiltonJacobiBellman (HJB) Partial Differential Equation (PDE
). Algorithms for stochastic control can be classified into different categories depending on the way of how they are dealing with the curse of dimensionality in solving the
HJB PDEfor systems with many degrees of freedom and/or states.
Gametheoretic, or minmax, extension to optimal control was first investigated by Isaacs [1]. He associated the solution of a differential game with the solution to a HJBlike equation, namely its minmax extension, also known as the HamiltonJacobiIsaacs (HJI) equation. The HJI
equation was derived heuristically under the assumptions of Lipschitz continuity of the cost and the dynamics, in addition to the assumption that both of them are separable in terms of the maximizing and minimizing controls. Despite extensive results in the theory of differential games, algorithmic development has seen less growth, due to the involved difficulties in addressing such problems. Prior work, including the Markov Chain approximation method
[2], largely suffers by the curse of dimensionality. In addition, a specific class of minmax control trajectory optimization methods have been derived recently, relying on the foundations of differential dynamic programming (DDP) [3, 4, 5], which requires linear and/or quadratic approximation of the dynamics and value function.Due to the inherent difficulties of solving stochastic differential games, most of the effort in optimal control theory was focused on the HJB PDE. Addressing the solution of the HJB equation, a number of algorithms for stochastic optimal control have been proposed that rely on the probabilistic representation of solutions of linear and nonlinear backward PDEs. Starting from the path integral control framework [6], the HJB equation is transformed into a linear backward PDE
under certain conditions related to control authority and variance of noise. The probabilistic representation of the solution of this
PDE is provided by the linear FeynmanKac theorem [7, 8, 9]. The nonlinear FeynmanKac theorem avoids the assumption required in the path integral control framework at the cost, however, of representing the solution of the HJB equation with a system of ForwardBackward Stochastic Differential Equations [10, 11]. Previous work by our group aimed at improving sampling efficiency and reducing computational complexity, and in [12, 13, 14] an importance sampling scheme was proposed and employed to develop iterative stochastic control algorithms using the FBSDE formulation. This work lead to algorithms for , risksensitive stochastic optimal control, as well as stochastic differential games [15, 16, 17].In [18]
the authors incorporate deep learning algorithms, such as Deep FeedForward Neural Networks, within the
FBSDE formulation and demonstrated the applicability the resulting algorithms to solving PDEs. While the approach in [18] offers an efficient method to represent the value function and its gradient, it has been only applied to PDEs that correspond to simple dynamics. Motivated by the limitations of the existing work on FBSDEs and Deep Learning (DL), the work in reference [19] utilizes importance sampling together with the benefits of recurrent neural networks in order to capture the temporal dependencies of the value function and to scale the deep FBSDE algorithm to high dimensional nonlinear stochastic systems.In this work, we demonstrate that the FBSDEs associated with stochastic differential games can be solved with the deep FBSDE framework. We focus on the case of minmax stochastic control that corresponds to risk sensitive control. Using the LongShort Term Memory (LSTM) network architecture [20], we introduce a scalable deep minmax FBSDE controller that results in trajectories with reduced variance. We demonstrate the variance reduction benefit of this algorithm against the standard risk neutral stochastic optimal control formulation of the deep FBSDE framework on a pendulum and a quadcopter in simulation.
The rest of this paper is organized as follows: in Section II we introduce the minmax stochastic control problem, demonstrate its connection to risk sensitive control, and reformulate the problem with a system of FBSDEs. We present the minmax FBSDE controller in Section III. In Section IV, we compare the controller introduced in this work against the deep FBSDE algorithm for standard stochastic optimal control, and we explore the variance reduction benefit of our controller as a function of risk sensitivity. Finally, we conclude the paper in Section V.
Ii FBSDE for Differential Games
Iia MinMax Stochastic Control
Consider a system with control affine dynamics in a differential game setting as follows:
(1) 
where , is the task horizon, is the state, is the minimizing control, is the adversarial control, is a standard dimensional Brownian motion, represents the drift, represents the actuator dynamics, represents the adversarial control dynamics and represents the diffusion. For this system we can define the following cost function:
(2) 
where is the terminal state cost, is the running state cost, and and are positive definite control cost matrices.
The minmax stochastic control problem is formulated as follows:
(3) 
where the minimizing control’s goal is reducing the cost under all admissible nonanticipating strategies , while the adversarial control maximizes the cost under all admissible nonanticipating strategies .
The HJI equation for this problem is:
(4) 
The terms inside the infimum and supremum operations are collectively called the Hamiltonian. The optimal minimizing and adversarial controls and are those for which the gradient of the Hamiltonian vanishes, which take the following form:
(5) 
Substitution of the expressions above into the HJI equation results in:
(6) 
Note that we will drop functional dependence in all PDEs for notational compactness. In the following section we show the equivalence of a certain case of minmax control to risk sensitive control.
IiB Risk Sensitive Stochastic Optimal Control
Risk sensitive stochastic optimal control[21] is essential in cases where decision has to be made in a manner that is robust to the stochasticity of the environment. Let us consider the following performance index:
(7) 
where is the risk sensitivity. The risk sensitive stochastic optimal control problem is formulated with the following value function:
(8) 
subject to the dynamics:
(9) 
where is a small constant, and represents diffusion [22].
The HJB equation for this stochastic optimal control problem is formulated as follows:
(10) 
The optimal control can be obtained by finding the control where the gradient of the terms inside the infimum vanishes and has the form . By substituting in the optimal control and setting in (10), we get the following final form of the HJB PDE:
(11) 
Note that the above PDE is a special case of the HJI PDE (6) when and (Fig. 1). Intuitively, this means that minmax control collapses to risk sensitive control when it is solving a problem with nonzero mean noise as the adversary, and the control authority of this adversary is proportional to the risk sensitivity.
IiC FBSDE Reformulation
We now reformulate the minmax control PDE (6) in the risk sensitive case to a set of FBSDEs. Here we restate the nonlinear FeynmanKac theorem (Theorem 2) from [16]:
Theorem 1 (Nonlinear FeynmanKac)
In order to apply the Nonlinear FeynmanKac theorem to (6), we assume that there exist matrixvalued functions and such that and for all , satisfying the same regularity conditions. This assumption suggests that there can not be a channel containing control input but no noise. In the risk sensitive case of minmax control, this assumption is already satisfied with and , where is a identity matrix because adversarial control enters through the noise channels. Under this assumption, Theorem 1 can be applied to the risk sensitive case of HJI equation (6) with
(16) 
The relationship between FBSDE (14), (15), HJI PDE (6), HJB PDE (11), and the parabolic PDE (12) is summarized in Fig. 1.
IiD Importance Sampling
The system of FBSDEs in (14) and (15) corresponds to a system whose dynamics are uncontrolled. In many cases, especially for unstable systems, it is hard or impossible to reach the target state with uncontrolled dynamics. We can address this problem by modifying the drift term in the dynamics (forward SDE) with an additional control term. Through Girsanov’s theorem [23] on change of measure, the drift term in the forward SDE (14) can be changed if the backward SDE (15) is compensated accordingly. This results in a new FBSDE system given by
(17) 
and
(18) 
for any measurable, bounded and adapted process . It is easy to verify that the PDE associated with the new system is the same as the original one (12). For the full derivation of change of measure for FBSDEs, we refer readers to proof of Theorem 1 in [14]. We can conveniently set for minmax control. Note that the nominal controls and can be any open or closed loop control or control from a previous iteration.
Pendulum  QuadCopter  
Low Noise  High Noise  Low Noise  High Noise  
Baseline  5.3  149.5  3.2  78.6 
RS  3.9  134.2  2.7  69.6 
Variance Reduction (%)  26  10  16  11 
IiE Forward Sampling of BSDE
The compensated BSDE (18) needs to satisfy a terminal condition, meaning its solution needs to be propagated backward in time, yet the filtration evolves forward in time. This poses a challenge for sampling based methods to solve the system of FBSDEs
. One solution is to approximate the conditional probability of the process and backpropagate the expected value. This approach lacks scalability due to inevitable compounding of approximation errors that are accumulated at every time step during regression.
This problem can be alleviated with DL. Using a deep recurrent network, we can initialize the value function and its gradient at
and treat the initializations as trainable network parameters. This allows for the BSDE to be propagated forward in time along with the FSDE. At the final time, the terminal condition can be compared against the propagated value in the loss function to update the initialzations as well as the network parameters. Compared to the conditional probability approximation scheme, the
DL approach has the additional advantage of not accumulating errors at every time step since the recurrent network at each time step contributes to a common goal of predicting the target terminal condition and thus prediction errors are jointly minimized.Iii Deep Minmax FBSDE Controller
With eqs. (17) and (18), we have a system of FBSDEs that we can sample from around a nominal control trajectory. Inspired by the network architecture developed in [19], we propose a deep minmax FBSDE algorithm that solves the risk sensitive formulation of the minmax control.
Iiia Numerics and Network Architecture
The task horizon can be discretized as with a time discretization of . With this we can approximate the continuous variables as step functions and obtain their discretization as if .
The network architecture used in this paper is shown in Fig. 2, which is based on the LSTM network in [19] with minmax dynamics and value function dynamics incorporated. LSTM
is a natural choice of network here since it is designed to effectively deal with the vanishing gradient problem in recurrent prediction of long time series
[20]. We use a twolayer LSTM network with tanh activation and Xavier initialization [24]. At every time step, the LSTM predicts the value function gradient using the current state as input. The optimal minimizing and adversarial control are then calculated with(19)  
(20) 
and fed back to the dynamics for importance sampling. Note that the adversarial control is only present during training. After the network is trained, only the optimal minimizing control is used at test time. By exposing the minimizing controller to an adversary that behaves in an optimal fashion, it becomes more robust resulting in trajectories with smaller variances.
IiiB Algorithm
The Deep Minmax FBSDE algorithm can be found in Algorithm 1. It solves a finite time horizon control problem by approximating the gradient of the value function (the superscript denotes the batch index, and the batchwise computation can be done in parallel) at every time step with a LSTM, which is parameterized by , and propagating the FBSDE associated with the control problem. For a given initial state condition , the algorithm randomly initializes the value function and its gradient at . The initial values are trainable and are parameterized by . During training, at every time step, control inputs are sampled around the optimal minimizing and adversarial controls and applied to the system. The discretized forward dynamics and the value function SDEs are propagated using an explicit forward Euler integration scheme. The function is calculated using (16). At the final time step , a modified loss with regularization is computed which compares the propagated value function against the true value function calculated using the final state (). For training our network, we propose a new regularized loss function, which is a convex combination of a) the difference between the target and the predicted value function, and b) the target value function itself:
(21) 
since we want the prediction to be close to the target and at the same time, the target value function to converge to zero for the sake of the optimality. Notice that this additional component in the loss function is possible only due to importance sampling. The modified drift is implemented as a connection in the computational graph between the LSTM output and input to forward SDE at the next timestep. This allows the network parameters to influence the next state and hence the final state. The network can be trained by Stochastic Gradient Descent (SGD) type optimizer and in our experiments, we used the Adam [25] optimizer.
Iv Experiments
The algorithm is implemented on a pendulum and quadcopter system in simulation. The task for the two systems is to reach a target state. The trained networks are tested on 128 trajectories. The time discretization is 0.02 seconds across all cases. We compare the algorithm proposed in this paper with the one in [19]
, where the standard optimal control problem is considered, in two different noise conditions. We will use “RS” to denote the algorithm in this work and “Baseline” for the algorithm that we are comparing against. All experiments were done in TensorFlow
[26] on an Intel i74820k CPU Processor.In all trajectory plots, the solid line denotes the mean trajectory in low noise condition, the dashed line denotes the mean trajectory in high noise condition, and the red dashed line denotes the target state. In addition, the 4 conditions are denoted by different colors, with blue for RS in low noise condition, green for RS in high noise condition, orange for Baseline in low noise condition, and magenta for Baseline with high noise. The shaded region of each color denotes the 95% confidence region.
Iva Pendulum
For the pendulum system, the algorithm was implemented to complete a swingup task with a task horizon of 1.5 seconds. The two system states are the pendulum angle [] and the pendulum angular rate []. Fig. 3 plots the pendulum states in all 4 cases (RS with low and high noise and Baseline with low and high noise). The control applied to the system is the torque [] (Fig. 4).
IvB Quadcopter
The algorithm was implemented on a quadcopter system for the task of reaching a final target state from an initial position with a task horizon of 2 seconds. The initial condition is 0 across all states. The target is 1 [] upward, forward and to the right from the initial position with zero velocities and attitudes. The quadcopter dynamics used can be found in [27]. The 12 system states are composed of the position [], angles [], linear velocities [], and angular velocities []. The control inputs to the system are 4 torques [], which control the rotors (Fig. 8).
IvC Reduced Variance with Deep Minmax FBSDE Controller
The trajectory plots (Fig. 3, 5, 6, and 7) compare the Deep Minmax FBSDE controller against the risk neutral Deep FBSDE controller in a low noise setting and a high noise setting for both systems. From the plots we can observe that the minmax controller proposed in this work accomplishes the tasks with similar level of performance compared to the baseline controller. Numerical comparisons of the total state variance (sum of variance in all states over the entire trajectory) of all test cases can be found in Table I. The results demonstrate at least 10% reduction in total state variance across all cases. It is worth noting that the high noise setting results in less variance reduction benefits. By examining the substitution of from (10) to (11) in risk sensitive control derivation, we can see that increasing noise level is in some sense equivalent to increasing . This naturally reduces the effect of the risk sensitive controller, as shown in the next section.
IvD Variance vs. Risk Sensitivity
We also investigated the relationship between total state variance and risk sensitivity in the two systems. Fig. 9 plots the total state variance for different values while also keeping track of task completion. In the variance versus scatter plots, blue circles are used to denote runs with successful task completion, whereas red cross denotes runs where the task failed. Since the risk sensitivity parameter is inversely proportional to the adversarial control authority, we expect the risk sensitive controller to converge to standard optimal controller as increases to infinity. On the other hand, as gets smaller, the adversarial control will eventually dominate the minimizing control and cause task failure. This is reflected in the plots as we can observe that the minimizing controller starts to fail when is too low. It is worth noting that the failure threshold increases as the system gets more complex and higher dimensional. Although the variance starts to increase as increases in the Quadcopter plot, the convergence to standard optimal controller is harder to observe as we only explore a limited range of values.
V Conclusions
In this paper, we proposed the Deep Minmax FBSDE Control algorithm, based on the risk sensitive case of stochastic gametheoretic optimal control theory. Utilizing prior work on importance sampling of FBSDEs and efficiency of the LSTM network to predict long time series, the algorithm is capable of solving stochastic gametheoretic control problems for nonlinear systems with controlaffine dynamics. Comparison of this algorithm against the standard stochastic optimal control formulation suggests that by considering an adversarial control in the form of noiserelated risk, the controller outputs trajectories with lower variance. Our algorithm scales in terms of the number of system states and system complexity for the minmax control problem, while the previous works did not. For future works, we would like to explore different network architectures to reduce the training time.
References
 [1] R. Isaacs. Differential Games: A Mathematical Theory with Applications to Warfare and Pursuit, Control and Optimization. New York: Willey, 1965.
 [2] H. Kushner. Numerical approximations for stochastic differential games. SIAM J. Control Optim., 41:457–486, 2002.
 [3] J. Morimoto, G. Zeglin, and C. Atkeson. Minimax differential dynamic programming: Application to a biped walking robot. IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, pages 1927–1932, October 2731, 2003.
 [4] J. Morimoto and C. Atkeson. Minimax differential dynamic programming: An application to robust biped walking. Advances in Neural Information Processing Systems (NIPS), Vancouver, British Columbia, Canada, December 914, 2002.
 [5] W. Sun, E. A. Theodorou, and P. Tsiotras. Gametheoretic continuous time differential dynamic programming. American Control Conference, Chicago, IL, pages 5593–5598, July 1–3, 2015.
 [6] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 11:P11011, 2005.
 [7] W.H. Fleming. Exit probabilities and optimal stochastic control. Applied Math. Optim, 9:329–346, 1971.
 [8] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Applications of mathematics. Springer, New York, 2nd edition, 2006.
 [9] I. Karatzas and S. E. Shreve. Brownian Motion and Stochastic Calculus (Graduate Texts in Mathematics). Springer, 2nd edition, August 1991.
 [10] Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations, volume 43. Springer Science & Business Media, 1999.
 [11] Etienne Pardoux and Aurel Rascanu. Stochastic Differential Equations, Backward SDEs, Partial Differential Equations, volume 69. 07 2014.
 [12] I. Exarchos. Stochastic Optimal ControlA Forward and Backward Sampling Approach. PhD thesis, Georgia Institute of Technology, 2017.
 [13] I. Exarchos and E. A. Theodorou. Learning optimal control via forward and backward stochastic differential equations. In American Control Conference (ACC), 2016, pages 2155–2161. IEEE, 2016.
 [14] I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backward stochastic differential equations and importance sampling. Automatica, 87:159–165, 2018.
 [15] I. Exarchos, E. A. Theodorou, and P. Tsiotras. Stochastic optimal control via forward and backward sampling. Systems & Control Letters, 118:101–108, 2018.
 [16] Ioannis Exarchos, Evangelos A Theodorou, and Panagiotis Tsiotras. Gametheoretic and risksensitive stochastic optimal control via forward and backward stochastic differential equations. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 6154–6160. IEEE, 2016.
 [17] I. Exarchos, E. A. Theodorou, and P. Tsiotras. Stochastic Differential Games – A Sampling Approach via FBSDEs. Dynamic Games and Applications, pages 1–20, 2018.
 [18] Jiequn Han et al. Deep Learning Approximation for Stochastic Control Problems. arXiv preprint arXiv:1611.07422, 2016.
 [19] Marcus Pereira, Ziyi Wang, Ioannis Exarchos, and Evangelos A Theodorou. Neural network architectures for stochastic control using the nonlinear FeynmanKac lemma. arXiv preprint arXiv:1902.03986, 2019.
 [20] Sepp Hochreiter and Jürgen Schmidhuber. LSTM can solve hard long time lag problems. pages 473–479, 1997.
 [21] T. Basar and P. Berhard. Hinfinity Optimal Control and Related Minimax Design. Birkhauser, Boston, 1995.
 [22] Wendell H. Fleming and William M. McEneaney. Risksensitive control on an infinite time horizon. SIAM Journal of Control Optimization, 33(6):1881–1915., 1995.
 [23] Igor Vladimirovich Girsanov. On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability & Its Applications, 5(3):285–301, 1960.

[24]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pages 249–256, 2010.  [25] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), abs/1412.6980, 2014.

[26]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,
Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang
Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  [27] Maki K Habib, Wahied Gharieb Ali Abdelaal, Mohamed Shawky Saad, et al. Dynamic modeling and control of a quadrotor using linear and nonlinear approaches. 2014.