1 Introduction
Stochastic optimal control is the center of decision making under uncertainty with a history and extensive prior work both in terms of theory as well as algorithms Stengel1994 ; Fleming2006 . One of the most celebrated formulations of stochastic control is for linear dynamics and additive noise. This is the socalled Linear Quadratic Gaussian (LQG) case Stengel1994 . For stochastic systems that are nonlinear in the state and affine in control, stochastic control results in the HamiltonJacobiBellman (HJB) equation that is a backward nonlinear partial differential equation. Solving the HJB equations for high dimensional systems is in general a challenging task.
Different algorithms have been derived to address stochastic control problems and solve the HJB
equation. The algorithms could be classified into algorithms that rely on linearization and algorithms that rely on sampling. Linearizationbased algorithms require linearization (iLQG) or quadratic approximation of dynamics (Stochastic Differential Dynamic Programming), and quadratic approximation of the cost function
ilqr001 ; EvangelosACC2010. Application of the aforementioned algorithms is not straightforward and requires special linearization schemes especially for the cases of control and/or state dependent noise. It is worth also mentioning that the convergence properties of these algorithms has not been investigated and remains an open question. Samplingbased methods include the MarkovChain Monte Carlo (MCMC) approximation of the
HJB equation Kushner:1992 ; HuynhKF16. MCMCbased algorithms rely on backward propagating the value function on a prespecified grid. Recently researchers have incorporated tensortrain decomposition techniques to scale these methods
Gorodetsky_RSS_15 . However, these techniques have been applied to special classes of systems and stochastic control problem formulations and have demonstrated limited applicability so far.Alternative samplingbased methodologies rely on the probabilistic representation of backward Partial Differential Equations and generalization of the socalled linear FeynmanKac lemma Karatzasbook to its nonlinear version Pardoux_Book2014
. Application of the linear FeynmanKac lemma requires the exponential transformation of the value function and certain assumptions related to control authority and variance of the noise. Stochastic control then is computed using forward sampling of stochastic differential equations
Kappen1995 ; Todorov2007 ; todorov2009efficient ; theodorou2010 . The nonlinear version of the FeynmanKac lemma overcomes the aforementioned limitations. However it requires a more sophisticated numerical scheme than just forward sampling, which relies on the theory of ForwardBackward Stochastic Differential Equation (FBSDE) and their connection to backward PDEs. The FBSDE formulation is very general and has been utilized in many problem formulations such as and stochastic control exarchos2018stochastic ; exarchos2016learning ; exarchosL1 , minmax and risksensitive control Exarchos2019 and control of systems with control multiplicative noise Bakshi_ACC2017 . The major limitation of methods for stochastic control that rely on FBSDEs, is the compounding errors from Least Squares approximation used at every timestep of the Backward Stochastic Differential Equation (BSDE).Recent efforts in the area of Deep Learning for solving nonlinear PDEs has shown encouraging results in terms of scalability and numerical efficiency. A Deep Learningbased algorithm was introduced in Han8505 to approximate the solution of nonlinear parabolic PDEs through their connection to first order FBSDEs
. Their framework relies on propagation of dynamics driven by white noise, which proves successful for simple linear systems, but suffers from insufficient exploration for nonlinear dynamics. Thus, their approach is not directly applicable to many
Stochastic Optimal Control (SOC) problems. One solution to this problem was proposed in exarchos2018stochastic through the application of importance sampling, leading to modification of the drift terms in the FBSDE to allow for sufficient exploration through the controlled forward dynamics.In this paper we develop a novel Deep Neural Network (DNN) architecture for PDEs that are fully nonlinear. Fully nonlinear PDEs appear in stochastic control problems in which noise is additive and control multiplicative. Such problem formulations are important in biomechanics and computational neuroscience, autonomous systems, and finance Todorov2005c ; Mitrovic2010 ; Primbs2007 ; McLane1971 ; Phillis1985 . Prior work, on stochastic control of such systems considers linear dynamics and quadratic cost function. Attempts to generalize these linear methods to the case of stochastic nonlinear dynamics with control multiplicative noise are only preliminary and require special treatment in terms of ways to propagate forward and linearize the underlying stochastic dynamics SVI:Gerardo:2015 .
Given the prior work in the core areas of stochastic control and deep learning, below we summarize the contributions of our work:

We derive an importance sampling scheme for the case of stochastic control for systems with control multiplicative noise. The derivation of the importance sampling scheme is based on a generalization of Nonlinear FeynmanKac lemma that utilizes secondorder FBSDEs. The aforementioned 2FBSDEs provide a probabilistic representation of the solution of fully nonlinear PDEs considered in this work and are essential towards the development of sampling based algorithms.

We design a novel DNN architecture to represent and solve 2FBSDEs. The neural network architecture consist of Fully Connected (FC) and LongShort Term Memory (LSTM) layers. The resulting Deep 2FBSDE network can be used to solve fully nonlinear PDEs in high dimensions by incorporating the importance sampling step into the underlying network architecture.

We demonstrate the applicability and correctness of the proposed algorithm in four examples. The proposed algorithm recovers analytical controls in the case of linear dynamics while it is also able to successfully control nonlinear dynamics with controlmultiplicative and additive sources of uncertainty. Our simulations show the robustness of the Deep 2FBSDE algorithm and prove the importance of considering the nature of the stochastic disturbances in the problem formulation as well as neural network representation.
The rest of the paper is organized as follows: in Section 2 we discuss the problem formulation. In Section 3 we provide the 2FBSDE formulation. The Deep 2FBSDE algorithm is introduced in Section 4. Then we demonstrate the simulation results in Section 5. Finally we conclude the paper in Section 6 with discussion and future directions.
2 Stochastic Control
In this section we provide definitions essential for the development of our proposed algorithm and then present the problem formulation.
2.1 Definitions
We first introduce the stochastic dynamical systems which have a drift term nonlinear in the state but affine in the controls and stochasticity comprising of nonlinear functions of the state and affine control multiplicative matrix coefficients. For a fixed finite time horizon , let be a Brownian motion in
on a filtered probability space
where , and the components of are mutually independent one dimensional standard Brownian motion. Letbe the state variable vector, and
be the control variable vector in the set of all admissible controls . We now assume that functions , , and satisfy certain Lipschitz and growth conditions (refer to Assumption 1 in supplementary material).Given the assumption, it is known that for every initial condition , there exists a unique solution to the Forward Stochastic Differential Equation (FSDE)
(1) 
where , , and
represent the drift, actuator dynamics, diffusion and standard deviation of the control multiplicative noise term respectively.
2.2 Problem Statement and HJB PDE
For the controlled diffusion process with control multiplicative noise above, we formulate the SOC problem as minimizing the following expected cost
(2) 
where is the running cost and is the terminal state cost. The expectation is taken with respect to the probability measure over the space of trajectories induced by the controlled stochastic dynamics. We can define the value function as
(3) 
Under the condition that the value function is in , we can apply the stochastic Bellman’s principle Bellman2003 and Ito’s differentiation rule ito1944109 to find its solution as
(4) 
This equation is commonly known as the HJB equation, and its derivation is included in the supplementary materials. The explicit dependence on independent variables in the PDE above and all PDEs henceforth is omitted for the sake of conciseness, but will be maintained for their corresponding Stochastic Differential Equations for clarity. Here we consider a nonlinear running state cost and quadratic control cost , where the control weights is a positive definite matrix of size . The optimal control can be found by taking the gradient of terms inside the infimum and setting it to zero to obtain
(5) 
where is assumed to be invertible. Substituting the optimal control back into (4) we can drop the infimum and get the final form of the HJB PDE
(6) 
3 FBSDE Formulation
The theory of BSDEs has been used to solve stochastic optimal control problems by establishing a representation of the solution to parabolic PDEs in a set of FBSDEs. Here we choose the forward process to be governed by the FSDE
(7) 
with initial condition . To obtain the corresponding secondorder BSDE (2BSDE), we can apply the nonlinear version of the FeynmanKac lemma:
Lemma 1.
[Nonlinear FeynmanKac]
Let and be a quadruple of progressively measurable processes in respectively, then we call a solution of the 2BSDE corresponding to if
(8) 
where
(9) 
The HJB operator is defined as
(10) 
With this lemma (proof in supplementary material), we transform the original problem of solving a nonlinear PDE into solving an equivalent set of 2FBSDEs.
3.1 Importance Sampling
The 2FBSDEs (7), (8) correspond to dynamic processes without control. This is very limiting since in many cases the target state simply cannot be reached by the uncontrolled stochastic system dynamics. We can eliminate this problem by modifying the forward SDE. In particular, we can add a control term to guide the dynamical process to the target state. The change in the forward SDE has to be compensated for accordingly in the 2BSDEs. This is known as the importance sampling for 2FBSDEs, which we formalize in the following theorem
Theorem 1.
3.2 Forward Sampling of 2BSDEs
Using (12), we can forward sample the FSDE. The 2BSDEs, on the other hand, need to satisfy a terminal condition and therefore have to be propagated backwards in time. In 7962990
, this is achieved by backpropagating the approximate conditional expectation of the two processes using regression. This method however, suffers from compounding errors introduced by least squares estimation at every time step. The
Deep Learning (DL) based approach, first introduced in Han8505 , mitigates this problem by using the terminal condition as the prediction target for a forward propagated BSDE. This is enabled by randomly initializing the initial condition and treating it as a trainable parameter of a selfsupervised learning problem. In addition, the approximation error at each time step is accounted for by backpropagation during the training of the deep network. This allowed using
FBSDEs to solve the HJB PDE for highdimensional linear systems. However, the original scheme lacked sufficient exploration and relied purely on noise (uncontrolled dynamics) to guide learning. A more recent approach, the Deep FBSDE controller pereira2019neural , utilizes importance sampling for guiding exploration and has been successfully applied to systems in simulation that correspond to first order FBSDEs. Extending this work, we propose a new framework for solving SOC problems of systems with control multiplicative noise, for which the value function solutions correspond to 2FBSDEs. We leverage importance sampling by explicitly computing and executing the optimal control with (5) at every time step and forward propagate all 3 SDEs in the modified second order FBSDEs as follows:(13) 
where , and with initial and terminal conditions of , and .
4 Deep 2Fbsde Controller
In this section, we introduce a new deep network architecture called the Deep 2FBSDE Controller and present a training algorithm to solve SOC problems with control multiplicative noise.
Time discretization: Firstly, we introduce a simple Euler timediscretization scheme.Here we overload as both the continuoustime variable and discrete time index and discretize the task horizon as , where . This is also used to discretize all variables as step functions if their discrete time index t lies in the interval . We use subscript to denote the discretized variables.
Network architecture: Inspired by the LSTMbased recurrent neural network architecture introduced in pereira2019neural , we propose the network in fig.1 tailored for 2FBSDEs given by (12). Instead of predicting the gradient of the value function at every time step, the output of the LSTM is used to predict the Hessian of value function and using two separate FC layers. Of these, is used to compute the control term , introduced in (12), for importance sampling. This in turn is used to propagate the stochastic dynamics and compensate for added control in the two forwardpropagated backward processes and . Both and are used to propagate which is then used to propagate . This is repeated until the end of the time horizon as shown fig.1 which also represents an unrolled computational graph of the recurrent network. Finally, the predicted values of , and are compared with their targets computed using
at the end of the horizon, to compute a loss function for backpropagation.
Algorithm: Algorithm. 1 details the training procedure of the deep 2FBSDE controller. Note that superscripts indicate batch index for variables and iteration number for trainable parameters. The value function and its gradient (at time index ), are randomly initialized and trainable. Functions denote the forward propagation equations of standard LSTM layers doi:10.1162/neco.1997.9.8.1735 and FC layers with tanh and linear activations respectively, and represents a discretized version of (13) using the Euler time discretization scheme. The loss function () is computed using the value function, its gradient and hessian at the final time index as well as an regularization term. A detailed justification of each loss term is included in the supplementary materials. The network is trained using any variant of Stochastic Gradient Descent (SGD) such as Adam kingma2014adam until convergence. The trained network will return the optimal control trajectory starting from the initial state.
5 Simulation Results
In this section we demonstrate the capability of the Deep 2FBSDE Controller in two ways: 1) comparison with the analytical solutions for a scalar linear system with additive and control multiplicative stochasticities; 2) control tasks for 3 nonlinear systems  pendulum, cartpole and quadcopter, in simulation, with control multiplicative noise artificially introduced in the actuation. The results are compared to pereira2019neural wherein the effect of control multiplicative noise is ignored by only considering first order FBSDEs.
All the comparison plots are evaluated over 128 test trials. We used time discretization seconds for the linear system case and seconds in other simulations. In all plots, the solid line represents the mean trajectory, and the shaded region represents the 68% confidence region. During comparison, we use green for 2FBSDE, blue for first order FBSDE and analytical solution, and we use red dotted line for target state.
Linear System: We consider a scalar linear time invariant system
(14) 
along with quadratic running and terminal state cost
(15) 
The dynamics and cost function parameters are set as . The task is to drive the state to 0. Assuming that the value function (6) has the form , where . The values of and are obtained by solving corresponding Riccatti equations, using ODE solvers, that can be found in Stengel1994 and are used to compute the optimal control (5) at every timestep. The solution obtained from the 2FBSDE controller is compared against the analytical solution in fig. 2. The resulting trajectories have matching mean and comparable variance, which verifies the effectiveness of the controller on linear systems.
Pendulum: We tested the controller on pendulum dynamics (see supplementary), for a swingup task with a time horizon of seconds. The network, was trained by sampling a batch of trajectories each for iterations with a custom scheduled learning rate. The performance of the controller, as seen in fig. 3 very closely resembles that of the first order deep FBSDE controller which illustrates how the effect of control multiplicative noise is negligible for lowdimensional systems and therefore can be ignored.
Cartpole:. We applied the controller to cartpole dynamics (see supplementary) for a swingup task with a time horizon of s. Similar to the pendulum, the network was trained using a batch size of each for iterations with a custom scheduled learning rate. However, in contrast to the simple pendulum, increase in states and underactuated dynamics causes the performance of the first order FBSDE controller to deteriorate as the effect of control multiplicative noise becomes significant. As seen in fig. 3, by taking control multiplicative noise into account the Deep 2FBSDE Controller is able to achieve the task objectives with much lower variance.
Quadcopter: The controller was also tested on quadcopter dynamics (see supplementary) for a task of reaching and hovering at a target position with a time horizon of seconds. The network was trained with a batch size of for iterations. Only linear and angular states are included in fig. 4 since they most directly reflect the task performance (velocity plots included in the supplementary materals). The figure demonstrates superior performance of the 2FBSDE controller over the FBSDE controller. In particular, the mean trajectories from the 2FBSDE controller are smoother and have smaller residual error. Additionally, the high variance in FBSDE trajectories result from large actuator inputs when exposed to control multiplicative noise at test time, while the 2FBSDE controller maintains small trajectory variances by limiting the control inputs.
Discussion: In our simulations we observed the cart position remaining regulated although the corresponding state cost weight was set to zero. A plausible explanation for this is due to the fact that the cart is a directly actuated state of the system and thus ignoring control multiplicative noise affects the cart position and cart velocity the most. In case of the quadcopter we observe a similar effect in the zdirectional position and velocities. This is due to both the task requiring the quadcopter to accelerate vertically upwards and decelerate to hover and the thrust being an addition of the 4 motor torques (see supplementary for control profiles). Additionally, for the above nonlinear systems, the choice of network initialization was crucial to convergence. A discussion on initialization strategy and training performance is included in the supplementary materials.
6 Conclusions
In this paper, we proposed the Deep 2FBSDE control algorithm to solve the Stochastic Optimal Control problems for systems with both additive and multiplicative noise. The algorithm relies on the 2FBSDE importance sampling theorem introduced in this paper for sufficient exploration. The effectiveness of the algorithm is demonstrated by comparing against analytical solution for a linear system, and against the first order FBSDE controller on systems of pendulum, cartpole and quadcopter in simulation. Potential future directions of this work include: 1) Application to tendondriven rigid body models of biomechanical systems; 2) Handling of control constraints while preserving optimality, a crucial requirement for actuation in robotic systems; 3) Application to financial models with control multiplicative noise and 4) Theoretical analysis of error bounds on value function approximation.
References
 [1] R. F. Stengel. Optimal control and estimation. Dover books on advanced mathematics. Dover Publications, New York, 1994.
 [2] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Applications of mathematics. Springer, New York, 2nd edition, 2006.
 [3] E. Todorov and Weiwei Li. A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pages 300–306 vol. 1, June 2005.
 [4] E.A. Theodorou, Y. Tassa, and E. Todorov. Stochastic differential dynamic programming. In American Control Conference, pages 1125–1132, 2010.
 [5] H. J. Kushner and P. G. Dupuis. Numerical Methods for Stochastic Control Problems in Continuous Time. SpringerVerlag, London, UK, UK, 1992.
 [6] V. A. Huynh, S. Karaman, and E. Frazzoli. An incremental samplingbased algorithm for stochastic optimal control. I. J. Robotic Res., 35(4):305–333, 2016.
 [7] Alex Gorodetsky, Sertac Karaman, and Youssef Marzouk. Efficient highdimensional stochastic optimal motion control using tensortrain decomposition. In Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015.
 [8] I. Karatzas and S. E. Shreve. Brownian Motion and Stochastic Calculus (Graduate Texts in Mathematics). Springer, 2nd edition, August 1991.
 [9] Etienne Pardoux and Aurel Rascanu. Stochastic Differential Equations, Backward SDEs, Partial Differential Equations, volume 69. 07 2014.
 [10] H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Phys Rev Lett, 95:200201, 2005. Journal Article United States.
 [11] E. Todorov. Linearlysolvable markov decision problems. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, Vancouver, BC, 2007. Cambridge, MA: MIT Press.
 [12] E. Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences, 106(28):11478–11483, 2009.

[13]
E.A. Theodorou, J. Buchli, and S. Schaal.
A generalized path integral approach to reinforcement learning.
Journal of Machine Learning Research
, (11):3137–3181, 2010.  [14] I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backward stochastic differential equations and importance sampling. Automatica, 87:159–165, 2018.
 [15] I. Exarchos and E. A. Theodorou. Learning optimal control via forward and backward stochastic differential equations. In American Control Conference (ACC), 2016, pages 2155–2161. IEEE, 2016.
 [16] I. Exarchos, E. A. Theodorou, and P. Tsiotras. Stochastic optimal control via forward and backward sampling. Systems & Control Letters, 118:101–108, 2018.
 [17] Ioannis Exarchos, Evangelos Theodorou, and Panagiotis Tsiotras. Stochastic differential games: A sampling approach via fbsdes. Dynamic Games and Applications, 9(2):486–505, Jun 2019.
 [18] K. S. Bakshi, D. D. Fan, and E. A. Theodorou. Stochastic control of systems with control multiplicative noise using second order fbsdes. In 2017 American Control Conference (ACC), pages 424–431, May 2017.
 [19] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving highdimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
 [20] Emanuel Todorov. Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system. Neural Comput., 17:1084–1108, May 2005.
 [21] Djordje Mitrovic, Stefan Klanke, Rieko Osu, Mitsuo Kawato, and Sethu Vijayakumar. A computational model of limb impedance control based on principles of internal model uncertainty. PLOS ONE, 5(10):1–11, 10 2010.
 [22] J. A. Primbs. Portfolio optimization applications of stochastic receding horizon control. In 2007 American Control Conference, pages 1811–1816, July 2007.
 [23] P. McLane. Optimal stochastic control of linear systems with state and controldependent disturbances. IEEE Transactions on Automatic Control, 16(6):793–798, December 1971.
 [24] Y. Phillis. Controller design of systems with multiplicative noise. IEEE Transactions on Automatic Control, 30(10):1017–1019, October 1985.
 [25] G. DeLa Torre and E.A. Theodorou. Stochastic variational integrators for system propagation and linearization. Sept 2015.
 [26] Richard Bellman. Dynamic Programming. Dover Publications, March 2003.
 [27] Kiyosi Itô. 109. stochastic integral. Proceedings of the Imperial Academy, 20(8):519–524, 1944.
 [28] K. S. Bakshi, D. D. Fan, and E. A. Theodorou. Stochastic control of systems with control multiplicative noise using second order fbsdes. In 2017 American Control Conference (ACC), pages 424–431, May 2017.
 [29] Marcus A. Pereira, Ziyi Wang, Ioannis Exarchos, and Evangelos A. Theodorou. Neural Network Architectures for Stochastic Control using the Nonlinear FeynmanKac Lemma, 2019.
 [30] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [32] Maki K Habib, Wahied Gharieb Ali Abdelaal, Mohamed Shawky Saad, et al. Dynamic modeling and control of a quadrotor using linear and nonlinear approaches. 2014.

[33]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pages 249–256, 2010.  [34] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
Supplementary Materials
1 Assumptions
Assumption 1.
There exists a constant such that
(16)  
(17)  
(18) 
where , , and .
2 HJB PDE Derivation
Applying the dynamic programming principle to the value function we have
(19) 
Then, we can approximate the running cost integral with a step function and apply Ito’s lemma to obtain
(20)  
(21) 
Finally, we can cancel on both sides and bring the terms not dependent on controls outside of the infimum to get the HJB PDE (4).
3 Proof of Nonlinear FeynmanKac lemma
Firstly, we can apply Ito’s differentiation rule to :
(22) 
Since is the value function, we can substitute in the final form of HJB PDE (6) for and the uncontrolled forward dynamics for (7) to get:
(23) 
For , we can again apply Ito’s differentiation rule:
(24) 
We can substitute the uncontrolled forward dynamics for and get:
(25) 
Note that the transpose on is dropped since it is symmetric.
4 Proof of 2FBSDE Importance Sampling theorem
To prove the 2FBSDE importance sampling theorem, we can start with equations (22) and (24). When substituting in the forward dynamics for , we can use the modified dynamics
(26) 
instead of the uncontrolled forward dynamics. It is obvious that the terms and will be added to equations (23) and (25) respectively. Now we demonstrate that the modified system of 2FBSDEs correspond to the same HJB PDE (6). We first rewrite the PDE in the form of
(27) 
where correspond to both the state process in the original problem formulation (1) and the uncontrolled forward process in the 2FBSDEs (7), with and being the drift terms in the uncontrolled FSDE and first BSDE respectively. Substituting in the modified dynamics and , the additional terms cancel out in the PDE (27), resulting in the same PDE as the one corresponding to the original uncontrolled dynamics. This therefore also proves that solving the modified set of 2FBSDEs (12) is equivalent to solving the original HJB PDE (6).
5 Loss Function
The loss function used in this work builds on the loss functions used in [19] and [29]. Because the Deep 2FBSDE Controller propagates 2 BSDEs, in addition to using the propagated value function , the propagated gradient of the value function can also be used in the loss function to enforce that the network meets both the terminal constraints i.e. and respectively. Moreover, since the terminal cost function is known, its hessian can be computed and used to enforce that the output of the layer at the terminal time index be equal to the target hessian of the terminal cost function . Although this is enforced only at the terminal time index , because the weights of a recurrent neural network are shared across time, in order to be able to predict accurately all of the prior predictions will have to be adjusted accordingly, thereby representing some form of propagation of the hessian of the value function.
where,
Additionally importance sampling introduces an additional gradient path through the system dynamics at every time step. Although this makes training difficult (gradient vanishing problem) it allows the weights to now influence what the next state (i.e. at the next time index) will be. As a result, the weights can control the state at the end time index and hence the target for the neural network prediction itself. This can be added to the loss function which in addition to importance sampling allows to accelerate the minimization of the terminal cost to achieve the task objectives. So, the above loss function becomes,
Finally, for highdimensional systems, we observed that the training process was very sensitive to weight initialization. Unlike [29], the 2FBSDE network propagates the gradient of the value function, rather than predicting it at every time step. Because the initial weights of the network are random and the number of trainable parameters in the layer is , the poor initial predictions leads to a snowballing effect causing and hence the loss to diverge. This makes training very unstable and sometimes impossible. Therefore, to regulate the growth of error in , in addition to proper initialization, we added a term to the loss function to penalize the growth of the gradient of the value function. We define the loss term as follows,
We used for our quadcopter experiments. Finally, the entire loss function is as follows,
6 Nonlinear System Dynamics and System Parameters
6.1 Pendulum
The equations of motion for a simple pendulum are given by
The model and dynamics parameters are set as , , , , . The initial condition for the system is the pendulum vertically pointing down at 0 and stationary. The target is a swung up position, with the pendulum pointing vertically upward at an angle of and stationary.
6.2 Cartpole
The equations of motion for the cartpole are given by
The model and dynamics parameters are set as , , , , . The initial pole and cart position are 0 and 0 with no velocity, and the target state is a pole angle of and zeros for all other states.
6.3 Quadcopter
The quadcopter dynamics used can be found in [32]. The model and dynamics parameters are set as , , , , , . Additionally, we used an exploration factor of on the additive noise during training to facilitate convergence (effectively sampling