Over the past 15 years there has been significant interest from the robotics community in developing algorithms for stochastic control of systems operating in dynamic and uncertain environments. This interest was initiated by two main developments related to theory and hardware. From a theoretical standpoint, there has been a better and in some sense deeper understanding of connections between different disciplines. As an example, the connections between optimality principles in control theory and information theoretic concepts in statistical physics are well understood so far [1, 2, 3, 4, 5]. These connections have resulted in novel algorithms that are scalable, real-time and can handle complex nonlinear dynamics [6, 7, 8]. On the hardware side, there have been significant technological developments that made possible the use of high performance computing for real-time Stochastic Optimal Control (SOC) in robotics and autonomy .
Traditionally SOC problems are solved using Dynamic Programming (DP). Dynamic Programming requires solving a nonlinear second order Partial Differential Equation (PDE) known as the Hamilton-Jacobi-Bellman (HJB) equation . It is well-known that the HJB
equation suffers from the curse of dimensionality. One way to tackle this problem is through an exponential transformation to linearize theHJB equation, which can then be solved with forward sampling using the linear Feynman-Kac lemma  . While the linear Feynman-Kac lemma provides a probabilistic representation of the solution to the HJB that is exact, its application relies on certain assumptions between control authority and noise. In addition, the exponential transformation of the value function reduces the discriminability between good and bad states, which makes the computation of the optimal control policy difficult.
An alternative approach to solve SOC problems is to transform the HJB into a system of Forward-Backward Stochastic Differential Equations using the nonlinear version of the Feynman-Kac lemma [13, 14]. This is a more general approach compared to the standard Path Integral control framework, in that, it does not rely on any assumptions between control authority and noise [15, 16, 17]. In addition, it is valid for general classes of stochastic processes including jump-diffusions and infinite dimensional stochastic processes [18, 19]. The main challenge, however, with using the nonlinear Feynman-Kac lemma, is solving the backward SDE that requires back-propagating a conditional expectation, which can not be solved through sampling directly, as compared to the forward SDE. This therefore requires numerical approximation techniques for utilization in an actual algorithm. Exarchos and Theodorou 21] and ). However, this method suffers from compounding errors from Least Squares approximation at every time step.
Recently, the idea of using Deep Neural Networks and other data-driven techniques for approximating the solutions of non-linear PDEs has been garnering significant attention. In Raissi et al. , DNNs were used for both solving and data-driven discovery of the coefficients of non-linear PDEs popular in physics literature such as the Schrodinger equation, the Allen-Cahn equation, the Navier-Stokes and Burgers equation. They have demonstrated that their DNN-based approach can surpass the performance of other data-driven methods such as sparse linear regression as was proposed by Rudy et al. . On the other hand, using DNNs for end-to-end Model Predictive Optimal Control (MPOC) has also become a popular research area. Pereira et al.  introduced a DNN architecture for Imitation Learning (IL), inspired by MPOC, based on the Path Integral (PI) Control approach alongside Amos et al.  who introduced an end-to-end MPOC architecture that uses the KKT conditions of the convex approximation. Pan et al.  demonstrated the MPOC capabilities of a DNN control policy, using only camera and wheel speed sensors, through IL. Morton et al.  used a Koopman operator based DNN model for learning the dynamics of fluids and performing MPOC for suppressing vortex shedding in the wake of a cylinder.
This tremendous success of DNNs as universal function approximators  inspires an alternative scheme to solve systems of FBSDEs. Recently, Han et al.  introduced a Deep Learning based algorithm to solve FBSDEs associated with nonlinear parabolic PDEs. Their framework was applied to solve the HJB
equation for a white-noise driven linear system to obtain the value function at the initial time step. This framework, although effective for solving parabolic PDEs, can not be applied directly to solve the HJB for optimal control of unstable nonlinear systems since it lacks sufficient exploration and is limited to only states that can be reached by purely noise driven dynamics. This problem was addressed in through application of Girsanov’s theorem, which allows for the modification of the drift terms in the FBSDE system thereby facilitating efficient exploration through controlled forward dynamics.
In this paper, we propose a novel framework for solving SOC problems of nonlinear systems in robotics. The resulting algorithms overcome limitations of previous work in  by exploiting Girsanov’s theorem as in  to enable efficient exploration and by utilizing the benefits of recurrent neural networks in learning temporal dependencies. We begin by proposing essential modifications to the existing framework of FBSDEs to utilize the solutions of the HJB equation at every timestep to compute an optimal feedback control which thereby drives the exploration to optimal areas of the state space. Additionally, we propose a novel architecture that utilizes Long-Short Term Memory (LSTM) networks to capture the underlying temporal dependency of the problem. In contrast to the individual Fully Connected (FC) networks in , our proposed architecture uses fewer parameters, is faster to train, scales to longer time horizons and produces smoother control trajectories. We also extend our framework to problems with control-constraints which are very relevant to most applications in Robotics wherein actuation torques must not violate specified box constraints. Finally, we compare the performance of both network architectures on systems with nonlinear dynamics such as pendulum, cartpole and quadcopter in simulation.
The rest of this paper is organized as follows: in Section II we reformulate the stochastic optimal control problem in the context of FBSDE. In Section III we use the same FBSDE framework to the control constrained case. Then we provide the Deep FBSDE Control algorithm in Section IV. The experimental results are included in Section V. Finally we conclude the paper and discuss future research directions.
Ii Stochastic Optimal Control through Fbsde
Ii-a Problem Formulation
Let () be a complete, filtered probability space on which is defined a -dimensional standard Brownian motion , such that is the normal filtration of . Consider a general stochastic non-linear system with control affine dynamics,
where, , is the time horizon,
is the state vector,is the control vector, represents the drift, represents the actuator dynamics, represents the diffusion. The Stochastic Optimal Control problem can be formulated as minimization of an expected cost functional given by
where is the terminal state cost, is the running state cost and is a positive definite matrix. The expectation is taken with respect to the probability measure over the space of trajectories induced by controlled stochastic dynamics. With the set of all admissible controls , we can define the value function as,
where denote the gradient and Hessian of respectively. The explicit dependence on independent variables in the PDE above and henceforth all PDEs in this paper is omitted for the sake of conciseness, but will be maintained for their corresponding SDEs for clarity. For the chosen form of the cost functional integrand, the infimum operation can be carried out by taking the gradient of the terms inside, known as the Hamiltonian, with respect to and setting it to zero,
Therefore, the optimal control is obtained as
Plugging the optimal control back into the original HJB equation, the following form of the equation is obtained,
Ii-B Non-linear Feynman-Kac lemma
Here we restate the non-linear Feynman-Kac lemma from . Consider the Cauchy problem,
wherein, is the unique solution of the FBSDE system given by,
where, without loss of generality, is chosen as a n-dimensional Brownian motion. The process , satisfying the above forward SDE, is also called the state process. And,
is the associated backward SDE. The function is called the generator or driver.
We assume that there exists a matrix-valued function such that the controls matrix in (1) can be decomposed as for all , satisfying the same mild regularity conditions. This decomposition can be justified as the case of stochastic actuators, where noise enters the system through the control channels. Under this assumption, we can apply the nonlinear Feynman-Kac lemma to the HJB PDE (7) and establish equivalence to (8) with coefficients of (8) given by
Ii-C Importance Sampling for Efficient Exploration
There are several cases of systems in which the goal state practically cannot be reached by the uncontrolled stochastic system dynamics. This issue can be eliminated if one is given the ability to modify the drift term of the forward SDE. Specifically, by changing the drift, we can direct the exploration of the state space towards the given goal state, or any other state of interest, reachable by control. Through Girsanov’s theorem  on change of measure, the drift term in the forward SDE (11) can be changed if the backward SDE (12) is compensated accordingly. This is known as the importance sampling for FBSDEs. This results in a new system of FBSDEs in certain sense equivalent to the original ones,
along with the compensated BSDE,
for any measurable, bounded and adapted process . We refer the readers to proof of Theorem 1 in  for the full derivation of change of measure for FBSDEs. The PDE associated with this new system is given by
which is identical to the original problem (8) as we have merely added and subtracted the term . Recalling the decomposition of control matrix in the case of stochastic actuators, the modified drift term can be applied with any nominal control to achieve the controlled dynamics,
with, . The nominal control can be any open or closed-loop control, a random control, or a control calculated from a previous run of the algorithm.
Ii-D FBSDE Reformulation
Solutions to BSDEs need to satisfy a terminal condition, and thus, integration needs to be performed backwards in time, yet the filtration still evolves forward in time. It turns out that a terminal value problem involving BSDEs admits an adapted solution if one back-propagates the conditional expectation of the process. This was the basis of the approximation scheme and corresponding algorithm introduced in 
. However, this scheme is prone to approximation errors introduced by least squares estimates which compound over time steps. On the other hand, theDeep Learning (DL)-based approach in 
uses the terminal condition of the BSDE as a prediction target for a self-supervised learning problem with the goal of using back-propagation to estimate the value function at the initial timestep. This was achieved by treating the value at the initial timestep,, as one of the trainable parameters of a DL model. There is a two-fold advantage of this approach: (i) starting with a random guess of
, the backward SDE can be forward propagated instead. This eliminates the need to back-propagate a least-squares estimate of the conditional expectation to solve the BSDE and instead treat the BSDE similar to the FSDE, and (ii) the approximation errors at every time step are compensated by the backpropagation training process of DL. This is because the individual networks, at every timestep, contribute to a common goal of predicting the target terminal condition and are jointly trained.
In this work, we combine the importance sampling concepts for FBSDEs with the Deep Learning techniques that allows for the forward sampling of the BSDE and propose a new algorithm for Stochastic Optimal Control problems. The novelty of our approach is to incorporate importance sampling for efficient exploration in the DL model. Instead of the original HJB equation (7), we focus on obtaining solutions for the modified HJB PDE in (16) by using the modified FBSDE system (14), (15). Additionally, we explicitly compute the control at every time step using the analytical expression for optimal control (6) in the computational graph. Similar to , the FBSDE system is solved by integration of both the SDEs forward in time as follows,
Iii Stochastic Control Problems with Control Constraints
The framework we have considered so far can be suitably modified to accommodate a certain type of control constraints, namely upper and lower bounds . Specifically, each control dimension component satisfies for all . Such control constraints are common in mechanical systems, where control forces and/or torques are bounded, and may be readily introduced in our framework via the addition of a “soft” constraint, integrated within the cost functional.
Prior work on constrained trajectory optimization typically dealt with deterministic problems and made use of tools from constrained quadratic programming  to compute the optimal controls. Here we take a different approach that incorporate the control constrains in the HJB equation by defining the appropriate control cost function. Indeed, one can replace the cost functional given by (2) with .
are constant weights, denotes the sigmoid (tanh-like) function that saturates at infinity, i.e., , while
is a dummy variable of integration. A suitable example along with its inverse is
Following the same procedure as in Section II, we set the derivative of the Hamiltonian equal to zero and obtain
By introducing the notation
where (not to be confused with the terminal cost ) denotes the i-th column of , we may write the optimal control in component-wise notation as
The optimal control can be written equivalently in vector form. Indeed, if is the vector of bounds, is a diagonal matrix of the reciprocals of the weights and is a diagonal matrix of the bounds, one readily obtains
Substituting the equation of the constrained controls into eqn. 16 equation results in
where is specified by the expression that follows:
Iv Deep FBSDE Controller
In this section we introduce a simple Euler time discretization scheme and formulate algorithms for solution of stochastic optimal control using two neural network architectures.
The task horizon in continuous-time can be discretized as , where . Here we abuse the notation as both the continuous time variable and discrete time index. With this we can also discretize all the variables as step functions such that if the discrete time index is between the time interval .
The Deep FBSDE algorithm, as shown in Alg. 1, solves the finite time horizon control problem by approximating the gradient of the value function at every time step with a DNN parameterized by . Note that the superscript is the batch index, and the batch-wise calculation can be implemented in parallel. The initial value and its gradient are parameterized by trainable variables and are randomly initialized. The optimal control action is calculated using the discretized version of (6) (or (26) for the control constrained case). The dynamics and value function are propagated using the Euler integration scheme, as shown in the algorithm. The function is calculated using (13) (or (28) for the control constrained case). The predicted final value is compared against the true final value to calculate the loss. The networks can be trained with any one of the variants of Stochastic Gradient Descent (SGD) such as the Adam optimizer  until convergence with custom learning rate scheduling. The trained networks can then be used to predict the optimal control at every time step starting from the given initial condition .
Iv-B Network Architecture
The network architecture proposed in fig. 1, is an extension of that proposed in  with additional connections that use the gradient of the value function at every time step for optimal feedback control. A similar architecture was introduced in  to solve model-based Reinforcement Learning (RL) problems posed as finite time horizon SOC problems. This was designed to predict time varying controls by parameterizing the controller at every time step by an independent FC network as shown in fig. 3. The networks are stacked together to form one large deep network which is then trained in an end-to-end fashion.
In our proposed architecture, we choose to apply the optimal control (see (18)) calculated using the value function gradient predicted by the network as the nominal control. This, however, creates a new path for gradient backpropagation through time  which introduces both advantages and challenges for training the networks. The advantage being a direct influence of the weights on the state cost
leading to accelerated convergence. Nonetheless, this passage also leads to the vanishing gradient problem, which has been known to plague training ofRecurrent Neural Networks for long sequences (or time horizons).
To tackle this problem, we propose a new LSTM-based network architecture, as shown in fig. 2 and fig. 4, which can effectively deal with the vanishing gradient problem  as it allows for the gradient to flow unchanged. Additionally, since the weights are shared across all time steps, the total number of parameters to train is far less than the FC structure. These features allows the algorithm to scale to optimal problems of long time horizons. Intuitively, one can also think of the use of LSTM as modeling the time evolution of , in contrast to the FC structure, which acts independently at every time step.
We applied the Deep FBSDE controller to systems of pendulum, cartpole and quadcopter for the task of reaching a target final state. The trained networks are evaluated over 128 trials and the results are compared between the different network architectures for both the unconstrained and control constrained case. We use FC to denote experiments with the network architecture in fig. 1 and LSTM for the architecture in fig. 2. We use 2 layer FC and LSTM networks and tanh activation for all experiments, with
. All experiments were conducted in TensorFlow on an Intel i7-4820k CPU Processor.
In all plots, the solid line represents the mean trajectory, and shaded region shows the 95% confidence region. To differentiate between the 4 cases, we use blue for unconstrained FC, green for unconstrained LSTM, cyan for constrained FC and magenta for constrained LSTM.
The algorithm was applied to the pendulum system for the swing-up task with a time horizon of 1.5 seconds. The equation of motion for the pendulum is given by
The initial pendulum angle is 0 , and the target pendulum angle and rate are and 0 respectively. A maximum torque constraint of is used for the control constrained cases.
shows the state trajectories across the 4 case. It can be observed that the swing-up task is completed in all casess with low variance. However, the pole rate does not return to 0 for unconstrainedFC, as compared to unconstrained LSTM. When the control is constrained, the pendulum angular rate becomes serrated for FC while remaining smooth for LSTM. This also more noticeable in the control torques (fig. 6). The control torques becomes very spiky for FC due to the independent networks at each time step. On the other hand, the hidden temporal connection within LSTM allows for smooth and optimally behaved control policy.
V-B Cart Pole
The algorithm was applied to the cart-pole system for the swing-up task with a time horizon of 1.5 seconds. The equations of motion for the cart-pole are given by
The initial pole angle is 0 , and the target pole angle is with target pole and cart velocities of 0 and 0 respectively. Note that despite the target of 0 for cart position, we do not penalize non-zero cart position in training. A maximum force constraint of 10 is used for the control constrained case.
The cart-pole states are shown in fig. 7. Similar to the pendulum experiment, the swing-up task is completed with low variance acrossed all cases. Interestingly, when control is constrained, both FC and LSTM swing the pole in the direction opposite to target at first and utilize momentum to complete the task. Another interesting observation is that in the unconstrained case, the LSTM-policy is able to exploit long-term temporal connections to initially apply large controls to swing-up the pole and then focus on decelerating the pole for the rest of the time horizon, whereas the FC-policy appears to be more myopic resulting in a delayed swing-up action. Similar to the pendulum experiment, under control constraint the FC-policy results in sawtooth-like controls while the LSTM-policy outputs smooth control trajectories.
The algorithm was applied to the quadcopter system for the task of flying from its initial position to a target final position with a time horizon of 2 seconds. The quadcopter dynamics used is described in detail by Habib et al. . The initial condition is 0 across all states, and the target is 1 upward, forward and to the right from the initial location with zero velocities and attitude. The controls are motor torques. A maximum torque constraint of 3 is imposed for the control constrained case.
This task required individual FC networks. After extensive experimentation, we conclude that tuning the FC-based policy becomes significantly difficult and cumbersome as the time horizon of the task increases. On the other hand, tuning our proposed LSTM-based policy was equivalent to that for the cart-pole and pendulum experiments. Moreover, the shared weights across all time steps results in faster build-times and run-times of the TensorFlow computational graph. As seen in the figures (9-12) from our experiments, the performance of the LSTM-based policies surpassed that of the FC-based policies (especially for the attitude states) due to exploiting long term temporal dependence and ease of tuning.
In this paper, we proposed the Deep FBSDE Control algorithm, with both FC-based and a novel LSTM-based architecture, to solve finite time horizon Stochastic Optimal Control problems for nonlinear systems with control-affine dynamics. Our work relies on prior work on importance sampling of FBSDEs and the efficiency of recurrent neural networks in the ability to capture the temporal dependence of the value function and its gradient.
There are three observations that are essential for the application of the proposed methods to robotic and autonomous systems. In particular, the LSTM-based architecture is capable of providing smooth controls in stark contrast of the FC-based architecture. This feature makes the LSTM-based architecture suitable for deployment to real robotic systems. The second observation is that the importance sampling approach is key for scaling the proposed algorithms to high dimensional systems. While the aforementioned importance sampling scheme was first introduced in , the LSTM-based architecture introduced in this work significantly increases its effectiveness to high dimensional systems.
Finally our control-constrained stochastic optimal control formulation is essential for robotic control applications since it is very often the case that robotics systems operate under the presence of saturation input limits and control constrains. In terms of future research, there are directions in terms of alternative neural network architecture and stochastic control problem formulations.
- Dai Pra et al. [1996-12-08] P. Dai Pra, L. Meneghini, and W. Runggaldier. Connections between stochastic control and dynamic games. Mathematics of Control, Signals, and Systems (MCSS), 9(4):303–326, 1996-12-08. URL http://dx.doi.org/10.1007/BF01211853.
- Fleming  W.H. Fleming. Exit probabilities and optimal stochastic control. Applied Math. Optim, 9:329–346, 1971.
- Theodorou and Todorov  E.A Theodorou and E. Todorov. Relative entropy and free energy dualities: Connections to path integral and kl control. In the Proceedings of IEEE Conference on Decision and Control, pages 1466–1473, Dec 2012. doi: 10.1109/CDC.2012.6426381.
- Theodorou  E. A. Theodorou. Nonlinear stochastic control and information theoretic dualities: Connections, interdependencies and thermodynamic interpretations. Entropy, 17(5):3352, 2015.
- Williams et al. [2017a] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. Rehg, B. Boots, and E.A. Theodorou. Information theoretic mpc for model-based reinforcement learning. In Proceedings of the 2017 IEEE Conference on Robotics and Automation (ICRA), 2017a.
- Williams et al.  G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Transactions on Robotics, 34(6):1603–1622, Dec 2018. ISSN 1552-3098. doi: 10.1109/TRO.2018.2865891.
Drews et al. 
P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg.
Aggressive deep driving: Combining convolutional neural networks and model predictive control.In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 133–142. PMLR, 13–15 Nov 2017. URL http://proceedings.mlr.press/v78/drews17a.html.
- Williams et al. [2017b] G. Williams, A. Aldrich, and E. A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40:2:344–357, 2017b.
- NVIDIA [Aug 31, 1999] NVIDIA. Nvidia launches the world’s first graphics processing unit: Geforce 256. Aug 31, 1999. URL https://www.nvidia.com/object/IO_20020111_5424.html.
- Bellman  Richard Bellman. Dynamic programming. Courier Corporation, 2013.
- Theodorou et al. [2010a] E. A. Theodorou, J. Buchli, and S. Schaal. A Generalized Path Integral Control Approach to Reinforcement Learning. J. Mach. Learn. Res., 11:3137–3181, December 2010a. ISSN 1532-4435.
- Karatzas and Shreve  I. Karatzas and S. E. Shreve. Brownian Motion and Stochastic Calculus (Graduate Texts in Mathematics). Springer, 2nd edition, August 1991. ISBN 0387976558.
- Yong and Zhou  Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations, volume 43. Springer Science & Business Media, 1999.
- Pardoux and Rascanu  Etienne Pardoux and Aurel Rascanu. Stochastic Differential Equations, Backward SDEs, Partial Differential Equations, volume 69. 07 2014. doi: 10.1007/978-3-319-05714-9.
- Kappen [2005a] H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Phys Rev Lett, 95:200201, 2005a. Journal Article United States.
- Kappen [2005b] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 11:P11011, 2005b.
- Theodorou et al. [2010b] E.A. Theodorou, J. Buchli, and S. Schaal. A generalized path integral approach to reinforcement learning. Journal of Machine Learning Research, (11):3137–3181, 2010b.
- Kharroubi and Pham  Idris Kharroubi and Huyên Pham. Feynman–kac representation for hamilton–jacobi–bellman ipde. Ann. Probab., 43(4):1823–1865, 07 2015. doi: 10.1214/14-AOP920. URL https://doi.org/10.1214/14-AOP920.
Fabbri et al. 
Giorgio Fabbri, Fausto Gozzi, and Andrzej Swiech.
Stochastic Optimal Control in Infinite Dimensions - Dynamic
Programming and HJB Equations.
Number 82 in Probability Theory and Stochastic Modelling. Springer, January 2017.URL https://hal-amu.archives-ouvertes.fr/hal-01505767. OS.
- Exarchos and Theodorou  I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backward stochastic differential equations and importance sampling. Automatica, 87:159–165, 2018.
- Exarchos and Theodorou  I. Exarchos and E. A. Theodorou. Learning optimal control via forward and backward stochastic differential equations. In American Control Conference (ACC), 2016, pages 2155–2161. IEEE, 2016.
- Exarchos  I. Exarchos. Stochastic Optimal Control-A Forward and Backward Sampling Approach. PhD thesis, Georgia Institute of Technology, 2017.
- Raissi et al.  M Raissi, P Perdikaris, and GE Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
- Rudy et al.  Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017.
- Pereira et al.  M. Pereira, D. D. Fan, G. Nakajima An, and E. A. Theodorou. MPC-Inspired Neural Network Policies for Sequential Decision Making. arXiv preprint arXiv:1802.05803, 2018.
- Amos et al.  Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable MPC for End-to-end Planning and Control. Advances in Neural Information Processing Systems, 2018.
- Pan et al.  Y. Pan, C. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, and B. Boots. Agile Off-Road Autonomous Driving Using End-to-End Deep Imitation Learning. Robotics: Science and Systems, 2018.
- Morton et al.  Jeremy Morton, Antony Jameson, Mykel J Kochenderfer, and Freddie Witherden. Deep Dynamical Modeling and Control of Unsteady Fluid Flows. Advances in Neural Information Processing Systems 31, pages 9278–9288, 2018.
- Goodfellow et al.  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016. ISBN 0262035618, 9780262035613.
- Han et al.  Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018. ISSN 0027-8424. doi: 10.1073/pnas.1718942115. URL https://www.pnas.org/content/115/34/8505.
- Girsanov  Igor Vladimirovich Girsanov. On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability & Its Applications, 5(3):285–301, 1960.
- Tassa et al.  Yuval Tassa, Nicolas Mansard, and Emo Todorov. Control-limited differential dynamic programming. IEEE International Conference on Robotics and Automation (ICRA), pages 1168–1175, 2014.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
- Han et al.  Jiequn Han et al. Deep Learning Approximation for Stochastic Control Problems. arXiv preprint arXiv:1611.07422, 2016.
- Werbos  Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. LSTM can solve hard long time lag problems. pages 473–479, 1997.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Habib et al.  Maki K Habib, Wahied Gharieb Ali Abdelaal, Mohamed Shawky Saad, et al. Dynamic modeling and control of a quadrotor using linear and nonlinear approaches. 2014.