Neural Network Architectures for Stochastic Control using the Nonlinear Feynman-Kac Lemma

02/11/2019 ∙ by Marcus Pereira, et al. ∙ 0

In this paper we propose a new methodology for decision-making under uncertainty using recent advancements in the areas of nonlinear stochastic optimal control theory, applied mathematics and machine learning. Our work is grounded on the nonlinear Feynman-Kac lemma and the fundamental connection between backward nonlinear partial differential equations and forward-backward stochastic differential equations. Using these connections and results from our prior work on importance sampling for forward-backward stochastic differential equations, we develop a control framework that is scalable and applicable to general classes of stochastic systems and decision-making problem formulations in robotics and autonomy. Two architectures for stochastic control are proposed that consist of feed-forward and recurrent neural networks. The performance and scalability of the aforementioned algorithms is investigated in two stochastic optimal control problem formulations including the unconstrained L2 and control-constrained case, and three systems in simulation. We conclude with a discussion on the implications of the proposed algorithms to robotics and autonomous systems.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past 15 years there has been significant interest from the robotics community in developing algorithms for stochastic control of systems operating in dynamic and uncertain environments. This interest was initiated by two main developments related to theory and hardware. From a theoretical standpoint, there has been a better and in some sense deeper understanding of connections between different disciplines. As an example, the connections between optimality principles in control theory and information theoretic concepts in statistical physics are well understood so far [1, 2, 3, 4, 5]. These connections have resulted in novel algorithms that are scalable, real-time and can handle complex nonlinear dynamics [6, 7, 8]. On the hardware side, there have been significant technological developments that made possible the use of high performance computing for real-time Stochastic Optimal Control (SOC) in robotics and autonomy [9].

Traditionally SOC problems are solved using Dynamic Programming (DP). Dynamic Programming requires solving a nonlinear second order Partial Differential Equation (PDE) known as the Hamilton-Jacobi-Bellman (HJB) equation [10]. It is well-known that the HJB

equation suffers from the curse of dimensionality. One way to tackle this problem is through an exponential transformation to linearize the

HJB equation, which can then be solved with forward sampling using the linear Feynman-Kac lemma [11] [12]. While the linear Feynman-Kac lemma provides a probabilistic representation of the solution to the HJB that is exact, its application relies on certain assumptions between control authority and noise. In addition, the exponential transformation of the value function reduces the discriminability between good and bad states, which makes the computation of the optimal control policy difficult.

An alternative approach to solve SOC problems is to transform the HJB into a system of Forward-Backward Stochastic Differential Equations using the nonlinear version of the Feynman-Kac lemma [13, 14]. This is a more general approach compared to the standard Path Integral control framework, in that, it does not rely on any assumptions between control authority and noise [15, 16, 17]. In addition, it is valid for general classes of stochastic processes including jump-diffusions and infinite dimensional stochastic processes [18, 19]. The main challenge, however, with using the nonlinear Feynman-Kac lemma, is solving the backward SDE that requires back-propagating a conditional expectation, which can not be solved through sampling directly, as compared to the forward SDE. This therefore requires numerical approximation techniques for utilization in an actual algorithm. Exarchos and Theodorou [20]

developed an importance sampling based iterative scheme by approximating the conditional probability at every time step using linear regression (also see

[21] and [22]). However, this method suffers from compounding errors from Least Squares approximation at every time step.

Recently, the idea of using Deep Neural Networks and other data-driven techniques for approximating the solutions of non-linear PDEs has been garnering significant attention. In Raissi et al. [23], DNNs were used for both solving and data-driven discovery of the coefficients of non-linear PDEs popular in physics literature such as the Schrodinger equation, the Allen-Cahn equation, the Navier-Stokes and Burgers equation. They have demonstrated that their DNN-based approach can surpass the performance of other data-driven methods such as sparse linear regression as was proposed by Rudy et al. [24]. On the other hand, using DNNs for end-to-end Model Predictive Optimal Control (MPOC) has also become a popular research area. Pereira et al. [25] introduced a DNN architecture for Imitation Learning (IL), inspired by MPOC, based on the Path Integral (PI) Control approach alongside Amos et al. [26] who introduced an end-to-end MPOC architecture that uses the KKT conditions of the convex approximation. Pan et al. [27] demonstrated the MPOC capabilities of a DNN control policy, using only camera and wheel speed sensors, through IL. Morton et al. [28] used a Koopman operator based DNN model for learning the dynamics of fluids and performing MPOC for suppressing vortex shedding in the wake of a cylinder.

This tremendous success of DNNs as universal function approximators [29] inspires an alternative scheme to solve systems of FBSDEs. Recently, Han et al. [30] introduced a Deep Learning based algorithm to solve FBSDEs associated with nonlinear parabolic PDEs. Their framework was applied to solve the HJB

equation for a white-noise driven linear system to obtain the value function at the initial time step. This framework, although effective for solving parabolic PDEs, can not be applied directly to solve the HJB for optimal control of unstable nonlinear systems since it lacks sufficient exploration and is limited to only states that can be reached by purely noise driven dynamics. This problem was addressed in

[20] through application of Girsanov’s theorem, which allows for the modification of the drift terms in the FBSDE system thereby facilitating efficient exploration through controlled forward dynamics.

In this paper, we propose a novel framework for solving SOC problems of nonlinear systems in robotics. The resulting algorithms overcome limitations of previous work in [30] by exploiting Girsanov’s theorem as in [20] to enable efficient exploration and by utilizing the benefits of recurrent neural networks in learning temporal dependencies. We begin by proposing essential modifications to the existing framework of FBSDEs to utilize the solutions of the HJB equation at every timestep to compute an optimal feedback control which thereby drives the exploration to optimal areas of the state space. Additionally, we propose a novel architecture that utilizes Long-Short Term Memory (LSTM) networks to capture the underlying temporal dependency of the problem. In contrast to the individual Fully Connected (FC) networks in [30], our proposed architecture uses fewer parameters, is faster to train, scales to longer time horizons and produces smoother control trajectories. We also extend our framework to problems with control-constraints which are very relevant to most applications in Robotics wherein actuation torques must not violate specified box constraints. Finally, we compare the performance of both network architectures on systems with nonlinear dynamics such as pendulum, cartpole and quadcopter in simulation.

The rest of this paper is organized as follows: in Section II we reformulate the stochastic optimal control problem in the context of FBSDE. In Section III we use the same FBSDE framework to the control constrained case. Then we provide the Deep FBSDE Control algorithm in Section IV. The experimental results are included in Section V. Finally we conclude the paper and discuss future research directions.

Ii Stochastic Optimal Control through Fbsde

Ii-a Problem Formulation

Let () be a complete, filtered probability space on which is defined a -dimensional standard Brownian motion , such that is the normal filtration of . Consider a general stochastic non-linear system with control affine dynamics,


where, , is the time horizon,

is the state vector,

is the control vector, represents the drift, represents the actuator dynamics, represents the diffusion. The Stochastic Optimal Control problem can be formulated as minimization of an expected cost functional given by


where is the terminal state cost, is the running state cost and is a positive definite matrix. The expectation is taken with respect to the probability measure over the space of trajectories induced by controlled stochastic dynamics. With the set of all admissible controls , we can define the value function as,


Using stochastic Bellman’s principle, as shown in [13], if the value function is in , then its solution can be found with Ito’s differentiation rule to be the Hamilton-Jacobi-Bellman equation,


where denote the gradient and Hessian of respectively. The explicit dependence on independent variables in the PDE above and henceforth all PDEs in this paper is omitted for the sake of conciseness, but will be maintained for their corresponding SDEs for clarity. For the chosen form of the cost functional integrand, the infimum operation can be carried out by taking the gradient of the terms inside, known as the Hamiltonian, with respect to and setting it to zero,


Therefore, the optimal control is obtained as


Plugging the optimal control back into the original HJB equation, the following form of the equation is obtained,


Ii-B Non-linear Feynman-Kac lemma

Here we restate the non-linear Feynman-Kac lemma from [20]. Consider the Cauchy problem,


wherein the functions , , and satisfy mild regularity conditions [20]. Then, (8) admits a unique (viscosity) solution , which has the following probabilistic representation,


wherein, is the unique solution of the FBSDE system given by,


where, without loss of generality, is chosen as a n-dimensional Brownian motion. The process , satisfying the above forward SDE, is also called the state process. And,


is the associated backward SDE. The function is called the generator or driver.

We assume that there exists a matrix-valued function such that the controls matrix in (1) can be decomposed as for all , satisfying the same mild regularity conditions. This decomposition can be justified as the case of stochastic actuators, where noise enters the system through the control channels. Under this assumption, we can apply the nonlinear Feynman-Kac lemma to the HJB PDE (7) and establish equivalence to (8) with coefficients of (8) given by


Ii-C Importance Sampling for Efficient Exploration

There are several cases of systems in which the goal state practically cannot be reached by the uncontrolled stochastic system dynamics. This issue can be eliminated if one is given the ability to modify the drift term of the forward SDE. Specifically, by changing the drift, we can direct the exploration of the state space towards the given goal state, or any other state of interest, reachable by control. Through Girsanov’s theorem [31] on change of measure, the drift term in the forward SDE (11) can be changed if the backward SDE (12) is compensated accordingly. This is known as the importance sampling for FBSDEs. This results in a new system of FBSDEs in certain sense equivalent to the original ones,


along with the compensated BSDE,


for any measurable, bounded and adapted process . We refer the readers to proof of Theorem 1 in [20] for the full derivation of change of measure for FBSDEs. The PDE associated with this new system is given by


which is identical to the original problem (8) as we have merely added and subtracted the term . Recalling the decomposition of control matrix in the case of stochastic actuators, the modified drift term can be applied with any nominal control to achieve the controlled dynamics,


with, . The nominal control can be any open or closed-loop control, a random control, or a control calculated from a previous run of the algorithm.

Ii-D FBSDE Reformulation

Solutions to BSDEs need to satisfy a terminal condition, and thus, integration needs to be performed backwards in time, yet the filtration still evolves forward in time. It turns out that a terminal value problem involving BSDEs admits an adapted solution if one back-propagates the conditional expectation of the process. This was the basis of the approximation scheme and corresponding algorithm introduced in [20]

. However, this scheme is prone to approximation errors introduced by least squares estimates which compound over time steps. On the other hand, the

Deep Learning (DL)-based approach in [30]

uses the terminal condition of the BSDE as a prediction target for a self-supervised learning problem with the goal of using back-propagation to estimate the value function at the initial timestep. This was achieved by treating the value at the initial timestep,

, as one of the trainable parameters of a DL model. There is a two-fold advantage of this approach: (i) starting with a random guess of

, the backward SDE can be forward propagated instead. This eliminates the need to back-propagate a least-squares estimate of the conditional expectation to solve the BSDE and instead treat the BSDE similar to the FSDE, and (ii) the approximation errors at every time step are compensated by the backpropagation training process of DL. This is because the individual networks, at every timestep, contribute to a common goal of predicting the target terminal condition and are jointly trained.

Fig. 1: Proposed FC network architecture.

In this work, we combine the importance sampling concepts for FBSDEs with the Deep Learning techniques that allows for the forward sampling of the BSDE and propose a new algorithm for Stochastic Optimal Control problems. The novelty of our approach is to incorporate importance sampling for efficient exploration in the DL model. Instead of the original HJB equation (7), we focus on obtaining solutions for the modified HJB PDE in (16) by using the modified FBSDE system (14), (15). Additionally, we explicitly compute the control at every time step using the analytical expression for optimal control (6) in the computational graph. Similar to [30], the FBSDE system is solved by integration of both the SDEs forward in time as follows,




Iii Stochastic Control Problems with Control Constraints

The framework we have considered so far can be suitably modified to accommodate a certain type of control constraints, namely upper and lower bounds . Specifically, each control dimension component satisfies for all . Such control constraints are common in mechanical systems, where control forces and/or torques are bounded, and may be readily introduced in our framework via the addition of a “soft” constraint, integrated within the cost functional.

Prior work on constrained trajectory optimization typically dealt with deterministic problems and made use of tools from constrained quadratic programming [32] to compute the optimal controls. Here we take a different approach that incorporate the control constrains in the HJB equation by defining the appropriate control cost function. Indeed, one can replace the cost functional given by (2) with .




are constant weights, denotes the sigmoid (tanh-like) function that saturates at infinity, i.e., , while

is a dummy variable of integration. A suitable example along with its inverse is


Following the same procedure as in Section II, we set the derivative of the Hamiltonian equal to zero and obtain


By introducing the notation

where (not to be confused with the terminal cost ) denotes the i-th column of , we may write the optimal control in component-wise notation as


The optimal control can be written equivalently in vector form. Indeed, if is the vector of bounds, is a diagonal matrix of the reciprocals of the weights and is a diagonal matrix of the bounds, one readily obtains


Substituting the equation of the constrained controls into eqn. 16 equation results in


where is specified by the expression that follows:

Fig. 2: Proposed LSTM network architecture.

Iv Deep FBSDE Controller

In this section we introduce a simple Euler time discretization scheme and formulate algorithms for solution of stochastic optimal control using two neural network architectures.

Iv-a Algorithm

The task horizon in continuous-time can be discretized as , where . Here we abuse the notation as both the continuous time variable and discrete time index. With this we can also discretize all the variables as step functions such that if the discrete time index is between the time interval .

The Deep FBSDE algorithm, as shown in Alg. 1, solves the finite time horizon control problem by approximating the gradient of the value function at every time step with a DNN parameterized by . Note that the superscript is the batch index, and the batch-wise calculation can be implemented in parallel. The initial value and its gradient are parameterized by trainable variables and are randomly initialized. The optimal control action is calculated using the discretized version of (6) (or (26) for the control constrained case). The dynamics and value function are propagated using the Euler integration scheme, as shown in the algorithm. The function is calculated using (13) (or (28) for the control constrained case). The predicted final value is compared against the true final value to calculate the loss. The networks can be trained with any one of the variants of Stochastic Gradient Descent (SGD) such as the Adam optimizer [33] until convergence with custom learning rate scheduling. The trained networks can then be used to predict the optimal control at every time step starting from the given initial condition .

1  Given:
2, : Initial state and system dynamics;
3: Cost function parameters;
4: Task horizon, : Number of iterations, : Batch size; bool: Boolean for constrained control case;
5: maximum controls per input channel;
: Time discretization; : weight-decay parameter;
6  Parameters:
7: Value function at ;
8: Gradient of value function at ;
: Weights and biases of all fully-connected and/or LSTM layers;
  Initialize neural network parameters;
9  Initialize states:
  for  to  do
     for  to  do
        for  to  do
           Compute gamma matrix: ;
           if bool == True then
           end if
           Sample Brownian noise:
           Update value function:
           Update system state:
           Predict gradient of value function:
        end for
        Compute target terminal value:
     end for
     Compute mini-batch loss:
      Adam.step(); Adam.step()
  end for
Algorithm 1 Finite Horizon Deep FBSDE Controller

Iv-B Network Architecture

The network architecture proposed in fig. 1, is an extension of that proposed in [30] with additional connections that use the gradient of the value function at every time step for optimal feedback control. A similar architecture was introduced in [34] to solve model-based Reinforcement Learning (RL) problems posed as finite time horizon SOC problems. This was designed to predict time varying controls by parameterizing the controller at every time step by an independent FC network as shown in fig. 3. The networks are stacked together to form one large deep network which is then trained in an end-to-end fashion.

In our proposed architecture, we choose to apply the optimal control (see (18)) calculated using the value function gradient predicted by the network as the nominal control. This, however, creates a new path for gradient backpropagation through time [35] which introduces both advantages and challenges for training the networks. The advantage being a direct influence of the weights on the state cost

leading to accelerated convergence. Nonetheless, this passage also leads to the vanishing gradient problem, which has been known to plague training of

Recurrent Neural Networks for long sequences (or time horizons).

Fig. 3: Diagram of the proposed FC network at one time step.
Fig. 4: Diagram of the proposed LSTM network at one time step.

To tackle this problem, we propose a new LSTM-based network architecture, as shown in fig. 2 and fig. 4, which can effectively deal with the vanishing gradient problem [36] as it allows for the gradient to flow unchanged. Additionally, since the weights are shared across all time steps, the total number of parameters to train is far less than the FC structure. These features allows the algorithm to scale to optimal problems of long time horizons. Intuitively, one can also think of the use of LSTM as modeling the time evolution of , in contrast to the FC structure, which acts independently at every time step.

V Experiments

We applied the Deep FBSDE controller to systems of pendulum, cartpole and quadcopter for the task of reaching a target final state. The trained networks are evaluated over 128 trials and the results are compared between the different network architectures for both the unconstrained and control constrained case. We use FC to denote experiments with the network architecture in fig. 1 and LSTM for the architecture in fig. 2. We use 2 layer FC and LSTM networks and tanh activation for all experiments, with

. All experiments were conducted in TensorFlow

[37] on an Intel i7-4820k CPU Processor.

In all plots, the solid line represents the mean trajectory, and shaded region shows the 95% confidence region. To differentiate between the 4 cases, we use blue for unconstrained FC, green for unconstrained LSTM, cyan for constrained FC and magenta for constrained LSTM.

V-a Pendulum

The algorithm was applied to the pendulum system for the swing-up task with a time horizon of 1.5 seconds. The equation of motion for the pendulum is given by


The initial pendulum angle is 0 , and the target pendulum angle and rate are and 0 respectively. A maximum torque constraint of is used for the control constrained cases.

Fig. 5: Pendulum states. Left: Pendulum Angle; Right: Pendulum Rate.
Fig. 6: Pendulum controls.

Fig. 5

shows the state trajectories across the 4 case. It can be observed that the swing-up task is completed in all casess with low variance. However, the pole rate does not return to 0 for unconstrained

FC, as compared to unconstrained LSTM. When the control is constrained, the pendulum angular rate becomes serrated for FC while remaining smooth for LSTM. This also more noticeable in the control torques (fig. 6). The control torques becomes very spiky for FC due to the independent networks at each time step. On the other hand, the hidden temporal connection within LSTM allows for smooth and optimally behaved control policy.

V-B Cart Pole

The algorithm was applied to the cart-pole system for the swing-up task with a time horizon of 1.5 seconds. The equations of motion for the cart-pole are given by


The initial pole angle is 0 , and the target pole angle is with target pole and cart velocities of 0 and 0 respectively. Note that despite the target of 0 for cart position, we do not penalize non-zero cart position in training. A maximum force constraint of 10 is used for the control constrained case.

Fig. 7: Cart Pole states. Top Left: Pole Angle; Top Right: Pole Rate; Bottom Left: Cart Position; Bottom Right: Cart Velocity.
Fig. 8: Cart Pole controls.

The cart-pole states are shown in fig. 7. Similar to the pendulum experiment, the swing-up task is completed with low variance acrossed all cases. Interestingly, when control is constrained, both FC and LSTM swing the pole in the direction opposite to target at first and utilize momentum to complete the task. Another interesting observation is that in the unconstrained case, the LSTM-policy is able to exploit long-term temporal connections to initially apply large controls to swing-up the pole and then focus on decelerating the pole for the rest of the time horizon, whereas the FC-policy appears to be more myopic resulting in a delayed swing-up action. Similar to the pendulum experiment, under control constraint the FC-policy results in sawtooth-like controls while the LSTM-policy outputs smooth control trajectories.

Fig. 9: Quadcopter states. Top Left: X Position; Top Right: X Velocity; Bottom Left: Y Position; Bottom Right: Y Velocity.
Fig. 10: Quadcopter states. Top Left: Z Position; Top Right: Z Velocity; Bottom Left: Roll Angle; Bottom Right: Roll Velocity.
Fig. 11: Quadcopter states. Top Left: Pitch Angle; Top Right: Pitch Velocity; Bottom Left: Yaw Angle; Bottom Right: Yaw Velocity.
Fig. 12: Quadcopter controls.

V-C Quadcopter

The algorithm was applied to the quadcopter system for the task of flying from its initial position to a target final position with a time horizon of 2 seconds. The quadcopter dynamics used is described in detail by Habib et al. [38]. The initial condition is 0 across all states, and the target is 1 upward, forward and to the right from the initial location with zero velocities and attitude. The controls are motor torques. A maximum torque constraint of 3 is imposed for the control constrained case.

This task required individual FC networks. After extensive experimentation, we conclude that tuning the FC-based policy becomes significantly difficult and cumbersome as the time horizon of the task increases. On the other hand, tuning our proposed LSTM-based policy was equivalent to that for the cart-pole and pendulum experiments. Moreover, the shared weights across all time steps results in faster build-times and run-times of the TensorFlow computational graph. As seen in the figures (9-12) from our experiments, the performance of the LSTM-based policies surpassed that of the FC-based policies (especially for the attitude states) due to exploiting long term temporal dependence and ease of tuning.

Vi Conclusions

In this paper, we proposed the Deep FBSDE Control algorithm, with both FC-based and a novel LSTM-based architecture, to solve finite time horizon Stochastic Optimal Control problems for nonlinear systems with control-affine dynamics. Our work relies on prior work on importance sampling of FBSDEs and the efficiency of recurrent neural networks in the ability to capture the temporal dependence of the value function and its gradient.

There are three observations that are essential for the application of the proposed methods to robotic and autonomous systems. In particular, the LSTM-based architecture is capable of providing smooth controls in stark contrast of the FC-based architecture. This feature makes the LSTM-based architecture suitable for deployment to real robotic systems. The second observation is that the importance sampling approach is key for scaling the proposed algorithms to high dimensional systems. While the aforementioned importance sampling scheme was first introduced in [21], the LSTM-based architecture introduced in this work significantly increases its effectiveness to high dimensional systems.

Finally our control-constrained stochastic optimal control formulation is essential for robotic control applications since it is very often the case that robotics systems operate under the presence of saturation input limits and control constrains. In terms of future research, there are directions in terms of alternative neural network architecture and stochastic control problem formulations.