The theory of forward-backward stochastic differential equations (FBSDEs) has been gaining traction as a framework to solve nonlinear stochastic control problems, including optimal control problems with quadratic cost [exarchos2018stochastic], minimum-fuel (-running cost) problems [Exarchos2018], differential games [Exarchos2016], and reachability problems [exarchos2018stochastic, mete2002stochastic]. Although initial results demonstrate promise in terms of flexibility and theoretical validity, numerical algorithms which leverage this theory have not yet matured. For even modest problems, state-of-the-art algorithms often have issues with slow and unstable convergence to the optimal policy. Producing more robust numerical methods is critical for the broader adoption of FBSDE methods in real-world tasks.
FBSDE numerical solution methods broadly consist of two steps, a forward pass, which generates Monte Carlo samples of the forward stochastic process, and a backward pass, which iteratively approximates the value function backwards in time. Typically, FBSDE methods perform this approximation with a scheme called least-squares Monte Carlo (LSMC), which uses conditional expectation projection to implicitly solve the backward SDE by turning the problem into a parametric function approximation problem [Longstaff2001]. Often, the approximate value function fit in the backward pass is used to improve sampling in an updated forward pass, leading to an iterative algorithm which, ideally, improves the approximation till convergence. Although FBSDE methods share a distinct similarity to differential dynamic programming (DDP) techniques [jacobson1970differential, theodorou2010stochastic, NIPS2007_3297], DDP is generally less flexible. For most DDP applications, a strictly positive definite running cost with respect to the control is required for convergence [tassa2011theory, Section 2.2.3]. Furthermore, in DDP, the computation of first and second order derivatives of both dynamics and costs is necessary for the backward pass, making it challenging to apply this approach to problems where these derivatives are not known analytically. In contrast, FBSDE techniques only require a good fit of the value function and evaluation of the gradient of this value function to obtain the optimal control.
inverted pendulum problem. Hue corresponds to the path integral heuristicused for weighing particles in the backward pass and for pruning the tree (green values are smaller). The blue and black dashed lines are the mean of trajectory rollouts, following the policies computed at the end of the 1st and 6th iterations respectively. Control counts are based on trajectory rollouts of the 6th iteration policy computed by FBRRT. The hue of each rectangle indicates the relative frequency of each control signal in for each time step.
The flexibility of FBSDE algorithms stems from their relation with Feynman-Kac-type formulae, which connects the solution of a broad class of second-order parabolic or elliptic PDEs to the solution of FBSDEs [yong1999stochastic], brought to prominence in Pardoux1990, peng1993backward, el1997backward. Both Hamilton-Jacobi-Bellman (HJB) and Hamilton-Jacobi-Isaacs (HJI) second order PDEs, utilized for solving stochastic optimal control and stochastic differential game equations respectively, can thus be solved via FBSDE methods, even when the costs and dynamics are nonlinear. This theoretically allows for the solution of PDEs typically solved using grid-based methods such as the Level Set Toolbox [Mitchell2004], known for poor scaling with higher dimensional systems ().
Although FBSDE methods have received some attention in the past, especially in the mathematical finance community [Bender2007, Longstaff2001, ma2007forward], application of these methods to physical systems has only seen significant progress recently [exarchos2018stochastic, Exarchos2018a, Exarchos2018]
. The primary advantage of these methods is that they produce an unbiased estimator for the value function associated with the HJB equations. The challenge with their utilization is that naïve application of the theory leads to estimators with high variance. Simply put, if the forward pass never produces sample trajectories close to the optimal trajectories, then the value function cannot be estimated accurately in those regions. Recent work has shown that Girsanov’s theorem can be used to change the sampling measure of the forward pass without adding intrinsic bias to the estimator[exarchos2018stochastic, Exarchos2018, Exarchos2018a]
. That is, a change over probability spaces corresponds to the introduction of a drift to the forward SDE that can be employed to modify the sampling in the forward pass; this, in turn, requires appropriate accommodation of the change of measures in the backward pass.
In this work we expand upon this result, by showing that the forward sampling measure can be modified at will, which enables us to incorporate methods from other domains, namely, RRTs [lavalle2001randomized] (recent survey in noreen2016optimal), in order to more efficiently explore the state space in the forward pass. RRTs are frequently applied to reachability-type motion planning problems, biasing the samples towards regions of the state space that have low density. Using RRTs in the forward sampling allows us to spread samples evenly over the reachable state space, increasing the likelihood that near-optimal samples are well-represented in the forward pass sample distribution. By sampling more efficiently and relying less on incremental approximations of the value function to guide our search, we can achieve faster and more robust convergence than previous FBSDE methods.
The primary contributions of this paper are as follows:
Framing forward drift selection as a McKean-Markov branched particle sampler, allowing the use of RRTs and similar methods in the forward pass of FBSDE methods.
Showing how local-entropy-weighted LSMC techniques can be used to concentrate value function approximation accuracy in regions likely to have optimal trajectories, to compensate for the broader search induced by the forward pass.
Demonstrating how the forward sampling RRT tree can be pruned and re-expanded to create an iterative method.
We call the proposed method forward-backward rapidly exploring random trees (FBRRT). After we describe the approach in both theory and numerical implementation, we apply FBRRT to a few problems, compare it to Exarchos2018, and demonstrate its ability to solve nonlinear stochastic optimal control problems with non-quadratic running costs. Although local-entropy path integral theory and RRTs have been used together in [arslan2014], called PI-RRT, this method is more closely related to the path-integral approach to control [theodorou2010stochastic]. Our method similarly performs forward passes to broadly sample the state space, but follows them with backward passes to obtain approximations for the value functions, and consequently obtaining closed loop policies over the full horizon.
Ii Forward Backwards Stochastic Differential Equations and Optimal Control
In this section, we briefly introduce the theoretical foundation behind FBSDE methods and demonstrate how they allow us to solve stochastic optimal control problems. We start with a complete, filtered probability space , on which is an -dimensional standard Brownian (Wiener) process with respect to the probability measure and adapted to the filtration . Consider a stochastic nonlinear system governed by the control-affine Itô differential equation
where is a -progressively measurable state process on the interval taking values in , is a progressively measurable input process taking values in the convex compact set , and , , are bounded functions, continuously differentiable in , and twice continuously differentiable in . We further assume that is uniformly strictly positive definite.
Associated with (1) is the stochastic optimal control problem, seeking to determine the value function defined by
where denotes the conditional expectation given under the probability measure , and where is bounded, continuously differentiable in , twice continuously differentiable in , and where is bounded and thrice continuously differentiable in . Below, and for brevity of exposition, the arguments are omitted from the functions and we write, for instance, and .
It is well known, that the solution to the (SOP) problem satisfies the Hamilton-Jacobi-Bellman (HJB) PDE [fleming2006controlled, Chapter 4, Corollary 3.2]
where, by a slight abuse of notation, we denote the partial derivative with respect to time , the gradient with respect to state , and the Hessian with respect to state . By fleming2006controlled, under the parameter assumptions and especially the uniform positive definiteness of , the (HJB) equations possess a unique classical solution which coincides with the value function defined by (SOP).111For the assumptions on the boundedness of the parameters, we also have that , and its partials are bounded in addition to being continuous. To satisfy the theorem in the following paragraph, in addition to the assumptions above, we further assume that is selected such that is uniformly continuous in and Lipschitz continuous in .
By the Feynman-Kac theorem [yong1999stochastic, Chapter 7, Theorem 4.5, (4.29)], the solution to (HJB) has a representation
-almost surely (a.s.) where is the solution to the forward-backward SDEs
and is a Brownian process in the filtered probability space .
Ii-a Drifted FBSDEs
We now show a result first discussed in Bender2010 and later in exarchos2018stochastic, which states that adding a drift term to the forward SDE (4) and appropriately compensating in the backward SDE (5) only changes the measure under which (3) holds. To this end, let be a Brownian process in , and let both and be any bounded adapted processes.
Define the Dolèans-Dade exponential process
a form which has the property that the process is a strictly positive martingale with for all . The measure is defined via the relationship
where is the Radon-Nikodym derivative between the two measures and , restrictions of and to (i.e. trajectories over the interval ). By Girsanov’s theorem (see e.g., fleming1976deterministic), is a probability space and, furthermore,
defines a standard Brownian process with respect to the measure . A further consequence of Girsanov’s theorem is illustrated, through an abuse of notation, by substituting this relationship into the equations (4, 5), taking to be such that . Performing this substitution shows that the solution of the FBSDE defined by (4, 5) in the filtered probability space corresponds to the solution of the drifted FBSDE
in the space . Henceforth, we refer to as the drift process.
Ii-B Equivalent Measures
The random variablecan be interpreted as the density of the path measure with respect to the path measure . Further, since it is strictly positive with , the measure is called an equivalent measure to , that is, is absolutely continuous with respect to and vice versa [fleming2006controlled, Remark, p. 142]. Equivalent probability measures have the characteristic that whenever a property holds a.s. in terms of one measure, the same property holds a.s. in the equivalent measure. Thus, the equalities in (3) hold -a.s., in addition to -a.s.. An important consequence of the equivalence under this change of measure is that, by applying the conditional expectation to both sides of the equalities,
for , and any defined according to (7).
A result related to Girsanov’s theorem (see Proposition I.1 in the appendix) shows that if we, conversely, define the random variable directly, there exists some drift with the same relationships discussed above. This demonstrates that we can produce measures equivalent to in two different ways, either by choosing any drift process , or reweighing the measure directly via a normalized likelihood random variable in an importance sampling-like scheme. Further, using the transitive property of equivalent measures (following from Shiryaev1996), we can choose a drift process to produce the measure , then reweigh this measure to produce another measure , both equivalent to .
Iii McKean–Markov Formulation
We now demonstrate how FBSDEs can be approximated by a branching particle process, drawing on the McKean-Markov theory presented in del2013mean. We begin by introducing a discretization of the SDEs and discuss how the forward process can be approximated by a scheme similar to a particle filter, where some particle trajectories terminate before the horizon and others branch into multiple paths. Next, we show how the backward pass estimates the value function by solving a series of weighted least squares problems in a path integral-like method.
Iii-a Discretization of the FBSDE
The (FBSDE) is discretized according to a given partition of the interval , . For the process we shorten to for brevity, and similarly for the other processes. Let be any solution to (FBSDE) computed using the drift . By integrating the SDEs over each time step, we have that the relations
hold -a.s. for , where
Using (3), the boundary conditions at the end of the interval yield
The conditional expectation eliminates the Itô integral in (11) because it is zero mean when conditioned at the beginning of the step. For the purposes of approximating (12), note that if given the value function at the end of the step, then (3) allows us to induce the value
Iii-B McKean-Markov Branched Sampling
In this subsection, we show how the spatial branching models presented in del2013mean can be used to represent the distribution constructed by the forward SDE (10). We denote by the time-discrete approximation of , and by its restriction to the interval , and introduce the joint random variable . We approximate the distribution of with the empirical measure
for , where , is a finite set of independent sample particles and is the Dirac-delta measure. We interpret the random variable as the constant value the piecewise constant drift process takes on the interval and as the instantaneous value the process takes at the end of this interval .
The distribution of the Markov chainis approximated by a finite set of sample paths
Each sample path in begins with and ends with one of the particles constituting , that is, . For each particle , the particles refer to the ancestral history of the particle .
In previous FBSDE methods each path was sampled independently such that the particles never affect each others’ movement, a design choice we call parallel drift. Although this is known to produce a high-fidelity approximation of an SDE whose dynamics are well known (see, for example, the discussion in del2013mean), for solving optimal control problems FBSDE methods may benefit greatly by relaxing this restriction. More specifically, having a good approximation of the forward SDE is not useful if the drift does not resemble the optimal control trajectory distribution. We generalize this choice by allowing paths to branch and represent the probability measures in a McKean-Markov chain model.
McKean-Markov chains are a generalization of the construction used in particle filters, genetic algorithms, etc. Over each time interval, a measureis constructed by first starting with a measure and applying a selection step followed by a mutation step as follows
In the selection step particles are chosen from via an arbitrarily-designed heuristic method for expansion, some multiple times and others not at all. For each chosen particle a “mutation” is applied, that is, a drift value is produced arbitrarily which can incorporate both state information from the originating particle and randomly sampled auxiliary values. Using the selected particle and corresponding drift, the forward SDE (10) is approximated with an Euler-Maruyama step to produce . Combining these two values into , this represents one of the particles in the distribution . Tracing back the selection history of each particle represents the set of sample paths used in approximate path measure .
Figure 2 illustrates how the representations of the forward measures in parallel-sampled FBSDE compare to the proposed branched-sampled method. Some particles may share parents (e.g., in Figure 2, the branch-sampled path measure has ), causing the history used to define path measures to have a branching tree structure. Note that this feature causes the path measures at different times to be inconsistent in the distribution of the same random variables. For example, in Figure 2, due to selection, the distribution of the random variable excludes the particle in the path measure but includes it in . The interpretation of the selection procedure is that since the drift variable can covary with , the arbitrary choice of can be chosen in such a way to change the density of
by concentrating the joint distribution differently. Thus, the selection procedure does not alter the distribution of, as prescribed by the forward SDE, but rather the distribution of the joint distribution . A finite number of i.i.d. samples from this joint distribution may fail to select all particles in the representation of if the joint distribution has low density near some particles. Even though the path measures and might be inconsistent, the next section will demonstrate that each backward step only uses one of them at a time.
We briefly justify the selection procedure as the approximation of a change of measure. The path measure , converges to an ideal Feynman-Kac path integral measure with large number of sample paths [del2013mean, Sec. 4.1.2], where this measure is defined as
where is a normalization constant, is a potential weighing function, and is a measure where no selection is performed. As noted in Section II-B, choosing the density weighing function directly like this produces an equivalent measure, so the selection procedure produces a measure equivalent to a measure with no selection. This confirms the validity of this procedure for FBSDE problems. A direct consequence of the convergence of the path measure is that, for any function on paths,
is a good numerical approximation for expectations.
Iii-C Weighted Path Integral LSMC
In the backward pass, we use the sequence of path integral measures to fit a value function. Each backward step assumes an approximation of the value function known at , , where
is a vector of parameters and the functionis continuously differentiable. The goal of the backward step is to find the parameters which approximate .
Suppose we are provided a potential weighing function, . We can again use the analysis of Section II-B to show that the random variables , when normalized, produce an equivalent measure
The weighted LSMC method relates to the relationship (13), solving for an approximation to the value function by solving the least squares problem
by applying a well known projective property of conditional expectations [resnick2003probability, Chapter 10.3, Property 11]. Here, we use an approach similar to importance sampling. That is, although we take the expectation of (18) in the measure (i.e. the desired distribution), we evaluate it in (i.e. the proposal distribution), arriving at the weighted least squares problem
The approximation in (17) demonstrates how the expectation can be evaluated by averaging over sample paths, though we note that only one step backward in the tree is needed for each path ().
Iv Forward-Backward RRT
In the previous section, we proposed a generalization of FBSDE methods which greatly expands their flexibility in searching for an optimal control policy. In doing so, we have exposed a few design choices, namely, the particle selection procedure, the drift values , and the weighted LSMC potential function . In this section, we propose forward-backward RRT (FBRRT) as a particular implementation of this generalization, although other methods could be developed within the same framework discussed in the previous section.
The FBRRT algorithm takes as input an initial state and produces an array of parameter vectors , each of which parameterizes an approximation of the value function at discrete time instances, . The algorithm begins with a forward pass, producing a branching tree in a manner similar to kinodynamic RRT. This is followed by a backward pass, approximating the value function at each time step with weighted least squares polynomial regression. Following the backward pass, particles with low weight as determined by (25) are pruned from the tree, removing a significant fraction of them. The pruned tree is again fed to the forward pass procedure to grow the tree to its original width, alternating forward and backward passes until convergence.
Iv-a Kinodynamic RRT Forward Sampling
In general, we desire resampling methods which seek to explore the whole state space, increasing the likelihood of sampling in the proximity of optimal trajectories. For this reason, we chose kinodynamic RRT methods, proposed in lavalle2001randomized
, as a way to achieve this goal. The selection procedure for this method ensures that the distribution of chosen particles is more uniformly distributed in a user-supplied region of interest, more likely to select particles which explore empty space and less likely to oversample dense clusters of particles.
With some probability we choose the RRT sampling procedure, but otherwise choose a particle uniformly from , each particle with equal weight. This ensures dense particle clusters will still receive more attention. Thus, the choice of the parameter balances exploring the state space against refining the area around the current distribution.
For drift generation we again choose a random combination of exploration and exploitation. The functions define a closed-loop policy
over the full time horizon, coinciding with the optimal control policy when the value function approximation is exact [fleming2006controlled, Chapter 4, Corollary 3.2]. Thus, the optimal drift approximation proposed in exarchos2018stochastic,
is good for exploitation. However, for exploration, we choose random controls, , where is a distribution over provided as a design parameter. For example, for minimum fuel () problems where control is bounded and the running cost is , we select because the optimal policy (20) is guaranteed to only return values in this discrete set.
Algorithm 1 sketches out the implementation of particle selection and drift generation, producing the forward sampling tree . The algorithm takes as input any tree with width and adds nodes at each depth until the width is , the parameter indicating the desired width. On the first iteration there are no value function estimate parameters available to exploit, so we set to maximize exploration using the RRT sampling.
Iv-B Local Entropy Backwards Weighing
We now propose a heuristic design choice for the backward pass potential weighing functions , and justify its choice with some theoretical results. We pose a measure-theoretic optimization problem which selects a measure, balancing between selecting a target measure and keeping the selected measure close to the forward sampling measure for regularization. We follow an approach which draws from the theory presented in theodorou2012relative and apply it to the problem of weighted LSMC.
A dynamic programming principle result following directly from fleming2006controlled indicates that
where is any control process in on the interval , is the measure produced by the drift , and , the distribution of trajectories controlled by the optimal policy. The law of total expectation further yields that
The random variable can be numerically estimated at any particle by computing the path integral over the running cost for the first term and using the current approximation of the value function for the second term. By “path integral”, we indicate that the integral is approximated along the path , to produce the value associated with the particle , as is shown in line 20 of Algorithm 1. The above analysis suggests that minimizing over the possible measures produces . Thus, we use the random variable as a heuristic value to be minimized.
Consider the local entropy minimization problem
where is a design parameter and
is relative entropy, also called Kullback-Leibler divergence, a semi-metric which measures the closeness of measures (the equality is proved in Proposition I.3 in the appendix). The measureis approximated by , representing the forward path measure produced by the previous section, while a minimizer of the problem, , approximated by , is the measure considered as the desired distribution used for weighing in Section III-C. Keeping the weighted measure relatively close to acts as a regularization technique, ensuring the least squares regression does not become ill-conditioned by introducing too many small weights. Indeed, as we can see that , so the weighted LSMC becomes equally weighted for every particle.
The problem (23) can be shown (Proposition I.5 in the appendix) to have a minimizing measure when the equality
is satisfied. Thus, we choose the weighing potential functions to be
Algorithm 2 details the implementation of the backward pass with local entropy weighting. The value function is represented by a linear combination of multivariate Chebyshev polynomials up to the 2nd order,
. The proposed local weighing method is structurally similar to a softmin operation, often used in deep learning literature[Goodfellow-et-al-2016]. Line 13 does not, theoretically, have an effect on the optimization, since it will come out of the exponential as a constant multiplier, but it has the potential to improve the numerical conditioning of the exponetial function computation as discussed in Goodfellow-et-al-2016. The value is, in general, a parameter which must be selected by the user. For some problems we choose to search over a series of of possible parameters, evaluating each with a backward pass and using the one which produces the smallest expected cost over a batch of trajectory rollouts executing the computed policy.
V Numerical Results
We evaluated the FBRRT algorithm by applying it to a pair of nonlinear stochastic optimal control problems. For both problems, we used a minimum fuel () running cost of , , where the terminal cost is a quadratic function centered at the origin. We implemented the FBRRT algorithm and all examples in Matlab 2019b and ran them on a Intel G4560 3.50GHz processor with 8GB RAM.
Figure 1 illustrates our method applied to the inverted pendulum problem. Note that even though there were no paths in the tree that continued along the 1st iteration’s mean trajectory (blue line) from beginning to end, the algorithm was still able to produce a policy in regions where no particles were produced. The green particles along the backward swing inform the policy in the beginning of the trajectory while the green particles near the origin inform it near the end, despite taking different paths in the tree.
The policies computed after the first few iterations are visualized in Figure 3. Of significant note is that the policy obtained after only one iteration (red hue) does significantly well in general. For the inverted pendulum problem evaluated in [Exarchos2018], convergence required iterations, but for our method only a handful of iterations were needed to get comparable performance.
We also compared the convergence speed and robustness of FBRRT to parallel-sampled FBSDE [Exarchos2018] by randomly sampling different starting states and evaluating their relative performance over a number of trials. We normalized the final costs across the initial states by dividing all costs for a particular initial state by the largest cost obtained across both methods. For each iteration, we assign the value of the accumulated minimum value across previous iterations for that trial, i.e., the value is the current best cost after running that many iterations, regardless of the current cost. We aggregated these values across initial states and trials into the box plots in Figure 4. Since the FBRRT is significantly slower than the FBSDE per iteration due to the RRT nearest neighbors calculation, we scale each iteration by the runtime. By nearly every comparison, FBRRT converges faster and in fewer iterations than FBSDE, and does so with half as many particle samples.
Vi Conclusions and Future Work
In this work, we have proposed a novel generalization of the FBSDE approach to solve stochastic optimal control problems, combining both branched sampling techniques with weighted least squares function approximation to greatly expand the flexibility of these methods. Leveraging the efficient space-filling properties of RRT methods, we have demonstrated that our method significantly improves the convergence properties of previous FBSDE methods. We have shown how the proposed method works hand in hand with a proposed local entropy-weighted LSMC method, concentrating function approximation in the regions where optimal trajectories are most likely to be dense. We have demonstrated that FBRRT can generate feedback control policies for nonlinear stochastic optimal control problems with non-quadratic costs.
In future work, we plan to incorporate more current RRT algorithms, since nearly any method could be adapted to this approach with the proper book keeping. Further, though we did not explicitly address obstacles, handled naturally by RRT algorithms, we plan to investigate how the FBSDE theory can represent obstacle avoidance. Another significant area of research worth investigating is to find better methods of value function representation. Although 2nd-order polynomials generally produce nice policy functions, they are unlikely to produce a good approximation of the value function outside of a local region. In addition, we plan to evaluate our method on higher dimensional problems to demonstrate how our method can solve problems out of reach for most methods.