Planning under Uncertainty to Goal Distributions

by   Adam Conkey, et al.

Goal spaces for planning problems are typically conceived of as subsets of the state space. It is common to select a particular goal state to plan to, and the agent monitors its progress to the goal with a distance function defined over the state space. Due to numerical imprecision, state uncertainty, and stochastic dynamics, the agent will be unable to arrive at a particular state in a verifiable manner. It is therefore common to consider a goal achieved if the agent reaches a state within a small distance threshold to the goal. This approximation fails to explicitly account for the agent's state uncertainty. Point-based goals further do not accommodate goal uncertainty that arises when goals are estimated in a data-driven way. We argue that goal distributions are a more appropriate goal representation and present a novel approach to planning under uncertainty to goal distributions. We use the unscented transform to propagate state uncertainty under stochastic dynamics and use cross-entropy method to minimize the Kullback-Leibler divergence between the current state distribution and the goal distribution. We derive reductions of our cost function to commonly used goal-reaching costs such as weighted Euclidean distance, goal set indicators, chance-constrained goal sets, and maximum expectation of reaching a goal point. We explore different combinations of goal distributions, planner distributions, and divergence to illustrate behaviors achievable in our framework.



There are no comments yet.


page 1

page 5


Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning

What goals should a multi-goal reinforcement learning agent pursue durin...

Proximity-Based Non-uniform Abstractions for Approximate Planning

In a deterministic world, a planning agent can be certain of the consequ...

Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning

Goal-conditioned policies are used in order to break down complex reinfo...

Planning in Stochastic Environments with Goal Uncertainty

We present the Goal Uncertain Stochastic Shortest Path (GUSSP) problem -...

Risk Sensitive Model-Based Reinforcement Learning using Uncertainty Guided Planning

Identifying uncertainty and taking mitigating actions is crucial for saf...

Selective Dyna-style Planning Under Limited Model Capacity

In model-based reinforcement learning, planning with an imperfect model ...

Goal Agnostic Planning using Maximum Likelihood Paths in Hypergraph World Models

In this paper, we present a hypergraph–based machine learning algorithm,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Representing goals defines a fundamental step in formalizing problems in robotics [30, 50]. Goals provide an intuitive means by which humans communicate task objectives to an agent, and the choice of representation can greatly impact what behaviors the agent may achieve. Goals are commonly considered abstractly as elements of a (possibly infinite) subset of the agent’s state space . To achieve a task, the robot must act so as to arrive at a state .

In practice, it is common to have only a single desired goal state such that . A planning problem can then be formalized to have the robot minimize the distance between the terminal state in its plan and according to a goal-parameterized distance metric111An alternative formulation achieves the same result by setting and having the robot minimize where is a distance function defined between arbitrary states in  [3, 26, 37]. [35, 54]

. Due to numerical imprecision in floating point arithmetic, standard approaches consider the goal achieved when the agent reaches a state within an

-ball around the goal state. Even if numeric precision were perfect, the same issue arises under stochastic dynamics or state uncertainty where an agent is unlikely to ever arrive at a specific state in a verifiable manner [51]. Note in these stochastic and belief space control settings the single goal case directly extends to the expected cost formulation [51]

while the multi-goal set formulation extends to chance-constrained control (i.e. maximizing the probability of reaching the goal set) 


Fig. 1: (Top Row) An example plan execution for a robot travelling from a start state (red box) to a Gaussian goal distribution (left room) under Gaussian state uncertainty. The orange paths show the state uncertainty propagated forward from the current state using the unscented transform and the mean action sequence generated by CEM optimization. (Bottom Row) KL divergence between the start state distribution and the goal distribution converging towards zero as the plan executes over time with MPC.

While goal states are suitable for many domains and algorithms, these common relaxations highlight their shortcomings as an appropriate goal representation for planning and control under uncertainty in continuous state spaces. We advocate for probability distributions as a more suitable goal representation. For planning with stochastic dynamics we can estimate the robot’s state distribution at future points in time and use a divergence function to quantify how far the predicted state distribution is from the goal distribution. We can then formalize a planning problem as minimizing this divergence. Note this approach naturally extends to partially observable domains, where the robot maintains a belief distribution over its current state (e.g. using a recursive filter 

[51]) and must replan online.

Goal distributions subsume point-based goals; a Dirac delta distribution at the goal state defines an indicator of being in that state [38]

. Uniform distributions probabilistically encode the classical notion of goal sets. Gaussian goal distributions naturally encode smoothly increasing costs near a goal state. Mixture models (e.g. Gaussian mixture models) can represent disjoint goal regions of varying importance and achievability. Goal distributions are indispensable in learning from demonstration settings 

[39] where goals are often estimated in a data-driven fashion from noisy samples provided by sub-optimal experts [2, 13, 27].

In this paper, we present a modular algorithm for planning to goal distributions under state uncertainty and stochastic dynamics. We implement a specific instantiation of this algorithm using the unscented transform [52] to propagate state uncertainty, the cross-entropy method (CEM) [26, 45] for derivative-free optimization to minimize the Kullback-Leibler (KL) divergence between the current state distribution and the goal distribution, and model predictive control (MPC) for robust execution. An example execution is shown in Figure 1. We provide a number of illustrative examples for how the behavior of the planner can be beneficially modulated by the choice of the distribution modeled by the planner and the choice of goal distribution. We additionally derive reductions of our cost function to commonly utilized cost functions in the literature.

The paper is organized as follows. In Section II we describe some preliminary concepts that we utilize in our approach. We show reductions of KL divergence cost to standard cost functions used in planning in Section III. In Section IV we formally define our method and identify the specific algorithmic choices we make. We show illustrative examples of our method applied to a 2D navigation problem and motion planning for a 7-DOF arm in Section V. We delay our discussion of related work to Section VI to contextualize other approaches in the framework we formalize. We conclude the paper in Section VII with a brief discussion of our results and some interesting directions for future research.

All software, videos, and supplemental material related to this paper can be found at the associated website:

Ii Preliminaries

Divergence functions, which encode dissimilarity between probability distributions, are central to our planning to goal distribution formulation. Specifically we explore using the Kullback-Leibler (KL) divergence as the primary cost in our objective function for a stochastic trajectory optimization view of planning. Formally, KL divergence is defined as


where and

are distributions of a continuous random variable

. Following standard convention we define to be if and and if and . The KL divergence is an asymmetric measure. This asymmetry induces two different optimization problems when optimizing to minimize KL divergence from a target distribution :


The information or I-projection exhibits mode-seeking behavior, while the moment or M-projection performs moment-matching behavior seeking coverage of all regions where

 [36]. We demonstrate in Sections III and V that both of these projections are necessary for planning to goal distributions depending on the choice of goal distribution family and state uncertainty distribution.

Iii Cost Function Reductions

Before providing our planning algorithm in Section IV, we show reductions of KL divergence between state and goal distributions to standard cost functions used in planning. Please see our supplementary material for full derivations.

Goal Set Indicator: We set the terminal state uncertainty distribution when following trajectory to be the Dirac delta distribution (i.e. known state/deterministic dynamics), , and the goal distribution to be a uniform distribution, , over the goal set . Then minimizing the I-projection we recover


where . Hence the minimum is obtained with a constant cost if the terminal state reaches any point in the goal set while any state outside the goal set receives infinite cost. Note the function is non-differentiable as expected from the set-based goal definition. We can treat a single goal state naturally as a special case of this function.

(Weighted) Euclidean Distance: By setting the goal distribution to a Gaussian with mean centered at the desired goal state and assuming known state as above minimizing the I-projection recovers a weighted squared Euclidean distance:



defines the precision matrix of the goal distribution. By setting the precision matrix to be the identity matrix, we recover the standard squared Euclidean distance. We recover the same cost by minimizing the M-projection between a Gaussian state and Dirac goal, albeit weighted by the belief state precision. For any dimension

we wish to ignore, we can set the precision 0 (i.e. variance


Maximum Probability of Reaching Goal Point: By minimizing the M-projection with a Dirac delta distribution at the goal point, , and having arbitrary belief distribution over the state, the KL divergence reduces to


which is equivalent to minimizing the negative log-probability (maximizing the probability or expectation) of reaching the point-based goal.

Chance-Constrained Goal Set: If we take a uniform goal set distribution over and an arbitrary distribution over the state the resulting minimization of the M-projection can equivalently be stated as the following maximization problem


which defines the probability of reaching any state in the goal set , a commonly used term for reaching a goal set in chance-constrained control (e.g. Equation (6) in [6]).

Iv Methods

We now present the details of our approach to planning under uncertainty to goal distributions via KL divergence minimization. The full procedure is detailed in Algorithm 1.

Consider an agent with state space that maintains a belief distribution for its current state where are sufficient statistics defining the distribution. The distribution is estimated from a history of observations from the agent’s sensors, e.g. using a recursive filter [51]222In fully observable domains, where the robot directly observes its state, we can define a Dirac distribution at the point or set the state as the mean of a Gaussian and define an arbitrarily tight variance.. The transition dynamics are stochastic with additive Gaussian noise


where is a nonlinear function. The objective is to plan and execute a sequence of actions that transform the current state belief distribution into a goal distribution while accounting for the state uncertainty and stochastic dynamics. The goal distribution may be manually specified, e.g. a Gaussian centered about a desired state to be reached, or estimated from data with a density estimation procedure.

There are four modular components to our approach, each detailed below in IVA-D. While we have not performed a comprehensive evaluation of these different design choices, we justify and note specific features of our choices and highlight that other variants are possible. First, the combined components provide sufficiently fast computation for use inside a model predictive controller. Second, we use gradient-free optimization which enables us to incorporate arbitrary costs and constraints in addition to our goal distribution costs. Finally, our planner can naturally encode multi-modal trajectory distributions enabling us to comprehensively reason over multi-modal goal representations.

1 while  do
       /* Model predictive control */
2       PlanCEM() ExecuteAction() StateEstimation()
3Function PlanCEM():
4       repeat
5             for  to  do
6                   UTF()
7             FitElite()
8      until  return
9Function UTF():
10       Cholesky() for  to  do
12       for  to  do
14      for  to  do
16       return
Algorithm 1 DistributionPlanning

Iv-a Planning Cost to Goal Distributions

We use the KL divergence between the current state distribution and the goal distribution as the cost to be minimized. We accumulate cost over time such that the costs later in the plan have higher weight (see also line 13 in Alg. 1):


where is the current timestep, is the planning horizon, , and denotes KL divergence as in Equation 1. We additionally utilize an application-specific cost function on the sigma points computed for the unscented transform. For example, in our 2D navigation experiments in Section V we explore collision costs on the sigma points to avoid collisions under uncertain dynamics. We can optionally incorporate common action costs to, for example, enforce smoothness or low-energy solutions.

Note that Equation 7 is formulated as an I-projection as discussed in Section II. The mode-seeking behavior of the I-projection is typically desired for planning to goal distributions. However, for distributions with bounded support like uniform and Dirac-delta distributions, the M-projection is necessary to avoid division by zero. We explore this issue further in our experiments in Section V.

Iv-B Goal Distribution

Our goal formulation permits any distribution for which KL divergence can be computed with respect to a Gaussian state distribution. In our experiments in Section V we explore Dirac-delta, uniform, Gaussian, and Gaussian mixture model distributions. Approximations are also suitable when a closed form solution to KL divergence cannot be computed. For example, we use the unscented transform to efficiently compute an accurate approximation of KL divergence between GMMs [18] for which there is no closed-form solution.

Iv-C Planner

We use cross-entropy method (CEM) [26, 45] optimization for planning. CEM (lines 6-18 in Alg. 1) is a widely used gradient-free optimization method favored for the ease with which it can optimize over arbitrary costs that may be expensive or impossible to compute derivatives for. In a planning context [26], CEM generates action sequences at each iteration sampled from a distribution of action sequences, typically a Gaussian or Gaussian mixture model (GMM). The distribution is initialized to be broad enough to achieve coverage of the state space. Every sample is evaluated under the cost function and the distribution is refit using maximum likelihood estimation to the top low-cost samples, the so-called elite set. This procedure iterates until the KL divergence between distributions at consecutive iterations falls below a tolerance . We execute plans in a model predictive control (MPC) style to ensure robustness (lines 3-4 in Alg. 1) which terminates when the total cost falls below tolerance .

The planner can easily represent multi-modal solutions by using a GMM distribution over trajectories. This interplays nicely with different goal distribution choices as we show in our experiments in Section V.

Iv-D Uncertainty Propagation

We use the unscented transform (lines 19-32 in Alg. 1

) to propagate state uncertainty under stochastic nonlinear dynamics during planning. The unscented transform is best known for its use in the unscented Kalman filter (UKF) 

[53], a more accurate variant of the Kalman filter for nonlinear dynamics than the extended Kalman filter (EKF) [51]. However, it has also achieved successful application in uncertainty propagation through learned, nonlinear dynamics functions [7], in trajectory optimization [35], and robot arm motion planning [31]. The unscented transform computes a small number of sigma points that are passed through the nonlinear function to systematically probe how the state distribution will transform under the nonlinearity. Typically points are computed for state size , one point being the distribution mean and two points for each dimension symmetrically dispersed from the mean. We use the sigma point algorithm presented in [35] as it has only one open parameter to tune the sigma point dispersion, compared to the three open parameters of the original formulation [52]. We choose the unscented transform primarily for its accuracy and efficiency over a Monte Carlo sampling approach. We assume a Gaussian belief state .

V Experiments

Fig. 2: Examples of planning to goal distributions in a 2D robot navigation domain. The robot (black box) navigates from the start position (red square) to the goal (green indicator); state uncertainty propagation shown in orange-purple gradient. (a) Gaussian CEM plan to Gaussian goal. (b) Gaussian CEM plan with I-projection to asymmetrically-weighted GMM goal; multiple MPC runs go to left goal (blue lines) and right goal (yellow lines). (c) Gaussian CEM plan with I-projection to symmetrically-weighted GMM goal. (d) Gaussian CEM plan with M-projection to symmetrically-weighted GMM goal. (e) Gaussian CEM plan to Dirac-delta goal. (f) Gaussian CEM plan to uniform goal (green square). (g) GMM CEM plan to Gaussian goal; multi-modal solution to goal around an obstacle. (h) GMM CEM plan to symmetrically-weighted GMM goal; multi-modal paths to a multi-modal goal.

We demonstrate some of the possibilities achievable with our framework using different goal distributions, CEM planner distributions, and cost objectives. We use a simple 2D navigation environment for these examples in V-A to clearly illustrate the different behaviors. We additionally show an example on a larger state space for motion planning with a 7-DOF arm in V-B

. All of our code is open-source and implemented in PyTorch for fast batch computation.

V-a Dubins Car 2D Navigation

We use the Dubins car model from [26], a simple vehicle model with non-holonomic constraints in the state space . The state denotes the car’s planar position and orientation . The dynamics obey


where is a linear speed and is the turn rate for .

We use an arc primitive parameterization similar to [26] to generate trajectory samples for CEM. Actions are applied at each timestep for duration such that the robot moves in a straight line with velocity if and arcs with radius otherwise. A trajectory with arc primitives has the form , which are sampled during CEM optimization. We extend the model in [26] to have stochastic dynamics by adding Gaussian noise to the state updates. Please see our supplementary material for further details.

Costs in each example are computed according to Eq. 7. We additionally employ a collision cost on the sigma points


where is an indicator function determining if a state is in collision and is a gain factor on the collision cost. We note that while it is possible to simply reject samples that have any point in collision [26], we empirically found it better to accumulate collision costs in a MPC setting, as it can take too long to find collision-free samples without a good prior for action trajectory initialization.

We explore different combinations of goal distribution, CEM planner distribution, and KL projection to illustrate the behavior of our planner. Each example is visualized in Fig. 2. In every case the robot has a Gaussian belief distribution over its current pose in the environment. We manually specify this distribution as a scaled unit Gaussian centered at the robot’s state for simplicity, but note it can be acquired through state estimation with recursive filters in more realistic settings [51]. The images show an initial plan generated from the robot’s starting pose. Please see the supplementary video for MPC executions.

Fig. 2a-d illustrate why the mode-seeking behavior of the KL-divergence I-projection is preferred for matching a Gaussian state distribution to Gaussian and GMM goal distributions. Fig.2a shows optimizing the I-projection between a Gaussian state distribution and a Gaussian goal distribution, where the path converges towards the mean of the goal distribution in the left room. A multi-modal goal modeled as a GMM is shown in Fig. 2b with asymmetrically weighted Gaussian components in the left and right rooms (weights 0.2 and 0.8, respectively). The lines illustrate the paths the robot took in 10 different MPC executions. Note that the number of paths leading to each goal is proportional to the component weightings, i.e. 2 paths go to the left goal (blue paths) while 8 paths go to the right goal (yellow paths) in the asymmetric weighting. Fig. 2c has equally-weighted components, resulting in 5 paths going to each goal in 10 MPC executions. The different weightings may represent varying levels of achievability, or that some goals are better estimated than others in a data-driven goal estimation procedure [13].

We see in Fig. 2d that optimizing the M-projection instead of the I-projection for a multi-modal goal results in the robot driving in between the two goals, getting stuck in the middle room. These results may on the surface suggest the I-projection should always be used. However, the M-projection is necessary when the goal distribution has bounded support, as is the case for uniform and Dirac-delta distributions. The I-projection in these cases would cause division by zero at every point outside the goal distribution’s support, resulting in an infinite cost that cannot be optimized over the robot’s state space. Planning to a Dirac-delta goal distribution is shown in Fig. 2e, where the KL divergence in this case reduces to maximizing the probability of reaching the goal state in the left room, as shown in Section III. Since the state belief distribution in this case is Gaussian, the cost further reduces to a precision-weighted Euclidean distance between the robot’s state and the goal state. Planning to a uniform goal distribution is shown in Fig. 2f, which probabilistically encodes the set-based goal of arriving in the left room, irrespective of the precise end pose in that room.

Utilizing a GMM distribution in the CEM planner and planning to a Gaussian goal exhibits multi-modal trajectories as illustrated in Fig. 2g. The blue lines are trajectory samples generated from the planner GMM that take different routes around the central obstacle. This allows for generating multiple solutions to arriving to a goal which can then be pursued according to a selection strategy, e.g. further refining for shortest path optimality [26]. We can use a GMM as both the planning distribution and the goal state distribution to achieve multi-modal trajectories that each arrive at a different goal as shown in Fig. 2h. This multi-objective planning can be useful if, for example, a goal becomes infeasible while executing due to a dynamic obstacle, the alternative path to the other goal can be immediately pursued. In our MPC executions with the GMM as the planner distribution, one of the solutions ends up getting selected at chance due to one mode becoming biased in CEM sampling. We comment further on this point in our discussion in Section VII.

These examples illustrate that our framework subsumes point-based and set-based goal-directed behavior while explicitly accounting for the uncertainty in the robot’s state. Further, we can optimize for multi-modal goals and multi-modal solutions to goals. While the literature elucidates the M-projection/I-projection well [36], we find it illuminating to see this demonstrated in a planning context general enough to accommodate different goal distribution families, including probabilistic variants of classical set- and point-based goals.

V-B 7-DOF Arm Motion Planning

We further demonstrate our method on a 7-DOF arm which has a larger state space than the 2D navigation environment in V-A. As illustrated in the left of Fig 3, the robot is given a target position (yellow sphere) to reach its end effector to. An inverse kinematics solution is computed (red robot) to serve as its joint space goal. However, that solution is in collision with the gray pole. In spite of this, we can set as the mean of a Gaussian goal distribution . In addition to the KL divergence and collision costs from the 2D environment, we add a sigma-point cost for the position error between the end-effector and the target position. Running our planner with MPC results in the path shown in the right of Fig. 3. The robot finds a path that reaches its end-effector around the obstacle to the target position without collision. Note the dynamics are stochastic with additive Gaussian noise and we use kinematic control only (i.e. commanded joint positions). Please see our supplementary material for further details.

Fig. 3: A 7-DOF arm reaching a target end-effector position while avoiding collisions. (Left) An inverse kinematics (IK) solution (red robot) to reach the target position (yellow sphere) is in collision with an obstacle (gray pole). (Right) Using the IK solution as the mean of a joint space Gaussian goal distribution, our approach finds a collision-free path around the obstacle to reach the target.

Vi Related Work

Planning and control under uncertainty have a long history in robotics [22, 30, 51]. The problem can most generally be formalized in the language of

partially observable Markov decision processes

(POMDPs) [4, 22, 49] which account for both the agent’s measurement uncertainty and stochastic controls [51]. POMDPs are infamously intractable to solve  [51] therefore causing most solutions to be approximate [33]. Our approach falls into this approximate class.

Our approach closely relates to fully probabilistic control design [24, 23, 48] and path integral  [54] classes of control as inference [43]. Both methods minimize an M-projection between the controlled state-action distribution and a desired state-action distribution over the time horizon. Probabilistic control design [24, 23, 48] provides a theoretical optimal control derivation, but offers no concrete algorithms, while also noting the connections between Gaussian target distributions under KL and quadratic costs (Eq. (3)) in the context of LQR. Model predictive path integral control optimizes an open loop control sequence and an unknown optimal control distribution via importance sampling [54]. In contrast our approach enables direct construction of desired state-based goal distributions for use as targets in the KL divergence minimization and can use either M- or I-projections depending on the relevant distributions instead of inducing an optimal distribution based on a target cost function. As such we find our approach simpler to interpret with respect to state-based costs and motion planning at the limitation of having to add in other costs when performing control.

Similar to our approach, state distribution matching to target uniform distributions via KL divergence minimization guides policy learning in [32]. However, this approach automatically generates distributions for use in exploration and only examines a subclass of the distributions and optimizations in this paper. Chen et al. [11] examine divergences as distance functions for minimum path length planning to point-based goals but ignore explicit goal distributions.

Goal distributions have also been used to sample a state from the distribution and continue using point-based planning algorithms [37, 42]. Recent improvements on hindsight experience replay (HER) [3] have sought to estimate goal distributions that generate samples from low density areas of the replay buffer state distribution [28, 41, 55] or that form a curriculum of achievable goals [44, 41]

. We note that HER-based methods are typically used in multi-goal reinforcement learning where states are interpreted as goals irrespective of their semantic significance as a task goal. Goal distributions are considered in 

[8] where KL divergence is used to guide local policies towards the goal distribution. However, only Dirac delta distributions were considered, which is tantamount to maximizing the expectation of reaching a point-based goal as shown in Eq. (5).

Distributed multi-robot systems leverage goal distributions to control a network of robots to a desired configuration [16, 46]. Goal distributions over a known state space are advocated for in [38] as a generalization to goal-conditioned reinforcement learning. Our approach is parallel to [38] in that we explore goal distributions in a planning under uncertainty setting and provide a wider set of reductions.

KL divergence (and f-divergences more broadly) has been commonly used to constrain iterations of optimization in planning [1] and policy search [10, 40, 47]. Stochastic optimal control and planning techniques typically attempt to minimize the expected cost of the trajectory [15, 54] or maximize the probability of reaching a goal set [6, 34]. Information theoretic costs are included in the maximum entropy formulation of reinforcement learning and control [19, 20, 56, 1] which reward maintaining high information in a stochastic policy in order to avoid unnecessarily biasing the policy class. Another common use of information theoretic cost comes from active exploration in SLAM [9] and object reconstruction [21] where the information gain is computed over the uncertainty of the current world model. In spite of the myriad uses of KL divergence in the literature, it has been scarcely utilized explicitly for defining cost between a predicted state distribution and a goal distribution. We believe this is because probability distributions have remained an under-utilized representation for goal spaces.

Vii Discussion and Conclusion

We are excited to investigate many extensions to this work. First, as noted in the text, we could easily extend our work to include other

-divergences as has recently been examined in policy search and imitation learning 

[5, 17, 25]. A more thorough study of alternative uncertainty propagation techniques should also be examined such as Monte Carlo sampling [6, 54, 12], moment matching [14], or bounded uncertainties [34]

. These approaches, together with learned dynamics, could help expand the representational power of the dynamics to include multi-modal distributions (e.g. ensembles of neural networks) or learn complex models from few samples (e.g. Gaussian processes).

Additionally, while we demonstrated the ability of our planner to generate multimodal trajectories to reach different goal regions or have paths in distinct homotopy classes in reaching a single goal region, our MPC control approach quickly collapses to a unimodal trajectory distribution. In cases where users determine it important to maintain multiple distinct trajectory classes our method could be extended to incorporate the recently introduced Stein variational MPC to achieve this [29]. We believe significant work could go into designing specialized solvers for planning based on the chosen divergence, cost, and uncertainty representation to further improve planning efficiency.


  • [1] H. Abdulsamad, O. Arenz, J. Peters, and G. Neumann (2017) State-regularized policy search for linearized dynamical systems. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), External Links: Link Cited by: §VI.
  • [2] B. Akgun and A. Thomaz (2016) Simultaneously learning actions and goals from demonstration. Autonomous Robots 40 (2), pp. 211–227. Cited by: §I.
  • [3] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in neural information processing systems, pp. 5048–5058. Cited by: §VI, footnote 1.
  • [4] K. J. Astrom (1965) Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications 10 (1), pp. 174–205. Cited by: §VI.
  • [5] B. Belousov and J. Peters (2017) F-divergence constrained policy improvement. arXiv. External Links: Link Cited by: §VII.
  • [6] L. Blackmore, M. Ono, A. Bektassov, and B. C. Williams (2010) A Probabilistic Particle-Control Approximation of Chance-Constrained Stochastic Predictive Control. IEEE Transactions on Robotics. Cited by: §A-D, §I, §III, §VI, §VII.
  • [7] E. V. Bonilla, D. Steinberg, and A. Reid (2016) Extended and Unscented Kitchen Sinks. In

    International Conference on Machine Learning (ICML)

    Cited by: §IV-D.
  • [8] S. Candido and S. Hutchinson (2011) Minimum uncertainty robot navigation using information-guided pomdp planning. In 2011 IEEE International Conference on Robotics and Automation, pp. 6102–6108. Cited by: §VI.
  • [9] B. Charrow, G. Kahn, S. Patil, S. Liu, K. Goldberg, P. Abbeel, N. Michael, and V. Kumar (2015) Information-theoretic planning with trajectory optimization for dense 3d mapping.. In Robotics: Science and Systems, Vol. 11. Cited by: §VI.
  • [10] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine (2017) Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3381–3388. Cited by: §VI.
  • [11] R. Chen, C. Gotsman, and K. Hormann (2017-08) Path Planning with Divergence-Based Distance Functions. Computer Aided Geometric Design 66, pp. 52–74. External Links: Document, Link Cited by: §VI.
  • [12] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. In Advances in Neural Information Processing, External Links: 1805.12114, Link Cited by: §VII.
  • [13] A. Conkey and T. Hermans (2019) Active learning of probabilistic movement primitives. In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pp. 1–8. Cited by: §I, §V-A.
  • [14] M. P. Deisenroth, D. Fox, and C. E. Rasmussen (2015) Gaussian Processes for Data-Efficient Learning in Robotics and Control. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2), pp. 408–423. External Links: Document, arXiv:1502.02860v1, ISBN 0162-8828 VO - 37, ISSN 01628828, Link Cited by: §VII.
  • [15] M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472. Cited by: §VI.
  • [16] G. Foderaro, S. Ferrari, and M. Zavlanos (2012)

    A decentralized kernel density estimation approach to distributed robot path planning

    In Proceedings of the Neural Information Processing Systems Conference, Cited by: §VI.
  • [17] S. K. S. Ghasemipour, R. Semel, and S. Gu (2019) A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning (CoRL), Cited by: §VII.
  • [18] J. Goldberger and H. Aronowitz (2005) A distance measure between gmms based on the unscented transform and its application to speaker recognition. In Ninth European Conference on Speech Communication and Technology, Cited by: §IV-B.
  • [19] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165. Cited by: §VI.
  • [20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §VI.
  • [21] K. Huang and T. Hermans (2019) Building 3d object models during manipulation by reconstruction-aware trajectory optimization. arXiv preprint arXiv:1905.03907. Cited by: §VI.
  • [22] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: §VI.
  • [23] M. Kárný and T. V. Guy (2006) Fully probabilistic control design. Systems & Control Letters 55 (4), pp. 259 – 265. External Links: ISSN 0167-6911, Document, Link Cited by: §VI.
  • [24] M. Kárný (1996) Towards fully probabilistic control design. Automatica 32 (12), pp. 1719 – 1722. External Links: ISSN 0005-1098, Document, Link Cited by: §VI.
  • [25] L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa (2020) Imitation Learning as f-Divergence Minimization. In Workshop on the Algorithmic Foundations of Robotics, Cited by: §VII.
  • [26] M. Kobilarov (2012) Cross-entropy randomized motion planning. In Robotics: Science and Systems, Vol. 7, pp. 153–160. Cited by: §B-A, §B-A, §B-A, §I, §IV-C, §V-A, §V-A, §V-A, §V-A, footnote 1.
  • [27] D. Koert, S. Trick, M. Ewerton, M. Lutter, and J. Peters (2020) Incremental learning of an open-ended collaborative skill library. International Journal of Humanoid Robotics 17 (01), pp. 2050001. Cited by: §I.
  • [28] Y. Kuang, A. I. Weinberg, G. Vogiatzis, and D. R. Faria (2020) Goal density-based hindsight experience prioritization for multi-goal robot manipulation reinforcement learning. In 2020 29th IEEE International Symposium on Robot and Human Interactive Communication (ROMAN). IEEE, Cited by: §VI.
  • [29] A. Lambert, A. Fishman, D. Fox, B. Boots, and F. Ramos (2020) Stein Variational Model Predictive Control. In Conference on Robot Learning (CoRL), Cited by: §VII.
  • [30] S. M. LaValle (2006) Planning algorithms. Cambridge university press. Cited by: §I, §VI.
  • [31] A. Lee, Y. Duan, S. Patil, J. Schulman, Z. McCarthy, J. Van Den Berg, K. Goldberg, and P. Abbeel (2013) Sigma hulls for gaussian belief space planning for imprecise articulated robots amid obstacles. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5660–5667. Cited by: §IV-D.
  • [32] L. Lee, B. Eysenbach, E. Parisotto, E. Xing, S. Levine, and R. Salakhutdinov (2018) Efficient Exploration via State Marginal Matching. In ICLR Workshop, Cited by: §VI.
  • [33] W. S. Lee, N. Rong, and D. Hsu (2008) What makes some pomdp problems easy to approximate?. In Advances in neural information processing systems, pp. 689–696. Cited by: §VI.
  • [34] T. Lew, A. Sharma, J. Harrison, and M. Pavone (2020) Safe model-based meta-reinforcement learning: a sequential exploration-exploitation framework. arXiv preprint arXiv:2008.11700. Cited by: §VI, §VII.
  • [35] Z. Manchester and S. Kuindersma (2016) Derivative-free trajectory optimization with unscented dynamic programming. In IEEE Conference on Decision and Control (CDC), pp. 3642–3647. Cited by: §I, §IV-D.
  • [36] K. P. Murphy (2012) Machine Learning: A Probabilistic Perspective. MIT Press. External Links: Document, 0-387-31073-8, ISBN 0070428077, ISSN 0036-8733 Cited by: §II, §V-A.
  • [37] A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §VI, footnote 1.
  • [38] S. Nasiriany (2020-08) DisCo rl: distribution-conditioned reinforcement learning for general-purpose policies. Master’s Thesis, EECS Department, University of California, Berkeley. External Links: Link Cited by: §I, §VI.
  • [39] A. Paraschos, C. Daniel, J. Peters, and G. Neumann (2018) Using probabilistic movement primitives in robotics. Autonomous Robots 42 (3), pp. 529–551. Cited by: §I.
  • [40] J. Peters, K. Mülling, and Y. Altun (2010) Relative entropy policy search.. In AAAI, Vol. 10, pp. 1607–1612. Cited by: §VI.
  • [41] S. Pitis, H. Chan, S. Zhao, B. Stadie, and J. Ba (2020) Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. arXiv preprint arXiv:2007.02832. Cited by: §VI.
  • [42] V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine (2020) Skew-fit: state-covering self-supervised reinforcement learning. International Conference on Machine Learning (ICML). Cited by: §VI.
  • [43] K. Rawlik, M. Toussaint, and S. Vijayakumar (2012) On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. In Robotics: Science and Systems (RSS), Cited by: §VI.
  • [44] Z. Ren, K. Dong, Y. Zhou, Q. Liu, and J. Peng (2019) Exploration via hindsight goal generation. In Advances in Neural Information Processing Systems, pp. 13485–13496. Cited by: §VI.
  • [45] R. Y. Rubinstein and D. P. Kroese (2013)

    The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation and machine learning

    Springer Science & Business Media. Cited by: §I, §IV-C.
  • [46] K. Rudd, G. Foderaro, P. Zhu, and S. Ferrari (2017) A generalized reduced gradient method for the optimal control of very-large-scale robotic systems. IEEE Transactions on Robotics 33 (5), pp. 1226–1232. Cited by: §VI.
  • [47] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning (ICML), pp. 1889–1897. Cited by: §VI.
  • [48] J. Sindelar, I. Vajda, and M. Karny (2008) Stochastic Control Optimal in the Kullback Sense. Kybernetika 44 (1), pp. 53–60. Cited by: §VI.
  • [49] E. J. Sondik (1978) The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations research 26 (2), pp. 282–304. Cited by: §VI.
  • [50] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
  • [51] S. Thrun, W. Burgard, and D. Fox (2005) Probabilistic robotics. MIT press. Cited by: §I, §I, §IV-D, §IV, §V-A, §VI.
  • [52] J. K. Uhlmann (1995) Dynamic map building and localization: new theoretical foundations. Ph.D. Thesis, University of Oxford Oxford. Cited by: §I, §IV-D.
  • [53] E. A. Wan and R. Van Der Merwe (2000) The unscented kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pp. 153–158. Cited by: §IV-D.
  • [54] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou (2017) Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. Cited by: §I, §VI, §VI, §VII.
  • [55] R. Zhao, X. Sun, and V. Tresp (2019) Maximum entropy-regularized multi-goal reinforcement learning. arXiv preprint arXiv:1905.08786. Cited by: §VI.
  • [56] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §VI.

Appendix A Cost Reduction Derivations

We provide more involved derivations of all of the cost function reductions presented in Section III of the paper. In the following, the density function for the uniform distribution for set is defined as


where , where defines the volume of the set. A Dirac-delta distribution for state is a special case of a uniform distribution where the support is limited to a single state . In this case if and otherwise.

A-a Goal Set Indicator

We set the terminal state uncertainty distribution when following trajectory to be the Dirac-delta distribution (i.e. known state/deterministic dynamics), , where is the state reached after following trajectory . We set the goal distribution to be a uniform distribution over the goal set . Minimizing the I-projection of the KL divergence between these distributions we recover


where Equation 12 follows from the fact that for and for . Hence the minimum is obtained with a constant cost if the terminal state of trajectory reaches any point in the goal set while any state outside of

receives infinite cost. This function is non-differentiable as expected from the set-based goal definition. We can treat a single goal state naturally as a special case of this function. While the KL divergence provides these specific costs, this cost function is equivalent to any classifier or indicator that provides distinct costs for inside the goal set versus outside (e.g.

or ). Another common equivalent formulation provides a cost of (true) inside the goal set as and outside the set as (false) and define a maximization.

A-B (Weighted) Euclidean Distance:

Consider a Gaussian goal distribution with mean centered at a desired goal state , and assume a Dirac-delta distribution over the current state (i.e. known state/deterministic dynamics), , where again is the resulting state from following trajectory . Minimizing the I-projection recovers a weighted squared Euclidean distance:


where defines the precision matrix of the goal distribution. By setting the precision matrix to be the identity matrix, we recover the standard squared Euclidean distance between the terminal state and goal state . We also note that for any dimension set to 0 to ignore this dimension results in an associated variance .

A-C Maximum Probability of Reaching Goal Point

Minimizing the M-projection with a Dirac-delta distribution at a goal state and having arbitrary belief distribution over the state when following trajectory , the KL divergence reduces to


which is maximizing the probability of reaching the point-based goal following trajectory .

In the special case where the probability distribution over state is a Gaussian, we recover the same weighted Euclidean distance cost as above albeit weighted by the belief state precision instead of the goal distribution precision.

A-D Chance-Constrained Goal Set

Consider a uniform distribution over goal set and an arbitrary distribution over the terminal state after following trajectory . Minimizing the M-projection of the KL divergence between these distributions we get


Equation 29 defines the probability of reaching any state in the goal set , a commonly used term for reaching a goal set in chance-constrained control (e.g. Equation (6) in [6]).

Appendix B Experiment Environments

B-a Dubins Car 2D Navigation

We use the Dubins car model from [26] which is a simple vehicle model with non-holonomic constraints in the state space . The state denotes the car’s planar position and orientation . The dynamics obey


where is a linear speed and is the turn rate for .

We use an arc primitive parameterization similar to [26] to generate trajectory samples for CEM. Actions are applied at each timestep for duration such that the robot moves in a straight line with velocity if and arcs with radius otherwise. A trajectory with arc primitives has the form , which are sampled during CEM optimization.

The state space under this parameterization evolves as


where and for the primitive. Note we add a small value to each to avoid division by zero, which simplifies the computation described in [26]. We extend the model in [26] to have stochastic dynamics by adding Gaussian noise to the state updates in Equations 31-33; we use a value of .

B-B 7-DOF Arm Environment

We use the Franka Emika Panda arm rendered in rviz. The state is the arm’s 7-D joint configuration in radians, where we compute state updates simply as


We use PyBullet for checking collisions between the robot and the environment.

We use the same KL divergence cost and collision cost as the Dubins car environment. We add one additional cost term for the arm that specifies the arm’s end-effector should reach a desired position in its workspace


where is the desired end-effector position to be reached, is the robot’s forward kinematics function, and . We compute the forward kinematics using PyBullet.