I Introduction
Representing goals defines a fundamental step in formalizing problems in robotics [30, 50]. Goals provide an intuitive means by which humans communicate task objectives to an agent, and the choice of representation can greatly impact what behaviors the agent may achieve. Goals are commonly considered abstractly as elements of a (possibly infinite) subset of the agent’s state space . To achieve a task, the robot must act so as to arrive at a state .
In practice, it is common to have only a single desired goal state such that . A planning problem can then be formalized to have the robot minimize the distance between the terminal state in its plan and according to a goalparameterized distance metric^{1}^{1}1An alternative formulation achieves the same result by setting and having the robot minimize where is a distance function defined between arbitrary states in [3, 26, 37]. [35, 54]
. Due to numerical imprecision in floating point arithmetic, standard approaches consider the goal achieved when the agent reaches a state within an
ball around the goal state. Even if numeric precision were perfect, the same issue arises under stochastic dynamics or state uncertainty where an agent is unlikely to ever arrive at a specific state in a verifiable manner [51]. Note in these stochastic and belief space control settings the single goal case directly extends to the expected cost formulation [51]while the multigoal set formulation extends to chanceconstrained control (i.e. maximizing the probability of reaching the goal set)
[6].While goal states are suitable for many domains and algorithms, these common relaxations highlight their shortcomings as an appropriate goal representation for planning and control under uncertainty in continuous state spaces. We advocate for probability distributions as a more suitable goal representation. For planning with stochastic dynamics we can estimate the robot’s state distribution at future points in time and use a divergence function to quantify how far the predicted state distribution is from the goal distribution. We can then formalize a planning problem as minimizing this divergence. Note this approach naturally extends to partially observable domains, where the robot maintains a belief distribution over its current state (e.g. using a recursive filter
[51]) and must replan online.Goal distributions subsume pointbased goals; a Dirac delta distribution at the goal state defines an indicator of being in that state [38]
. Uniform distributions probabilistically encode the classical notion of goal sets. Gaussian goal distributions naturally encode smoothly increasing costs near a goal state. Mixture models (e.g. Gaussian mixture models) can represent disjoint goal regions of varying importance and achievability. Goal distributions are indispensable in learning from demonstration settings
[39] where goals are often estimated in a datadriven fashion from noisy samples provided by suboptimal experts [2, 13, 27].In this paper, we present a modular algorithm for planning to goal distributions under state uncertainty and stochastic dynamics. We implement a specific instantiation of this algorithm using the unscented transform [52] to propagate state uncertainty, the crossentropy method (CEM) [26, 45] for derivativefree optimization to minimize the KullbackLeibler (KL) divergence between the current state distribution and the goal distribution, and model predictive control (MPC) for robust execution. An example execution is shown in Figure 1. We provide a number of illustrative examples for how the behavior of the planner can be beneficially modulated by the choice of the distribution modeled by the planner and the choice of goal distribution. We additionally derive reductions of our cost function to commonly utilized cost functions in the literature.
The paper is organized as follows. In Section II we describe some preliminary concepts that we utilize in our approach. We show reductions of KL divergence cost to standard cost functions used in planning in Section III. In Section IV we formally define our method and identify the specific algorithmic choices we make. We show illustrative examples of our method applied to a 2D navigation problem and motion planning for a 7DOF arm in Section V. We delay our discussion of related work to Section VI to contextualize other approaches in the framework we formalize. We conclude the paper in Section VII with a brief discussion of our results and some interesting directions for future research.
All software, videos, and supplemental material related to this paper can be found at the associated website:
Ii Preliminaries
Divergence functions, which encode dissimilarity between probability distributions, are central to our planning to goal distribution formulation. Specifically we explore using the KullbackLeibler (KL) divergence as the primary cost in our objective function for a stochastic trajectory optimization view of planning. Formally, KL divergence is defined as
(1) 
where and
are distributions of a continuous random variable
. Following standard convention we define to be if and and if and . The KL divergence is an asymmetric measure. This asymmetry induces two different optimization problems when optimizing to minimize KL divergence from a target distribution :Mprojection  
Iprojection 
The information or Iprojection exhibits modeseeking behavior, while the moment or Mprojection performs momentmatching behavior seeking coverage of all regions where
[36]. We demonstrate in Sections III and V that both of these projections are necessary for planning to goal distributions depending on the choice of goal distribution family and state uncertainty distribution.Iii Cost Function Reductions
Before providing our planning algorithm in Section IV, we show reductions of KL divergence between state and goal distributions to standard cost functions used in planning. Please see our supplementary material for full derivations.
Goal Set Indicator: We set the terminal state uncertainty distribution when following trajectory to be the Dirac delta distribution (i.e. known state/deterministic dynamics), , and the goal distribution to be a uniform distribution, , over the goal set . Then minimizing the Iprojection we recover
(2) 
where . Hence the minimum is obtained with a constant cost if the terminal state reaches any point in the goal set while any state outside the goal set receives infinite cost. Note the function is nondifferentiable as expected from the setbased goal definition. We can treat a single goal state naturally as a special case of this function.
(Weighted) Euclidean Distance: By setting the goal distribution to a Gaussian with mean centered at the desired goal state and assuming known state as above minimizing the Iprojection recovers a weighted squared Euclidean distance:
(3) 
where
defines the precision matrix of the goal distribution. By setting the precision matrix to be the identity matrix, we recover the standard squared Euclidean distance. We recover the same cost by minimizing the Mprojection between a Gaussian state and Dirac goal, albeit weighted by the belief state precision. For any dimension
we wish to ignore, we can set the precision 0 (i.e. variance
).Maximum Probability of Reaching Goal Point: By minimizing the Mprojection with a Dirac delta distribution at the goal point, , and having arbitrary belief distribution over the state, the KL divergence reduces to
(4) 
which is equivalent to minimizing the negative logprobability (maximizing the probability or expectation) of reaching the pointbased goal.
ChanceConstrained Goal Set: If we take a uniform goal set distribution over and an arbitrary distribution over the state the resulting minimization of the Mprojection can equivalently be stated as the following maximization problem
(5) 
which defines the probability of reaching any state in the goal set , a commonly used term for reaching a goal set in chanceconstrained control (e.g. Equation (6) in [6]).
Iv Methods
We now present the details of our approach to planning under uncertainty to goal distributions via KL divergence minimization. The full procedure is detailed in Algorithm 1.
Consider an agent with state space that maintains a belief distribution for its current state where are sufficient statistics defining the distribution. The distribution is estimated from a history of observations from the agent’s sensors, e.g. using a recursive filter [51]^{2}^{2}2In fully observable domains, where the robot directly observes its state, we can define a Dirac distribution at the point or set the state as the mean of a Gaussian and define an arbitrarily tight variance.. The transition dynamics are stochastic with additive Gaussian noise
(6) 
where is a nonlinear function. The objective is to plan and execute a sequence of actions that transform the current state belief distribution into a goal distribution while accounting for the state uncertainty and stochastic dynamics. The goal distribution may be manually specified, e.g. a Gaussian centered about a desired state to be reached, or estimated from data with a density estimation procedure.
There are four modular components to our approach, each detailed below in IVAD. While we have not performed a comprehensive evaluation of these different design choices, we justify and note specific features of our choices and highlight that other variants are possible. First, the combined components provide sufficiently fast computation for use inside a model predictive controller. Second, we use gradientfree optimization which enables us to incorporate arbitrary costs and constraints in addition to our goal distribution costs. Finally, our planner can naturally encode multimodal trajectory distributions enabling us to comprehensively reason over multimodal goal representations.
Iva Planning Cost to Goal Distributions
We use the KL divergence between the current state distribution and the goal distribution as the cost to be minimized. We accumulate cost over time such that the costs later in the plan have higher weight (see also line 13 in Alg. 1):
(7) 
where is the current timestep, is the planning horizon, , and denotes KL divergence as in Equation 1. We additionally utilize an applicationspecific cost function on the sigma points computed for the unscented transform. For example, in our 2D navigation experiments in Section V we explore collision costs on the sigma points to avoid collisions under uncertain dynamics. We can optionally incorporate common action costs to, for example, enforce smoothness or lowenergy solutions.
Note that Equation 7 is formulated as an Iprojection as discussed in Section II. The modeseeking behavior of the Iprojection is typically desired for planning to goal distributions. However, for distributions with bounded support like uniform and Diracdelta distributions, the Mprojection is necessary to avoid division by zero. We explore this issue further in our experiments in Section V.
IvB Goal Distribution
Our goal formulation permits any distribution for which KL divergence can be computed with respect to a Gaussian state distribution. In our experiments in Section V we explore Diracdelta, uniform, Gaussian, and Gaussian mixture model distributions. Approximations are also suitable when a closed form solution to KL divergence cannot be computed. For example, we use the unscented transform to efficiently compute an accurate approximation of KL divergence between GMMs [18] for which there is no closedform solution.
IvC Planner
We use crossentropy method (CEM) [26, 45] optimization for planning. CEM (lines 618 in Alg. 1) is a widely used gradientfree optimization method favored for the ease with which it can optimize over arbitrary costs that may be expensive or impossible to compute derivatives for. In a planning context [26], CEM generates action sequences at each iteration sampled from a distribution of action sequences, typically a Gaussian or Gaussian mixture model (GMM). The distribution is initialized to be broad enough to achieve coverage of the state space. Every sample is evaluated under the cost function and the distribution is refit using maximum likelihood estimation to the top lowcost samples, the socalled elite set. This procedure iterates until the KL divergence between distributions at consecutive iterations falls below a tolerance . We execute plans in a model predictive control (MPC) style to ensure robustness (lines 34 in Alg. 1) which terminates when the total cost falls below tolerance .
The planner can easily represent multimodal solutions by using a GMM distribution over trajectories. This interplays nicely with different goal distribution choices as we show in our experiments in Section V.
IvD Uncertainty Propagation
We use the unscented transform (lines 1932 in Alg. 1
) to propagate state uncertainty under stochastic nonlinear dynamics during planning. The unscented transform is best known for its use in the unscented Kalman filter (UKF)
[53], a more accurate variant of the Kalman filter for nonlinear dynamics than the extended Kalman filter (EKF) [51]. However, it has also achieved successful application in uncertainty propagation through learned, nonlinear dynamics functions [7], in trajectory optimization [35], and robot arm motion planning [31]. The unscented transform computes a small number of sigma points that are passed through the nonlinear function to systematically probe how the state distribution will transform under the nonlinearity. Typically points are computed for state size , one point being the distribution mean and two points for each dimension symmetrically dispersed from the mean. We use the sigma point algorithm presented in [35] as it has only one open parameter to tune the sigma point dispersion, compared to the three open parameters of the original formulation [52]. We choose the unscented transform primarily for its accuracy and efficiency over a Monte Carlo sampling approach. We assume a Gaussian belief state .V Experiments
We demonstrate some of the possibilities achievable with our framework using different goal distributions, CEM planner distributions, and cost objectives. We use a simple 2D navigation environment for these examples in VA to clearly illustrate the different behaviors. We additionally show an example on a larger state space for motion planning with a 7DOF arm in VB
. All of our code is opensource and implemented in PyTorch for fast batch computation.
Va Dubins Car 2D Navigation
We use the Dubins car model from [26], a simple vehicle model with nonholonomic constraints in the state space . The state denotes the car’s planar position and orientation . The dynamics obey
(8) 
where is a linear speed and is the turn rate for .
We use an arc primitive parameterization similar to [26] to generate trajectory samples for CEM. Actions are applied at each timestep for duration such that the robot moves in a straight line with velocity if and arcs with radius otherwise. A trajectory with arc primitives has the form , which are sampled during CEM optimization. We extend the model in [26] to have stochastic dynamics by adding Gaussian noise to the state updates. Please see our supplementary material for further details.
Costs in each example are computed according to Eq. 7. We additionally employ a collision cost on the sigma points
(9) 
where is an indicator function determining if a state is in collision and is a gain factor on the collision cost. We note that while it is possible to simply reject samples that have any point in collision [26], we empirically found it better to accumulate collision costs in a MPC setting, as it can take too long to find collisionfree samples without a good prior for action trajectory initialization.
We explore different combinations of goal distribution, CEM planner distribution, and KL projection to illustrate the behavior of our planner. Each example is visualized in Fig. 2. In every case the robot has a Gaussian belief distribution over its current pose in the environment. We manually specify this distribution as a scaled unit Gaussian centered at the robot’s state for simplicity, but note it can be acquired through state estimation with recursive filters in more realistic settings [51]. The images show an initial plan generated from the robot’s starting pose. Please see the supplementary video for MPC executions.
Fig. 2ad illustrate why the modeseeking behavior of the KLdivergence Iprojection is preferred for matching a Gaussian state distribution to Gaussian and GMM goal distributions. Fig.2a shows optimizing the Iprojection between a Gaussian state distribution and a Gaussian goal distribution, where the path converges towards the mean of the goal distribution in the left room. A multimodal goal modeled as a GMM is shown in Fig. 2b with asymmetrically weighted Gaussian components in the left and right rooms (weights 0.2 and 0.8, respectively). The lines illustrate the paths the robot took in 10 different MPC executions. Note that the number of paths leading to each goal is proportional to the component weightings, i.e. 2 paths go to the left goal (blue paths) while 8 paths go to the right goal (yellow paths) in the asymmetric weighting. Fig. 2c has equallyweighted components, resulting in 5 paths going to each goal in 10 MPC executions. The different weightings may represent varying levels of achievability, or that some goals are better estimated than others in a datadriven goal estimation procedure [13].
We see in Fig. 2d that optimizing the Mprojection instead of the Iprojection for a multimodal goal results in the robot driving in between the two goals, getting stuck in the middle room. These results may on the surface suggest the Iprojection should always be used. However, the Mprojection is necessary when the goal distribution has bounded support, as is the case for uniform and Diracdelta distributions. The Iprojection in these cases would cause division by zero at every point outside the goal distribution’s support, resulting in an infinite cost that cannot be optimized over the robot’s state space. Planning to a Diracdelta goal distribution is shown in Fig. 2e, where the KL divergence in this case reduces to maximizing the probability of reaching the goal state in the left room, as shown in Section III. Since the state belief distribution in this case is Gaussian, the cost further reduces to a precisionweighted Euclidean distance between the robot’s state and the goal state. Planning to a uniform goal distribution is shown in Fig. 2f, which probabilistically encodes the setbased goal of arriving in the left room, irrespective of the precise end pose in that room.
Utilizing a GMM distribution in the CEM planner and planning to a Gaussian goal exhibits multimodal trajectories as illustrated in Fig. 2g. The blue lines are trajectory samples generated from the planner GMM that take different routes around the central obstacle. This allows for generating multiple solutions to arriving to a goal which can then be pursued according to a selection strategy, e.g. further refining for shortest path optimality [26]. We can use a GMM as both the planning distribution and the goal state distribution to achieve multimodal trajectories that each arrive at a different goal as shown in Fig. 2h. This multiobjective planning can be useful if, for example, a goal becomes infeasible while executing due to a dynamic obstacle, the alternative path to the other goal can be immediately pursued. In our MPC executions with the GMM as the planner distribution, one of the solutions ends up getting selected at chance due to one mode becoming biased in CEM sampling. We comment further on this point in our discussion in Section VII.
These examples illustrate that our framework subsumes pointbased and setbased goaldirected behavior while explicitly accounting for the uncertainty in the robot’s state. Further, we can optimize for multimodal goals and multimodal solutions to goals. While the literature elucidates the Mprojection/Iprojection well [36], we find it illuminating to see this demonstrated in a planning context general enough to accommodate different goal distribution families, including probabilistic variants of classical set and pointbased goals.
VB 7DOF Arm Motion Planning
We further demonstrate our method on a 7DOF arm which has a larger state space than the 2D navigation environment in VA. As illustrated in the left of Fig 3, the robot is given a target position (yellow sphere) to reach its end effector to. An inverse kinematics solution is computed (red robot) to serve as its joint space goal. However, that solution is in collision with the gray pole. In spite of this, we can set as the mean of a Gaussian goal distribution . In addition to the KL divergence and collision costs from the 2D environment, we add a sigmapoint cost for the position error between the endeffector and the target position. Running our planner with MPC results in the path shown in the right of Fig. 3. The robot finds a path that reaches its endeffector around the obstacle to the target position without collision. Note the dynamics are stochastic with additive Gaussian noise and we use kinematic control only (i.e. commanded joint positions). Please see our supplementary material for further details.
Vi Related Work
Planning and control under uncertainty have a long history in robotics [22, 30, 51]. The problem can most generally be formalized in the language of
partially observable Markov decision processes
(POMDPs) [4, 22, 49] which account for both the agent’s measurement uncertainty and stochastic controls [51]. POMDPs are infamously intractable to solve [51] therefore causing most solutions to be approximate [33]. Our approach falls into this approximate class.Our approach closely relates to fully probabilistic control design [24, 23, 48] and path integral [54] classes of control as inference [43]. Both methods minimize an Mprojection between the controlled stateaction distribution and a desired stateaction distribution over the time horizon. Probabilistic control design [24, 23, 48] provides a theoretical optimal control derivation, but offers no concrete algorithms, while also noting the connections between Gaussian target distributions under KL and quadratic costs (Eq. (3)) in the context of LQR. Model predictive path integral control optimizes an open loop control sequence and an unknown optimal control distribution via importance sampling [54]. In contrast our approach enables direct construction of desired statebased goal distributions for use as targets in the KL divergence minimization and can use either M or Iprojections depending on the relevant distributions instead of inducing an optimal distribution based on a target cost function. As such we find our approach simpler to interpret with respect to statebased costs and motion planning at the limitation of having to add in other costs when performing control.
Similar to our approach, state distribution matching to target uniform distributions via KL divergence minimization guides policy learning in [32]. However, this approach automatically generates distributions for use in exploration and only examines a subclass of the distributions and optimizations in this paper. Chen et al. [11] examine divergences as distance functions for minimum path length planning to pointbased goals but ignore explicit goal distributions.
Goal distributions have also been used to sample a state from the distribution and continue using pointbased planning algorithms [37, 42]. Recent improvements on hindsight experience replay (HER) [3] have sought to estimate goal distributions that generate samples from low density areas of the replay buffer state distribution [28, 41, 55] or that form a curriculum of achievable goals [44, 41]
. We note that HERbased methods are typically used in multigoal reinforcement learning where states are interpreted as goals irrespective of their semantic significance as a task goal. Goal distributions are considered in
[8] where KL divergence is used to guide local policies towards the goal distribution. However, only Dirac delta distributions were considered, which is tantamount to maximizing the expectation of reaching a pointbased goal as shown in Eq. (5).Distributed multirobot systems leverage goal distributions to control a network of robots to a desired configuration [16, 46]. Goal distributions over a known state space are advocated for in [38] as a generalization to goalconditioned reinforcement learning. Our approach is parallel to [38] in that we explore goal distributions in a planning under uncertainty setting and provide a wider set of reductions.
KL divergence (and fdivergences more broadly) has been commonly used to constrain iterations of optimization in planning [1] and policy search [10, 40, 47]. Stochastic optimal control and planning techniques typically attempt to minimize the expected cost of the trajectory [15, 54] or maximize the probability of reaching a goal set [6, 34]. Information theoretic costs are included in the maximum entropy formulation of reinforcement learning and control [19, 20, 56, 1] which reward maintaining high information in a stochastic policy in order to avoid unnecessarily biasing the policy class. Another common use of information theoretic cost comes from active exploration in SLAM [9] and object reconstruction [21] where the information gain is computed over the uncertainty of the current world model. In spite of the myriad uses of KL divergence in the literature, it has been scarcely utilized explicitly for defining cost between a predicted state distribution and a goal distribution. We believe this is because probability distributions have remained an underutilized representation for goal spaces.
Vii Discussion and Conclusion
We are excited to investigate many extensions to this work. First, as noted in the text, we could easily extend our work to include other
divergences as has recently been examined in policy search and imitation learning
[5, 17, 25]. A more thorough study of alternative uncertainty propagation techniques should also be examined such as Monte Carlo sampling [6, 54, 12], moment matching [14], or bounded uncertainties [34]. These approaches, together with learned dynamics, could help expand the representational power of the dynamics to include multimodal distributions (e.g. ensembles of neural networks) or learn complex models from few samples (e.g. Gaussian processes).
Additionally, while we demonstrated the ability of our planner to generate multimodal trajectories to reach different goal regions or have paths in distinct homotopy classes in reaching a single goal region, our MPC control approach quickly collapses to a unimodal trajectory distribution. In cases where users determine it important to maintain multiple distinct trajectory classes our method could be extended to incorporate the recently introduced Stein variational MPC to achieve this [29]. We believe significant work could go into designing specialized solvers for planning based on the chosen divergence, cost, and uncertainty representation to further improve planning efficiency.
References
 [1] (2017) Stateregularized policy search for linearized dynamical systems. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), External Links: Link Cited by: §VI.
 [2] (2016) Simultaneously learning actions and goals from demonstration. Autonomous Robots 40 (2), pp. 211–227. Cited by: §I.
 [3] (2017) Hindsight experience replay. In Advances in neural information processing systems, pp. 5048–5058. Cited by: §VI, footnote 1.
 [4] (1965) Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications 10 (1), pp. 174–205. Cited by: §VI.
 [5] (2017) Fdivergence constrained policy improvement. arXiv. External Links: Link Cited by: §VII.
 [6] (2010) A Probabilistic ParticleControl Approximation of ChanceConstrained Stochastic Predictive Control. IEEE Transactions on Robotics. Cited by: §AD, §I, §III, §VI, §VII.

[7]
(2016)
Extended and Unscented Kitchen Sinks.
In
International Conference on Machine Learning (ICML)
, Cited by: §IVD.  [8] (2011) Minimum uncertainty robot navigation using informationguided pomdp planning. In 2011 IEEE International Conference on Robotics and Automation, pp. 6102–6108. Cited by: §VI.
 [9] (2015) Informationtheoretic planning with trajectory optimization for dense 3d mapping.. In Robotics: Science and Systems, Vol. 11. Cited by: §VI.
 [10] (2017) Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3381–3388. Cited by: §VI.
 [11] (201708) Path Planning with DivergenceBased Distance Functions. Computer Aided Geometric Design 66, pp. 52–74. External Links: Document, Link Cited by: §VI.
 [12] (2018) Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. In Advances in Neural Information Processing, External Links: 1805.12114, Link Cited by: §VII.
 [13] (2019) Active learning of probabilistic movement primitives. In 2019 IEEERAS 19th International Conference on Humanoid Robots (Humanoids), pp. 1–8. Cited by: §I, §VA.
 [14] (2015) Gaussian Processes for DataEfficient Learning in Robotics and Control. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2), pp. 408–423. External Links: Document, arXiv:1502.02860v1, ISBN 01628828 VO  37, ISSN 01628828, Link Cited by: §VII.
 [15] (2011) PILCO: a modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472. Cited by: §VI.

[16]
(2012)
A decentralized kernel density estimation approach to distributed robot path planning
. In Proceedings of the Neural Information Processing Systems Conference, Cited by: §VI.  [17] (2019) A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning (CoRL), Cited by: §VII.
 [18] (2005) A distance measure between gmms based on the unscented transform and its application to speaker recognition. In Ninth European Conference on Speech Communication and Technology, Cited by: §IVB.
 [19] (2017) Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165. Cited by: §VI.
 [20] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §VI.
 [21] (2019) Building 3d object models during manipulation by reconstructionaware trajectory optimization. arXiv preprint arXiv:1905.03907. Cited by: §VI.
 [22] (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (12), pp. 99–134. Cited by: §VI.
 [23] (2006) Fully probabilistic control design. Systems & Control Letters 55 (4), pp. 259 – 265. External Links: ISSN 01676911, Document, Link Cited by: §VI.
 [24] (1996) Towards fully probabilistic control design. Automatica 32 (12), pp. 1719 – 1722. External Links: ISSN 00051098, Document, Link Cited by: §VI.
 [25] (2020) Imitation Learning as fDivergence Minimization. In Workshop on the Algorithmic Foundations of Robotics, Cited by: §VII.
 [26] (2012) Crossentropy randomized motion planning. In Robotics: Science and Systems, Vol. 7, pp. 153–160. Cited by: §BA, §BA, §BA, §I, §IVC, §VA, §VA, §VA, §VA, footnote 1.
 [27] (2020) Incremental learning of an openended collaborative skill library. International Journal of Humanoid Robotics 17 (01), pp. 2050001. Cited by: §I.
 [28] (2020) Goal densitybased hindsight experience prioritization for multigoal robot manipulation reinforcement learning. In 2020 29th IEEE International Symposium on Robot and Human Interactive Communication (ROMAN). IEEE, Cited by: §VI.
 [29] (2020) Stein Variational Model Predictive Control. In Conference on Robot Learning (CoRL), Cited by: §VII.
 [30] (2006) Planning algorithms. Cambridge university press. Cited by: §I, §VI.
 [31] (2013) Sigma hulls for gaussian belief space planning for imprecise articulated robots amid obstacles. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5660–5667. Cited by: §IVD.
 [32] (2018) Efficient Exploration via State Marginal Matching. In ICLR Workshop, Cited by: §VI.
 [33] (2008) What makes some pomdp problems easy to approximate?. In Advances in neural information processing systems, pp. 689–696. Cited by: §VI.
 [34] (2020) Safe modelbased metareinforcement learning: a sequential explorationexploitation framework. arXiv preprint arXiv:2008.11700. Cited by: §VI, §VII.
 [35] (2016) Derivativefree trajectory optimization with unscented dynamic programming. In IEEE Conference on Decision and Control (CDC), pp. 3642–3647. Cited by: §I, §IVD.
 [36] (2012) Machine Learning: A Probabilistic Perspective. MIT Press. External Links: Document, 0387310738, ISBN 0070428077, ISSN 00368733 Cited by: §II, §VA.
 [37] (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §VI, footnote 1.
 [38] (202008) DisCo rl: distributionconditioned reinforcement learning for generalpurpose policies. Master’s Thesis, EECS Department, University of California, Berkeley. External Links: Link Cited by: §I, §VI.
 [39] (2018) Using probabilistic movement primitives in robotics. Autonomous Robots 42 (3), pp. 529–551. Cited by: §I.
 [40] (2010) Relative entropy policy search.. In AAAI, Vol. 10, pp. 1607–1612. Cited by: §VI.
 [41] (2020) Maximum entropy gain exploration for long horizon multigoal reinforcement learning. arXiv preprint arXiv:2007.02832. Cited by: §VI.
 [42] (2020) Skewfit: statecovering selfsupervised reinforcement learning. International Conference on Machine Learning (ICML). Cited by: §VI.
 [43] (2012) On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. In Robotics: Science and Systems (RSS), Cited by: §VI.
 [44] (2019) Exploration via hindsight goal generation. In Advances in Neural Information Processing Systems, pp. 13485–13496. Cited by: §VI.

[45]
(2013)
The crossentropy method: a unified approach to combinatorial optimization, montecarlo simulation and machine learning
. Springer Science & Business Media. Cited by: §I, §IVC.  [46] (2017) A generalized reduced gradient method for the optimal control of verylargescale robotic systems. IEEE Transactions on Robotics 33 (5), pp. 1226–1232. Cited by: §VI.
 [47] (2015) Trust region policy optimization. In International Conference on Machine Learning (ICML), pp. 1889–1897. Cited by: §VI.
 [48] (2008) Stochastic Control Optimal in the Kullback Sense. Kybernetika 44 (1), pp. 53–60. Cited by: §VI.
 [49] (1978) The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations research 26 (2), pp. 282–304. Cited by: §VI.
 [50] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
 [51] (2005) Probabilistic robotics. MIT press. Cited by: §I, §I, §IVD, §IV, §VA, §VI.
 [52] (1995) Dynamic map building and localization: new theoretical foundations. Ph.D. Thesis, University of Oxford Oxford. Cited by: §I, §IVD.
 [53] (2000) The unscented kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pp. 153–158. Cited by: §IVD.
 [54] (2017) Information theoretic mpc for modelbased reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. Cited by: §I, §VI, §VI, §VII.
 [55] (2019) Maximum entropyregularized multigoal reinforcement learning. arXiv preprint arXiv:1905.08786. Cited by: §VI.
 [56] (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §VI.
Appendix A Cost Reduction Derivations
We provide more involved derivations of all of the cost function reductions presented in Section III of the paper. In the following, the density function for the uniform distribution for set is defined as
(10) 
where , where defines the volume of the set. A Diracdelta distribution for state is a special case of a uniform distribution where the support is limited to a single state . In this case if and otherwise.
Aa Goal Set Indicator
We set the terminal state uncertainty distribution when following trajectory to be the Diracdelta distribution (i.e. known state/deterministic dynamics), , where is the state reached after following trajectory . We set the goal distribution to be a uniform distribution over the goal set . Minimizing the Iprojection of the KL divergence between these distributions we recover
(11)  
(12)  
(13) 
where Equation 12 follows from the fact that for and for . Hence the minimum is obtained with a constant cost if the terminal state of trajectory reaches any point in the goal set while any state outside of
receives infinite cost. This function is nondifferentiable as expected from the setbased goal definition. We can treat a single goal state naturally as a special case of this function. While the KL divergence provides these specific costs, this cost function is equivalent to any classifier or indicator that provides distinct costs for inside the goal set versus outside (e.g.
or ). Another common equivalent formulation provides a cost of (true) inside the goal set as and outside the set as (false) and define a maximization.AB (Weighted) Euclidean Distance:
Consider a Gaussian goal distribution with mean centered at a desired goal state , and assume a Diracdelta distribution over the current state (i.e. known state/deterministic dynamics), , where again is the resulting state from following trajectory . Minimizing the Iprojection recovers a weighted squared Euclidean distance:
(14)  
(15)  
(16)  
(17)  
(18) 
where defines the precision matrix of the goal distribution. By setting the precision matrix to be the identity matrix, we recover the standard squared Euclidean distance between the terminal state and goal state . We also note that for any dimension set to 0 to ignore this dimension results in an associated variance .
AC Maximum Probability of Reaching Goal Point
Minimizing the Mprojection with a Diracdelta distribution at a goal state and having arbitrary belief distribution over the state when following trajectory , the KL divergence reduces to
(19)  
(20)  
(21)  
(22) 
which is maximizing the probability of reaching the pointbased goal following trajectory .
In the special case where the probability distribution over state is a Gaussian, we recover the same weighted Euclidean distance cost as above albeit weighted by the belief state precision instead of the goal distribution precision.
AD ChanceConstrained Goal Set
Consider a uniform distribution over goal set and an arbitrary distribution over the terminal state after following trajectory . Minimizing the Mprojection of the KL divergence between these distributions we get
(23)  
(24)  
(25)  
(26)  
(27)  
(28)  
(29) 
Equation 29 defines the probability of reaching any state in the goal set , a commonly used term for reaching a goal set in chanceconstrained control (e.g. Equation (6) in [6]).
Appendix B Experiment Environments
Ba Dubins Car 2D Navigation
We use the Dubins car model from [26] which is a simple vehicle model with nonholonomic constraints in the state space . The state denotes the car’s planar position and orientation . The dynamics obey
(30) 
where is a linear speed and is the turn rate for .
We use an arc primitive parameterization similar to [26] to generate trajectory samples for CEM. Actions are applied at each timestep for duration such that the robot moves in a straight line with velocity if and arcs with radius otherwise. A trajectory with arc primitives has the form , which are sampled during CEM optimization.
The state space under this parameterization evolves as
(31)  
(32)  
(33) 
where and for the primitive. Note we add a small value to each to avoid division by zero, which simplifies the computation described in [26]. We extend the model in [26] to have stochastic dynamics by adding Gaussian noise to the state updates in Equations 3133; we use a value of .
BB 7DOF Arm Environment
We use the Franka Emika Panda arm rendered in rviz. The state is the arm’s 7D joint configuration in radians, where we compute state updates simply as
(34) 
We use PyBullet for checking collisions between the robot and the environment.
We use the same KL divergence cost and collision cost as the Dubins car environment. We add one additional cost term for the arm that specifies the arm’s endeffector should reach a desired position in its workspace
(35) 
where is the desired endeffector position to be reached, is the robot’s forward kinematics function, and . We compute the forward kinematics using PyBullet.
Comments
There are no comments yet.