I Introduction
Imagine an unmanned aerial vehicle (UAV) navigating through a dense environment using an RGBD sensor (Figure 1
). Visionbased planning of this kind has been the subject of decades of work in the robotics literature. Traditional approaches to this problem use RGBD sensors to perform state estimation and create a (potentially local) map of obstacles in the environment; the resulting state estimate and map are used in conjunction with motion planning techniques (e.g., rapidlyexploring random trees
[16] or motion primitives [31, 2]) to perform navigation [31, 26, 21]. Recent approaches have sought to harness the power of deep learning in order to forego explicit geometric representations of the environment and
learn to perform visionbased planning [2, 33, 12]. These approaches learn to map raw sensory inputs to a latent representationof the robot’s state and environment using neural networks; planning is performed in this latent space. Learningbased approaches to planning have two primary advantages over more traditional methods: (1) the use of convolutional neural networks allows one to elegantly handle RGBD inputs, and (2) one can learn to exploit statistical regularities of natural environments to improve planning performance. However, deep learningbased approaches to planning currently provide
no explicit guarantees on generalization performance. In other words, such approaches are unable to provide bounds on the performance of the learned planner when placed in a novel environment (i.e., an environment that was not seen during training).The goal of this paper is to address this challenge by developing an approach that learns to plan using RGBD sensors while providing explicit bounds on generalization performance. In particular, the guarantees associated with our approach take the form of probably approximately correct (PAC) [23] bounds on the performance of the learned visionbased planner in novel environments. Concretely, given a set of training environments, we learn a planner with a provable bound on its expected performance in novel environments (e.g., a bound on the probability of collision). This bound holds with high probability (over sampled training environments) under the assumption that training environments and novel environments are drawn from the same (but unknown) underlying distribution (see Section II for a formal description of our problem formulation).
Statement of Contributions. To our knowledge, the results in this paper constitute the first attempt to provide guarantees on generalization performance for visionbased planning using neural networks. To this end, we make three primary contributions. First, we develop a framework that leverages PACBayes generalization theory [23] for learning to plan in a recedinghorizon manner using a library of motion primitives. The planners trained using this framework are accompanied by certificates of performance on novel environments in the form of PAC bounds. Second, we present algorithms based on evolutionary strategies [38] and convex optimization (relative entropy programming [5]) for learning to plan with highdimensional sensory feedback (e.g., RGBD) by explicitly optimizing the PACBayes bounds on the performance. Finally, we demonstrate the ability of our approach to provide strong generalization bounds for visionbased motion planners through two examples: navigation of an Unmanned Aerial Vehicle (UAV) across obstacle fields (see Fig. 1) and locomotion of a quadruped across rough terrains (see Fig. 1). In both examples, we obtained PACBayes bounds that guarantee successful traversal of of the environments on average.
Ia Related Work
Planning with Motion Primitive Libraries. Motion primitive libraries comprise of precomputed primitive trajectories that can be sequentially composed to generate a rich class of motions. The use of such libraries is prevalent in the literature on motion planning; a nonexhaustive list of examples includes navigation of UAVs [21], balancing [18] and navigation [26] of humanoids, grasping [1], and navigation of autonomous ground vehicles [31]. Several approaches furnish theoretical guarantees on the composition of primitive motions, such as: manuever automata [10], composition of funnel libraries by estimating regionsofattraction [3, 34, 21], and leveraging the theory of switched systems [35]. However, unlike the present paper, none of the above provide theoretical guarantees for visionbased planning by composing motion primitives.
Planning with Vision.
The advent of Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) have recently boosted interest in learning visionbased motion planners. Certain recent methods can be classified in one of three categories: (1) selfsupervised learning approaches
[37, 12]that uncover lowdimensional latentspace representations of the visual data before planning; (2) imitation learning approaches
[2, 33, 28] that leverage an expert’s motion plans; (3) and deep reinforcement learning (RL) [39, 9, 11, 17] that performs RL with visual data and uncovers the latentspace representations relevant to the planning task. Planning guarantees presented in the above papers are limited to the learned lowdimensional latent space embeddings and do not necessarily translate to the actual execution of the plan. In this paper, we provide generalization guarantees for the execution of our neural network planning policies in novel environments.Generalization Guarantees on Motion Plans. Generalization theory was developed in the context of supervised learning to provide bounds on a learned model’s performance on novel data [32]. In the domain of robotics, these PAC generalization bounds were used in [13] to learn a stochastic robot model from experimental data. PAC bounds were also adopted by the controls community to learn robust controllers [36, 4]; however, their use has not extended to visionbased DNN controllers/policies. In this paper, we use the PACBayes framework, which has recently been successful in providing generalization bounds for DNNs for supervised learning [8, 29]. Our previous work [20, 19] developed the PACBayes Control framework to provide generalization bounds on learned control policies. This paper differs from our previous work in three main ways. (1) We plan using motion primitive libraries instead of employing reactive control policies. This allows us to embed our knowledge of the robot’s dynamics while simultaneously reducing the complexity of the policies. (2) We perform deep RL with PACBayes guarantees using rich sensory feedback (e.g., depth map) on DNN policies. (3) Finally, this paper contributes various algorithmic developments. We develop a training pipeline using Evolutionary Strategies (ES) [38] to obtain a prior for the PACBayes optimization. Furthermore, we develop an efficient Relative Entropy Program (REP)based PACBayes optimization with the recent quadraticPACBayes bound [29, Theorem 1] that was shown to be tighter than the bound used in [20, 19].
Ii Problem Formulation
We consider robotic systems with discretetime dynamics:
(1) 
where is the timestep, is the robot’s state, is the control input, and is the robot’s “environment”. We use the term “environment” broadly to refer to any exogenous effects that influence the evolution of the robot’s state; e.g., the geometry of an obstacle environment that a UAV must navigate or the geometry of terrain that a legged robot must traverse. In this paper we will make the following assumption.
Assumption 1
There is an underlying unknown distribution over the space of all environments that the robot may be deployed in. At training time, we are provided with a dataset of environments drawn i.i.d. from .
It is important to emphasize that we do not assume any explicit characterization of or . We only assume indirect access to in the form of a training dataset (e.g., a dataset of building geometries for the problem of UAV navigation).
Let be the robot’s extereoceptive sensor (e.g., vision) that furnishes an observation from a state and an environment . Further, let be the robot’s proprioceptive sensor mapping that maps the robot’s state to a sensor output . We aim to learn control policies that have a notion of planning embedded in them. In particular, we will work with policies that utilize rich sensory observations , e.g., vision or depth, to plan the execution of a motion primitive from a library in a recedinghorizon manner. Each member of is a (potentially timevarying) proprioceptive controller and the index set is compact.
We assume the availability of a cost function that defines the robot’s task. For the sake of simplicity, we will assume that the environment captures all sources of stochasticity (including random initial conditions); thus, the cost associated with deploying policy on a particular environment (over a given time horizon ) is deterministic. In order to apply PACBayes theory, we assume that the cost is bounded. Without further loss of generality, we assume . As an example in the context of navigation, the cost function may assign a cost of 1 for colliding with an obstacle in a given environment (during a finite time horizon) and a cost of 0 otherwise.
The goal of this work is to learn policies that minimize the expected cost across novel environments drawn from :
(2) 
As the distribution over environments is unknown, a direct computation of for the purpose of the minimization in (2) is infeasible. The PACBayes framework [22, 23] provides us an avenue to alleviate this problem. However, in order to leverage it, we will work with a slightly more general problem formulation. In particular, we learn a distribution over the space of policies instead of finding a single policy. When the robot is faced with a given environment, it first randomly selects a policy using and then executes this policy. The corresponding optimization problem is:
where
is the space of probability distributions over
. We emphasize that the distribution over environments is unknown to us. We are only provided a finite training dataset to learn from; solving thus requires finding (distributions over) policies that generalize to novel environments.Iii PACBayes Control
We now describe the PACBayes Control approach developed in [20, 19] and perform suitable extensions for visionbased planning using motion primitives. Let
denote a space of policies parameterized by weight vectors
that determine the mapping from observations in to primitives in . Specifically, the parameters will correspond to weights of a neural network. Let represent a “prior” distribution over control policies obtained by specifying a distribution over the parameter space . The PACBayes approach requires this prior to be chosen independently of the dataset of training environments. As described in Section II, our goal is to learn a distribution over policies that minimizes the objective in . We will refer to as the “posterior”. We note that the prior and the posterior need not be Bayesian. We define the empirical cost associated with a particular choice of posterior as the average (expected) cost across training environments in :(3) 
The PACBayes Control result then can be stated as follows.
Theorem 1
For any and posterior , with probability over sampled environments , the following inequalities hold:
(4)  
(5) 
where is defined as:
(6) 
The bound (4) was proved in [20, Theorem 2]. The proof of (5) follows analogous to that of [20, Theorem 2] with the only difference being that we use [29, Theorem 1] in the place of [20, Corollary 1].
This result provides an upper bound (that holds with probability ) on our primary quantity of interest: the objective in . In other words, it allows us to bound the true expected cost of a posterior policy distribution across environments drawn from the (unknown) distribution . Theorem 1 suggests an approach for choosing a posterior over policies; specifically, one should choose a posterior that minimizes the bounds on . The bounds are a composite of two quantities: the empirical cost and a “regularization” term (both of which can be computed given the training dataset and a prior ). Intuitively, minimizing these bounds corresponds to minimizing a combination of the empirical cost and a regularizer that prevents one from overfitting to the specific training environments.
For solving , we can either minimize (4) or (5). Intuitively, we would like to use the tighter one of the two. The following proposition addresses this concern by analytically identifying regimes where (5) is tighter than (4) and viceversa.
Proposition 1
The proof of this proposition is detailed in Appendix A.
Proposition 1 shows that (5) is tighter than (4) if and only if the upper bound of (5) is smaller than . On the other hand, we also have that (4) is tighter than (5) if and only if the upper bound of (5) is greater than . Hence, in our PACBayes training algorithm we will use (4) when and (5) when .
Iv Training
In this section we present our methodology for training visionbased planning policies that can provably perform well on novel environments using the PACBayes Control framework. The PACBayes framework permits the use of any prior distribution (independent of the training data) on the policy space. However, an uninformed choice of could result in vacuous bounds [8]. Therefore, obtaining strong PACBayes bounds with efficient sample complexity calls for a good prior on the policy space. For DNNs, the choice of a good prior is often unintuitive. To remedy this, we split a given training dataset into two parts: and . We use the Evolutionary Strategies (ES) framework to train a prior using the training data in ; more details are provided in Section IVA. Leveraging this prior, we perform PACBayes optimization on the training data in ; further details on the PACBayes optimization are presented in Section IVB.
Iva Training A PACBayes Prior With ES
We train the prior distribution on the policy space by minimizing the empirical cost on environments belonging to the set with cardinality . In particular, we choose
to be a multivariate Gaussian distribution
with a mean and a diagonal covariance . Let be the elementwise squareroot of the diagonal of . Our training is performed using the class of RL algorithms known as Evolutionary Strategies (ES) [38]. ES provides multiple benefits in our setting: (a) The presence of a physics engine in the training loop prohibits backpropagation of analytical gradients. Since the policies
in our setting are DNNs with hundreds and thousands of parameters, a naive finitedifference estimate of the gradient would be too computationally prohibitive. ES permits gradient estimation with significantly lower number of rollouts (albeit resulting in noisy estimates). (b) ES directly supplies us a distribution on the policy space, thereby meshing well with the PACBayes Control framework. (c) The ES gradient estimation can be conveniently parallelized in order to leverage cloud computing resources.Adapting (3) for , we can express the gradient of the empirical cost with respect to (w.r.t.) as:
(7) 
Following the ES framework from [30, 38], the gradient of the empirical cost for any w.r.t. the mean is:
(8) 
where is the Hadamard division (elementwise division). Similarly, the gradient of the empirical cost w.r.t. is:
(9) 
where is the Hadamard product (elementwise product) and is a vector of ’s with dimension . Hence, (8) and (9) allow a MonteCarlo estimation of the gradient. One can derive (8) and (9) using the diagonal covariance structure of and the reparameterization trick^{2}^{2}2 where in [38, Equations (3),(4)].
Estimating the gradients directly from (8) and (9
) leads to poor convergence of the cost function due to highvariance in the gradient estimate. Indeed, this observation has been shared in the past literature
[30, 38]. To reduce the variance in the gradient estimate we perform antithetic sampling, as suggested in [30], i.e., for every we also evaluate the policy corresponding to for estimating the gradient. If we sample number of , then the MonteCarlo estimate of the gradient with antithetic sampling will be:(10)  
(11) 
Now we turn our attention towards the implementation aspects of ES. We exploit the high parallelizability of ES by splitting the environments into minibatches , where is the number of CPU workers available to us. Each worker computes the gradient of the cost associated with each environment in using Algorithm 1 and returns their sum to the main process, where the gradients are averaged over all environments; see Algorithm 3 in Appendix AB. These estimated gradients are then passed to a gradientbased optimizer for updating the distribution; we use the Adam optimizer [15]
. We circumvent the nonnegativeness constraint of the standard deviation by optimizing
^{3}^{3}3The notation is overloaded to mean elementwise . instead of . Accordingly, the gradients w.r.t. must be converted to gradients w.r.t. before supplying them to the optimizer. We empirically observed that using ES, the training sometimes gets stuck at a local minimum. To remedy this, we replaced the cost function with the utility function [38, Section 3.1] as was done in [30]. We shall refer to the use of the utility function as ESutility. The difference between our implementation and [30] is that we use ESutility only when we get stuck in a local minimum; once we escape it, we revert back to ES. However, [30] uses ESutility for the entire duration of the training. The switching strategy allows us to benefit from the faster convergence of ES and local minimum avoidance of ESutility, thereby saving a significant amount of computation.IvB Training a PACBayes Policy
This section details our approach for minimizing the PACBayes upperbounds in Theorem 1 to obtain provably generalizable posterior distributions on the policyspace. We begin by restricting our policy space to a finite set as follows:
Let be a prior^{4}^{4}4As before, it is understood that is a diagonal covariance matrix with its diagonal entries being the elementwise square of . on the policy space obtained using ES. Draw i.i.d. policies from and restrict the policy space to . Finally, define a prior over the finite space
as a uniform distribution.
The primary benefit of working over a finite policy space is that it allows us to formulate the problem of minimizing the PACBayes bounds (4) and (5) using convex optimization. As described in [19, Section 5.1], optimization of the PACBayes bound (4) for can be achieved using a relative entropy program (REP); REPs are efficientlysolvable convex programs in which a linear functional of the decision variables is minimized subject to constraints that are linear or of the form [5, Section 1.1]. The remainder of this section will formulate the optimization of the bound (5) as a parametric REP. The resulting algorithm (Algorithm 2) finds a posterior distribution over that minimizes the PACBayes bound (arbitrarily close to the global infimum).
Let be the policywise cost vector, each entry of which holds the average cost of running policy on the environments . Then, the empirical cost can be expressed linearly in as . Hence, the minimization of the PACBayes bound (5) can be written as:
s.t. 
Introducing the scalars , , and , this optimization can be equivalently expressed as:^{5}^{5}5If we have an infeasible , then we assume that .
(12)  
(13)  
(14)  
(15) 
This is an REP for a fixed and . Hence, we can perform a grid search on these scalars and solve for each fixed tuple of parameters . We can control the density of the grid on to get arbitrarily close to:
(16) 
The search space for is impractically large in (16). The following Proposition remedies this by shrinking the search space on and to compact intervals, thereby allowing for an efficient algorithm to solve (16).
Proposition 2
Our implementation of the PACBayes optimization is detailed in Algorithm 2. In practice, we sweep across and for each fixed we perform a bisectional search on (line 1314 in Algorithm 2). We make our bounds for in Proposition 2 tighter by replacing with the chosen in the expressions of and (line 1011 in Algorithm 2). Furthermore, we choose (line 6 in Algorithm 2). The REP that arises by fixing is solved using CVXPY [7] with the MOSEK solver [25]. Finally, we postprocess the solution of Algorithm 2 to obtain a tighter PACBayes bound by computing the KLinverse between the empirical cost and using an REP; further details are provided in Appendix B.
V Examples
In this section we use the algorithms developed in Section IV on two examples: (1) visionbased navigation of a UAV in novel obstacle fields, and (2) locomotion of a quadrupedal robot across novel rough terrains using proprioceptive and exteroceptive sensing. Through these examples, we demonstrate the ability of our approach to train visionbased DNN policies with strong generalization guarantees in a deep RL setting. The simulation and training are performed with PyBullet [6]
and PyTorch
[27] respectively. Our code is available at: https://github.com/iromlab/PACVisionPlanningVa VisionBased UAV Navigation
In this example we train a quadrotor to navigate across an obstacle field without collision using depth maps from an onboard vision sensor; see Fig. 1 for an illustration.
Environment. For the sake of visualization, we made the “roof” of the obstacle course transparent in Fig. 1 and the videos. The true obstacle course is a (red) tunnel cluttered by cylindrical obstacles; see Fig. 4. The (unknown) distribution over environments is chosen by drawing obstacle radii and locations from a uniform distribution on and
, respectively. The orientation is generated by drawing a quaternion using a normal distribution.
Motion Primitives. We work with a library of motion primitives for the quadrotor. The motion primitives are generated by connecting the initial position and the final desired position of the quadrotor by a smooth sigmoidal trajectory; Fig. 2 illustrates the sigmoidal trajectory for each primitive in our library. The robot moves along these trajectories at a constant speed and yaw and the rollpitch are recovered by exploiting the differential flatness of the quadrotor [24, Section III].
Planning Policy. Our control policy maps a depth map to a score vector and then selects the motion primitive with the highest score; see the last layer in Fig. 3. A typical depthmap from the onboard sensor is visualized in Fig. 4. We model our policy as a DNN with a ResNetlike architecture (illustrated in Fig. 3). The policy processes the depth map along two parallel branches: the Depth Filter (which is fixed and has no parameters to learn) and the Residual Network (which is a DNN). Both branches generate a score for each primitive in which are summed to obtain the final aggregate score. The Depth Filter embeds the intuition that the robot can avoid collisions with obstacles by moving towards the “deepest” part of the depth map. We construct the Depth Filter by projecting the quadrotor’s position after executing a primitive on the depth map to identify the pixels where the quadrotor would end up; see the grid in Fig. 4 where each cell corresponds to the ending position of a motion primitive. The Depth Filter then applies a mask on the depth map that zeros out all pixels outside this grid and computes the averagedepth of each cell in the grid which is treated as a primitive score from that branch of the policy. Note that this score is based only on the ending position of the quadrotor; the entire primitive trajectory when projected onto the depth map can lie outside the grid in Fig. 4. Therefore, to improve the policy’s performance, we train the Residual Network. Intuitively, this augments the scores from the Depth Filter branch by processing the entire depth map. The Residual Network is a DNN with parameters: .
Training Summary. We choose the cost as where is the time at which the robot collides with an obstacle and is the total time horizon; in our example seconds. For the primitives in Fig. 2, the quadrotor moves monotonically forward. Hence, time is analogous to the forward displacement. The prior is trained using the method described in Section IVA on an AWS EC2 g3.16xlarge instance with a runtime of 22 hours. PACBayes optimization is performed with Algorithm 2 on a desktop with a 3.30 GHz i97900X CPU with 10 cores, 32 GB RAM, and a 12 GB NVIDIA Titan XP GPU. The bulk of the time in executing Algorithm 2 is spent on computing the costs in line 5 (i.e., 400 sec, 800 sec, and 1600 sec for the results in Table I from top to bottom), whereas solving (16) takes sec; Table III
in the appendix provides the hyperparameters.
Results. The PACBayes results are detailed in Table I. Here we choose (implying that the PACbounds hold with probability ) and vary ; see Appendix B for details on the KLinverse PACBayes bound provided in Table I. We also perform an exhaustive simulation of the trained posterior on novel environments to empirically estimate the true cost. It can be observed that our PACBayes bounds show close correspondence to the empirical estimate of the true cost. To facilitate the physical interpretation of the results in Table I, consider the last row with : according to our PACBayes guarantee, with probability the quadrotor will (on average) get through of previously unseen obstacle courses. Videos of representative trials on the test environments can be found at: https://youtu.be/03qq4sLU34o
# Environments  PACBayes Cost  True Cost  

N  KL Inv.  (Estimate)  
1000  26.02%    24.89%  18.34% 
2000    22.99%  22.28%  18.42% 
4000    21.55%  21.11%  18.43% 
VB Quadrupedal Locomotion on Uneven Terrain
Environment. In this example, we train the quadrupedal robot Minitaur [14] to traverse an uneven terrain characterized by slopes uniformly sampled between to ; see Fig. 1 for an illustration of a representative environment. We use the minitaur_gym_env in PyBullet to simulate the full nonlinear/hybrid dynamics of the robot. The objective here is to train a posterior distribution on policies that enables the robot to cross a finishline (depicted in red in Fig. 1) situated m in the initial heading direction .
Motion Primitives. We use the sine controller that is available in the minitaur_gym_env as well as Minitaur’s SDK developed by Ghost Robotics. The sine controller generates desired motor angles based on a sinusoidal function that depends on the stepping amplitudes , , steering amplitude , and angular velocity as follows:
These four desired angles are communicated to the eight motors of the Minitaur (two per leg) in the following order: . Hence, our primitives are characterized by the scalars , and rad/s, rendering the library of motion primitives uncountable but compact. Each primitive is executed for a timehorizon of 0.5 seconds as it roughly corresponds to one step for the robot.
Planning Policy. The policy selects a primitive based on the depth map and the proprioceptive feedback (8 motor angles, 8 motor angular velocities, and the sixdimensional position and orientation of the robot’s torso). A visualization of the depth map used by the policy can be found in Fig. 5. The policy we train is a DNN with parameters: The depth map is passed through a convolutional layer and the features are concatenated with the proprioceptive feedback and processed as follows: . Let the 4 outputs be denoted by . Then, the primitive is assigned as:
Intuitively, the transformation above from to ensures that the robot has a minimum forward speed and a maximum steering speed.
Training Summary. The cost we use is where is the robot’s displacement at the end of the rollout along the initial heading direction and is the distance to the finishline from the robot’s initial position along ; m for the results in this paper. A rollout is considered complete when the robot crosses the finishline or if the rollout time exceeds seconds. All the training in this example is performed on a desktop with a 3.30 GHz i97900X CPU with 10 cores, 32 GB RAM, and a 12 GB NVIDIA Titan XP GPU. As before, we train the prior using the method described in Section IVA, the execution of which takes 2 hours. PACBayes optimization is performed using Algorithm 2. The execution of line 5 in Algorithm 2 takes 65 min, 130 min, and 260 min for the results in Table II from top to bottom, respectively, whereas solving (16) takes 1 sec. Table III in the Appendix details the relevant hyperparameters.
Results. The PACBayes results are detailed in Table II where and the number of environments is varied; see Appendix B for details on the KLinverse PACBayes bound provided in Table I. The PACBayes cost can be interpreted in a manner similar to the quadrotor; e.g., for PACBayes optimization with , with probability , the quadruped will (on average) traverse () of the previously unseen environments. Videos of representative trials on the test environment can be found at: https://youtu.be/03qq4sLU34o
# Environments  PACBayes Cost  True Cost  

N  KL Inv.  (Estimate)  
500  25.65%    23.75%  16.89% 
1000    22.92%  21.60%  16.78% 
2000    21.40%  20.23%  16.80% 
Vi Conclusions
We presented a deep reinforcement learning approach for synthesizing visionbased planners with certificates of performance on novel environments. We achieved this by directly optimizing a PACBayes generalization bound on the average cost of the policies over all environments. To obtain strong generalization bounds, we devised a twostep training pipeline. First, we use ES to train a good prior distribution on the space of policies. Then, we use this prior in a PACBayes optimization to find a posterior that minimizes the PACBayes bound. The PACBayes optimization is formulated as a parametric REP that can be solved efficiently. Our examples demonstrate the ability of our approach to train DNN policies with strong generalization guarantees.
Future Work. There are a number of exciting future directions that arise from this work. We believe that our approach can be extended to provide PACBayes certificates of generalization for longhorizon visionbased motion plans. In particular, we are exploring the augmentation of our policy with a generative network that can predict the future visual observations conditioned on the primitive to be executed. Another direction we are excited to pursue is training visionbased policies that are robust to unknown disturbances (e.g., wind gusts) which are not a part of the training data. Specifically, we hope to address this challenge by bridging the approach in this paper with the modelbased robust planning approaches in the authors’ previous work [21, 35]. Finally, we are also working towards a hardware implementation of our approach on a UAV and the Minitaur. We hope to leverage recent advances in simtoreal transfer to minimize training on the actual hardware.
References
 [1] (2007) Grasp planning in complex scenes. In Proceedings of IEEERAS International Conference on Humanoid Robots, pp. 42–48. Cited by: §IA.
 [2] (2016) End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316. Cited by: §IA, §I.
 [3] (1999) Sequential composition of dynamically dexterous robot behaviors. The International Journal of Robotics Research 18 (6), pp. 534–555. Cited by: §IA.
 [4] (2019) Scenario optimization for mpc. In Handbook of Model Predictive Control, S. V. Raković and W. S. Levine (Eds.), pp. 445–463. External Links: ISBN 9783319774893 Cited by: §IA.
 [5] (2017) Relative entropy optimization and its applications. Mathematical Programming 161 (12), pp. 1–32. Cited by: §B, §I, §IVB.

[6]
(2018)
Pybullet, a python module for physics simulation for games, robotics and machine learning
. Cited by: §V.  [7] (2016) CVXPY: a Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research 17 (83), pp. 1–5. Cited by: §IVB.
 [8] (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §IA, §IV.
 [9] (2018) Visual foresight: modelbased deep reinforcement learning for visionbased robotic control. arXiv preprint arXiv:1812.00568. Cited by: §IA.
 [10] (2005) Maneuverbased motion planning for nonlinear systems with symmetries. IEEE Transactions on Robotics 21 (6), pp. 1077–1091. Cited by: §IA.
 [11] (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2450–2462. Cited by: §IA.
 [12] (2019) Robot motion planning in learned latent spaces. IEEE Robotics and Automation Letters 4 (3), pp. 2407–2414. Cited by: §IA, §I.
 [13] (2015) Probabilistically valid stochastic extensions of deterministic models for systems with uncertainty. The International Journal of Robotics Research 34 (10), pp. 1278–1295. Cited by: §IA.
 [14] (2016) Design principles for a family of directdrive legged robots. IEEE Robotics and Automation Letters 1 (2), pp. 900–907. Cited by: §VB.
 [15] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IVA.
 [16] (2001) Randomized kinodynamic planning. The International Journal of Robotics Research 20 (5), pp. 378–400. Cited by: §I.
 [17] (2019) Stochastic latent actorcritic: deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953. Cited by: §IA.
 [18] (2009) Standing balance control using a trajectory library. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3031–3036. Cited by: §IA.
 [19] (2019) PACBayes Control: learning policies that provably generalize to novel environments. arXiv preprint arXiv:1806.04225. Cited by: §B, §IA, §III, §IVB.
 [20] (2018) PACBayes Control: synthesizing controllers that provably generalize to novel environments. In Proceedings of the Conference on Robot Learning, Cited by: §B, §IA, §III, §III.
 [21] (201707) Funnel libraries for realtime robust feedback motion planning. The International Journal of Robotics Research 36 (8), pp. 947–982. Cited by: §IA, §I, §VI.
 [22] (2004) A note on the PAC Bayesian theorem. arXiv preprint cs/0411099. Cited by: §II, Theorem 2.
 [23] (1999) Some PACBayesian theorems. Machine Learning 37 (3), pp. 355–363. Cited by: §I, §I, §II, Theorem 2.
 [24] (2011) Minimum snap trajectory generation and control for quadrotors. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2520–2525. Cited by: §VA.
 [25] (2019) MOSEK fusion api for python 9.0.84(beta). External Links: Link Cited by: §IVB.
 [26] (2016) Composing limit cycles for motion planning of 3d bipedal walkers. In Proceedings of the IEEE Conference on Decision and Control, pp. 6368–6374. Cited by: §IA, §I.
 [27] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §V.
 [28] (2019) Motion planning networks. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2118–2124. Cited by: §IA.
 [29] (2019) PACBayes with backprop. arXiv preprint arXiv:1908.07380. Cited by: §IA, §III.
 [30] (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §IVA, §IVA, §IVA.
 [31] (2008) Learning maneuver dictionaries for ground robot planning. In Proceedings of the International Symposium on Robotics (ISR), Cited by: §IA, §I.
 [32] (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press. Cited by: §IA.
 [33] (2018) Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §IA, §I.
 [34] (2010) LQRtrees: feedback motion planning via sumsofsquares verification. The International Journal of Robotics Research 29 (8), pp. 1038–1052. Cited by: §IA.
 [35] (2019) Switched systems with multiple equilibria under disturbances: boundedness and practical stability. IEEE Transactions on Automatic Control. Cited by: §IA, §VI.

[36]
(2001)
Randomized algorithms for robust controller synthesis using statistical learning theory
. Automatica 37 (10), pp. 1515–1528. Cited by: §IA.  [37] (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pp. 2746–2754. Cited by: §IA.
 [38] (2014) Natural evolution strategies. The Journal of Machine Learning Research 15 (1), pp. 949–980. Cited by: §IA, §I, §IVA, §IVA, §IVA, §IVA.
 [39] (2017) Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 3357–3364. Cited by: §IA.
Appendix A Derivation of Optimization Problem for Finite Policy Spaces
Aa Hyperparameters
The hyperparameters used for the examples in this paper are detailed in Table III.
AB Complete ES Algorithm
A complete implementation of the ES algorithm used in the paper for training the prior is supplied in Algorithm 3.
Comments
There are no comments yet.