# PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments

Our goal is to synthesize controllers for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between robustness of controllers to novel environments and generalization of hypotheses in supervised learning. In particular, we utilize the Probably Approximately Correct (PAC)-Bayes framework, which allows us to obtain upper bounds (that hold with high probability) on the expected cost of (stochastic) controllers across novel environments. We propose control synthesis algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (Relative Entropy Programming in particular) in the setting where we are optimizing over a finite control policy space. In the more general setting of continuously parameterized controllers, we minimize this upper bound using stochastic gradient descent. We present examples of our approach in the context of obstacle avoidance control with depth measurements. Our simulated examples demonstrate the potential of our approach to provide strong generalization guarantees on controllers for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, and rich sensory inputs (e.g., depth measurements).

## Authors

• 19 publications
• 1 publication
• ### Probably Approximately Correct Vision-Based Planning using Motion Primitives

This paper presents a deep reinforcement learning approach for synthesiz...
02/28/2020 ∙ by Sushant Veer, et al. ∙ 0

• ### Generalization Guarantees for Multi-Modal Imitation Learning

Control policies from imitation learning can often fail to generalize to...
08/05/2020 ∙ by Allen Z. Ren, et al. ∙ 0

• ### A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains

Finite-state controllers (FSCs), such as plans with loops, are powerful ...
05/16/2019 ∙ by Laszlo Treszkai, et al. ∙ 0

• ### Unifying Variational Inference and PAC-Bayes for Supervised Learning that Scales

Neural Network based controllers hold enormous potential to learn comple...
10/23/2019 ∙ by Sanjay Thakur, et al. ∙ 0

• ### On Polynomial Time PAC Reinforcement Learning with Rich Observations

We study the computational tractability of provably sample-efficient (PA...
03/01/2018 ∙ by Christoph Dann, et al. ∙ 0

• ### Stabilization of Complementarity Systems via Contact-Aware Controllers

We propose a framework for provably stable local control of multi-contac...
08/03/2020 ∙ by Alp Aydinoglu, et al. ∙ 0

• ### A PAC algorithm in relative precision for bandit problem with costly sampling

This paper considers the problem of maximizing an expectation function o...
07/30/2020 ∙ by Marie Billaud-Friess, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Imagine an unmanned aerial vehicle that successfully navigates a thousand different obstacle environments or a robotic manipulator that successfully grasps a million objects in our dataset. How likely are these systems to succeed on a novel (i.e., previously unseen) environment or object? How can we explicitly synthesize controllers that provably generalize well to environments or objects that our robot has not previously encountered? Current approaches for designing controllers for robotic systems either do not provide such guarantees on generalization or provide guarantees only under extremely restrictive assumptions (e.g., strong assumptions on the geometry of a novel environment [53, 23, 3, 40]).

The goal of this paper is to develop an approach for synthesizing controllers for robotic systems that provably generalize well with high probability to novel environments given a dataset of example environments. The key conceptual idea for enabling this is to exploit a precise analogy between robustness of controllers to novel environments and generalization in supervised learning. This analogy allows us to translate techniques for learning hypotheses with generalization guarantees in the supervised learning setting into techniques for synthesizing control policies for robot tasks with performance guarantees on novel environments.

In order to obtain more insight into this analogy, suppose we have a dataset of objects. A simple approach to designing a grasping controller is to synthesize a controller that achieves the best possible performance on these objects. However, such a strategy might result in an overly complex controller that “overfits” to the specific objects at hand. This is a particularly important challenge for robotics applications since datasets are generally quite small (e.g., as compared to training sets for image classification tasks). In order to design a controller that generalizes well to novel environments, we may need to add a “regularizer” that penalizes the “complexity” of the controller. This raises the following questions: (1) what form should this regularizer take?; and (2) can we provide a formal guarantee on the performance of the resulting controller on novel environments?

The analogous questions for supervised learning algorithms have been extensively studied in the literature on generalization theory in machine learning. Here we leverage PAC-Bayes theory (Probably Approximately Correct Bayes) [42], which provides some of the tightest known generalization bounds for classical supervised learning approaches [36, 55, 25]

. Very recently, PAC-Bayes analysis has also been used to train deep neural networks with guarantees on generalization performance

[17, 45, 46]. As we will see, we can leverage PAC-Bayes theory to provide precise answers to both questions posed above; it will allow us to specify a regularizer for designing (stochastic) controllers that generalize well (with high probability) to novel environments.

Statement of Contributions: To our knowledge, the results in this paper constitute the first attempt to provide generalization guarantees on controllers for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, and rich sensory inputs (e.g., depth measurements). To this end, this paper makes three primary contributions. First, we provide a reduction that allows us to translate generalization bounds for supervised learning problems to generalization bounds for controllers. We apply this reduction in order to translate PAC-Bayes bounds to the control setting we consider here (Section 4). Second, we propose solution algorithms for minimizing the regularized cost functions specified by PAC-Bayes theory in order to synthesize controllers with generalization guarantees (Section 5). In the setting where we are optimizing over a finite policy space (Section 5.1), the corresponding optimization problem can be solved using convex optimization techniques (Relative Entropy Programs (REPs) in particular). In the more general setting of continuously parameterized controllers (Section 5.2), we rely on stochastic gradient descent to perform the optimization. Third, we demonstrate our approach on the problem of synthesizing depth sensor-based reactive obstacle avoidance controllers for the ground robot model shown in Figure 1 (Section 6). Our simulation results demonstrate that we are able to obtain strong performance guarantees even with a relatively small number of training environments. We compare the bounds obtained from PAC-Bayes theory with exhaustive sampling to illustrate the tightness of the bounds.

### 1.1 Related Work

One possible approach to synthesizing controllers with guaranteed performance in novel environments is to assume that a novel environment satisfies conditions that allow a real-time planner to always succeed. For example, in the context of navigation, this constraint could be satisfied by hand-coding emergency maneuvers (e.g., stopping maneuvers or loiter circles) that are always guaranteed to succeed [53, 23, 3]. However, requiring the existence of such emergency maneuvers can lead to extremely conservative behavior. Another approach is to assume that the environment satisfies certain geometric conditions (e.g., large separation between obstacles) that allow for safe navigation [40]. However, such geometric conditions are rarely satisfied by real-world environments. Moreover, such conditions are domain specific; it is not clear how one would specify such constraints for problems other than navigation (e.g., grasping).

Another conceptually appealing approach for synthesizing controllers with guaranteed performance on a priori unknown environments is to model the problem as a Partially Observable Markov Decision Process (POMDP)

[29], where the environment is part of the (partially observed) state of the system [51]. Computational considerations aside, such an approach is made infeasible by the need to specify a distribution over environments the robot might encounter. Unfortunately, specifying such a distribution over real-world environments is an extremely challenging endeavor. Thus, many approaches (including ours) assume that we only have indirect access to the true underlying distribution over environments in the form of examples. For example, Richter et al. [51, 50]

propose an approximation to the POMDP framework in the context of navigation by learning to predict future collision probabilities from past data. The work on deep-learning based approaches for manipulation represents another prominent set of techniques where interactions with example environments (objects in this case) are used to learn control policies

[37, 38, 1, 39, 56]. While the approaches mentioned above have led to impressive empirical demonstrations, it is very challenging to guarantee that such methods will perform well on novel environments that are not part of the training data (especially when a limited number of training examples are available, as is often the case for robotics applications).

The primary theoretical framework we utilize in this paper is PAC-Bayes generalization theory [42]. PAC-Bayes theory provides some of the tightest known generalization bounds for classical supervised learning problems [36, 55, 25] and has recently been applied to explain and promote generalization in deep learning [17, 45, 46]. PAC-Bayes theory has also been applied to learn control policies for Markov Decision Processes (MDPs) with provable sample complexity bounds [19, 20]. These approaches also exploit the intuition (ref. Section 1) that “regularizing” controllers in an appropriate manner can prevent overfitting and lead to sample efficiency (see also [44, 32, 7, 6, 54]

for other approaches that exploit this intuition in the reinforcement learning context). However, we note that the focus of our work is quite different from the work on PAC-Bayes MDP bounds (and the more general framework of PAC MDP bounds

[31, 11, 24]), which consider the standard reinforcement learning setup where a control policy must be learned through multiple interactions with a given MDP (with unknown transition dynamics and/or rewards). In contrast, here we focus on zero-shot generalization to a novel environment (e.g., obstacle environments or objects). In other words, a controller learned from examples of different environments must immediately perform well on a new one (i.e., without further exploratory interactions with the new environment). We further note that [19] considers finite state and action spaces along with policies that depend on full state feedback while [20] relaxes the assumption on finite state spaces but retains the other modeling assumptions. In contrast, we target systems with continuous state and action spaces and synthesize control policies that rely on rich sensory inputs.

On the algorithmic front, we make significant use of Relative Entropy Programs (REPs) [12]

. REPs constitute a rich class of convex optimization problems that generalize many other problems including Linear Programs, Geometric Programs, and Second-Order Cone Programs

[10]. REPs are optimization problems in which a linear functional of the decision variables is minimized subject to linear constraints and conic constraints given by a relative entropy cone. REPs are amenable to efficient solution techniques (e.g., interior point methods [43]) and can be solved using existing software packages (e.g., SCS [48, 47] and ECOS [16]). We refer the reader to [12] for a more thorough introduction to REPs. Importantly for us, REPs can handle constraints of the form , where and

are decision variables corresponding to probability vectors and

represents the Kullback-Leibler divergence. As we will see, this allows us to use REPs to synthesize controllers using the PAC-Bayes framework in the setting where we are optimizing over a finite set of control policies.

### 1.2 Notation

We use the notation to refer to the i-th component of a vector . We use to denote the set of elementwise nonnegative vectors in and to denote element-wise multiplication.

## 2 Problem Formulation

We assume that the robot’s dynamics are described by a discrete-time system:

 x(t+1)=f(x(t),u(t);E), (1)

where is the time index, is the state, is the control input, and is the environment that the robot operates in. We use the term “environment” here broadly to refer to any factors that are external to the robot. For example, could refer to an obstacle field that a mobile robot is attempting to navigate through, external disturbances (e.g., wind gusts) that a UAV is subjected to, or an object that a manipulator is attempting to grasp.

Let denote the space of all possible environments. We assume that there is an underlying distribution over from which environments are drawn. Importantly, we do not assume that we have explicit descriptions of or . Instead, we only assume indirect access to in the form of a data set of training environments drawn i.i.d. from .

Let denote the robot’s sensor mapping from a state and an environment to an observation . Let denote a control policy that maps sensor measurements to control inputs. Note that this is a very general model and can capture control policies that depend on histories of sensor measurements (by simply augmenting the state to keep track of histories of states and letting denote the space of histories of sensor measurements).

We assume that the robot’s desired behavior is encoded through a cost function. In particular, let denote the function that “rolls out” the system with control policy , i.e., maps an environment to the state-control trajectory one obtains by applying the control policy (up to a time horizon ). We will assume (without loss of generality) that the environment captures all sources of stochasticity (including random initial conditions) and the rollout function for a particular environment is thus deterministic. We then let denote the cost incurred by control policy when operating in environment over a time horizon . We assume that the cost is bounded and will assume (without further loss of generality) that .

The primary assumption we make in this work is the following.

###### Assumption 1.

Given any control policy , we can compute the cost for the training environments .

This assumption is satisfied if one can simulate the robot’s operation in the environments . We note that computational considerations aside, we do not make any restrictions on the dynamics or the sensor mapping beyond the ability to simulate them. The models that our approach can handle are thus extremely rich in principle (e.g., nonlinear or hybrid dynamics, sensor models involving raycasting or simulated vision, etc.).

Another possibility for satisfying Assumption 1 is to run the controller on the hardware system itself in the given environments. This may a feasible option for problems such as grasping, which are not safety-critical in nature. In such cases, our approach does not require models of the dynamics, sensor mapping, or the rollout function.

Goal: Our goal is to design a control policy that minimizes the expected value of the cost across environments:

 minπ∈Π  CD(π):=EE∼D [C(rπ;E)]. (2)

In this work, it will be useful to consider a more general form of this problem where we choose a distribution over the control policy space instead of making a single deterministic choice. Our goal is then to solve the following optimization problem, which we refer to as :

 C⋆:=minP∈P  CD(P):=EE∼D π∼PE[C(rπ;E)], (OPT)

where

denotes the space of probability distributions over

. Note that the outer expectation here is taken with respect to the unknown distribution . This constitutes the primary challenge in tackling this problem.

## 3 Background

The primary technical framework we leverage in this paper is PAC-Bayes theory. In Section 3.2, we provide a brief overview of the key results from PAC-Bayes theory in the context of supervised learning. We first provide some brief background on the properties of the Kullback-Leibler (KL) divergence in Section 3.1 and show how we can compute its inverse using Relative Entropy Programming (REP) in Section 3.1.1.

### 3.1 KL divergence

Given two discrete probability distributions and defined over a common set, the KL divergence from Q to P is defined as

 KL(P∥Q):=∑iP[i]log(P[i]Q[i]). (3)

For scalars , we define

 KL(p∥q):=KL(B(p)∥B(q))=plogpq+(1−p)log1−p1−q, (4)

where

denotes a Bernoulli distribution on

with parameter (i.e., mean) .

For distributions P and Q of a continuous random variable, the KL divergence is defined to be

 KL(P∥Q)=∫p(x)logp(x)q(x)dx, (5)

where and denote the densities of and . Importantly, if and

correspond to normal distributions

and over , the KL divergence can be computed in closed form as

 KL(Np∥Nq)=12(Tr(Σ−1qΣp)+(μq−μp)TΣ−1q(μq−μp)+log% det(Σq)det(Σp)−d). (6)

#### 3.1.1 Computing KL inverse using Relative Entropy Programming

PAC-Bayes bounds (Section 3.2) are typically expressed as bounds on a quantity of the form (for some and ). These bounds can then be used to upper bound by the KL inverse as follows:

 q⋆≤KL−1(p∥c):=sup{q∈[0,1] | KL(p∥q)≤c}. (7)

In prior work on PAC-Bayes theory, the KL inverse was numerically approximated using local root-finding techniques such as Newton’s method [17, 18], which do not have a priori guarantees on convergence to a global solution. Here we observe that the KL inverse is readily expressed as the optimal value to a simple Relative Entropy Program (ref. Section 1.1). In particular, the expression for the KL inverse in (7) corresponds to an optimization problem with a (scalar) decision variable , a linear cost function (i.e., ), linear inequality constraints (i.e., ), and a constraint on the KL divergence between the decision variable and the constant . We can thus compute the KL inverse exactly (up to numerical tolerances) using convex optimization (e.g., interior point methods [12]).

### 3.2 PAC-Bayes Theory in Supervised Learning

We now provide a brief overview of the key results from PAC-Bayes theory in the context of supervised learning. Let be an input space and be a set of labels. Let be the (unknown) true distribution on . Let be a hypothesis class consisting of functions parameterized by (e.g., neural networks parameterized by weights ). Let

be a loss function

111Note that we are considering a slightly restricted form of the supervised learning problem where each input has only one correct label . The loss thus only depends on the input and the label . The PAC-Bayes framework applies to the more general setting where there is an underlying true distribution on and the loss thus has the form . However, the more restricted setting is sufficient for our needs here.. We will denote by the space of probability distributions on the parameter space . Informally, we will refer to distributions on when we mean distributions over the underlying parameter space.

PAC-Bayes analysis then applies to learning algorithms that output a distribution over hypotheses. Specifically, the PAC-Bayes framework applies to learning algorithms with the following structure:

1. Choose a prior distribution before observing any data.

2. Observe data samples and choose a posterior distribution . This posterior can depend on the data and the prior.

It is important to note that the posterior distribution need not be the Bayesian posterior. PAC-Bayes theory applies to any distribution .

Let us denote the training loss associated with the posterior distribution as:

 lS(P):=1N∑z∈SEw∼P[l(hw;z)], (8)

and the true expected loss as:

 lD(P):=Ez∼D Ew∼P[l(hw;z)]. (9)

The following theorem is the primary result from PAC-Bayes theory222The bound we state here is due to Maurer [41] and improves slightly upon the original PAC-Bayes bounds [42]. The stated bound holds when costs are bounded in the range (as assumed here) and we have samples. .

###### Theorem 1 (PAC-Bayes Bound for Supervised Learning [42, 41]).

For any , with probability at least over samples , the following inequality holds:

 KL(lS(P)∥lD(P))≤KL(P∥P0)+log(2√Nδ)N. (10)

Here, is interpreted as a KL divergence between Bernoulli distributions and computed using (4) (this is meaningful since and are scalars bounded within ).

Intuitively, Theorem 1 provides a bound on how “close” the training loss and the true expected loss are. However, in practice, one would like to find an upper bound on the true expected loss . Such an upper bound can be obtained by computing the KL inverse (ref. Section 3.1.1):

 lD(P)≤KL−1(lS(P)∥KL(P∥P0)+log(2√Nδ)N). (11)

Another upper bound that is useful for the purpose of optimization is provided by the following corollary, which follows from Theorem 1 by applying the well known upper bound for the KL inverse: .

###### Corollary 1 (PAC-Bayes Upper Bound for Supervised Learning [42, 41]).

For any , with probability at least over samples , the following inequality holds:

 lD(P)True expected loss≤lS(P)Training loss+ ⎷KL(P∥P0)+log(2√Nδ)2NRegularizer". (12)

Corollary 1 provides a strategy for choosing a distribution over hypotheses: minimize the right hand side (RHS) of inequality (12) consisting of the training loss and a “regularization” term.

## 4 PAC-Bayes Controllers

We now describe our approach for adapting the PAC-Bayes framework in order to tackle the control synthesis problem and synthesize (stochastic) control policies with guaranteed expected performance across novel environments. Our key idea for doing this is to exploit a precise analogy between the supervised learning setting from Section 3.2 and the control synthesis setting described in Section 2. Table 1 presents this relationship.

One can think of the relationship in Table 1 as providing a reduction from the control synthesis problem to a supervised learning problem. We are provided input data in the form of a data set of example environments. Choosing a “hypothesis” corresponds to choosing a control policy (since the rollout function is determined by ). A “hypothesis” maps an environment to a “label”, corresponding to the state-control trajectory obtained by applying on . This “label” incurs a loss .

We can use this reduction to translate the PAC-Bayes theorems for supervised learning (Theorem

1 and Corollary 1) to the control setting. Similar to the supervised learning setting, we assume that the space of control policies is parameterized by . This in turn produces a parameterization of rollout functions. With a slight abuse of notation, we will refer to rollout functions instead of (with the understanding that is the parameter vector for the control policy ).

Let be a prior distribution over the parameter space . The prior can be used to encode domain knowledge, but need not be “true” in any sense (i.e., results hold for any prior). Let be a (possibly data-dependent) posterior. Following the notation from Section 2, we denote the true expected cost across environments by . We will denote the cost on the training environments as

 CS(P):=1N∑E∈SEw∼P[C(rw;E)]. (13)

The following theorem is then an exact analogy of Corollary 1.

###### Theorem 2 (PAC-Bayes Bound for Control Policies).

For any , with probability at least over sampled environments , the following inequality holds:

 CD(P)True expected cost≤ C% PAC(P):=CS(P)Training cost+√KL(P∥P0)+log(2√Nδ)2N% Regularizer". (14)
###### Proof.

The proof follows immediately from Corollary 1 given the reduction in Table 1. ∎

This theorem will constitute our primary tool for designing controllers with guarantees on their expected performance across novel environments. In particular, the left hand side of inequality (14) is the cost function of the optimization problem . Theorem 2 thus provides an upper bound (that holds with probability ) on the true expected performance across environments of any controller distribution in terms of the loss on the sampled environments in and a “regularizer”. Our approach for choosing is to minimize this upper bound. Algorithm 1 outlines the steps involved in our approach.

We note that while is chosen by optimizing (i.e., the RHS of inequality (14)), the final upper bound on is not computed as . While this is a valid upper bound, a tighter bound is provided by inequality (11). The observations made in Section 3.1.1 allow us to compute this final bound using a REP. This is the bound we report in the results presented in Section 6.

## 5 Computing PAC-Bayes Controllers

We now describe how to tackle the optimization problem in Algorithm 1 for minimizing the upper bound on the true expected cost. We will first discuss the setting where the control policy space is finite (Section 5.1). For this setting, the optimization problem can be solved to global optimality via Relative Entropy Programming. We then tackle the more general setting where is continuously parameterized in Section 5.2.

### 5.1 Finite Control Policy Space

Let the space of policies be . Our goal is then to optimize a discrete probability distribution (with corresponding probability vector ) over the space . Thus, denotes the probability assigned to controller . Define a matrix of costs, where each element

 ^C[i,j]=C(rπj;Ei) (15)

corresponds to the cost incurred on environment by controller (recall that Assumption 1 implies that we can compute each ). The training cost from inequality (14) can then be written as:

 1N∑E∈SEπ∼P[C(rπ;E)]=1NN∑i=1L∑j=1^C[i,j]p[j]:=¯Cp, (16)

where the matrix is defined as:

 ¯C:=1N1T^C. (17)

Here, is the all-ones vector of size . We note that finding a vector that minimizes the training cost corresponds to solving a Linear Program.

Minimizing the PAC-Bayes upper bound corresponds to solving the following optimization problem:

 minp∈RL ¯Cp+ ⎷KL(p∥p0)+log(2√Nδ)2N s.t. 0≤p≤1, ∑jp[j]=1. (18)

This optimization problem can be equivalently reformulated via an epigraph constraint [10] as:

 minp∈RL,τ τ s.t. τ≥¯Cp+ ⎷KL(p∥p0)+log(2√Nδ)2N 0≤p≤1, ∑jp[j]=1.

We further rewrite the problem as:

 minp∈RL,τ,λ τ (19) s.t. λ2≥KL(p∥p0)+log(2√Nδ)2N λ=τ−¯Cp, λ≥0 0≤p≤1, ∑jp[j]=1.

Our key observation here is that for a fixed , the above problem is a Relative Entropy Program (REP) since it consists of minimizing a linear cost function subject to linear equality and inequality constraints and an additional inequality constraint of the form .

We note that since , where (because upper bounds the true expected cost) and (recall that we assumed that costs are bounded between and ). In order to solve problem (19) to global optimality, we can thus simply search over the one-dimensional parameter (e.g., by simply discretizing the interval , performing a bisection search, etc.) and find the setting of that leads to the lowest optimal value for the corresponding REP.

### 5.2 Continuously-Parameterized Control Policy Space

We now consider policies parameterized by the vector (e.g., neural networks parameterized by weights). We will consider stochastic policies defined by probability distributions over the parameters

. Here, we choose Gaussian distributions

with diagonal covariance (with ) and use the shorthand . Using Gaussians makes computations easier since we can express the KL divergence between Gaussians in closed form. We can then apply Algorithm 1 and choose to minimize the PAC-Bayes upper bound . In order to turn this into a practical algorithm, there are two primary issues we need to address.

First, in order to minimize the bound , one would like to apply gradient-based methods (e.g., stochastic gradient descent). However, the cost function may not be a differentiable function of the parameters . For example, in the case of designing obstacle avoidance controllers, a natural (but non-differentiable) cost function is the one that assigns a cost of if the robot collides (and 0 otherwise). To tackle this issue, we employ a differentiable surrogate for the cost function during optimization (note that the final bound is still evaluated for the original cost function). This surrogate will necessarily depend on the application at hand; we will present an example of this in the context of obstacle avoidance in Section 6.

The second challenge is the fact that computing the training cost requires computing the following expectation over controllers:

 Ew∼Nμ,s[C(rw;E)]. (20)

For most realistic settings, this expectation cannot be computed in closed form. We address this issue in a manner similar to [17]. In particular, in order to optimize and

using gradient descent, we take gradient steps with respect to the following unbiased estimator of

:

 1N∑E∈SC(rμ+√s⊙ξ;E),ξ∼N0,Id. (21)

In other words, in each gradient step we use an i.i.d. sample of and compute the gradient of (21) with respect to and .

At the end of the optimization procedure, we fix the optimal and and estimate the training cost by producing a large number of samples drawn from :

 ^CS(Nμ⋆,s⋆):=1NL∑E∈SL∑i=1C(rwi;E). (22)

We can then use a sample convergence bound (see [35]) to bound the error between and . In particular, the following bound is an application of the relative entropy version of the Chernoff bound for random variables (i.e., costs) bounded in and holds with probability :

 CS(Nμ⋆,s⋆)≤¯CS(Nμ⋆,s⋆;L,δ′):=KL−1(^CS(Nμ⋆,s⋆)∥1Llog(2δ′)). (23)

Combining inequalities (10) and (23) using the union bound, we see that the following bound holds with probability at least :

 CD(Nμ⋆,s⋆)≤C⋆bound:=KL−1(¯CS(Nμ⋆,s⋆;L,δ′)∥KL(Nμ⋆,s⋆∥P0)+log(2√Nδ)N). (24)

This is the final version of our bound on the expected performance of controllers (drawn from ).

Algorithm 2 summarizes our approach from this section. Note that in order to ensure positivity of , we perform the optimization with respect to .

## 6 Example: Reactive Obstacle Avoidance Control

In this section, we demonstrate our approach on the problem of synthesizing reactive obstacle avoidance controllers for a ground vehicle model equipped with a depth sensor. We first consider a finite policy space and leverage the REP-based framework described in Section 5.1 to provide guarantees on the performance of the controller across novel obstacle environments. We then consider continuously parameterized policies and apply the approach from Section 5.2.

Dynamics. A pictorial depiction of the ground vehicle model is provided in Figure 1. The state of the system is given by , where and are the x and y positions of the vehicle respectively, and is the yaw angle. We model the system as a differential drive vehicle with the following nonlinear dynamics:

 ⎡⎢⎣˙x˙y˙ψ⎤⎥⎦=⎡⎢ ⎢⎣−r2(ul+ur)sin(ψ)r2(ul+ur)cos(ψ)rL(ur−ul)⎤⎥ ⎥⎦, (25)

where and are the control inputs (corresponding to the left and right wheel speeds respectively), m corresponds to the radius of the wheels, and m corresponds to the width of the base of the vehicle. We set:

 ul=u0−udiff,ur=u0+udiff, (26)

where with m/s. This ensures that the robot has a fixed speed . We limit the turning rate by constraining . The system is simulated as a discrete-time system with time-step s.

Obstacle environments. A typical obstacle environment is shown in Figure 1 and consists of cylinders of varying radii along with three walls that bound the environment between m and m. Environments are generated by first sampling the integer uniformly between and

, and then independently sampling the x-y positions of the cylinders from a uniform distribution over the ranges

m and m. The radius of each obstacle is sampled independently from a uniform distribution over the range m. The robot’s state is always initialized at .

Obstacle Avoidance Controllers. We assume that the robot is equipped with a depth sensor that provides distances along rays in the range radians (+ve is clockwise) up to a sensing horizon of m (as shown in Figure 1). A given sensor measurement thus belongs to the space . Let be the inverse distance vector computed by taking an element-wise reciprocal of . We then choose as the following dot product:

 udiff=K⋅^y. (27)

An example of is:

 K[i]={(y0/x0)(x0−θ[i])if θ[i]≥0,(y0/x0)(−x0−θ[i])if θ[i]<0. (28)

Such a is shown in Figure 2. For , is a linear function of with x- and y-intercepts equal to and respectively. This linear function is reflected about the origin for .

Intuitively, this corresponds to a simple reactive controller that computes a weighted combination of inverse distances in order to turn away from obstacles that are close. As a simple example, consider the case where we have two obstacles: one located m away along (i.e., to the robot’s left) and the other located m away along (i.e., to the robot’s right). The computed control input will then be (i.e., robot turns left) since the inverse depth for the obstacle to the right is larger than that of the obstacle to the left. Simple reactive controllers of this kind have been shown to be quite effective in practice [4, 8, 52, 13], but can often be challenging to tune by hand in order to achieve good expected performance across all environments. We tackle this challenge by applying the PAC-Bayes control framework proposed here.

Results (finite policy space). In order to obtain a finite policy space, we choose different ’s of the form (28) by choosing different x and y intercepts and . In particular, is chosen by discretizing the space into 5 values for and 10 values for . Our control policy space is thus , where each controller corresponds to a particular choice of .

We consider a time horizon of and assign a cost of if the robot collides with an obstacle during this period and a cost of otherwise. We choose a uniform prior over the policy space and apply the REP framework from Section 5.1 in order to optimize a distribution over controllers. The PyBullet package [14] is used to simulate the dynamics and depth sensor; we use these simulations to compute the elements of the cost matrix (ref. Section 5.1). Each simulation takes s to execute in our implementation (note that the computation of the different elements of can be entirely parallelized). Given the matrix with 100 sampled environments, each REP (corresponding to a fixed value of in Problem (19)) takes s to solve using the CVXPY package [15] and the SCS solver [48]. We discretize the interval into 100 values to find the optimal . Complete code for this implementation is freely available on GitHub.

Table 2 presents the upper bound on the true expected cost of the PAC-Bayes controller (ref. Algorithm 1) for different sample sizes with . The table also presents an estimate of the true expected cost obtained by sampling environments. As the table illustrates, the PAC-Bayes bound provides strong guarantees even for relatively small sample sizes. For example, using only samples, the PAC-Bayes controller is guaranteed (with probability ) to have an expected success rate of (i.e., an expected cost of ). Exhaustive sampling indicates that the expected success rate for the PAC-Bayes controller is approximately for this case. Videos of representative trials on test environments can be found at https://youtu.be/j-uMK-7tF2s.

Results (continuous policy space). Next, we consider a continuously parameterized policy space and apply the approach described in Section 5.2. In particular, we parameterize our policy using the matrix in equation (27) while ensuring symmetry of the control law, i.e., we constrain for (note that is no longer constrained to have the linear form from equation (28)). The dimensionality of the parameter space is thus . We apply Algorithm 2 to optimize a distribution over controllers. For the purpose of optimization, we employ a continuous surrogate cost function in place of the discontinuous 0-1 cost. We choose this to be the negative of the minimum distance to an obstacle along a trajectory (appropriately scaled to lie within ). Note that we employ this surrogate cost only for optimization; all results are presented for the 0-1 cost. Gradients in Algorithm 2 are estimated numerically. We choose a prior with ; the mean is given by a vector of the form (28) with x-intercept and y-intercept .

We use training environments and choose confidence parameters , , and samples to evaluate the sample convergence bound in equation (23). Figure 3 shows the mean of the optimized controller obtained using Algorithm 2. The corresponding PAC-Bayes bound is . Thus, with probability over sampled training data, the optimized PAC-Bayes controller is guaranteed to have an expected success rate of . Exhaustive sampling with environments indicates that the expected success rate is approximately .

## 7 Discussion and Conclusions

We have presented an approach for synthesizing controllers that provably generalize well to novel environments given a dataset of example environments. Our approach leverages PAC-Bayes theory to obtain upper bounds on the expected cost of (stochastic) controllers on novel environments and can be applied to robotic systems with continuous state and action spaces, complicated dynamics, and rich sensory inputs. We synthesize controllers by explicitly minimizing this upper bound using convex optimization in the case of a finite policy space, and using stochastic gradient descent in the more general case of continuously parameterized policies. We demonstrated our approach by synthesizing depth sensor-based obstacle avoidance controllers with guarantees on collision-free navigation in novel environments. Our simulation results compared the generalization guarantees provided by our technique with exhaustive numerical evaluations in order to demonstrate that our approach is able to provide strong bounds even with relatively few training environments.

Challenges and future work: On the practical front, our future work will focus on scaling up the size and richness of the datasets we use along with the complexity of our controller parameterizations. In particular, we believe that the approach presented here holds promise for being able to provide strong guarantees on neural-network based controllers for vision-based robotic tasks such as navigation and grasping by leveraging existing datasets such as the Stanford large-scale 3D Indoor Spaces (S3DIS) dataset [5] and DexNet [39].

There are also a number of challenges and exciting opportunities for future work on the theoretical front. First, it may be desirable in many cases (e.g., safety-critical settings) to synthesize deterministic policies instead of stochastic ones. Techniques for converting stochastic hypotheses into deterministic hypotheses have been developed within the PAC-Bayes framework (e.g., using majority voting in the classification setting [36, 33]); an interesting avenue for future work is to extend such techniques to the control synthesis setting we consider.

Our approach inherits the challenges associated with generalization theory in the supervised learning setting. For example, here we assumed that training and test environments are drawn independently from the same underlying distribution. There has been a large effort towards relaxing these assumptions in the supervised learning context (e.g., domain adaptation techniques [26, 27] and PAC-Bayes bounds that do not assume i.i.d. data [49, 2]). An important feature of the reduction-based perspective we presented in Section 4 is that it immediately allows us to port over such improvements from the supervised learning setting to the control setting (since the reduction is general and not tied to the PAC-Bayes framework). We will further exploit this feature of our approach in future work and consider generalization frameworks beyond PAC-Bayes (e.g., bounds based on algorithmic stability [9, 30, 28] and sample compression bounds [22, 34]).

Another exciting future direction is to combine the techniques presented here with meta-learning techniques in order to achieve provably data-efficient control on novel tasks. Specifically, we will investigate using a PAC-Bayes bound as part of the objective of a meta-learning algorithm such as MAML [21] to achieve improved generalization performance and few-shot learning.

We believe that the approach presented here along with the indicated future directions represent an important step towards synthesizing controllers with provable guarantees for challenging robotic platforms with rich sensory inputs operating in novel environments.

## References

• Agrawal et al. [2016] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074–5082, 2016.
• Alquier and Guedj [2018] P. Alquier and B. Guedj. Simpler pac-bayesian bounds for hostile data. Machine Learning, 107(5):887–902, 2018.
• Althoff et al. [2015] D. Althoff, M. Althoff, and S. Scherer. Online safety verification of trajectories for unmanned flight with offline computed robust invariant sets. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3470–3477. IEEE, 2015.
• Arkin [1998] R. C. Arkin. Behavior-based robotics. MIT press, 1998.
• Armeni et al. [2016] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3D semantic parsing of large-scale indoor spaces. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pages 1534–1543, 2016.
• Bagnell [2004] J. A. Bagnell. Learning Decisions: Robustness, Uncertainty, and Approximation. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, August 2004.
• Bagnell and Schneider [2001] J. A. Bagnell and J. G. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), volume 2, pages 1615–1620. IEEE, 2001.
• Beyeler et al. [2009] A. Beyeler, J.-C. Zufferey, and D. Floreano. Vision-based control of near-obstacle flight. Autonomous robots, 27(3):201, 2009.
• Bousquet and Elisseeff [2002] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
• Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
• Brafman and Tennenholtz [2002] R. I. Brafman and M. Tennenholtz. R-max – A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
• Chandrasekaran and Shah [2017] V. Chandrasekaran and P. Shah. Relative entropy optimization and its applications. Mathematical Programming, 161(1-2):1–32, 2017.
• Conroy et al. [2009] J. Conroy, G. Gremillion, B. Ranganathan, and J. S. Humbert. Implementation of wide-field integration of optic flow for autonomous quadrotor navigation. Autonomous robots, 27(3):189, 2009.
• Coumans and Bai [2018] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2018.
• Diamond and Boyd [2016] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
• Domahidi et al. [2013] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In European Control Conference (ECC), pages 3071–3076, 2013.
• Dziugaite and Roy [2017a] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017a.
• Dziugaite and Roy [2017b] G. K. Dziugaite and D. M. Roy. Entropy-sgd optimizes the prior of a pac-bayes bound: Data-dependent pac-bayes priors via differential privacy. arXiv preprint arXiv:1712.09376, 2017b.
• Fard and Pineau [2010] M. M. Fard and J. Pineau. PAC-Bayesian model selection for reinforcement learning. In Advances in Neural Information Processing Systems, pages 1624–1632, 2010.
• Fard et al. [2012] M. M. Fard, J. Pineau, and C. Szepesvári. PAC-Bayesian policy evaluation for reinforcement learning. arXiv preprint arXiv:1202.3717, 2012.
• Finn et al. [2017] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
• Floyd and Warmuth [1995] S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine learning, 21(3):269–304, 1995.
• Fraichard [2007] T. Fraichard. A short paper about motion safety. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1140–1145. IEEE, 2007.
• Fu and Topcu [2014] J. Fu and U. Topcu. Probably approximately correct MDP learning and control with temporal logic constraints. arXiv preprint arXiv:1404.7073, 2014.
• Germain et al. [2009] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand.

PAC-Bayesian learning of linear classifiers.

In Proceedings of the 26th Annual International Conference on Machine Learning, pages 353–360. ACM, 2009.
• Germain et al. [2016] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A new PAC-Bayesian perspective on domain adaptation. In International Conference on Machine Learning, pages 859–868, 2016.
• Germain et al. [2017] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. PAC-Bayes and domain adaptation. arXiv preprint arXiv:1707.05712, 2017.
• Hardt et al. [2015] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
• Kaelbling et al. [1998] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
• Kearns and Ron [1999] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6):1427–1453, 1999.
• Kearns and Singh [2002] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
• Kearns et al. [2000] M. J. Kearns, Y. Mansour, and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, pages 1001–1007, 2000.
• Lacasse et al. [2007] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, and N. Usunier.

PAC-Bayes bounds for the risk of the majority vote and the variance of the gibbs classifier.

In Advances in Neural information processing systems, pages 769–776, 2007.
• Langford [2005] J. Langford. Tutorial on practical prediction theory for classification. Journal of machine learning research, 6(Mar):273–306, 2005.
• Langford and Caruana [2002] J. Langford and R. Caruana. (not) bounding the true error. In Advances in Neural Information Processing Systems, pages 809–816, 2002.
• Langford and Shawe-Taylor [2003] J. Langford and J. Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems, pages 439–446, 2003.
• Lenz et al. [2015] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
• Levine et al. [2016] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
• Mahler et al. [2017] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
• Majumdar and Tedrake [2017] A. Majumdar and R. Tedrake. Funnel libraries for real-time robust feedback motion planning. The International Journal of Robotics Research (IJRR), 36(8):947–982, July 2017.
• Maurer [2004] A. Maurer. A note on the PAC Bayesian theorem. arXiv preprint cs/0411099, 2004.
• McAllester [1999] D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999.
• Nesterov and Nemirovskii [1994] Y. Nesterov and A. Nemirovskii. Interior-point polynomial algorithms in convex programming, volume 13. SIAM, 1994.
• Neu et al. [2017] G. Neu, A. Jonsson, and V. Gómez. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
• Neyshabur et al. [2017a] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. preprint arXiv:1707.09564, 2017a.
• Neyshabur et al. [2017b] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5949–5958, 2017b.
• O’Donoghue et al. [2016] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042–1068, June 2016.
• O’Donoghue et al. [2017] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 2.0.2. https://github.com/cvxgrp/scs, Nov. 2017.
• Ralaivola et al. [2010] L. Ralaivola, M. Szafranski, and G. Stempfel. Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary -mixing processes. Journal of Machine Learning Research, 11(Jul):1927–1956, 2010.
• Richter and Roy [2017] C. Richter and N. Roy.

Safe visual navigation via deep learning and novelty detection.

In Proceedings of Robotics: Science and Systems (RSS), 2017.
• Richter et al. [2015] C. Richter, W. Vega-Brown, and N. Roy. Bayesian learning for safe high-speed navigation in unknown environments. In Proceedings of the International Symposium on Robotics Research (ISRR), 2015.
• Ross et al. [2013] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert. Learning monocular reactive uav control in cluttered natural environments. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1765–1772. IEEE, 2013.
• Schouwenaars et al. [2004] T. Schouwenaars, J. How, and E. Feron. Receding horizon path planning with implicit safety guarantees. In Proceedings of the IEEE American Control Conference (ACC), volume 6, pages 5576–5581. IEEE, 2004.
• Schulman et al. [2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
• Seeger [2002] M. Seeger. PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of machine learning research, 3(Oct):233–269, 2002.
• Tobin et al. [2017] J. Tobin, W. Zaremba, and P. Abbeel. Domain randomization and generative models for robotic grasping. arXiv preprint arXiv:1710.06425, 2017.