Learning Task-Driven Control Policies via Information Bottlenecks

02/04/2020 ∙ by Vincent Pacelli, et al. ∙ Princeton University 0

This paper presents a reinforcement learning approach to synthesizing task-driven control policies for robotic systems equipped with rich sensory modalities (e.g., vision or depth). Standard reinforcement learning algorithms typically produce policies that tightly couple control actions to the entirety of the system's state and rich sensor observations. As a consequence, the resulting policies can often be sensitive to changes in task-irrelevant portions of the state or observations (e.g., changing background colors). In contrast, the approach we present here learns to create a task-driven representation that is used to compute control actions. Formally, this is achieved by deriving a policy gradient-style algorithm that creates an information bottleneck between the states and the task-driven representation; this constrains actions to only depend on task-relevant information. We demonstrate our approach in a thorough set of simulation results on multiple examples including a grasping task that utilizes depth images and a ball-catching task that utilizes RGB images. Comparisons with a standard policy gradient approach demonstrate that the task-driven policies produced by our algorithm are often significantly more robust to sensor noise and task-irrelevant changes in the environment.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The increasing availability of high-resolution sensors has significantly contributed to the recent explosion of robotics applications. For example, high-precision cameras and LIDARs now allow autonomous vehicles to perform tasks ranging from navigating busy city streets to mapping mines and buildings. By far the most common approach adopted by such systems is to utilize as much sensor information as is available in order to estimate the full state of these complex environments; the resulting state estimates are then used by the robot to choose control actions necessary to complete its task. However, this ubiquitous approach does not distinguish between the

task-relevant and task-irrelevant portions of the system’s state. The result is an unnecessarily tight coupling between sensor observations and control actions that is often not robust to noise or uncertainty in irrelevant portions of the state. Moreover, these policies tend to have a large computational burden. This computational requirement can manifest either as a state estimator that needs to process large amounts of data in real time, or as a policy learned from sampling a large number of diverse operating environments. The goal of this paper is to address these challenges by synthesizing control policies that only depend on task-driven representations of the robot’s state. Doing so reduces the coupling between the states and control actions, improves robustness, and reduces computational requirements.

To illustrate the advantages of a task-driven policy, consider a ball-catching example. An agent equipped with a high-resolution camera is tasked with catching a ball (Figure 1). Using its sensor, the agent can attempt to estimate the full state of the system (e.g., ball velocity, ego-motion, wind speed, etc.), feed this information into a physical model for the ball’s flight, integrate the model to find where the ball will land, and move to this location to catch it. This approach requires estimating every parameter involved in the ball’s motion and can easily be compromised when the ball is far away and difficult to see or when there are visual artifacts such as glare.

Fig. 1: A depiction of our approach applied to the ball-catching problem described in Section IV-B. Our policy implements an information bottleneck that limits the amount of state information (in this case, the robot and ball position and velocities) extracted from the sensor observations (RGB images) to create a set of task-relevant variables (TRVs) on which the control action (robot velocity) depends. The result is a policy that is more robust to image noise and changes to the brick background texture.

Instead, extensive cognitive psychology experiments [11, 29, 36]

have demonstrated that humans employ a task-driven strategy known as the gaze heuristic to catch projectiles. This strategy simply modulates the human’s speed in order to fix the position of the ball in their visual field. This policy, which only requires minimal sensor information and internal computation, naturally drives the human to the ball’s landing position. The gaze heuristic highlights that task-driven control policies are often robust control policies that only depend on small amounts of salient state information. Specifically, the heuristic is naturally

robust to distributional shifts in irrelevant portions of the environment (e.g., the visual backdrop) while also being adaptable to perturbations like a gust of wind altering the ball’s course. Traditionally, task-driven policies have required hand-engineering for each control problem. This process can be difficult and time consuming. The goal of this paper is to propose a reinforcement learning framework that automatically synthesizes task-driven control policies for systems with nonlinear dynamics and high-dimensional observations (e.g., RGB or depth images).

Statement of Contributions. The main technical contribution of this paper is to formulate a reinforcement learning algorithm that synthesizes task-driven control policies. This synthesis is achieved by creating an information bottleneck [43] that limits how much state information the policy is allowed to use. We present a reinforcement learning algorithm — referred to as task-driven policy gradient (TDPG) — that leverages the recently-proposed mutual information neural estimator [4]

to tractably search for an effective task-driven policy. Finally, we demonstrate that this formulation provides the key advantage of a task-driven approach — robustness to perturbations in task-irrelevant state and sensor variables. This benefit is demonstrated in three examples featuring nonlinear dynamics and high-dimensional sensor models: an adaption of the lava problem from the literature on partially observable Markov decision processes (POMDPs), ball catching using RGB images, and grasping an object using depth images.

I-a Related Work

State Estimation and Differentiable Filtering. Classical approaches to controlling robotic systems typically involve two distinct pipelines: one that uses the robot’s sensors to estimate its state and another that uses this state estimate to choose control actions. Such an architecture is motivated by the separation principle [3] from (linear) control theory and allows one to leverage powerful control-theoretic techniques for robust estimation and control [9, 48]. A recent line of work on differentiable filtering [16, 18, 20] has extended this traditional pipeline to elegantly handle rich sensor observations (e.g., images) by learning

state estimators in an end-to-end manner via deep learning. However, as the gaze heuristic example from Section

I demonstrates, full state representations are often overly rich when viewed from the perspective of the task at hand. This is particularly true in settings where representing the full state requires capturing the state of the robot’s environment. Instead, our goal is to learn minimalistic task-driven representations that are sufficient for control. Such representations can be highly compact as compared to full state representations. Moreover, carefully-constructed task-driven representations have the potential to be robust to sensor noise and changes to irrelevant portions of the robot’s environment. Intuitively, this is because uncertainty or noise in irrelevant portions of the sensor observations are filtered out and thus no longer corrupt the robot’s actions. We present a theoretical result in Section II along with simulation experiments in Section IV to support this intuition.

End-to-end Learning of Policies. Deep reinforcement learning approaches have the ability to learn control policies in an end-to-end manner [14, 15, 21, 26, 27, 40, 49]. Such end-to-end approaches learn to create representations that are tuned to the task at hand by exploiting statistical regularities in the robot’s observations, dynamics, and environment. However, these methods do not explicitly attempt to learn representations that are task-driven (i.e., representations that filter out portions of the sensor observations that are irrelevant to the task). As a result, policies trained via standard deep RL techniques may be sensitive to changes in irrelevant portions of the robot’s environment (e.g., changes to the background color in a ball-catching task). In contrast, the deep RL-based approach we present seeks to explicitly learn task-driven representations that filter out irrelevant factors. Our simulation results in Section IV empirically demonstrate that our approach is robust to such distributional shifts.

Information Bottlenecks. Originally developed in the information theory literature, information bottlenecks [43, 2, 25, 44]

allow one to formalize the notion of a “minimal-information representation” that is sufficient for a given task (e.g., a prediction task in the context of supervised learning). Given an input random variable

and a target random variable , one seeks a representation that forms a Markovian structure . The representation is chosen to minimize the mutual information between and while still maintaining enough information to predict from . Recent work has sought to adapt this theory for synthesizing task-driven representations for control [30, 1]. In [30], a model-based approach for automatically synthesizing task-driven representations via information bottlenecks is presented. However, this approach is limited to settings with an explicit model of the robot’s sensor and dynamics. This prevents the approach from being applied to systems with rich sensing modalities (e.g. RGB or depth images), for which one cannot assume a model. In contrast, the approach we present here learns to create a task-driven representation via reinforcement learning and is directly applicable to settings with rich sensing. In [1], the authors use information bottlenecks to define minimal state representations for control tasks involving high-dimensional sensor observations (this approach is also related to the notion of actionable information [38, 39] in vision; see [1] for a discussion). However, [1] does not present concrete algorithms for learning task-driven representations. In contrast, we present a policy gradient algorithm based on mutual information neural estimation (MINE) [4]. Moreover, we note that our definition of a task-driven policy differs from that of [1]; the representations we learn seek to create a bottleneck between sensor observations and control actions as opposed to finding a minimal representation for predicting costs.

Exploration in RL via Mutual Information Regularization. The information bottleneck principle has also been used to improve sample efficiency in RL by encouraging exploration [12, 45]. More generally, there is a line of work on endowing RL agents with intrinsic motivation by maximizing the mutual information between the agent’s control actions and states [24, 32, 33, 42, 46, 47]. Our focus instead is on improving robustness and generalization of policies; the information bottleneck-based approach we present is aimed at learning task-driven representations for control. This is achieved by minimizing the mutual information between the agent’s states and learned task-relevant variables.

Ii Defining Task-Driven Policies

In this section, we formalize our notion of a task-driven control policy. Our definition is in terms of a reinforcement learning problem whose solution produces a set of task-relevant variables (TRVs) and a policy that depends only on these variables. We begin by formulating the problem as a finite-horizon partially-observable Markov decision process (POMDP) [41]. The robot’s states, control actions, and sensor observations at time are denoted by , and respectively. The robot dynamics and sensor model are denoted by the conditional distributions and respectively. We do not assume knowledge of either of these distributions. The functions describe the cost at each time step with a terminal cost specified by . Our goal is to find a policy that solves


when run online in a test environment. Throughout this paper, we use to denote the expectation; it is subscripted with a distribution when necessary for clarity.

To achieve our goal of learning a task-driven policy, we choose the policy to have the recurrent structure illustrated in Figure 1. We refer to as the task-relevant variables (TRVs) and search for two conditional distributions:

. The first specifies a (stochastic) mapping from the previous TRVs and the current observation to the current TRVs. The second conditional distribution computes the control actions given the current TRVs. These will be parameterized by neural networks in the following sections. The overall policy can thus be expressed as:

Our goal is to learn TRVs that form a compressed representation of the state that is sufficient for the purpose of control. To formalize this, we leverage the theory of information bottlenecks [43] to limit how much information contains about . This is quantified using the mutual information


between the state and TRVs at each time step. Here, is the Kullback-Leibler (KL) divergence [8]. Intuitively, minimizing the mutual information corresponds to learning TRVs that are as independent of the state as possible. Thus, the TRVs “filter out” irrelevant information from the state. Therefore, we would like to find a policy that solves:


Here, can be viewed as a Lagrange multiplier, and the above problem can be interpreted as minimizing the total information contained in subject to an upper bound on the expected cost over the time horizon. In Section III, we develop a reinforcement learning algorithm for tackling Eq. 3. We refer to the resulting policies as task-driven policies. Note that the mutual information is invariant under bijective transforms of the random variables for which it is computed [4, 8]. As a result, the optimum value of remains unchanged for equivalent representations of the robot and its environment.

Similar to the gaze heuristic (ref. Section I), we expect that a task-driven policy as defined above will be a robust policy. This benefit is conferred to our policy by minimizing the mutual information between states and TRVs. Intuitively, a policy is closer to being open-loop when less state information is present in . The more open-loop the policy is, the less it is impacted by changes in the state or sensor distributions between environments. In [30], this intuition is formalized using the theory of risk metrics with the following theorem:

Theorem II.1.

Define the entropic risk metric [37, Example 6.20]:


Let be any distribution satisfying:


Then, the online expected cost is bounded by a combination of the entropic risk and mutual information:


The entropic risk is a functional that is similar to the expectation, but also accounts for higher moments of the distribution, e.g. its variance. The parameter

controls how much the metric weights the expected cost versus the higher moments, and . Optimizing can be difficult because computing its gradient requires computing at each time step. Therefore, we optimize , a first-order approximation of Eq. 6. In section Section IV, we demonstrate in multiple examples that minimizing our objective produces a robust policy that generalizes beyond its training environment.

Iii Learning Task-Driven Policies

This section discusses finding a policy that approximately solves (3) within a reinforcement learning (RL) framework. If the mutual information term is removed from the objective , standard RL techniques such as policy gradient (PG) (e.g. [34, 35]) would be sufficient. However, the mutual information and its gradient are known to be difficult quantities to estimate. A number of tractable upper and lower bounds have been proposed recently to provide means for optimizing objectives containing a mutual information term [31]. We elected to use the recently-proposed mutual information neural estimator (MINE) [4] due to its accuracy and ease of implementation. We then derive a PG-style algorithm in Section III-B.

Iii-a Mutual Information Neural Estimator (MINE)

The MINE is based on the Donsker-Varadhan (DV) variational representation of the KL divergence [13, Theorem 2.3.2]. For any two distributions defined on sample space , the KL divergence between them can be expressed as:


The supremum is taken over all functions such that the two expectations are finite. Let denote the empirical expectation computed with i.i.d. samples from . We can estimate the KL divergence by replacing the function class over which the supremum is taken by a family of neural networks parameterized by :


This approximation is a strongly consistent underestimate of Eq. 7 (see [4]). In short, this means that through choice of appropriate network structure and sample size, can approximate arbitrarily closely.

Since the mutual information is a KL divergence, we can approximate Eq. 2 using the above KL approximation as:


This is the MINE. The algorithm for computing the MINE estimate is presented in Algorithm 1. At a high-level, the algorithm attempts to find neural network parameters in order to maximize in (8

) via stochastic gradient descent. We use the notation

to denote the expectation taken using

. The expectation computed using the joint distribution

is unsubscripted. These expectations are approximated by sampling two minibatches of size uniformly from the training batch. The minibatches are denoted in Algorithm 1 by indicies and respectively. The gradient of is


This gradient, which is computed using the minibatches, is used to update with stochastic gradient descent (or a similar optimizer like ADAM [22]). For additional details on training the MINE, see [4].

1:procedure Train-Mine()
2:     repeat
3:         Sample joint minibatch:
4:         Sample marginal minibatch:
5:         Compute with
6:         Compute with .
7:         Update using minibatches and Eq. 10.
8:     until Convergence of .
10:end procedure
Algorithm 1 Mutual Information Neural Estimator (MINE)

Iii-B Task-Driven Policy Gradient Algorithm

We now describe our procedure for leveraging MINE within a policy gradient (PG) algorithm for tackling (3). As described in Section II, the policy is parameterized by two neural networks: a recurrent network that outputs the TRVs and a feedforward network that outputs the actual control action. Here

represent the parameters for each network. In our implementation, both networks output the mean and diagonal covariance of multivariate Gaussian distributions.

Computing the gradient of our objective Eq. 3 is difficult due to the presence of the mutual information, so the MINE is used to approximate this term.111We note that, in order to employ the bound Eq. 6, we need to employ an overestimate of . In Section IV, we demonstrate that in practice, minimizing the MINE is a practical approximation that produces robust policies. This approximation allows us to derive a learning algorithm similar to policy gradient. Using the well-known identity


the gradient of the expected cost with respect to is given by


Repeating this process for yields an analogous formula with replacing :


Finally, it remains to calculate . Though it is more tractable to optimize the MINE instead of the true mutual information, computing the gradient of the MINE with respect to the policy parameters is not entirely straightforward. The complexity lies in the fact that the gradient of the MINE with respect to these parameters depends on knowing or estimating the marginal distributions and , both of which depend on and . To approximate the MINE gradient we fix to the converged MINE parameters from Algorithm 1, which yields:


Since the neural networks used to parameterize the policy produce Gaussian distributions, we can represent as


This is similar to the reparameterization trick for variational autoencoders

[23]. Here, are computed using the mean and covariance output from the network . Since the dynamics and sensor model are unknown, the approximations that and are independent of and are independent of are made. With these fixed, the gradient of Eq. 14 with respect to can be computed by storing as a part of each rollout and backpropogating using Eq. 15.

The task-driven policy gradient (TDPG) algorithm is outlined in Algorithm 2. Our learning process involves training MINE networks, whose parameters are denoted

. For clarity in future sections, we refer to an iteration of the outer optimization loop as a policy epoch and an iteration of the optimization loop in

Algorithm 1 as a MINE epoch. During each policy epoch, we rollout trajectories using the current policy parameters . We then update each set of MINE parameters using Algorithm 1. Once the MINE networks converge, they are used to approximate and optimize the policy. This is repeated until the empirical estimate of , which is given by converges.

Implementation Details. It remains to specify how to select in a principled manner. Returning to the perspective described in Section II, we treat as the Lagrangian for minimizing the information shared between and subject to an upper bound on the maximum expected cost the policy is allowed. Then, we sweep through a set of values for and select the policy from the epoch with the lowest MINE estimate that also satisfies the specified limit on the empirical expected cost. This strategy produces the policy estimated to have the least state information present in the TRVs while satisfying our performance constraint.

It is likely that each MINE network is initialized to a poor estimate of the mutual information. In order to improve the initial estimate of the mutual information and its gradient, additional MINE epochs are used during the first policy epoch. In the following section, we will specify the number of additional epochs used. Moreover, as discussed in [4], using minibatches to estimate the MINE gradient in Eq. 10 leads to a biased estimate of the gradient. Replacing the denominator in Eq. 10 with the exponential moving average (EMA) of its value compensates for this bias.222The EMA is a filter defined on a sequence by the recursive relationship . This technique is used in some examples in the following section. We also limit the policy networks to output Gaussian distributions with diagonal covariances.

2:     Rollout batch of trajectories: .
3:     for  do
4:         Train-Mine 
5:     end for
6:     Update using rollout batch and Eq. 12, Eq. 13, Eq. 15.
7:until Convergence of .
Algorithm 2 Task-Driven Policy Gradient (TDPG)

Iv Examples

In this section, we apply the algorithm described in Section III to three problems: (i) a continuous state and action version of the “Lava Problem” from the POMDP literature, (ii) a vision-based ball-catching example, and (iii) a grasping problem with depth-image observations. For each of these problems, we present thorough simulation results demonstrating that the task-driven policy is robust to distributional shifts in the sensor model and testing environment. We compare our method against a policy with the same parameterization as ours, but trained using a standard policy gradient method in order to minimize the expected cost associated with the problem.

Iv-a Lava Problem

Fig. 2: An illustration of the Lava Problem described in Section IV-A. The robot (a double integrator) needs to navigate to a target state without moving so far right that the robot falls into the lava.

The first problem we consider is a continuous state and action version of the Lava problem (Figure 2) [5, 10, 19, 30], which is a common example for evaluating robust solutions to POMDPs. This scenario involves a robot navigating to a goal location along a line segment between a wall and a lava pit. The robot is modeled as a time-discretized double integrator, i.e. its state evolves with dynamics . Here is the displacement from the wall (in meters) and is the robot’s velocity (in meters per second). The goal is to navigate to the state within a time horizon of steps. However, is limited to the interval m. If the robot collides with the wall located at , then its velocity is set to m/s as well. If the robot’s position exceeds m, then the robot falls into hot lava, where it is unable to move any further. At training time, the robot is provided with a high-quality estimate of its state. This is modeled by the choice , where and denotes the Gaussian distribution. The cost function for this problem is for and . The robot is initialized with and uniformly distributed between 0 and 5 meters.

Training Summary. Both and have two hidden layers with 64 units and use exponential linear unit (ELU) nonlinearities [6]. We choose a two-dimensional space of TRVs. In this example, we found the performance of all learned policies improved when the parameters and were allowed to be time-varying. The MINE networks used in this example contain two hidden layers with 32 units each and ELU nonlinearities. A batch of 500 rollouts is used for each epoch, and a minibatch size of 50 is used for each MINE epoch, and an EMA with 5e-5 was used to compute the MINE gradient. The learning rates in this problem are 8e-4 for the policy networks and 5e-5 for the MINE network. All policies were trained for 300 epochs, with a computation time of about 10 seconds per epoch (about 50 minutes total for each policy). For this example, all computation was carried out on an Intel i9-7940X. Policies were trained with and the upper limit on the expected cost for selecting the policy was 40.

Fig. 3: These histograms compare our policy (TDPG) to the policy found by policy gradient (PG) in the Lava Problem described in Section IV-A by showing the frequencies at which each policy incurs different costs. The sensor noise of the test environment is increased from left to right. As the noise increases, the PG policy performance degrades, while the TDPG policy performance remains almost constant. This is because the TDPG algorithm found a task-relevant open-loop policy that is robust to this kind of disturbance.
(a) Training
(b) Test 1
(c) Test 2
(d) Test 3
(e) Test 4
(f) Test 5
(g) Test 6
(h) Test 7
Fig. 4: Textures used as backdrops in different ball-catching environments as part of the example described in Section IV-B.
Scenario Policy Gradient Task-Driven PG
Mean Std. Mean Std.
Training 31.04 7.191 36.18 23.22
Sensor Noise 1e-3 31.58 8.632 36.55 25.02
Sensor Noise 1e-2 35.69 15.68 36.00 22.32
Sensor Noise 1e-1 66.29 36.48 36.20 21.85
Sensor Noise 1e0 172.40 88.08 37.39 25.60
TABLE I: Performance of PG and TDPG Policies in Training and Testing Lava Environments

Policy Evaluation. We compare the resulting policy with one trained to minimize only the expected cost using policy gradient. The two policies are compared in Figure 3, and the statistics of their performance in different environments (corresponding to different levels of sensor noise) are presented in Table I . Due to the presence of low sensor noise in the training environment, the PG policy finds it optimal to drive the robot directly towards the goal state. Interestingly, the TDPG algorithm finds a qualitatively different strategy. In particular, TDPG recovers the robust open-loop behavior described in [5, 10, 30]: regardless of initial position, the robot moves left until it collides with the wall, then moves right to the goal state. As the sensor noise is increased, the performance of the PG policy degrades rapidly. For example, if the sensor reports the robot is to the left of the goal when it is really to the right of the goal, it is likely to fall in the lava. In contrast, the TDPG is virtually unaffected by the increased sensor noise.

Iv-B Vision-Based Ball Catching

Next, we consider a ball-catching example inspired by the gaze heuristic discussed in Section I (see Figure 1). We formalize this problem by considering a ball confined to a plane with and coordinates . The robot is confined to the -axis and must navigate to . The state of the system in this example is given by , where represents the robot’s displacement along the -axis, represents the ball’s velocity along the -axis, and represents the ball’s velocity along the -axis (i.e., the ball’s vertical velocity). The dynamics are given by:


Here, is gravity, and is used to discretize the dynamics of the system in time. The robot’s initial position is uniformly distributed over the interval m. The ball is launched from m and m with fixed initial velocities m/s and m/s. These initial conditions are chosen such that the ball always spends a fixed number of time steps above the -axis. The time horizon is chosen such that is the last time step the ball remains above the -axis. The cost function that we are trying to minimize is for and , which encourages the robot to be very close to the ball when it lands at the end of the time horizon. The sensor in this scenario is a camera mounted above the robot. This camera provides RGB images with values scaled between 0 and 1. We also placed a wall with a red brick texture centered at 10m along the -axis. All simulations are carried out using PyBullet [7]. A sample observation from the camera is presented in Figure 1, and a video depicting both policies operating in training and testing environments for this scenario is available here: https://youtu.be/Mwv0kkRveas.

Training Summary. In this example,

is parameterized by a network with 2 convolutional layers with 6 output channels, kernel size of 4, and stride of length 2, followed by two fully-connected layers using 32 units each. An ELU nonlinearity is applied between convolutional layers. After each fully-connected layer,

nonlinearities were used instead of ELU nonlinearities to prevent the values of from growing unbounded. The dimension of is 8. The network contains a single linear layer. The MINE networks used in this example contain two hidden layers with 64 units each and ELU nonlinearities. A batch of 200 rollouts is used for each epoch, and a minibatch size of 20 is used to train the MINE network. The learning rates in this problem are 1e-3 for the policy networks and 5e-5 for the MINE network. When training the TDPG policy, it was initialized with the PG solution and trained for 100 policy epochs. Each policy epoch contained 100 MINE epochs with 100,000 MINE epochs used on the first policy epoch, and an EMA with 5e-5 was used to compute the MINE gradient. Policies with were evaluated, with the upper limit on the expected cost placed at 24. Rollouts were computed on an 3.7GHz i7-8700K CPU while all optimization was done using an Nvidia Titan Xp. Each policy epoch took about 45 seconds to compute.

Policy Evaluation. We consider two different kinds of testing environments. All statistics reported for test environments were computed using rollouts. In the first group of examples, we add random noise to each pixel; the noise is sampled from a Gaussian distribution , where is varied between experiments. In order to ensure that the observed image is a valid RGB image after adding noise, we normalize the values to lie between 0 and 1. For each experiment, the mean cost and distance between the robot and the ball on the -axis is presented in Table II. As the level of image noise increases, the performance of the PG policy deteriorates dramatically while the TDPG policy’s performance remains largely unchanged.

In the second set of experiments, we qualitatively change the nature of the environment by changing the texture of the backdrop (i.e., the wall in the background). A representative portion of each texture is presented in Figure 4. The TDPG policy outperforms the PG policy in each of these seven testing environments except background 7. This texture has a similar hue (red) to the texture in the training environment allowing PG to perform well. The TDPG policy, however, outperforms PG in environments whose textures consist of hues that are different than the ones seen during training. Together with the previous set of experiments, this result suggests that the TDPG policy is more robust to observations that are qualitatively different than those provided to the policy during training.

Scenario Policy Gradient Task-Driven PG
Cost Dist. (m) Cost Dist. (m)
Training 8.702 0.085 18.577 0.184
Sensor Noise 19.12 0.189 19.48 0.193
Sensor Noise 46.68 0.464 20.12 0.199
Sensor Noise 124.63 1.244 21.59 0.214
Sensor Noise 187.53 1.874 27.11 0.274
Test Background 1 182.1 1.828 122.7 1.225
Test Background 2 214.3 2.141 79.90 0.797
Test Background 3 138.7 1.385 26.62 0.264
Test Background 4 95.38 0.952 28.11 0.279
Test Background 5 82.01 0.818 36.95 0.367
Test Background 6 208.5 2.083 166.4 0.166
Test Background 7 9.180 0.090 14.92 0.147
TABLE II: Mean Cost and Final Distance of PG and TDPG Policies in Training and Testing Ball-Catching Environments

Iv-C Grasping using a Depth Camera

Scenario Policy Gradient Task-Driven PG
Mean Std. Mean Std.
Training 0.107 0.309 0.094 0.292
Increased Rotation 0.182 0.386 0.132 0.339
Increased Translation 0.367 0.482 0.358 0.480
TABLE III: Performance of PG and TDPG Policies in Training and Testing Grasping Environments
(a) Franka Emika Panda
(b) Sensor Observation
Fig. 5: (a) A third-person perspective of the grasping scenario described in Section IV-C using PyBullet. (b) An upscaled example of the depth image forming the sensor observation for this scenario.

In this final example, we consider the task of grasping and lifting an object (a mug) from a table using a Franka Emika Panda simulated with PyBullet (see Figure 5). This example is particularly interesting for studying task-driven policies due to the (approximate) radial symmetry of the mug about the vertical axis. Intuitively, this symmetry renders the orientation of the mug about the vertical axis largely irrelevant to the task of grasping the mug. This is due to the fact that the robot does not need to know the precise orientation of the mug; it only needs to know enough about the orientation in order to estimate if the handle will directly interfere with the gripper’s location when the grasp is performed. We thus expect that a task-driven policy will remain largely unaffected by changes in the mug’s orientation. Hence, a task-driven policy trained on a small set of mug rotations should generalize well to the full set of rotations.

In this grasping problem, the state of the system contains the position of the object and its rotation about the -axis (which is oriented normal to the table surface) in radians. The control action specifies the position and orientation of the end effector of the arm when the grasp occurs. After grasping, the arm attempts to move the end effector vertically. A cost of 0 is awarded if the arm successfully lifts the object more than 0.05 meters above the table and a cost of 1 is assigned otherwise. The observation is a 128 128 depth image. An example observation is included in Figure 5. The initial state of the object is sampled uniformly from the set .

Training Summary. We parameterized the policy in the following manner. The network contains two convolutional layers with 6 output channels. The first uses a kernel size of 6 and the second uses a kernel size of 4; both use a stride length of 2. The convolutional layers are followed by two fully-connected layers of sizes 128 and 64. An ELU linearity is used between convolutional layers and nonlinearities are used after linear layers. The size of is 16. The network contains two fully connected layers with 64 units each and an ELU nonlinearity between them. The output of this network is then divided by 10 and added to a nominal control action of . This scaling and translation is done to bias the policy to select grasping locations near the object to speed up learning early in the optimization process. Again, as in Section IV-B, we initialize the TDPG solution to the PG solution at the start of training and learned policies with values of for 30 epochs.

In this example, the MINE proved particularly noisy and took longer to converge. To combat this, we increased the number of MINE epochs to 500,000 on the first policy epoch with 100,000 MINE epochs used on following policy epochs. To determine the value of the MINE, we applied an EMA with parameter . No EMA was used for computing MINE gradients. The epoch with the lowest filtered MINE estimate with an expected cost below 0.15 was used for testing. Each epoch took approximately 5 minutes to compute. Rollouts were computed on an Intel 3.7GHz i7-8700K in parallel and optimization was done on a Titan Xp.

Policy Evaluation. The test results for this example are summarized in Table III. Again, all reported statistics are computed using 1000 trials. In the first testing environment, the set of angles the mug is placed at is expanded from to . The expected cost (i.e., grasp failure rate) of the PG policy increased twice as much as the TDPG policy in these new testing environments. In the second testing environment, the set of angles the object is placed at is again , but the and values were sampled from the larger set of . In this setting, both policies perform equally poorly. This result supports our hypothesis that the rotation of the mug is largely unimportant for the task of lifting the mug because the TDPG policy was able to generalize to new mug angles, but not to the task-relevant translational coordinates of the mug. Meanwhile, the PG policy exhibits overfitting to task-irrelevant state information and is impacted poorly by its changes.

V Discussion and Conclusion

We presented a novel reinforcement learning algorithm for computing task-driven control policies for systems equipped with rich sensor observations (e.g., RGB or depth images). The key idea behind our approach is to learn a task-relevant representation that contains as little information as possible about the state of the system while being sufficient for achieving a low cost on the task at hand. Formally, this is achieved by using an information bottleneck criterion that minimizes the mutual information between the state of the system and a set of task-relevant variables (TRVs) used for computing control actions. We parameterize our policies using neural networks and present a novel policy gradient algorithm that leverages the recently-proposed mutual information neural estimator (MINE) for optimizing our objective. We refer to the resulting algorithm as task-driven policy gradient (TDPG).

We compare PG and TDPG policies in three experiments: an adaption of the canonical lava problem to continuous state spaces, a ball catching scenario inspired by the gaze heuristic from cognitive psychology, and a depth image-based grasping problem. In the lava example, the TDPG policy exploits the nonlinear dynamics to find a minimal information (open-loop) control policy that is robust to increased sensor noise. In the ball-catching example, the TDPG is also more robust to changes in the sensing model at test time than PG. These changes include both random noise corrupting the images and task-irrelevant structural changes, e.g. altering the textures in the robot’s environment. Finally, in the grasping scenario, the TDPG policy generalizes to rotated states of the object not seen during training on which the PG policy struggles. Together, these scenarios validate that our approach to designing task-driven control policies produces robust policies that can operate in environments unseen during training.

Future Work. There are a number of challenges and exciting directions for further exploration. First, we have observed that MINE often results in noisy estimates of the mutual information and can take many epochs to converge. This results in long training times for TDPG. We also employed an approximation of the gradient of MINE with respect to policy parameters (due to the challenges associated with estimating state distributions at each time step, as described in Section III). These observations motivate the exploration of other methods for minimizing the mutual information, eg., Stein variational gradient methods [17, 28]. Another direction for future work is to adapt more advanced on-policy methods (e.g., proximal policy optimization (PPO) [35]) to work with our approach, and potentially explore off-policy methods. We expect that these methods will be even more effective at learning task-driven policies. Finally, a particularly exciting direction for future work is to explore the benefits that our approach affords in terms of sim-to-real transfer. The simulation results in this paper suggest that TDPG can be more robust to sensor noise and perturbations to task-irrelevant features. Exploring whether this translates to more robust sim-to-real transfer is a promising direction we hope to explore in the future.


This work is partially supported by the National Science Foundation [IIS-1755038], the Office of Naval Research [Award Number: N00014-18-1-2873], the Google Faculty Research Award, and the Amazon Research Award. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.


  • [1] A. Achille and S. Soatto (2018) A separation principle for control in the age of deep learning. Annual Review of Control, Robotics, and Autonomous Systems 1, pp. 287–307. Cited by: §I-A.
  • [2] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §I-A.
  • [3] B. Anderson and J. Moore (2007) Optimal control: linear quadratic methods. Courier Corporation. Cited by: §I-A.
  • [4] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In

    Proceedings of the International Conference on Machine Learning

    pp. 531–540. Cited by: §I-A, §I, §II, §III-A, §III-A, §III-B, §III.
  • [5] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman (1994) Acting optimally in partially observable stochastic domains. In AAAI, Vol. 94, pp. 1023–1028. Cited by: §IV-A, §IV-A.
  • [6] D. Clevert, T. Unterthiner, and S. Hochreiter (2015) Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289. Cited by: §IV-A.
  • [7] E. Coumans and Y. Bai (2018) Pybullet, a python module for physics simulation for games, robotics and machine learning. Cited by: §IV-B.
  • [8] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §II.
  • [9] G. E. Dullerud and F. Paganini (2013) A course in robust control theory: a convex approach. Vol. 36, Springer Science & Business Media. Cited by: §I-A.
  • [10] P. R. Florence (2017) Integrated perception and control at high speed. Master’s Thesis, Massachusetts Institute of Technology. Cited by: §IV-A, §IV-A.
  • [11] G. Gigerenzer (2007) Gut feelings: the intelligence of the unconscious. Penguin. Cited by: §I.
  • [12] A. Goyal, R. Islam, D. Strouse, Z. Ahmed, H. Larochelle, M. Botvinick, S. Levine, and Y. Bengio (2019) Transfer and exploration via the information bottleneck. In Proceedings of the International Conference on Learning Representations, Cited by: §I-A.
  • [13] R. M. Gray (2011) Entropy and information theory. 2nd edition, Springer Science & Business Media. Cited by: §III-A.
  • [14] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2616–2625. Cited by: §I-A.
  • [15] S. Gupta, D. Fouhey, S. Levine, and J. Malik (2017) Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125. Cited by: §I-A.
  • [16] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel (2016) Backprop KF: learning discriminative deterministic state estimators. In Advances in Neural Information Processing Systems, pp. 4376–4384. Cited by: §I-A.
  • [17] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, pp. 1352–1361. Cited by: §V.
  • [18] R. Jonschkowski, D. Rastogi, and O. Brock (2018-06) Differentiable particle filters: end-to-end learning with algorithmic priors. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. Cited by: §I-A.
  • [19] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101 (1-2), pp. 99–134. Cited by: §IV-A.
  • [20] P. Karkus, D. Hsu, and W. S. Lee (2018) Particle filter networks: end-to-end probabilistic localization from visual observations. arXiv preprint arXiv:1805.08975. Cited by: §I-A.
  • [21] P. Karkus, X. Ma, D. Hsu, L. P. Kaelbling, W. S. Lee, and T. Lozano-Pérez (2019) Differentiable algorithm networks for composable robot learning. arXiv preprint arXiv:1905.11602. Cited by: §I-A.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-A.
  • [23] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §III-B.
  • [24] A. S. Klyubin, D. Polani, and C. L. Nehaniv (2005) Empowerment: a universal agent-centric measure of control. In

    IEEE Congress on Evolutionary Computation

    Vol. 1, pp. 128–135. Cited by: §I-A.
  • [25] A. Kolchinsky, B. D. Tracey, and D. H. Wolpert (2019) Nonlinear information bottleneck. Entropy 21 (12), pp. 1181. Cited by: §I-A.
  • [26] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §I-A.
  • [27] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research (IJRR) 37 (4-5), pp. 421–436. Cited by: §I-A.
  • [28] Q. Liu and D. Wang (2016)

    Stein variational gradient descent: a general purpose bayesian inference algorithm

    In Advances in neural information processing systems, pp. 2378–2386. Cited by: §V.
  • [29] P. McLeod, N. Reed, and Z. Dienes (2003) Psychophysics: how fielders arrive in time to catch the ball. Nature 426 (6964), pp. 244. Cited by: §I.
  • [30] V. Pacelli and A. Majumdar (2019) Task-driven estimation and control via information bottlenecks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I-A, §II, §IV-A, §IV-A.
  • [31] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. arXiv preprint arXiv:1905.06922. Cited by: §III.
  • [32] C. Salge, C. Glackin, and D. Polani (2013) Empowerment and state-dependent noise-an intrinsic motivation for avoiding unpredictable agents. In Proceedings of the Artificial Life Conference, pp. 118–125. Cited by: §I-A.
  • [33] C. Salge, C. Glackin, and D. Polani (2014) Empowerment–an introduction. In Guided Self-Organization: Inception, pp. 67–114. Cited by: §I-A.
  • [34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, pp. 1889–1897. Cited by: §III.
  • [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III, §V.
  • [36] D. M. Shaffer, S. M. Krauchunas, M. Eddy, and M. K. McBeath (2004) How dogs navigate to catch frisbees. Psychological Science 15 (7), pp. 437–441. Cited by: §I.
  • [37] A. Shapiro, D. Dentcheva, and A. Ruszczyński (2009) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: Theorem II.1.
  • [38] S. Soatto (2011) Steps towards a theory of visual information: active perception, signal-to-symbol conversion and the interplay between sensing and control. arXiv preprint arXiv:1110.2053. Cited by: §I-A.
  • [39] S. Soatto (2013) Actionable information in vision. In Machine Learning for Computer Vision, pp. 17–48. Cited by: §I-A.
  • [40] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, B. Upcroft, P. Abbeel, W. Burgard, M. Milford, and P. Corke (2018) The limits and potentials of deep learning for robotics. The International Journal of Robotics Research (IJRR) 37 (4-5), pp. 405–420. Cited by: §I-A.
  • [41] R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §II.
  • [42] S. Tiomkin, D. Polani, and N. Tishby (2017) Control capacity of partially observable dynamic systems in continuous time. arXiv preprint arXiv:1701.04984. Cited by: §I-A.
  • [43] N. Tishby, F. C. Pereira, and W. Bialek (1999) The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control, and Computing, Cited by: §I-A, §I, §II.
  • [44] N. Tishby and N. Zaslavsky (2015) Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop, pp. 1–5. Cited by: §I-A.
  • [45] P. Yingjun and H. Xinwen (2019) Learning representations in reinforcement learning: an information bottleneck approach. arXiv preprint arXiv:1911.05695. Cited by: §I-A.
  • [46] T. Yu, G. Shevchuk, D. Sadigh, and C. Finn (2019) Unsupervised visuomotor control through distributional planning networks. arXiv preprint arXiv:1902.05542. Cited by: §I-A.
  • [47] R. Zhao, S. Tiomkin, and P. Abbeel (2019) Learning efficient representation for intrinsic motivation. arXiv preprint arXiv:1912.02624. Cited by: §I-A.
  • [48] K. Zhou and J. C. Doyle (1998) Essentials of robust control. Vol. 104, Prentice hall Upper Saddle River, NJ. Cited by: §I-A.
  • [49] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. Cited by: §I-A.