I Introduction
The increasing availability of highresolution sensors has significantly contributed to the recent explosion of robotics applications. For example, highprecision cameras and LIDARs now allow autonomous vehicles to perform tasks ranging from navigating busy city streets to mapping mines and buildings. By far the most common approach adopted by such systems is to utilize as much sensor information as is available in order to estimate the full state of these complex environments; the resulting state estimates are then used by the robot to choose control actions necessary to complete its task. However, this ubiquitous approach does not distinguish between the
taskrelevant and taskirrelevant portions of the system’s state. The result is an unnecessarily tight coupling between sensor observations and control actions that is often not robust to noise or uncertainty in irrelevant portions of the state. Moreover, these policies tend to have a large computational burden. This computational requirement can manifest either as a state estimator that needs to process large amounts of data in real time, or as a policy learned from sampling a large number of diverse operating environments. The goal of this paper is to address these challenges by synthesizing control policies that only depend on taskdriven representations of the robot’s state. Doing so reduces the coupling between the states and control actions, improves robustness, and reduces computational requirements.To illustrate the advantages of a taskdriven policy, consider a ballcatching example. An agent equipped with a highresolution camera is tasked with catching a ball (Figure 1). Using its sensor, the agent can attempt to estimate the full state of the system (e.g., ball velocity, egomotion, wind speed, etc.), feed this information into a physical model for the ball’s flight, integrate the model to find where the ball will land, and move to this location to catch it. This approach requires estimating every parameter involved in the ball’s motion and can easily be compromised when the ball is far away and difficult to see or when there are visual artifacts such as glare.
Instead, extensive cognitive psychology experiments [11, 29, 36]
have demonstrated that humans employ a taskdriven strategy known as the gaze heuristic to catch projectiles. This strategy simply modulates the human’s speed in order to fix the position of the ball in their visual field. This policy, which only requires minimal sensor information and internal computation, naturally drives the human to the ball’s landing position. The gaze heuristic highlights that taskdriven control policies are often robust control policies that only depend on small amounts of salient state information. Specifically, the heuristic is naturally
robust to distributional shifts in irrelevant portions of the environment (e.g., the visual backdrop) while also being adaptable to perturbations like a gust of wind altering the ball’s course. Traditionally, taskdriven policies have required handengineering for each control problem. This process can be difficult and time consuming. The goal of this paper is to propose a reinforcement learning framework that automatically synthesizes taskdriven control policies for systems with nonlinear dynamics and highdimensional observations (e.g., RGB or depth images).Statement of Contributions. The main technical contribution of this paper is to formulate a reinforcement learning algorithm that synthesizes taskdriven control policies. This synthesis is achieved by creating an information bottleneck [43] that limits how much state information the policy is allowed to use. We present a reinforcement learning algorithm — referred to as taskdriven policy gradient (TDPG) — that leverages the recentlyproposed mutual information neural estimator [4]
to tractably search for an effective taskdriven policy. Finally, we demonstrate that this formulation provides the key advantage of a taskdriven approach — robustness to perturbations in taskirrelevant state and sensor variables. This benefit is demonstrated in three examples featuring nonlinear dynamics and highdimensional sensor models: an adaption of the lava problem from the literature on partially observable Markov decision processes (POMDPs), ball catching using RGB images, and grasping an object using depth images.
Ia Related Work
State Estimation and Differentiable Filtering. Classical approaches to controlling robotic systems typically involve two distinct pipelines: one that uses the robot’s sensors to estimate its state and another that uses this state estimate to choose control actions. Such an architecture is motivated by the separation principle [3] from (linear) control theory and allows one to leverage powerful controltheoretic techniques for robust estimation and control [9, 48]. A recent line of work on differentiable filtering [16, 18, 20] has extended this traditional pipeline to elegantly handle rich sensor observations (e.g., images) by learning
state estimators in an endtoend manner via deep learning. However, as the gaze heuristic example from Section
I demonstrates, full state representations are often overly rich when viewed from the perspective of the task at hand. This is particularly true in settings where representing the full state requires capturing the state of the robot’s environment. Instead, our goal is to learn minimalistic taskdriven representations that are sufficient for control. Such representations can be highly compact as compared to full state representations. Moreover, carefullyconstructed taskdriven representations have the potential to be robust to sensor noise and changes to irrelevant portions of the robot’s environment. Intuitively, this is because uncertainty or noise in irrelevant portions of the sensor observations are filtered out and thus no longer corrupt the robot’s actions. We present a theoretical result in Section II along with simulation experiments in Section IV to support this intuition.Endtoend Learning of Policies. Deep reinforcement learning approaches have the ability to learn control policies in an endtoend manner [14, 15, 21, 26, 27, 40, 49]. Such endtoend approaches learn to create representations that are tuned to the task at hand by exploiting statistical regularities in the robot’s observations, dynamics, and environment. However, these methods do not explicitly attempt to learn representations that are taskdriven (i.e., representations that filter out portions of the sensor observations that are irrelevant to the task). As a result, policies trained via standard deep RL techniques may be sensitive to changes in irrelevant portions of the robot’s environment (e.g., changes to the background color in a ballcatching task). In contrast, the deep RLbased approach we present seeks to explicitly learn taskdriven representations that filter out irrelevant factors. Our simulation results in Section IV empirically demonstrate that our approach is robust to such distributional shifts.
Information Bottlenecks. Originally developed in the information theory literature, information bottlenecks [43, 2, 25, 44]
allow one to formalize the notion of a “minimalinformation representation” that is sufficient for a given task (e.g., a prediction task in the context of supervised learning). Given an input random variable
and a target random variable , one seeks a representation that forms a Markovian structure . The representation is chosen to minimize the mutual information between and while still maintaining enough information to predict from . Recent work has sought to adapt this theory for synthesizing taskdriven representations for control [30, 1]. In [30], a modelbased approach for automatically synthesizing taskdriven representations via information bottlenecks is presented. However, this approach is limited to settings with an explicit model of the robot’s sensor and dynamics. This prevents the approach from being applied to systems with rich sensing modalities (e.g. RGB or depth images), for which one cannot assume a model. In contrast, the approach we present here learns to create a taskdriven representation via reinforcement learning and is directly applicable to settings with rich sensing. In [1], the authors use information bottlenecks to define minimal state representations for control tasks involving highdimensional sensor observations (this approach is also related to the notion of actionable information [38, 39] in vision; see [1] for a discussion). However, [1] does not present concrete algorithms for learning taskdriven representations. In contrast, we present a policy gradient algorithm based on mutual information neural estimation (MINE) [4]. Moreover, we note that our definition of a taskdriven policy differs from that of [1]; the representations we learn seek to create a bottleneck between sensor observations and control actions as opposed to finding a minimal representation for predicting costs.Exploration in RL via Mutual Information Regularization. The information bottleneck principle has also been used to improve sample efficiency in RL by encouraging exploration [12, 45]. More generally, there is a line of work on endowing RL agents with intrinsic motivation by maximizing the mutual information between the agent’s control actions and states [24, 32, 33, 42, 46, 47]. Our focus instead is on improving robustness and generalization of policies; the information bottleneckbased approach we present is aimed at learning taskdriven representations for control. This is achieved by minimizing the mutual information between the agent’s states and learned taskrelevant variables.
Ii Defining TaskDriven Policies
In this section, we formalize our notion of a taskdriven control policy. Our definition is in terms of a reinforcement learning problem whose solution produces a set of taskrelevant variables (TRVs) and a policy that depends only on these variables. We begin by formulating the problem as a finitehorizon partiallyobservable Markov decision process (POMDP) [41]. The robot’s states, control actions, and sensor observations at time are denoted by , and respectively. The robot dynamics and sensor model are denoted by the conditional distributions and respectively. We do not assume knowledge of either of these distributions. The functions describe the cost at each time step with a terminal cost specified by . Our goal is to find a policy that solves
(1) 
when run online in a test environment. Throughout this paper, we use to denote the expectation; it is subscripted with a distribution when necessary for clarity.
To achieve our goal of learning a taskdriven policy, we choose the policy to have the recurrent structure illustrated in Figure 1. We refer to as the taskrelevant variables (TRVs) and search for two conditional distributions:
. The first specifies a (stochastic) mapping from the previous TRVs and the current observation to the current TRVs. The second conditional distribution computes the control actions given the current TRVs. These will be parameterized by neural networks in the following sections. The overall policy can thus be expressed as:
Our goal is to learn TRVs that form a compressed representation of the state that is sufficient for the purpose of control. To formalize this, we leverage the theory of information bottlenecks [43] to limit how much information contains about . This is quantified using the mutual information
(2) 
between the state and TRVs at each time step. Here, is the KullbackLeibler (KL) divergence [8]. Intuitively, minimizing the mutual information corresponds to learning TRVs that are as independent of the state as possible. Thus, the TRVs “filter out” irrelevant information from the state. Therefore, we would like to find a policy that solves:
(3) 
Here, can be viewed as a Lagrange multiplier, and the above problem can be interpreted as minimizing the total information contained in subject to an upper bound on the expected cost over the time horizon. In Section III, we develop a reinforcement learning algorithm for tackling Eq. 3. We refer to the resulting policies as taskdriven policies. Note that the mutual information is invariant under bijective transforms of the random variables for which it is computed [4, 8]. As a result, the optimum value of remains unchanged for equivalent representations of the robot and its environment.
Similar to the gaze heuristic (ref. Section I), we expect that a taskdriven policy as defined above will be a robust policy. This benefit is conferred to our policy by minimizing the mutual information between states and TRVs. Intuitively, a policy is closer to being openloop when less state information is present in . The more openloop the policy is, the less it is impacted by changes in the state or sensor distributions between environments. In [30], this intuition is formalized using the theory of risk metrics with the following theorem:
Theorem II.1.
Define the entropic risk metric [37, Example 6.20]:
(4) 
Let be any distribution satisfying:
(5)  
Then, the online expected cost is bounded by a combination of the entropic risk and mutual information:
(6) 
The entropic risk is a functional that is similar to the expectation, but also accounts for higher moments of the distribution, e.g. its variance. The parameter
controls how much the metric weights the expected cost versus the higher moments, and . Optimizing can be difficult because computing its gradient requires computing at each time step. Therefore, we optimize , a firstorder approximation of Eq. 6. In section Section IV, we demonstrate in multiple examples that minimizing our objective produces a robust policy that generalizes beyond its training environment.Iii Learning TaskDriven Policies
This section discusses finding a policy that approximately solves (3) within a reinforcement learning (RL) framework. If the mutual information term is removed from the objective , standard RL techniques such as policy gradient (PG) (e.g. [34, 35]) would be sufficient. However, the mutual information and its gradient are known to be difficult quantities to estimate. A number of tractable upper and lower bounds have been proposed recently to provide means for optimizing objectives containing a mutual information term [31]. We elected to use the recentlyproposed mutual information neural estimator (MINE) [4] due to its accuracy and ease of implementation. We then derive a PGstyle algorithm in Section IIIB.
Iiia Mutual Information Neural Estimator (MINE)
The MINE is based on the DonskerVaradhan (DV) variational representation of the KL divergence [13, Theorem 2.3.2]. For any two distributions defined on sample space , the KL divergence between them can be expressed as:
(7) 
The supremum is taken over all functions such that the two expectations are finite. Let denote the empirical expectation computed with i.i.d. samples from . We can estimate the KL divergence by replacing the function class over which the supremum is taken by a family of neural networks parameterized by :
(8) 
This approximation is a strongly consistent underestimate of Eq. 7 (see [4]). In short, this means that through choice of appropriate network structure and sample size, can approximate arbitrarily closely.
Since the mutual information is a KL divergence, we can approximate Eq. 2 using the above KL approximation as:
(9) 
This is the MINE. The algorithm for computing the MINE estimate is presented in Algorithm 1. At a highlevel, the algorithm attempts to find neural network parameters in order to maximize in (8
) via stochastic gradient descent. We use the notation
to denote the expectation taken using. The expectation computed using the joint distribution
is unsubscripted. These expectations are approximated by sampling two minibatches of size uniformly from the training batch. The minibatches are denoted in Algorithm 1 by indicies and respectively. The gradient of is(10) 
This gradient, which is computed using the minibatches, is used to update with stochastic gradient descent (or a similar optimizer like ADAM [22]). For additional details on training the MINE, see [4].
IiiB TaskDriven Policy Gradient Algorithm
We now describe our procedure for leveraging MINE within a policy gradient (PG) algorithm for tackling (3). As described in Section II, the policy is parameterized by two neural networks: a recurrent network that outputs the TRVs and a feedforward network that outputs the actual control action. Here
represent the parameters for each network. In our implementation, both networks output the mean and diagonal covariance of multivariate Gaussian distributions.
Computing the gradient of our objective Eq. 3 is difficult due to the presence of the mutual information, so the MINE is used to approximate this term.^{1}^{1}1We note that, in order to employ the bound Eq. 6, we need to employ an overestimate of . In Section IV, we demonstrate that in practice, minimizing the MINE is a practical approximation that produces robust policies. This approximation allows us to derive a learning algorithm similar to policy gradient. Using the wellknown identity
(11) 
the gradient of the expected cost with respect to is given by
(12) 
Repeating this process for yields an analogous formula with replacing :
(13) 
Finally, it remains to calculate . Though it is more tractable to optimize the MINE instead of the true mutual information, computing the gradient of the MINE with respect to the policy parameters is not entirely straightforward. The complexity lies in the fact that the gradient of the MINE with respect to these parameters depends on knowing or estimating the marginal distributions and , both of which depend on and . To approximate the MINE gradient we fix to the converged MINE parameters from Algorithm 1, which yields:
(14) 
Since the neural networks used to parameterize the policy produce Gaussian distributions, we can represent as
(15) 
This is similar to the reparameterization trick for variational autoencoders
[23]. Here, are computed using the mean and covariance output from the network . Since the dynamics and sensor model are unknown, the approximations that and are independent of and are independent of are made. With these fixed, the gradient of Eq. 14 with respect to can be computed by storing as a part of each rollout and backpropogating using Eq. 15.The taskdriven policy gradient (TDPG) algorithm is outlined in Algorithm 2. Our learning process involves training MINE networks, whose parameters are denoted
. For clarity in future sections, we refer to an iteration of the outer optimization loop as a policy epoch and an iteration of the optimization loop in
Algorithm 1 as a MINE epoch. During each policy epoch, we rollout trajectories using the current policy parameters . We then update each set of MINE parameters using Algorithm 1. Once the MINE networks converge, they are used to approximate and optimize the policy. This is repeated until the empirical estimate of , which is given by converges.Implementation Details. It remains to specify how to select in a principled manner. Returning to the perspective described in Section II, we treat as the Lagrangian for minimizing the information shared between and subject to an upper bound on the maximum expected cost the policy is allowed. Then, we sweep through a set of values for and select the policy from the epoch with the lowest MINE estimate that also satisfies the specified limit on the empirical expected cost. This strategy produces the policy estimated to have the least state information present in the TRVs while satisfying our performance constraint.
It is likely that each MINE network is initialized to a poor estimate of the mutual information. In order to improve the initial estimate of the mutual information and its gradient, additional MINE epochs are used during the first policy epoch. In the following section, we will specify the number of additional epochs used. Moreover, as discussed in [4], using minibatches to estimate the MINE gradient in Eq. 10 leads to a biased estimate of the gradient. Replacing the denominator in Eq. 10 with the exponential moving average (EMA) of its value compensates for this bias.^{2}^{2}2The EMA is a filter defined on a sequence by the recursive relationship . This technique is used in some examples in the following section. We also limit the policy networks to output Gaussian distributions with diagonal covariances.
Iv Examples
In this section, we apply the algorithm described in Section III to three problems: (i) a continuous state and action version of the “Lava Problem” from the POMDP literature, (ii) a visionbased ballcatching example, and (iii) a grasping problem with depthimage observations. For each of these problems, we present thorough simulation results demonstrating that the taskdriven policy is robust to distributional shifts in the sensor model and testing environment. We compare our method against a policy with the same parameterization as ours, but trained using a standard policy gradient method in order to minimize the expected cost associated with the problem.
Iva Lava Problem
The first problem we consider is a continuous state and action version of the Lava problem (Figure 2) [5, 10, 19, 30], which is a common example for evaluating robust solutions to POMDPs. This scenario involves a robot navigating to a goal location along a line segment between a wall and a lava pit. The robot is modeled as a timediscretized double integrator, i.e. its state evolves with dynamics . Here is the displacement from the wall (in meters) and is the robot’s velocity (in meters per second). The goal is to navigate to the state within a time horizon of steps. However, is limited to the interval m. If the robot collides with the wall located at , then its velocity is set to m/s as well. If the robot’s position exceeds m, then the robot falls into hot lava, where it is unable to move any further. At training time, the robot is provided with a highquality estimate of its state. This is modeled by the choice , where and denotes the Gaussian distribution. The cost function for this problem is for and . The robot is initialized with and uniformly distributed between 0 and 5 meters.
Training Summary. Both and have two hidden layers with 64 units and use exponential linear unit (ELU) nonlinearities [6]. We choose a twodimensional space of TRVs. In this example, we found the performance of all learned policies improved when the parameters and were allowed to be timevarying. The MINE networks used in this example contain two hidden layers with 32 units each and ELU nonlinearities. A batch of 500 rollouts is used for each epoch, and a minibatch size of 50 is used for each MINE epoch, and an EMA with 5e5 was used to compute the MINE gradient. The learning rates in this problem are 8e4 for the policy networks and 5e5 for the MINE network. All policies were trained for 300 epochs, with a computation time of about 10 seconds per epoch (about 50 minutes total for each policy). For this example, all computation was carried out on an Intel i97940X. Policies were trained with and the upper limit on the expected cost for selecting the policy was 40.
Scenario  Policy Gradient  TaskDriven PG  

Mean  Std.  Mean  Std.  
Training  31.04  7.191  36.18  23.22 
Sensor Noise 1e3  31.58  8.632  36.55  25.02 
Sensor Noise 1e2  35.69  15.68  36.00  22.32 
Sensor Noise 1e1  66.29  36.48  36.20  21.85 
Sensor Noise 1e0  172.40  88.08  37.39  25.60 
Policy Evaluation. We compare the resulting policy with one trained to minimize only the expected cost using policy gradient. The two policies are compared in Figure 3, and the statistics of their performance in different environments (corresponding to different levels of sensor noise) are presented in Table I . Due to the presence of low sensor noise in the training environment, the PG policy finds it optimal to drive the robot directly towards the goal state. Interestingly, the TDPG algorithm finds a qualitatively different strategy. In particular, TDPG recovers the robust openloop behavior described in [5, 10, 30]: regardless of initial position, the robot moves left until it collides with the wall, then moves right to the goal state. As the sensor noise is increased, the performance of the PG policy degrades rapidly. For example, if the sensor reports the robot is to the left of the goal when it is really to the right of the goal, it is likely to fall in the lava. In contrast, the TDPG is virtually unaffected by the increased sensor noise.
IvB VisionBased Ball Catching
Next, we consider a ballcatching example inspired by the gaze heuristic discussed in Section I (see Figure 1). We formalize this problem by considering a ball confined to a plane with and coordinates . The robot is confined to the axis and must navigate to . The state of the system in this example is given by , where represents the robot’s displacement along the axis, represents the ball’s velocity along the axis, and represents the ball’s velocity along the axis (i.e., the ball’s vertical velocity). The dynamics are given by:
(16) 
Here, is gravity, and is used to discretize the dynamics of the system in time. The robot’s initial position is uniformly distributed over the interval m. The ball is launched from m and m with fixed initial velocities m/s and m/s. These initial conditions are chosen such that the ball always spends a fixed number of time steps above the axis. The time horizon is chosen such that is the last time step the ball remains above the axis. The cost function that we are trying to minimize is for and , which encourages the robot to be very close to the ball when it lands at the end of the time horizon. The sensor in this scenario is a camera mounted above the robot. This camera provides RGB images with values scaled between 0 and 1. We also placed a wall with a red brick texture centered at 10m along the axis. All simulations are carried out using PyBullet [7]. A sample observation from the camera is presented in Figure 1, and a video depicting both policies operating in training and testing environments for this scenario is available here: https://youtu.be/Mwv0kkRveas.
Training Summary. In this example,
is parameterized by a network with 2 convolutional layers with 6 output channels, kernel size of 4, and stride of length 2, followed by two fullyconnected layers using 32 units each. An ELU nonlinearity is applied between convolutional layers. After each fullyconnected layer,
nonlinearities were used instead of ELU nonlinearities to prevent the values of from growing unbounded. The dimension of is 8. The network contains a single linear layer. The MINE networks used in this example contain two hidden layers with 64 units each and ELU nonlinearities. A batch of 200 rollouts is used for each epoch, and a minibatch size of 20 is used to train the MINE network. The learning rates in this problem are 1e3 for the policy networks and 5e5 for the MINE network. When training the TDPG policy, it was initialized with the PG solution and trained for 100 policy epochs. Each policy epoch contained 100 MINE epochs with 100,000 MINE epochs used on the first policy epoch, and an EMA with 5e5 was used to compute the MINE gradient. Policies with were evaluated, with the upper limit on the expected cost placed at 24. Rollouts were computed on an 3.7GHz i78700K CPU while all optimization was done using an Nvidia Titan Xp. Each policy epoch took about 45 seconds to compute.Policy Evaluation. We consider two different kinds of testing environments. All statistics reported for test environments were computed using rollouts. In the first group of examples, we add random noise to each pixel; the noise is sampled from a Gaussian distribution , where is varied between experiments. In order to ensure that the observed image is a valid RGB image after adding noise, we normalize the values to lie between 0 and 1. For each experiment, the mean cost and distance between the robot and the ball on the axis is presented in Table II. As the level of image noise increases, the performance of the PG policy deteriorates dramatically while the TDPG policy’s performance remains largely unchanged.
In the second set of experiments, we qualitatively change the nature of the environment by changing the texture of the backdrop (i.e., the wall in the background). A representative portion of each texture is presented in Figure 4. The TDPG policy outperforms the PG policy in each of these seven testing environments except background 7. This texture has a similar hue (red) to the texture in the training environment allowing PG to perform well. The TDPG policy, however, outperforms PG in environments whose textures consist of hues that are different than the ones seen during training. Together with the previous set of experiments, this result suggests that the TDPG policy is more robust to observations that are qualitatively different than those provided to the policy during training.
Scenario  Policy Gradient  TaskDriven PG  

Cost  Dist. (m)  Cost  Dist. (m)  
Training  8.702  0.085  18.577  0.184 
Sensor Noise  19.12  0.189  19.48  0.193 
Sensor Noise  46.68  0.464  20.12  0.199 
Sensor Noise  124.63  1.244  21.59  0.214 
Sensor Noise  187.53  1.874  27.11  0.274 
Test Background 1  182.1  1.828  122.7  1.225 
Test Background 2  214.3  2.141  79.90  0.797 
Test Background 3  138.7  1.385  26.62  0.264 
Test Background 4  95.38  0.952  28.11  0.279 
Test Background 5  82.01  0.818  36.95  0.367 
Test Background 6  208.5  2.083  166.4  0.166 
Test Background 7  9.180  0.090  14.92  0.147 
IvC Grasping using a Depth Camera
Scenario  Policy Gradient  TaskDriven PG  

Mean  Std.  Mean  Std.  
Training  0.107  0.309  0.094  0.292 
Increased Rotation  0.182  0.386  0.132  0.339 
Increased Translation  0.367  0.482  0.358  0.480 
In this final example, we consider the task of grasping and lifting an object (a mug) from a table using a Franka Emika Panda simulated with PyBullet (see Figure 5). This example is particularly interesting for studying taskdriven policies due to the (approximate) radial symmetry of the mug about the vertical axis. Intuitively, this symmetry renders the orientation of the mug about the vertical axis largely irrelevant to the task of grasping the mug. This is due to the fact that the robot does not need to know the precise orientation of the mug; it only needs to know enough about the orientation in order to estimate if the handle will directly interfere with the gripper’s location when the grasp is performed. We thus expect that a taskdriven policy will remain largely unaffected by changes in the mug’s orientation. Hence, a taskdriven policy trained on a small set of mug rotations should generalize well to the full set of rotations.
In this grasping problem, the state of the system contains the position of the object and its rotation about the axis (which is oriented normal to the table surface) in radians. The control action specifies the position and orientation of the end effector of the arm when the grasp occurs. After grasping, the arm attempts to move the end effector vertically. A cost of 0 is awarded if the arm successfully lifts the object more than 0.05 meters above the table and a cost of 1 is assigned otherwise. The observation is a 128 128 depth image. An example observation is included in Figure 5. The initial state of the object is sampled uniformly from the set .
Training Summary. We parameterized the policy in the following manner. The network contains two convolutional layers with 6 output channels. The first uses a kernel size of 6 and the second uses a kernel size of 4; both use a stride length of 2. The convolutional layers are followed by two fullyconnected layers of sizes 128 and 64. An ELU linearity is used between convolutional layers and nonlinearities are used after linear layers. The size of is 16. The network contains two fully connected layers with 64 units each and an ELU nonlinearity between them. The output of this network is then divided by 10 and added to a nominal control action of . This scaling and translation is done to bias the policy to select grasping locations near the object to speed up learning early in the optimization process. Again, as in Section IVB, we initialize the TDPG solution to the PG solution at the start of training and learned policies with values of for 30 epochs.
In this example, the MINE proved particularly noisy and took longer to converge. To combat this, we increased the number of MINE epochs to 500,000 on the first policy epoch with 100,000 MINE epochs used on following policy epochs. To determine the value of the MINE, we applied an EMA with parameter . No EMA was used for computing MINE gradients. The epoch with the lowest filtered MINE estimate with an expected cost below 0.15 was used for testing. Each epoch took approximately 5 minutes to compute. Rollouts were computed on an Intel 3.7GHz i78700K in parallel and optimization was done on a Titan Xp.
Policy Evaluation. The test results for this example are summarized in Table III. Again, all reported statistics are computed using 1000 trials. In the first testing environment, the set of angles the mug is placed at is expanded from to . The expected cost (i.e., grasp failure rate) of the PG policy increased twice as much as the TDPG policy in these new testing environments. In the second testing environment, the set of angles the object is placed at is again , but the and values were sampled from the larger set of . In this setting, both policies perform equally poorly. This result supports our hypothesis that the rotation of the mug is largely unimportant for the task of lifting the mug because the TDPG policy was able to generalize to new mug angles, but not to the taskrelevant translational coordinates of the mug. Meanwhile, the PG policy exhibits overfitting to taskirrelevant state information and is impacted poorly by its changes.
V Discussion and Conclusion
We presented a novel reinforcement learning algorithm for computing taskdriven control policies for systems equipped with rich sensor observations (e.g., RGB or depth images). The key idea behind our approach is to learn a taskrelevant representation that contains as little information as possible about the state of the system while being sufficient for achieving a low cost on the task at hand. Formally, this is achieved by using an information bottleneck criterion that minimizes the mutual information between the state of the system and a set of taskrelevant variables (TRVs) used for computing control actions. We parameterize our policies using neural networks and present a novel policy gradient algorithm that leverages the recentlyproposed mutual information neural estimator (MINE) for optimizing our objective. We refer to the resulting algorithm as taskdriven policy gradient (TDPG).
We compare PG and TDPG policies in three experiments: an adaption of the canonical lava problem to continuous state spaces, a ball catching scenario inspired by the gaze heuristic from cognitive psychology, and a depth imagebased grasping problem. In the lava example, the TDPG policy exploits the nonlinear dynamics to find a minimal information (openloop) control policy that is robust to increased sensor noise. In the ballcatching example, the TDPG is also more robust to changes in the sensing model at test time than PG. These changes include both random noise corrupting the images and taskirrelevant structural changes, e.g. altering the textures in the robot’s environment. Finally, in the grasping scenario, the TDPG policy generalizes to rotated states of the object not seen during training on which the PG policy struggles. Together, these scenarios validate that our approach to designing taskdriven control policies produces robust policies that can operate in environments unseen during training.
Future Work. There are a number of challenges and exciting directions for further exploration. First, we have observed that MINE often results in noisy estimates of the mutual information and can take many epochs to converge. This results in long training times for TDPG. We also employed an approximation of the gradient of MINE with respect to policy parameters (due to the challenges associated with estimating state distributions at each time step, as described in Section III). These observations motivate the exploration of other methods for minimizing the mutual information, eg., Stein variational gradient methods [17, 28]. Another direction for future work is to adapt more advanced onpolicy methods (e.g., proximal policy optimization (PPO) [35]) to work with our approach, and potentially explore offpolicy methods. We expect that these methods will be even more effective at learning taskdriven policies. Finally, a particularly exciting direction for future work is to explore the benefits that our approach affords in terms of simtoreal transfer. The simulation results in this paper suggest that TDPG can be more robust to sensor noise and perturbations to taskirrelevant features. Exploring whether this translates to more robust simtoreal transfer is a promising direction we hope to explore in the future.
Acknowledgements
This work is partially supported by the National Science Foundation [IIS1755038], the Office of Naval Research [Award Number: N000141812873], the Google Faculty Research Award, and the Amazon Research Award. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
References
 [1] (2018) A separation principle for control in the age of deep learning. Annual Review of Control, Robotics, and Autonomous Systems 1, pp. 287–307. Cited by: §IA.
 [2] (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §IA.
 [3] (2007) Optimal control: linear quadratic methods. Courier Corporation. Cited by: §IA.

[4]
(2018)
Mutual information neural estimation.
In
Proceedings of the International Conference on Machine Learning
, pp. 531–540. Cited by: §IA, §I, §II, §IIIA, §IIIA, §IIIB, §III.  [5] (1994) Acting optimally in partially observable stochastic domains. In AAAI, Vol. 94, pp. 1023–1028. Cited by: §IVA, §IVA.
 [6] (2015) Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289. Cited by: §IVA.
 [7] (2018) Pybullet, a python module for physics simulation for games, robotics and machine learning. Cited by: §IVB.
 [8] (2012) Elements of information theory. John Wiley & Sons. Cited by: §II.
 [9] (2013) A course in robust control theory: a convex approach. Vol. 36, Springer Science & Business Media. Cited by: §IA.
 [10] (2017) Integrated perception and control at high speed. Master’s Thesis, Massachusetts Institute of Technology. Cited by: §IVA, §IVA.
 [11] (2007) Gut feelings: the intelligence of the unconscious. Penguin. Cited by: §I.
 [12] (2019) Transfer and exploration via the information bottleneck. In Proceedings of the International Conference on Learning Representations, Cited by: §IA.
 [13] (2011) Entropy and information theory. 2nd edition, Springer Science & Business Media. Cited by: §IIIA.

[14]
(2017)
Cognitive mapping and planning for visual navigation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2616–2625. Cited by: §IA.  [15] (2017) Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125. Cited by: §IA.
 [16] (2016) Backprop KF: learning discriminative deterministic state estimators. In Advances in Neural Information Processing Systems, pp. 4376–4384. Cited by: §IA.
 [17] (2017) Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine Learning, pp. 1352–1361. Cited by: §V.
 [18] (201806) Differentiable particle filters: endtoend learning with algorithmic priors. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. Cited by: §IA.
 [19] (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101 (12), pp. 99–134. Cited by: §IVA.
 [20] (2018) Particle filter networks: endtoend probabilistic localization from visual observations. arXiv preprint arXiv:1805.08975. Cited by: §IA.
 [21] (2019) Differentiable algorithm networks for composable robot learning. arXiv preprint arXiv:1905.11602. Cited by: §IA.
 [22] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IIIA.
 [23] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IIIB.

[24]
(2005)
Empowerment: a universal agentcentric measure of control.
In
IEEE Congress on Evolutionary Computation
, Vol. 1, pp. 128–135. Cited by: §IA.  [25] (2019) Nonlinear information bottleneck. Entropy 21 (12), pp. 1181. Cited by: §IA.
 [26] (2016) Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §IA.
 [27] (2018) Learning handeye coordination for robotic grasping with deep learning and largescale data collection. The International Journal of Robotics Research (IJRR) 37 (45), pp. 421–436. Cited by: §IA.

[28]
(2016)
Stein variational gradient descent: a general purpose bayesian inference algorithm
. In Advances in neural information processing systems, pp. 2378–2386. Cited by: §V.  [29] (2003) Psychophysics: how fielders arrive in time to catch the ball. Nature 426 (6964), pp. 244. Cited by: §I.
 [30] (2019) Taskdriven estimation and control via information bottlenecks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Cited by: §IA, §II, §IVA, §IVA.
 [31] (2019) On variational bounds of mutual information. arXiv preprint arXiv:1905.06922. Cited by: §III.
 [32] (2013) Empowerment and statedependent noisean intrinsic motivation for avoiding unpredictable agents. In Proceedings of the Artificial Life Conference, pp. 118–125. Cited by: §IA.
 [33] (2014) Empowerment–an introduction. In Guided SelfOrganization: Inception, pp. 67–114. Cited by: §IA.
 [34] (2015) Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, pp. 1889–1897. Cited by: §III.
 [35] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III, §V.
 [36] (2004) How dogs navigate to catch frisbees. Psychological Science 15 (7), pp. 437–441. Cited by: §I.
 [37] (2009) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: Theorem II.1.
 [38] (2011) Steps towards a theory of visual information: active perception, signaltosymbol conversion and the interplay between sensing and control. arXiv preprint arXiv:1110.2053. Cited by: §IA.
 [39] (2013) Actionable information in vision. In Machine Learning for Computer Vision, pp. 17–48. Cited by: §IA.
 [40] (2018) The limits and potentials of deep learning for robotics. The International Journal of Robotics Research (IJRR) 37 (45), pp. 405–420. Cited by: §IA.
 [41] (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §II.
 [42] (2017) Control capacity of partially observable dynamic systems in continuous time. arXiv preprint arXiv:1701.04984. Cited by: §IA.
 [43] (1999) The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control, and Computing, Cited by: §IA, §I, §II.
 [44] (2015) Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop, pp. 1–5. Cited by: §IA.
 [45] (2019) Learning representations in reinforcement learning: an information bottleneck approach. arXiv preprint arXiv:1911.05695. Cited by: §IA.
 [46] (2019) Unsupervised visuomotor control through distributional planning networks. arXiv preprint arXiv:1902.05542. Cited by: §IA.
 [47] (2019) Learning efficient representation for intrinsic motivation. arXiv preprint arXiv:1912.02624. Cited by: §IA.
 [48] (1998) Essentials of robust control. Vol. 104, Prentice hall Upper Saddle River, NJ. Cited by: §IA.
 [49] (2017) Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. Cited by: §IA.
Comments
There are no comments yet.