Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks

12/31/2018 ∙ by Alexander Sax, et al. ∙ 8

One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conventional approach to solving "vision" is to define a set of offline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the recent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for performing arbitrary downstream active tasks? We show that proper use of mid-level perception confers significant advantages over training from scratch. We implement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realizing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.



There are no comments yet.


page 2

page 4

page 6

page 7

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The renaissance of deep Reinforcement Learning (RL) started with the Atari DQN paper [28] in which it was shown a wide set of video games could be learned to play directly from pixels using RL. Robotics quickly adopted deep RL for learning to control, using frames from an onboard camera, and commonly referred to as pixel-to-torque. This interdisciplinary success has led to remarkable recent progress in RL research and has implications for various other fields, in particular, perception.

The basic premise of this approach, pertinent to perception, is: performing an active task can be effectively learned from scratch directly from images. The premise poses an existential question for computer vision: what is the use of computer vision, if all one needs from images can be learned from scratch using RL? In this paper, we focus on this question and try to identify what the consequences of this paradigm are, if and when vision can be learned from scratch using RL, and how computer vision could actually help with learning active tasks.

Figure 1: Mid-level perception module in an end-to-end framework for learning active robotic tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream active tasks. We report significant advantages in sample efficiency and generalization.

To take account of the situation, it is important to note that there are two common themes when learning from scratch using RL: I. the policies require a massive number of data points to learn as they are sample inefficient (e.g. DQN requires tens of millions of frames to learn [28]). Even state of the art Q-learning methods like Rainbow [18] still require millions of samples. II. the policies are often tested in the same environment as training, as they exhibit difficulties generalizing across environments with even modest differences. This is while generalization and sample efficiency are essential requirements for any practically useful system operating in the real-world. As an example, biological organisms are known to be capable of learning active tasks from few samples and effortlessly generalize them to places other than where learning occurred. We show that indeed the lack of an appropriate perception module is one of the primary causes of these two phenomena which can be consequently alleviated by adopting a perception solution. We recognize that conventional computer vision consists of defining a set of offline recognition problems, e.g. object classification, and tackling them on their own without a clear path towards effective integration in active frameworks. We will demonstrate that such standard vision tasks can be turned into mid-level vision skills and then integrated into RL frameworks for learning arbitrary active tasks.

To be more specific: our goal is to learn an arbitrary downstream active task like visual navigation. We assume a set of standard imperfect visual estimators (e.g. depth, orientation, objects, etc.) are available; we refer to them as

mid-level vision tasks. We then study if and how mid-level vision can provide benefits towards learning the downstream active task, compared to not adopting a perception. Our metrics are how quickly the active task is learned and how well the policies generalize to unseen test spaces. We do not care about the task-specific performance of mid-level visual estimators or their vision-based metrics, as our sole goal is the downstream active task and mid-level vision is only in service to that.

We test three core hypotheses: I. if mid-level vision provides an advantage in terms of sample efficiency of learning an active task (answer: yes) II. if mid-level vision provides an advantage towards generalization to unseen spaces (answer: yes) III. if a fixed mid-level vision feature could suffice or a set of features would be essential to support arbitrary active tasks (answer: a set is essential). We use statistical tests where appropriate. Finally, we put forth a simple and practical perception module based on the findings of our study which can be adopted in lieu of raw pixels to gain the advantages of mid-level vision.

To perform our study, we needed to adopt an end-to-end framework for learning arbitrary active tasks and we chose deep RL—however any of the common alternatives, such as imitation learning or classic control, would be viable choices as well. In our experiments, we use neural networks from existing vision techniques 

[54, 58, 6, 52] trained on real images for each mid-level task and use their internal representations as the observed state provided to the RL policy. We do not use synthetic data to train the visual estimators and do not assume they are perfect. The code and trained models are available on our website.

2 Related Work

Our study has connections to a broad set of topics, including lifelong learning, un/self supervised learning, transfer learning, Reinforcement and Imitation Learning, control theory, active vision and several others. We overview the most relevant ones within constraints of space.

Figure 2: Illustration of experimental setup. Left: Plate-notation view of the transfer learning setup where internal representations from the encoder network(s) are used as inputs to various RL policies. Right: Illustrations of the hypotheses. Features (also illustrated by the readout images) are ranked by performance on to the downstream task. Red lines identify features that have a higher rank for task 1 while blue lines connect features that have a higher rank for task 2. For HI and HII: Some features are ranked significantly above scratch. For HIII: The feature ranking reorders between tasks.

Conventional (Offline) Computer Vision encompasses a wide set of approaches—e.g. fully supervised learning [23], self-supervised learning [32, 35, 7, 56, 55, 33]

, unsupervised learning 

[9, 4, 7, 56, 8, 41]—adopted to solve various standard perception tasks, e.g. depth estimation [24], object detection [23] and segmentation [43], pose estimation [57], etc. The common characteristic shared across these methods is that they are offline (i.e. trained and tested on pre-recorded datasets) and evaluated in terms of their immediate tasks. In this paper we study how such methods can be plugged into a bigger framework targeted towards solving downstream active tasks. In recent years the computer vision community has become increasingly interested in robotic tasks.

Reinforcement Learning[46] and its variants like Meta-RL [10, 14, 30, 12, 21] or its sister fields such as Imitation Learning [2], commonly focus on the last part of the end-to-end active task pipeline: how to choose an action given a “state” from the world. These methods can be viewed as users of our study as we essentially update their input state from pixel (or single fixed features) to a set of generic mid-level vision features.

Feature Learning

literature shares its goal with our study: how to encode images in a way that provides benefits over just using raw pixels. There are a number of successful works in this area. Compression techniques like autoencoders 

[19] squeeze images into a lower-dimensional representation. Extensions [49, 22] to autoencoders may enforce desirable properties on the latent space. Another family, Generative Adversarial Networks [47, 13]

, formulates the problem as a game where the empirical data distribution is an optimal solution. Another method that often reduces the sample complexity compared to using raw pixels is “self-supervised” learning, which uses a novel loss function designed to encourage learning a single useful feature 

[56, 33, 32, 45, 50]. This can be considered a learned variant of hand-designed features like [25, 3]. Self-supervised approaches in particular have been used to reduce sample complexity for active tasks [16, 34, 29, 15]. Here we show that the appropriate choice of feature for active tasks actually depends on the desired downstream task, hence no single feature was able to sufficiently support arbitrary tasks. Consequently, a set of mid-level features is necessary. This is consistent with recent works in computer vision showing that no single vision feature could be the perfect transfer source for all other vision tasks, echoing the need for a set [54].

Transfer Learning reuses pre-learned knowledge to somehow benefit learning a new task outside of the original trianing distribution. [48, 36, 44, 11, 27, 31, 26, 37, 53, 38]. Our study is a case of transfer learning with the important characteristic that we are interested in transferring from mid-level static recognition tasks to downstream sequential active tasks.

3 Methodology

3.1 Transferring from Mid-Level Vision to Downstream Active Tasks

How might we use mid-level features to support a downstream task? We choose to use a typical transfer learning setup in which we use a pretrained neural network as a feature extractor. For example, we might take a network trained for image reshading estimation, pass in the observed image and use the intermediate “reshaded” representations as the sole input to another neural network. Only the second neural network is updated during training. This setup is shown graphically in figure 2. Other network configurations are possible and our observed performance gain is therefore a lower bound and any better method would increase the significance of our conclusions. Freezing the features has the advantage that they can be reused without degrading the performance on already-learned tasks, in order to learn multiple tasks over the lifetime of the agent.

Figure 3: Feature readouts in Gibson. Sample outputs from the trained perception networks on input from Gibson. See more frame-by-frame results on the supplementary material.

While the features can be trained on the same image distribution that the agent will see at test time, our features were trained on standard computer vision datasets. This induces a problem of domain shift where the perception networks are not necessarily well-calibrated in the new environment. However, our network outputs are still informative and we show qualitative examples in figures 3 and the supplementary material.

3.2 Hypothesis Testing

The following section details how we test our three core hypotheses (see Fig. 2). We use nonparametric significance tests where possible since nonparametric approaches avoid unnecessary assumptions on the shape or type of the population distributions. For pairwise tests, we use the Wilcoxon rank-sum test and we correct for multiple comparisons by controlling the False Discovery Rate (FDR) through Benjamini-Hochberg [5]. 111All RL evaluation suffers from some additional estimation error stemming from the fact that we estimate the average performance of a particular random seed, and then use that to further estimate the expected performance of a particular training approach. We use enough episodes so that these cluster effects are small, but we perform a more sophisticated analysis using [1] in the supplementary material, and include the code on our website.

Hypothesis I: Does mid-level vision provide an advantage in terms of sample efficiency when learning an active task? We want to examine whether using an agent equipped with mid-level vision can learn faster than a comparable agent learning tabula rasa, from images alone. Since there is significant variation in the asymptotic performance between different random seeds, any given agent may never achieve a given level of performance. We could not come up with a satisfying metric of relative sample efficiency that could be robustly evaluated from just a few seeds. We therefore provide the curves and ask the readers to use their judgment, which is currently standard practice. Unless designated otherwise, all of our curves are evaluated in unseen test environments.

Hypothesis II: Can mid-level vision features generalize better to unseen spaces? If our mid-level perception networks provide standard encodings of the visual observations that are less space-specific, then we might expect that agents using these features will also learn policies that are robust to differences between the training and testing environments and consequently generalize better. We test this in H2 by evaluating the performance of scratch and feature-based agents at the end of training, after convergence. Specifically we test which, if any, feature-based agents outperform scratch in unseen test environments, correcting for multiple hypothesis testing.

Hypothesis III: Can a single feature support all arbitrary downstream tasks? Or is a set of features required for that? We show there is no universal visual task that provides the the best features regardless of the downstream activity. We show this by demonstrating rank-reversal. Rank reversal is when the features that are best for one task are not ideal for the other (and vice-versa). Concretely, we find that depth estimation features perform the best for exploration while object classification is ideal for target-driven navigation. We demonstrate rank reversal by showing that for exploration while for navigation (performing hypothesis tests for both of these).

3.3 Mid-Level Feature Selection Module

If the three formulated hypothesis are correct, as we will demonstrate in Sec. 4.4

, then a proper perceptual support demands integrating one or a few mid-level visual representations from a larger set, conditional on the actual downstream task. Though it is viable and probably advantageous to define a complicated model, we propose an extremely simple one as a jumping-off point. Our feature-selection module simply takes a sparse linear combination of the pretrained features.

Specifically, the module learns a percept of an input image . The learned percept is a sparse blend of the outputs from pretrained feature extractors . Our module chooses blending weights for the mid-level features and we enforce that at most of these are nonzero. This sparsity reduces noise in the features and allows us to only evaluate a few of the at any given time, saving significant computational complexity.

The learned percept is then the weighted average of these weights:


There are a a myriad of ways to train such a module. For example, one could train the selection and blending weights via policy gradients, with gradient boosting, as a bandit problem (Thompson sampling or upper confidence bounds), or using supervised learning (e.g. noisy gates or the Gumbel-Softmax trick). We choose supervised learning with the noisy gating formulation in


The module can additionally be conditioned on the input image . The formula is then:

Figure 4: Visualization of training and test buildings from Gibson database [52]. The training space (highlighted in red and zoomed on the left) and the testing spaces (remaining on the right). Actual sample observations from agents virtualized in Gibson framework [52] are shown in the bottom of each box.

Such a setup encourages the agent to learn perception dynamics because it must choose a limited number of percepts to use for a given observation. One can imagine myriad useful improvements such as finetuning the to support a set of tasks, or designing the gating function to be especially adaptable (the gating is meta-learned).

4 Experiments

In this section we describe our experimental setup and present the results of our hypothesis tests and of the selection module. With 20 vision features and 4 baselines, our approach leads to training between 3-8 seeds per scenario in order to control the false discovery rate. The total number of policies used in the evaluation is about 600 which took 90,000 GPU hours to train.

4.1 Experimental Setup

Environments: We are ultimately interested in how to reduce sample complexity for agents learning in the real world

. There are two options for training—either on a real robot or by training in simulation. Training on a real robot is practically slow and tedious, but more importantly does not provide an easy way to control experiments and reproduce results for a proper statistical hypothesis test. Thus, due to the scale of the study, we opt to train in simulation.

The downside of training in simulation is that there is a realism gap between the simulator and the physical world. We attempt to mitigate this issue in two ways. First, we choose a recent simulator (Gibson [52]) that is designed to be perceptually similar to the real world as it operates by virtualizing scans of real buildings. Gibson is also integrated with the PyBullet physics engine which contains a fast collision-handling system used to simulate dynamics.

Second, we also perform universality experiments in a second simulator, VizDoom [20]. VizDoom [20] is based on the 1992 game Doom and is one of the simplest examples of a 3D environment. It allows the agent to move around and contains a rudimentary physics engine that handles momentum and enables some basic interactions. The latter is of interest to us since it unlocks certain tasks that are not currently feasible in Gibson (e.g. opening a door, removing an enemy, etc). VizDoom is visually distinct from Gibson and we include it to show that our findings are rather robust to the idiosyncrasies of the particular environment.

4.1.1 Train/Test split

For each environment we define a clear train/test split. In Gibson we train in one building and test in different and completely unseen buildings used only for evaluation (fig. 4). We also test in 10 additional unseen spaces of comparable size and the results are in section 4.5. In Gibson, the training space for the visual navigation task covers 40.2 square meters and the testing space covers 415.6 square meters. The training space for the local planning and exploration tasks covers 154.9 square meters and the testing space covers 1270.1 square meters. The universality experiments in Doom also use a train/test split of textures which is provided in the supplementary material.

4.1.2 Downstream Active Tasks

We try to choose practically useful tasks in order to test our hypotheses. The tasks are visual target-driven local navigation, visual exploration, and local planning. These are depicted in figure 5 and described below.

Figure 5: Task definitions. Visual descriptions of the selected active tasks and their implementations in Gibson (right two columns). Reward functions () and max episode lengths are shown in the left-hand column. Additional observations besides the RGB image are shown in the obs column. Exploration receives only the revealed occupancy grid and not the actual mesh boundaries.

Visual Target-Driven Local Navigation:

In this scenario the agent must locate a target object as fast as possible with only sparse rewards. Upon touching the target there is a large positive reward and the episode ends. Otherwise there is a small negative reward for living. The target visually remains the same between episodes although the location and orientation of both the agent and target are randomized according to a uniform distribution over a predefined boundary within the floor plan of the space. In Gibson, the target is a box and in Doom, the target is a green torch, but the agent must learn to identify of the target during the course of training. The maximum episode length is 400 timesteps and the shortest path averages around 30 steps.

Visual Exploration: For Visual Exploration, the agent is tasked to visit as many new parts of the space as quickly as possible. The environment is partitioned into small occupancy cells and and the cells are “unlocked” upon being seen by the agent. The reward at each timestep is proportional to the number of revealed occupancy cells. The episode ends after 1000 timesteps. The agent is equipped with a myopic range scanner. This scanner reveals the area directly in front of the agent for up to 1.5 meters. Since our agents are memoryless, we provide them with an odometric map of the unlocked cells.

Local Planning: In Local Planning the agent must direct itself to a given nonvisual target destination using visual inputs, avoid obstacles and walls as it navigates to the target. Since our agent is memoryless, we keep the problem well-posed by specifying the current target direction.222

This problem formulation is equivalent to assuming the initial coordinates to the target are given, and the robot has a perfect localization system (ideal IMU). In a deployment setting, noise could be added to the target vector to simulate real-world conditions.

The agent receives dense positive reward proportional to the progress it makes (in Euclidean distance) towards the goal, and is penalized for colliding with walls and objects. There is also a small negative reward for living as in visual navigation. This task represents the practical skill of local planning, where an agent may be given sparse waypoints along a desired path and must navigate gracefully along the desired path in a cluttered space. The maximum episode length is 400 timesteps, and the target distance is sampled from a Gaussian distribution with mean of 5 meters and standard deviation of 2 meters.

4.1.3 State Space

In all tasks, the state space contains the RGB image and minimum amount of side information so that the task is solvable. We stack the most recent 4 RGB frames as input and do not share weights between these frames to allow the agent to infer its local dynamics. Unlike the common practice in reinforcement learning, we do not include any proprioception information such as the agent’s joint positions or velocities or any other side information that could be useful, but is not essential to solving the task, such as a map of obstacles or floor layout. For visual navigation, the state space is only the image. For local planning, the agent also receives the vector to the target in its own inertial reference frame as a vector where is the Euclidean distance to the target and is the angle relative to the agent’s heading in the ground plane. For visual exploration, the task requires some form of memory for the agent to know where it has already been. Since our neural network architecture is memoryless (aside from the frame stacking), we choose to encode the memory as an occupancy grid which is translated and rotated to align with the agent’s inertial reference frame, whose cell values are 1 if the agent has already observed a given cell and 0 otherwise. The occupancy grid contains no global information about the scene such as walls or obstacles; it is only the previous output of the robot’s laser sensor.

Figure 6: Agent trajectories in test environment. Left: Average rewards in the test environment in Doom and Gibson for features and scratch. Feature-based policies generalize better in Gibson. In Doom, the features generalize better to novel texture variation (see 10 for additional generalization results in Doom). Right: Visualizations of the paths that agents take in Gibson. Top: The policy trained for object detection learned to recognize the target and, once it does so, heads for the goal, but fails to cover the entire space in exploration. Middle: Distance estimation features learn some rough approximations for the target, but seems to run around until it is nearly on top of the target, while covering the entire space for exploration. Right: The scratch policy completely fails to generalize to the test space and wanders about almost randomly. More visualizations are available on the website.

4.1.4 Action Space

In all tasks in this section we assume that there is a low-level controller for robot actuation. Therefore the policies have an action space of

move_forward corresponds to a 0.1 m translation in the direction of the robot’s heading in the ground plane, and turn_left and turn_right correspond to in-place rotations of the robot’s heading of 0.14 radians. No frame skipping is used for Gibson. For Doom, actions are selected and repeated for 4 frames. All actions are available at every timestep, with the physics engines responsible for enforcing the physical boundaries of the spaces.

4.2 Learning Setup

In all experiments we use the common Proximal Policy Optimization (PPO) [40] algorithm with Generalized Advantage Estimation [39]. PPO is a stable and well-tested algorithm. Due to the computational load of rendering perceptually realistic images in Gibson we are only able to use a single rollout worker and we therefore decorrelate our batches using experience replay and off-policy variant of PPO. The formulation is similar to Actor-Critic with Experience Replay (ACER) [51] in that full trajectories are sampled from the replay buffer and reweighted using the first-order approximation for importance sampling. We include the full formulation in the supplementary material. For the universality experiments in Doom, we use standard PPO with 16 rollout environments. The standard PPO objective is


where is the advantage function at timestep (some sufficient statistic for the value of a policy at timestep , in our experiments we choose the generalized advantage estimator [39]) and is a trajectory drawn from the current policy.

For each task and each environment we conduct a hyperparameter search optimized for the

scratch baseline (see section 4.3). We then fix this setting and reuse it for every feature. This setup should favor scratch and possibly other baselines that use the same architecture.

For our experiments we use a set of 20 different computer vision tasks. This set covers various common modes of computer vision tasks, from texture-based tasks like denoising, to 3D pixel-level tasks like depth estimation, to low-dimensional geometric tasks like room layout estimation, to semantic tasks like object classification. For a full list of the tasks as well as descriptions and some sample videos, please see the supplementary material.

Our feature networks were trained on a dataset of 4 million static images in of indoor scenes [54]. We use the pretrained networks of [54]. Each network encoder consists of a ResNet-50 [17] without a global average-pooling layer. This preserves spatial information in the image. The feature networks were all trained using identical hyperparameters.

All network architectures and full experimental details, as well as videos of the pretrained networks evaluated in our environments are included in the supplementary material.

4.3 Baselines

We include several controls to provide a baseline for the visual-feature-based agents and to address possible confounding factors.

Navigation   Feature r p-val Obj. Cls. 5.91 .001 Sem. Segm. 5.87 .001 Curvature 4.75 .002 Scene Cls. 3.07 .003 2.5D Segm. 3.01 .002 2D Segm. 1.99 .003 Distance 1.74 .003 Occ. Edges .38 .009 Vanish. Pts. .39 .019 Reshading .21 .021 2D Edges .12 .006 Normals -.50 .035 Jigsaw -.86 .122 3D Keypts. -1.08 .112 Layout -1.14 .057 Autoenc. -1.16 .043 Rand. Proj. -2.12 .083 Blind -3.20 .755 Pix-as-state -4.30 .856 2D Keypts. -6.10 .922 In-painting -6.57 .971 Denoising -6.47 .981 Exploration   Feature r p-val Distance 5.90 .015 Reshading 5.79 .003 3D Keypts. 5.27 .004 Curvature 5.12 .027 2.5D Segm. 5.60 .056 Layout 4.78 .108 2D Edges 4.87 .120 Normals 5.26 .143 Scene Cls. 4.67 .152 Obj. Cls. 4.80 .187 2D Segm. 4.47 .406 Jigsaw 4.47 .455 Rand. Proj. 4.33 .500 Vanish. Pts. 4.24 .500 Pix-as-state 4.20 .531 Blind 4.21 .545 2D Keypts. 4.21 .682 In-painting 4.30 .697 Autoenc. 4.11 .815 Sem. Segm. 3.67 .857 Occ. Edges 3.85 .864 Denoising 3.59 .962 Local Planning   Feature r p-val 3D Keypts. 15.45 .015 Normals 15.10 .000 Curvature 14.84 .003 Distance 14.56 .001 2.5D Segm. 14.50 .001 Sem. Segm. 14.49 .000 Scene Cls. 14.20 .001 Occ. Edges 14.20 .001 Reshading 14.12 .000 Layout 14.12 .015 Obj. Cls. 13.95 .000 2D Segm. 13.86 .001 Denoising 13.54 .000 In-painting 13.28 .000 Jigsaw 13.17 .012 2D Edges 13.16 .008 Vanish. Pts. 12.14 .028 2D Keypts. 11.99 .050 Autoenc. 11.39 .155 Pix-as-state 10.22 .654 Rand. Proj. 8.93 .892 Blind 9.83 .929
Figure 7: Features vs. scratch. The plots above show training and test performance of scratch vs. some selected features throughout training. For all tasks there is a significant gap between train/test performance for scratch, and a much smaller one for the best feature. The tables above show significance tests of the performance of feature-based agents vs from scratch in Gibson. P-values come from a Wilcoxon rank sum test, adjusted for multiple hypothesis testing with a FDR of 20%. Significant rows are in white, and these blocks are ordered by average episode reward. The plots show training and testing curves, and there are significant gaps between them. Scratch often fails to generalize (bottom), while feature-based agents generalize better (top). Sometimes, models may appear to learn in the training environment, but they fail at test time—underscoring the importance of a good test environment in RL.

Scratch Learning: Learning from scratch, or “vanilla” RL for the perception aspect, is among the common practices today. In this condition the agent starts with an appropriate random initialization and receives the raw RGB image as input. This baseline uses the common AtariNet [28] tower.

Blind Intelligent Actor: The Blind Intelligent Actor (blind) baseline is the same as scratch except that the visual input is constant and does not depend on the state of the environment. The blind agent indicates how much performance can be squeezed out of the nonvisual biases, correlations, and overall structure of the environment. If our tasks were essentially nonvisual, e.g. a narrow maze where the layout leads the agent to the target, then that would manifest as a small performance gap between blind and scratch.

Random Nonlinear Projections: To rule out the possibility that the perceptual architecture, not the source task, is the primary factor for good representations we include the Random Nonlinear Projection (random) baseline. This condition is the same as the pretrained features condition, except that this network is randomly initialized and then frozen. As a result, the policy network learns from a random nonlinear projection of the input image. These features contain much of the information in the original image.

Pixels as State: This baseline considers the possibility that the small representations size is easier to learn from. Pixels-as-state

downsamples the input image to a 16x16x3 image and then stacks two of these and two other copies of the greyscale version to produce a 16x16x8 tensor; that is the same shape as the pretrained activations. This tensor is then passed as the representation.

4.4 Experimental results on hypothesis testing I-III

We report our findings for the effect of intermediate representations on sample efficiency and generalization. All the results are evaluated in the test environment with multiple random seeds, unless otherwise explicitly stated.

4.4.1 Hypothesis I: Sample Complexity Results

In this experiment we check whether an agent can learn faster using pretrained visual features than it would be able to learn from scratch. We evaluate 20 different features against the four control groups on each of our tasks: visual target-driven local navigation, visual exploration, and local planning. As shown in Fig.  6, we find that in all cases feature-based the agents learn significantly faster and may achieve a higher final performance than an agent that learns from scratch, even after averaging over many random seeds. We explore when an agent may not achieve higher test performance in section 4.6.

4.4.2 Hypothesis II: Generalization Results

Do policies trained with pretrained features generalize better to unseen test environment? The previous experiment tested how quickly learning saturated, but this experiment tests for superior generalization performance with a given level of data. We find that specific feature-based policies exhibit superior generalization performance compared to scratch when tested in environments unseen at training time.

Generalization Significance Analysis: As shown in Fig. 7, we find that for each of the tasks there are some features that generalize significantly better than scratch. We used a nonparametric significance test and adjusted for multiple comparison, using a False Discovery Rate of 20%. If there were no actual difference, the probability of all of these results being spurious is for exploration (fig. 7, center) and negligible for navigation and local planning (fig. 7: left, right). After using the additional seeds from the follow-up experiment in the next section, the p-value for exploration is also negligible.

Generalization Gap: We found a large generalization gap between agent performance in the training vs. test environments, shown in the plots in fig. 7. All our policies exhibit some gap, but agents trained from scratch seem to overfit completely—they do not show test improvement during training. We note that Autoencoder learns quickly in the training environment but fails to carry that over to the test set. Since Variational Autoencoders are a commonly-used form of perception in RL, we caution that such methods must be evaluated in terms of performance in a test environment and not solely on the training set.

Qualitative Generalization Results: Feature-based policies behave qualitatively differently than those trained from scratch. Different features also exhibit different types of behaviors. Fig. 6 highlights these differences by plotting the trajectories of random rollouts for various policies. We see that the navigation agent trained with semantic features is able to effectively identify and then navigate to the target. However, the semantic agent is not very adept at exploring through hallways and doors, as evidence in the exploration task. This is where the depth-based agent shines: despite not having fine grained knowledge about the objects in the scene, as is clear in the navigation task, the depth-based can cover much more ground in the exploration task by seeking paths that lead to wide open spaces. Both agents perform noticeably better than scratch, which wanders about the test environment seemingly at random.

4.4.3 Hypothesis III: Rank Reversal Results

It is well-known that ImageNet-based features transfer well. Indeed, such pretraining is often the default choice. We find that there may not be one or two single features that consistently outperform all the others. Instead, the choice of pretrained features should depend upon the downstream task. This experiment exhibits a case of

rank reversal where features that work well on one task are not ideal for another, and vice-versa.

Figure 8: Rank reversal in visual tasks and no universal feature. Scatterplots of the rank of feature performance on navigation (x-axis) and exploration (y-axis) in both Gibson (right) and Doom (left). The fact that there is no feature on the bottom left corner means there is no single universal feature—and the fact that there is a blank region there means there is also no almost

universal feature. The feature with the maximum F-score requires giving up 3-4 ranks for each task.

Figure 9: Rank reversal significance graphs. Arrows graphs indicate which features are better for a given downstream task. Heavier arrows indicate more significant results (lower -level). Blue arrows point towards tasks that are better for navigation and red arrows point towards tasks that better support exploration. Lack of an arrow indicates the performance difference was not statistically significant. That there is no node with all incoming arrows demonstrates the lack of a universal feature. The essentially complete bipartite structure in the Gibson graph shows that navigation is characteristically semantic while exploration is geometric.

Rank Reversal Significance Analysis: We compare the top-performing navigation feature against the top-performing exploration feature. It happens that the best feature for Gibson navigation was indeed an Object Classification network based on ImageNet, but the best feature for Gibson exploration in was Distance Estimation. For navigation, Object Classification was better than Distance Estimation at the level. The order is reversed for exploration, also at the level.

Rank Reversal Among Related Visual Tasks: The trend of rank reversal appears to be a widespread phenomenon. Fig. 9 shows that semantic features seem to be useful for navigation while geometric features are useful for exploration. In Gibson, the graph is nearly complete bipartite (indicating that this distinction is quite useful). In the universality experiments (partially shown in figure 9), a similar trend holds despite the fact that Doom is visually quite distinct from Gibson.

Similarly, we see in figure 8 that the trend is not just among the top few features, but holds for families of computer vision tasks found in [54].

4.5 Robustness in Additional Environments

We repeated our testing in 9 other buildings to account for the possibility that our main test building is anomalous in some way. We found that the average reward in our main test building and in the 9 other buildings was extremely strongly correlated with a Spearman’s rho of 0.93 for navigation and 0.85 for exploration. Full experimental setup and results are included in the supplementary material.

4.6 Universality Experiments in Doom

Figure 10: Features generalize to new axes of variation. In ViZDoom, feature-based agents (right two columns) generalize to new textures even when not exposed to texture variation in training (top row), while agents trained from scratch suffer a significant drop in performance (left, top).

We also implemented navigation and exploration in Doom to evaluate whether one can expect to see similar effects hold across environments. We found (shown in fig. 8) that features which perform well in Gibson also tend to perform well in Doom, and that similar tradeoffs exist in between tasks regardless of environment. Specifically, we again find the geometric/semantic distinction from Gibson appearing in Doom, and the results are highly statistically significant (figure 9). We also find that there is no universal feature in either Doom or Gibson, and that maximizing the combined score (see figure 8) requires choosing the third- or fourth-best feature for any given task.

We also found that features were more robust to changes in texture than learning from scratch. While scratch achieves the highest final performance when the agent learns in a video game environment where there are many train textures that emulate the test textures (fig. 6), scratch fails to generalize when there is little or no variation in texture during training. On the other hand, feature-based agents are able to generalize even without texture randomization, as shown in fig. 10.

4.7 Evaluation of Representation Selection Module

Since the choice of feature seems to have a significant impact on the agent’s test performance, we show that the feature selection can be stably learned using our perception module. We evaluated the stability of feature selection in Doom since we could not fit all 20 perception networks as well as the Gibson environment onto a V100. However, this is not a fundamental limitation of the method and the feature rankings in Doom were shown in section 4.6 to be similar to those in Gibson. To fit the model into memory in Doom, we needed to use a single rollout worker. Furthermore, we reduced the learning rate by to give the policy network time to adapt to changing perception.

In figure 11 we examined which feature the module eventually selects between 20 options. The module selects the same final feature even when presented with other subsets of perception features (11, 6, and 3 options). We note that using only one rollout worker (the setup for this experiment), scratch was wholly unable to learn either task, while the module-based agent meaningfully improved during training.

Figure 11: Feature selection. The charts show which feature the model selects during training in Doom. The model selects scene classification for navigation (left) and room layout for exploration (right).

5 Conclusion and Limitations

We investigated the role of mid-level perception for learning active tasks using an RL platform. Our study suggested a notable benefit associated with adopting proper perceptual features, in contrast to using pixel as the state of the world or learning perception entirely from scratch via RL. The benefits were particularly stark in terms of sample complexity and generalization to unseen spaces. We consequently put forth a simple and efficient module for mid-level feature selection based on the findings of our study.

In retrospect, the finding that mid-level vision improves learning speed and generalization to new places is somewhat expected. Mid-level features encapsulate a readable and easy to understand state of the world (e.g. for 3D features, remove shadows and texture to convey the true underlying geometry) and are presumably designed to provide generic abstract information about the world that is not specific to a certain place. We found that by kickstarting the perception of an active system with such representations instead of bewildering the entire system with raw unprocessed sensory data that is full of information but hard to parse, the system will develop rewarding behavior faster.

It is worth noting a number of limitations of our framework. Our selection of active tasks was primarily oriented around locomotion. Though locomotion is a significant enough problem, our study does not necessarily convey conclusions about other important active tasks, such as manipulation. Also, given that RL was our experimental platform, our findings are enveloped by the limitations of existing RL methods, e.g. difficulties in long-range exploration or credit assignment for sparse reward functions. We also use a fixed basis of mid-level features during learning.

What exactly these mid-level tasks should be, how to learn their estimators efficiently, and how to incrementally expand the dictionary or improve each element are important research questions. Answering them would have benefits towards adapting the agents for the perceptual characteristics of new spaces, reducing computational cost, and would bring the problem closer to true life-long learning.

Acknowledgements We gratefully acknowledge the support of ONR MURI (N00014-14-1-0671), NVIDIA NGC beta, and TRI. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.