Sidekick Policy Learning for Active Visual Exploration

07/29/2018 ∙ by Santhosh K. Ramakrishnan, et al. ∙ Facebook The University of Texas at Austin 6

We consider an active visual exploration scenario, where an agent must intelligently select its camera motions to efficiently reconstruct the full environment from only a limited set of narrow field-of-view glimpses. While the agent has full observability of the environment during training, it has only partial observability once deployed, being constrained by what portions it has seen and what camera motions are permissible. We introduce sidekick policy learning to capitalize on this imbalance of observability. The main idea is a preparatory learning phase that attempts simplified versions of the eventual exploration task, then guides the agent via reward shaping or initial policy supervision. To support interpretation of the resulting policies, we also develop a novel policy visualization technique. Results on active visual exploration tasks with 360 scenes and 3D objects show that sidekicks consistently improve performance and convergence rates over existing methods. Code, data and demos are available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

page 14

page 24

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual recognition has witnessed dramatic successes in recent years. Fueled by benchmarks composed of Web photos, the focus has been inferring semantic labels from human-captured images

—whether classifying scenes, detecting objects, or recognizing activities 

[51, 41, 57]. By relying on human-taken images, the common assumption is that an intelligent agent will have already decided where and how to capture the input views. While sufficient for handling static repositories of photos (e.g., auto-tagging Web photos and videos), assuming informative observations glosses over a very real hurdle for embodied vision systems.

A resurgence of interest in perception tied to action takes aim at that hurdle. In particular, recent work explores agents that optimize their physical movements to achieve a specific perception goal, e.g., for active recognition [43, 29, 31, 2, 28], visual exploration [30], object manipulation [40, 49, 46], or navigation [70, 21, 2]. In any such setting, deep reinforcement learning (RL) is a promising approach. The goal is to learn a policy that dictates the best action for the given state, thereby integrating sequential control decisions with visual perception.

Figure 1: Embodied agents that actively explore novel objects (left) or environments (right) intelligently select camera motions to gain as much information as possible with very few glimpses. While they naturally face limited observability of the environment, during learning fuller observability may be available. We propose sidekicks to guide policy learning for active visual exploration.

However, costly exploration stages and partial state observability are well-known impediments to RL. In particular, an active visual agent [70, 21, 71, 30] has to take a long series of actions purely based on the limited information available from its first person view. Due to poor action selection based on limited information, the most effective viewpoint trajectories are buried among many mediocre ones, impeding the agent’s exploration in complex state-action spaces.

We observe that agents lacking full observability when deployed may nonetheless possess full observability during training, in some cases. Overall, the imbalance occurs naturally when an agent is trained with a broader array of sensors than available at test-time, or trained free of the hard time pressures that limit test-time exploration. In particular, as we will examine in this work, once deployed, an active exploration agent can only move the camera to “look-around” nearby [30], yet if trained with omnidirectional panoramas, could access any possible viewpoint while learning. Similarly, an active object recognition system [29, 31, 2, 65, 28] can only see its previously selected views of the object; yet if trained with CAD models, it could observe all possible views while learning. Additionally, agents can have access to multiple sensors during training in simulation environments [13, 48, 10], yet operate on first-person observations during test-time. However, existing methods restrict the agent to the same partial observability during training [65, 31, 29, 30, 70, 28].

We propose to leverage the imbalance of observability. To this end, we introduce sidekick policy learning. We use the name “sidekick” to signify how a sidekick to a hero (e.g., in a comic or movie) provides alternate points of view, knowledge, and skills that the hero does not have. In contrast to an expert [19, 61], a sidekick complements the hero (agent), yet cannot solve the main task at hand.

We propose two sidekick variants. Both use access to the full state during a preparatory training period to facilitate the agent’s ultimate learning task. The first sidekick previews individual states, estimates their value, and shapes rewards to the agent for visiting valuable states during training. The second sidekick provides initial supervision via trajectory selections to accelerate the agent’s training, while gradually permitting the agent to act on its own. In both cases, the sidekicks learn to solve

simplified versions of the main task with full observability, and use insights from those solutions to aid the training of the agent. At test time, the agent has to act without the sidekick.

We validate sidekick policy learning for active visual exploration [30]. The agent enters a novel environment and must select a sequence of camera motions to rapidly understand its entire surroundings. For example, an agent that has explored various grocery stores should enter a new one and, with a couple glimpses, 1) conjure a belief state for where different objects are located, then 2) direct its camera to flesh out the harder-to-predict objects and contexts. The task is like active recognition [65, 31, 29, 2], except that the training signal is pixelwise reconstruction error for the full environment rather than labeling error. Our sidekicks can look at any part of the environment in any sequence during training, whereas the actual agent is limited to physically feasible camera motions and sees only those views it has selected. On two standard datasets [66, 65], we show how sidekicks accelerate training and promote better look around policies.

As a secondary contribution, we present a novel policy visualization technique. Our approach takes the learned policy as input, and displays a sequence of heatmaps showing regions of the environment most responsible for the agent’s selected actions. The resulting visualizations help illustrate how sidekick policy learning differs from traditional training.

2 Related Work

Active vision and attention: Linking intelligent control strategies to perception has early foundations in the field [1, 6, 5, 63]. Recent work explores new strategies for active object recognition [65, 31, 29, 2, 28], object localization [9, 20, 71], and visual SLAM [32, 58], in order to minimize the number of sampled views required to perform accurate recognition or reconstruction. Our work is complementary to any of the above: sidekick policy learning is a means to accelerate and improve active perception when observability is greater during training.

Models of saliency and attention allow a system to prioritize portions of its observation to reduce clutter or save computation [42, 4, 45, 68, 67]. However, unlike both our work and the active methods above, they assume full observability at test time, selecting among already-observed regions. Work in active sensor placement aims to place sensors in an environment to maximize coverage [11, 36, 62]. We introduce a model for coverage in our policy learning solution (Sec. 3.3.2). However, rather than place and fix static sensors, the visual exploration tasks entail selecting new observations dynamically and in sequence.

Supervised learning with observability imbalance:

Prior work in supervised learning investigates ways to leverage greater observability during training, despite more limited observability during test time. Methods for depth estimation 

[22, 16, 60] and/or semantic segmentation [56, 25, 26] use RGBD depth data, multiple views, and/or auxiliary annotations during training, then proceed with single image observations at test time. Similarly, self-supervised losses [44, 27] based on auxiliary prediction tasks at training time have been used to aid representation learning for control tasks. Knowledge distillation [24] lets a “teacher” network guide a “student” with the motivaton of network compression. In learning with privileged information, an “expert” provides the student with training data having extra information (unavailable during testing) [61, 53, 37]. At a high level, all the above methods relate to ours in that a simpler learning task facilitates a harder one. However, in strong contrast, they tackle supervised classification/regression/representation learning, whereas our goal is to learn a policy for selecting actions. Accordingly, we develop a very different strategy—introducing rewards and trajectory suggestions—rather than auxiliary labels/modalities.

Guiding policy learning: There is a wide body of work aimed at addressing sparse rewards and partial observability. Several works explore reward shaping motivated by different factors. The intrinsic motivation literature develops parallel reward mechanisms, e.g., based on surprise [47, 7], to direct exploration. The TAMER framework [33, 34, 35] utilizes expert human rewards about the end-task. Potential-based reward shaping [23] incorporates expert knowledge grounded in potential functions to ensure policy invariance. Others convert control tasks into supervised measurement prediction task by defining goals and rewards as functions of measurements [12]. In contrast to all these approaches, our sidekicks exploit the observability difference between training and testing to transfer knowledge from a simpler version of the task. This external knowledge directly impacts the final policy learned by augmenting task related knowledge via reward shaping.

Behavior cloning provides expert-generated trajectories as supervised (state, action) pairs [8, 17, 14, 50]. Offline planning, e.g., with tree search, is another way to prepare good training episodes by investing substantial computation offline [19, 3, 54], but observability is assumed to be the same between training and testing. Guided policy search uses importance sampling to optimize trajectories within high-reward regions [39] and can utilize full observability [38], yet transfers from an expert in a purely supervised fashion. Our second sidekick also demonstrates good action sequences, but we specifically account for the observability imbalance by annealing supervision over time.

More closely related to our goal is the asymmetric actor critic, which leverages synthetic images to train a robot to pick/push an object [48]. Full state information from the graphics engine is exploited to better train the critic. While this approach modifies the advantage expected for a state like our first sidekick, this is only done at the task level. Our sidekick injects a different perspective by solving simpler versions of the task, leading to better performance (Sec. 4.2).

Policy visualization: Methods for post-hoc explanation of deep networks are gaining attention due to their complexity and limited interpretability. In supervised learning, heatmaps indicating regions of an image most responsible for a decision are generated via backprop of the gradient for a class label [55, 15, 52]. In reinforcement learning, policies for visual tasks (like Atari) are visualized using t-SNE maps [69] or heatmaps highlighting the parts of a current observation that are important for selecting an action [18]. We introduce a policy visualization method that reflects the influence of an agent’s cumulative observations on its action choices, and use it to illuminate the role of sidekicks.

3 Approach

Our goal is to learn a policy for controlling an agent’s camera motions such that it can explore novel environments and objects efficiently. Our key insight is to facilitate policy learning via sidekicks that exploit 1) full observability and 2) unlimited time steps to solve a simpler problem in a preparatory training phase.

We first formalize the problem setup in Sec. 3.1. After overviewing observation completion as a means of active exploration in Sec. 3.2, we introduce our sidekick learning framework in Sec. 3.3. We tie together the observation completion and sidekick components with the overall learning objective in Sec. 3.4. Finally, we present our policy visualization technique in Sec. 3.5.

3.1 Problem setup: active visual exploration

The problem setting builds on the “learning to look around” challenge introduced in [30]. Formally, the task is as follows. The agent starts by looking at a novel environment (or object) from some unknown viewpoint 222For simplicity of presentation, we represent an environment as where the agent explores a novel scene, looking outward in new viewing directions. However, experiments will also use as an object where the agent moves around an object, looking inward at it from new viewing angles.. It has a budget of time to explore the environment. The learning objective is to minimize the error in the agent’s pixelwise reconstruction of the full—mostly unobserved—environment using only the sequence of views selected within that budget.

Following [30], we discretize the environment into a set of candidate viewpoints. In particular, the space of viewpoints is a viewgrid indexed by elevations and azimuths, denoted by , where is the 2D view of from viewpoint , which is comprised of two angles. More generally, could capture both camera angle and position; however, to best exploit existing datasets, we limit camera motions to rotations.

The agent expends the budget in discrete increments, called “glimpses”, by selecting camera motions in sequence. At each time step, the agent gets observation from the current viewpoint. The agent makes an exploratory rotation () based on its policy . When the agent executes action , the viewpoint changes according to . For each camera motion executed by the agent, a reward is provided by the environment (Sec. 3.3.1 and 3.4). Using the view , the agent updates its internal representation of the environment, denoted . Because camera motions are restricted to have proximity to the current camera angle (Sec. 4.1) and candidate viewpoints partially overlap, the discretization promotes efficiency without neglecting the physical realities of the problem (following [43, 29, 30, 31]).

3.2 Recurrent observation completion network

We start with the deep RL neural network architecture proposed in 

[30] to represent the agent’s recurrent observation completion. The process is deemed “completion” because the agent strives to hallucinate portions of the environment it has not yet seen. It consists of five modules: Sense, Fuse, Aggregate, Decode, and Act with parameters , , , and respectively.

  • Sense: Independently encodes the view () and proprioception () consisting of elevation at time and relative motion from time to , and returns the encoded tuple .

  • Fuse: Consists of fully connected layers that jointly encode the tuple and output a fused representation .

  • Aggregate: An LSTM that aggregates fused inputs over time to build the agent’s internal representation of .

  • Decode: A convolutional decoder which reconstructs the viewgrid as a set of feature maps ( for 3 channeled images) corresponding to each view of the viewgrid.

  • Act: Given the aggregated state and proprioception , the Act

    module outputs a probability distribution

    over the candidate camera motions . An action sampled from this distribution is executed.

At each time step, the agent receives and encodes a new view , then updates its internal representation by sensing, fusing, and aggregating. It decodes the viewgrid and executes to change the viewpoint. It repeats the above steps until the time budget is reached (see Fig. 2). See Supp. for implementation details and architecture diagram.

Figure 2: Active observation completion. The agent receives one view (shown in red), updates its belief and reconstructs the viewgrid at each time step. It executes an action (red arrows) according to its policy to obtain the next view. The active agent must rapidly refine its belief with well-chosen views.

3.3 Sidekick definitions

Sidekicks provide a preparatory learning phase that informs policy learning. Sidekicks have full observability during training: in particular, they can observe the results of arbitrary camera motions in arbitrary sequence. This is impossible for the actual look-around agent—who must enter novel environments and respect physical camera motion and budget constraints—but it is practical for the sidekick with fully observed training samples (e.g., a panoramic image or 3D object model, cf. Sec. 4.1). Sidekicks are trained to solve a simpler problem with relevance to the ultimate look-around agent, serving to accelerate training and help the agent converge to better policies. In the following, we define two sidekick variants: a reward-based sidekick and a demonstration-based sidekick.

3.3.1 Reward-based sidekick

The reward-based sidekick aims to identify a set of views which can provide maximal information about the environment . The sidekick is allowed to access and select views without any restrictions. Hence, it addresses a simplified completion problem.

A candidate view is scored based on how informative it is, i.e., how well the entire environment can be reconstructed given only that view. We train a completion model (cf. Sec. 3.2) that can reconstruct from any single view (i.e., we set ). Let denote the decoded reconstruction for given only view as input. The sidekick scores the information in observation as:

(1)

where denotes the reconstruction error and is the fully observed environment. We use a simple loss on pixels for to quantify information. Higher-level losses, e.g., for detected objects, could be employed when available. The scores are normalized to lie in across the different views of . The sidekick scores each candidate view. Then, in order to sharpen the effects of the scoring function and avoid favoring redundant observations, the sidekick selects the top most informative views with greedy non-maximal suppression. It iteratively selects the view with the highest score and suppresses all views in the neighborhood of that view until views are selected (see Supp. for details). This yields a map of favored views for each training environment. See Fig 3, top row.

The sidekick conveys the results to the agent during policy learning in the form of an augmented reward (to be defined in Sec. 3.4). Thus, the reward-based sidekick previews observations and encourages the selection of those individually valuable for reconstruction. Note that while the sidekick indexes views in absolute angles, the agent will not; all its observations are relative to its initial (random) glimpse direction. This works because the sidekick becomes a part of the environment, i.e., it attaches rewards to the true views of the environment. In short, the reward-based sidekick shapes rewards based on its exploration with full observability.

Figure 3: Top left shows the environment’s viewgrid, indexed by viewing elevation and azimuth. Top: Reward sidekick scores individual views based on how well they alone permit inference of the viewgrid (Eq 1). The grid of scores (center) is post-processed with non-max suppression to prioritize non-redundant views (right), then is used to shape the agent’s rewards. Bottom: Demonstration sidekick. Left “grid-of-grids” displays example coverage score maps (Eq 2) for all view pairs. The outer grid considers each , and each inner grid considers each for the given (bottom left). A pixel in that grid is bright if coverage is high for given , and dark otherwise. Each denotes an (elevation, azimuth) pair. While observed views and their neighbors are naturally recoverable (brighter), the sidekick uses broader environment context to also anticipate distant and/or different-looking parts of the environment, as seen by the non-uniform spread of scores in the left grid. Given the coverage function and a starting position, this sidekick selects actions to greedily optimize the coverage objective (Eq 3). The bottom right strip shows the cumulative coverage maps as each of the =4 glimpses is selected.

3.3.2 Demonstration-based sidekick

Our second sidekick generates trajectories of informative views. Given a starting view in , the demonstration sidekick selects a trajectory of views that are deemed to be most informative about . Unlike the reward-based sidekick above, this sidekick offers guidance with respect to a starting state, and it is subject to the same camera motion restrictions placed on the main agent. Such restrictions model how an agent cannot teleport its camera using one unit of effort.

To identify informative trajectories, we first define a scoring function that captures coverage. Coverage reflects how much information contains about each view in . The coverage score for view upon selecting view is:

(2)

where denotes an inferred view within , as estimated using the same completion network used by the reward-based sidekick. Coverage scores are normalized to lie in for .

(3)

The goal of the demonstration sidekick is to maximize the coverage objective (Eqn. 3), where denotes the sequence of selected views, and saturates at 1. In other words, it seeks a sequence of reachable views such that all views are “explained” as well as possible. See Fig. 3, bottom panel.

The policy of the sidekick () is to greedily select actions based on the coverage objective. The objective encourages the sidekick to select views such that the overall information obtained about each view in is maximized.

(4)

We use these sidekick-generated trajectories as supervision to the agent for a short preparatory period. The goal is to initialize the agent with useful insights learned by the sidekick to accelerate training of better policies. We achieve this through a hybrid training procedure that combines imitation and reinforcement. In particular, for the first time steps, we let the sidekick drive the action selection and train the policy based on a supervised objective. For steps to , we let the agent’s policy drive the action selection and use REINFORCE [64] or Actor-Critic [59] to update the agent’s policy (see Sec. 4). We start with and gradually reduce it to in the preparatory sidekick phase (see Supp.). This step relates to behavior cloning [8, 17, 14], which formulates policy learning as supervised action classification given states. However, unlike typical behavior cloning, the sidekick is not an expert. It solves a simpler version of the task, then backs away as the agent takes over to train with partial observability.

3.4 Policy learning with sidekicks

Having defined the two sidekick variants, we now explain how they influence policy learning. The goal is to learn the policy which returns a distribution over actions for the aggregated internal representation at time . Let denote the set of camera motions available to the agent.

Our agent seeks the policy that minimizes reconstruction error for the environment given a budget of camera motions (views). If we denote the set of weights of the network by and excluding by and exluding by , then the overall weight update is:

(5)

where is the number of training samples, indexes over the training samples, and are constants and and update all parameters except and , respectively. The pixel-wise MSE reconstruction loss () and corresponding weight update at time are given in Eqn. 6, where denotes the reconstructed view at viewpoint and time , and denotes the offset to account for the unknown starting azimuth (see [30]).

(6)

The agent’s reward at time (see Eqn. 7) consists of the intrinsic reward from the sidekick (see Sec. 3.3.1) and the negated final reconstruction loss ().

(7)

The update from the policy (see Eqn. 8) consists of the REINFORCE update, with a baseline

to reduce variance, and supervision from the demonstration sidekick (see Eqn. 

9). We consider both REINFORCE [64] and Actor-Critic [59] methods to update the Act module. For the latter, the policy term additionally includes a loss to update a learned Value Network (see Supp.). For both, we include a standard entropy term to promote diversity in action selection and avoid converging too quickly to a suboptimal policy.

(8)

The demonstration sidekick influences policy learning via a cross entropy loss between the sidekick’s policy (cf. Sec. 3.3.2) and the agent’s policy :

(9)

We pretrain the Sense, Fuse, and Decode modules with . The full network is then trained end-to-end (with Sense and Fuse frozen). For training with sidekicks, the agent is augmented either with additional rewards from the reward sidekick (Eqn. 7) or an additional supervised loss from the demonstration sidekick (Eqn. 9). As we will show empirically, training with sidekicks helps overcome uncertainty due to partial observability and learn better policies.

3.5 Visualizing the learned motion policies

Finally, we propose a visualization technique to qualitatively understand the policy that has been learned. The aggregated state is used by the policy network to determine the action probabilities. To analyze which part of the agent’s belief () is important for the current selected action , we solve for the change in the aggregated state () which maximizes the change in the predicted action distribution ():

(10)

where is a constant that limits the deviation in norm from the true belief. Eqn. 10 is maximized using gradient ascent (see Supp.). This change in belief is visualized in the viewgrid space by forward propagating through the Decode module. The visualized heatmap intensities () are defined as follows:

(11)

The heatmap indicates which parts of the agent’s belief would have to change to affect its action selection. The views with high intensity are those that affect the agent’s action selection the most.

4 Experiments

In Sec. 4.14.2, we describe our experimental setup and analyze the learning efficiency and test-time performance of different methods. In Sec. 4.3, we visualize learned policies and demonstrate the superiority of our policies over a baseline.

4.1 Experimental Setup

Datasets: We use two popular datasets to benchmark our models.

  • SUN360: SUN360 [66] consists of high resolution spherical panoramas from multiple scene categories. We restrict our experiments to the 26 category subset used in [66, 30]. The viewgrid consists of 3232 views captured across 4 elevations (- to ) and 8 azimuths ( to ). At each step, the agent sees a field-of-view. This dataset represents an agent looking out at a scene in a series of narrow field-of-view glimpses.

  • ModelNet Hard: ModelNet [65] provides a collection of 3D CAD models for different categories of objects. ModelNet-40 and ModelNet-10 are provided subsets consisting of 40 and 10 object categories respectively, the latter being a subset of the former. We train on objects from the 30 categories not present in ModelNet-10 and test on objects from the unseen 10 categories. We increase completion difficulty in “ModelNet Hard” by rendering with more challenging lighting conditions, textures and viewing angles than [30]; see Supp. It consists of views sampled from 5 elevations and 9 azimuths. This dataset represents an agent looking in at a 3D object and moving it to a series of selected poses.

For both datasets, the candidate motions are restricted to a 3 elevations x 5 azimuths neighborhood, representing the set of unit-cost actions. Neighborhood actions mimic real-world scenarios where the agent’s physical motions are constrained (i.e., no teleporting) and is consistent with recent active vision work [30, 43, 29, 28, 2]. The budget for number of steps is fixed to .

Baselines: We benchmark our methods against several baselines:

  • one-view: the agent trained to reconstruct from one view ().

  • rnd-actions: samples actions uniformly at random.

  • ltla [30]: our implementation of the “learning to look around” approach [30]. We verified our code reproduces results from [30].

  • rnd-rewards: naive sidekick where rewards are assigned uniformly at random on the viewgrid.

  • asymm-ac [48]: approach from [48] adapted for discrete actions. Critic sees the entire panorama/object and true camera poses (no experience replay).

  • demo-actions: actions selected by demo-sidekick while training / testing.

  • expert-clone: imitation from an expert policy that uses full observability (similar to critic in Fig. 2 of Supp.)

Evaluation: We evaluate reconstruction error averaged over uniformly sampled elevations, azimuths and all test samples (avg). To provide a worst case analysis, we also report an adversarial metric (adv), which evaluates each agent on its hardest starting positions in each test sample and averages over the test data.

4.2 Active Exploration Results

Method SUN360 ModelNet Hard
avg (1000) adv (1000) avg (1000) adv (1000)
mean mean mean mean
one-view 38.31 - 55.12 - 9.63 - 17.10 -
rnd-actions 30.99 19.09 44.85 18.63 7.32 23.93 12.38 27.56
rnd-rewards 25.55 33.30 30.20 45.21 7.04 26.89 9.66 43.50
ltla [30] 24.94 34.89 31.86 42.19 6.30 34.57 8.78 48.65
asymm-ac [48] 23.74 38.01 29.92 45.72 6.24 35.20 8.55 50.00
expert-clone 23.98 37.38 28.50 48.28 6.41 33.44 8.52 50.13
ours(rew) 23.44 38.82 28.54 48.22 5.80 39.79 7.17 58.04
ours(demo) 24.24 36.73 29.01 47.36 6.32 34.37 8.64 49.47
ours(rew)+ac 23.36 39.01 28.26 48.72 5.75 40.26 7.10 58.44
ours(demo)+ac 24.05 37.22 28.52 48.26 6.13 36.31 8.26 51.64
demo-actions* 26.12 31.82 31.53 42.76 5.82 39.50 7.46 56.40
Table 1: Avg/Adv MSE errors ( lower is better) and corresponding improvements (%) over the one-view model ( higher is better), for the two datasets. The best and second best performing models are highlighted in green and blue

respectively. Standard errors range from 0.2 to 0.3 on SUN360 and 0.1 to 0.2 on ModelNet Hard. (* - requires full observability at test time)

Table 1 shows the results on both datasets. For each metric, we report the mean error along with the percentage improvement over the one-view baseline. Our methods are abbreviated ours(rew) and ours(demo) referring to the use of our reward- and demonstration-based sidekicks, respectively. We denote the use of Actor-Critic instead of REINFORCE with +ac.

We observe that ours(rew) and ours(demo) with REINFORCE generally perform better than ltla with REINFORCE [30]. In particular, ours(rew) performs significantly better than ltla on both datasets on all metrics. ours(demo) performs better on SUN360, but is only slightly better on ModelNet Hard. Figure 4 shows the validation loss plots; using the sidekicks leads to significant improvement in the convergence rate over ltla.

Figure 5 compares example decoded reconstructions. We stress that the vast majority of pixels are unobserved when decoding the belief state, i.e., only views out of the entire viewing sphere are observed. Accordingly, they are blurry. Regardless, their differences indicate the differences in belief states between the two methods. A better policy more quickly fleshes out the general shape of the scene or object.

Next, we compare our model to asymm-ac, which is an alternate paradigm for exploiting full observability during training. First, we note that asymm-ac performs better than ltla across all datasets and metrics , making it a strong baseline. Comparing asymm-ac with ours(rew)+ac and ours(demo)+ac, we see our methods still perform considerably better on all metrics and datasets. As we show in the Supp, our methods also lead to faster convergence.

In order to contrast learning from sidekicks with learning from experts, we additionally compare our models to behavior cloning an expert that exploits full observability at training time. As shown in Tab. 1, ours(rew) outperforms expert-clone on both the datasets, validating the strength of our approach. It is particularly interesting because training an expert takes a lot longer () than training sidekicks (see Supp.). When compared with demo-actions, an ablated version of ours(demo) that requires full observability at test time, our performance is still significantly better on SUN360 and slightly better on ModelNet Hard. ours(rew) and ours(demo) also beat the remaining baselines by a significant margin. These results verify our hypothesis that sidekick policy learning can improve over strong baselines by exploiting full observability during training.

Figure 4: Validation errors (

) vs. epochs on SUN360 (left) and ModelNet Hard (right). All models shown here use REINFORCE (see Supp. for more curves). Our approach accelerates convergence.

Figure 5: Qualitative comparison of ours(rew) vs. ltla [30] on SUN360 (first 2 rows) and ModelNet Hard (last 2 rows). The first column shows the groundtruth viewgrid and a randomly selected starting point (marked in red). The 2nd and 3rd columns contain the decoded viewgrids from ltla and ours(rew) after time steps. The reconstructions from ours(rew) are visibly better. For example, in the row, our model reconstructs the protrusion more clearly; in the row, our model reconstructs the sky and central hills more effectively. Best viewed on pdf with zoom.

4.3 Policy Visualization

We present our policy visualizations for ltla and ours(rew) on SUN360 in Figure 6; see Supp. for examples with ours(demo). The heatmap from Eq 10 is shown in pink and overlayed on the reconstructed viewgrids. For both models, the policies tend to take actions that move them towards views which have low heatmap density, as witnessed by the arrows / actions pointing to lower density regions. Intuitively, the agents move towards the views that are not contributing effectively to their action selection to increase their understanding of the scene. It can observed in many cases that ours(rew) model has a much denser heat map across time when compared to ltla. Therefore, ours(rew) takes more views into account for selecting its actions earlier in the trajectory, suggesting that a better policy and history aggregation leads to more informed action selection.

Figure 6: Policy visualization: The viewgrid reconstructions of ours(rew) and ltla [30] are shown on two examples from SUN360. The first column shows the viewgrid with a randomly selected view (in red). Subsequent columns show the view received (in red), viewgrid reconstructed, action selected (red arrow), and the parts of the belief space our method deems responsible for the action selection (pink heatmap). Both the agents tend to move towards sparser regions of the heatmap, attempting to improve their beliefs about views that do not contribute to their action selection. ours(rew) improves its beliefs much more rapidly and as a result, performs more informed action selection.

5 Conclusion

We propose sidekick policy learning, a framework to leverage extra observability or fewer restrictions on an agent’s motion during training to learn better policies. We demonstrate the superiority of policies learned with sidekicks on two challenging datasets, improving over existing methods and accelerating training. Further, we utilize a novel policy visualization technique to illuminate the different reasoning behind policies trained with and without sidekicks. In future work, we plan to investigate the effectiveness of our framework on other active vision tasks such as recognition and navigation.

Acknowledgements

The authors thank Dinesh Jayaraman, Thomas Crosley, Yu-Chuan Su, and Ishan Durugkar for helpful discussions. This research is supported in part by DARPA Lifelong Learning Machines, a Sony Research Award, and an IBM Open Collaborative Research Award.

References

  • [1]

    Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. International Journal of Computer Vision (1988)

  • [2] Ammirato, P., Poirson, P., Park, E., Košecká, J., Berg, A.C.: A dataset for developing and benchmarking active vision. In: Robotics and Automation, 2017 IEEE International Conference on (2017)
  • [3]

    Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems (2017)

  • [4] Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
  • [5] Bajcsy, R.: Active perception. Proceedings of the IEEE (1988)
  • [6]

    Ballard, D.H.: Animate vision. Artificial intelligence (1991)

  • [7] Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems (2016)
  • [8] Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
  • [9] Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcement learning. In: Computer Vision, 2015 IEEE International Conference on (2015)
  • [10]

    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied Question Answering. In: Computer Vision and Pattern Recognition, 2018 IEEE Conference on (2018)

  • [11] Dhillon, S.S., Chakrabarty, K.: Sensor placement for effective coverage and surveillance in distributed sensor networks. In: Wireless Communications and Networking, 2003. WCNC 2003. 2003 IEEE (2003)
  • [12] Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: International Conference on Learning Representations (2017)
  • [13] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on Robot Learning (2017)
  • [14]

    Duan, Y., Andrychowicz, M., Stadie, B., Ho, O.J., Schneider, J., Sutskever, I., Abbeel, P., Zaremba, W.: One-shot imitation learning. In: Advances in Neural Information Processing Systems (2017)

  • [15] Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: Computer Vision, 2017 IEEE International Conference on (2017)
  • [16] Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (2016)
  • [17]

    Giusti, A., Guzzi, J., Cireşan, D.C., He, F.L., Rodríguez, J.P., Fontana, F., Faessler, M., Forster, C., Schmidhuber, J., Di Caro, G., et al.: A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters (2016)

  • [18] Greydanus, S., Koul, A., Dodge, J., Fern, A.: Visualizing and understanding atari agents. CoRR (2017)
  • [19] Guo, X., Singh, S., Lee, H., Lewis, R.L., Wang, X.: Deep learning for real-time atari game play using offline monte-carlo tree search planning. In: Advances in Neural Information Processing Systems (2014)
  • [20] Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Computer Vision and Pattern Recognition, 2017 IEEE Conference on (2017)
  • [21] Gupta, S., Fouhey, D., Levine, S., Malik, J.: Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125 (2017)
  • [22] Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: Computer Vision and Pattern Recognition, 2016 IEEE Conference on (2016)
  • [23] Harutyunyan, A., Devlin, S., Vrancx, P., Nowe, A.: Expressing arbitrary reward functions as potential-based advice. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
  • [24] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  • [25] Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. In: Advances in neural information processing systems (2015)
  • [26]

    Hong, S., Oh, J., Lee, H., Han, B.: Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In: Computer Vision and Pattern Recognition, 2016 IEEE Conference on (2016)

  • [27] Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., Kavukcuoglu, K.: Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397 (2016)
  • [28] Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)
  • [29] Jayaraman, D., Grauman, K.: Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion. In: European Conference on Computer Vision (2016)
  • [30] Jayaraman, D., Grauman, K.: Learning to look around: Intelligently exploring unseen environments for unknown tasks. In: Computer Vision and Pattern Recognition, 2018 IEEE Conference on (2018)
  • [31] Johns, E., Leutenegger, S., Davison, A.J.: Pairwise decomposition of image sequences for active multi-view recognition. In: Computer Vision and Pattern Recognition, 2016 IEEE Conference on (2016)
  • [32] Kim, A., Eustice, R.M.: Perception-driven navigation: Active visual slam for robotic area coverage. In: Robotics and Automation, 2013 IEEE International Conference on (2013)
  • [33] Knox, W.B., Stone, P.: Interactively shaping agents via human reinforcement: The tamer framework. In: Proceedings of the fifth international conference on Knowledge capture (2009)
  • [34] Knox, W.B., Stone, P.: Combining manual feedback with subsequent mdp reward signals for reinforcement learning. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems (2010)
  • [35] Knox, W.B., Stone, P.: Reinforcement learning from simultaneous human and mdp reward. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (2012)
  • [36] Krause, A., Guestrin, C.: Near-optimal observation selection using submodular functions. In: AAAI (2007)
  • [37] Lapin, M., Hein, M., Schiele, B.: Learning using privileged information: Svm+ and weighted svm. Neural Networks (2014)
  • [38] Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research (2016)
  • [39] Levine, S., Koltun, V.: Guided policy search. In: International Conference on Machine Learning (2013)
  • [40] Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with large-scale data collection. In: Kulić, D., Nakamura, Y., Khatib, O., Venture, G. (eds.) 2016 International Symposium on Experimental Robotics (2017)
  • [41] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (2014)
  • [42] Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.Y.: Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine intelligence (2011)
  • [43]

    Malmir, M., Sikka, K., Forster, D., Movellan, J.R., Cottrell, G.: Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning. In: British Machine Vision Conference (2015)

  • [44] Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016)
  • [45] Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems (2014)
  • [46] Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J., Levine, S.: Combining self-supervised learning and imitation for vision-based rope manipulation. In: Robotics and Automation, 2017 IEEE International Conference on (2017)
  • [47] Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning (2017)
  • [48] Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W., Abbeel, P.: Asymmetric actor critic for image-based robot learning. Robotics: Science and Systems (2018)
  • [49] Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: Robotics and Automation, 2016 IEEE International Conference on (2016)
  • [50] Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (2011)
  • [51]

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (2015)

  • [52] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Computer Vision, 2017 IEEE International Conference on (2017)
  • [53] Sharmanska, V., Quadrianto, N., Lampert, C.H.: Learning to rank using privileged information. In: Computer Vision, 2013 IEEE International Conference on. IEEE (2013)
  • [54] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature (2017)
  • [55] Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
  • [56] Song, S., Zeng, A., Chang, A.X., Savva, M., Savarese, S., Funkhouser, T.: Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view. In: Computer Vision and Pattern Recognition, 2018 IEEE Conference on (2018)
  • [57] Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  • [58] Spica, R., Giordano, P.R., Chaumette, F.: Active structure from motion: Application to point, sphere, and cylinder. IEEE Transactions on Robotics (2014)
  • [59] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction
  • [60] Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Computer Vision and Pattern Recognition, 2017 IEEE Conference on (2017)
  • [61] Vapnik, V., Izmailov, R.: Learning with intelligent teacher. In: Symposium on Conformal and Probabilistic Prediction with Applications (2016)
  • [62] Wang, B.: Coverage problems in sensor networks: A survey. ACM Computing Surveys (CSUR) (2011)
  • [63] Wilkes, D., Tsotsos, J.K.: Active object recognition. In: Computer Vision and Pattern Recognition, 1992. IEEE Computer Society Conference on (1992)
  • [64] Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. In: Reinforcement Learning (1992)
  • [65] Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Computer Vision and Pattern Recognition, 2015 IEEE Conference on (2015)
  • [66] Xiao, J., Ehinger, K.A., Oliva, A., Torralba, A.: Recognizing scene viewpoint using panoramic place representation. In: Computer Vision and Pattern Recognition, 2012 IEEE Conference on (2012)
  • [67] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning (2015)
  • [68] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Computer Vision and Pattern Recognition, 2013 IEEE Conference on (2013)
  • [69] Zahavy, T., Ben-Zrihem, N., Mannor, S.: Graying the black box: Understanding dqns. In: International Conference on Machine Learning (2016)
  • [70] Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., Mottaghi, R., Farhadi, A.: Visual Semantic Planning using Deep Successor Representations. In: Computer Vision, 2017 IEEE International Conference on (2017)
  • [71] Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Robotics and Automation, 2017 IEEE International Conference on (2017)

6 Architectures and implementation details

Figure 7: Architecture for ltla baseline [30]. Note: G = M*N*C
Figure 8: Architecture for critics used in our Actor Critic training. The Partial Observability critic is used for ours(rew)+ac, ours(demo)+ac and the Full Observability critic is used for asymm-ac [48]. Note:

Before we review the architecture, we list out some key notations:

  • - proprioception input, consists of the relative change in elevation, azimuth from to and the absolute elevation at .

  • - augmented with absolute azimuth

  • - input view, dimensionality is where is the number of channels, is the image height and is the image width. For SUN360, and for ModelNet Hard, .

  • - number of azimuths in ( for SUN360 and for ModelNet Hard).

  • - number of elevations in ( for SUN360 and for ModelNet Hard).

We follow the same architecture (see Fig. 7) for the modules described in [30]

. Models are implemented in PyTorch and layer naming conventions are accordingly followed 

333refer to http://pytorch.org/docs/master/nn.html

. For all the Conv layers, filter size = 5, stride = 1 and zero padding = 2; for all the Deconv (aka transposed convolution) layers, filter size = 5, stride = 2, zero padding = 2 and output padding = 1.

We have two critic architectures for our experiments (see Fig. 8). The critic with partial observability consists of a similar architecture as the Act module. The critic with full observability takes in the absolute position on the viewgrid and the entire viewgrid as additional inputs. Each view of the viewgrid is processed by the Sense module (to give ) and the encoded views are fused together using two FC layers. This aggregated state, proprioception input, absolute position, and fused viewgrid are concatenated and processed by the critic to obtain the value of the current view.

We use the Adam optimizer with a learning rate of , weight decay of 1e-6, and other default settings from PyTorch 444refer http://pytorch.org/docs/master/optim.html. We also set and based on grid search. In the case of the demonstration-based sidekick, we decay from to after every 50 epochs. For the reward-based sidekick, we decay the rewards by a factor of after every epochs (selected based on grid search). All the models are trained for epochs. For the reward-based sidekick, we use a non-maximal suppression neighborhood of and views for SUN360, and neighborhood of and views for ModelNet Hard. The neighborhood and number of views were selected manually upon brief visual inspection to ensure sufficient spread of rewards on the viewgrid.

To solve for from Eq.

in the main paper, we use stochastic gradient descent with learning rate of

, weight decay of and momentum of . We run the optimization for a maximum of 200 iterations, and perform early stopping if crosses . The parameters were selected to increase the chances of the probability change being maximised.

7 Additional policy learning details

Let the weights of the critic be denoted by . Following standard actor-critic training, a regression loss over the critic’s value prediction is additionally used to update the agent’s parameters, specifically, :

(12)

where is the number of data samples and is the value estimated by the Value network at time for the data sample. We additionally include a standard entropy term to promote diversity in action selection and avoid converging too quickly to a suboptimal policy. The loss term and the corresponding weight update (on ) are as follows:

(13)

8 Validation plots

Fig. 4 in the main paper shows the validation error plots for both datasets to compare the speed of learning for our method trained with REINFORCE vs. ltla [30] trained with REINFORCE. Here, Fig. 9 shows the parallel validation error plots comparing ours(rew), ours(demo) and asymm-ac [48] using Actor-Critic. Note that separating by REINFORCE vs. Actor-Critic ensures both sets of plots are apples-to-apples. The bump in the yellow curve on the SUN360 plot reflects how the demonstration schedule changes over epochs.

Figure 9: Validation errors () vs. epochs on SUN360 and ModelNet Hard (Actor Critic methods)

9 ModelNet Hard construction

As noted in the paper, we altered the sampling angles, lighting conditions, and object materials to increase the reconstruction difficulty of the rendered images. In Fig. 10, we render the same object using settings similar to [30] and our settings from ModelNet Hard.

The rendering details are as follows. We sampled the angles at intervals of (as opposed to in [30]) to reduce the number of views which were similar in appearance and geometry. We further altered the lighting positions to be non-uniform across views and used higher specularity to generate complex renderings. Specifically, we use two light sources, each placed below and above the object. The exact coordinates are selected relative to the size of the object. Each light source is placed randomly at one out of two locations for a given object. Using MATLAB’s rendering toolbox555refer to https://www.mathworks.com/help/matlab/visualize/lighting-overview.html, we render the objects with “interp” shading, “dull” material, “gouraud” face lighting, ambient strength of 0.4, diffuse strength of 0.9, specular strength of 0.7 and specular exponent of 15. The data is available to ensure reproducibility 666http://vision.cs.utexas.edu/projects/sidekicks/.

Figure 10: An example to qualitatively compare the renderings of ModelNet Hard vs. ModelNet from [30]

10 Policy visualization examples

Fig. 11 visualizes the policy beliefs for ours(demo), ours(rew) and ltla on the SUN360 examples from the main paper and an additional example. Fig. 13 shows examples for ModelNet Hard.

Fig. 11 shows how all the models follow a similar behaviour of visiting regions with low heatmap densities, as indicated by the red arrows. This shows how the agents are often moving towards the views that are not yet contributing effectively to their action selection, to improve their understanding of the scene. The heatmap “density” serves as a high-level visual for the spread of the agent’s reasoning about its belief state as it influences its action selection: the greater the spread, the more its belief about the full unobserved environment is directing camera motion selections. The heatmap density of ours(demo) lies between that of ours(rew) and ltla, which is consistent with the quantitative performance observed (refer to the main paper). We also note the qualitative difference between the heatmaps of ours(demo) and ours(rew). While both have dense heatmaps across the entire viewgrid, ours(demo) appears to rely significantly more on its beliefs about the ground plane of the scene. However, there are cases where the visualizations are not conclusive in differentiating between the policies. As shown in Fig. 12, we can see that visualizations are dense across all models, and therefore, less conclusive.

In Fig. 13, we see our visualization is less effective in differentiating the policies on ModelNet Hard, possibly due to the narrower margins in the reconstruction errors for this dataset. However, it is interesting to note that the heatmap densities are better concentrated on the object for ours(rew) and ours(demo), whereas it often unnecessarily leaks to the background pixels for ltla.

Figure 11: Policy visualizations of ltla, ours(rew) and ours(demo) on four examples from SUN360. The policies tend to visit regions on the viewgrid with low heatmap densities in order to improve their belief about the environment. Better policies tend to more rapidly improve their beliefs, as witnessed by denser heatmaps. Best viewed on pdf with zoom.
Figure 12: Examples of less conclusive visualizations on SUN360, where ltla, ours(rew) and ours(demo) have similar heatmap densities. Best viewed on pdf with zoom.
Figure 13: Policy visualizations of ltla and ours(rew) on three examples from ModelNet Hard. Best viewed on pdf with zoom.

11 Additional training time for sidekicks

In order to account for additional training time required to train the sidekicks, we analyze the time taken for training various models and sidekicks. Since all models are pretrained with , including ltla — the training overhead ( min) is identical for the baseline. Both sidekicks use the model to compute scores (see Sec. 3.4 from main paper), a one-time cost of min. To train for 500 epochs, ours (rew) and ours (demo) require and min, resp, while ltla and asymm-ac take and min, resp (averaged over 3 runs) 777Experiments were run on a Intel(R) Xeon(R) CPU @ 1.70GHz system with GeForce GTX 1080 GPU.. Therefore, the additional training time for sidekicks is nominal in comparison to the overall training process. However, training the expert for expert-clone takes as long as it takes to train a full model ( minutes for 1000 epochs), which is the time required to pre-train at and pre-compute the sidekick scores.