Early Fusion for Goal Directed Robotic Vision

by   Aaron Walsman, et al.

Increasingly, perceptual systems are being codified as strict pipelines wherein vision is treated as a pre-processing step to provide a dense representation of the scene to planners for high level reasoning downstream. Problematically, this paradigm forces models to represent nearly every aspect of the scene even if it has no bearing on the task at hand. In this work, we flip this paradigm, by introducing vision models whose feature representations are conditioned on embedded representations of the agent's goal. This allows the model to build scene descriptions that are specifically designed to help achieve that goal. We find this leads to models that learn faster, are substantially more parameter efficient and more robust than existing attention mechanisms in our domain. Our experiments are performed on a simulated robot item retrieval problem and trained in a fully end-to-end manner via imitation learning.



There are no comments yet.


page 1

page 2

page 4


NEAT: Neural Attention Fields for End-to-End Autonomous Driving

Efficient reasoning about the semantic, spatial, and temporal structure ...

Scene Graph Parsing by Attention Graph

Scene graph representations, which form a graph of visual object nodes t...

Imitation Learning of Robot Policies by Combining Language, Vision and Demonstration

In this work we propose a novel end-to-end imitation learning approach w...

A Geometric Perspective on Visual Imitation Learning

We consider the problem of visual imitation learning without human super...

Embodied Language Grounding with Implicit 3D Visual Feature Representations

Consider the utterance "the tomato is to the left of the pot." Humans ca...

Does a Plane Imitate a Bird? Does Computer Vision Have to Follow Biological Paradigms?

We posit a new paradigm for image information processing. For the last 2...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotics has benefited greatly from advances in computer vision, but sometimes our objectives have been misaligned. While the central question of computer vision is “tell me what you see”, ours is “do what I say.” In goal directed tasks, most of the scene is a distraction. When grabbing an apple, an agent only needs to care about the counter, table, or chairs if they interfere with accomplishing the goal. Additionally, when a robot learns through grounded interactions, architectures must be sample efficient in order to learn visual representations quickly for new environments. In this work we show how inverting the traditional perception pipeline: Vision

reasoning, to incorporate goal information early into the visual stream allows agents to jointly reason and perceive: Vision + Goal

action, yielding faster and more robust learning. While our approach is relatively simple, early fusion approaches have not been explored to the same extent as late fusion attention models and present many advantages in episodic goal-oriented tasks.

In this work we focus on the task of retrieving objects in a 3D environment. This task includes vocabulary learning, navigation, and scene understanding. Task completion requires computing action trajectories and resolving 3D occlusions from a 2D image which satisfy the user’s requests. Fast and efficient planners work well in the presence of ground-truth knowledge of the world


. However, in practice, this ground-truth knowledge is difficult to obtain, and we must often settle for noisy estimates. Additionally, when many objects need to be collected or moved, the planning problem grows rapidly in the search space.

Our work is most closely related to recent advances in instruction following and visual attention [2, 3]. However, we focus on learning visual representations for goal-specific task completion without explicit supervision of object detection or classification. In order to isolate the goal-oriented visual learning component of this problem, we provide explicit goals instead of natural language questions and use imitation learning to provide action supervision. We show that early fusion of goal information in the visual processing pipeline (Early Fusion) outperforms more traditional approaches and learns faster. Furthermore, model accuracy does not degrade in performance even when reducing model parameters by several orders of magnitude (from 6M to 25K).

Ii Task Definition

Fig. 1: The agent was instructed to collect one spam can and one soap bottle. The agent aligns the camera (1 – 17) before executing the collect action on the jello box in frame 18. From here, the soap box is visible, so the agent aligns to it (19–28) and collects it on frame 29. Overlaid circles, arrows and boxes are for paper visualization purposes only.

Our task is to collect objects in a 3D scene as efficiently as possible. The agent is presented with a cluttered scene and a list of requested objects. Often there are multiple instances of the same object, or unrequested objects blocking the agent’s ability to reach a target. This forces the agent to develop detailed spatial reasoning to recognize the target objects, determine which is closest and finally remove obstructions as necessary. The list of remaining requested objects is presented to the agent at every time step, to avoid conflating scene understanding performance with issues of memory. The goal (Figure 1) is to train an agent that receives an image and a list of requested objects and produces the optimal next action.

Ii-a Simulation Environment: CHALET

Fig. 2: Our 16 object types.

Our environment consists of a tabletop setting with randomly placed objects, within a kitchen from the CHALET [4] house environment. Every episode consists of a randomly sampled environment which determines the set of objects (number, position, orientation and type) in addition to which subset will be requested. When there is more than one instance of a particular object, collecting any instance will satisfy the collection criteria, but one may be closer and require fewer steps to reach. Figure 2 shows the sixteen object types that we use for this task (six from CHALET and ten from the YCB dataset).

The objects are chosen randomly and placed at a random location () on the table with a random upright orientation (). Positions and orientations are sampled until a non-colliding configuration is found. A random subset of the instances on the table are used for the list of requested objects. This process allows the same object type to be requested multiple times if multiple of those objects exist in the scene. Additionally, random sampling means an object may serve as a target in one episode and a distractor in the next. The agent receives 128x128 pixel images of the world and has a 60 horizontal field of view, allowing it to see most of the workspace.

Fig. 3: Collecting the Jello (blue box) requires more steps than the peach (orange box) due to occluding objects.

Our agent consists of a first-person camera that can tilt up and down and pan left and right with additional collect, remove and idle actions. Each of the pan and tilt actions deterministically rotate the camera 2 in the specified direction. The collect action removes the nearest object that is within 3 of the center axis of the camera and registers the object as having been collected for the purposes of calculating the agent’s score. This region is visualized in Figure 1 as a magenta circle in the center of the frame. The remove action does the same thing as collect, but does not register the item as having been collected. This is used to remove superfluous items occluding the requested target. Finally, the idle action performs no action and should only be used once all requested items have been collected. All actions require one time step, therefore objects which are physically closer to the center of the camera may take more time to reach if they are occluded. For example, in Figure 3 the peach (orange box) requires fewer steps to collect than the Jello box (blue box). The precision required to successfully collect an object makes this a difficult task to master from visual data alone.

Iii Models

In our task, models must learn to ground visual representations of the world to the description of what to collect. How to best combine this information is a crucial modelling decision. Most multimodal approaches compute a visual feature map representing the contents of the entire image before selectively filtering based on the goal. This is commonly achieved using soft attention mechanisms developed in the language [5] and vision [6, 7, 8] communities.

Attention re-weights the image representation and leads to more informative gradients, helping models learn quickly and efficiently. Despite its successes, attention has important limitations. Most notably, because task specific knowledge (e.g. target information) is only incorporated late in the visual processing pipeline, the model must first build dense image representations that encode anything the attention might want to extract for all possible future goals. In complex scenes and tasks, this places a heavy burden on the initial stages of the vision system. In contrast, we present a technique that injects goal information early into the visual pipeline in order to build a task specific representation of the image from the bottom up. Our approach avoids the traditional bottleneck imposed on perception systems, and allows the model to discard irrelevant information immediately.

Below, we briefly describe the three models (Figure 4) we compare: Traditional approaches with delayed goal information (Late Fusion & Attention Map) versus our goal conditioned Early Fusion architecture.

Fig. 4: We compare a simple concatenation of visual and goal representations (Late Fusion), against two variations of the attention mechanism above, and Early Fusion to isolate the effects of when multimodal representations are formed.

Iii-a Late Fusion

Late Fusion

constructs a single holistic representation of the entire image via a stack of convolution and pooling layers before concatenating an embedding of the requested objects in order to predict an action. An object embedding is computed using a simple linear layer designed to turn a one-hot encoding of the object into a dense representation. The complete request for multiple objects is computed as a sum of these individual object embeddings. This architecture forces the vision module to store semantic and spatial information about every object in the scene so the final fully connected layers can ground target objects and reason about actions.

Iii-B Attention Map

We test traditional attention mechanisms over image regions. As with Late Fusion

, the first step of this model is to pass the image through a stack of convolution layers. Rather than concatenate the request embedding directly onto the resulting representation, these models first compute an attention map over the spatial dimensions of the convolution output. This is accomplished by comparing the embedded target vector with each region of the convolutional feature map via a simple dot product. This provides a distribution over regions which can then be used to form the final image representation

. This weighted representation is concatenated to the request in order to make an action decision. We test two attention models: Softmax Attention Map which is defined above and Attention Map which is unnormalized. Using a softmax causes the model to remove the contribution of small entries and focus on fewer regions of the image. This is visualized in Figure 7.

In contrast to the Late Fusion model, the attention mechanism provides a filter on extraneous aspects of the image to simplify the control processing. In these models the grounding from image features to goal objects is done with a direct comparison operator (the dot product). These models are widely used for Visual Question Answering (VQA) problems on static images. We also explored more complex models [8] for computing attention maps, but found this traditional version worked the best in our setting and provided a strong baseline for comparison.

When compressing an entire image via a weighted sum, spatial information is lost from the final vector. To address this we concatenate a normalized grid ranging from -1 to 1 in 2D to our image representation for these models [9]. We only report results from the best performing architecture, which concatenates the grid directly to the initial image rather than later in the convolution stack. Grid information did not yield gains for the Early Fusion or Late Fusion models.

Iii-C Early Fusion

Finally, we present our most successful approach, Early Fusion, which concatenates the request embedding to every region of a convolutional filter map. This feature is then processed normally by a set of convolution kernels that have been augmented to account for the extra channels. Figure 5 shows this process. All further processing in the network is computed normally. The model’s subsequent convolution and fully connected layers may filter the visual information according to the goal description that is now combined with the visual input. This results in an image representation which contains only the necessary information for deciding on the next action, effectively gaining the benefits of a bottleneck while dispersing the logic throughout the network. Critically, this means that the network does not have to build a semantic representation of the entire image (See section IV-D for details).

Fig. 5: Fusion of goal information with visual data in the Early Fusion model. The goal is concatenated with each block of visual data.

Two important results of this architecture are: 1. Because the goal information is incorporated early, the network can learn to ground the image features to the goal objects at any point in the model without additional machinery (like attention); and 2. The model can compute and retain the spatial information needed for its next action without requiring the addition of a spatial grid. These benefits allow us to obviate the complexity of other approaches, minimize parameters, and outperform other approaches on our task.

Iii-D Imitation Learning

We train these models with imitation learning using an oracle with direct access to the simulator’s state. Similar to the late stages of DAgger [10] and Scheduled Sampling [11]

we roll out trajectories using the current model while collecting supervision from the expert. We then use batches of recent trajectories to train the model for a small number of epochs and repeat this process

[12]. We found that for our item retrieval problem this was faster to train than a more faithful implementation of DAgger which trains a new policy on all previous data at each step, and offered significant improvements over behavior cloning (training on trajectories demonstrated by the expert policy). The models performed best when trained on only the most recent 150 trajectories.When training we make three passes through the data in these 150 trajectories before rolling out 50 new trajectories, discarding the oldest 50 trajectories and repeating this process.

Rather than teach our agents to find the shortest path to multiple objects, which is intractable in general, we design our expert policy to behave greedily and move to collect the requested object that would take the fewest steps to reach (including the time necessary to remove occluding objects).

Iii-E Implementation Details

All convolutions have 3

3 kernels with a padding of one, followed by 2

2 max-pooling, ReLU nonlinearity 


and batch normalization 

[14]. This produces a feature map with half the spatial dimensions of the input. The number of convolution channels and hidden dimensions in the fully connected layers vary by experiment (see Section IV-B).


Our images are RGB and 128x128 pixels, but as is common practice in visual episodic settings [15] we found our models performed best when we concatenated the most recent three frames to create a 9x128x128 input. All models use four convolution layers.


Models are provided the complete set of remaining items to collect as a list of one-hot vectors. These are encoded into a single dense vector () by summing their learned embeddings. Because the sequence order is not important to our task, we found no benefit from RNN based encodings, though the use of an embedding layer, rather than a count vector, proved essential to model performance.


All of our models are optimized with Adam [16] with a learning rate of 1-4, and trained with dropout [17]

. The training loss was computed with cross-entropy over the action space. All models and code were written in PyTorch 

[18] and are available at redacted.

Fig. 6: Model performance on Simple, Medium, and Hard learning paradigms. Models were run to convergence.

Iv Experiments

We tested all four models on a series of increasingly cluttered and difficult problems. We also tested these models with varying network capacity by reducing the number of convolution channels and features in the fully connected layers. In all of these experiments, our Early Fusion model performs as well or better than the others, while typically training faster and with fewer parameters.

Iv-a Varying Problem Difficulty

To test models on problems of increasing difficulty, we built three variations of the basic task by varying clutter and the number of requested items. In the simplest task (Simple), each episode starts with four instances randomly placed on the table and one object type is requested. Next, for Medium eight instances are placed and two are requested. Finally, for Hard twelve instances are placed and three are requested. The agent’s goal is to collect only the requested items in the allotted time. To evaluate peak performance for these experiments we fixed the number of convolutions and hidden layer dimensions in the fully connected layers to 128.

Each episode runs for forty-eight steps, during which it is possible for the agent to both successfully collect requested objects and erroneously collect items that were not requested. We therefore measure task completion using an F1 score. Precision is the percentage of collected objects that were actually requested, and recall is the percentage of requested objects that were collected. The F1 score is computed at the end of each episode. In addition, we report overall agreement between the model and the expert’s actions over the entire episode. Figure 6 plots the results of all four models on each of these problems as a function of training time.

Fig. 7: Attention visualizations for Attention Map and Softmax Attention Map models. Targets are indicated here with magenta boxes in the top row.
Fig. 8: Ablation analysis showing the effect of the number of convolution channels and fully-connected hidden units on network performance. Note that the scale of the x-axes in these plots varies due to longer training times for smaller networks to converge. Dashed Early Fusion lines plot the performance of the model with half the reported number of filters to include a comparison ensure where model has fewer parameters than the attention based approaches.


Except for the Late Fusion model, which performs poorly in all scenarios, all models are able to master the easiest task. The Early Fusion and Softmax Attention Map models learn quickly, but Attention Map eventually catches up to them. The failure of the Late Fusion baseline on this task shows that even the simplest version of this problem is non-trivial.


The intermediate problem formulation is clearly more difficult, as no models are able to perform as well on it as the easiest problem. The Early Fusion model gains a small but significant improvement in performance while Softmax Attention Map and Attention Map are slightly worse, but comparable to each other.


In this case the networks must deal with more cluttered images and more complex goal descriptions. The Early Fusion model is clearly superior, learning significantly faster than the other models and results in greater overall performance.

It is also worth comparing the Attention Map and Softmax Attention Map models. While these models perform similarly on these tasks, the Softmax Attention Map model learns faster than the Attention Map model on the easiest task, but slightly slower on the more difficult ones. We posit that the softmax focuses the attention heavily on only a few regions which is useful for sparse uncluttered environments, but less appropriate when the network must reason about multiple objects in different regions.

Figure 7 provides a comparison of four attention maps. Unsurprisingly, the Softmax Attention Map model produces a sharper distribution around the requested objects, but both methods correctly highlight the objects of interest. In this work, we have limited our definition of clutter to 12 items per scene, in part for ease of visualization and compute time. We have not investigated the upper limit for saturating our models, but rather focused on how their learning curves diverged as a function of scene complexity. We anticipate a further widening in arbitrarily complex real world images.

Iv-B Varying Network Capacity

Having demonstrated that Early Fusion is at least as powerful as attention based approaches while being simpler (no grid information or attention logic), we next explore how these approaches perform on varying parameter budgets. Real-time and embedded systems require efficiency both when training and during inference. Since Early Fusion filters and removes irrelevant information early in the processing pipeline, we expect it to require less network capacity than the other methods. To test this claim, we re-run our Medium difficulty setting (because attention models performed well) and compare performance when models have access to 256, 128, 64, 32, or only 16 channel convolutions and fully connected layers, reducing our model sizes by several orders of magnitude.

In Figure 8, we see that training time increases for small networks, but Early Fusion is able to quickly achieve around the same final performance regardless of the extremely small network capacity. This allows for dramatically more efficient inference and parameter/memory usage. However, the same cannot be said for the other models, which degrade substantially as the number of parameters in the network decreases. Note that after 50,000 trajectories it appears that attention based models are still slowly improving, but there is a stark contrast in learning rates. In particular, for the smallest models (16) we see that Attention Map, even after training for twice as long as Early Fusion, still has half the performance.

Because attention mechanisms collapse their final representations, they have a smaller fully connected layer and therefore fewer parameters for the same number of channels. To account for this, we have also included a dashed orange line in Figure 8 which shows the performance of Early Fusion with half the channels as the other models and fewer parameters. We see again that smaller Early Fusion networks outperform and learn faster than the other approaches.

Iv-C Generalization

To determine how well the agent can generalize and represent the compositionality inherent in the requests we conduct experiments in which the agent is trained on a subset of the possible request combinations and then tested on unseen requests. Here the agent is trained with 128 different two-item combinations for 15,000 trajectories, and then tested on the held out 128 two-item combinations (Rows 1 and 2 below). In this setting, performance degrades only slightly, indicating that the agent is not merely memorizing combinations, but learning to recognize the structure of requests composed of individual objects.

Agreement F1
Two-Item Train 0.8918 0.9215
Two-Item Test 0.8695 0.8938
Three-Item Test 0.8140 0.8243

In the second experiment, the same agent was tested on a random collection of three-item combinations to determine if the agent can generalize to higher counts than during training (Row 3). Performance degrades slightly more in this case, but the agent remains quite robust. In all cases, results were generated by testing on 1,000 new episodes after training.

Iv-D Information Retention

We have argued above that knowing the request allows the network to discard information about irrelevant objects in the scene. To investigate how much information is retained in the intermediate stages of the network we use the hidden states from models trained on the Simple task and assess whether they can be used to predict the correct action for a new query that is different than the one they were conditioned on. This is implemented by freezing the original model, and training a new set of final layers with a second conditional (Figure 9). In this experiment, we use the Late Fusion model as a proxy for the layer prior to attention in those models.

For all models, we find that if the same request is fed to both the original network and the new branch, we quickly achieve performance comparable to the original model (dotted lines). On the other hand if mismatched requests are fed into the two branches all models suffer a substantial degradation of performance, with most unable to collect a single object (solid lines). Both Early Fusion and the attention models have completely removed irrelevant information, while Late Fusion approaches appear to only retain some of the irrelevant information.

Fig. 9: The top is the frozen Early Fusion network with an new trainable branch for collecting an alternate test object. On bottom is the performance of this approach when the old target and the new target are aligned (dotted lines) and performance for when they are different (solid lines).222Note agreement is the majority class baseline as movement is more common than collection.

V Related Work

Learning to recognize objects, perform actions, and ground instructions to observed phenomena is core to many AI domains and advances are spread across the Robotics, Vision and Natural Language communities. Most immediately relevant to our work, is the recent proliferation of goal directed visual learning in simulated worlds [19, 20, 21, 22, 23, 24] which each aim to bring different amounts of language, vision and interaction to the task of navigating a 3D environment. These systems are often built using one of several open environment simulators based on 3D game engines [25, 26, 27, 28]. This has also been attempted in real 3D environments [29]. Importantly, in contrast to our work, these approaches often pretrain as much of their networks as possible. [21] do not pretrain for their RL based language learning. Their work focuses on a limited vocabulary with composition and does not address learning with occulusion or larger vocabularies.

Visual Question Answering and Visual Referring Expressions have recently emerged as challenging problems in the computer vision community. In these settings a model is trained to either answer questions about an image or detect parts of an image referred to in a natural language expression. Challenging benchmark data sets have been proposed using both real [30, 31, 32] and simulated [33] images. Many successful approaches to these problems use visual attention similar to our baseline models [34, 35, 36].

Within the Natural Language community, instruction following [37] is often studied as semantic parsing [38, 39, 40] which aims to convert sentences into executable logical forms. Recent work has also investigated mapping language in referring expressions [41, 42]. In parallel, the robotics literature has investigated grounding instructions directly to robotic control [43, 44, 45, 24, 12, 46].

Training end-to-end visual and control networks [47]

, has proven difficult due to long roll outs and large action spaces. Within reinforcement learning, several approaches for mapping natural language instructions to actions rely on reward shapping

[2, 3] and imitation learning [12, 46]. Imitation learning has also proven effective for fine grained activities like grasping [48], leading to state-of-the-art results on a broad set of tasks [49]. The difficulty encountered in these scenarios emphasizes the need to explore new methods for efficient learning of multimodal representations. [8] explored attention model architectures, but do not include early fusion techniques. Early fusion of goal information has shown promise with small observation spaces [50], but our work begins to explore this method for high-dimensional visual domains. In this paper, we hope to provide some insight into this approach and highlight its power in interactive settings.

Vi Conclusion

Goal directed computer vision is an important area for robotics research. While our community has benefited greatly by borrowing results from computer vision, we often have different goals and need to specialize our architectures appropriately. Minimally, in many robotic systems vision alone is not enough to direct behavior, and instead additional goal or task information must be taken into account. This paper argues that vision systems perform best when they are aware of goal and task information as early as possible. So far we have shown this to be true on a simulated robotic object retrieval task, but to the extent that this holds more generally, it motivates a line of research that moves away from vision systems that produce broadly general descriptions of images and environments and towards systems that build contextual representations that are specific to the task at hand.

We have compared four models on a simplified robotic retrieval task to show both the necessity of selective reasoning in these problems and demonstrate the effectiveness of the Early Fusion technique. We see how it slowly filters unnecessary information unlike the hard bottleneck of attention. Further, because of the relative simplicity of our approach, we observe substantially better scaling and parameter efficiency of Early Fusion, making it particularly well suited to low power and embedded systems.