Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

04/06/2019 ∙ by Erik Wijmans, et al. ∙ Facebook Georgia Institute of Technology 12

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challenging to outperform, due to the specific choice of the evaluation setting presented by [1]. We find a novel loss-weighting scheme we call Inflection Weighting to be important when training recurrent models for navigation with behavior cloning and are able to out perform the baselines with this technique. We find that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 7

page 9

page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine asking a home robot ‘Hey - can you go check if my laptop is on my desk? And if so, bring it to me.’

In order to be successful, such an agent would need a range of artificial intelligence (AI) skills – visual perception (to recognize objects, scenes, obstacles), language understanding (to translate questions and instructions into actions), and navigation of potentially novel environments (to move and find things in a changing world). Much of the recent success in these areas is due to large neural networks trained on massive human-annotated datasets collected from the web. However, this static paradigm of

‘internet vision’ is poorly suited for training embodied agents. By their nature, these agents engage in active perception – observing the world and then performing actions that in turn dynamically change what the agent perceives. What are needed then are richly annotated, photo-realistic environments where agents may learn about the consequence of their actions on future perceptions while performing high-level goals.

Figure 1: We extend EmbodiedQA [1] to photorealstic environments, our agent is spawned in a perceptually and semantically novel environment and tasked with answering a question about that environment. We examine the agent’s ability to navigate the environment and answer the question by perceiving its environment through point clouds, RGB images, or a combination of the two.

To this end, a number of recent works have proposed goal-driven, perception-based tasks situated in simulated environments to develop such agents [2, 3, 4, 5, 6, 7, 8, 1, 9, 10]. While these tasks are set in semantically realistic environments (i.e. having realistic layouts and object occurrences), most are based in synthetic environments (on SUNCG [11] or Unity 3D models [12]) that are perceptually quite different from what agents embodied in the real world might experience. Firstly, these environments lack visual realism both in terms of the fidelity of textures, lighting, and object geometries but also with respect to the rich in-class variation of objects111To pervert Tolstoy, each ugly lamp is ugly in its own way.. Secondly, these problems are typically approached with 2D perception (RGB frames) despite the widespread use of depth-sensing cameras (RGB-D) on actual robotic platforms [13, 14, 15].

Contributions. We address these points of disconnect by instantiating a large-scale, language-based navigation task in photorealistic environments and by developing end-to-end trainable models with point cloud perception – from raw 3D point clouds to goal-driven navigation policies.

Specifically, we generalize the recently proposed Embodied Question Answering (EmbodiedQA) [1] task (originally proposed in synthetic SUNCG scenes [11]) to the photorealistic 3D reconstructions from Matterport 3D (MP3D) [16]. In this task, an agent is spawned at a random location in a novel environment (e.g. a house) and asked to answer a question (‘What color is the car in the garage?’). In order to succeed, the agent needs to navigate from egocentric vision alone (without an environment map), locate the entity in question (‘car in the garage’), and respond with the correct answer (e.g. ‘orange’).

We introduce the MP3D-EQA dataset, consisting of 1136 questions and answers grounded in 83 environments. Similar to [1], our questions are generated from functional programs operating on the annotations (objects, rooms, and their relationships) provided in MP3D; however, MP3D lacks color annotations for objects, which we collect from Amazon Mechanical Turk in order to generate ‘What color …?’ questions. The MP3D environments provide significantly more challenging environments for our agent to learn to navigate in due to the increased visual variation.

We present a large-scale exhaustive evaluation of design decisions, training a total of 16 navigation models (2 architectures, 2 language variations, and 4 perception variations), 3 visual question answering models, and 2 perception models – ablating the effects of perception, memory, and goal-specification. Through this comprehensive analysis we demonstrate the complementary strengths of these perception modalities and highlight surprisingly strong baselines in the EmbodiedQA experimental setting.

Our analysis reveals that the seemingly naive baselines, forward-only and random, are strong navigators in the default evaluation setting presented in [1] and challenging to beat, providing insight to others working in this space that models can perform surprisingly well without learning any meaningful behavior. We also find that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating continued study of utilizing depth information in embodied navigation tasks.

We find a novel weighting scheme we call Inflection Weighting – balancing the contributions to the cross-entropy loss between inflections

, where the ground truth action differs from the previous one, and non-inflections – to be an effective technique when performing behavior cloning with a shortest path expert. We believe this technique will be broadly useful any time a recurrent model is trained on long sequences with an imbalance in symbol continuation versus symbol transition probabilities,

i.e. when .

To the best of our knowledge, this is the first work to explore end-to-end-trainable 3D perception for goal-driven navigation in photo-realistic environments. With the use of point clouds and realistic indoor scenes, our work lays the groundwork for tighter connection between embodied vision and goal-driven navigation, provides a testbed for benchmarking 3D perception models, and hopefully brings embodied agents trained on simulation one step closer to real robots equipped with 2.5D RGB-D cameras.

2 Related Work

Embodied Agents and Environments. End-to-end learning methods – to predict actions directly from raw pixels [17] – have recently demonstrated strong performance. Gupta et al[2] learn to navigate via mapping and planning. Sadeghi et al[18] teach an agent to fly using simulated data and deploy it in the real world. Gandhi et al[19] collect a dataset of drone crashes and train self-supervised agents to avoid obstacles. A number of new challenging tasks have been proposed including instruction-based navigation [6, 7], target-driven navigation [4, 2], embodied/interactive question answering [1, 9], and task planning [5].

A prevailing problem in embodied perception is the lack of a standardized, large-scale, diverse, real-world benchmark – essentially, there does not yet exist a COCO [20] for embodied vision. A number of synthetic 3D environments have been introduced, such as DeepMind Lab [21] and VizDoom [22]. Recently, more visually stimulating and complex datasets have emerged which contain actionable replicas of 3D indoor scenes [3, 23, 24, 25]. These efforts make use of synthetic scenes [26, 25], or scans of real indoor houses [16, 27] and are equipped with a variety of input modalities, i.e. RGB, semantic annotations, depth, etc.

The closest to our work is the EmbodiedQA work of Das et al[1], who train agents to predict actions from egocentric RGB frames. While RGB datasets are understandably popular for ‘internet vision’, it is worth stepping back and asking – why must an embodied agent navigating in 3D environments be handicapped to perceive with a single RGB camera? We empirically show that point cloud representations are more effective for navigation in this task. Moreover, contrary to [1, 9] that use synthetic environments, we extend the task to real environments sourced from [16].

3D Representations and Architectures.

Deep learning has been slower to impact 3D computer vision than its 2D counterpart, in part due to the increased complexity of representing 3D data. Initial success was seen with volumetric CNN’s

[28, 29, 30]. These networks first discretize 3D space with a volumetric representation and then apply 3D variants of operations commonly found in 2D CNN’s – convolutions, pooling, etc. Volumetric representations are greatly limited due to the sparsity of 3D data and the computational cost of 3D convolutions. Recent works on 3D deep learning have proposed architectures that operate directly on point clouds. Point clouds are a challenging input for deep learning as they are naturally a set of points with no canonical ordering. To overcome the ordering issue, some utilize symmetric functions, PointNet(++) [31, 32], and A-SCN [33]. Others have used clever internal representations, such as SplatNet [34], Kd-Net [35], and O-CNN [36].

3 Questions in Environments


(a) RGB Panorama


(b) Mesh Reconstruction


(c) Point Cloud


(d) RGB-D Render
Figure 2: Illustration of mesh construction errors and what point clouds are able to correct. Notice the warping of flat surfaces, the extreme differences in color, and texture artifacts from reflections.

In this work, we instantiate the Embodied Question Answering (EQA) [1] task in realistic environments from the Matterport3D dataset [16].

3.1 Environments

The Matterport3D dataset consists of 90 home environments captured through a series of panoramic RGB-D images taken by a Matterport Pro Camera (see sample panoramas in fig:pano). The resulting point clouds are aligned and used to reconstruct a 3D mesh (like those shown in fig:mesh) that is then annotated with semantic labels. The Matterport3D dataset is densely annotated with semantic segmentations of 40 object categories for 50,000 instances. Room type is annotated for over 2050 individual rooms.

These reconstructions offer high degrees of perceptual realism but are not perfect however and sometimes suffer from discoloration and unusual geometries such as holes in surfaces. In this work, we examine both RGB and RGB-D perception in these environments. For RGB, we take renders from the mesh reconstructions and for point clouds we operate directly on the aligned point clouds. fig:pc and fig:render show the point cloud rendered for an agent looking at the scene shown in fig:pano.

Simulator. To enable agents to navigate in MatterPort3D environments, we develop a simulator based on MINOS [23]. Among other things, MINOS provides occupancy checking, RGB frame rendering from the mesh, and shortest path calculation (though we reimplement this for higher accuracy and speed). It does not however provide access to the underlying point clouds. In order to render 2.5D RGB-D frames, we first construct a global point cloud from all of the panoramas provided in an environment from the dataset. Then, the agent’s current position, camera parameters (field of view, and aspect ratio), and the mesh reconstruction are used to determine which points are within its view. See the supplementary for full details on this.

3.2 Questions

Following [1], we programmatically generate templated questions based on the Matterport3D annotations, generating questions of the following three types:

  • location: What room is the <OBJ> located in?

  • color: What color is the <OBJ> ?

  • color_room: What color is the <OBJ> in the <ROOM> ?

While EQA [1] included a forth question type prepositions, we found those questions in MP3D to be relatively few, with strong biases in their answer, thus we do not include them.

While room and object annotations and positions supporting the three question types above are available in MP3D, human names for object colors are not. To rectify this, we collect the dominant color of each object from workers on Amazon Mechanical Turk (AMT). Workers are asked to select one of 24 colors for each object. The color palette was created by starting with Kenneth Kelly’s 22 colors of maximum contrast [37] and adding 2 additional colors (off-white and slate-grey) due to their prevalence in indoor scenes. Overall, the most reported color was gray. For each object, we collect 5 annotations and take the majority vote, breaking ties based on object color priors. We include details of the AMT interface in the supplementary.

Following the protocol in [1], we filter out questions that have a low entropy in distribution over answers across environments i.e. peaky answer priors – e.g. the answer to ‘What room is the shower in?’ is nearly always ‘bathroom’ – to ensure that questions in our dataset require the agent to navigate and perceive to answer accurately. We remove rooms or objects that are ambiguous (e.g. “misc” rooms) or structural (e.g. “wall” objects). Below are the objects and rooms that appear in our generated questions:

Objects: shelving, picture, sink, clothes, appliances, door, plant, furniture, fireplace, chest of drawers, seating, sofa, table, curtain, shower, towel, cushion, blinds, counter, stool, bed, chair, bathtub, toilet, cabinet

Rooms: family room, closet, spa, dinning room, lounge, gym, living room, office, laundry room, bedroom, foyer, bathroom, kitchen, garage, rec room, meeting room, hallway, tv room

In total, we generate 1100 questions across 83 home environments (7 environments resulted in no questions after filtering). Note that this amounts to 13 question-answer pairs per environment compared to 12 per scene in [1]. Color_room questions make up the majority of questions. These questions require searching the environment to find the specified object in the specified room. Whereas [1] requires both the object and the room to be unique within the environment, we only require the (object, room) pair to be unique, thereby giving the navigator significantly less information about the location of the object.

We use the same train/val/test split of environments as in MINOS [23]. Note that in [1], the test environments differ from train only in the layout of the objects; the objects themselves have been seen during training. In MP3D-EQA, the agents are tested on entirely new homes, thus may come across entirely new objects – testing semantic and perceptual generalization. Tab. 1 shows the distribution of homes, floors, and questions across these splits. We restrict agent start locations to lie on the same floor as question targets and limit episodes to single floors.

Homes Floors Total Qns. Unique Qns.
train 57 102 767 174
val 10 16 130 88
test 16 28 239 112
Table 1: Statistics of splits for EQA in Matterport3D

4 Perception for Embodied Agents

Agents for EmbodiedQA must understand the given question, perceive and navigate their surroundings collecting information, and answer correctly in order to succeed. Consider an EmbodiedQA agent that navigates by predicting an action at each time step based on its trajectory of past observations and actions , the current state , and the question . There are many important design decisions for such a model – action selection policy, question representation, trajectory encoding, and observation representation. In this work, we focus on the observation representation – i.e. perception – in isolation and follow the architectural pattern in [1] for the remaining components. In this section, we describe our approach and recap existing model details.

4.1 Learning Point Cloud Representations

Figure 3:

The visual encoders a trained using three pertaining tasks to imbue their scene representations with information about semantics (segmentation), color (autoencoding), and structure (depth). All decoder heads share the same encoder. Up-sampling for RGB (Up #) is done with bi-linear interpolation. Upsampling for pointclouds (FP #), is achieved with Feature Propagation layers

[32]. After pretraining, the decoders are discarded, and the encoder is treated as a static feature extractor.

Consider a point cloud which is an unordered set of points in 3D space with associated colors, i.e. . To enable a neural agent to perceive the world using point clouds, we must learn a function that maps a point cloud to an observation representation. To do this, we leverage a widely used 3D architecture, PointNet++ [32].

PointNet++.

At a high-level, PointNet++ alternates between spatial clustering and feature summarization – resulting in a hierarchy of increasingly coarse point clusters with associated feature representations summarizing their members. This approach draws an analogy to convolution and pooling layers in standard convolutional neural networks.

More concretely, let be the set of points at the level of a PointNet++ architecture and be their associated feature representations (e.g. RGB values for the input level). To construct the level, centroids are sampled from level via iterative farthest point sampling (FPS) – ensuring even representation of the previous layer. These centroids will make up the points in level and represent their local areas. For each centroid , the closest points within a max radius are found and a symmetric learnable neural architecture [31]

, composed of a series of per-point operations (essentially 1-by-1 convolutions) and a terminal max-pool operation, is applied to this set of associated points to produce the summary representation

. These clustering and summarization steps (referred to as Set Abstractions in [32]) can be repeated arbitrarily many times. In this work we use a 3 level architecture with , , and . We compute a final feature with a set of 1-by-1 convolutions and a max-pool over the 3rd level point features and denote this network as .

Given an input point cloud from an agent’s view at time , we produce a representation where . However, point clouds have an interesting property – as an agent navigates an environment the number of points it perceives can vary. This may be due to sensor limitations (e.g. being too close or too far from objects) or properties of the observed surfaces (e.g. specularity). While the encoder is invariant to the number of input points, representations drawn from few supporting points are not likely to be good representations of a scene. For a navigation or question-answering agent, this means there is no way to discern between confident and unconfident or erroneous observation. To address this, we divide the range spanning the possible number of points in any given point cloud –

– into 5 equal sized bins and represent these bins as 32-d feature vectors that encode the sparsity of a point cloud. Now, given a point cloud

with points, we retrieve its corresponding sparsity embedding and produce a final encoding that is used by the agent for navigation and question-answering.

Visual Pretraining Tasks. To train the encoder architecture to extract semantically and spatially meaningful representations of agent views, we introduce three pretraining tasks based on the annotations provided in Matterport3D. Specifically, these tasks are:

  1. Semantic Segmentation in which the model predicts the object annotation for each point, , from the summarized representation . We train a PointNet++ feature propagation network to minimize the cross-entropy between and [32]. This encourages the encoder, , to include information about which objects are in the frame.

  2. Color Autoencoding mirrors the semantic segmentation task. However, the network is now trained to minimize the smooth-L1 loss between and . This task encourages the encoder to capture holistic information about the colors in the scene.

  3. Structure Autoencoding where point coordinates must be recovered from the summarized representation, i.e.

    . We implement this decoder as a multi-layer perceptron that regresses to the

    spatial coordinates. As in [38]

    , we use the earth-movers distance as the loss function.

We demonstrate these tasks in Fig. 3. These tasks encourage the model features to represent colors, objects, and spatial information including free-space and depth that are essential to navigation. We collect 100,000 frames from Matterport3D using our simulator and train the point cloud encoder for these tasks. We discard the decoder networks after training, and use the encoder as a fixed feature extractor.

RGB Image representations. We utilize ResNet50 [39] trained using an analogous set of tasks (semantic segmentation, autoencoding, and depth prediction) to learn a representation for egocentric RGB images as in [1]. We find that ResNet50 is better able to handle the increased visual complexity of the Matterport3D environments than the shallow CNN model used in Das et al. We provide further details about perception model and decoder architectures in the supplement.

4.2 Navigation and Question Answering

We now provide an overview of the navigation and question answering models we use in this work.

Question Encoding. In order to succeed at navigation and question answering, it is important for an embodied agent to understand the queries it is being tasked with answering. We use two layer LSTMs with 128-d hidden states to encode questions. The question encoding for navigation and question answering are learned separately.

Question Answering Models. We experimented with three classes of question answering models:

  • Question-only We examine the question-only baselines proposed in [1] – a small classification network that predicts the answer using just the question encoding. We also examine the recently proposed question-only baselines in [40]

    – a simple nearest neighbors approach and a bag-of-words with a softmax classifier.

  • Attention This is the best performing VQA model from [1]. It computes question-guided attention over the features of the last five frames observed by the agent before stopping, followed by element-wise product between the attended feature and question encoding to answer; and

  • Spatial Attention utilizes the bag-of-words encoder proposed in [40] to compute spatial attention over the last-frame. We use scaled dot-product attention [41] over the feature map, perform an element-wise product between attended features and the question feature, and predict an answer. This model only uses RGB inputs.

Navigation Models. We consider two baseline navigators:

  • Forward-only (Fwd) which always predicts forward.

  • Random which uniformly chooses one of forward, turn-left, and turn-right at every time step.

We consider two navigation architectures:

  • Reactive (R) which is a simple feed-forward network that takes a concatenation of the embedding of the five most recent visual observations as input to predict an action. As such, this is a memory-less navigator.

  • Memory (M) which is a two-layer GRU-RNN that takes the encoding(s) of the current observation and previous action as inputs to predict the current action.

For each navigation architecture, we examine the combination of our 4 different perception variations, None (i.e. a blind model as suggested by Thomason et al. [42]), PC, RGB, and PC+RGB, with the 2 different language variations, None and Question. For reactive models that utilize the question, we incorporate the question embedding by concatenation with the visual embedding. For memory models, the question embedding is an additional input to the GRU. Due to the highly correlated observations our agents see during training, we utilize Group Normalization layers [43]  in our navigation models. The action space for all our navigation models is forward, turn-left, turn-right, and stop.

4.3 Imitation Learning from Expert Trajectories

To train our models, we first create a static dataset of agent trajectories by generating training episodes based on shortest-paths from agent spawn locations to the best view of the object of interest. For example, if a question asks ‘What color is the sofa in the living room?’, we spawn an agent randomly in the environment in the same floor as the target object – the sofa – and compute the shortest navigable path to the best view of the sofa. The best view of the sofa is determined by exhausting all possible view positions within a reasonable radius of the target object. The quality of a view is determined by the intersection over union of a pre-determined bounding box and the segmentation mask of the target. In normalized image coordinates, the bounding box’s top left corner is at (0.25, 0.25) and it has a height of 0.6 and a width of 0.5. We use this metric instead of simply maximizing the number of visible pixels in the segmentation mask to maintain context of the object’s surroundings.

To provide enough data to overcome the complexity of Matterport3D environments, we generate 11,796 such paths in total (corresponding to approximately 15 episodes per question-environment pair, each for a different spawn location of the agent). For computational efficiency in the large Matterport3D environments, we compute shortest paths in continuous space using LazyTheta* [44] and greedily generate agent actions to follow that path, rather than directly searching in the agent’s action space.

Perception. We use the frozen pre-trained perception models as described in sec:percept. For PC+RGB models we simply concatenate both visual features.

Question Answering. The question answering models are trained to predict the ground truth answer from a list of answers using Cross Entropy loss. The models with vision use the ground-truth navigator during training.

4.4 Imitating Long Trajectories Effectively

All navigation models are trained with behavior cloning where they are made to mimic the ground truth, shortest path agent trajectories. That is to say the agents are walked through the ground truth trajectory observing the corresponding frames (though reactive models retain only the last five) up until a given time step and then make an action prediction. Regardless of the decision, the agent will be stepped along the ground truth trajectory and repeat this process. One challenge with this approach is that relatively unintelligent policies can achieve promising validation loss without really learning anything useful – one such strategy simply repeats the previous ground truth action. Insidiously, these models achieve very high validation accuracy for action prediction but miss every transition between actions!

Inflection Weighting. To combat this problem and encourage agents to focus on important decisions along the trajectory, we introduce a novel weighting scheme we call Inflection Weighting. Conceptually, we weight predictions at time steps more heavily if the ground truth action differs from the previous one – that is if the time step is an inflection point in the trajectory. More formally, we define a per-time step weight

(1)

where is the inverse frequency of inflection points (approximately 5.7 in our dataset). We can then write an inflection weighted loss between a sequence of predictions and a ground truth trajectory over as:

(2)

where is the task loss – cross-entropy in our setting. We define the first action, , to be an inflection. In practice, we find inflection weighting leads to significant gains in performance for recurrent models.

Inflection weighting may be viewed as a generalization of the class-balanced loss methods that are commonly used in supervised learning under heavily imbalanced class distributions (

e.g. in semantic segmentation [45]) for a particular definition of a ‘class’ (inflection or not).

5 Experiments and Analysis

We closely follow the experimental protocol of Das et al[1]. All results here are reported on novel test environments. Agents are evaluated on their performance 10, 30, or 50 primitive actions away from the question target, corresponding to distances of 0.35, 1.89, and 3.54 meters respectively. One subtle but important point is that to achieve these distances the agent is first randomly spawned within the environment, and then the agent is walked along the shortest path to the target until it is the desired distance from the target (10, 30, or 50 steps).

We perform an exhaustive evaluation of design decisions, training a total of 16 navigation models (2 architectures, 2 language variations, and 4 perception variations), 3 visual question answering models, and 2 perception models.

5.1 Metrics

Question Answering. For measuring question answering performance, we report the top-1 accuracy, i.e. did the agent’s predicted answer match the ground truth or not.

Navigation. For navigation, we report the distance to the target object from where the agent is spawned () for reference, measure distance to the target object upon navigation completion (lower is better), and the percentage of actions that result in a collision with an obstacle (lower is better). All the distances are geodesic, i.e. measured along the shortest path.

We propose a new metric, (higher is better), to evaluate the quality of the view of the target the agent obtains at the end of navigation. We compute the intersection-over-union (IoU) score between the ground-truth target segmentation and the same centered bounding box used to select views during dataset generation (see sec:expert). To compensate for object size, we divide by the best attainable IoU for the target object. We define as the maximum of the last IoU scores. We set to as the VQA model receives the last frames.

5.2 Results and Analysis

Figure 4: Models with memory significantly outperform their memory-less counterparts. Surprisingly, the baselines, random and forward-only, and a vision-less navigator with memory perform very well.
Figure 5: Models trained with inflection-weighted cross-entropy loss significantly outperform their unweighted cross-entropy counterparts and the baselines.
Figure 6: Vision generally hurts distance-based navigation metrics. However metrics that are dependent on the navigators ability to look in a particular direction ( and ) generally improve, and the models collide with the environment less.
Figure 7: Comparison of memory navigation models with and without the question. Interestingly, adding the question doesn’t appear to aid models trained with behavior cloning.

Question Answering. The top-1 accuracy for different answering modules on the validation set using the ground-truth navigator is shown below.

Top-1 (%) spatial+RGB+Q 46.2 attention+RGB+Q 40.0 attention+PC+RGB+Q 38.4 attention+PC+Q 36.1 lstm-question-only 32.8 nn-question-only [40] 35.4 bow-question-only [40] 38.3

In-order to compare performance between navigators, we report all results with the best-performing module – spatial+RGB+Q – regardless of the navigator.

Navigation. We use the following notation to specify our models: For the base architecture, R denotes reactive models and M denotes memory models. The base architectures are then augmented with their input types, +PC, +RGB, and +Q. So a memory model that utilizes point clouds (but no question) is denoted as M+PC. Unless otherwise specified (by the prefix NoIW), models are trained with inflection weighting. We denote the two baseline navigators, forward-only and random, as Fwd and Random, respectively.

Due to the large volume of results, we present key findings and analysis here (with ) and, for the intrepid reader, provide the full table (with 300+ numbers!) in the supplement. We make the following observations:

Forward-only is a strong baseline. One of the side-effects of the evaluation procedure proposed in [1] is that the agent is commonly facing the correct direction when it is handed control. This means the right thing to do to make progress is to go forward. As a result, a forward-only navigator does quite well, see Fig. 4. Forward-only also tends to not overshoot too much due to its ‘functional stop’: continually running into an obstacle until the max step limit is reached. Our vision-less reactive models (R/Fwd and R+Q/Fwd) learn to only predict forward, the most frequent action.

Fig. 4 also shows that the random baseline is a deceptively strong baseline. The lack of a backward action, and left and right cancelling each other out in expectation, results in random essentially becoming forward-only.

Inflection weighting improves navigation. We find inflection weighting to be crucial for training navigation models with behavior cloning of a shortest-path expert; see Fig. 5. While we see some improvements with inflection weighting for most models, memory models reap the greatest benefits – improving significantly on both and . Interestingly, these gains do not translate into improved

accuracy. While we have only utilized this loss for behavior cloning, we suspect the improvements seen from inflection weighting will transfer to models that are fine-tuned with reinforcement learning as they begin with better performance.

Memory helps. Fig. 4 shows that models with memory are better navigators than their reactive counter parts. Surprisingly, a vision-less navigator with memory performs very well at distance based navigation metrics. Like a vision-less reactive navigator (forward-only), a vision-less memory navigator is only able to learn priors on how shortest paths in the dataset tend to look, however memory allows the model to count and therefore it is able to stop and turn.

Vision helps gaze direction metrics. Fig. 6 shows the effect of adding vision to both reactive and memory models. The addition of vision leads to improvements on and , however, the improvements in do not translate directly improvement on . This is likely due to naive VQA models. Models with vision also tend to collide with the environment less often, as can be seen by usually being lower.

Vision hurts distance metrics. Surprisingly, adding vision hurts distance based navigation metrics (). For reactive models, adding vision causes the models to collide significantly less frequently, resulting in a loss of the ‘functional stop’ that forward-only uses, i.e. continually colliding until the step limit is reached. For memory models, the story isn’t as clear; however, memory models with vision stop less often and thus have a higher average episode length than their vision-less counterpart, which causes them to overshoot more often. We suspect this is because they learn a more complex function for stopping than the simple counting method used by vision-less memory models and this function is less able to handle errors during navigation.

Question somewhat helps. Fig. 7 provides a comparison of M+PC and M+RGB and M+PC+RGB with and without the question (Q). Interestingly, we do not see large improvements when providing models with the question. Given how much color_room dominates our dataset, it seems reasonable to expect that telling the navigation models which room to go to would be a large benefit. We suspect that our models are not able to properly utilize this information due to limitations of behavior cloning. Models trained with behavior cloning never see mistakes or exploration and therefore never learn to correct mistakes or explore.

PC+RGB provides the best of both worlds. Fig. 6 also provides a comparison of the three different vision modalities. The general tend is that point clouds provided a richer signal for obstacle avoidance (corresponding to lower values), while RGB provides richer semantic information (corresponding to a higher and ). Combining both point clouds and RGB provides improvements to both obstacle avoidance and leveraging semantic information.

6 Conclusion

We present an extension of the task of EmbodiedQA to photorealistic environments utilizing the Matterport 3D dataset and propose the MP3D-EQA v1 dataset. We then present a thorough study of 2 navigation baselines and 2 different navigation architectures with 8 different input variations. We develop an end-to-end trainable navigation model capable of learning goal-driving navigation policies directly from 3D point clouds. We provide analysis and insight into the factors that affect navigation performance and propose a novel weighting scheme – Inflection Weighting – that increases the effectiveness of behavior cloning. We demonstrate that two the navigation baselines, random and forward-only, are quite strong under the evaluation settings presented by [1]. Our work serves as a step towards bridging the gap between internet vision-style problems and the goal of vision for embodied perception.

Acknowledgements. This work was supported in part by NSF (Grant # 1427300), AFRL, DARPA, Siemens, Samsung, Google, Amazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

References

  • [1] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018.
  • [2] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In CVPR, 2017.
  • [3] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environments. In ICLR Workshop, 2018.
  • [4] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.
  • [5] Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, and Ali Farhadi. Visual Semantic Planning using Deep Successor Representations. In ICCV, 2017.
  • [6] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language grounding. arXiv preprint arXiv:1706.07230, 2017.
  • [7] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
  • [8] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojtek Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
  • [9] Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. IQA: Visual question answering in interactive environments. In CVPR, 2018.
  • [10] Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Neural Modular Control for Embodied Question Answering. In Proceedings of the Conference on Robot Learning (CoRL), 2018.
  • [11] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017.
  • [12] Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents, 2018.
  • [13] Albert S Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox, and Nicholas Roy. Visual odometry and mapping for autonomous flight using an rgb-d camera. In Robotics Research. 2017.
  • [14] Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois Robert Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, and Alberto Rodriguez. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In ICRA, 2018.
  • [15] Shiqi Zhang, Yuqian Jiang, Guni Sharon, and Peter Stone. Multirobot symbolic planning under temporal uncertainty. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Sytems (AAMAS), May 2017.
  • [16] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  • [17] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, Jan. 2016.
  • [18] Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. RSS, 2017.
  • [19] Abhinav Gupta Dhiraj Gandhi, Lerrel Pinto. Learning to fly by crashing. IROS, 2017.
  • [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  • [21] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deepmind lab. arXiv.
  • [22] Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: A doom-based AI research platform for visual reinforcement learning. arXiv 1605.02097, 2016.
  • [23] Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. MINOS: Multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931, 2017.
  • [24] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron C. Courville. Home: a household multimodal environment. arXiv 1711.11017, 2017.
  • [25] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
  • [26] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. CVPR, 2017.
  • [27] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  • [28] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
  • [29] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
  • [30] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 5648–5656, 2016.
  • [31] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  • [32] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
  • [33] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In CVPR, 2018.
  • [34] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In CVPR, 2018.
  • [35] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In ICCV. IEEE, 2017.
  • [36] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
  • [37] Kenneth L Kelly. Twenty-two colors of maximum contrast. Color Engineering, 3(26):26–27, 1965.
  • [38] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
  • [39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [40] Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, and Aaron Courville. Blindfold Baselines for Embodied QA. arXiv preprint arXiv:1811.05013, 2018.
  • [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [42] Jesse Thomason, Daniel Gordan, and Yonatan Bisk. Shifting the baseline: Single modality performance on visual navigation & qa. arXiv preprint arXiv:1811.00613, 2018.
  • [43] Yuxin Wu and Kaiming He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.
  • [44] Alex Nash, Sven Koenig, and Craig Tovey. Lazy theta*: Any-angle path planning and path length analysis in 3d. In Third Annual Symposium on Combinatorial Search, 2010.
  • [45] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
  • [46] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.

7 Color Label Collection Interface

Figure 8: Interface shown to AMT workers for collecting the dominate color name for objects in the Matterport3D dataset [16]. Workers were shown up to two good views of the object (some objects only had one good view) and the corresponding instance segmentation mask and asked to selected the dominate color of the object from a predefined list of colors.

Fig. 8 shows the interface we used to collect dominate color annotations from workers on Amazon Mechanical Turk.

8 Point Cloud Rendering

In order to render 2.5D RGB-D frames, we first construct a global point cloud from all of the panoramas provided in an environment in the Matterport3D dataset [16]. Next, we determine what parts of the global point cloud are in the agent’s current view. The agent’s current position, , camera parameters (field of view, and aspect ratio), and the mesh reconstruction are used to determine which points are within its view pyramid (frustum).

For each point in view, we first check if the point lies within the agents view frustum using the extrinsic and intrinsic camera matrices. After determining which points lie within the view frustum, we check for occlusions. We then draw a line between the agent’s camera and , let be that line. We intersect this line with the provided mesh and keep the intersection that is closest to the agent, let be that intersection. If the distance to the closest intersection with the mesh, , is less than the distance to the point, , indicating that there is a closer (occluding) point, we remove the point. We also perform this check in the other direction and remove the point if , indicating that the point is in the free-space. Points in the free-space are a result of panorama alignment errors and scanning oddities from reflections. In practice, we find that equality is too strict of a criteria due to mesh reconstruction errors, we instead remove a point if the absolute difference of the distances is greater than , .

The implementation implied by the description above would be very slow. What we have described above implies something akin to a ray-tracing rendering. As in normal graphics pipelines, we can significantly speed this up by approximation with rasterization. We rasterize the provided mesh at and capture the depth buffer. The depth buffer provides the distance from the agent to the closest intersection with the mesh for some finite number of rays. Let be a ray in that set and be the distance to the closest intersection with the mesh as provided by the depth buffer. We can then approximate for every point in the point cloud by finding the that is the closest to parallel to and using the value of .

Further, a single global point cloud in Matterport3D environments has hundreds of millions of points and, in theory, we would need to check every single one to determine what points are visible. This would be quite slow and use a prohibitive amount of memory. We alleviate this by creating a significantly sparser point cloud and perform an initial visibility check on that instead. We then recheck the dense point cloud only in areas of the sparse point cloud that passed the initial visibility check.

9 Perception Models

Here we provide full details on the architectures of our perceptual encoders and the decoder heads used for their pre-training tasks.

9.1 PC – PointNet++

We use notation similar to the notation from Qi et al. [32] to specify our PointNet++ architecture. Let be the input point cloud. Then,

is a set abstraction module with multi-scale grouping that takes input the level point cloud, , and produces the level point cloud, , with points. See section 4.1 in the main paper for a full description of a set abstraction module with single-scale grouping. A set abstraction module with multi-scale grouping is a logical extension. The feature descriptor for each point in is calculated across different scales ( balls with different radii). specifies the number of output channels for each layer in the shared-weighted multi-layer perceptron (MLP) used a each scale. The feature descriptor for any point is then the concatenation of the feature descriptor calculated at each scale.

As an analogy, a multi-scale convolution would be achieved by by first convolving the input with convolutions of different kernel sizes and then concatenating the outputs from each convolution.

is a global set abstraction module and produces a dimensional feature vector.

Our encoder is specified by

is a feature propagation layer that transfers the features from to [32]. For each point in , its three nearest neighbors in are found and their feature descriptors are combined via inverse-distance weighted interpolation. The interpolated feature descriptor is then processed by shared-weight MLP with output channels specified by . is a or boolean that determines if any existing features of are used. If is true, the the interpolated feature and the existing feature are concatenated together before the shared-weight MLP. If is false, only the interpolated feature is processed by the shared-weight MLP.

The semantic segmentation head is specified as

where is the number of class, in our case.

The color prediction head is specified as

The structure prediction head is specified as

is a fully connected layer that takes and produces a vector of size . The output of the structure decoding head, , is then reshaped into a point cloud.

Training details. We train with a batch size of and utilize Adam [46]. We use an initial learning rate of and decay the learning rate by every batches. We select the checkpoint by performance on a held-out validation set.

9.2 RGB – ResNet50

We use the initial convolution and residual blocks from ResNet50 [39] as the encoder for RGB images. Each of the three decoders (semantic, depth, and color) are identical and consistent of 1x1 convolutions and bi-linear interpolation.

Let , , and be the outputs from the second, third, and fourth residual blocks in ResNet50 respectively. Each decoder is then parameterized as

is simply a 1-by-1 convectional layer that transforms its input to have channels. is a bi-linear interpolation layer that up-samples its input to be the necessary size. Where is the number of output channels for the given decoder type, for semantic, for depth, and for color.

Training details. Due to the high prevalence of walls and floors among indoor scenes, we use class weighted cross-entropy loss for semantic segmentation. For depth and autoencoding, we use smooth- The total loss is then the weighted sum of the individual task losses:

We find , , and balances the magnitudes of the various losses and works well.

Tab. 2 provides a comparison of pre-training tasks when trained for separately vs. jointly and comparison with the Shallow CNN used in Das et al. [1]. The experiments showed that training jointly doesn’t sacrifice performance on depth and autoencoding.

We train with a batch size of and utilize Adam [46]. We use an initial learning rate of

, maximum epochs considered

, we sampled every -rd frame from shortest paths. The checkpoint was selected by performance on a held-out validation set.

Model Total loss AE loss Depth loss Sem. loss PA MPA MIOU
Shallow CNN [1] – All Tasks 0.369 0.0047 0.0077 2.45 0.384 0.14 0.070
ResNet50 – AE only 1.774 0.0030 0.0873 8.71 0.008 0.02 0.002
ResNet50 – Depth only 2.311 0.1051 0.0072 11.89 0.005 0.02 0.002
ResNet50 – AE and depth only 1.003 0.0034 0.0067 9.02 0.006 0.03 0.002
ResNet50 – All Tasks 0.356 0.0040 0.0069 2.48 0.390 0.15 0.078
Table 2: Performance on the MP3D-EQA v1 validation set for Shallow CNN [1] and for ResNet50 when trained for autoencdoing only, for depth only, for autoencoding and depth, and for all tasks jointly. For segmentation we report the overall pixel accuracy (PA), mean pixel accuracy (MPA) averaged over all semantic classes and the mean IOU intersection over union (MIOU). For depth and autoencoder, we report the smooth- on the validation set.

10 Question answering model training

  1. lstm-question-only, we train with a batch size of 40 and utilize Adam with a learning rate of ;

  2. nn-question-only and bow-question-only we use the code provided in [40];

  3. attention+*, we train with a batch size of 20 and utilize Adam with a learning rate of ;

  4. spatial+RGB+Q , we train with a batch size of 32 and utilize Adam with a learning rate of

For all models, the best checkpoint is selected via performance on a held out validation set.

11 Navigation model training

For models trained with and without inflection weighted loss, we train and select checkpoints with the same procedure.

  1. R+*, we train with a batch size of 20 and utilize Adam with a learning rate of ;

  2. M+*, we train with a batch size of 5 full sequences and utilize Adam with a learning rate of . Note that due to the length of sequences, we compute the forward and backward pass for each sequence individually and average the gradients

Due to the difference in end-to-end evaluation and teacher-forcing accuracy, we use the following method to select checkpoints for navigation models: First we run non-minimal suppression with a window size of 5 on teacher forcing validation loss (vanilla cross-entropy for NoIW-* models, and inflection weighted cross-entropy otherwise). After NMS, we select the top 5 checkpoints by validation loss and run them through end-to-end evaluation on the validation set. The best checkpoint is the model that has the highest value of at . We find that inflection weighted cross-entropy is a significantly better predictor of end-to-end performance than vanilla cross-entropy. Interestingly, we also find that teacher forcing validation accuracy is a good predictor of end-to-end performance when training with inflection weighting and don’t see a significant difference in performance if the procedure above is run with teacher forcing validation error instead of loss.

12 Results

We provide the full tables we first analyzed in their entirety and then sliced for the analysis provided in the main paper. Tab. 3 shows our primary results with inflection weighting and Tab. 4 shows the same set of models trained without inflection weighting.

We also provide the full tables with confidence intervals. Confidence intervals are 90% confidence intervals calculated with empirical bootstrapping. See Tab. 

3 and Tab. 6.

As a reminder, we use the following notation to specify our models: For the base architecture, R denotes reactive models and M denotes memory models. The base architectures are then augmented with their input types, +PC, +RGB, and +Q. So a memory model that utilizes point clouds (but no question) is denoted as M+PC. Unless otherwise specified (by the prefix NoIW), models are trained with inflection weighting. We denote the two baseline navigators, forward-only and random, as Fwd and Random, respectively.

12.1 Navigation Performance Correlation Analysis

Figure 9: Correlation and clustering of correlation for per episode at . Reactive models build a strong cluster, while memory models are less correlated. Usage of the question changes navigators with vision significantly.

Here we provide some additional insight on the effect the question has on our navigation models. We find that while the question does significantly impact performance on average, it does significantly change a model’s behavior.

Fig. 9 shows the Pearson correlation correlation coefficient between at for all models and cluster discovery using the Nearest Point Algorithm. All navigators that use the reactive base architecture build a strong cluster and are highly correlated with each-other. The vision-less memory models are also well correlated with each-other, indicating that the question has little relation to the distribution of actions along a shortest path. Memory models that use vision don’t tend to be well correlated with their counterpart that also uses the question in-spite of the question having little effect on overall navigation metrics.

Navigation QA
Navigator (For reference) (Lower is better) (Lower is better) (Higher is better) (Lower is better) (Higher is better) (Higher is better)
R
R+Q
R+RGB
R+RGB+Q
R+PC
R+PC+Q
R+PC+RGB
R+PC+RGB+Q
M
M+Q
M+RGB
M+RGB+Q
M+PC
M+PC+Q
M+PC+RGB
M+PC+RGB+Q
Random
ShortestPath

Table 3: Evaluation of EmbodiedQA agents trained with inflection weighting on navigation and answering metrics for the MP3D-EQA v1 test set. RGB models perceive the world via RGB images and use ResNet50. PC models perceive the world via point clouds and use PointNet++. PC+RGB models use both perception modalities and their respective networks.
Navigation QA
Navigator (For reference) (Lower is better) (Lower is better) (Higher is better) (Lower is better) (Higher is better) (Higher is better)
NoIW-R
NoIW-R+Q
NoIW-R+RGB
NoIW-R+RGB+Q
NoIW-R+PC
NoIW-R+PC+Q
NoIW-R+PC+RGB
NoIW-R+PC+RGB+Q
NoIW-M
NoIW-M+Q
NoIW-M+RGB
NoIW-M+RGB+Q
NoIW-M+PC
NoIW-M+PC+Q
NoIW-M+PC+RGB
NoIW-M+PC+RGB+Q
NoIW-Random
NoIW-ShortestPath

Table 4: Evaluation of EmbodiedQA agents trained without inflection weighting on navigation and answering metrics for the MP3D-EQA v1 test set.
Navigation QA
Navigator (For reference) (Lower is better) (Lower is better) (Higher is better) (Lower is better) (Higher is better) (Higher is better)
NoIW-R
NoIW-R+Q