RobustNav: Towards Benchmarking Robustness in Embodied Navigation

As an attempt towards assessing the robustness of embodied navigation agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual - affecting RGB inputs - and dynamics - affecting transition dynamics - corruptions. Most recent efforts in visual navigation have typically focused on generalizing to novel target environments with similar appearance and dynamics characteristics. With RobustNav, we find that some standard embodied navigation agents significantly underperform (or fail) in the presence of visual or dynamics corruptions. We systematically analyze the kind of idiosyncrasies that emerge in the behavior of such agents when operating under corruptions. Finally, for visual corruptions in RobustNav, we show that while standard techniques to improve robustness such as data-augmentation and self-supervised adaptation offer some zero-shot resistance and improvements in navigation performance, there is still a long way to go in terms of recovering lost performance relative to clean "non-corrupt" settings, warranting more research in this direction. Our code is available at https://github.com/allenai/robustnav

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 12

page 14

05/18/2019

SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation

We propose SplitNet, a method for decoupling visual perception and polic...
10/13/2020

Measuring Visual Generalization in Continuous Control from Pixels

Self-supervised learning and data augmentation have significantly reduce...
10/27/2020

Unsupervised Domain Adaptation for Visual Navigation

Advances in visual navigation methods have led to intelligent embodied n...
12/07/2020

MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation

Navigation tasks in photorealistic 3D environments are challenging becau...
12/11/2020

How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget

PointGoal navigation has seen significant recent interest and progress, ...
03/14/2021

Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation

We present Success weighted by Completion Time (SCT), a new metric for e...
04/23/2018

Benchmarking projective simulation in navigation problems

Projective simulation (PS) is a model for intelligent agents with a deli...

Code Repositories

robustnav

Evaluating pre-trained navigation agents under corruptions


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: RobustNav. (a) A navigation agent pretrained in clean environments is asked to navigate to targets in unseen environments in the presence of (b) visual and (c) dynamics based corruptions. Visual corruptions (ex. camera crack) affect the agent’s egocentric RGB observations while Dynamics corruptions (ex. drift in translation) affect transition dynamics in the unseen environment.

A longstanding goal of the artificial intelligence community has been to develop algorithms for embodied agents that are capable of reasoning about rich perceptual information and thereby accomplishing tasks by navigating in and interacting with their environments. In addition to being able to exhibit these capabilities, it is equally important that such embodied agents are able to do so in a robust and generalizable manner.

A major challenge in Embodied AI is to ensure that agents can generalize to environments with different appearance statistics and motion dynamics than the environment used for training those agents. For instance, an agent that is trained to navigate in “sunny” weather should continue to operate in rain despite the drastic changes in the appearance, and an agent that is trained to move on carpet should decidedly navigate when on a hardwood floor despite the discrepancy in friction. While a potential solution may be to calibrate the agent for a specific target environment, it is not a scalable one since there can be enormous varieties of unseen environments and situations. A more robust, efficient and scalable solution is to equip agents with the ability to autonomously adapt to new situations by interaction without having to train for every possible target scenario. Despite the remarkable progress in Embodied AI, especially in embodied navigation [62, 48, 50, 57, 8], most efforts focus on generalizing trained agents to unseen environments, but critically assume similar appearance and dynamics attributes across train and test environments.

As a first step towards assessing general purpose robustness of embodied agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of common visual (vis) and dynamics (dyn) corruptions – artifacts that affect the egocentric RGB observations and transition dynamics, respectively. We envision RobustNav as a testbed for adapting agent behavior across different perception and actuation properties. While assessing robustness to changes (stochastic or otherwise) in environments has been investigated in the robotics community [33, 14, 15, 22], the simulated nature of RobustNav enables practitioners to explore robustness against a rich and very diverse set of changes, while inheriting the advantages of working in simulation – speed, safety, low cost and reproducibility.

RobustNav consists of two widely studied embodied navigation tasks, Point-Goal Navigation (PointNav[3] and Object-Goal Navigation (ObjectNav[5] – the tasks of navigating to a goal-coordinate in a global reference frame or an instance of a specified object, respectively. Following the standard protocol, agents learn using a set of training scenes and are evaluated within a set of held out test scenes, but differently, RobustNav test scenes are subject to a variety of realistic visual and dynamics corruptions. These corruptions can emulate real world scenarios such as a malfunctioning camera or drift (see Fig.1).

As zero shot adaptation to test time corruptions may be out of reach for our current algorithms, we provide agents with a fixed “calibration budget” (number of interactions) within the target world for unsupervised adaptation. This mimics a real-world analog where a shipped robot is allowed to adapt to changes in the environment by executing a reasonable number of unsupervised interactions. Post calibration, agents are evaluated on the two tasks in the corrupted test environments using standard navigation metrics.

Our extensive analysis reveals that both PointNav and ObjectNav agents experience significant degradation in performance across the range of corruptions, particularly when multiple corruptions are applied together. We show that this degradation reduces in the presence of a clean depth sensor suggesting the advantages of incorporating multiple sensing modalities, to improve robustness. We find that data augmentation and self-supervised adaptation strategies offer some zero-shot resistance and improvement over degraded performance, but are unable to fully recover this gap in performance. Interestingly, we also note that visual corruptions affect embodied tasks differently from static tasks like object recognition – suggesting that visual robustness should be explored within an embodied task. Finally, we analyze several interesting behaviors our agents exhibit in the presence of corruptions – such as increase in the number of collisions and inability to terminate episodes successfully.

In summary, our contributions include: (1) We present RobustNav– a framework for benchmarking and assessing the robustness of embodied navigation agents to visual and dynamics corruptions. (2) Our findings show that present day navigation agents trained in simulation underperform severely when evaluated in corrupt target environments. (3) We systematically analyze the kinds of mistakes embodied navigation agents make when operating under such corruptions. (4) We find that although standard data-augmentation techniques and self-supervised adaptation strategies offer some improvement, much remains to be done in terms of fully recovering lost performance.

RobustNav provides a fast framework to develop and test robust embodied policies, before they can be deployed onto real robots. While RobustNav currently supports navigation heavy tasks, the supported corruptions can be easily extended to more tasks, as they get popular within the Embodied AI community.

2 Related Work

Visual Navigation. Tasks involving navigation based on egocentric visual inputs have witnessed exciting progress in recent years [50, 11, 25, 9, 20, 10]. Some of the widely studied tasks in this space include PointNav [3], ObjectNav [5] and goal-driven navigation where the target is specified by a goal-image [62]. Approaches to solve PointNav and ObjectNav

can broadly be classified into two categories – (1) learning neural policies end-to-end using RL 

[56, 60, 48, 50, 57] or (2) decomposing navigation into a mapping (building a semantic map) and path planning stage [7, 8, 26, 44]. Recent research has also focused on assessing the ability of polices trained in simulation to transfer to real-world robots operating in physical spaces [34, 13].

Robustness Benchmarks. Assessing robustness of deep neural models has received quite a bit of attention in recent years [31, 47, 32, 4]. Most relevant and closest to our work is [31]

, where authors show that computer vision models are susceptible to several synthetic visual corruptions, as measured in the proposed ImageNet-C benchmark. In 

[35, 40], authors study the effect of similar visual corruptions for semantic segmentation and object detection on standard static benchmarks. RobustNav integrates several visual corruptions from [31] and adds ones such as low-lighting and crack in the camera-lens, but within an embodied scenario. Our findings (see Sec. 5) show that visual corruptions affect embodied tasks differently from static tasks like object recognition. In [53]

, authors repurpose the ImageNet validation split to be used as a benchmark for assessing robustness to natural distribution shifts (unlike the ones introduced in 

[31]) and  [18] identifies statistical biases in the same. Recently, [30] proposes three extensive benchmarks assessing robustness to image-style, geographical location and camera operation.

Real-world RL Suite. Efforts similar to RobustNav have been made in [17], where authors formalize different challenges holding back RL from real-world use – including actuator delays, high-dimensional state and action spaces, latency, and others. In contrast, RobustNav focuses on challenges in the visually rich domains and complexities associated with visual observation. Recently, Habitat [50] also introduced actuation (from [41]) and visual noise models for navigation tasks. In contrast, RobustNav is designed to benchmark robustness of models against a variety of visual and dynamics corruptions ( vis and dyn corruptions for both PointNav and ObjectNav).

Adapting Visio-Motor Policies. Significant progress has been made in the problem of adapting policies trained with RL from a source to a target environment. Unlike RobustNav, major assumptions involved in such transfer settings are either access to task-supervision in the target environment [24] or access to paired data from the source and target environments [23, 54]. Domain Randomization (DR) [2, 46, 38, 42] is another common approach to train policies robust to various environmental factors. Notably, [38] perturbs features early in the visual encoders of the policy network so as to mimic DR and [42] selects optimal DR parameters during training based on sparse data obtained from the the real world. In absence of task supervision, another common approach is to optimize self-supervised objectives in the target [57, 49] and has been used to adapt policies to visual disparities (see Sec. 5) in new environments [27]. To adapt to changes in transition dynamics, a common approach is to train on a broad family of dynamics models and perform system-identification (ex. with domain classifiers [19]) in the target environment [59, 61].  [34, 13] studies the extent to which embodied navigation agents transfer from simulated environments to real-world physical spaces. Among these, we investigate two of the most popular approaches – self-supervised adaptation [27] and aggressive data augmentation and measure if they can help build resistance to vis corruptions.

3 RobustNav

We present RobustNav, a benchmark to assess the robustness of embodied agents to common visual (vis) and dynamics (dyn) corruptions. RobustNav is built on top of RoboTHOR [12]. In this work, we study the effects corruptions have on two kinds of embodied navigation agents – namely, PointNav (navigate to a specified goal coordinate) and ObjectNav (navigate to an instance of an object category). While we restrict our experiments to navigation, in practice, our vis and dyn corruptions can also be extended to other embodied tasks that share the same modalities, for instance tasks involving interacting with objects.

Figure 2: Visual Corruptions. Visual corruptions RobustNav supports in the unseen target environments. Top-left shows a clean RGB frame and rest show corrupted versions of the same.

In RobustNav, agents are trained within the training scenes and evaluated on “corrupt” unseen target scenes. Corruptions in target scenes are drawn from a set of predefined vis and dyn corruptions. As is the case with any form of modeling of corruptions (or noise) in simulation [33, 12], there will always be an approximation error when the vis and dyn corruptions are compared to their real world counterparts. Our aim is to ensure that the RobustNav benchmark acts as a stepping stone towards the larger goal of obtaining robust agents, ready to be deployed in real world.

To adapt to a corrupt target scene, we provide agents with a “calibration budget” – an upper bound on the number of interactions an agent is allowed to have with the target environment without any external task supervision. This is done to mimic a real-world analog where a shipped robot is allowed to adapt to changes in the environment by executing a reasonable number of unsupervised interactions. We adopt a modest definition of the calibration-budget based on the number of steps it takes an agent to reasonably recover degraded performance in the most severely corrupted environments when finetuned under complete supervision (see Table. 3) – set to k steps for all our experiments. We attempt to understand if self-supervised adaptation approaches [27] improve performance when allowed to adapt under this calibration budget (see Sec. 5, resisting corruptions). We now describe in detail the vis and dyn corruptions present in RobustNav.

Visual Corruptions. Visual corruptions are artifacts that degrade the navigation agent’s egocentric RGB observation (see Fig. 2). We provide seven visual corruptions within RobustNav, four of which are drawn from the set of corruptions and perturbations proposed in [31] – Spatter, Motion Blur, Defocus Blur and Speckle Noise; realistic corruptions that one might expect to see on a real robot. Spatter emulates occlusion in images due to particles of dirt, water droplets, etc. residing on the camera lens. Motion Blur emulates blur in images due to jittery movement of the robot. Defocus Blur occurs when the RGB image is out of focus. Speckle Noise emulates granular interference that inherently exists in and degrades the quality of images obtained by the camera (modeled as additive noise with the noise being proportional to the original pixel intensity). Each of these corruptions can manifest at five levels of severity indicating increase in the extent of visual degradation ().

In addition to these, we also add low-lighting (low-lighting conditions in the target environment, has associated severity levels ), lower-FOV (agents operating with a lower camera field of view compared to the one used during training, ) and camera-crack (a randomized crack in the camera lens). For camera-crack, we use fixed random seeds for the validation scenes which dictate the location and kind of crack on the camera lens.

Figure 3: Dynamics Corruptions. We show the kinds of dynamics corruptions supported in RobustNav. Motion Bias (C & S) are modeled to mimic friction. Motion Drift models a setting where translation actions have a slight bias towards rotating right (or left). In Motor Failure, the one of the rotation actions fail.

Dynamics Corruptions. Dynamics corruptions affect the transition dynamics of the agents in the target environment (see Fig. 3). We consider three classes of dynamics corruptions – Motion Bias, Motion Drift and Motor Failure. Our dyn corruptions are motivated from and in line with the well-known systematic and/or stochastic drifts (due to error accumulation) and biases in robot motion [37, 6, 21, 43].

A common dynamics corruption observed in the real world is friction. Unfortunately RoboTHOR does not yet natively support multiple friction zones within a scene, as may be commonly observed in a real physical environment (for instance the kitchen floor in a house may have smooth tiles while the bedroom may have rough hardwood floors). In lieu of this, we present the Motion Bias corruption. In the absence of this corruption, the move_ahead action moves an agent forward by m, and rotation rotate_left and rotate_right actions rotate an agent by left and right respectively. Motion Bias can induce either (a) a constant bias drawn uniformly per-episode from m or or (b) stochastic translation and rotation amounts drawn per-step from and respectively.111(a) Motion Bias (C) is intended to model scene-level friction, different floor material in the target environment; (b) Motion Bias (S) is intended to model high and low friction zones in a scene. Including more sophisticated models of friction is in the feature roadmap for RobustNav.

Motion Drift models a setting where an agent’s translation movements in the environment include a slight bias towards turning left or right. Specifically, the move_ahead action, instead of moving an agent forward m in the direction of its heading (intended behavior), drifts towards the left or right directions stochastically (for an episode) by and takes it to a location which deviates in a direction perpendicular to the original heading by a max of m. Motor-failure is the setting where either the rotate_left or the rotate_right actions malfunction throughout an evaluation episode.

With the exception of Motion-Bias (S) – the stochastic version – the agent also operates under standard actuation noise models as calibrated from a LoCoBot in [13]. Recently, PyRobot [41] has also introduced LoCoBot calibrated noise models that demonstrate strafing and drifting. While we primarily rely on the noise models calibrated in [12], for completeness, we also include results with the PyRobot noise models.

Tasks. RobustNav consists of two major embodied navigation tasks – namely, PointNav and ObjectNav. In PointNav, an agent is initialized at a random spawn location and orientation in an environment and is asked to navigate to target coordinates specified relative to the agent’s position. The agent must navigate based only on sensory inputs from an RGB (or RGB-D) and a GPS + Compass sensor. An episode is declared successful if the agent stops within m of the goal location (by intentionally invoking an end action). In ObjectNav, an agent is instead asked to navigate to an instance of a specified object category (e.g., Television, out of total object categories) given only ego-centric sensory inputs – RGB or RGB-D. An episode is declared successful if the agent stops within m of the target object (by invoking an end action) and has the target object in it’s egocentric view. Due to the lack of perfect localization (no GPS + Compass sensor) and the implicit need to ground the specified object within its view, ObjectNav may be considered a harder task compared to PointNav– also evident in lower ObjectNav performance (Table. 2).

Metrics. We report performance in terms of the following well established navigation metrics reported in past works – Success Rate (SR) and Success Weighted by Path Length (SPL) [3]. SR indicates the fraction of successful episodes. SPL provides a score for the agent’s path based on how close it’s length is to the shortest path from the spawn location to the target. If denotes whether an episode is successful (binary indicator), is the shortest path length, and is the agent’s path length then SPL

Scenes. RobustNav is built on top of the RoboTHOR scenes [13]. RoboTHOR consists of training and validation environments based on indoor apartment scenes drawn from different layouts. To assess robustness in the presence of corruptions, we evaluate on (and ) episodes of varying difficulties (easy, medium and hard)222 Based on shortest path lengths – (1) PointNav: for easy, for medium, for hard; (2) ObjectNav: for easy, for medium , for hard. for PointNav (and ObjectNav) across the val scenes.

Benchmarking. Present day embodied navigation agents are typically trained without any corruptions. However, we anticipate that researchers may incorporate corruptions as augmentations at training time to improve the robustness of their algorithms in order to make progress on our RobustNav framework. For the purposes of fair benchmarking, we recommend that future comparisons using RobustNav do not draw from the set of corruptions reserved for the target scenes – ensuring the corruptions encountered in the target scenes are indeed “unseen”.

4 Experimental Setup

Corruptions Top-1 Acc. Top-5 Acc.
1 Clean 69.76 89.08
2 Camera Crack 57.715.82 80.274.54
3 Lower FOV 45.44 69.53
4 Low Lighting 35.76 58.54
5 Spatter 19.73 39.34
6 Motion Blur 10.11 22.66
7 Defocus Blur 9.39 22.25
8 Speckle Noise 7.79 18.84
Table 1: ImageNet Performance Degradation. Degradation in classification performance on the ImageNet validation split under visual corruptions for ResNet-18 [29] trained on ImageNet (used as the agent’s visual encoder). Corruptions in 2-8 are present RobustNav. Since mimicking lower FOV requires access to camera intrinsics, unavailable for static datasets, we mimic the same by aggressive center-cropping. For camera-crack, we report performance over all possible crack settings present in RobustNav.

Agent. Our PointNav agents have 4 actions available to them – namely, move_ahead (m), rotate_left (), rotate_right () and end. The action end indicates that the agent believes that it has reached the goal, thereby terminating the episode. During evaluation, we allow an agent to execute a maximum of 300 steps – if an agent does not call end within 300 steps, we forcefully terminate the episode. For ObjectNav, in addition to the aforementioned actions, the agent also has the ability to look_up or look_down – indicating change in the agent’s view above or below the forward camera horizon. The agent receives sized ego-centric observations (RGB or RGB-D). All agents are trained under LoCoBot calibrated actuation noise models from [13] for translation and for rotation. Our agent architectures (akin to [56]) are composed of a CNN head to process input observations followed by a recurrent (GRU) policy network (more details in Sec. A.3 of appendix).

Training. We train our agents using DD-PPO [56] – a decentralized, distributed and synchronous version of the Proximal Policy Optimization (PPO) [52] algorithm. If denotes the terminal reward obtained at the end of a successful episode (with being an indicator variable denoting whether an episode was successful), denotes the change in geodesic distance to target at timestep from and denotes a slack penalty to encourage efficiency, then the reward received by the agent at time-step can be expressed as,

We train our agents using the AllenAct [55] framework.

5 Results and Findings

PointNav ObjectNav
RGB RGB-D RGB RGB-D
# Corruption  V D SR  SPL  SR  SPL  SR  SPL  SR  SPL 
1 Clean 98.82 83.13 98.54 84.60 31.05 14.26 35.62 17.20
2 Low Lighting 94.36 75.15 99.45 84.97 10.78 4.59 21.64 9.98
3 Motion Blur 95.72 73.37 99.36 85.36 10.59 4.03 20.27 8.29
4 Camera Crack 82.07 63.83 95.72 81.21 7.21 3.57 24.29 12.50
5 Defocus Blur 75.89 53.55 99.09 85.54 5.02 2.42 19.18 7.90
6 Speckle Noise 67.42 48.57 98.73 84.66 9.04 3.66 18.63 7.52
7 Lower-FOV 42.49 31.73 89.08 73.59 9.77 3.90 9.86 4.77
8 Spatter 33.58 24.72 98.91 84.81 6.76 2.93 21.10 9.06
9 Motion Bias (C) 92.81 77.83 93.36 79.46 31.51 14.09 31.96 15.38
10 Motion Bias (S) 94.72 76.95 96.72 79.08 30.87 14.15 35.62 16.39
11 Motion Drift 95.72 76.19 93.36 75.08 29.68 13.58 34.06 17.03
12 PyRobot [41] (ILQR) Mul. = 1.0 96.00 67.79 95.45 69.27 32.51 11.26 36.35 13.62
13 Motor Failure 20.56 17.63 20.56 17.62 4.20 2.43 6.39 3.67
14 Defocus Blur + Motion Bias (S) 76.52 51.08 97.18 79.46 5.57 2.00 18.54 7.23
15 Speckle Noise + Motion Bias (S) 62.69 43.31 95.81 78.27 7.85 3.73 18.54 8.16
16 Spatter + Motion Bias (S) 33.30 23.33 95.81 78.85 7.85 3.09 21.28 9.26
17 Defocus Blur + Motion Drift 74.25 50.99 95.54 76.66 4.57 1.93 17.35 6.97
18 Speckle Noise + Motion Drift 64.42 44.73 94.36 75.23 8.49 3.67 19.82 8.61
19 Spatter + Motion Drift 32.94 23.44 95.45 76.61 6.85 2.68 19.54 8.86
Table 2: PointNav and ObjectNav Performance. Degradation in task performance of pretrained PointNav (trained for M frames) and ObjectNav (trained for M frames) agents when evaluated under vis and dyn corruptions present in RobustNav. PointNav agents have additional access to a GPS-Compass sensor. For visual corruptions with controllable severity levels, we report results with severity set to (worst). Performance is measured across tasks of varying difficulties (easy, medium and hard). Rows are sorted based on SPL values for RGB PointNav agents. Success and SPL values are reported as percentages. (V = Visual, D = Dynamics)

In this section, we show that the performance of PointNav and ObjectNav agents degrades in the presence of corruptions (see Table. 2). We first highlight how vis corruptions affect static vision and embodied navigation tasks differently (see Table 1). Following this, we analyze behaviors that emerge in these agents when operating in the presence of vis, dyn, and vis+dyn corruptions. Finally, we investigate whether standard data-augmentation and self-supervised adaptation [27] techniques help recover the degraded performance (see Table 3).

5.1 Degradation in Performance

We now present our findings regarding degradation in performance relative to agents being evaluated in clean (no corruption) target environments (row 1 in Table. 2).

Visual corruptions affect static and embodied tasks differently. In Table 1, we report object recognition performance for models trained on the ImageNet [16] train split and evaluated on the corrupt validation splits. In Table 2, we report performance degradation of PointNav and ObjectNav agents under corruptions (row 1, clean & rows 2-8 corrupt). It is important to note that the nature of tasks (one-shot prediction vs sequential decision making) are different enough that the difficulty of corruptions for classification may not indicate the difficulty of corruptions for navigation. We verify this hypothesis by comparing results in Tables 1 and 2 – for instance, corruptions which are severe for classification (Defocus Blur and Speckle Noise) are not as severe for PointNav-RGB agents in terms of relative drop from clean performance. Additionally, for Mask-RCNN [28] trained on AI2-THOR images, we note that detection (segmentation)333For the 12 ObjectNav target classes mAP drops from () to () and () for Spatter (S5) and Low-Lighting (S5), respectively – unlike rows 2 & 8 in Table 2, where Spatter appears to be much severe compared to Low-Lighting. This difference in relative degradation suggests that that techniques for visual adaptation or robustness in static settings may not transfer out-of-the-box to embodied tasks, warranting more research in this direction.

Not all corruptions are equally bad. While we note that PointNav and ObjectNav agents suffer a drop in performance from clean settings, not all corruptions are equally severe. For instance, in PointNav-RGB, while Low Lighting, Motion Blur and Motion Bias (C) (rows 2, 3, 9 in Table 2) lead to a worst-case absolute drop of in SPL (and in SR), corruptions like Spatter and Motor Failure (rows 8, 13) are more extreme and significantly affect task performance (absolute drops of in SPL, in SR). For ObjectNav, however, the drop in performance is more gradual across corruptions (partly because it’s a harder task and even clean performance is fairly low).

A “clean” depth sensor helps resisting degradation. We compare the RGB and RGB-D variants of the trained PointNav and ObjectNav agents (RGB corrupt, Depth clean) in Table 2 (corresponding RGB & RGB-D columns). We observe that including a “clean” depth sensor consistently improves resistance to vis, dyn and vis+dyn corruptions for both PointNav and ObjectNav. For PointNav, we note that while RGB and RGB-D variants have comparable clean performance (row 1), under severe corruptions (Spatter, Lower-FOV and Speckle-Noise), the RGB-D counterparts are ahead roughly by an absolute margin of SPL. We further observe that, barring exceptions, PointNav RGB-D agents are generally affected minimally by corruptions – for instance, Low-Lighting and Motion Blur barely result in any drop in performance. We hypothesize that this is likely because RGB-D navigation agents are much less reliant on the RGB sensor compared to the RGB counterparts. In ObjectNav, an additional depth sensor generally improves clean performance (row 1 in Table 2) which is likely the major contributing factor for increased resistance to corruptions. Sensors of different modalities are likely to degrade in different scenarios – e.g., a depth sensor may continue to perceive details in low lighting settings. The obtained results suggest that adding multiple sensors, while expensive can help train robust models. Additional sensors can also be helpful for unsupervised adaptation during the calibration phase. For instance, in the presence of a “clean” depth sensor, one can consider comparing depth based

egomotion estimates with expected odometry readings in the target environment to infer changes in dynamics.

Failed Actions Min. Dist. to Target (m) Stop-Fail. (Pos) (%) Stop-Fail (Neg) (%)
PointNav
ObjectNav
Figure 4: Agent Behavior Analysis. To understand agent behaviors, we report the breakdown of four metrics: Number of collisions as observed through Failed Actions (first column), closest the agent was to target as measured by Min. Dist. to Target (second column), and failure to appropriately end and episode either when out of range – Stop-Fail (Pos) (third column), or in range – Stop-Fail (Neg) (fourth column). Each behavior is reported for both PointNav (top row) and ObjectNav (bottom row) RGB agents within a clean and five corrupt settings: Defocus Blur (D.B.), Speckle Noise (S.N.), Motion Drift (M.D.), Defocus Blur + Motion Drift, and Speckle Noise + Motion Drift. Xygp is clean, Xygp is vis  corruptions, Xygp is dyn  corruptions and Xygp is vis+dyn  corruptions. Blue line in col 2 indicates the distance threshold for goal in range. Severities for S.N. and D.B. are set to (worst).

In Sec. A.5 of appendix, we further investigate the degree to which more sophisticated PointNav agents, composed of map-based architectures, are susceptible to vis corruptions. Specifically, we evaluate the performance of the winning PointNav entry of Habitat-Challenge (HC) 2020 [1] – Occupancy Anticipation (OccAnt) [45] on Gibson [58] val scenes under noise-free, Habitat Challenge conditions and vis corruptions. We find that introducing corruptions under noise-free conditions degrades navigation performance significantly only for RGB agents. Under HC conditions, RGB-D agents suffer drop in performance as RGB noise is replaced with progressively severe vis corruptions.

Presence of vis+dyn corruptions further degrades performance. Rows 14-19 in Table 2 indicate the extent of performance degradation when vis+dyn corruptions are present. With the exception of a few cases, as expected, the drop in performance is slightly more pronounced compared to the presence of just vis or dyn corruptions. The relative drop in performance from vis  vis+dyn is more pronounced for ObjectNav as opposed to PointNav.

Navigation performance for RGB agents degrades consistently with escalating episode difficulty. Recall that we evaluate navigation performance over epsisodes of varying difficulty levels (see Sec. 3). We break down the performance of PointNav & ObjectNav agents by episode difficulty levels (in Sec. A.5 of appendix). Under “clean” settings, we find that PointNav (RGB and RGB-D) have comparable performance across all difficulty levels. Under corruptions, we note that unlike the RGB-D counterparts, performance of PointNav-RGB agents consistently deteriorates as the episodes become harder. ObjectNav (both RGB & RGB-D) agents show a similar trend of decrease in navigation performance with increasing episode difficulty.

5.2 Behavior of Visual Navigation Agents

We now study the idiosyncrasies (see Fig 4) exhibited by these agents (PointNav-RGB and ObjectNav-RGB) which leads to their degraded performance.

Agents tend to collide more often. Fig 4 (first column, bars color-coded based on the kind of corruption) shows the average number of failed actions under corrupt settings. In our framework, failed actions occur as a consequence of colliding with objects, walls, etc. While corruptions generally lead to increased collisions, we note that adding a dyn corruption in addition to a vis one (D.B. D.B. + M.D. & S.N. S.N. + M.D.) increases the number of collisions over vis or dyn corruptions – dyn corruptions lead to unforeseen changes in dynamics (actions working unexpectedly), which likely contributes to an uptick in collisions.

Approach Visual Corruption
Clean Lower-FOV Defocus Blur Camera Crack Spatter
SR SPL SR SPL SR SPL SR SPL SR SPL
1 Nav. Loss 98.82 83.13 42.49 31.73 75.89 53.55 82.07 63.83 33.58 24.72
2 Nav. Loss + AP 98.45 83.28 45.68 35.14 83.35 61.51 72.70 56.82 20.38 15.70
3 Nav. Loss + AP + SS-Adapt 37.31 31.03 32.94 26.09 40.95 33.35 57.87 46.72 14.19 10.29
4 Nav. Loss + RP 98.73 82.53 44.95 32.74 32.21 22.47 67.06 53.70 23.48 18.63
5 Nav. Loss + RP + SS-Adapt 94.63 77.25 50.59 36.10 79.16 62.74 60.42 49.37 61.06 47.16
6 Nav. Loss + Data Aug 98.45 81.08 71.70 54.54 81.26 61.32 88.44 71.57 23.93 18.41
7 Finetune Nav. Loss on Target - - 72.88 61.82 97.18 80.32 96.54 80.92 91.81 77.38
Table 3: Resisting Visual Corruptions. To assist near-term progress, we study if standard approaches towards training visually robust models or adapting to visual disparities can help resisting visual corruptions. All agents in rows 1-7 are PointNav RGB agents pre-trained for M frames. Agents in rows 3 & 5 have obtained by running adaptation for k steps. Agents in row 7 provide an anecdotal upper bound indicating attainable improvements when finetuned with task-supervision under the calibration budget – set to k steps. For visual corruptions with controllable severity levels, we report results with severity set to 5 (worst).

Agents tend to be farther from the target. Fig 4 (second column) shows the minimum distance from the target over the course of an episode. While we note that as corruptions become progressively severe, agents tend to terminate farther away from the target (see Sec. A.4 of appendix), Fig 4 (second column) indicates that the overall proximity of the agent to the goal over an episode decreases – minimum distance to target increases as we go from Clean vis or dyn; vis or dyn  vis+dyn. While this may be intuitive in the presence of a dyn corruption, it is interesting to note that this trend is also consistent for vis corruptions (Clean D.B. or S.N.).

Corruptions hurt ObjectNav stopping mechanism. Recall that for both PointNav and ObjectNav, success depends on the notion of “intentionality” [5] – the agent calls an end action when it believes it has reached the goal. In Fig 4 (last two columns) we aim to understand how corruptions affect this stopping mechanism. Specifically, we look at two quantitative measures – (1) Stop-Failure (Positive), the proportion of times the agent invokes an end action when the goal is not range; and (2) Stop-Failure (Negative), the proportion of times the agent does not invoke an end action when the goal is in range, out of the number of times the goal is in range.444 The goal in range criterion for PointNav checks if the target is within the threshold distance. For ObjectNav, this includes an additional visibility criterion.

We observe that prematurely calling an end action is a significant issue only for ObjectNav (Fig 4 (third column)) – which becomes more pronounced as corruptions become progressively severe (Clean D.B. or S.N.; M.D. D.B. + M.D. or S.N. + M.D.). Similarly, the inability of an agent to invoke an end action is also more pronounced for ObjectNav as opposed to PointNav (Fig 4 (fourth column)). To investigate the extent to which this impacts the agent’s performance, we compare the agent’s Success Rate (SR) with a setting where the agent is equipped with an oracle stopping mechanism (call end as soon as the goal is in range). We find that this makes a significant difference only for ObjectNav– absolute for Clean, for M.D. and for D.B. + M.D. We hypothesize that equipping agents with robust stopping mechanisms can significantly improve performance on RobustNav. For instance, equipping the agent with a progress monitor module [39] (estimating progress made towards the goal in terms of distance) robust to vis corruptions can potentially help decide when explicitly to invoke an end action in the target environment.

5.3 Resisting Corruptions

To assist near-term progress, we investigate if some standard approaches towards training robust models or adapting to visual disparities can help resisting vis corruptions under a calibration budget (Sec. 3) – set to k steps.555Based on the number of steps it takes an agent to reasonably recover degraded performance in corrupted environments when finetuned with complete task supervision.

Extent of attainable improvement by finetuning under task supervision. As an anecdotal upper bound on attainable improvements under the calibration-budget, we also report the extent to which degraded performance can be recovered when fine-tuned under complete task supervision. We report these results for vis corruptions in Table 3 (row 7). We note that unlike Lower-FOV, the agent is able to almost recover performance for Defocus Blur, Camera-Crack and Spatter (Table. 3, rows 1,7).

Do data-augmentation strategies help? In Table 3, we study if data-augmentation strategies improve zero-shot resistance to vis corruptions (rows 1,6). We compare PointNav RGB agents trained with Random-Crop, Random-Shift and Color-Jitter (row 6) with the vanilla versions (row 1) and find that while data augmentation (row 6) offers some improvements (Spatter being an exception) over degraded performance (row 1) – absolute improvements of ( SPL, SR) for Lower-FOV, ( SPL, SR) for Defocus Blur and ( SPL, SR) for Camera-Crack, obtained performance is still significantly below Clean settings (row 1, Clean col). Improvements are more pronounced for Lower-FOV compared to others (likely due to Random-Shift and Random-Crop). We note that data-augmentation provides improvements only for a subset of vis corruptions and when it does, obtained improvements are still not sufficient enough to recover lost performance.

Do self-supervised adaptation approaches help? In the absence of reward supervision in the target environment, Hansen  [27]

proposed Policy Adaptation during Deployment (PAD) – source pretraining with an auxiliary supervised objective and optimizing only the self-supervised objective when deployed in the target environment. We investigate the degree to which PAD helps adapting to the target environments in

RobustNav. The adopted self-supervised tasks are (1) Action-Prediction (AP) – given two successive observations in a trajectory, predict the intermediate action and (2) Rotation-Prediction (RP) – rotate the input observation by , , , or before feeding it to the agent and task an additional auxiliary head with predicting the rotation. We report numbers with AP (rows 2,3) and RP (rows 4,5) in Table. 3. For AP, we find that (1) pre-training (row 2 vs row 1) results in little or no improvements over degraded performance (maximum absolute improvements of SPL, SR for Defocus Blur) and (2) further adaptation (row 3 vs rows 2,1) under calibration budget consistently degrades performance. For RP, we observe that (1) with the exception of Clean and Lower-FOV, pre-training (row 4 vs row 1) results in worse performance and (2) while self-supervised adaptation under corruptions improves performance over pre-training (row 5 vs row 4), it is still significantly below Clean settings (row 1, Clean col) – minimum absolute gap of SPL, SR between Defocus Blur (row 5) and Clean (row 1). While improvements over degraded performance might highlight the utility of PAD (with AP / RP) as a potential unsupervised adaptation approach, there is still a long way to go in terms of closing the performance gap between clean and corrupt settings.

6 Conclusion

In summary, as a step towards assessing general purpose robustness of embodied navigation agents, we propose RobustNav, a challenging framework well-suited to benchmark the robustness of embodied navigation agents, with a wide variety of visual and dynamics corruptions. To succeed on RobustNav, an agent must be insensitive to corruptions and also be able to adapt to unforeseen changes in new environments with minimal interaction. We find that standard PointNav and ObjectNav agents underperform (or fail) significantly in the presence of corruptions and while standard techniques to improve robustness or adapt to environments with visual disparities (data-augmentation, self-supervised adaptation) provide some improvements, a large room for improvement remains in terms of fully recovering lost navigation performance. Lastly, we plan on evolving RobustNav in terms of the sophistication and diversity of corruptions as more features are supported in the underlying simulator. We release RobustNav in RoboTHOR, and hope that our findings provide insights into developing more robust navigation agents.

Acknowledgements. We thank Klemen Klotar, Luca Weihs, Martin Lohmann, Harsh Agrawal and Rama Vedantam for fruitful discussions and valuable feedback. We thank Winson Han for helping out with the Camera-Crack vis corruption. We also thank Vishvak Murahari for helping out with the ImageNet experiments and for sharing code for the Mask-RCNN experiments. This work is supported by the NASA University Leadership Initiative (ULI) under grant number 80NSSC20M0161.

References

  • [1] Habitat Challenge, 2020. https://aihabitat.org/challenge/2020/.
  • [2] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv, 2019.
  • [3] Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv, 2018.
  • [4] Maksym Andriushchenko and Nicolas Flammarion. Understanding and improving fast adversarial training. In NeurIPS, 2020.
  • [5] Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv, 2020.
  • [6] Kostas E. Bekris, Andrew Ladd, and Lydia E. Kavraki. Efficient motion planners for systems with dynamics. In ICRA Workshop, 2007.
  • [7] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. In ICLR, 2019.
  • [8] Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020.
  • [9] Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. In ECCV, 2020.
  • [10] Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Learning to set waypoints for audio-visual navigation. In ICLR, 2021.
  • [11] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018.
  • [12] Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, et al. Robothor: An open simulation-to-real embodied ai platform. In CVPR, 2020.
  • [13] Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform. In CVPR, 2020.
  • [14] Andrea Del Prete and Nicolas Mansard. Addressing constraint robustness to torque errors in task-space inverse dynamics. In RSS, 2015.
  • [15] Andrea Del Prete and Nicolas Mansard. Robustness to joint-torque-tracking errors in task-space inverse dynamics. IEEE Trans. on Robotics, 2016.
  • [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [17] Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester.

    An empirical investigation of the challenges of real-world reinforcement learning.

    arXiv, 2020.
  • [18] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, and Aleksander Madry. Identifying statistical bias in dataset replication. In ICML, 2020.
  • [19] Benjamin Eysenbach, Swapnil Asawa, Shreyas Chaudhari, Ruslan Salakhutdinov, and Sergey Levine. Off-dynamics reinforcement learning: Training for transfer with domain classifiers. arXiv, 2020.
  • [20] Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B Tenenbaum. Look, listen, and act: Towards audio-visual embodied navigation. In ICRA, 2020.
  • [21] Carlos Garcia-Saura. Self-calibration of a differential wheeled robot using only a gyroscope and a distance sensor. arXiv, 2015.
  • [22] Nirmal Giftsun, Andrea Del Prete, and Florent Lamiraux. Robustness to inertial parameter errors for legged robots balancing on level ground. In International Conference on Informatics in Control, Automation and Robotics, 2017.
  • [23] Florian Golemo, Adrien Ali Taiga, Aaron Courville, and Pierre-Yves Oudeyer. Sim-to-real transfer with neural-augmented robot simulation. In CoRL, 2018.
  • [24] Daniel Gordon, Abhishek Kadian, Devi Parikh, Judy Hoffman, and Dhruv Batra. Splitnet: Sim2sim and task2task transfer for embodied visual navigation. In ICCV, 2019.
  • [25] Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. Iqa: Visual question answering in interactive environments. In CVPR, 2018.
  • [26] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In CVPR, 2017.
  • [27] Nicklas Hansen, Yu Sun, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. In ICLR, 2021.
  • [28] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [30] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv, 2020.
  • [31] Dan Hendrycks and Thomas Dietterich.

    Benchmarking neural network robustness to common corruptions and perturbations.

    In ICLR, 2019.
  • [32] Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. In ACL, 2020.
  • [33] Sebastian Höfer, Kostas Bekris, Ankur Handa, Juan Camilo Gamboa, Florian Golemo, Melissa Mozifian, Chris Atkeson, Dieter Fox, Ken Goldberg, John Leonard, et al. Perspectives on sim2real transfer for robotics: A summary of the r:ss 2020 workshop. arXiv, 2020.
  • [34] Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. In IROS, 2020.
  • [35] Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. In CVPR, 2020.
  • [36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [37] Andrew M. Ladd and Lydia E. Kavraki. Motion planning in the presence of drift, underactuation and discrete system changes. In RSS, 2005.
  • [38] Kimin Lee, Kibok Lee, Jinwoo Shin, and Honglak Lee. Network randomization: A simple technique for generalization in deep reinforcement learning. In ICLR, 2020.
  • [39] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035, 2019.
  • [40] Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv, 2019.
  • [41] Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta, and Abhinav Gupta.

    Pyrobot: An open-source robotics framework for research and benchmarking.

    arXiv, 2019.
  • [42] Fabio Muratore, Christian Eilers, Michael Gienger, and Jan Peters. Data-efficient domain randomization with bayesian optimization. IEEE Robotics Automation and Letters, 2021.
  • [43] J. Müller, N. Kohler, and W. Burgard. Autonomous miniature blimp navigation with online motion planning and re-planning. In IROS, 2011.
  • [44] Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation. In ECCV, 2020.
  • [45] Santhosh K. Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation, 2020.
  • [46] Sharath Chandra Raparthy, Bhairav Mehta, Florian Golemo, and Liam Paull. Generating automatic curricula via self-supervised active domain randomization. arXiv, 2020.
  • [47] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  • [48] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018.
  • [49] Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. In ICLR, 2019.
  • [50] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In ICCV, 2019.
  • [51] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • [52] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017.
  • [53] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. NeurIPS, 2020.
  • [54] Joanne Truong, Sonia Chernova, and Dhruv Batra. Bi-directional domain adaptation for sim2real transfer of embodied navigation agents. IEEE Robotics and Automation Letters, 2021.
  • [55] Luca Weihs, Jordi Salvador, Klemen Kotar, Unnat Jain, Kuo-Hao Zeng, Roozbeh Mottaghi, and Aniruddha Kembhavi. Allenact: A framework for embodied ai research. arXiv, 2020.
  • [56] Erik Wijmans, Abhishek Kadian, Ari S. Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In ICLR, 2020.
  • [57] Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In CVPR, 2019.
  • [58] Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018.
  • [59] Jiachen Yang, Brenden Petersen, Hongyuan Zha, and Daniel Faissol. Single episode policy transfer in reinforcement learning. In ICLR, 2020.
  • [60] Joel Ye, Dhruv Batra, Erik Wijmans, and Abhishek Das. Auxiliary tasks speed up learning pointgoal navigation. In CoRL, 2020.
  • [61] Wenxuan Zhou, Lerrel Pinto, and Abhinav Gupta. Environment probing interaction policies. In ICLR, 2019.
  • [62] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.

Appendix A Appendix

a.1 Overview

This appendix is organized as follows. In Sec. A.2, we describe in detail the task specifications for PointNav and ObjectNav. In Sec. A.3, we provide details about the architecture adopted for PointNav and ObjectNav agents and how they are trained. In Sec. A.4, we include more plots demonstrating the kinds of behaviors PointNav and ObjectNav agents exhibit under corruptions (RGB-D variants in addition to the RGB variants in Sec. 5.2 of the main paper). In Sec. A.5, we provide more results demonstrating degradation in performance at severity set to (for vis corruptions with controllable severity levels; excluded from the main paper due to space constraints) and break down performance degradation by episode difficulty.

a.2 Task Specifications

We describe the task-specifications (as outlined in Sec. 3 of the main paper) for the ones included in RobustNav in detail. Note that while RobustNav currently supports navigation heavy tasks, the corruptions included can easily be extended to other embodied tasks that share the same modalities, for instance, tasks involving vision and language guided navigation or having interaction components.

PointNav. In PointNav, an agent is spawned at a random location and orientation in an environment and asked to navigate to goal coordinates specified relative to the agent’s position. This is equivalent to the agent being equipped with a GPS+Compass sensor (providing relative location and orientation with respect to the agent’s current position). Note that the agent does not have access to any “map” of the environment and must navigate based solely on sensory inputs from a visual RGB (or RGB-D) and GPS+Compass sensor. An episode is declared successful if the PointNav agent stops (by “intentionally” invoking an end action) within m of the goal location.

Figure 5: ObjectNav Target Objects. We present a few examples of the target objects considered for ObjectNav agents in RoboTHOR as viewed from the agent’s ego-centric RGB frame under successful episode termination conditions.

ObjectNav. In ObjectNav, an agent is spawned at a random location and orientation in an environment as is asked to navigate to a specified “object” category (e.g, Television) that exists in the environment. Unlike PointNav, an ObjectNav agent does not have access to a GPS+Compass sensor and must navigate based solely on the specified target and visual sensor inputs – RGB (or RGB-D). An episode is declared successful if the ObjectNav agent (1) stops (by “intentionally” invoking an end action) within m of the target object and (2) has the target object within it’s ego-centric view. We consider object categories present in the RoboTHOR scenes for our ObjectNav experiments. These are AlarmClock, Apple, BaseballBat, BasketBall, Bowl, GarbageCan, HousePlant, Laptop, Mug, SprayBottle, Television and Vase (see Fig. 5 for a few examples in the agent’s ego-centric frame).

a.3 Navigation Agents

Figure 6: Agent Architecture. We show the general architecture adopted for our PointNav and ObjectNav agents – convolutional units to encode observations followed by recurrent policy networks. The auxiliary task heads are used when we consider pre-training or adaptation using PAD [27].

We highlight describe the architecture of the agents studied in RobustNav and provide additional training details.

Base Architecture. We consider standard neural architectures (akin to [56, 55]) for both PointNav and ObjectNav– convolutional units to encode observations followed by recurrent policy networks to predict action distributions. Concretely, our agent architecture consists of four major components – a visual encoder, a goal encoder, a target observation combiner and a policy network (see Fig. 6). The visual encoder (for RGB and RGB-D) agents consists of a frozen ResNet-18 [29] encoder (till the last residual block) pretrained on ImageNet [16]

, followed by a learnable compressor network consisting of two convolutional layers of kernel size 1, each followed by ReLU activations (

). The goal encoder encodes the specified target – a goal location in polar coordinates () for PointNav and the target object token (e.g., Television) for ObjectNav. For PointNav, the goal is encoded via a linear layer (). For ObjectNav, the goal is encoded via an embedding layer () set to encode one of the 12 object categories. The goal embedding and output of the visual encoder are then concatenated and further passed through the target observation combiner network consisting of two convolutional layers of kernel size 1 (). The output of the target observation combiner is flattened and then fed to the policy network – specifically, to a single layer GRU (hidden size ), followed by linear actor and critic heads used to predict action distributions and value estimates.

Auxiliary Task Heads. In Sec. 5.3 of the main paper, we investigate if self-supervised approaches, particularly, Policy Adaptation during Deployment (PAD) [27] help in resisting performance degradation due to vis corruptions. Incorporating PAD involves training the vanilla agent architectures (as highlighted before) with self-supervised tasks (for pre-training as well as adaptation in a corrupt target environment) – namely, Action Prediction (AP) and Rotation Prediction (RP). In Action-Prediction (AP), given two successive observations in a trajectory, an auxiliary head is tasked with predicting the intermediate action and in Rotation-Prediction (RP), the input observation is rotated by , , , or uniformly at random before feeding to the agent and an auxiliary head is asked to to predict the rotation bin. For both AP and RP, the auxiliary task heads operate on the encoded visual observation (as shown in Fig. 6). To gather samples in the target environment (corrupt or otherwise), we use data collected from trajectories under the source (clean) pre-trained policy – i.e., the visual encoder is updated online as observations are encountered under the pre-trained policy.

Training and Evaluation Details. As mentioned earlier, we train our agents with DD-PPO [56] (a decentralized, distributed version of the Proximal Policy Optimization Algorithm [52]) with Generalized Advantage Estimation [51]. We use rollout lengths

, 4 epochs of PPO with 1 mini-batch per-epoch. We set the discount factor to

, GAE factor to , PPO clip parameter to , value loss coefficient to and clip the gradient norms at . We use the the Adam optimizer [36] with a learning rate of with linear decay. The reward structure used is as follows – if denotes the terminal reward obtained at the end of a “successful” episode and denotes a slack penalty to encourage efficiency, then the reward received by the agent at time-step can be expressed as,

(1)

where is the change in geodesic distance to the goal, is the action taken by the agent and indicates where the episode was successful () or not (). During evaluation, we allow an agent to execute a maximum of steps – if an agent doesn’t call end within steps, we forcefully terminate the episode. All agents are trained under LocoBot calibrated actuation noise models from [13] for translation and for rotation. During evaluation, with the exception of circumstances when Motion Bias (S) is present, we use the same actuation noise models (in addition to dyn corruptions when applicable). We train our PointNav agents for M steps and ObjectNav agents for M steps (both RGB and RGB-D variants).

a.4 Behavior Analysis

Failed Actions Term. Dist. to Target (m) Min. Dist. to Target (m) Stop-Fail. (Pos) (%) Stop-Fail (Neg) (%)
PNav-RGB
PNav-RGBD
ONav-RGB
ONav-RGBD
Figure 7: Agent Behavior Analysis. To understand agent behaviors, we report the breakdown of four metrics: Number of collisions as observed through Failed Action (first column), distance to target at episode termination as measured by Term. Dist. to Target (second column), closest agent was to target as measured by Min. Dist. to Target (third column), and failure to appropriately end an episode either when out of range – Stop-Fail (Pos) (fourth column), or in range – Stop-Fail (Neg) (fifth column). Each behavior is reported for both PointNav (RGB-first row, RGBD-second row) and ObjectNav (RGB-third row, RGBD-fourth row) within a clean and five corrupt settings: Defocus Blur (D.B.), Speckle Noise (S.N.), Motion Drift (M.D.), Defocus Blur + Motion Drift, and Speckle Noise + Motion Drift. Xygp is clean, Xygp is vis  corruptions, Xygp is dyn  corruptions and Xygp is vis+dyn  corruptions. Blue lines in column 2 and 3 indicate the distance threshold for goal in range. Severities for S.N. and D.B. are set to (worst).
ONav-RGB ONav-RGBD
SR - SR
Figure 8: Effect of Degraded Stopping Mechanism. To understand the extent to which a degraded stopping mechanism under corruptions affects ObjectNav RGB agent performance, we look the difference between the agent’s success rate (SR) compared to the setting where the agent is equipped with an oracle stopping mechanism. SR denotes success rate when an end action is forcefully called in an episode whenever the goal is in range. We consider one clean and five corrupt settings: Defocus Blur (D.B.), Speckle Noise (S.N.), Motion Drift (M.D.), Defocus Blur + Motion Drift, and Speckle Noise + Motion Drift. Xygp is clean, Xygp is vis  corruptions, Xygp is dyn  corruptions and Xygp is vis+dyn  corruptions. Severities for S.N. and D.B. are set to (worst).

In Sec. 5.2 of the main paper, we try to understand the idiosyncrasies exhibited by the navigation agents under corruptions. Specifically, we look at the number of collisions as observed through the number of failed actions in RoboTHOR, the closest the agent arrives to the target in an episode and Stop-Fail (Pos) and Stop-Fail (Neg). Since for both PointNav and ObjectNav, success depends on a notion of “intentionality” [5] – the agent calls an end action when it believes it has reached the goal – we use both Stop-Fail (Pos) and Stop-Fail (Neg) to assess how corruptions impact this “stopping” mechanism of the agents. Stop-Fail (Pos) measures the fraction of times the agent calls an end action when the goal is not in range666The goal in range criterion for PointNav checks whether the target is within the threshold distance. For ObjectNav, this includes a visibility criterion in addition to taking distance into account., out of the number of times the agent calls an end action. Stop-Fail (Neg) measures the fraction of times the agent fails to invoke an end action when the goal is in range, out of the number of steps the goal is in range in an episode. Both are averaged across evaluation episodes. In addition to the above aspects, we also measure the average distance to the goal at episode termination. Here we report these measures for PointNav and ObjectNav agents trained with RGB and RGB-D sensors in Fig. 7 (RGB-D variants in addition to the RGB agents in Fig.4 of the main paper).

We find that across RGB and RGB-D variants, (1) agents tend to collide more often under corruptions (Fig. 7, col 1), (2) agents generally end up farther from the target at episode termination under corruptions (Fig. 7, col 2) and (3) agents tend to be farther from the target under corruptions even in terms of minimum distance over an episode (Fig. 7, col 3). We further note that the effect of corruptions on the agent’s stopping mechanism is more pronounced for ObjectNav as opposed to PointNav (Fig. 7, cols 4 & 5).

SR  SPL  Len. SR  SPL  Len.
Noise Free RGB RGBD
(1) Clean 88.90 70.70 240.897 92.50 78.20 185.255
(2) Low-Light. 63.30 32.90 566.286 91.90 75.10 207.994
(3) Spatter 47.50 18.40 728.074 94.80 78.50 181.732
HC Conditions RGB RGBD
(4) HC RGB Noise N/A N/A N/A 65.90 49.50 104.167
(5) Low-Light. N/A N/A N/A 60.50 45.50 107.932
(6) Spatter N/A N/A N/A 41.60 32.10 110.630
Table 4: OccAnt [45] results on Gibson [58] val. Rows 1-3 are when vis corruptions are introduced over clean settings under noise-free conditions, based on the checkpoint used to report results in the publication. Rows 4-6 are when RGB noise under Habitat Challenge (HC) conditions is replaced with the vis corruptions, based on the HC submission checkpoint. Len indicates episode length. N/A implies checkpoint not available. Severity for Low-Lighting and Spatter is set to 5 (worst).

To further understand the extent to which a worse stopping mechanism impacts the agent’s performance, in Fig. 8, we compare the agents’ success rate (SR) with a setting where the agent is equipped with an oracle stopping mechanism (forcefully call end when goal is in range). For both ObjectNav RGB and RGB-D, we find that the presence of vis and vis+dyn corruptions affects success significantly compared to the clean settings (Fig. 8, black bars).

a.5 Degradation Results

Habitat Challenge Results. As stated in Sec.5 of the main paper, here we investigate the degree to which more sophisticated PointNav agents, composed of map-based architectures, are susceptible to vis corruptions. Specifically, we evaluate the performance of the winning entry of Habitat Challenge (HC) 2020 [1] – Occupancy Anticipation [45] on the Gibson [58] validation scenes (see Table. 4). We evaluate the performance of OccAnt (for RGB and RGB-D; based on provided checkpoints) when vis corruptions are introduced (1) over clean settings under noise-free conditions (rows 1-3 in Table. 4) and (2) by replacing the RGB noise under Habitat Challenge (HC) conditions (rows 4-6 in Table. 4). Under noise free conditions, we note that degradation in performance from the clean settings is more pronounced for the RGB agents as opposed to the RGB-D variants. Under HC conditions, we note that the RGB-D variants suffer significant degradation in performance when RGB noise is replaced with vis corruptions.

PointNav ObjectNav
RGB RGB-D RGB RGB-D
# Corruption  V D SR  SPL  SR  SPL  SR  SPL  SR  SPL 
1 Clean 98.970.18 83.450.27 99.240.15 85.000.25 31.780.81 14.500.47 35.430.83 17.570.52
2 Low Lighting (S3) 97.450.27 80.530.33 99.090.17 84.910.26 21.550.72 8.910.38 27.550.78 13.080.47
3 Low Lighting (S5) 93.420.43 74.880.43 99.270.15 85.040.25 11.690.56 4.900.30 23.260.74 10.610.43
4 Motion Blur (S3) 98.820.19 80.640.29 98.910.18 84.620.26 18.570.68 8.180.37 24.320.75 11.520.44
5 Motion Blur (S5) 96.150.34 73.240.38 99.060.17 85.010.25 10.500.93 4.710.51 18.260.83 7.870.45
6 Camera Crack 81.560.68 63.480.59 96.000.34 81.480.36 7.060.45 3.540.28 27.280.78 13.490.48
7 Defocus Blur (S3) 94.630.39 73.280.41 98.790.19 84.470.26 15.590.63 6.990.36 22.070.72 9.750.41
8 Defocus Blur (S5) 75.830.75 53.480.59 99.030.17 85.440.25 4.200.35 1.810.20 17.660.67 7.470.36
9 Speckle Noise (S3) 89.230.54 68.180.53 98.850.19 84.580.27 14.920.62 6.540.34 24.050.75 10.340.42
10 Speckle Noise (S5) 66.700.82 47.920.67 98.970.18 84.790.26 8.680.49 3.900.28 17.690.67 7.420.36
11 Lower-FOV 43.250.86 32.390.68 89.440.54 73.920.50 10.170.53 2.500.19 10.020.52 4.890.31
12 Spatter (S3) 38.400.85 25.530.59 98.640.20 84.000.28 7.060.45 3.810.29 23.200.74 9.710.40
13 Spatter (S5) 34.640.83 25.550.64 99.300.14 84.680.25 7.790.47 2.710.22 21.430.72 9.980.42
14 Motion Bias (S) 95.690.35 77.050.38 96.600.32 79.220.35 33.210.82 14.880.47 34.980.83 16.700.51
15 Motion Drift 95.940.34 76.320.35 93.570.43 75.090.40 28.550.79 13.300.46 34.370.83 16.430.50
16 Motion Bias (C) 92.270.47 77.480.46 93.110.44 79.040.45 30.470.80 13.200.45 32.720.82 15.670.50
17 PyRobot [41] (ILQR) Mul. = 1.0 95.180.37 67.450.37 96.180.33 69.480.35 32.540.82 11.650.39 36.860.84 14.240.44
18 Motor Failure 20.840.71 17.910.62 21.410.71 18.390.62 4.600.37 2.880.26 6.060.42 3.650.28
19 Defocus Blur (S3) + Motion Bias (S) 92.720.45 68.610.43 97.450.27 79.700.32 14.400.61 6.150.33 22.400.73 9.200.39
20 Defocus Blur (S5) + Motion Bias (S) 75.800.75 50.760.58 97.000.30 79.810.33 5.660.40 2.340.22 17.530.66 7.070.35
21 Speckle Noise (S3) + Motion Bias (S) 86.620.59 63.200.54 96.850.30 79.230.34 14.920.62 6.310.34 24.720.75 10.040.41
22 Speckle Noise (S5) + Motion Bias (S) 64.360.83 44.380.66 96.780.31 79.490.34 8.950.50 3.850.27 18.390.68 7.490.36
23 Spatter (S3) + Motion Bias (S) 37.250.84 23.830.57 96.600.32 78.620.35 7.180.45 3.600.28 24.440.75 9.800.40
24 Spatter (S5) + Motion Bias (S) 33.850.82 23.980.61 95.940.34 78.640.36 7.640.46 2.930.23 20.910.71 9.390.40
25 Defocus Blur (S3) + Motion Drift 89.720.53 65.840.47 94.840.39 75.970.37 14.160.61 6.260.34 23.560.74 10.650.43
26 Defocus Blur (S5) + Motion Drift 73.920.76 50.840.59 94.720.39 76.210.37 4.570.36 2.100.21 17.260.66 7.040.35
27 Speckle Noise (S3) + Motion Drift 86.650.59 62.440.53 93.990.41 75.020.39 13.460.60 5.950.33 23.010.73 9.960.41
28 Speckle Noise (S5) + Motion Drift 63.180.84 43.290.65 94.510.40 75.340.38 7.490.80 3.630.46 18.930.68 7.850.36
29 Spatter (S3) + Motion Drift 37.700.84 24.270.57 94.570.39 75.340.38 7.150.45 3.590.27 23.440.74 9.720.40
30 Spatter (S5) + Motion Drift 33.360.82 23.590.60 95.030.38 75.840.37 7.210.55 2.770.28 18.690.68 8.370.38
31 Defocus Blur (S3) + PyRobot [41] (ILQR) Mul. = 1.0 93.990.41 58.880.40 97.660.26 70.540.32 16.130.64 5.220.28 22.680.73 7.330.32
32 Defocus Blur (S5) + PyRobot [41] (ILQR) Mul. = 1.0 79.340.71 42.290.49 97.240.29 70.350.33 5.810.41 1.040.11 18.480.68 5.860.29
33 Speckle Noise (S3) + PyRobot [41] (ILQR) Mul. = 1.0 88.380.56 54.600.49 96.120.34 68.670.35 14.950.62 4.710.26 24.110.75 7.510.32
34 Speckle Noise (S5) + PyRobot [41] (ILQR) Mul. = 1.0 67.120.82 37.770.57 96.360.33 69.440.34 8.890.50 2.660.20 18.720.68 5.730.29
35 Spatter (S3) + PyRobot [41] (ILQR) Mul. = 1.0 40.700.86 18.260.45 96.090.34 68.250.36 8.310.48 1.760.16 23.170.74 7.760.33
36 Spatter (S5) + PyRobot [41] (ILQR) Mul. = 1.0 36.370.84 19.700.51 96.030.34 68.980.36 8.580.49 2.090.17 20.850.71 7.410.33
Table 5: PointNav and ObjectNav Performance. Degradation in task performance of pretrained PointNav (trained for M frames) and ObjectNav (trained for M frames) agents when evaluated under vis and dyn corruptions present in RobustNav. PointNav

agents have additional access to a GPS-Compass sensor. For visual corruptions with controllable severity levels, we report results with severity set to 5 and 3. Performance is measured across tasks of varying difficulties (easy, medium and hard). Reported results are mean and standard error across 3 evaluation runs with different seeds. Rows are sorted based on SPL values for RGB

PointNav agents. Success and SPL values are reported as percentages. (V = Visual, D = Dynamics)

More Degradation Results. In Table. 5, we report the degradation in performance (relative to clean settings) of PointNav and ObjectNav agents when operating under vis, dyn and vis+dyn corruptions. We report mean and standard error values across evaluation runs under actuation noise (wherever applicable). For vis corruptions with controllable severity levels – Motion Blur, Low-Lighting, Defocus Blur, Speckle Noise and Spatter – we report results with severities set to and (identified by S3 and S5; excluded from the main paper due to space constraints) – for both vis and vis+dyn settings. We note that unlike the RGB-D variants, for PointNav RGB agents, performance drops more as severity levels increase (increasing degradation from severity 5). For ObjectNav, we find that for both RGB and RGB-D variants, performance decreases as with increasing severity of corruptions ().

Increasing Episode Difficulty Easy Medium Hard
Corruption SR SPL SR SPL SR SPL
PointNav-RGB
1 Clean 99.640.18 82.800.38 99.360.24 84.210.47 97.910.43 83.340.54
2 Low Lighting 99.360.24 80.590.45 95.540.62 75.830.70 85.341.07 68.220.94
3 Camera Crack 94.100.71 75.810.70 80.051.21 62.151.06 70.491.38 52.441.14
4 Spatter 74.931.31 57.851.08 18.671.18 12.030.80 10.200.91 6.690.63
5 Speckle Noise + Motion Bias (S) 86.741.02 60.860.96 61.111.47 41.301.15 45.171.50 30.931.11
6 Spatter + Motion Bias (S) 72.481.35 53.251.08 18.851.18 11.990.79 10.110.91 6.630.62
7 Speckle Noise + Motion Drift 88.740.95 63.570.89 59.471.48 38.751.11 41.261.49 27.501.07
8 Spatter + Motion Drift 73.211.34 54.161.06 17.121.14 10.500.74 9.650.89 6.040.58
PointNav-RGBD
9 Clean 99.550.20 82.360.41 99.450.22 85.380.47 98.720.34 87.270.40
10 Low Lighting 99.550.20 82.250.42 99.360.24 86.150.43 98.910.31 86.730.41
11 Camera Crack 99.270.26 81.790.43 97.180.50 83.190.59 91.530.84 79.450.80
12 Spatter 99.820.13 82.400.40 99.090.29 84.690.48 99.000.30 86.960.41
13 Speckle Noise + Motion Bias (S) 96.280.57 75.590.62 97.270.49 80.770.56 96.810.53 82.110.55
14 Spatter + Motion Bias (S) 96.460.56 76.020.62 94.990.66 78.610.67 96.360.57 81.290.59
15 Speckle Noise + Motion Drift 99.270.26 77.850.41 96.170.58 76.770.61 88.070.98 71.390.86
16 Spatter + Motion Drift 99.180.27 77.240.44 97.360.48 78.420.53 88.520.96 71.870.85
Table 6: Breakdown of PointNav Performance Degradation by Episode Difficulty. Degradation in task performance of pre-trained PointNav RGB and RGB-D agents (trained for 75M frames) for episodes of varying difficulties (based on shortest path lengths) when evaluated under vis and dyn corruptions present in RobustNav. For visual corruptions with controllable severity levels, severity is set to 5 (worst). Reported results are mean and standard error across 3 evaluation runs under noisy actuations (wherever applicable). Success and SPL values are reported as percentages.
Increasing Episode Difficulty Easy Medium Hard
Corruption SR SPL SR SPL SR SPL
ObjectNav-RGB
1 Clean 40.501.94 12.431.04 33.481.29 15.510.73 25.751.21 14.490.75
2 Low Lighting 22.591.65 8.500.96 13.230.93 5.600.46 4.750.59 2.400.33
3 Camera Crack 21.651.63 10.101.05 5.380.62 2.720.34 1.610.35 1.150.26
4 Spatter 21.181.61 6.390.78 6.650.68 2.660.32 2.380.42 0.950.18
5 Speckle Noise + Motion Bias (S) 20.561.60 7.010.85 9.790.81 4.890.47 2.380.42 1.230.23
6 Spatter + Motion Bias (S) 20.561.60 6.710.83 7.320.71 3.350.38 1.610.35 0.650.15
7 Speckle Noise + Motion Drift 18.692.67 8.551.59 7.621.26 3.860.74 1.840.64 0.970.36
8 Spatter + Motion Drift 21.501.99 7.081.01 6.840.85 3.180.45 0.570.26 0.230.11
ObjectNav-RGBD
9 Clean 46.731.97 15.221.15 35.721.31 18.090.80 29.581.26 18.180.86
10 Low Lighting 28.821.79 10.551.04 25.411.19 11.560.68 18.311.07 9.680.65
11 Camera Crack 35.511.89 11.531.03 28.031.23 13.900.73 22.451.16 14.030.79
12 Spatter 29.751.81 9.790.99 18.761.07 9.060.62 20.081.11 11.000.67
13 Speckle Noise + Motion Bias (S) 22.121.64 5.930.80 18.541.06 8.290.58 16.401.03 7.430.54
14 Spatter + Motion Bias (S) 27.261.76 8.660.92 19.811.09 8.810.60 18.931.08 10.350.66
15 Speckle Noise + Motion Drift 22.741.66 6.350.84 19.131.08 7.860.56 16.861.04 8.560.59
16 Spatter + Motion Drift 25.081.71 8.160.89 17.791.05 8.160.57 16.481.03 8.690.60
Table 7: Breakdown of ObjectNav Performance Degradation by Episode Difficulty. Degradation in task performance of pre-trained ObjectNav RGB and RGB-D agents (trained for 300M frames) for episodes of varying difficulties (based on shortest path lengths) when evaluated under vis and dyn corruptions present in RobustNav. For visual corruptions with controllable severity levels, severity is set to 5 (worst). Reported results are mean and standard error across 3 evaluation runs under noisy actuations (wherever applicable). Success and SPL values are reported as percentages.

Performance Breakdown by Episode Difficulty. In Tables. 6 and 7 we break down performance of PointNav and ObjectNav agents by difficulty of evaluation episodes (based on shortest path lengths). We report results for a subset of vis, dyn and vis+dyn corruptions (mean across evaluation runs under noisy actuations, wherever applicable). For PointNav RGB agents, we find that while performance is comparable across easy, medium and hard episodes under clean settings, under corruptions, navigation performance decreases significantly with increase in episode difficulty – indicating that under corruptions, PointNav-RGB agents are more successful at reaching goal locations closer to the spawn location. However, this is not the case for PointNav RGB-D agents, where the drop in performance with increasing episode difficulty is much less pronounced. For ObjectNav-RGB agents, we observe that performance (in terms of SR and SPL) drops as episodes become more difficult. For ObjectNav-RGB-D agents, although we find comparable SPL across episode difficulties in some cases, the trends are mostly the same – decreasing performance (in terms of SR and SPL) with increasing episode difficulty.