Evaluating pre-trained navigation agents under corruptions
As an attempt towards assessing the robustness of embodied navigation agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual - affecting RGB inputs - and dynamics - affecting transition dynamics - corruptions. Most recent efforts in visual navigation have typically focused on generalizing to novel target environments with similar appearance and dynamics characteristics. With RobustNav, we find that some standard embodied navigation agents significantly underperform (or fail) in the presence of visual or dynamics corruptions. We systematically analyze the kind of idiosyncrasies that emerge in the behavior of such agents when operating under corruptions. Finally, for visual corruptions in RobustNav, we show that while standard techniques to improve robustness such as data-augmentation and self-supervised adaptation offer some zero-shot resistance and improvements in navigation performance, there is still a long way to go in terms of recovering lost performance relative to clean "non-corrupt" settings, warranting more research in this direction. Our code is available at https://github.com/allenai/robustnavREAD FULL TEXT VIEW PDF
Evaluating pre-trained navigation agents under corruptions
A longstanding goal of the artificial intelligence community has been to develop algorithms for embodied agents that are capable of reasoning about rich perceptual information and thereby accomplishing tasks by navigating in and interacting with their environments. In addition to being able to exhibit these capabilities, it is equally important that such embodied agents are able to do so in a robust and generalizable manner.
A major challenge in Embodied AI is to ensure that agents can generalize to environments with different appearance statistics and motion dynamics than the environment used for training those agents. For instance, an agent that is trained to navigate in “sunny” weather should continue to operate in rain despite the drastic changes in the appearance, and an agent that is trained to move on carpet should decidedly navigate when on a hardwood floor despite the discrepancy in friction. While a potential solution may be to calibrate the agent for a specific target environment, it is not a scalable one since there can be enormous varieties of unseen environments and situations. A more robust, efficient and scalable solution is to equip agents with the ability to autonomously adapt to new situations by interaction without having to train for every possible target scenario. Despite the remarkable progress in Embodied AI, especially in embodied navigation [62, 48, 50, 57, 8], most efforts focus on generalizing trained agents to unseen environments, but critically assume similar appearance and dynamics attributes across train and test environments.
As a first step towards assessing general purpose robustness of embodied agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of common visual (vis) and dynamics (dyn) corruptions – artifacts that affect the egocentric RGB observations and transition dynamics, respectively. We envision RobustNav as a testbed for adapting agent behavior across different perception and actuation properties. While assessing robustness to changes (stochastic or otherwise) in environments has been investigated in the robotics community [33, 14, 15, 22], the simulated nature of RobustNav enables practitioners to explore robustness against a rich and very diverse set of changes, while inheriting the advantages of working in simulation – speed, safety, low cost and reproducibility.
RobustNav consists of two widely studied embodied navigation tasks, Point-Goal Navigation (PointNav)  and Object-Goal Navigation (ObjectNav)  – the tasks of navigating to a goal-coordinate in a global reference frame or an instance of a specified object, respectively. Following the standard protocol, agents learn using a set of training scenes and are evaluated within a set of held out test scenes, but differently, RobustNav test scenes are subject to a variety of realistic visual and dynamics corruptions. These corruptions can emulate real world scenarios such as a malfunctioning camera or drift (see Fig.1).
As zero shot adaptation to test time corruptions may be out of reach for our current algorithms, we provide agents with a fixed “calibration budget” (number of interactions) within the target world for unsupervised adaptation. This mimics a real-world analog where a shipped robot is allowed to adapt to changes in the environment by executing a reasonable number of unsupervised interactions. Post calibration, agents are evaluated on the two tasks in the corrupted test environments using standard navigation metrics.
Our extensive analysis reveals that both PointNav and ObjectNav agents experience significant degradation in performance across the range of corruptions, particularly when multiple corruptions are applied together. We show that this degradation reduces in the presence of a clean depth sensor suggesting the advantages of incorporating multiple sensing modalities, to improve robustness. We find that data augmentation and self-supervised adaptation strategies offer some zero-shot resistance and improvement over degraded performance, but are unable to fully recover this gap in performance. Interestingly, we also note that visual corruptions affect embodied tasks differently from static tasks like object recognition – suggesting that visual robustness should be explored within an embodied task. Finally, we analyze several interesting behaviors our agents exhibit in the presence of corruptions – such as increase in the number of collisions and inability to terminate episodes successfully.
In summary, our contributions include: (1) We present RobustNav– a framework for benchmarking and assessing the robustness of embodied navigation agents to visual and dynamics corruptions. (2) Our findings show that present day navigation agents trained in simulation underperform severely when evaluated in corrupt target environments. (3) We systematically analyze the kinds of mistakes embodied navigation agents make when operating under such corruptions. (4) We find that although standard data-augmentation techniques and self-supervised adaptation strategies offer some improvement, much remains to be done in terms of fully recovering lost performance.
RobustNav provides a fast framework to develop and test robust embodied policies, before they can be deployed onto real robots. While RobustNav currently supports navigation heavy tasks, the supported corruptions can be easily extended to more tasks, as they get popular within the Embodied AI community.
Visual Navigation. Tasks involving navigation based on egocentric visual inputs have witnessed exciting progress in recent years [50, 11, 25, 9, 20, 10]. Some of the widely studied tasks in this space include PointNav , ObjectNav  and goal-driven navigation where the target is specified by a goal-image . Approaches to solve PointNav and ObjectNav
can broadly be classified into two categories – (1) learning neural policies end-to-end using RL[56, 60, 48, 50, 57] or (2) decomposing navigation into a mapping (building a semantic map) and path planning stage [7, 8, 26, 44]. Recent research has also focused on assessing the ability of polices trained in simulation to transfer to real-world robots operating in physical spaces [34, 13].
, authors repurpose the ImageNet validation split to be used as a benchmark for assessing robustness to natural distribution shifts (unlike the ones introduced in) and  identifies statistical biases in the same. Recently,  proposes three extensive benchmarks assessing robustness to image-style, geographical location and camera operation.
Real-world RL Suite. Efforts similar to RobustNav have been made in , where authors formalize different challenges holding back RL from real-world use – including actuator delays, high-dimensional state and action spaces, latency, and others. In contrast, RobustNav focuses on challenges in the visually rich domains and complexities associated with visual observation. Recently, Habitat  also introduced actuation (from ) and visual noise models for navigation tasks. In contrast, RobustNav is designed to benchmark robustness of models against a variety of visual and dynamics corruptions ( vis and dyn corruptions for both PointNav and ObjectNav).
Adapting Visio-Motor Policies. Significant progress has been made in the problem of adapting policies trained with RL from a source to a target environment. Unlike RobustNav, major assumptions involved in such transfer settings are either access to task-supervision in the target environment  or access to paired data from the source and target environments [23, 54]. Domain Randomization (DR) [2, 46, 38, 42] is another common approach to train policies robust to various environmental factors. Notably,  perturbs features early in the visual encoders of the policy network so as to mimic DR and  selects optimal DR parameters during training based on sparse data obtained from the the real world. In absence of task supervision, another common approach is to optimize self-supervised objectives in the target [57, 49] and has been used to adapt policies to visual disparities (see Sec. 5) in new environments . To adapt to changes in transition dynamics, a common approach is to train on a broad family of dynamics models and perform system-identification (ex. with domain classifiers ) in the target environment [59, 61]. [34, 13] studies the extent to which embodied navigation agents transfer from simulated environments to real-world physical spaces. Among these, we investigate two of the most popular approaches – self-supervised adaptation  and aggressive data augmentation and measure if they can help build resistance to vis corruptions.
We present RobustNav, a benchmark to assess the robustness of embodied agents to common visual (vis) and dynamics (dyn) corruptions. RobustNav is built on top of RoboTHOR . In this work, we study the effects corruptions have on two kinds of embodied navigation agents – namely, PointNav (navigate to a specified goal coordinate) and ObjectNav (navigate to an instance of an object category). While we restrict our experiments to navigation, in practice, our vis and dyn corruptions can also be extended to other embodied tasks that share the same modalities, for instance tasks involving interacting with objects.
In RobustNav, agents are trained within the training scenes and evaluated on “corrupt” unseen target scenes. Corruptions in target scenes are drawn from a set of predefined vis and dyn corruptions. As is the case with any form of modeling of corruptions (or noise) in simulation [33, 12], there will always be an approximation error when the vis and dyn corruptions are compared to their real world counterparts. Our aim is to ensure that the RobustNav benchmark acts as a stepping stone towards the larger goal of obtaining robust agents, ready to be deployed in real world.
To adapt to a corrupt target scene, we provide agents with a “calibration budget” – an upper bound on the number of interactions an agent is allowed to have with the target environment without any external task supervision. This is done to mimic a real-world analog where a shipped robot is allowed to adapt to changes in the environment by executing a reasonable number of unsupervised interactions. We adopt a modest definition of the calibration-budget based on the number of steps it takes an agent to reasonably recover degraded performance in the most severely corrupted environments when finetuned under complete supervision (see Table. 3) – set to k steps for all our experiments. We attempt to understand if self-supervised adaptation approaches  improve performance when allowed to adapt under this calibration budget (see Sec. 5, resisting corruptions). We now describe in detail the vis and dyn corruptions present in RobustNav.
Visual Corruptions. Visual corruptions are artifacts that degrade the navigation agent’s egocentric RGB observation (see Fig. 2). We provide seven visual corruptions within RobustNav, four of which are drawn from the set of corruptions and perturbations proposed in  – Spatter, Motion Blur, Defocus Blur and Speckle Noise; realistic corruptions that one might expect to see on a real robot. Spatter emulates occlusion in images due to particles of dirt, water droplets, etc. residing on the camera lens. Motion Blur emulates blur in images due to jittery movement of the robot. Defocus Blur occurs when the RGB image is out of focus. Speckle Noise emulates granular interference that inherently exists in and degrades the quality of images obtained by the camera (modeled as additive noise with the noise being proportional to the original pixel intensity). Each of these corruptions can manifest at five levels of severity indicating increase in the extent of visual degradation ().
In addition to these, we also add low-lighting (low-lighting conditions in the target environment, has associated severity levels ), lower-FOV (agents operating with a lower camera field of view compared to the one used during training, ) and camera-crack (a randomized crack in the camera lens). For camera-crack, we use fixed random seeds for the validation scenes which dictate the location and kind of crack on the camera lens.
Dynamics Corruptions. Dynamics corruptions affect the transition dynamics of the agents in the target environment (see Fig. 3). We consider three classes of dynamics corruptions – Motion Bias, Motion Drift and Motor Failure. Our dyn corruptions are motivated from and in line with the well-known systematic and/or stochastic drifts (due to error accumulation) and biases in robot motion [37, 6, 21, 43].
A common dynamics corruption observed in the real world is friction. Unfortunately RoboTHOR does not yet natively support multiple friction zones within a scene, as may be commonly observed in a real physical environment (for instance the kitchen floor in a house may have smooth tiles while the bedroom may have rough hardwood floors). In lieu of this, we present the Motion Bias corruption. In the absence of this corruption, the move_ahead action moves an agent forward by m, and rotation rotate_left and rotate_right actions rotate an agent by left and right respectively. Motion Bias can induce either (a) a constant bias drawn uniformly per-episode from m or or (b) stochastic translation and rotation amounts drawn per-step from and respectively.111(a) Motion Bias (C) is intended to model scene-level friction, different floor material in the target environment; (b) Motion Bias (S) is intended to model high and low friction zones in a scene. Including more sophisticated models of friction is in the feature roadmap for RobustNav.
Motion Drift models a setting where an agent’s translation movements in the environment include a slight bias towards turning left or right. Specifically, the move_ahead action, instead of moving an agent forward m in the direction of its heading (intended behavior), drifts towards the left or right directions stochastically (for an episode) by and takes it to a location which deviates in a direction perpendicular to the original heading by a max of m. Motor-failure is the setting where either the rotate_left or the rotate_right actions malfunction throughout an evaluation episode.
With the exception of Motion-Bias (S) – the stochastic version – the agent also operates under standard actuation noise models as calibrated from a LoCoBot in . Recently, PyRobot  has also introduced LoCoBot calibrated noise models that demonstrate strafing and drifting. While we primarily rely on the noise models calibrated in , for completeness, we also include results with the PyRobot noise models.
Tasks. RobustNav consists of two major embodied navigation tasks – namely, PointNav and ObjectNav. In PointNav, an agent is initialized at a random spawn location and orientation in an environment and is asked to navigate to target coordinates specified relative to the agent’s position. The agent must navigate based only on sensory inputs from an RGB (or RGB-D) and a GPS + Compass sensor. An episode is declared successful if the agent stops within m of the goal location (by intentionally invoking an end action). In ObjectNav, an agent is instead asked to navigate to an instance of a specified object category (e.g., Television, out of total object categories) given only ego-centric sensory inputs – RGB or RGB-D. An episode is declared successful if the agent stops within m of the target object (by invoking an end action) and has the target object in it’s egocentric view. Due to the lack of perfect localization (no GPS + Compass sensor) and the implicit need to ground the specified object within its view, ObjectNav may be considered a harder task compared to PointNav– also evident in lower ObjectNav performance (Table. 2).
Metrics. We report performance in terms of the following well established navigation metrics reported in past works – Success Rate (SR) and Success Weighted by Path Length (SPL) . SR indicates the fraction of successful episodes. SPL provides a score for the agent’s path based on how close it’s length is to the shortest path from the spawn location to the target. If denotes whether an episode is successful (binary indicator), is the shortest path length, and is the agent’s path length then SPL
Scenes. RobustNav is built on top of the RoboTHOR scenes . RoboTHOR consists of training and validation environments based on indoor apartment scenes drawn from different layouts. To assess robustness in the presence of corruptions, we evaluate on (and ) episodes of varying difficulties (easy, medium and hard)222 Based on shortest path lengths – (1) PointNav: for easy, for medium, for hard; (2) ObjectNav: for easy, for medium , for hard. for PointNav (and ObjectNav) across the val scenes.
Benchmarking. Present day embodied navigation agents are typically trained without any corruptions. However, we anticipate that researchers may incorporate corruptions as augmentations at training time to improve the robustness of their algorithms in order to make progress on our RobustNav framework. For the purposes of fair benchmarking, we recommend that future comparisons using RobustNav do not draw from the set of corruptions reserved for the target scenes – ensuring the corruptions encountered in the target scenes are indeed “unseen”.
|Corruptions||Top-1 Acc.||Top-5 Acc.|
Agent. Our PointNav agents have 4 actions available to them – namely, move_ahead (m), rotate_left (), rotate_right () and end. The action end indicates that the agent believes that it has reached the goal, thereby terminating the episode. During evaluation, we allow an agent to execute a maximum of 300 steps – if an agent does not call end within 300 steps, we forcefully terminate the episode. For ObjectNav, in addition to the aforementioned actions, the agent also has the ability to look_up or look_down – indicating change in the agent’s view above or below the forward camera horizon. The agent receives sized ego-centric observations (RGB or RGB-D). All agents are trained under LoCoBot calibrated actuation noise models from  – for translation and for rotation. Our agent architectures (akin to ) are composed of a CNN head to process input observations followed by a recurrent (GRU) policy network (more details in Sec. A.3 of appendix).
Training. We train our agents using DD-PPO  – a decentralized, distributed and synchronous version of the Proximal Policy Optimization (PPO)  algorithm. If denotes the terminal reward obtained at the end of a successful episode (with being an indicator variable denoting whether an episode was successful), denotes the change in geodesic distance to target at timestep from and denotes a slack penalty to encourage efficiency, then the reward received by the agent at time-step can be expressed as,
We train our agents using the AllenAct  framework.
|9||Motion Bias (C)||✓||92.81||77.83||93.36||79.46||31.51||14.09||31.96||15.38|
|10||Motion Bias (S)||✓||94.72||76.95||96.72||79.08||30.87||14.15||35.62||16.39|
|12||PyRobot  (ILQR) Mul. = 1.0||✓||96.00||67.79||95.45||69.27||32.51||11.26||36.35||13.62|
|14||Defocus Blur + Motion Bias (S)||✓||✓||76.52||51.08||97.18||79.46||5.57||2.00||18.54||7.23|
|15||Speckle Noise + Motion Bias (S)||✓||✓||62.69||43.31||95.81||78.27||7.85||3.73||18.54||8.16|
|16||Spatter + Motion Bias (S)||✓||✓||33.30||23.33||95.81||78.85||7.85||3.09||21.28||9.26|
|17||Defocus Blur + Motion Drift||✓||✓||74.25||50.99||95.54||76.66||4.57||1.93||17.35||6.97|
|18||Speckle Noise + Motion Drift||✓||✓||64.42||44.73||94.36||75.23||8.49||3.67||19.82||8.61|
|19||Spatter + Motion Drift||✓||✓||32.94||23.44||95.45||76.61||6.85||2.68||19.54||8.86|
In this section, we show that the performance of PointNav and ObjectNav agents degrades in the presence of corruptions (see Table. 2). We first highlight how vis corruptions affect static vision and embodied navigation tasks differently (see Table 1). Following this, we analyze behaviors that emerge in these agents when operating in the presence of vis, dyn, and vis+dyn corruptions. Finally, we investigate whether standard data-augmentation and self-supervised adaptation  techniques help recover the degraded performance (see Table 3).
We now present our findings regarding degradation in performance relative to agents being evaluated in clean (no corruption) target environments (row 1 in Table. 2).
Visual corruptions affect static and embodied tasks differently. In Table 1, we report object recognition performance for models trained on the ImageNet  train split and evaluated on the corrupt validation splits. In Table 2, we report performance degradation of PointNav and ObjectNav agents under corruptions (row 1, clean & rows 2-8 corrupt). It is important to note that the nature of tasks (one-shot prediction vs sequential decision making) are different enough that the difficulty of corruptions for classification may not indicate the difficulty of corruptions for navigation. We verify this hypothesis by comparing results in Tables 1 and 2 – for instance, corruptions which are severe for classification (Defocus Blur and Speckle Noise) are not as severe for PointNav-RGB agents in terms of relative drop from clean performance. Additionally, for Mask-RCNN  trained on AI2-THOR images, we note that detection (segmentation)333For the 12 ObjectNav target classes mAP drops from () to () and () for Spatter (S5) and Low-Lighting (S5), respectively – unlike rows 2 & 8 in Table 2, where Spatter appears to be much severe compared to Low-Lighting. This difference in relative degradation suggests that that techniques for visual adaptation or robustness in static settings may not transfer out-of-the-box to embodied tasks, warranting more research in this direction.
Not all corruptions are equally bad. While we note that PointNav and ObjectNav agents suffer a drop in performance from clean settings, not all corruptions are equally severe. For instance, in PointNav-RGB, while Low Lighting, Motion Blur and Motion Bias (C) (rows 2, 3, 9 in Table 2) lead to a worst-case absolute drop of in SPL (and in SR), corruptions like Spatter and Motor Failure (rows 8, 13) are more extreme and significantly affect task performance (absolute drops of in SPL, in SR). For ObjectNav, however, the drop in performance is more gradual across corruptions (partly because it’s a harder task and even clean performance is fairly low).
A “clean” depth sensor helps resisting degradation. We compare the RGB and RGB-D variants of the trained PointNav and ObjectNav agents (RGB corrupt, Depth clean) in Table 2 (corresponding RGB & RGB-D columns). We observe that including a “clean” depth sensor consistently improves resistance to vis, dyn and vis+dyn corruptions for both PointNav and ObjectNav. For PointNav, we note that while RGB and RGB-D variants have comparable clean performance (row 1), under severe corruptions (Spatter, Lower-FOV and Speckle-Noise), the RGB-D counterparts are ahead roughly by an absolute margin of SPL. We further observe that, barring exceptions, PointNav RGB-D agents are generally affected minimally by corruptions – for instance, Low-Lighting and Motion Blur barely result in any drop in performance. We hypothesize that this is likely because RGB-D navigation agents are much less reliant on the RGB sensor compared to the RGB counterparts. In ObjectNav, an additional depth sensor generally improves clean performance (row 1 in Table 2) which is likely the major contributing factor for increased resistance to corruptions. Sensors of different modalities are likely to degrade in different scenarios – e.g., a depth sensor may continue to perceive details in low lighting settings. The obtained results suggest that adding multiple sensors, while expensive can help train robust models. Additional sensors can also be helpful for unsupervised adaptation during the calibration phase. For instance, in the presence of a “clean” depth sensor, one can consider comparing depth based
egomotion estimates with expected odometry readings in the target environment to infer changes in dynamics.
|Failed Actions||Min. Dist. to Target (m)||Stop-Fail. (Pos) (%)||Stop-Fail (Neg) (%)|
In Sec. A.5 of appendix, we further investigate the degree to which more sophisticated PointNav agents, composed of map-based architectures, are susceptible to vis corruptions. Specifically, we evaluate the performance of the winning PointNav entry of Habitat-Challenge (HC) 2020  – Occupancy Anticipation (OccAnt)  on Gibson  val scenes under noise-free, Habitat Challenge conditions and vis corruptions. We find that introducing corruptions under noise-free conditions degrades navigation performance significantly only for RGB agents. Under HC conditions, RGB-D agents suffer drop in performance as RGB noise is replaced with progressively severe vis corruptions.
Presence of vis+dyn corruptions further degrades performance. Rows 14-19 in Table 2 indicate the extent of performance degradation when vis+dyn corruptions are present. With the exception of a few cases, as expected, the drop in performance is slightly more pronounced compared to the presence of just vis or dyn corruptions. The relative drop in performance from vis vis+dyn is more pronounced for ObjectNav as opposed to PointNav.
Navigation performance for RGB agents degrades consistently with escalating episode difficulty. Recall that we evaluate navigation performance over epsisodes of varying difficulty levels (see Sec. 3). We break down the performance of PointNav & ObjectNav agents by episode difficulty levels (in Sec. A.5 of appendix). Under “clean” settings, we find that PointNav (RGB and RGB-D) have comparable performance across all difficulty levels. Under corruptions, we note that unlike the RGB-D counterparts, performance of PointNav-RGB agents consistently deteriorates as the episodes become harder. ObjectNav (both RGB & RGB-D) agents show a similar trend of decrease in navigation performance with increasing episode difficulty.
We now study the idiosyncrasies (see Fig 4) exhibited by these agents (PointNav-RGB and ObjectNav-RGB) which leads to their degraded performance.
Agents tend to collide more often. Fig 4 (first column, bars color-coded based on the kind of corruption) shows the average number of failed actions under corrupt settings. In our framework, failed actions occur as a consequence of colliding with objects, walls, etc. While corruptions generally lead to increased collisions, we note that adding a dyn corruption in addition to a vis one (D.B. D.B. + M.D. & S.N. S.N. + M.D.) increases the number of collisions over vis or dyn corruptions – dyn corruptions lead to unforeseen changes in dynamics (actions working unexpectedly), which likely contributes to an uptick in collisions.
|Clean||Lower-FOV||Defocus Blur||Camera Crack||Spatter|
|1 Nav. Loss||98.82||83.13||42.49||31.73||75.89||53.55||82.07||63.83||33.58||24.72|
|2 Nav. Loss + AP||98.45||83.28||45.68||35.14||83.35||61.51||72.70||56.82||20.38||15.70|
|3 Nav. Loss + AP + SS-Adapt||37.31||31.03||32.94||26.09||40.95||33.35||57.87||46.72||14.19||10.29|
|4 Nav. Loss + RP||98.73||82.53||44.95||32.74||32.21||22.47||67.06||53.70||23.48||18.63|
|5 Nav. Loss + RP + SS-Adapt||94.63||77.25||50.59||36.10||79.16||62.74||60.42||49.37||61.06||47.16|
|6 Nav. Loss + Data Aug||98.45||81.08||71.70||54.54||81.26||61.32||88.44||71.57||23.93||18.41|
|7 Finetune Nav. Loss on Target||-||-||72.88||61.82||97.18||80.32||96.54||80.92||91.81||77.38|
Agents tend to be farther from the target. Fig 4 (second column) shows the minimum distance from the target over the course of an episode. While we note that as corruptions become progressively severe, agents tend to terminate farther away from the target (see Sec. A.4 of appendix), Fig 4 (second column) indicates that the overall proximity of the agent to the goal over an episode decreases – minimum distance to target increases as we go from Clean vis or dyn; vis or dyn vis+dyn. While this may be intuitive in the presence of a dyn corruption, it is interesting to note that this trend is also consistent for vis corruptions (Clean D.B. or S.N.).
Corruptions hurt ObjectNav stopping mechanism. Recall that for both PointNav and ObjectNav, success depends on the notion of “intentionality”  – the agent calls an end action when it believes it has reached the goal. In Fig 4 (last two columns) we aim to understand how corruptions affect this stopping mechanism. Specifically, we look at two quantitative measures – (1) Stop-Failure (Positive), the proportion of times the agent invokes an end action when the goal is not range; and (2) Stop-Failure (Negative), the proportion of times the agent does not invoke an end action when the goal is in range, out of the number of times the goal is in range.444 The goal in range criterion for PointNav checks if the target is within the threshold distance. For ObjectNav, this includes an additional visibility criterion.
We observe that prematurely calling an end action is a significant issue only for ObjectNav (Fig 4 (third column)) – which becomes more pronounced as corruptions become progressively severe (Clean D.B. or S.N.; M.D. D.B. + M.D. or S.N. + M.D.). Similarly, the inability of an agent to invoke an end action is also more pronounced for ObjectNav as opposed to PointNav (Fig 4 (fourth column)). To investigate the extent to which this impacts the agent’s performance, we compare the agent’s Success Rate (SR) with a setting where the agent is equipped with an oracle stopping mechanism (call end as soon as the goal is in range). We find that this makes a significant difference only for ObjectNav– absolute for Clean, for M.D. and for D.B. + M.D. We hypothesize that equipping agents with robust stopping mechanisms can significantly improve performance on RobustNav. For instance, equipping the agent with a progress monitor module  (estimating progress made towards the goal in terms of distance) robust to vis corruptions can potentially help decide when explicitly to invoke an end action in the target environment.
To assist near-term progress, we investigate if some standard approaches towards training robust models or adapting to visual disparities can help resisting vis corruptions under a calibration budget (Sec. 3) – set to k steps.555Based on the number of steps it takes an agent to reasonably recover degraded performance in corrupted environments when finetuned with complete task supervision.
Extent of attainable improvement by finetuning under task supervision. As an anecdotal upper bound on attainable improvements under the calibration-budget, we also report the extent to which degraded performance can be recovered when fine-tuned under complete task supervision. We report these results for vis corruptions in Table 3 (row 7). We note that unlike Lower-FOV, the agent is able to almost recover performance for Defocus Blur, Camera-Crack and Spatter (Table. 3, rows 1,7).
Do data-augmentation strategies help? In Table 3, we study if data-augmentation strategies improve zero-shot resistance to vis corruptions (rows 1,6). We compare PointNav RGB agents trained with Random-Crop, Random-Shift and Color-Jitter (row 6) with the vanilla versions (row 1) and find that while data augmentation (row 6) offers some improvements (Spatter being an exception) over degraded performance (row 1) – absolute improvements of ( SPL, SR) for Lower-FOV, ( SPL, SR) for Defocus Blur and ( SPL, SR) for Camera-Crack, obtained performance is still significantly below Clean settings (row 1, Clean col). Improvements are more pronounced for Lower-FOV compared to others (likely due to Random-Shift and Random-Crop). We note that data-augmentation provides improvements only for a subset of vis corruptions and when it does, obtained improvements are still not sufficient enough to recover lost performance.
Do self-supervised adaptation approaches help? In the absence of reward supervision in the target environment, Hansen 
proposed Policy Adaptation during Deployment (PAD) – source pretraining with an auxiliary supervised objective and optimizing only the self-supervised objective when deployed in the target environment. We investigate the degree to which PAD helps adapting to the target environments inRobustNav. The adopted self-supervised tasks are (1) Action-Prediction (AP) – given two successive observations in a trajectory, predict the intermediate action and (2) Rotation-Prediction (RP) – rotate the input observation by , , , or before feeding it to the agent and task an additional auxiliary head with predicting the rotation. We report numbers with AP (rows 2,3) and RP (rows 4,5) in Table. 3. For AP, we find that (1) pre-training (row 2 vs row 1) results in little or no improvements over degraded performance (maximum absolute improvements of SPL, SR for Defocus Blur) and (2) further adaptation (row 3 vs rows 2,1) under calibration budget consistently degrades performance. For RP, we observe that (1) with the exception of Clean and Lower-FOV, pre-training (row 4 vs row 1) results in worse performance and (2) while self-supervised adaptation under corruptions improves performance over pre-training (row 5 vs row 4), it is still significantly below Clean settings (row 1, Clean col) – minimum absolute gap of SPL, SR between Defocus Blur (row 5) and Clean (row 1). While improvements over degraded performance might highlight the utility of PAD (with AP / RP) as a potential unsupervised adaptation approach, there is still a long way to go in terms of closing the performance gap between clean and corrupt settings.
In summary, as a step towards assessing general purpose robustness of embodied navigation agents, we propose RobustNav, a challenging framework well-suited to benchmark the robustness of embodied navigation agents, with a wide variety of visual and dynamics corruptions. To succeed on RobustNav, an agent must be insensitive to corruptions and also be able to adapt to unforeseen changes in new environments with minimal interaction. We find that standard PointNav and ObjectNav agents underperform (or fail) significantly in the presence of corruptions and while standard techniques to improve robustness or adapt to environments with visual disparities (data-augmentation, self-supervised adaptation) provide some improvements, a large room for improvement remains in terms of fully recovering lost navigation performance. Lastly, we plan on evolving RobustNav in terms of the sophistication and diversity of corruptions as more features are supported in the underlying simulator. We release RobustNav in RoboTHOR, and hope that our findings provide insights into developing more robust navigation agents.
Acknowledgements. We thank Klemen Klotar, Luca Weihs, Martin Lohmann, Harsh Agrawal and Rama Vedantam for fruitful discussions and valuable feedback. We thank Winson Han for helping out with the Camera-Crack vis corruption. We also thank Vishvak Murahari for helping out with the ImageNet experiments and for sharing code for the Mask-RCNN experiments. This work is supported by the NASA University Leadership Initiative (ULI) under grant number 80NSSC20M0161.
An empirical investigation of the challenges of real-world reinforcement learning.arXiv, 2020.
Benchmarking neural network robustness to common corruptions and perturbations.In ICLR, 2019.
Pyrobot: An open-source robotics framework for research and benchmarking.arXiv, 2019.
This appendix is organized as follows. In Sec. A.2, we describe in detail the task specifications for PointNav and ObjectNav. In Sec. A.3, we provide details about the architecture adopted for PointNav and ObjectNav agents and how they are trained. In Sec. A.4, we include more plots demonstrating the kinds of behaviors PointNav and ObjectNav agents exhibit under corruptions (RGB-D variants in addition to the RGB variants in Sec. 5.2 of the main paper). In Sec. A.5, we provide more results demonstrating degradation in performance at severity set to (for vis corruptions with controllable severity levels; excluded from the main paper due to space constraints) and break down performance degradation by episode difficulty.
We describe the task-specifications (as outlined in Sec. 3 of the main paper) for the ones included in RobustNav in detail. Note that while RobustNav currently supports navigation heavy tasks, the corruptions included can easily be extended to other embodied tasks that share the same modalities, for instance, tasks involving vision and language guided navigation or having interaction components.
PointNav. In PointNav, an agent is spawned at a random location and orientation in an environment and asked to navigate to goal coordinates specified relative to the agent’s position. This is equivalent to the agent being equipped with a GPS+Compass sensor (providing relative location and orientation with respect to the agent’s current position). Note that the agent does not have access to any “map” of the environment and must navigate based solely on sensory inputs from a visual RGB (or RGB-D) and GPS+Compass sensor. An episode is declared successful if the PointNav agent stops (by “intentionally” invoking an end action) within m of the goal location.
ObjectNav. In ObjectNav, an agent is spawned at a random location and orientation in an environment as is asked to navigate to a specified “object” category (e.g, Television) that exists in the environment. Unlike PointNav, an ObjectNav agent does not have access to a GPS+Compass sensor and must navigate based solely on the specified target and visual sensor inputs – RGB (or RGB-D). An episode is declared successful if the ObjectNav agent (1) stops (by “intentionally” invoking an end action) within m of the target object and (2) has the target object within it’s ego-centric view. We consider object categories present in the RoboTHOR scenes for our ObjectNav experiments. These are AlarmClock, Apple, BaseballBat, BasketBall, Bowl, GarbageCan, HousePlant, Laptop, Mug, SprayBottle, Television and Vase (see Fig. 5 for a few examples in the agent’s ego-centric frame).
We highlight describe the architecture of the agents studied in RobustNav and provide additional training details.
Base Architecture. We consider standard neural architectures (akin to [56, 55]) for both PointNav and ObjectNav– convolutional units to encode observations followed by recurrent policy networks to predict action distributions. Concretely, our agent architecture consists of four major components – a visual encoder, a goal encoder, a target observation combiner and a policy network (see Fig. 6). The visual encoder (for RGB and RGB-D) agents consists of a frozen ResNet-18  encoder (till the last residual block) pretrained on ImageNet 
, followed by a learnable compressor network consisting of two convolutional layers of kernel size 1, each followed by ReLU activations (). The goal encoder encodes the specified target – a goal location in polar coordinates () for PointNav and the target object token (e.g., Television) for ObjectNav. For PointNav, the goal is encoded via a linear layer (). For ObjectNav, the goal is encoded via an embedding layer () set to encode one of the 12 object categories. The goal embedding and output of the visual encoder are then concatenated and further passed through the target observation combiner network consisting of two convolutional layers of kernel size 1 (). The output of the target observation combiner is flattened and then fed to the policy network – specifically, to a single layer GRU (hidden size ), followed by linear actor and critic heads used to predict action distributions and value estimates.
Auxiliary Task Heads. In Sec. 5.3 of the main paper, we investigate if self-supervised approaches, particularly, Policy Adaptation during Deployment (PAD)  help in resisting performance degradation due to vis corruptions. Incorporating PAD involves training the vanilla agent architectures (as highlighted before) with self-supervised tasks (for pre-training as well as adaptation in a corrupt target environment) – namely, Action Prediction (AP) and Rotation Prediction (RP). In Action-Prediction (AP), given two successive observations in a trajectory, an auxiliary head is tasked with predicting the intermediate action and in Rotation-Prediction (RP), the input observation is rotated by , , , or uniformly at random before feeding to the agent and an auxiliary head is asked to to predict the rotation bin. For both AP and RP, the auxiliary task heads operate on the encoded visual observation (as shown in Fig. 6). To gather samples in the target environment (corrupt or otherwise), we use data collected from trajectories under the source (clean) pre-trained policy – i.e., the visual encoder is updated online as observations are encountered under the pre-trained policy.
Training and Evaluation Details. As mentioned earlier, we train our agents with DD-PPO  (a decentralized, distributed version of the Proximal Policy Optimization Algorithm ) with Generalized Advantage Estimation . We use rollout lengths
, 4 epochs of PPO with 1 mini-batch per-epoch. We set the discount factor to, GAE factor to , PPO clip parameter to , value loss coefficient to and clip the gradient norms at . We use the the Adam optimizer  with a learning rate of with linear decay. The reward structure used is as follows – if denotes the terminal reward obtained at the end of a “successful” episode and denotes a slack penalty to encourage efficiency, then the reward received by the agent at time-step can be expressed as,
where is the change in geodesic distance to the goal, is the action taken by the agent and indicates where the episode was successful () or not (). During evaluation, we allow an agent to execute a maximum of steps – if an agent doesn’t call end within steps, we forcefully terminate the episode. All agents are trained under LocoBot calibrated actuation noise models from  – for translation and for rotation. During evaluation, with the exception of circumstances when Motion Bias (S) is present, we use the same actuation noise models (in addition to dyn corruptions when applicable). We train our PointNav agents for M steps and ObjectNav agents for M steps (both RGB and RGB-D variants).
|Failed Actions||Term. Dist. to Target (m)||Min. Dist. to Target (m)||Stop-Fail. (Pos) (%)||Stop-Fail (Neg) (%)|
|SR - SR|
In Sec. 5.2 of the main paper, we try to understand the idiosyncrasies exhibited by the navigation agents under corruptions. Specifically, we look at the number of collisions as observed through the number of failed actions in RoboTHOR, the closest the agent arrives to the target in an episode and Stop-Fail (Pos) and Stop-Fail (Neg). Since for both PointNav and ObjectNav, success depends on a notion of “intentionality”  – the agent calls an end action when it believes it has reached the goal – we use both Stop-Fail (Pos) and Stop-Fail (Neg) to assess how corruptions impact this “stopping” mechanism of the agents. Stop-Fail (Pos) measures the fraction of times the agent calls an end action when the goal is not in range666The goal in range criterion for PointNav checks whether the target is within the threshold distance. For ObjectNav, this includes a visibility criterion in addition to taking distance into account., out of the number of times the agent calls an end action. Stop-Fail (Neg) measures the fraction of times the agent fails to invoke an end action when the goal is in range, out of the number of steps the goal is in range in an episode. Both are averaged across evaluation episodes. In addition to the above aspects, we also measure the average distance to the goal at episode termination. Here we report these measures for PointNav and ObjectNav agents trained with RGB and RGB-D sensors in Fig. 7 (RGB-D variants in addition to the RGB agents in Fig.4 of the main paper).
We find that across RGB and RGB-D variants, (1) agents tend to collide more often under corruptions (Fig. 7, col 1), (2) agents generally end up farther from the target at episode termination under corruptions (Fig. 7, col 2) and (3) agents tend to be farther from the target under corruptions even in terms of minimum distance over an episode (Fig. 7, col 3). We further note that the effect of corruptions on the agent’s stopping mechanism is more pronounced for ObjectNav as opposed to PointNav (Fig. 7, cols 4 & 5).
|(4) HC RGB Noise||N/A||N/A||N/A||65.90||49.50||104.167|
To further understand the extent to which a worse stopping mechanism impacts the agent’s performance, in Fig. 8, we compare the agents’ success rate (SR) with a setting where the agent is equipped with an oracle stopping mechanism (forcefully call end when goal is in range). For both ObjectNav RGB and RGB-D, we find that the presence of vis and vis+dyn corruptions affects success significantly compared to the clean settings (Fig. 8, black bars).
Habitat Challenge Results. As stated in Sec.5 of the main paper, here we investigate the degree to which more sophisticated PointNav agents, composed of map-based architectures, are susceptible to vis corruptions. Specifically, we evaluate the performance of the winning entry of Habitat Challenge (HC) 2020  – Occupancy Anticipation  on the Gibson  validation scenes (see Table. 4). We evaluate the performance of OccAnt (for RGB and RGB-D; based on provided checkpoints) when vis corruptions are introduced (1) over clean settings under noise-free conditions (rows 1-3 in Table. 4) and (2) by replacing the RGB noise under Habitat Challenge (HC) conditions (rows 4-6 in Table. 4). Under noise free conditions, we note that degradation in performance from the clean settings is more pronounced for the RGB agents as opposed to the RGB-D variants. Under HC conditions, we note that the RGB-D variants suffer significant degradation in performance when RGB noise is replaced with vis corruptions.
|2||Low Lighting (S3)||✓||97.450.27||80.530.33||99.090.17||84.910.26||21.550.72||8.910.38||27.550.78||13.080.47|
|3||Low Lighting (S5)||✓||93.420.43||74.880.43||99.270.15||85.040.25||11.690.56||4.900.30||23.260.74||10.610.43|
|4||Motion Blur (S3)||✓||98.820.19||80.640.29||98.910.18||84.620.26||18.570.68||8.180.37||24.320.75||11.520.44|
|5||Motion Blur (S5)||✓||96.150.34||73.240.38||99.060.17||85.010.25||10.500.93||4.710.51||18.260.83||7.870.45|
|7||Defocus Blur (S3)||✓||94.630.39||73.280.41||98.790.19||84.470.26||15.590.63||6.990.36||22.070.72||9.750.41|
|8||Defocus Blur (S5)||✓||75.830.75||53.480.59||99.030.17||85.440.25||4.200.35||1.810.20||17.660.67||7.470.36|
|9||Speckle Noise (S3)||✓||89.230.54||68.180.53||98.850.19||84.580.27||14.920.62||6.540.34||24.050.75||10.340.42|
|10||Speckle Noise (S5)||✓||66.700.82||47.920.67||98.970.18||84.790.26||8.680.49||3.900.28||17.690.67||7.420.36|
|14||Motion Bias (S)||✓||95.690.35||77.050.38||96.600.32||79.220.35||33.210.82||14.880.47||34.980.83||16.700.51|
|16||Motion Bias (C)||✓||92.270.47||77.480.46||93.110.44||79.040.45||30.470.80||13.200.45||32.720.82||15.670.50|
|17||PyRobot  (ILQR) Mul. = 1.0||✓||95.180.37||67.450.37||96.180.33||69.480.35||32.540.82||11.650.39||36.860.84||14.240.44|
|19||Defocus Blur (S3) + Motion Bias (S)||✓||✓||92.720.45||68.610.43||97.450.27||79.700.32||14.400.61||6.150.33||22.400.73||9.200.39|
|20||Defocus Blur (S5) + Motion Bias (S)||✓||✓||75.800.75||50.760.58||97.000.30||79.810.33||5.660.40||2.340.22||17.530.66||7.070.35|
|21||Speckle Noise (S3) + Motion Bias (S)||✓||✓||86.620.59||63.200.54||96.850.30||79.230.34||14.920.62||6.310.34||24.720.75||10.040.41|
|22||Speckle Noise (S5) + Motion Bias (S)||✓||✓||64.360.83||44.380.66||96.780.31||79.490.34||8.950.50||3.850.27||18.390.68||7.490.36|
|23||Spatter (S3) + Motion Bias (S)||✓||✓||37.250.84||23.830.57||96.600.32||78.620.35||7.180.45||3.600.28||24.440.75||9.800.40|
|24||Spatter (S5) + Motion Bias (S)||✓||✓||33.850.82||23.980.61||95.940.34||78.640.36||7.640.46||2.930.23||20.910.71||9.390.40|
|25||Defocus Blur (S3) + Motion Drift||✓||✓||89.720.53||65.840.47||94.840.39||75.970.37||14.160.61||6.260.34||23.560.74||10.650.43|
|26||Defocus Blur (S5) + Motion Drift||✓||✓||73.920.76||50.840.59||94.720.39||76.210.37||4.570.36||2.100.21||17.260.66||7.040.35|
|27||Speckle Noise (S3) + Motion Drift||✓||✓||86.650.59||62.440.53||93.990.41||75.020.39||13.460.60||5.950.33||23.010.73||9.960.41|
|28||Speckle Noise (S5) + Motion Drift||✓||✓||63.180.84||43.290.65||94.510.40||75.340.38||7.490.80||3.630.46||18.930.68||7.850.36|
|29||Spatter (S3) + Motion Drift||✓||✓||37.700.84||24.270.57||94.570.39||75.340.38||7.150.45||3.590.27||23.440.74||9.720.40|
|30||Spatter (S5) + Motion Drift||✓||✓||33.360.82||23.590.60||95.030.38||75.840.37||7.210.55||2.770.28||18.690.68||8.370.38|
|31||Defocus Blur (S3) + PyRobot  (ILQR) Mul. = 1.0||✓||✓||93.990.41||58.880.40||97.660.26||70.540.32||16.130.64||5.220.28||22.680.73||7.330.32|
|32||Defocus Blur (S5) + PyRobot  (ILQR) Mul. = 1.0||✓||✓||79.340.71||42.290.49||97.240.29||70.350.33||5.810.41||1.040.11||18.480.68||5.860.29|
|33||Speckle Noise (S3) + PyRobot  (ILQR) Mul. = 1.0||✓||✓||88.380.56||54.600.49||96.120.34||68.670.35||14.950.62||4.710.26||24.110.75||7.510.32|
|34||Speckle Noise (S5) + PyRobot  (ILQR) Mul. = 1.0||✓||✓||67.120.82||37.770.57||96.360.33||69.440.34||8.890.50||2.660.20||18.720.68||5.730.29|
|35||Spatter (S3) + PyRobot  (ILQR) Mul. = 1.0||✓||✓||40.700.86||18.260.45||96.090.34||68.250.36||8.310.48||1.760.16||23.170.74||7.760.33|
|36||Spatter (S5) + PyRobot  (ILQR) Mul. = 1.0||✓||✓||36.370.84||19.700.51||96.030.34||68.980.36||8.580.49||2.090.17||20.850.71||7.410.33|
agents have additional access to a GPS-Compass sensor. For visual corruptions with controllable severity levels, we report results with severity set to 5 and 3. Performance is measured across tasks of varying difficulties (easy, medium and hard). Reported results are mean and standard error across 3 evaluation runs with different seeds. Rows are sorted based on SPL values for RGBPointNav agents. Success and SPL values are reported as percentages. (V = Visual, D = Dynamics)
More Degradation Results. In Table. 5, we report the degradation in performance (relative to clean settings) of PointNav and ObjectNav agents when operating under vis, dyn and vis+dyn corruptions. We report mean and standard error values across evaluation runs under actuation noise (wherever applicable). For vis corruptions with controllable severity levels – Motion Blur, Low-Lighting, Defocus Blur, Speckle Noise and Spatter – we report results with severities set to and (identified by S3 and S5; excluded from the main paper due to space constraints) – for both vis and vis+dyn settings. We note that unlike the RGB-D variants, for PointNav RGB agents, performance drops more as severity levels increase (increasing degradation from severity 5). For ObjectNav, we find that for both RGB and RGB-D variants, performance decreases as with increasing severity of corruptions ().
|Increasing Episode Difficulty||Easy||Medium||Hard|
|2 Low Lighting||99.360.24||80.590.45||95.540.62||75.830.70||85.341.07||68.220.94|
|3 Camera Crack||94.100.71||75.810.70||80.051.21||62.151.06||70.491.38||52.441.14|
|5 Speckle Noise + Motion Bias (S)||86.741.02||60.860.96||61.111.47||41.301.15||45.171.50||30.931.11|
|6 Spatter + Motion Bias (S)||72.481.35||53.251.08||18.851.18||11.990.79||10.110.91||6.630.62|
|7 Speckle Noise + Motion Drift||88.740.95||63.570.89||59.471.48||38.751.11||41.261.49||27.501.07|
|8 Spatter + Motion Drift||73.211.34||54.161.06||17.121.14||10.500.74||9.650.89||6.040.58|
|10 Low Lighting||99.550.20||82.250.42||99.360.24||86.150.43||98.910.31||86.730.41|
|11 Camera Crack||99.270.26||81.790.43||97.180.50||83.190.59||91.530.84||79.450.80|
|13 Speckle Noise + Motion Bias (S)||96.280.57||75.590.62||97.270.49||80.770.56||96.810.53||82.110.55|
|14 Spatter + Motion Bias (S)||96.460.56||76.020.62||94.990.66||78.610.67||96.360.57||81.290.59|
|15 Speckle Noise + Motion Drift||99.270.26||77.850.41||96.170.58||76.770.61||88.070.98||71.390.86|
|16 Spatter + Motion Drift||99.180.27||77.240.44||97.360.48||78.420.53||88.520.96||71.870.85|
|Increasing Episode Difficulty||Easy||Medium||Hard|
|2 Low Lighting||22.591.65||8.500.96||13.230.93||5.600.46||4.750.59||2.400.33|
|3 Camera Crack||21.651.63||10.101.05||5.380.62||2.720.34||1.610.35||1.150.26|
|5 Speckle Noise + Motion Bias (S)||20.561.60||7.010.85||9.790.81||4.890.47||2.380.42||1.230.23|
|6 Spatter + Motion Bias (S)||20.561.60||6.710.83||7.320.71||3.350.38||1.610.35||0.650.15|
|7 Speckle Noise + Motion Drift||18.692.67||8.551.59||7.621.26||3.860.74||1.840.64||0.970.36|
|8 Spatter + Motion Drift||21.501.99||7.081.01||6.840.85||3.180.45||0.570.26||0.230.11|
|10 Low Lighting||28.821.79||10.551.04||25.411.19||11.560.68||18.311.07||9.680.65|
|11 Camera Crack||35.511.89||11.531.03||28.031.23||13.900.73||22.451.16||14.030.79|
|13 Speckle Noise + Motion Bias (S)||22.121.64||5.930.80||18.541.06||8.290.58||16.401.03||7.430.54|
|14 Spatter + Motion Bias (S)||27.261.76||8.660.92||19.811.09||8.810.60||18.931.08||10.350.66|
|15 Speckle Noise + Motion Drift||22.741.66||6.350.84||19.131.08||7.860.56||16.861.04||8.560.59|
|16 Spatter + Motion Drift||25.081.71||8.160.89||17.791.05||8.160.57||16.481.03||8.690.60|
Performance Breakdown by Episode Difficulty. In Tables. 6 and 7 we break down performance of PointNav and ObjectNav agents by difficulty of evaluation episodes (based on shortest path lengths). We report results for a subset of vis, dyn and vis+dyn corruptions (mean across evaluation runs under noisy actuations, wherever applicable). For PointNav RGB agents, we find that while performance is comparable across easy, medium and hard episodes under clean settings, under corruptions, navigation performance decreases significantly with increase in episode difficulty – indicating that under corruptions, PointNav-RGB agents are more successful at reaching goal locations closer to the spawn location. However, this is not the case for PointNav RGB-D agents, where the drop in performance with increasing episode difficulty is much less pronounced. For ObjectNav-RGB agents, we observe that performance (in terms of SR and SPL) drops as episodes become more difficult. For ObjectNav-RGB-D agents, although we find comparable SPL across episode difficulties in some cases, the trends are mostly the same – decreasing performance (in terms of SR and SPL) with increasing episode difficulty.