Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

11/20/2022
by   Chia-Wen Kuo, et al.
0

In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal. Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder. Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is crucial for the target navigation task. The SEA pre-trained features can be easily plugged into existing VLN agents without any tuning. For example, on Test-Unseen environments, the VLN agents combined with our SEA pre-trained features achieve absolute success rate improvement of 12 Env-Dropout, and 4

READ FULL TEXT

page 2

page 4

page 5

page 11

page 12

research
02/25/2020

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Learning to navigate in a visual environment following natural-language ...
research
07/23/2023

Learning Navigational Visual Representations with Semantic Map Supervision

Being able to perceive the semantics and the spatial structure of the en...
research
04/05/2023

ENTL: Embodied Navigation Trajectory Learner

We propose Embodied Navigation Trajectory Learner (ENTL), a method for e...
research
11/01/2019

Decentralized Distributed PPO: Solving PointGoal Navigation

We present Decentralized Distributed Proximal Policy Optimization (DD-PP...
research
03/18/2020

Fixing the train-test resolution discrepancy: FixEfficientNet

This note complements the paper "Fixing the train-test resolution discre...
research
01/01/2022

Turath-150K: Image Database of Arab Heritage

Large-scale image databases remain largely biased towards objects and ac...
research
05/12/2020

Guaranteeing Reproducibility in Deep Learning Competitions

To encourage the development of methods with reproducible and robust tra...

Please sign up or login with your details

Forgot password? Click here to reset