Zero-shot object goal visual navigation

by   Qianfan Zhao, et al.

Object goal visual navigation is a challenging task that aims to guide a robot to find the target object only based on its visual observation, and the target is limited to the classes specified in the training stage. However, in real households, there may exist numerous object classes that the robot needs to deal with, and it is hard for all of these classes to be contained in the training stage. To address this challenge, we propose a zero-shot object navigation task by combining zero-shot learning with object goal visual navigation, which aims at guiding robots to find objects belonging to novel classes without any training samples. This task gives rise to the need to generalize the learned policy to novel classes, which is a less addressed issue of object navigation using deep reinforcement learning. To address this issue, we utilize "class-unrelated" data as input to alleviate the overfitting of the classes specified in the training stage. The class-unrelated input consists of detection results and cosine similarity of word embeddings, and does not contain any class-related visual features or knowledge graphs. Extensive experiments on the AI2-THOR platform show that our model outperforms the baseline models in both seen and unseen classes, which proves that our model is less class-sensitive and generalizes better. Our code is available at


page 2

page 13

page 14


Classifier and Exemplar Synthesis for Zero-Shot Learning

Zero-shot learning (ZSL) enables solving a task without the need to see ...

Zero Experience Required: Plug Play Modular Transfer Learning for Semantic Visual Navigation

In reinforcement learning for visual navigation, it is common to develop...

Synthesizing the Unseen for Zero-shot Object Detection

The existing zero-shot detection approaches project visual features to t...

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

We present a scalable approach for learning open-world object-goal navig...

RobustNav: Towards Benchmarking Robustness in Embodied Navigation

As an attempt towards assessing the robustness of embodied navigation ag...

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization

Learning to synthesize data has emerged as a promising direction in zero...

Finding any Waldo: zero-shot invariant and efficient visual search

Searching for a target object in a cluttered scene constitutes a fundame...

Code Repositories

1 Introduction

Object goal visual navigation is an important skill for robots to perform real-world tasks, which aims to guide an agent to reach the instance of a given target class based on its observations. Learning effective navigation policy is a complicated problem that involves many fields of robotics such as visual perception, scene understanding, and motion planning. While researchers have achieved promising object navigation results 

[article1, article2, article3, article4, article5], this task mainly focuses on the classes specified in the training phase, which are called ”seen classes”. In real households, there may exist numerous object classes that cannot be fully contained in the training stage, which are called ”unseen classes”. This real-world condition requires robots to have the ability to handle novel classes.

To address this challenge of real-world applications, zero-shot learning is essential, which aims to successfully process unseen classes with zero training samples. It has achieved impressive results in the fields of image recognition [article11, article12], object detection [article9, article10, article13], and image segmentation [article14, article15]. This line of work mainly uses images and word embeddings to align the visual features and semantic knowledge of seen classes into a common space, and use the aligned space to process unseen classes.

Figure 1: The visualization of an example of the Zero-Shot Object Navigation task. The robot learns navigation policy in training scenes with observations and word embeddings of specified seen categories. The robots use the word embeddings of unseen categories to generalize the navigation, and successfully find the target object of the unseen category in testing scenes.

Recently, some researchers combine zero-shot learning with object navigation and propose the Zero-Shot Object Navigation task [article6, article39], which aims to guide the agent to navigate to the target of unseen classes. These works directly use a trained CLIP (Contrastive Language-Image Pre-training) [article7] to process unseen classes. However, the CLIP is trained through a large image-caption dataset which may contain the image and word embedding samples of the unseen classes. With this kind of dataset, the CLIP may have learned the semantic knowledge of unseen classes, and thus this line of work is more consistent with the open vocabulary setting [article8, article37, article38] rather than the zero-shot setting [article9, article10, article11].

To make our research more rigorous, we introduce a stricter task setting called the Strict Zero-Shot Object Navigation (S-ZSON) that aims to navigate the robot to the target belonging to unseen classes in testing scenes and the navigation model could not learn semantic knowledge of unseen classes through image and word embedding samples during the training phase. A concise example of the S-ZSON task is illustrated in Figure.1. In the training scenes, an agent is trained through images and word embeddings of seen classes. In the testing scenes, the agent takes the visual observations and the word embedding of the unseen class as input and output motion plans to find the unseen target. To successfully complete the task, the agent needs to get close enough to the target and adjust its observation angle until the target is visible. This requires the robot to infer the possible location of the target, whether seen or not, through visual observations and word embeddings.

Furthermore, we propose a zero-shot navigation model based on the DRL algorithm and the self-attention mechanism. We believe that overfitting of seen classes in the training stage is an important factor that prevents the model from being able to handle unseen classes. Therefore, we discard the widely used class-related visual features and knowledge graphs [article1, article2, article3, article4, article5], and utilize the detection results of the current observation and the cosine similarity of word embeddings between the target and other visible objects as the input, which eliminates most of the information directly related to the class. With this approach, our model can avoid the overfitting on seen classes and generalize the learned policy to unseen classes. In addition, we also propose a new semantic reward function, which uses the cosine similarity of word embeddings to help the agent to learn navigation policy.

We evaluate our method in the widely used navigation environment AI2-THOR [article1] and re-split the commonly used 22 target classes  [article3, article4, article5] into different numbers of seen classes and unseen classes. In the training stage, the agent learns the navigation policy only through seen classes in training scenes. In the testing stage, the agent will deal with both seen and unseen classes in testing scenes. The experimental results confirm our view on the relationship between input and generalization ability. The proposed model outperforms all baseline models on both seen and unseen classes. Furthermore, our model also shows considerable performance under the normal object navigation setting. Our main contributions are summarized as follows:

1. A stricter task setting called Strict Zero-Shot Object Navigation (S-ZSON) is proposed by us. The goal of S-ZSON is to navigate the robot to the target belonging to unseen classes. Different from the task using CLIP, our task does not allow image and word embedding samples of unseen classes to appear during the training phase.

2. A novel zero-shot object navigation model is proposed by us. This model is based on the DRL algorithm and use class-unrelated data as input, which can generalize the learned policy to the unseen classes.

3. A novel semantic reward is proposed by us. The proposed reward uses the cosine similarity of word embeddings between the target and other visible objects to guide the robot to learn the zero-shot object navigation policy.

2 Related Work

This study is related to visual navigation, zero-shot learning, and open vocabulary learning which are briefly discussed as below.

2.1 Visual Navigation

As an important robotic task, visual navigation has attracted a lot of attention for a long time. After years of study, there is a number of works on visual navigation and we make a brief overview. Traditional navigation methods always rely on offline or online maps [article16, article17, article18, article19, article20, article21] made through Simultaneous Localization and Mapping (SLAM) techniques. They treat this task as an obstacle avoidance problem and focus on the path planning algorithm. Obviously, this kind of work is only suitable for basic navigation applications and cannot perform complex tasks.

Recently, with the development of deep reinforcement learning, there are more advanced tasks are proposed in the field of visual navigation. According to different input and target types, recent visual navigation tasks can be divided into point goal visual navigation [article22, article23, article24], object goal visual navigation [article1, article2, article3, article4, article5], and vision-language navigation [article25, article26]. Our work is similar to object goal visual navigation, so we will only introduce the related work of object goal visual navigation. Object goal visual navigation refers to the task that the robot needs to learn a navigation policy to find a specified target instance and avoid obstacle collision. Scene-prior [article2] uses a knowledge graph to extract semantic priors and relationships of objects to navigate the agent. It uses a Graph Convolutional Network (GCN) [article28] to extract prior knowledge through the Visual Genome dataset [article27]. MJOLNIR [article4] uses a hierarchical object relationship reward, a context matrix, and a GCN to learn a navigation policy. VTNet [article5] uses a transformer to extract relationships among objects and set up strong connections with the navigation policy. The target-driven work [article1] uses an image of the target object as input to train navigation policy and build the widely used AI2-THOR framework which provides an environment with realistic 3D scenes and a physics engine. Although this line of work achieves promising results in visual navigation, they are limited in the classes specified in the training stage. Once they encounter targets that belong to novel classes in the testing stage, it is hard for them to complete their visual navigation tasks. Our work focuses on the zero-shot setting and achieves better results when dealing with unseen targets in test scenes.

2.2 Zero-Shot Learning

Zero-shot learning aims to use word embeddings (Word2vec [article30] or GloVe [article29]) to handle the unseen classes. In the early years, zero-shot learning research mainly focused on the classification problem [article11, article12]

. With the emergence and development of other computer vision tasks, zero-shot learning has also been valued and applied by other field such as object detection 

[article9, article10, article13], image segmentation [article14, article15], etc. Zero-shot learning methods can be divided into two main categories: projection methods and generation methods. The projection methods project visual features and word embeddings of seen classes into a common space and align them by categories [article9, article12, article13, article14, article15]. When dealing with the unseen classes, their word embeddings can be used to infer their visual features in the common space. The generation methods [article10, article11]

use the word embeddings and visual features of seen classes to train a generative model such as Generative Adversarial Networks (GAN) 


or Conditional Variational Autoencoders (CVAE) 

[article32]. The generative model can be used to generate samples of unseen classes through their word embeddings and the generated samples can be used to finetune the visual model. Our work is more similar to projection methods, but we do not use word embeddings directly. Instead, we use the cosine similarity of the word embeddings between the target and other categories to generalize the navigation policy to unseen classes.

2.3 Open Vocabulary Learning

The definition of open vocabulary learning is to first use a super large image-caption dataset to train the model and the model uses the knowledge learned in the large dataset to perform downstream tasks. [article8, article37, article38]. The CLIP-based work [article6] uses the trained CLIP model [article7] to directly extract semantic similarity from observations. It is worth noting that the CLIP model is trained through a large image-caption dataset [article7] which may contain the image and word embedding samples of unseen classes. Therefore, this line of work is more similar to the open vocabulary setting rather than zero-shot learning. Nevertheless, we still conduct an unrigorous comparison experiment with the CLIP-based work.

3 Zero-Shot Object Navigation

3.1 Task Definition

Consider a set of ‘seen’ target classes denoted by , which are available during the training process and stands for the total number of the seen classes. Consider another set of ‘unseen’ target classes denoted by , which are only available during the testing process. Consider another set of ‘Irrelevant’ classes denoted by , which exists in training scenes and testing scenes, but will not be selected as a target. In addition, we denote the set of ‘all’ target classes by . The set of all training scenes is denoted by and the set of test scenes is denoted by . We also provide word embeddings (e.g., Glove [article29] or Word2vec [article30]) for seen classes, for unseen classes, and for Irrelevant classes.

At the training phase, the agent based on the DRL model is trained in training scenes with the seen target classes . The agent is given visual observations of training scenes and word embeddings of the target to learn a navigation policy . During the testing phase, the agent is given visual observations of test scenes and word embeddings of the target belonging to all target classes . When dealing with the target belonging to unseen classes , the agent needs to generalize the learned navigation policy to the target using , denoted as . Like most object navigation works [article1, article2, article3, article4, article5], the zero-shot object navigation task is considered successful when the target object is visible in the current observation and within a threshold of distance (1.5m).

3.2 Model Architecture

The architecture of our model is shown in Figure.2. In order to avoid the overfitting of seen classes in the training stage, we discard commonly used visual features or knowledge graphs [article1, article2, article4], and only use a highly abstract detection results matrix and the cosine similarity of word embeddings as input. Inspired by [article4], the detection matrix contains detection results of seen classes and irrelevant classes in the current observation (Because our work mainly focuses on the policy learning, we directly use the ground truth detection results as a perfect detector, which is the same as other papers [article3, article4, article33]). Each line of the detection matrix can be represented as of class . The first element is of binary type. If the object of the class j is visible in the current observation, the value of is one. If it cannot be observed, the value is zero. The second and third elements are the center coordinate of the detection bounding box of class j. The last element represents the area of the bounding box. In addition, there is an embedding matrix that represents the word embeddings of seen classes and irrelevant classes which are obtained from GloVe [article29]. the embedding matrix is used to calculate the cosine similarity (CS) with the word embedding of the target, which can be represented as below:


where denotes the word embeddings of class and denotes the word embedding of the target. Finally, the CS will be concatenated with the detection matrix as input to the subsequent modules. It can be seen from the above description that our input rarely contains information directly related to the class, and most of the information is similarity value and detection bounding box parameters, which are class-unrelated.

Figure 2:

The visualization of the proposed model architecture. The input of the model is only the detection result and cosine similarity of word embeddings. The input is split along the category axis and processed by the self-attention module. The outputs of the self-attention module are concatenated into a 1-D vector. Then a LSTM module is used to extract and store the information of previous actions and an A3C module is used to generate actions.

After building the matrix, we introduce a self-attention module and use the concatenated matrix, which is split along the class axis, as the input. The self-attention module can adaptively learn the relationship between each class according to the different CS and detection results. The self-attention module uses the learned attention parameters to fuse and output features for each class. Then we concatenate the output features into a 1-D vector and use a Long Short Term Memory (LSTM) network 

[article34] to extract and store useful information from previous and current states. After the LSTM network, we adapt A3C algorithm [article35] to learn the visual navigation policy and output motion plans.

After sufficient training, our model can learn the relationship between different classes to facilitate visual navigation policy. In the testing phase, the testing scenes contain seen and unseen classes that can be selected as the target. When dealing with unseen targets, our model only needs to take the word embedding of the target to calculate its cosine similarity with the embedding matrix, and then it can generalize the learned navigation policy to unseen classes without any other measures.

3.3 Learning Set-Up

After introducing the model architecture, we will describe the reinforcement learning settings: action space, observations, and reward.

Action space : We use the same discrete action space as in other papers [article1, article2, article4, article5] when simulating on a virtual platform (AI2-THOR). The discrete action space . The action will move the robot forward 0.25 meters. The and action will rotate the robot 45 degrees. The or action will tilt the camera up or down by 30 degrees. The action represents that the robot believes it has found the target and the episode will end.

Observations : The robot in the AI2-THOR framework will take RGB images as observations.

Reward : We propose a novel semantic reward to utilize the cosine similarity of word embeddings to learn visual navigation policy. The reward value is obtained by calculating the cosine similarity of the word embeddings between the visible objects and the target. If there are multiple objects that can be observed at the current observation, we choose the max cosine similarity value as the reward. In this way, it can encourage the agent to find the objects which are semantically similar to the target. Since the entire model is trained end-to-end, the reward will propagate back to the self-attention module, and guide it to correctly learn the attention parameters between the objects. In addition, only when the current similarity value is greater than the last similarity value will it be used as a reward value. This encourages the robot to find objects with higher semantic similarity until the target is located. Finally, if there are no objects visible in the current observation, the robot will get a reward value of -0.01 as a penalty to reduce the trajectory length. The above reward calculation process can be summarized by Algorithm 1 (in supplementary material).

4 Experiment and Result

4.1 Experiment Setting

We use the AI2-THOR embodied AI environment as the platform for the zero-shot object navigation task. This environment contains 120 photo-realistic floorplans including 4 different room layouts: Kitchen, Livingroom, Bedroom, and Bathroom. Each room contains a number of objects that the agent can observe and interact with. Similar to other papers [article3, article4, article5], we also use the first 20 rooms as the training scenes and use the remaining 10 rooms as the testing scenes. We re-split the widely used 22 target classes [article3, article4, article5] into 18/4 seen/unseen and 14/8 seen/unseen classes, and evaluate models on both class splits.

4.2 Baseline Models

Random : the robot randomly selects actions from its action space.

MJOLNIR : we train this off-shelf object navigation model [article4] under the zero-shot object navigation setting. This model uses a GCN stream and an observation stream to extract context information to learn navigation policy.

Zero-Shot Baseline (ZS-Baseline) : we build another zero-shot navigation baseline model. Its architecture is shown in Figure.3 (in supplementary material). We directly use a pre-trained ResNet to extract a 1-D visual feature from the current observation and use the word embedding of the target class to concatenate with the visual feature as the input of the policy network. The policy network is composed of LSTM and A3C, which is similar to the proposed model in section 3.2.

4.3 Metrics

In order to facilitate and fair comparison with baseline models, we use the evaluation metrics proposed by 

[article36] which are also widely used in other object navigation algorithms [article3, article4, article5]. These metrics contain Success Rate (SR) and Success weighted by Path Length (SPL). The SR is defined as , and the SPL is defined as . is the number of episodes. is a binary vector indicating the success of the -th episode. is the path length of the agent in an episode. is the length of the optimal path to the target. We divide the paths planned by the model into two categories: one is that the path length is greater than 1 (L>=1), and the other is that the path length is greater than 5 (L>=5). Consistent with most algorithms [article3, article4, article5], we evaluate two path categories separately.

4.4 Implementation Details

We build our models based on the open-source code of 

[article4, article3]. All models were trained for 900000 episodes on the offline data from AI2-THOR [article1] with a 0.0001 learning rate. During the evaluation, we evaluate 250 episodes for each room type. In each evaluation episode, the floorplan, initial position, and target are randomly chosen.

4.5 Experimental Results

Table 1 shows the performance of baseline models and ours in testing scenes under different seen/unseen class splits. The testing scenes contain the rooms which did not appear during the training phase, and thus the locations of the objects in the testing scenes are completely unknown. It can be seen that our model outperforms all the baseline models in both unseen and seen classes. These results prove that using the detection results and the cosine similarity of word embeddings allows the model to better generalize the navigation policy to unseen classes, and the self-attention module can better extract useful information for navigation policy. Under different split settings, our model can still be effective, but the success rate decreases when the number of unseen classes increases. We note that as the number of seen classes decreases, it will be difficult for the model to learn the relationship between semantic knowledge and the navigation policy. (Details of the path examples planned by our model can be found in the supplementary material.)

Figure.4 (in supplementary material) shows the success rate for some baseline models and ours. Our model achieves a better success rate as the episode grows. In addition, our model also learns faster than other models, which can be proved by the faster growth of the success rate. It is worth noting that the baseline models achieve the best performance at the beginning of the training stage and gradually degrade in subsequent training processes, while the performance of our model grows steadily. We believe that this phenomenon is caused by the overfitting of end-to-end learning and redundant inputs. This phenomenon also proves that using abstract input can reduce overfitting and enhance generalization ability.

Unseen classes Seen classes
L>=1 L>=5 L>=1 L>=5
SR(%) SPL(%) SR(%) SPL(%) SR(%) SPL(%) SR(%) SPL(%)
Random 18/4 10.8 2.1 0.9 0.3 9.5 3.3 1.0 0.4
ZS-Baseline 18/4 16.9 8.7 5.3 3.1 17.7 8.3 5.3 2.6
MJOLNIR [article4] 18/4 20.7 7.1 10.6 4.5 51.9 16.5 33.0 14.2
Ours 18/4 28.6 9.0 12.5 5.6 59.0 19.7 38.6 18.3
Random 14/8 8.2 3.5 0.5 0.1 8.9 3.0 0.5 0.3
ZS-Baseline 14/8 14.6 4.9 4.9 2.8 30.4 9.7 11.5 5.2
MJOLNIR [article4] 14/8 12.3 5.1 6.0 3.6 52.7 22.3 26.8 14.9
Ours 14/8 21.5 7.0 13.0 6.7 59.3 24.5 35.2 19.3
Table 1: Performances of the baseline models and ours in testing scenes.

Table 2 shows the performance of models under the normal object navigation setting. Our method still remains effective and outperform the baseline models.

Moreover, we compare our model with a CLIP-based zero-shot object navigation model [article6], which we think is more similar to the open vocabulary setting. It is worth noting that this comparison is not rigorous and just serves as a simple contrast. Two models are under different experiment settings, different seen classes, and different unseen classes, but the number of unseen classes used by the two models is the same. Table 3 shows the performance of the CLIP-based model and ours in testing scenes. In both seen and unseen classes, our model outperforms the CLIP-based model without a large image-caption dataset to train.

Table 2: Performances of the models under the normal setting. Model L>=1 L>=5 SR SPL SR SPL ZS-Baseline 18.7 7.9 4.3 2.3 MJOLNIR [article4] 58.9 17.0 39.1 16.0 Ours 63.7 22.8 42.9 21.3 Table 3: Performances of the CLIP-based model and our model. Model Unseen Classes Seen Classes SR(%) SPL(%) SR(%) SPL(%) CLIP-Based 8.1 17.0 Ours 28.6 9.0 59.0 19.7

4.6 Experimental Limitation

In this paper, we only conduct experiments on a embodied AI platform and do not verify on a physical robot. The room layout in AI-THOR is relatively simple, and there are no complex suite layouts. Obviously, real houses have very complicated room layouts and require a more robust policy to guide robots. Future work will explore the zero-shot object navigation task in more complex room layouts and evaluate our policy on a physical robot.

5 Conclusion and Future Work

In this paper, we introduce a Stricter Zero-Shot Object Navigation task setting that aims to navigate the robot to a novel target in testing scenes and the navigation model could not learn semantic knowledge of unseen classes through image and word embedding samples in the training stage. We also proposed a novel zero-shot object navigation model based on the DRL algorithm and the self-attention module. Our model utilizes class-unrelated data as input to alleviate the overfitting of seen classes. Extensive experiments show that our model can successfully generalize the learned navigation policy to unseen classes and significantly outperforms other baseline models in both seen and unseen classes.

Since the input of our model is highly abstract information, we expect that our model may have good sim-to-real transfer ability. In future work, we will apply our model to the physical robot platform and build a digital twin environment for sim-to-real transfer experiments. Furthermore, we will combine our model with the life-long learning method to explore the life-long learning capabilities of robots.


6 Supplementary Material

6.1 Semantic Reward Algorithm

1: state, action, target, objects, , SeenClasses, IrrelevantClasses,
2: reward
3:function S-Reward(,,,,,,)
4:     if  then
5:         if  is visible then
6:               The reward value for the success cases
7:         else
8:               The reward value for the failure cases
9:         end if
10:         return
11:     else
12:          The reward value for path length penalty
13:         for {}{do Select every object that belongs to and .
14:              if  is visible then
15:                   Use the formula 1 from section 3.2
16:                  if  then
17:                        Encourage the agent to find the object with greater
19:                  end if
20:              end if
21:         end for
22:         return
23:     end if
24:end function
Algorithm 1 Semantic Reward

6.2 Class Split

18 seen classes: ”Spatula”, ”Bread”, ”Mug”, ”CoffeeMachine”, ”Apple”, ”Painting”, ”RemoteControl”, ”Vase”, ”ArmChair”, ”Laptop”, ”Blinds”, ”DeskLamp”, ”CD”, ”AlarmClock”, ”SoapBar”, ”Towel”, ”SprayBottle”, ”ToiletPaper”

4 unseen classes: ”Toaster”, ”Laptop”, ”Pillow”, ”ToiletPaper”

14 seen classes: ”Spatula”, ”Bread”, ”Mug”, ”CoffeeMachine”, ”Painting”, ”RemoteControl”, ”Vase”, ”ArmChair”, ”Blinds”, ”DeskLamp”, ”CD”, ”SoapBar”, ”Towel”, ”SprayBottle”

8 unseen classes: ”Apple”, ”Toaster”, ”Television”, ”Laptop”, ”AlarmClock”, ”Pillow”, ”ToiletPaper”, ”Mirror”

6.3 Architecture of Baseline Model

Figure 3: The visualization of the ZS-Baseline model architecture.

6.4 Success Rate

Figure 4: The visualization of the success rate (L>=1) of different models under different splits.

6.5 Path Samples Visualization

Figure 5: The room type is living room and the target is laptop.
Figure 6: The room type is kitchen and the target is toaster.
Figure 7: The room type is bedroom and the target is pillow.