Exploiting Scene-specific Features for Object Goal Navigation

08/21/2020 ∙ by Tommaso Campari, et al. ∙ Università di Padova 2

Can the intrinsic relation between an object and the room in which it is usually located help agents in the Visual Navigation Task? We study this question in the context of Object Navigation, a problem in which an agent has to reach an object of a specific class while moving in a complex domestic environment. In this paper, we introduce a new reduced dataset that speeds up the training of navigation models, a notoriously complex task. Our proposed dataset permits the training of models that do not exploit online-built maps in reasonable times even without the use of huge computational resources. Therefore, this reduced dataset guarantees a significant benchmark and it can be used to identify promising models that could be then tried on bigger and more challenging datasets. Subsequently, we propose the SMTSC model, an attention-based model capable of exploiting the correlation between scenes and objects contained in them, highlighting quantitatively how the idea is correct.



There are no comments yet.


page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Navigation is a trending topic in the Computer Vision research community. This growth in interest is undoubtedly due to the important practical implications that the development of agent capable of moving in complex environments can have on our society. For example, in an ever closer future we will be able to ask robotic assistants to perform the most disparate tasks in our homes. Before we can ask a robot to take something out of the refrigerator, however, we need to make sure that it is able to find the refrigerator and get to it while avoiding the complex tangle of obstacles that a domestic environment can contain. For this reason, this work focuses on the Object Navigation task, defined in

[1] as the search for objects belonging to a specific class by a robotic agent. For humans, this is a very simple task whatever the object to be found is. A human can build a mental link between the object and the room where it is more likely to be found. In this way a human is able to simplify the problem by first searching for the room and then for the required object inside the room. For example, just think of having to look for a sink, unconsciously we will first search for the kitchen or bathroom, and then we will find a sink in them. To implement this intuition we have developed an attention-based [28]

policy in which we exploit a joint-representation that integrates inside it visual informations extracted with a scene classifier and encoding of the semantic goal to be searched. This representation allows us to have a significant increase in performance compared to other models taken into consideration.

Several works tried to tackle Navigation through Learning models [7][6]. In particular, [7] leveraged depth images to construct in an online fashion semantic maps of the environment. From these maps they tried to maximize the exploration of the scene. To do this, they placed intermediate subgoals in unexplored areas of the map that the agent was encouraged to reach through planning.
This type of approach inevitably tends to lengthen the agent’s paths, at least until the object sought is clearly visible, since wanting to maximize exploration involves a significant amount of moves by the agent. For the simpler PointGoal Navigation [1] task, proposed in the “Habitat Challenge 2019”111https://aihabitat.org/challenge/2019/, it was possible to observe how a simple model [29] based on LSTM was able to perform better than competitors based on more complex architectures that exploit maps creation [11][8]. This was made possible by the DD-PPO algorithm [29], a distributed version of the PPO [25] Reinforcement Learning algorithm capable of parallelizing learning in a massive way. In fact, they used 64 NVIDIA V100 GPUs for 3 consecutive days of training. In other words, the model was trained for about 180 days in a single GPU setting. Not all researchers, however, can have access to those massive hardware resources. For this reason, we have generated a reduced version of the dataset produced for the “Habitat Challenge 2020”222https://aihabitat.org/challenge/2020/ for the Object Navigation task that would allow the training of Deep Reinforcement Learning models in a few hours. In this way, even using few computational resources, complex models can be trained which on the original dataset would take days.
Furthermore, in [10]

it was pointed out how in the Navigation tasks the use of Recurrent Neural Networks is not recommended. These structures usually have the issue of considering as more important the recent past and at the same time gradually forgetting the remote past. On the contrary, by exploiting the principle of attention, described in


, it is possible to record all the past observations in a memory from which to extract information on every single step that the agent has undertaken in the past, improving that highly penalizing intrinsic aspect in the behavior of the Recurrent Neural Networks.

Summarizing, our contribution is twofold:

  • We propose a new reduced dataset for the Object Navigation task extracted from the one proposed in the “Habitat Challenge 2020”, on which it is possible to test algorithms that would require a lot of resources for training and that maintain as far as possible the main characteristics that the first had;

  • We propose the SMTSC model, an attention-based policy for Object Navigation that is able to exploit, starting from RGB images only, the idea mentioned above, namely that there exist a correlation that binds objects to specific rooms. This intuition improves performance, as demonstrated by the results obtained on a preliminary study performed using the aforementioned dataset.

2 Related Works

Visual Navigation is an increasingly central topic within Computer Vision. However, Visual Navigation involves several different sub-problems and, in this section, we will summarize the most relevant related works to Simulators and 3D Datasets for Visual Navigation and to other different research areas connected to Visual Navigation.

2.1 Simulators and 3D Datasets for Visual Navigation

In recent years a large number of different simulators have been developed. GibsonEnv [32][33] and AI2Thor [14] both allow to simulate multi-agent situations and to interact with objects, for example lifting them, pushing them, etc. Matterport3DSimulator [2] can provide the agent with photorealistic images extracted from Matterport3d [5] and is mainly used for Room2Room Navigation problem. HabitatAI [23], instead, provides support to 3D datasets such as Gibson, Matterport3D and Replica [27].

2.2 Visual Navigation

Also thanks to the new possibilities offered by these simulators today there are numerous tasks available, as pointed out in [1]. Common Navigation tasks are mainly divided into two categories, namely those that require active exploration of the environment and those that, on the other hand, provide tools that can signal, for example via GPS sensors, the direction to be taken to reach the required goal.
In Classical Navigation, there are numerous approaches that perform path planning on explicit maps [13][4][16].
More recently, however, approaches based on Reinforcement Learning have been presented through policies based on Recurrent Neural Networks [17][15][22][10][20]. Mirowski et al. [17] define an approach that jointly learns the goal-driven Reinforcement Learning problem with auxiliary depth prediction and loop closure classification tasks by exploiting the A3C algorithm [18]. Mousavian et al. [20] propose a Deep Reinforcement Learning framework that uses an LSTM-based policy for Semantic Target Driven Navigation. But LSTMs when they have to analyze very long data sequences tend to focus more on the most recent observations, giving less importance to the first ones that have been seen. On the contrary, Fang et al. [10] propose Scene Memory Transformer, a policy based on attention [28] that is able to exploit even the least recent steps performed by the agent. In this case, the training of the policy is performed through the Deep Q-Learning algorithm [19]. Starting from the work done in [35], Sax et al. [24] show that using Mid-Level Vision results in policies that learn faster and generalize better when compared to learning from scratch. The Mid-Level model achieved high results in the PointGoal Navigation task. In [29], a scalable Reinforcement Learning algorithm on multiple GPUs capable of solving the PointGoal Navigation task almost perfectly has been presented. This solution, in particular, shows how Visual Navigation is a really complex task that requires an impressive amount of resources. In fact, their training was conducted on 64 GPUs for 3 days. Unfortunately, these resources are not within the reach of the whole scientific community and training similar models remain almost prohibitive for most researchers.
For the Target Driven Navigation some works have recently been presented, such as [30][34][31]. Wu et al. [31] construct a probabilistic graphical model over the semantic information to explore structural similarities between the environment. Yang et al. [34], instead, propose a Deep Reinforcement Learning model that exploits the relationships between objects, encoded through a Graph Convolutional Network, to incorporate semantic priors within the policy. Chaplot et al. [7] propose a model for the ObjectGoal Navigation that constructs, during the exploration, a map with the semantic information of the scene extracted through a semantic segmentation model; from the generated map a long-term goal is selected to maximize exploration, through a policy trained with Reinforcement Learning, when the searched object is visible it is set as a new long-term goal. The actions of the agent are selected through the use of the FastMarching algorithm [26].

3 Dataset

Previous works [29] have been able to achieve excellent results on the Point Navigation task [1], a simpler Navigation problem that doesn’t require Semantic capabilities, while using an architecture that didn’t include complex components such as occupancy maps [9]. However, they leveraged massive parallelism using hardware resources that are inaccessible to most institutions. We investigate on the possibility of solving the same problem in a reduced dataset with a minimal set of computational and time resources. For this reason we decided to concentrate on a subset of the Matterport3D Dataset [5]. We argue that our choice of such subset still offers significant results as its statistical indicators are similar to the one of original Matterport3D.
The extraction of the subset was done restricting the problem to 5 out of 21 objects, choosing Chair, Cushion, Table, Cabinet, Sink. These objects are among the most frequent in the original set as is shown in Figure 1. We decided to include the Sink object as it is a characterising element of the bathroom. This increase the diversity of the domestic environments represented by our proposed Dataset.
Related to the distributions of objects, one of the main issue in the evaluation of Object Navigation agents using Matterport3D, is the Long Tail Distribution that limits the number of instances of infrequent objects that are seen during training. This, combined with metrics that ignore the precision on single classes, may undermine the development of agents with truly semantic capabilities.

Figure 1: Distribution of objects in the Matterport3D Dataset.

Furthermore, we decided to extract our new dataset - from here we will refer to it as Small MP3D - using 6 out of 56 scenes of the official Training split. These scenes are r47D5H71a5s, i5noydFURQK, ZMojNkEp431, jh4fc5c5qoQ, HxpKQynjfin and GdvgFV5R1Z5. For each object in each scene we extracted 100 episode for the training split and 20 for both validation and test. In total, Small MP3D possesses 3000 training episodes and 600 episodes for both validation and test. To ensure the ability to generalize to unseen scenes, we decided to generate also an Unseen Test and Validation Set, extracted from the D7N2EKCX4Sj and aayBHfsNo7d scenes, each with 100 episodes (10 episodes for each object class in each scene).
The characteristics of the new dataset are shown in Table 1. We report the average values of Euclidean and Geodesic Distance as well as the number of Steps required to complete the episodes. Overall, the complexity of the training split of the proposed reduced dataset is lesser than the original data as both Geodesic distance and the required Number of Steps are lower. However, the complexity of the Unseen Test split is significantly higher, with 30% more required steps on average. We think that this additional complexity can guarantee a fair and meaningful benchmark.

Chair Cushion Table
Euc Geo Steps Euc Geo Steps Euc Geo Steps
Original Train 5.17 6.45 40.05 5.39 7.23 45.56 4.98 6.19 38.82
New Train 3.50 4.00 27.71 5.79 7.23 49.14 3.12 3.85 28.42
Original Val 5.09 7.21 43.32 4.50 6.18 38.77 3.66 5.29 33.83
New Seen Val 3.43 3.89 27.01 6.09 7.58 51.37 2.77 3.42 26.42
New Unseen Val 4.07 5.79 37.15 9.00 12.19 68.40 4.53 5.18 32.95
New Seen Test 3.64 4.15 28.76 6.25 7.61 50.17 3.15 3.79 27.82
New Unseen Test 4.06 5.11 32.75 10.13 12.94 70.50 4.55 5.37 34.45
Cabinet Sink Total Average
Euc Geo Steps Euc Geo Steps Euc Geo Steps
Original Train 5.24 6.84 42.63 6.33 8.54 51.80 5.27 6.78 42.27
New Train 4.02 4.91 34.15 4.94 5.99 39.80 4.27 5.19 35.84
Original Val 5.40 7.46 46.10 6.52 8.83 53.50 4.73 6.65 40.98
New Seen Val 4.02 4.87 34.07 4.62 5.69 38.22 4.19 5.09 35.42
New Unseen Val 5.80 6.76 46.65 7.17 8.87 52.25 6.11 7.76 47.48
New Seen Test 4.37 5.33 36.58 4.96 6.06 40.16 4.47 5.39 36.70
New Unseen Test 8.40 9.58 56.85 8.87 10.93 63.60 7.20 8.79 51.63
Table 1: Statistics of our dataset

4 Method

In this section we first describe the Problem Setup. Then we introduce our SMTSC model as shown in Fig. 2

4.1 Problem Setup

Our interests fall within the task of Object Navigation. In particular, this task requires finding an occurrence of a certain class starting from a random position in the environment.

This task can be viewed as a Partially Observable Markov Decision Process (POMDP)

[12][10] in which:

  • is a finite set of states of the world;

  • is a finite set of actions;

  • is the observation spaces;

  • it is a reward function, which given a state s and an action a to be performed in it, returns a reward for the execution of a in s.

  • it is a transition function, which given a state s and an action a

    to be performed in it, returns the probability of reaching the state

    s’ by executing a in s;

  • is a probability density function that defines the likelihood of observing

    o in s.

In the setup taken into consideration the set of possible actions is defined as go_forward, turn_left, turn_right, stop . The actions are deterministic, that is, apart from possible collisions with objects in the scene, the agent will move in the desired direction without deviations due to noisy dynamics. In particular, with a go_forward action the agent will move forward by , while with the two turn actions, it will rotate in the desired direction. Finally, a stop-action causes the navigation episode to end and, if the agent is less than from an object of the type sought, then the episode will be deemed successfully concluded.
The observation RGB, p, a, goal is the set of features collected by the agent at each step in the environment and passed to the model. RGB is what the agent sees from a given position, it is an RGB image extracted with 640x480 size; p is the agent’s position w.r.t. to the starting point, is the action performed in the previous step and finally goal is the objective object to be sought.

4.2 Model

Figure 2: SMTSC model: a) Features processing: all features are processed and a shared representation is created. This representation is then inserted into a memory (b) Scene Memory Transformer) from which the action that the agent will perform is extracted with a Scene Memory Transformer model.

The proposed model is visible in Figure 2, it is composed of two main parts, a first part in which the features are extracted and brought into a joint representation and a second module in which the features of the current observation are added to a memory that keeps track of all past observations and an attention-based policy network extracts a distribution on possible actions.

4.2.1 Features Processing Module

Starting from an observation RGB, p, a, goal we first define 5 different encoders:

  1. : encodes an RGB observations into a vector of size 256 by using features extracted from a semantic segmentation model.

  2. : encodes an RGB observations into a vector of size 128 by using features extracted from a scene classification model.

  3. : encodes the goal into a vector of size 32

  4. : encodes the relative position into a vector of size 32

  5. : encodes the previous action executed into a vector of size 32

Now we define:




Eq. 1 encodes the previously illustrated idea that a goal is intrinsically associated with a specific room. To do this starting from the goal through the function, a representation of dimension 32 is extracted. In parallel, a scene classifier is used to extract features starting from the RGB image, these two modalities are concatenated and a joint representation is created using a fully connected layer. In this way, we obtain a representation of the goal conditioned step by step from the room in which the agent is located, that is going to add useful information to the agent to understand how to move in the environment.
Finally, Eq. 2 generates a joint representation between all the modalities described above. It, therefore, concatenates the representation obtained from a model for semantic segmentation and the representations of the previous action, position and goal (as defined by Eq. 1). This vector is passed to a fully connected layer which returns a vector of size 256.

4.2.2 Scene Memory Transformer Module

Figure 3: The structure of the encoder-decoder model based on Multi-Head Attention, as presented in Fang et al.[10].

SMT module is based on the one proposed by [10]. Given a new episode, we build an initially empty memory where to save the joint-representations obtained starting from observation through the application of the function defined in the Eq. 2. This memory is then passed along with to an attention-based encoder-decoder policy [28]

which extracts a probability distribution on the actions. The encoder-decoder structure is shown in Figure

3. The SMT encoder uses a self-attention to encode the M memory, so M is passed to a MultiHeadAttention with 8 heads in which therefore M is both Query and Key and Value as shown in Eq. 3.


Subsequently, M is passed together with the joint-representation to the decoding structure. It always uses the attention mechanism to give a representation of conditioned by past observations. Again, attention is implemented through an 8-head MultiHeadAttention mechanism as shown in Eq. 4.


Finally, Q is reduced to the dimensionality of the action space through a Linear Layer and a Categorical Distribution on the action space is extracted from the latter representation, as defined in Eq 5.


4.2.3 Implementation and Training Details

We implemented all the models using Python 3.6 and PyTorch

[21]. We used the PPO [25] algorithm to train the model using a 32 GB Tesla V100 GPU. We used batch size of 64 and Adam Optimizer with a Learning Rate of . The visual features are extracted from the images using the Taskonomy networks [35] as done in [24]. This allows us to have consistent features as Taskonomy networks have been trained on indoor environments simulated by Gibson [32]

and also the number of parameters to be trained on the network drops drastically. All the activation functions within the encoder-decoder structure are ReLU functions and the pose vector is encoded as a quadruple


5 Experimental Results

In this section an accurate description of the experimental setup will be provided, the results obtained will then be presented.

5.1 Experimental setup

We used the Small MP3D dataset presented in Section 3. We decided to test the SMTSC model on both the seen and unseen test sets. Results on the seen set will give as a measure of its capacity to memorize environments seen during training. Conversely, results obtained on the unseen set will quantify the degree of adaptation to unseen scenes that the agent possess. However, as the number of training scenes is only six, we don’t expect our agents to develop strong generalization behaviours. The simulator used for all the experiments was Habitat, which provides the agent with 640x480 RGB images. In addition to the images, an odometry system is available that can provide the x and y coordinates and the orientation of the agent with respect to the starting point. The simulated robot has a height of 88 cm from the ground with a radius of 18cm. The camera of the agent with which he acquires the images is placed 88cm from the ground and allows to capture images with a 79 HFOV. Sliding against objects is not allowed, once a collision has been made the agent must necessarily rotate before being able to proceed again in the environment. The moves that the agent can perform at each step are: 25cm move forward, 30 turn left, 30 turn right and stop. In particular, when the stop action is called the current episode is declared correct only if the object sought is less than 0.1m from the agent. For each episode the agent has a maximum of 500 steps to call the stop action, otherwise the episode is considered to be a failure automatically.
The metrics used to evaluate the proposed models are 3: Success Rate, Success weighted by Path Length (SPL) and distance to success (DTS). The Success Rate is simply the ratio between the number of episodes that have been successful and the total number of episodes. The SPL, on the other hand, measures the efficiency in reaching the goal when an episode is successfully completed with a numerical value between 0 and 1. When the episode is not successfully completed 0 is attributed to this metric, otherwise it can be calculated by using Eq. 6, in which N is the number of test episodes, l is the shortest-path distance from the agent’s starting position to the goal in episode i, p the length of the path actually taken by the agent in the episode i and finally S is a binary indicator of success.


The SPL is considered today as the main metric for the Object Navigation task, but as described in [3] there are numerous problems. In fact, not all the failures are to be considered equal, just think of how an agent arrived at 0.2m from the object sought is evaluated with the same score obtained by another agent who has rotated on himself for the entire duration of the episode. Finally, DTS is defined as the agent’s distance from the threshold boundary around the nearest object. This is mathematically defined as:


In which Geo function measures the geodesic distance between the agent at the end of the episode and the nearest object, d instead is the success distance, in this case 0.1m.
Apart from the model presented in Section 4, four other different baselines have been tested to assess the performance of the proposed model.
Random Agent:

an agent that performs random actions extracted from a uniform distribution.

Forward-Only Agent: an agent who only performs forward actions (with a 1% probability of calling the stop action). These first two baselines were placed to demonstrate the non-triviality of the dataset proposed.
Reactive Agent: a policy that extracts semantic segmentation features through Taskonomy and merges them with position and goal as was done in [24]. The action to be performed is directly extracted from this representation, therefore no type of memory is used.
LSTM Agent: the representation is extracted as for the reactive agent, but in this case, it is passed to an LSTM and the action to be performed is extracted from its output. The model is pretty similar to the RGB-Only presented in [29].
SMT without Scene Classification Features (SMT w/o SC): this is the same model presented in Section 4, except for the fact that the goal is coded only through an embedding layer without exploiting the joint representation with the features of the Scene Classifier.
The last three models were trained with PPO Reinforcement Learning algorithm.

5.2 Results

Model SPL Success DTS
Random 0.00 0.00 5.126
Forward-Only 0.009 0.026 5.094
Reactive [24] 0.041 0.126 4.523
LSTM [29] 0.131 0.247 3.199
SMT w/o SC (our) 0.345 0.595 1.562
SMTSC (our) 0.649 0.883 0.403
Table 2: Results on the Seen Test Set
Model SPL Success DTS
Random 0.00 0.00 7.842
Forward-Only 0.004 0.01 7.922
Reactive [24] 0.001 0.04 7.797
LSTM [29] 0.002 0.04 7.648
SMT w/o SC (our) 0.008 0.04 7.518
SMTSC (our) 0.039 0.080 6.817
Table 3: Results on the Unseen Test Set

In Table 2 are shown the results obtained on the test set in seen environments. The low performance of Random and Forward-Only baselines highlight how the dataset is not trivial, showing that the object to be found is almost never in the immediate proximity of the starting point of the episode. Looking instead at the other baselines based on Reinforcement Learning, we can see how the two models that use the Scene Memory Transformer are able to perform much better in almost all metrics. This big difference is probably attributable to the fact that the SMT can extract crucial information even from actions performed in a fairly remote past, while for example in the case of the LSTM this is very difficult, as more recent information tends to supplant the older ones and this is emphasized especially in very long sequences.
It is also interesting to note that the model that creates a joint representation between the goal coding and the visual features for the scene classification performs much better than the SMT w/o SC model. Even in the SPL, we have an increase of 88%, in the success ratio an increase of 48.4% and finally, the average DTS has fallen by over a meter.
The behavior observed in the seen test set follows the same trend also in the case of the unseen test set whose results are visible in Table 3. In fact, taking as a comparison the average geodesic distance of the unseen test set (7.76m), we can see how the Forward Only and Random agents tend to conclude their episodes even further away from the starting point. The reactive model instead certifies its performance in line with the average distance reported in Table 1 for the unseen test set. Finally, the two models that exploit the SMT are the ones that perform better, even in an unseen environment. In particular, the proposed model capable of exploiting the joint representation between visual features and goal coding lowers the average distance from the goal by almost a meter, and certifies its successes on 8% of the scenes.
In the next section we report some qualitative example of the SMTSC model on the seen and unseen test sets.

5.2.1 Qualitative Results

In Figure 4 it is possible to see, in blue, the path taken by the agent to reach an object of the “cushion” class on the seen test set. In green you can see the shortest path to the object. The agent has successfully reached the object sought by stopping less than 0.1m from it. On the contrary, in Figure 5 is shown an example of failure, still on the seen test set, in which the agent was unable to reach the object sought. The agent, from his initial position was able to recognize an object of the class sought and to head towards it, despite not being the closest chair. At the time of calling the stop action, however, the agent was 7cm beyond the boundary that would have given the episode success. In this case, we can see the strong penalty given by a metric like the SPL, in fact, in our opinion, this episode cannot be considered totally wrong.
In Figures 6 and 7, on the other hand, it is possible to see two examples extracted from the evaluation of the model on the unseen test set. In the first image, the was able to take a correct path to a “cabinet”. However, the cabinet that was found wasn’t the nearest so the SPL of this episode was only 0.57. In the second example, on the other hand, a very long route of over 14 meters of navigation is presented. The agent here was able to follow the optimal path for about half of its length, then it reached the maximum number of actions allowed. This means that it ”wasted” a lot of action to increase its understanding of the scene.

Figure 4: Successful navigation episode with SMTSC model on seen test set.
Figure 5: Unsuccessful navigation episode with SMTSC model on seen test set.
Figure 6: Successful navigation episode with SMTSC model on unseen test set.
Figure 7: Unsuccessful navigation episode with SMTSC model on unseen test set.

6 Conclusion

We proposed a subset of the dataset developed for the Habitat 2020 ObjectNav Challenge [23]. This was done to allow the training of models that do not involve the use of planning associated with the construction of maps within them with few computational resources (e.g. a single GPU). Furthermore, we proposed a model capable of exploiting the subtle relationship existing between objects and the rooms in which they are usually located. This intuition, combined with the use of a Scene Memory Transformer showed good results on the proposed dataset. In the future, it would be very interesting to be able to test this model on the complete dataset using the distributed Reinforcement Learning algorithm DD-PPO [29] and a greater number of GPUs in order to perform training in a reasonable time.