Visual Navigation is a trending topic in the Computer Vision research community. This growth in interest is undoubtedly due to the important practical implications that the development of agent capable of moving in complex environments can have on our society. For example, in an ever closer future we will be able to ask robotic assistants to perform the most disparate tasks in our homes. Before we can ask a robot to take something out of the refrigerator, however, we need to make sure that it is able to find the refrigerator and get to it while avoiding the complex tangle of obstacles that a domestic environment can contain. For this reason, this work focuses on the Object Navigation task, defined in as the search for objects belonging to a specific class by a robotic agent. For humans, this is a very simple task whatever the object to be found is. A human can build a mental link between the object and the room where it is more likely to be found. In this way a human is able to simplify the problem by first searching for the room and then for the required object inside the room. For example, just think of having to look for a sink, unconsciously we will first search for the kitchen or bathroom, and then we will find a sink in them. To implement this intuition we have developed an attention-based 
policy in which we exploit a joint-representation that integrates inside it visual informations extracted with a scene classifier and encoding of the semantic goal to be searched. This representation allows us to have a significant increase in performance compared to other models taken into consideration.
Several works tried to tackle Navigation through Learning models . In particular,  leveraged depth images to construct in an online fashion semantic maps of the environment. From these maps they tried to maximize the exploration of the scene. To do this, they placed intermediate subgoals in unexplored areas of the map that the agent was encouraged to reach through planning.
This type of approach inevitably tends to lengthen the agent’s paths, at least until the object sought is clearly visible, since wanting to maximize exploration involves a significant amount of moves by the agent. For the simpler PointGoal Navigation  task, proposed in the “Habitat Challenge 2019”111https://aihabitat.org/challenge/2019/, it was possible to observe how a simple model  based on LSTM was able to perform better than competitors based on more complex architectures that exploit maps creation . This was made possible by the DD-PPO algorithm , a distributed version of the PPO  Reinforcement Learning algorithm capable of parallelizing learning in a massive way. In fact, they used 64 NVIDIA V100 GPUs for 3 consecutive days of training. In other words, the model was trained for about 180 days in a single GPU setting. Not all researchers, however, can have access to those massive hardware resources. For this reason, we have generated a reduced version of the dataset produced for the “Habitat Challenge 2020”222https://aihabitat.org/challenge/2020/ for the Object Navigation task that would allow the training of Deep Reinforcement Learning models in a few hours. In this way, even using few computational resources, complex models can be trained which on the original dataset would take days.
Furthermore, in 
it was pointed out how in the Navigation tasks the use of Recurrent Neural Networks is not recommended. These structures usually have the issue of considering as more important the recent past and at the same time gradually forgetting the remote past. On the contrary, by exploiting the principle of attention, described in
, it is possible to record all the past observations in a memory from which to extract information on every single step that the agent has undertaken in the past, improving that highly penalizing intrinsic aspect in the behavior of the Recurrent Neural Networks.
Summarizing, our contribution is twofold:
We propose a new reduced dataset for the Object Navigation task extracted from the one proposed in the “Habitat Challenge 2020”, on which it is possible to test algorithms that would require a lot of resources for training and that maintain as far as possible the main characteristics that the first had;
We propose the SMTSC model, an attention-based policy for Object Navigation that is able to exploit, starting from RGB images only, the idea mentioned above, namely that there exist a correlation that binds objects to specific rooms. This intuition improves performance, as demonstrated by the results obtained on a preliminary study performed using the aforementioned dataset.
2 Related Works
Visual Navigation is an increasingly central topic within Computer Vision. However, Visual Navigation involves several different sub-problems and, in this section, we will summarize the most relevant related works to Simulators and 3D Datasets for Visual Navigation and to other different research areas connected to Visual Navigation.
2.1 Simulators and 3D Datasets for Visual Navigation
In recent years a large number of different simulators have been developed. GibsonEnv  and AI2Thor  both allow to simulate multi-agent situations and to interact with objects, for example lifting them, pushing them, etc. Matterport3DSimulator  can provide the agent with photorealistic images extracted from Matterport3d  and is mainly used for Room2Room Navigation problem. HabitatAI , instead, provides support to 3D datasets such as Gibson, Matterport3D and Replica .
2.2 Visual Navigation
Also thanks to the new possibilities offered by these simulators today there are numerous tasks available, as pointed out in . Common Navigation tasks are mainly divided into two categories, namely those that require active exploration of the environment and those that, on the other hand, provide tools that can signal, for example via GPS sensors, the direction to be taken to reach the required goal.
In Classical Navigation, there are numerous approaches that perform path planning on explicit maps .
More recently, however, approaches based on Reinforcement Learning have been presented through policies based on Recurrent Neural Networks . Mirowski et al.  define an approach that jointly learns the goal-driven Reinforcement Learning problem with auxiliary depth prediction and loop closure classification tasks by exploiting the A3C algorithm . Mousavian et al.  propose a Deep Reinforcement Learning framework that uses an LSTM-based policy for Semantic Target Driven Navigation. But LSTMs when they have to analyze very long data sequences tend to focus more on the most recent observations, giving less importance to the first ones that have been seen. On the contrary, Fang et al.  propose Scene Memory Transformer, a policy based on attention  that is able to exploit even the least recent steps performed by the agent. In this case, the training of the policy is performed through the Deep Q-Learning algorithm . Starting from the work done in , Sax et al.  show that using Mid-Level Vision results in policies that learn faster and generalize better when compared to learning from scratch. The Mid-Level model achieved high results in the PointGoal Navigation task. In , a scalable Reinforcement Learning algorithm on multiple GPUs capable of solving the PointGoal Navigation task almost perfectly has been presented. This solution, in particular, shows how Visual Navigation is a really complex task that requires an impressive amount of resources. In fact, their training was conducted on 64 GPUs for 3 days. Unfortunately, these resources are not within the reach of the whole scientific community and training similar models remain almost prohibitive for most researchers.
For the Target Driven Navigation some works have recently been presented, such as . Wu et al.  construct a probabilistic graphical model over the semantic information to explore structural similarities between the environment. Yang et al. , instead, propose a Deep Reinforcement Learning model that exploits the relationships between objects, encoded through a Graph Convolutional Network, to incorporate semantic priors within the policy. Chaplot et al.  propose a model for the ObjectGoal Navigation that constructs, during the exploration, a map with the semantic information of the scene extracted through a semantic segmentation model; from the generated map a long-term goal is selected to maximize exploration, through a policy trained with Reinforcement Learning, when the searched object is visible it is set as a new long-term goal. The actions of the agent are selected through the use of the FastMarching algorithm .
Previous works  have been able to achieve excellent results on the Point Navigation task , a simpler Navigation problem that doesn’t require Semantic capabilities, while using an architecture that didn’t include complex components such as occupancy maps . However, they leveraged massive parallelism using hardware resources that are inaccessible to most institutions.
We investigate on the possibility of solving the same problem in a reduced dataset with a minimal set of computational and time resources. For this reason we decided to concentrate on a subset of the Matterport3D Dataset . We argue that our choice of such subset still offers significant results as its statistical indicators are similar to the one of original Matterport3D.
The extraction of the subset was done restricting the problem to 5 out of 21 objects, choosing Chair, Cushion, Table, Cabinet, Sink. These objects are among the most frequent in the original set as is shown in Figure 1. We decided to include the Sink object as it is a characterising element of the bathroom. This increase the diversity of the domestic environments represented by our proposed Dataset.
Related to the distributions of objects, one of the main issue in the evaluation of Object Navigation agents using Matterport3D, is the Long Tail Distribution that limits the number of instances of infrequent objects that are seen during training. This, combined with metrics that ignore the precision on single classes, may undermine the development of agents with truly semantic capabilities.
Furthermore, we decided to extract our new dataset - from here we will refer to it as Small MP3D - using 6 out of 56 scenes of the official Training split. These scenes are r47D5H71a5s, i5noydFURQK, ZMojNkEp431, jh4fc5c5qoQ, HxpKQynjfin and GdvgFV5R1Z5. For each object in each scene we extracted 100 episode for the training split and 20 for both validation and test. In total, Small MP3D possesses 3000 training episodes and 600 episodes for both validation and test. To ensure the ability to generalize to unseen scenes, we decided to generate also an Unseen Test and Validation Set, extracted from the D7N2EKCX4Sj and aayBHfsNo7d scenes, each with 100 episodes (10 episodes for each object class in each scene).
The characteristics of the new dataset are shown in Table 1. We report the average values of Euclidean and Geodesic Distance as well as the number of Steps required to complete the episodes. Overall, the complexity of the training split of the proposed reduced dataset is lesser than the original data as both Geodesic distance and the required Number of Steps are lower. However, the complexity of the Unseen Test split is significantly higher, with 30% more required steps on average. We think that this additional complexity can guarantee a fair and meaningful benchmark.
|New Seen Val||3.43||3.89||27.01||6.09||7.58||51.37||2.77||3.42||26.42|
|New Unseen Val||4.07||5.79||37.15||9.00||12.19||68.40||4.53||5.18||32.95|
|New Seen Test||3.64||4.15||28.76||6.25||7.61||50.17||3.15||3.79||27.82|
|New Unseen Test||4.06||5.11||32.75||10.13||12.94||70.50||4.55||5.37||34.45|
|New Seen Val||4.02||4.87||34.07||4.62||5.69||38.22||4.19||5.09||35.42|
|New Unseen Val||5.80||6.76||46.65||7.17||8.87||52.25||6.11||7.76||47.48|
|New Seen Test||4.37||5.33||36.58||4.96||6.06||40.16||4.47||5.39||36.70|
|New Unseen Test||8.40||9.58||56.85||8.87||10.93||63.60||7.20||8.79||51.63|
In this section we first describe the Problem Setup. Then we introduce our SMTSC model as shown in Fig. 2
4.1 Problem Setup
Our interests fall within the task of Object Navigation. In particular, this task requires finding an occurrence of a certain class starting from a random position in the environment.
This task can be viewed as a Partially Observable Markov Decision Process (POMDP) in which:
is a finite set of states of the world;
is a finite set of actions;
is the observation spaces;
it is a reward function, which given a state s and an action a to be performed in it, returns a reward for the execution of a in s.
it is a transition function, which given a state s and an action a
to be performed in it, returns the probability of reaching the states’ by executing a in s;
is a probability density function that defines the likelihood of observingo in s.
In the setup taken into consideration the set of possible actions is defined as go_forward, turn_left, turn_right, stop .
The actions are deterministic, that is, apart from possible collisions with objects in the scene, the agent will move in the desired direction without deviations due to noisy dynamics. In particular, with a go_forward action the agent will move forward by , while with the two turn actions, it will rotate in the desired direction. Finally, a stop-action causes the navigation episode to end and, if the agent is less than from an object of the type sought, then the episode will be deemed successfully concluded.
The observation RGB, p, a, goal is the set of features collected by the agent at each step in the environment and passed to the model. RGB is what the agent sees from a given position, it is an RGB image extracted with 640x480 size; p is the agent’s position w.r.t. to the starting point, is the action performed in the previous step and finally goal is the objective object to be sought.
The proposed model is visible in Figure 2, it is composed of two main parts, a first part in which the features are extracted and brought into a joint representation and a second module in which the features of the current observation are added to a memory that keeps track of all past observations and an attention-based policy network extracts a distribution on possible actions.
4.2.1 Features Processing Module
Starting from an observation RGB, p, a, goal we first define 5 different encoders:
: encodes an RGB observations into a vector of size 128 by using features extracted from a scene classification model.
: encodes the goal into a vector of size 32
: encodes the relative position into a vector of size 32
: encodes the previous action executed into a vector of size 32
Now we define:
Eq. 1 encodes the previously illustrated idea that a goal is intrinsically associated with a specific room. To do this starting from the goal through the function, a representation of dimension 32 is extracted. In parallel, a scene classifier is used to extract features starting from the RGB image, these two modalities are concatenated and a joint representation is created using a fully connected layer. In this way, we obtain a representation of the goal conditioned step by step from the room in which the agent is located, that is going to add useful information to the agent to understand how to move in the environment.
Finally, Eq. 2 generates a joint representation between all the modalities described above. It, therefore, concatenates the representation obtained from a model for semantic segmentation and the representations of the previous action, position and goal (as defined by Eq. 1). This vector is passed to a fully connected layer which returns a vector of size 256.
4.2.2 Scene Memory Transformer Module
SMT module is based on the one proposed by . Given a new episode, we build an initially empty memory where to save the joint-representations obtained starting from observation through the application of the function defined in the Eq. 2. This memory is then passed along with to an attention-based encoder-decoder policy 
which extracts a probability distribution on the actions. The encoder-decoder structure is shown in Figure3. The SMT encoder uses a self-attention to encode the M memory, so M is passed to a MultiHeadAttention with 8 heads in which therefore M is both Query and Key and Value as shown in Eq. 3.
Subsequently, M is passed together with the joint-representation to the decoding structure. It always uses the attention mechanism to give a representation of conditioned by past observations. Again, attention is implemented through an 8-head MultiHeadAttention mechanism as shown in Eq. 4.
Finally, Q is reduced to the dimensionality of the action space through a Linear Layer and a Categorical Distribution on the action space is extracted from the latter representation, as defined in Eq 5.
4.2.3 Implementation and Training Details
We implemented all the models using Python 3.6 and PyTorch. We used the PPO  algorithm to train the model using a 32 GB Tesla V100 GPU. We used batch size of 64 and Adam Optimizer with a Learning Rate of . The visual features are extracted from the images using the Taskonomy networks  as done in . This allows us to have consistent features as Taskonomy networks have been trained on indoor environments simulated by Gibson 
and also the number of parameters to be trained on the network drops drastically. All the activation functions within the encoder-decoder structure are ReLU functions and the pose vector is encoded as a quadruple.
5 Experimental Results
In this section an accurate description of the experimental setup will be provided, the results obtained will then be presented.
5.1 Experimental setup
We used the Small MP3D dataset presented in Section 3.
We decided to test the SMTSC model on both the seen and unseen test sets. Results on the seen set will give as a measure of its capacity to memorize environments seen during training. Conversely, results obtained on the unseen set will quantify the degree of adaptation to unseen scenes that the agent possess. However, as the number of training scenes is only six, we don’t expect our agents to develop strong generalization behaviours.
The simulator used for all the experiments was Habitat, which provides the agent with 640x480 RGB images. In addition to the images, an odometry system is available that can provide the x and y coordinates and the orientation of the agent with respect to the starting point. The simulated robot has a height of 88 cm from the ground with a radius of 18cm. The camera of the agent with which he acquires the images is placed 88cm from the ground and allows to capture images with a 79 HFOV. Sliding against objects is not allowed, once a collision has been made the agent must necessarily rotate before being able to proceed again in the environment. The moves that the agent can perform at each step are: 25cm move forward, 30 turn left, 30 turn right and stop. In particular, when the stop action is called the current episode is declared correct only if the object sought is less than 0.1m from the agent. For each episode the agent has a maximum of 500 steps to call the stop action, otherwise the episode is considered to be a failure automatically.
The metrics used to evaluate the proposed models are 3: Success Rate, Success weighted by Path Length (SPL) and distance to success (DTS). The Success Rate is simply the ratio between the number of episodes that have been successful and the total number of episodes. The SPL, on the other hand, measures the efficiency in reaching the goal when an episode is successfully completed with a numerical value between 0 and 1. When the episode is not successfully completed 0 is attributed to this metric, otherwise it can be calculated by using Eq. 6, in which N is the number of test episodes, l is the shortest-path distance from the agent’s starting position to the goal in episode i, p the length of the path actually taken by the agent in the episode i and finally S is a binary indicator of success.
The SPL is considered today as the main metric for the Object Navigation task, but as described in  there are numerous problems. In fact, not all the failures are to be considered equal, just think of how an agent arrived at 0.2m from the object sought is evaluated with the same score obtained by another agent who has rotated on himself for the entire duration of the episode. Finally, DTS is defined as the agent’s distance from the threshold boundary around the nearest object. This is mathematically defined as:
In which Geo function measures the geodesic distance between the agent at the end of the episode and the nearest object, d instead is the success distance, in this case 0.1m.
Apart from the model presented in Section 4, four other different baselines have been tested to assess the performance of the proposed model.
an agent that performs random actions extracted from a uniform distribution.
Forward-Only Agent: an agent who only performs forward actions (with a 1% probability of calling the stop action). These first two baselines were placed to demonstrate the non-triviality of the dataset proposed.
Reactive Agent: a policy that extracts semantic segmentation features through Taskonomy and merges them with position and goal as was done in . The action to be performed is directly extracted from this representation, therefore no type of memory is used.
LSTM Agent: the representation is extracted as for the reactive agent, but in this case, it is passed to an LSTM and the action to be performed is extracted from its output. The model is pretty similar to the RGB-Only presented in .
SMT without Scene Classification Features (SMT w/o SC): this is the same model presented in Section 4, except for the fact that the goal is coded only through an embedding layer without exploiting the joint representation with the features of the Scene Classifier.
The last three models were trained with PPO Reinforcement Learning algorithm.
|SMT w/o SC (our)||0.345||0.595||1.562|
|SMT w/o SC (our)||0.008||0.04||7.518|
In Table 2 are shown the results obtained on the test set in seen environments. The low performance of Random and Forward-Only baselines highlight how the dataset is not trivial, showing that the object to be found is almost never in the immediate proximity of the starting point of the episode. Looking instead at the other baselines based on Reinforcement Learning, we can see how the two models that use the Scene Memory Transformer are able to perform much better in almost all metrics. This big difference is probably attributable to the fact that the SMT can extract crucial information even from actions performed in a fairly remote past, while for example in the case of the LSTM this is very difficult, as more recent information tends to supplant the older ones and this is emphasized especially in very long sequences.
It is also interesting to note that the model that creates a joint representation between the goal coding and the visual features for the scene classification performs much better than the SMT w/o SC model. Even in the SPL, we have an increase of 88%, in the success ratio an increase of 48.4% and finally, the average DTS has fallen by over a meter.
The behavior observed in the seen test set follows the same trend also in the case of the unseen test set whose results are visible in Table 3. In fact, taking as a comparison the average geodesic distance of the unseen test set (7.76m), we can see how the Forward Only and Random agents tend to conclude their episodes even further away from the starting point. The reactive model instead certifies its performance in line with the average distance reported in Table 1 for the unseen test set. Finally, the two models that exploit the SMT are the ones that perform better, even in an unseen environment. In particular, the proposed model capable of exploiting the joint representation between visual features and goal coding lowers the average distance from the goal by almost a meter, and certifies its successes on 8% of the scenes.
In the next section we report some qualitative example of the SMTSC model on the seen and unseen test sets.
5.2.1 Qualitative Results
In Figure 4 it is possible to see, in blue, the path taken by the agent to reach an object of the “cushion” class on the seen test set. In green you can see the shortest path to the object. The agent has successfully reached the object sought by stopping less than 0.1m from it. On the contrary, in Figure 5 is shown an example of failure, still on the seen test set, in which the agent was unable to reach the object sought. The agent, from his initial position was able to recognize an object of the class sought and to head towards it, despite not being the closest chair. At the time of calling the stop action, however, the agent was 7cm beyond the boundary that would have given the episode success. In this case, we can see the strong penalty given by a metric like the SPL, in fact, in our opinion, this episode cannot be considered totally wrong.
In Figures 6 and 7, on the other hand, it is possible to see two examples extracted from the evaluation of the model on the unseen test set. In the first image, the was able to take a correct path to a “cabinet”. However, the cabinet that was found wasn’t the nearest so the SPL of this episode was only 0.57. In the second example, on the other hand, a very long route of over 14 meters of navigation is presented. The agent here was able to follow the optimal path for about half of its length, then it reached the maximum number of actions allowed. This means that it ”wasted” a lot of action to increase its understanding of the scene.
We proposed a subset of the dataset developed for the Habitat 2020 ObjectNav Challenge . This was done to allow the training of models that do not involve the use of planning associated with the construction of maps within them with few computational resources (e.g. a single GPU). Furthermore, we proposed a model capable of exploiting the subtle relationship existing between objects and the rooms in which they are usually located. This intuition, combined with the use of a Scene Memory Transformer showed good results on the proposed dataset. In the future, it would be very interesting to be able to test this model on the complete dataset using the distributed Reinforcement Learning algorithm DD-PPO  and a greater number of GPUs in order to perform training in a reasonable time.
-  Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3674–3683 (2018)
-  Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A., Wijmans, E.: Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)
-  Canny, J.: The complexity of robot motion planning. MIT press (1988)
-  Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV) (2017)
-  Chang, M., Gupta, A., Gupta, S.: Semantic visual navigation by watching youtube videos. arXiv preprint arXiv:2006.10034 (2020)
-  Chaplot, D.S., Gandhi, D., Gupta, A., Salakhutdinov, R.: Object goal navigation using goal-oriented semantic exploration. arXiv preprint arXiv:2007.00643 (2020)
-  Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155 (2020)
-  Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959 (2019)
-  Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 538–547 (2019)
-  Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2616–2625 (2017)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial intelligence101(1-2), 99–134 (1998)
-  Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE transactions on Robotics and Automation 12(4), 566–580 (1996)
-  Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017)
-  Lample, G., Chaplot, D.S.: Playing fps games with deep reinforcement learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
-  LaValle, S.M., Kuffner, J.J.: Rapidly-exploring random trees: Progress and prospects. Algorithmic and computational robotics: new directions (5), 293–308 (2001)
-  Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016)
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. pp. 1928–1937 (2016)
-  Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. nature 518(7540), 529–533 (2015)
-  Mousavian, A., Toshev, A., Fišer, M., Košecká, J., Wahid, A., Davidson, J.: Visual representations for semantic target driven navigation. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 8846–8852. IEEE (2019)
-  Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
-  Savva, M., Chang, A.X., Dosovitskiy, A., Funkhouser, T., Koltun, V.: Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931 (2017)
-  Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al.: Habitat: A platform for embodied ai research. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9339–9347 (2019)
-  Sax, A., Zhang, J.O., Emi, B., Zamir, A., Savarese, S., Guibas, L., Malik, J.: Learning to navigate using mid-level visual priors. arXiv preprint arXiv:1912.11121 (2019)
-  Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
-  Sethian, J.A.: Fast-marching level-set methods for three-dimensional photolithography development. In: Optical Microlithography IX. vol. 2726, pp. 262–272. International Society for Optics and Photonics (1996)
-  Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., Newcombe, R.: The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
-  Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv pp. arXiv–1911 (2019)
-  Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6750–6759 (2019)
-  Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Learning and planning with a semantic model. arXiv preprint arXiv:1809.10842 (2018)
-  Xia, F., R. Zamir, A., He, Z.Y., Sax, A., Malik, J., Savarese, S.: Gibson env: real-world perception for embodied agents. In: Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE (2018)
-  Xia, F., Shen, W.B., Li, C., Kasimbeg, P., Tchapmi, M.E., Toshev, A., Martín-Martín, R., Savarese, S.: Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters 5(2), 713–720 (2020)
-  Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018)
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3712–3722 (2018)