Log In Sign Up

NavigationNet: A Large-scale Interactive Indoor Navigation Dataset

by   He Huang, et al.
Shanghai Jiao Tong University

Indoor navigation aims at performing navigation within buildings. In scenes like home and factory, most intelligent mobile devices require an functionality of routing to guide itself precisely through indoor scenes to complete various tasks in order to serve human. In most scenarios, we expected an intelligent device capable of navigating itself in unseen environment. Although several solutions have been proposed to deal with this issue, they usually require pre-installed beacons or a map pre-built with SLAM, which means that they are not capable of working in novel environments. To address this, we proposed NavigationNet, a computer vision dataset and benchmark to allow the utilization of deep reinforcement learning on scene-understanding-based indoor navigation. We also proposed and formalized several typical indoor routing problems that are suitable for deep reinforcement learning.


The AdobeIndoorNav Dataset: Towards Deep Reinforcement Learning based Real-world Indoor Robot Visual Navigation

Deep reinforcement learning (DRL) demonstrates its potential in learning...

MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

We present MINOS, a simulator designed to support the development of mul...

Localization and Navigation System for Indoor Mobile Robot

Visually impaired people usually find it hard to travel independently in...

Embodied Navigation at the Art Gallery

Embodied agents, trained to explore and navigate indoor photorealistic e...

Indoor Navigation Using Information From A Map And A Rangefinder

The problem of indoor navigation of mobile objects, using a map and meas...

Robust Neural Routing Through Space Partitions for Camera Relocalization in Dynamic Indoor Environments

Localizing the camera in a known indoor environment is a key building bl...

IBeaconMap: Automated Indoor Space Representation for Beacon-Based Wayfinding

Traditionally, there have been few options for navigational aids for the...

1 Introduction

Indoor navigation aims at preforming navigation within buildings, such as train stations, airports, shopping centers, offices and museums. Most intelligent devices require an indoor routing functionality to guide itself precisely through rooms to complete various tasks. In most of the scenarios, we expect an intelligent device capable of navigating itself in new environments to serve human.

In the past years, several solutions have been proposed to deal with this problem [1]

. Although they can be applied in some special cases, none of them can work smartly in novel environments or are based on scene understanding. Those solutions require either dense pre-installed beacons (e.g. WiFi/Bluetooth chips) over the scene or a map pre-built with SLAM in which human workers are required to carry cameras to scan the scene in advance. Such requirements significantly limit the applications of these solutions. New scenes have to be well prepared before the robots can work properly in it, which makes it impossible to be put into large-scale application in various indoor scenarios. Additionally, neither beacons nor maps are helpful in semantically understanding indoor scenes and thus will lead to the failure of the robot to perform many practical tasks. For example, to complete a regular command like ’to take a cup’, a housekeeping android should understand what the scene is like, what a cup is and where a cup usually be, to navigate properly.

We expected a more general solution of indoor navigation problems that requires no preparation of the scene in advance. Such a high demand requires a more human-like robot that is much more intelligent to implement self-exploration in novel environments. Recent years, we have witnessed the rapid development of deep reinforcement learning (DRL) [2, 3] and computer vision [4, 5]. Therefore, smart indoor navigation enabled by DRL with visual inputs has been introduced [6]. In this setting, pre-installed beacons or pre-built map are no longer required. A smart robot with vision will be enough to do the navigation job in different scenes.

To apply reinforcement learning methods on navigation problems, massive trial-and-error is necessary in training the model. A straight-forward way is to construct a real robot and train Reinforcement Learning (RL) models on real-world scenes. However, such process is extremely slow and costly. Additionally, it is often accompanied by robot collision and damage, which means that more time and resources would be wasted. Even worse, due to the bottleneck of physical movements, to execute any action in the real world requires at least several seconds, which is unacceptable for any machine learning methods since that they usually require millions of times of trial-and-error. Aside from those drawbacks, such a process raises the problem of reproducibility. Even if all the background settings of a research are published in detail, it is still very hard for third parties to reproduce the experiment result since it is costly and time-consuming to find or construct a new environment that is identical to the one the original author used. That makes it impossible for third parties to examine and evaluate a new research result.

We believe, an open-source, low-cost, large-scale dataset and corresponding benchmarks would be a key to largely advance this research field. Some previous work has tried to train the robot in computer graphics (CG) scenes like AI2-THOR[6], SUNCG[7]. However, we do have noticed that training models on CG environments raised the problem of model transferring as the gap between hand-crafted scenes and the real world is significant. The visual appearance of images rendered by CG system always look non-realistic. Existing 3D models are limited in representing real-world scenes and objects.

To deal with this problem, we proposed NavigationNet, a large-scale, real-world, interactive dataset. In each scene, we use human-size robot with 8 cameras to capture photos towards 8 different directions at all walk-able positions. Given all the collected images, we can easily tour the room like a CG system by moving among those walk-able positions. However, different from CG systems, what we observed from this dataset are all real images captured from the real world. Therefore, we can virtually command a robot to walk in the scene and get the first-perspective view of the robot. Our dataset is interactive. That is to say, as we send a command of movement (e.g. moving forward, turning right), the robot will perform as required and return a new view it sees from the new position. To our best knowledge, our NavigationNet is the first real-world indoor scene dataset that is interactive in computer vision field. Our dataset covers about 1500 indoor area with various bedrooms, studies, meeting room and etc. This makes the bias less in experiments on our dataset.

Based on NavigationNet, we proposed four applications that require indoor navigation. We will explained them in detail in chapter 5.

To summarize, we presented a large-scale dataset consisting of visual data collected from real scenarios to allow a low-cost, reproducible training of indoor navigation robots based on reinforcement learning methods. We also proposed and formalized several practical indoor navigation tasks in the framework of reinforcement learning.

2 Related Work

In this chapter, we will discuss some of the previous works that are related to our work or act as a tool in this project. First we will talk about the technology of Visual Semantic Planning. A brief introduction to the popular datasets in the field of computer vision, after which we will explain three techniques used in our work in detail: Reinforcement Learning, Convolutional Neural Network(CNN) and Simultaneous Localization and Mapping.

2.1 Visual Semantic Planning

Fundamental tasks in mobile robot controlling include navigation, mapping, grasping, path planning and so on. Such task of interacting with a visual world and planning a sequence of actions to achieve a certain goal is addressed as Visual Semantic Planning [8]. A series of work [9, 10, 11, 12] has addressed the task-level planning and learning from dynamics. Classical approaches [13, 14] of motion planning require building map purely geometrically using Light Detection and Ranging (LiDAR) or Structure from Motion (SfM) [15, 16, 17]. Srivastava et al. [18] provides an interface between task and motion planning, such that the task planner can effectively operate in an abstracted state space that ignores geometry. Recently proposed end-to-end learning-based approaches [19, 10, 20] can go directly from pixels to actions. For instance Zhu et al. [19]

adopt feed-forward neural network to target-driven visual navigation in indoor scenes. Gupta

et al. [10] use online Cognitive Mapping and Planning (CMP) approach for goal direction visual navigation without requiring a pre-constructed map. Jang et al. [9] train a deep neural network to perform the semantic grasping task inspired by the ”two-stream hypothesis” of human vision. In terms of Complete Path Planning, Kollar et al. [21] use reinforcement learning to find trajectories and the learnt policy transfers successfully to a real environment.

2.2 Datasets for Computer Vision

Datasets and corresponding benchmarks have played a significant role in many areas such as computer vision and speech recognition. The evolution from WordNet[22]

to ImageNet

[23] was one of the many successful examples that proved the power of an effective dataset to accelerate the development of one area. On the contrary, the lack of appropriate data has become one of the most crucial challenges for deep reinforcement learning, especially when it is dealing with real world problems like visual semantic learning. Since training and quantitatively evaluating DRL algorithms in real environments is either impractical or costly, this problem has become more urgent. For RL algorithms dealing with virtual-world problems, the Arcade Learning Environment (ALE) [24] exposed Atari 2600 games as reinforcement learning problems. Works such as Duan et al. [25] and Brockman et al. [26] propose toolkits to qualify progress in reinforcement learning. Other benchmarks [27, 28, 29, 30, 31], simulators [32] or physics engines [33] are also designed for the development of deep reinforcement learning. Also, scientists are trying to model the real world for DRL or other visual tasks. Chang et al. [34] provides 90 building-scale reconstructed scenes for supervised and self-supervised computer vision tasks such as keypoint matching, view overlap prediction semantic segmentation, scene classification and so on. AI2-THOR[19], SUNCG[7] and House3D are all CG scenes designed especially for deep reinforcement learning. The advanced version of AI2-THOR[35] even provides the functionality to let the robot directly interact with the objects in the scenarios. Different from our NavigationNet, these datasets are either synthetic or reconstructed using computer graphics techniques, which introduce inconsistency distribution between the real-world scenarios and the produced ones.

2.3 Reinforcement Learning

Reinforcement learning (RL) [36] method was proposed in the late 90s. It provides a technique to allow the robot or any other intelligent systems to build a value function to evaluate policies in given situations from an interactive way of trail-and-error. Pioneering works from Mnih et al. [37], Lillicrap et al. [38], Schulman et al. [39] and Silver et al [40]

introduced and ignited the idea of join Reinforcement Learning and Deep Learning into Deep Reinforcement Learning.

In learning complex behavior skills and solving challenging control tasks in high-dimensional raw sensory state-space, DRL methods have shown tremendous success [40, 41, 3, 42]. Trust Region Policy Optimization (TRPO) [41] makes a series of approximations to the theoretically-justified procedure and has robust performance on a wide range of tasks. Asynchronous Advantage Actor-Critic (A3C) Algorithm [3], which uses asynchronous gradient descent for optimization, succeeds on task of navigating random 3D mazes as well as a wide variety of continuous motor control problems.

UNsupervised REinforcement and Auxiliary Learning (UNREAL) [43] brings together the A3C framework with auxiliary control tasks and auxiliary reward tasks. Actor Critic using Kronecker-factored Trust Region (ACKTR) [42] Algorithm uses a Kronecker-factored approximation to natural policy gradient that allows the covariance of the gradient to be inverted efficiently. Proximal Policy Optimization (PPO) [44]

Algorithms use multiple epochs of stochastic gradient ascent to perform each policy update. Reinforcement Learning (RL) provides a powerful and flexible framework to several applications. Andrew Y. Ng

et al. [45] describe a successful application of autonomous helicopter. Finn et al. [12]

present an approach that automates state-space construction by learning a state representation directly from camera images with deep spatial autoencoder. Heess

et al. [46]

explore a rich environment can help to promote the learning of complex behavior. They normalize observations, scale the reward and use per-batch normalization of the advantages. Gu

et al. [47] demonstrate a DRL algorithm based on off-policy training of deep Q-function can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. Denil et al. [48] find DRL methods can learn to perform the experiments necessary to discover properties such as mass and cohesion of objects.

2.4 Convolutional Neural Network

Deep convolutional neural networks [49, 50] have led to a series of breakthrough for visual tasks. He et al. [5] present a Residual Network (ResNet) to ease the training of deep neural network. We extract feature map with ResNet 101. Huang et al. [51] presents Dense Convolutional Network (DenseNet) which connects each layer to every other layer in a feed-forward fashion. Chen et al. [52]

propose Dual Path Net (DPN) by revealing ResNet and DenseNet within the higher order recurrent neural network framework. Deep Learning models have been successful in learning powerful representations and understanding stereo context

[53, 54]. Shaked et al. [55] present a three-step pipeline for the stereo matching problem and a network architecture (ResMatch) for computing the matching cost at each possible disparity. Kendall et al. [56] propose an end-to-end model (GC-Net) for regressing disparity from a rectified pair of stereo images.

2.5 Simultaneous Localization and Mapping

Visual SLAM plays an important role in autonomous navigation of mobile robots [57]. The emergence of MonoSLAM [58] makes the idea of utilizing one camera become popular. Parallel Tracking and Mapping (PTaM) [59] uses an approach based on keyframes with two parallel processing threads. ORB [60] becomes a computationally-efficient replacement to SIFT keypoint detector and descriptor that has similar matching performance. Keyframe bundle adjustment outperforms filtering, since it gives the most accuracy per unit of computing time [61]. S-PTAM [62] allows accuracy improvement of the mapping process with respect to monocular SLAM and avoiding the well-known bootstrapping problem. ORB-SLAM2 [63] is a complete SLAM system for monocular, stereo and RGB-D cameras. RGB-D SLAM [64] can producing high quality globally consistent surface reconstruction with only a low-cost commodity RGB-D sensor. Mur-Artal et al. [65] propose a visual-inertial monocular SLAM which is able to close loops and reuse its map to achieve zero-drift localization in already mapped areas.

3 NavigationNet

NavigationNet is specifically designed for applying reinforcement learning methods to indoor navigation tasks. We collected data from the real world and organized it properly to allow the robot to roam in the dataset as if they are in the corresponding real-world scenario. In this chapter, we will explain its organization and applications in detail to show you the world of NavigationNet.

3.1 Data Organization

Figure 1: NavigationNet Structure

Images in NavigationNet are organized hierarchically. The hightest node is the root of NavigationNet. The second level nodes, also the primary elements, are scenes. A scene is a collection of images collected from the same indoor space. In the next level, a scene is divided into rooms, space connected with doors. In the current version of NavigationNet, we have 15 scenes, each with 1-3 rooms. The origin room for each scene is at least 50 in area.

The fourth level is called postion. A position is from where images are collected. A room contains hundreds of postions. When being trained, the robot could move among the adjacent positions to ’see’ from that perspective. When constructing the dataset, our mobile robot iterated over all the walkable points in the room with the granularity of 20cm. Hence, in the dataset, two adjacent positions are 20cm away from each other. With such a granularity, one room usually contains thounsands of positions.

From the perspective of information, a position can be seen as a bundle of information that the robot will receive when it is placed at that position. Usually, we offer 8 images in this bundle. These images are divided into four directions: front, back, left and right. Towards each direction, two parallel images are taken to complete a binocular vision system. Such a setting enables the possibilities like stereo matching and panorama reconstruction.

In addition to the images, we also provide a ground-truth map attached to the scene. The map can be considered as a binary map indicating where is walkable or non-walkable. This is essential especially to the Auto-SLAM problem.

3.2 Robot Moving Control

To satisfy the different requirements of potential tasks, we tried our best to provide a more generalizable moving control SDK for NavigationNet.

As is stated in section 3.1, the fundamental element in NavigationNet is scene, virtual robots are allowed to roam in scenes. To serve well as an action in a reinforcement learning task, movements should be standard, discrete, limited and complete.

  • An action should be standard, so that the robot should move the same distance towards the same direction or turning the same angle when a specific instruction of a movement is given.

  • An action should be discrete, so that the action space can be finite and discrete.

  • Actions should be limited, so that the algorithm do not need to choose actions from an infinite action space.

  • Actions should be complete, so that all the actions joined together could describe the whole real-world action space.

In practice, it is impossible to use a finite action space to cover an infinite one. We need some sort of compromise. The one we take in this case is to discrete the movement length and the turning angle. So that we defined six types of movements.

  • MOVE FORWARD Ask the robot to move forward one position. For example, go from (9, 12) to (9, 13) when facing north but go from (11, 8) to (10, 8) when facing west. Since two adjacent nodes are 20cm away, this is to move forward 20cm in real-world scenes.

  • MOVE BACKWARD Ask the robot to move backward one position. This is the opposite movement of MOVE FORWARD and can revert the effect of MOVE FORWARD.

  • MOVE LEFT Ask the robot to move left one position. For example, go from (5, 4) to (4, 4) when facing north but go from (7, 4) to (7, 3) when facing west. The real world distance would be the same as MOVE FORWARD and MOVE BACKWARD. It should be paid attention to that MOVE LEFT is very different from TURNING LEFT as it will move the position of the robot but will not change the facing direction.

  • MOVE RIGHT Ask the robot to move right one position. This is the opposite movement of MOVE LEFT and can revert the effect of MOVE LEFT.

  • TURN LEFT Ask the robot to turn left at 90-degree. For example, when the robot receives this request when at (5, 7) facing north, it should then facing west still at (5, 7). It should be paid extra attention that this movement is different from MOVE LEFT that the former changes the facing while the latter change the position.

  • TURN RIGHT Ask the robot to turn right at 90-degree. This is the opposite movement of TURN LEFT.

Figure 2: Movements

At the same time, we should make it clear that, in any given tasks, according to the specific setting, not all the actions (movement) need to be taken into consideration. For example, when we would like to simulate a two-wheel one-eye robot, TURN LEFT and TURN RIGHT are a must, otherwise seventy-five percent of the images can never be perceived while MOVE LEFT and MOVE RIGHT should be eliminated since a two-wheel robot can not make the exact movement of moving left 20cm. Meanwhile, when simulating a robot with panorama vision, we could only take MOVE LEFT and MOVE RIGHT but not TURN LEFT and TURN RIGHT as they are not necessary but can only enlarge the action space unnecessarily and make the task more difficult.

4 NavigationNet Construction

NavigationNet is a large-scale dataset with hundreds of thousands of images. The large amount makes the data collection a task of impossible. What is worst is that, as to each image, it is required that the position, angle and height where it is taken should be exactly where it is expected so long that the simulation of reality will not offset. These two factors together make the task of collection much more difficult than expected. In this chapter, we will talk about how we constructed this dataset efficiently and accurately.

4.1 Collector Mobile Robot

Robots are designed to liberate human beings from the repetitive boring work. NavigationNet is built for intelligent robots, also built with the help of intelligent robots. Collecting data for NavigationNet requires heavy labor and high accuracy but little flexibility (most of the possible conditions are under control), which is surprisingly suitable for robots.

In order to reduce the labor requirement, we exploited the possibilities of smart hardware. Our team developed a dedicated data-collecting mobile robot with Arduino Mega2560 and Raspberry Pi 3 Model B codenamed GoodCar.

Arduino[66] is an open-source physical computing platform based on a simple I/O board and a development environment that implements the Processing/Wiring language. Arduino boards can be programmed in the same name program language Arduino, which is a C-like area-specific language to receive and send signals from tens of low-voltage I/O ports. It is specifically suitable for robot controlling. In this project, we use a Arduino Mega2560 to control the movement of the robot.

Raspberry Pi[67] is a cheap single-board computer system running a specificialy made Linux distribution. It is originally built for computer programming education but we find it suitable for robot controlling. It consumes little electricity which allows us to strip the 220V power wires. It contains all kinds of I/O ports such that we could plug it with Arduinos, cameras, other full-sized computers and many other modules to make it the center of the system. It runs standard linux distribution such that the possibility of software extension is beyond imagine. Last but not least, it is cheap so that we could us as many as we want on the robot. In practice, we use two Raspberry Pi 3B for controlling all other modules and communicating with the upper computer.

To build this robot, we used a two-level structure. The upper-level is as high as 1.4m to simulate the the height of the eyes of an average adult. We plugged-in eight cameras on this level, two for each direction to make a stereo vision system and to avoid unnecessary turning-around.

The lower-level is for the motion systems. The Raspberry Pi and Arduino mentioned above are all on this level. The Arduino is to control the mobile devices and sensors. It is linked with four motors to control the movement. In the meantime, it is linked with a sonar and code plate counters to avoid collision and measure the distance of movement. Also it is linked with a Raspberry Pi and under its control. We used two Raspberry Pi 3B in each robot, one is master and the other is slave. The master computer controls many things. First it communicates with the Arduino to control the movement indirectly. Second it is responsible for evaluating the signal Arduino sent back to avoid collision. Third it is master computer’s duty to control four of the cameras to take photo at given position and store the images on the SD card. Fourth the slave computer is also under its control to take photos when required. Lastly it communicates with the upper computer, which is controlled by human, via Wi-Fi connection. That is the control core of our robot.

Under the lower-level is the mobile devices. We have tried on many types of mobile devices and at last we used four track structures to reduce error.

We had constructed six mobile robots before moving on to collect data.

Figure 3: Robot

4.2 Data Collection

To make the data collecting process efficient and accurate, also to make the whole process manageable and under control, we designed and followed the following instructions:

  1. Measure the size of the target room(s). Find the most southwestern corner as the starting point.

  2. Move the robot to the starting point. Evaluate and record its distance towards the borders.

  3. Let the robot face north and start to collect required photos at this point.

  4. After photos are taken, let the robot run 20cm towards north and collect data.

  5. Keep iterating over Step 3-4 until the robot reaches the end of the line.

  6. Move the robot to the starting point of the line and then move 20cm towards east.

  7. Iterating over Step3-6 until the robot reaches the east corner.

  8. Complete this scenario by taking photos from those points that the robot did not reach.

4.3 Data Processing

After the collection, the original data should be pre-processed before taken into production. The data processing steps we taken includes:

  • Data Cleaning When collecting data with robot, there are always inevitable errors occurring especially when dealing with room corners. For each room we collect, we will scan over all the collected photos and find errors. If necessary, we would go and retry collecting data to complete the set.

  • Map Building When collecting data, we also recorded whether a point is accessible. With this information, we are able to rebuild a walkable-or-not map for reference.

  • Stereo Vision For some of the task, to simulate a two-eye system or a RGB-D system, it would be of much help if we could produce depth channel in the very beginning. To serve this purpose, we use MC-CNN[54] to produce some pre-processed depth channel data.

5 Applications

NavigationNet is designed for indoor navigation problems. In this chapter, we would introduce four applications base on our dataset and formalize them with the idea of reinforcement learning. Apart from these four applications, we believe our dataset will spawn more applications.

5.1 Target-driven Navigation


In the real scenarios, as for robots, the key problem of indoor navigation is not about going to some place but finding some specific item. For example, as a home android, we do not ask it to get to the kitchen safely but to fetch the milk for the master. Also, as a robot in a warehouse, the ability to go to some specific location is never more important than finding a given good. To achieve this goal, a poorly trained robot control system may have to traverse over everywhere it can reach to search target object, which is neither safe nor efficient, since it has no prior knowledge like milk should be in the kitchen, or tables are usually aligned to the wall. This usually falls in the field of visual semantic learning. That is to say, to be well prepared for this kind of tasks, the robots should be able to understand the scene, knowing, for example, what the object it perceived is.

Figure 4: Target-driven Navigation(Upper: Random walking, Lower: DRL-based)


The formalization of this issue is inspired by Yuke Zhu et al.[6]. To complete the task of target-driven navigation, a robot control system is given two images. One is the target image, the other is the current first-person perspective image perceived from the environment. The robot can choose to walk forward, backward, turn left or right according to the input images. The environment will update the states accordingly and give out a reward. This process loops until the target image is the same with the perceived one.

To formalize that:

  • Goal The goal is straight-forward, to find the given object. In practice, it can be converted into make the perceived image and the given image identical.

  • Reward Different from the the goal, the design of the goal is not that straight-forward and requires more discussion. One thing that is sure is that a large positive reward will be given when the goal is achieved. To improve time efficiency, each step it taken should be given a small negative reward. In this paper, we use the setting of one large positive reward at the goal and minor negative in each step.

  • Action Space To find an object, the understanding of the scene is the most important factor of the task. Thus we would like to simulate a one-eye robot in this problem and the Action space would be defined as { MOVE FORWARD, MOVE BACKWARD, TURN LEFT, TURN RIGHT }

  • State Space In this problem, we need not only the current perception as input but also the target image. Hence the State space would be the set of all possible image pairs.


To evaluate such a system, many aspects should be taken into consideration. The most important two are, trajectory should be short and the robot should not run into the obstacles. Collisions should be limited strictly as only one collision could greatly damage the robot. To make a trajectory finite, we consider a trajectory longer than 10000 a loop. To calculate a overall score on all the test results, we utilize the idea of histogram. A success trajectory with steps and is a vote weighting to . A overall score is calculated by finding the average over this histogram.

5.2 Multi-target Navigation


The target-driven navigation task is focused on finding one object efficiently and safely. However, in practical scenarios, this is often not enough. In many cases, we need to find more than one items and we do not care about the order. For example, in the last example about collecting goods from the different rooms and then putting them all into a box, although it is required that the robot should get to the box in the very end, we do not really care about which item it goes to get first. This has become a rather different problem since it usually requires the knowledge of adjacency. For example, milk is usually placed near breads while books are usually far away, so although the item order in the list maybe milk, books and breads, robot should take reads instead of books first to advance the efficiency.

Figure 5: Multi-target Navigation (Left: Random walking, Right: DRL-based)


We also formalized this problem into a reinforcement learning problem. To find all the given items, the robot is given a list of items to collect, each one is represented by an image. The rest will be similar with the target-driven task. A first-person perspective image is given, an action is executed and then state is updated, reward is given.

To formalize that:

  • Goal The goal is also straight-forward, that the agent should find all the required object no matter the order.

  • Reward The reward design would be similar with the target-driven navigation task though we would like to the finding of each object a separated but relatively smaller reward. We believe that such a design would be better for deep reinforcement training[43]. Also a minor negative time time punishment is necessary.

  • Action Space Since the task type is similar with the target-driven problem. The same action space remains. { MOVE FORWARD, MOVE BACKWARD, TURN LEFT, TURN RIGHT }

  • State Space Different from the previous task, this task requires multiple images of objects to be perceived as targets. However, a set images with unknown-length is hard to be processed. Thus in this situation, we would limit the number of target inputs to 2 and the space of states would be the set of all possible package of three images


Although the task is different from the previous task, the two major meters are the same. Hence, we use the same metric system in this task as is in the target-driven navigation task.

5.3 Sweeper Route Planning


Apart from target-driven navigation, there are also other valuable tasks in the field of indoor navigation. One widely-discussed but never well-addressed problem is the sweeper route planning problem. Contrary to the previous target-driven tasks, a sweeper robot should traverse over all the possible places in a given room without collision and less repetition. The current solutions often lead to stuck at some small space. Besides sweeper robot, it is general problem we will meet in many scenarios.

This problem is usually considered as coverage path planning (CPP) problem, which is usually dealt with using Boustrophedon method. However, the sweeper routing planning problem using visual semantic learning is actually very different from CPP problem. The most important factor is that the robot will not know that some movement is blocked until the collision really happens. It must see and understand to avoid collisions.

Figure 6: Sweeper Route Planning (Upper: Random walking, Lower: DRL-based)


Again, we formalized this problem into a reinforcement learning task. At any point, the robot is given an image, the first-person perception of the robot and an action should be given to the environment indicating which way the robot is going the next time slice. The environment should give out the reward then according to how much area has been covered by the sweeper. Typically, the robot should maintain a map of the room inside its system.

To formalize that:

  • Goal As a sweeper robot, surely the goal is to cover all the points without obstacles in least time. However, this goal is straight-forward but practically impossible to achieve as that many corners are hard to get and there is no need to sweep and to achieve a least time path it usually requires the god’s perspective.

  • Reward Reward design is simple in this problem. Since we do not pursuit a real all-cover path, we should reward every new position it gets and punish every fruitless movement. That is to say, we should give a large positive reward for a new position and a minor negative reward for an old position.

  • Action Space In this task, a sweeper does not really care about which direction it is facing. The only thing it cares is that whether the place it is positioned and it will be positioned have been cleaned (went before). So we will not give the robot the ability to turn left/right. As the result, the action space shrinks to { MOVE FORWARD, MOVE BACKWARD, MOVE LEFT, MOVE RIGHT }.

  • State Space Different form the previous two tasks, sweeper path planning problem does not require a specific target as an input so that the state space would be as simple as a set of all the images that the robot may see.


The target for a sweeper is to sweep as much area as possible as fast as possible though these two aspect are usually contradict to each other. To deal with this, we defined a score similar to mAP score[68] in object detection tasks to evaluate models. For the step in a trajectory, we define that in the first steps, points are covered. If the step is a new hit, a score of is given to the step, otherwise 0. An average over scores of all steps in a trajectory is the final score of the trajectory.

function mAP(steps)
     for  in  do
         if  is  then
         end if
     end for
end function
Algorithm 1 Mean Average Precision (mAP)

6 Auto-SLAM with Deep Reinforcement Learning


Simultaneous localization and mapping (SLAM) problem has been paid much attention to since it is introduced decades ago. Many solutions have been proposed to increase the mapping accuracy. However, due to the lack of smart robots capable of navigating itself in alien environments, robots for SLAM tasks are still human controlled, which limited the application of SLAM technology significantly in dangerous places. Additionally, applying SLAM for some large-scale scenes (e.g. Museum) is effort consuming, which requires us to carry camera walked about all corners. So, we hope that robots can preform SLAM autonomously. With the help of NavigationNet, we proposed a Auto-SLAM task, to train a robot capable of navigating itself in an alien environment and complete the SLAM task without collision.


This task is also formalized as a reinforcement learning problem. On each step, the robot is fed with images perceived at its current location. The system could use the visual inputs and history information to improve the map. Reward should be calculated according to the difference between the constructed map and the ground truth. An action to lead the robot should be given on the basis of the reward and the current map, upon which the perception is updated.

To formally put it:

  • Goal Surely the goal is to build a 3D reconstruction efficiently and accurately. Though this would become a much more sophisticated problem than the other three due to the difficulties in evaluating the correctness of a new reconstruction.

  • Reward Due to the difficulty of evaluation in SLAM problems, we would like to consider more about progress. By talking about progress, we are assuming total trust in legacy SLAM algorithms. We would like to give a positive reward for any new reconstructed area after each steps. In the meantime, a fruitless step is also punished with a minor negative reward.

  • Action Space Due to the problem which will be talked about in State Space part, the function of turning-around is no longer required. In the meantime, we need to add the function of moving left/right to the robot. Hence the action space would become { MOVE FORWARD, MOVE BACKWARD, MOVE LEFT, MOVE RIGHT }

  • State Space Unlike the former three problems, SLAM is a problem far more than path planning. Most of it is about 3D reconstruction. Thus only a first-person single-eye perception is far from enough. In this setting, the environment should provide stereo vision from the four direction with depth information.


SLAM tasks are divided into two part, mapping and positioning. Building maps is a way for positioning. Hence, we use the accuracy of positioning but not the of mapping as the benchmark. To evaluate a Auto-SLAM model, we ask the robot to run SLAM task in an alien enviroment with limited time. After that, 10 images are given as query to the positioning system. The score is calculated by average over the positioning success rate.

7 Conclusion

We believe, to advance a research field, a suitable training environment and corresponding benchmarks will be the best catalyst. In this article, we proposed NavigationNet, a large-scale, open-source, low-cost, real-world dataset for indoor navigation. By introducing this dataset, we hope to construct a platform where researchers can train and evaluate their‘ own robot controlling system without really constructing the robot. We extract robot controlling system out from robot itself. We hope to eliminate the long and high-cost preparation process before really training the system. Also, by constructing this dataset, we hope to make deep learning methods work on real robots. On NavigationNet, a robot will see as if it is in the physical world but the speed of action will become acceptable for massive trail-and-error.

We also proposed four possible tasks that can be conducted on our dataset. We are hoping that more tasks can be proposed to exploit its maximum potential.

Our future work includes increasing the number of scenes in the dataset. Also we will construct more attributes in the dataset (object segmentation for example).