Recognizing and Tracking High-Level, Human-Meaningful Navigation Features of Occupancy Grid Maps

03/08/2019 ∙ by Payam Nikdel, et al. ∙ Simon Fraser University 0

This paper describes a system whereby a robot detects and track human-meaningful navigational cues as it navigates in an indoor environment. It is intended as the sensor front-end for a mobile robot system that can communicate its navigational context with human users. From simulated LiDAR scan data we construct a set of 2D occupancy grid bitmaps, then hand-label these with human-scale navigational features such as closed doors, open corridors and intersections. We train a Convolutional Neural Network (CNN) to recognize these features on input bitmaps. In our demonstration system, these features are detected at every time step then passed to a tracking module that does frame-to-frame data association to improve detection accuracy and identify stable unique features. We evaluate the system in both simulation and the real world. We compare the performance of using input occupancy grids obtained directly from LiDAR data, or incrementally constructed with SLAM, and their combination.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Humans and robots can navigate by detecting and referring to distinctive features in their environment. When giving each other indoor navigation instructions, humans tend to refer to sparse structural features such as doors, corridors, staircases, etc. Most current robot navigation systems use very different features, such as occupancy grids in 2D metric space, corner features in image space, or nodes and edges in topological space. Recent work in semantic mapping allows robots to reliably identify features that are salient to humans. In this work we seek to bridge a gap in navigation representations between humans and robots, so that they can be shared in joint human-robot applications. Specifically, we train a CNN to take as input traditional occupancy grids and localize some structural features that humans use for navigation: open doorways, closed doorways, and corridor openings at intersections. As a demonstration, we track these over time to obtain a sparse feature map that could be announced verbally to a visually-impaired user to assist them in navigation.

For a robot navigating in a known environment, it is possible to provide it with a map annotated with these semantic labels. For example, using a prior annotated map containing the locations of hazards, doors and intersections, the robot can warn the user not to enter dangerous places while localizing itself in the environment.

However, if the robot is located in an unknown environment, we can exploit machine learning methods to recognize different environmental features. For instance,

Capi [2] analyzed 2-dimensional (2D) LiDAR data using clustering technique to detect obstacles, steps and stairs, which makes it possible to warn the user about collisions or hazards.

In this study, we build a system to describe the navigational options ahead of the robot using occupancy grids obtained from either the 2D LiDAR data, SLAM or their combinations. To do so, we design a convolutional neural network (CNN) to detect open-rooms, closed-rooms and intersections around the robot in an unknown environment. A tracking module will then associate the model’s predictions frame-to-frame to locate the target classes more precisely and obtain a sparse metric map of semantic features. Together, this system can describe the environmental navigational cues for the user (e.g. open-room on the left)

This approach can be used as a real-time detector of navigational options around a robot. It can also be used offline to annotate unlabelled occupancy grid maps.

The contributions of this paper are:

  • A labelled dataset for training and evaluation;

  • A system that is able to detect multiple navigational options around the robot in real-time using 2D LiDAR data, SLAM maps or their combination;

  • A tracking module that does data association over time to maintain a sparse metric-semantic map;

  • An experimental evaluation of the system in both simulation and the real world.

Fig. 1: An example of a mobile robot describing the navigational options as it traverses through an indoor environment. Navigational options include the positions of open-rooms, closed-rooms or paths in an intersection.

Ii Related Work

There is a long history of robots and humans sharing spatial descriptions, going back to SRI’s Shakey in the 1960s and 1970s [13]. There is a large body of work on robots that can understand natural human commands [19, 21, 23], while fewer studies have focused on robots that can translate their observations to descriptions or instructions. In work explicitly considering human-robot interaction systems, Skubic et al. [18] investigated spatial semantic models for human-robot dialogue where the robot describes the spatial relation of objects with respect to itself. Daniele Daniele et al. proposed a navigational guide system able to generate instructions for navigating from point to point

given a known map using a sequence-to-sequence Recurrent Neural Network (RNN).

In the context of multi-agent systems where agents collaborate towards a goal, communication plays a significant role. Andreas et al. [1] formulated the problem of interpreting the policy of agents as translating their messages to human language. They build a translation model upon a similarity criteria to facilitate interpretation of communications between collaborative agents.

Various studies have used 2D [2, 5, 15] and 3D [7, 8] LiDAR scanners for object detection. Among these studies, 3D LiDAR scanners on mobile vehicles are able to collect more accurate and efficient 3D information about the surrounding environment. This type of mobile LiDAR scanning has gained popularity in road mapping studies. Guan et al. [8] surveyed recent studies that use LRF data to detect and extract road surfaces, on-road structures and pole-like objects. 3D LiDAR classification is also being used in the ecological analysis. For instance, Guan et al. [7]

proposed a tree classification method to classify different species of trees using mobile LiDAR data.

2D LiDAR data is often used to classify locations (e.g. corridors, doorways or different rooms) [5, 15] or detect big environmental objects (e.g. stairs, steps or obstacles) [2]. Goeddel and Olson [5] used CNN models to classify objects in the environment, taking 2D LiDAR data as the model’s input. Likewise, in another study, Pronobis and Rao [15]

propose a probabilistic framework using Sum-Product Networks (SPNs) and deep learning to classify locations in an autonomous robot application. Although these two studies are similar to the system we present in this paper, they only produce one target class per 2D LiDAR frame. By contrast, we detect multiple target classes from each data frame (occupancy grid obtained from either 2D LiDAR, SLAM or their combinations) and we localize them with sub-pixel accuracy on the map, i.e. in navigation space. A tracking module then aggregates the frame-to-frame predictions to improve the accuracy and position estimates of detected target classes.

Iii Approach

Given a sequence of a robot’s sensor data (2D LiDAR data, IMU and odometry), we create and annotate a map of the environment while describing the navigational options for the user. Our system can be used in a robotic system designed to guide a user in an unknown indoor environment. This system could provide feedback about the location of closed-rooms, open-rooms and corridors in an intersection. For instance, the robot can interact with the user by uttering sentences such as “Open-room on the right” or “corridors on the left and right” (for a three-way intersection with two navigational options excluding the corridor occupied by the robot). To achieve this, as the robot navigates through indoor maps, it should detect and track the positions of target classes as it navigates through the environment.

Iii-a System Overview

Our system is implemented in the Robot Operating System [16] (ROS) and tested in simulation in Stage [20] and in the real world on a Clearpath Husky robot.

We investigate three variations of local occupancy grid maps, one obtained from a single LiDAR scan, one obtained by metric LiDAR/inertial SLAM, and their combination. The laser scans are obtained from a Sick LMS-111 LiDAR with 270° field of view. Laser scans are rendered into a 2D occupancy grid local-map, and we use the ROS implementation of grid mapping (GMapping) for SLAM [6] to build an occupancy grid map of the environment. We will refer to this map as the "GMap". This occupancy grid map is created incrementally using the odometry data, the inertial measurement unit (IMU) data and the 2D LiDAR data. By cropping the 16-by-16 meter local-map in front of the robot, we obtained the GMap local-map.

To navigate in the environment, the robot uses the ROS navigation stack111 along with a trajectory planning module based on Dynamic Window Approach [4].

For our network model, we use a combination of convolutional, Batch Normalization and ReLU layers with residual connections (see Section

III-D for more details). The inputs to our model are the laser local-map and the GMap local-map.

On top of this network, our system uses a tracking module that aggregates predictions from individual frames and improves the location accuracy of target classes. The tracked predictions are then passed to our describe module that narrates the navigational options to the user.

Iii-B Dataset

Fig. 2: Visualization of one frame of data during dataset generation. Left: visualization from the Stage simulator showing a robot (top) in a corridor of a map obtained from a real building, overlaid with ground-truth annotations of closed-rooms (green cylinders), open-rooms (blue spheres) and corridor intersections (purple cubes). Centre: corresponding 2D LiDAR data in RViz. Right: local occupancy map constructed by GMap SLAM.

We generate our own dataset containing the raw laser data (an array containing the angle of each hit and its distance from the LRF), the current local-map (a 16-by-16 meter map in front of the robot) from the GMap, and positions and labels of the target classes (navigational objects). To gather the data, we used the Stage robot simulator [20] loaded with real world occupancy grid maps. We include six preexisting occupancy grid maps [10, 11] (with small modifications) and two new ones that were created by performing SLAM in the Applied Science Building at Simon Fraser University. Figure 2 shows one frame of the dataset generation process using the Stage simulator.

For each map, we annotate the open-rooms, closed-rooms and the beginning of each corridor at all intersections. The total number of target labels in these eight maps is shown in Table II. To generate our dataset, we divide the maps into training and testing portions. Two of our maps are used only for training, one map only for testing, and the other five are divided into two parts. We then move the robot in predetermined paths in each map, while recording the coordinates of the robot until it explores each corridor in that map at least twice. These consecutive recorded coordinates form trajectories that are used to generate data. The total number of labels per target classes is shown in Table I.

This small amount of training data was extensively augmented to be sufficient for training. One unusual aspect is that we synthesized partially-open doors at various angles in doorways, since we required the model to robustly identify doorways irrespective of door angle despite large difference in appearance in a 2D top down view.

Closed Room Open Room Corridor Total
% % %
Train 29 46 25 33316
Validation 29 46 25 6663
Test 28 39 33 15610
TABLE I: Number of labels per target classes in the train, validation and test portion of the dataset.
Closed Room Open Room Corridor Total
232 121 95 448
TABLE II: Total number of annotations for each one of our target classes.
Fig. 3: Our network architecture is a fully convolutional model with residual connections. The inputs are two local-map images, , one generated from 2D LiDAR and the other one from the GMap during run-time. The output is a tensor, which is the grid representation of predictions. Each grid cell contains the confidence score, coordinates and probability of target classes for that grid cell.

Iii-C Data Augmentation

To make the model more robust to unseen data, we apply data augmentation that includes rotating, translating and re-sizing the local-map images (both the laser and the GMap local-maps). We also add different orientations of the door for the open-room class.

Iii-C1 Rotation, Translation or Re-size

Fig. 4: Data augmentation process. Top: the laser local-maps, Bottom: the GMap local-maps. (I) the original 16-by-16 meter local-maps, (II) the cropped 8-by-8 meter local-maps centered at the robot, (III) translation of +1.6 meter in both x and y directions, (IV) rotation by -30°, (V) resize by 1.2 times and (VI) the obtained augmentation by applying these operation in order. (III) to (VI) are cropped 8-by-8 meter local-maps.

During training, whenever we fetch a GMap local-map or 2D LiDAR data from the dataset, we randomly rotate, translate or re-size them. Figure 4 shows the possible data augmentation operations. The resulting images are fed into our neural network.

The primary GMap local-map is saved with dimensions twice as big as the images the network needs. It helps to do the data augmentation by preventing unknown pixels in the final map. After the operations, we crop the GMap local-map into an 8-by-8 meter map and resize it to the desired network input size of pixel. The same operations (rotation, translation and re-size) are also applied to the laser scan data by first converting the 1D laser array to 2D local-map coordinates.

Iii-C2 Adding New Doors’ Orientations

Fig. 5: Four different orientations of four doors. Doors are annotated by red circles. During data augmentation, we add different orientations for the door panels to avoid over-fitting. We used a random angle between 30 and 100 degrees for open doors.

In this step, we create new maps by modifying the previous maps. First, we annotate the position and width of each door panel in open-rooms; then we remove the door panel from the occupancy grid maps. Next, we generate new maps by randomly adding open-rooms. For open-rooms, we assume that the angle between the door frame and door panel is between 30° and 100°. An example of four different doors’ orientations is illustrated in Figure 5.

Fig. 6: Our network architecture is a fully convolutional model with residual connections. The inputs are two local-map images, , one generated from 2D LiDAR and the other one from the GMap during run-time. The output is a tensor, which is the grid representation of predictions. Each grid cell contains the confidence score, coordinates and probability of target classes for that grid cell.

Iii-D Architecture

We trained three different models to predict the position of the target classes around the robot. All of these models use a similar network architecture but with different input data. We define these three models as follows:

  • Laser model, which uses only the laser local-map

  • Map model, which uses only the GMap local-map

  • Combined model, which uses both GMap local-map and laser local-map

The network architecture is presented in Figure 6. The first part of the network consists of two parallel networks where one extracts features from the GMap local-map and the other from the laser local-map. Each part of the parallel network consists of three ResNet blocks with residual connections. Each ResNet block contains convolutional, batch normalization and ReLU layers. In the next stage, the concatenation of these two parallel networks forms the input of the next two ResNet blocks followed by a convolutional layer.

Iii-E Implementation details

Our architecture is inspired by ResNet34 [9] and YOLO9000 [17]. The inputs of our network are two pixel images of the surrounding environment, belonging to a laser local-map or a GMap local-map. Inspired by YOLO’s methodology, we divide each input image into a 5 by 5 grid (referred to as ). For each grid cell, we define several properties including confidence score, coordinates and the probability of target classes. Therefore, the model predicts a

tensor, where the numbers along the 3rd axis correspond to predicted properties of each grid cell. Each vector

is responsible for detecting the object that resides in the grid cell of with six properties: Confidence score, coordinates and conditional class probabilities for the three target classes. The confidence score is the estimate of model certainty about whether the center of an object resides in this grid cell. The coordinates are the positions of each object’s center relative to the top-left corner of the grid cell in the input image (). The last three conditional class probabilities are the probabilities of each target class if an object of class exists in this grid cell. In our case, object classes are open-room, closed-room and the beginning of corridors. To calculate the final probability of each target class, we multiply the conditional class probabilities of each target class by the cell confidence score:


To train our network to predict the tensor, we use the combination of Cross-Entropy loss and L2 loss (mean squared error). The Cross-Entropy loss () is used to learn the conditional class probabilities and the L2 loss used to learn both the coordinates position () and confidence score (

) of each target classes. Our final loss function is a weighted sum of these losses:




In these equations, and are weights of the loss functions. We set , and , although we tested other values empirically and observed minor variation in performance. All three models are trained using the Adadelta optimizer [22]

with a batch size of 60 and randomly initialized weights on the training dataset. During training, we evaluate the model at each epoch on the validation dataset and keep the weights of the model that has the minimum total loss. These models were implemented with the PyTorch library

[14] and were trained on an Nvidia GeForce GTX 1080 Ti GPU.

Iii-F Tracking Module

Fig. 7: An example of local-maps when the robot approaches a closed-room. The closed-room is annotated by a red circle. The closed-room becomes clearer as the robot comes closer to the closed-room. In the beginning, the closed-room looks like an open-room but later (on the second frame for the laser local-map and the third frame for the GMap local-map) it can be distinguished as a closed-room.

In our application, a robot should be able to describe the navigational options for the user at the appropriate moments. Consequently, our system should be able to track the surrounding objects as the robot approaches them, and the robot should narrate the labels at the appropriate time. Our tracking module ensures that our system reliably tracks detected objects between frames as the local map changes.

When the robot is moving along a path and approaches a room or an intersection, the target object may be in the field of view of the robot for a few frames. In each frame, the robot sees the object with a slightly different shape. The reasons for these differences are the movement of the robot, error in the robot’s sensors and the incomplete map created by SLAM, which is being updated incrementally as the robot gets closer to the object. For example, a closed-room in some cases (especially at a distance) might look like an open-room (see Figure 7). Open-room and corridor also look similar in many cases. Thus, the network may change its prediction when the robot approaches an object. Moreover, when the shape of the object changes, the network finds slightly different coordinates for the labels. In other words, the system should not change the coordinates or class of an object based on a single network prediction. Instead, it should consider all the prediction labels and positions for that object.

Our tracking module works by first saving and calculating the position of the target labels on the GMap global map (conversion between different coordinate frames is done by the ROS TF package222, and then clustering the nearest target points using the k-Nearest Neighbors algorithm (with a maximum distance of one meter). Next, for every target class in each cluster, we average over the final probability (obtained from Equation 1) and assign the most probable target to the cluster. Finally, the coordinates of each cluster are calculated by averaging the coordinates of the most probable target class for that cluster.

The tracking module relies on the global coordinate frame that is updated by the GMapping algorithm. Although the SLAM algorithm usually builds a reliable map, sometimes it shifts the global coordinate frame or part of the occupancy grid maps. The reason for this is the cumulative errors in the odometry of the robot or delays in the sensor’s data. These shifts add cumulative errors to the previous predictions of our tracking module. To reduce these cumulative errors, each point in our tracking module has a lifetime of 30 seconds after its detection.

Ultimately, using the most probable target label and position, the describe module explains the environment for the user with the following procedure. First, it finds the coordinates of the detected targets around the robot’s position within five meters. Next, it calculates the angle between the robot-object line and the robot orientation. At last, the describe module says the target class name followed by its position around the robot (e.g. closed-room left). The position of the object with respect to the robot is defined as follows:

  • left side, if -120°-50°

  • front, if -30°-30°

  • right side, if 50°120°

Iv Results and Discussion

We evaluate our methodology by conducting three experiments. The results of each experiment are compared using three models described in Section III-D. These models are trained on the training dataset (see Section III-B), and the best performing model is obtained by evaluating on the validation dataset. We tested the same trained model for all three experiments. On each experiment, we report the recall, precision and F1 score, which are three standard metrics for classification.

Iv-a Experiment One

Fig. 8: Comparison between 2D LiDAR local-maps and GMap local-maps. The top row shows the GMap local-map; the middle row shows the laser local-maps and the bottom row is the screenshot of the robot in the Stage simulator. On the left side, there are three cases in which laser local-maps are more detailed; this usually happens in an unexplored map when the robot changes its orientation. The right side shows three cases where the GMap data is more detailed. This often occurs when the robot is in a previously explored place.
(a) Validation
(b) Recall
Closed Room Open Room Corridor
Laser 0.84 0.91 0.77
Map 0.82 0.89 0.77
Combined 0.9 0.95 0.89
(c) Precision
Closed Room Open Room Corridor
Laser 0.83 0.99 0.98
Map 0.74 0.99 0.97
Combined 0.84 0.98 0.99
(d) F1 Score
Closed Room Open Room Corridor
Laser 0.84 0.95 0.86
Map 0.78 0.94 0.86
Combined 0.87 0.96 0.94
(e) Test
(f) Recall
Closed Room Open Room Corridor
Laser 0.72 0.85 0.47
Map 0.68 0.68 0.46
Combined 0.78 0.86 0.48
(g) Precision
Closed Room Open Room Corridor
Laser 0.82 0.99 0.96
Map 0.65 0.98 0.95
Combined 0.80 0.99 0.97
(h) F1 Score
Closed Room Open Room Corridor
Laser 0.77 0.91 0.63
Map 0.67 0.80 0.62
Combined 0.79 0.92 0.64
TABLE III: Recall, precision and F1 score of three models running on the validation and testing dataset. The numbers indicate difficulties of detecting corridors (at the beginning of each intersection) compared to closed-rooms and open-rooms.

The First Experiment is conducted to evaluate the accuracy of our network without the use of the tracking module. We run each of the three models on the testing dataset. Table III shows the recall, precision and F1 score of each class for all three models in the validation and the testing datasets. These results show that the system obtains the highest recall for open rooms, and the lowest for corridors. Open-rooms are probably easier to detect for several reasons. First, we train our network with different orientations of doors (for details see Section III-C), which makes it more robust to open-rooms. Secondly, door frames usually have similar dimensions between different buildings. Lastly, in the case that the network has learned the relation between free spaces, it may easily detect an open-room as a narrow space that connects two wide open-areas as opposed to a closed-room, which is a small recess inside the wall. Another reason behind the poor recall of the corridors is likely the variety of intersection types. For example, the size of intersection corridors can vary widely. In some cases, the three-way intersections are very similar to a corridor with an open room on the side. All the models are less precise at detecting closed-rooms compared to other classes. It might be due to false identification of some of the recesses in the wall as a closed room, resulting in lower precision.

We also observe that the F1 score of the Map model is lower than the other two models. We find that the GMap local-maps have fewer details compared to local-map from LiDAR data. Besides, when the robot is turning into an unknown area of the map, the occupancy grid map will only be updated a few frames later, because SLAM requires multiple measurements for each one of the obstacles in the LiDAR view. Although the GMap local-map may have fewer details compared to the laser local-map, a combination of them in the Combined model slightly enhances the F1 score in most cases.

We measure the run-time speed of our network as it is essential for a robotic application to have a fast network that can predict in real-time. Our network can run as fast as 140 Hz (around seven milliseconds for each run) with a system equipped with a GeForce GTX 1080 Ti or 12 HZ (around 83 milliseconds for each run) using an Intel i7-7700 CPU.

Iv-B Experiment Two

In Experiment Two, we evaluate the addition of our tracking module on top of the model predictions. We also investigate the performance of our system in an unexplored environment versus a previously explored environment. We annotate new trajectories in our occupancy grid maps. These trajectories are chosen in a way to visit the whole map with minimal crossings (See Figures 9 and 10). As a result, the robot would see most of the map for the first time.

In the unexplored experiment, the robot follows a trajectory similar to the data generation phase. As the robot moves, SLAM GMapping updates the global occupancy grid map. In the explored experiment, the robot follows the same trajectory for the second time using an already explored map.

(a) Unexplored Maps
(b) Recall
Closed Room Open Room Corridor
Laser 0.75 0.83 0.74
Map 0.63 0.79 0.61
Combined 0.79 0.84 0.69
(c) Precision
Closed Room Open Room Corridor
Laser 0.83 0.94 0.95
Map 0.73 0.80 0.87
Combined 0.80 0.92 0.92
(d) F1 Score
Closed Room Open Room Corridor
Laser 0.79 0.88 0.83
Map 0.68 0.80 0.72
Combined 0.80 0.88 0.79
(e) Explored Maps
(f) Recall
Closed Room Open Room Corridor
Laser 0.81 0.85 0.75
Map 0.72 0.83 0.69
Combined 0.85 0.9 0.75
(g) Precision
Closed Room Open Room Corridor
Laser 0.87 0.95 0.95
Map 0.82 0.87 0.98
Combined 0.86 0.93 0.93
(h) F1 Score
Closed Room Open Room Corridor
Laser 0.84 0.90 0.84
Map 0.77 0.85 0.81
Combined 0.86 0.91 0.83
TABLE IV: Recall, precision and F1 Score for all three models for Experiment Two in the unexplored and the explored maps using test occupancy grid maps.

As the robot navigates in the environment, for each update of GMapping we use three models to predict the target classes around the robot. The tracking module then uses these predictions to update the accuracy and location of the target classes on the global occupancy grid map.

FR79 building (train and validation)
FR52 building (test)
Fig. 9: Prediction of our system for the FR52 and FR72 building map. Closed-rooms predictions are annotated by green cylinders and open-rooms by blue spheres.
SAIC building (train and validation portion)
SAIC building (test portion)
Fig. 10: Prediction of our system for the SAIC building map. Closed-rooms predictions are annotated by green cylinders, open-rooms by blue spheres and the start of each corridor in the intersections by purple cubes. The false-positives are annotated by and false-negatives by over predictions.

We evaluate the final results of our tracking module by finding the closest ground-truth point to every predicted object with a maximum distance of 0.5 meters. Because we use the LiDAR and GMap local-map, details of the obstacles in each local-map improves as the robot gets closer to them. Thus, an object can be mistakenly classified as an open-room from a distance, but as the robot comes closer, more detailed local-maps can lead to a correct prediction of an open-room (See Figure 7).

As mentioned in Section III-F, the tracking module updates the probability of target classes using all the previous model’s predictions in that region. This module keeps updating the predictions until the object leaves the robot’s field of view. Only then do we evaluate the predicted location.

If the robot tries to follow the same trajectory multiple times, the constructed local-maps (either the GMap or laser local-map) might be slightly different for the same region due to small errors and delays in navigation and performing SLAM. To address this difference, for each setting (the explored or unexplored), we repeat the experiment 30 times and report the averaged results.

Some examples of our system predictions for a few maps are shown in Figures 9 and 10. In these figures, we also annotated the false-positives and false-negative predictions of our system.

The recall, precision and F1 score for each target class for unexplored and explored maps are shown in Table IV. In all three types of input data, the explored maps show an increased F1 score compared to unexplored ones. Part of this increase is due to using a global frame as the origin of our tracking module. The tracking module uses the global GMap occupancy grid maps to do the frame-to-frame data association and subsequently improves the detection accuracy of the system. As a result, if we have a more stable GMap, the accuracy of tracking increases (the tracking module is provided with a more stable global map). In explored maps, the F1 score increase is more pronounced in the Map model, which is due to a more detailed GMap local-map. Figure 8 depicts a few examples comparing the laser and GMap local-map in the explored and unknown maps.

Iv-C Experiment Three

Experiment Three evaluates our system in the real world. In this experiment, we use a ClearPath Husky robot equipped with a Sick LMS-111 LiDAR and an IMU sensor. The experiment is done in two corridors of the TASC1 building at Simon Fraser University. We moved the robot in these two corridors using a joystick while recording the robot’s sensor data. These corridors include a total of 26 closed-rooms, ten corridors (corridors are paths in each intersection) and three open-rooms. The trajectories and environment maps along with the predictions of the "Laser" model are shown in Figure 11.


(a) Recall
Closed Room Open Room Corridor
Laser 0.96 1 0.6
Map 0.81 0.67 0.4
Combined 0.77 1 0.2
(b) Precision
Closed Room Open Room Corridor
Laser 0.96 0.27 1
Map 0.81 0.11 0.57
Combined 0.91 0.33 0.67
(c) F1 Score
Closed Room Open Room Corridor
Laser 0.96 0.43 0.71
Map 0.81 0.19 0.47
Combined 0.83 0.50 0.31
TABLE V: Recall, precision and F1 Score for all three models for Experiment Three.

Table V shows the precision, recall and F1 score of each target class using our three models. Based on these tables, the laser model gives a slightly better F1 score for closed-rooms and corridors compared to the other two models. This is probably due to the increased delays and accumulating sensor error while generating the GMap in the real world. We also observe that the models have some difficulties classifying corridors. In some cases, corridors are classified as open-rooms. Although recall for the open-room is high, precision is much lower due to many false-positive predictions from the network. These false positives may be caused by the additional error associated with real-world sensors.

(a) First part of the experiment
(b) Second part of the experiment
Fig. 11: Prediction of our system for the real world (TASC-1 building) experiment using the Laser model. Closed-rooms predictions are annotated by green cylinders, open-rooms by blue spheres and the start of each corridor by purple cubes. The false-positives are annotated by and false-negatives by over predictions.

V Conclusion and Future Work

In this paper, we presented a system to describe the navigational cues around a robot using a combination of 2D LiDAR data and occupancy grid maps. We trained a CNN to predict the closed-rooms, open-rooms and intersections around the robot. A tracking module aggregated the predictions to locate and classify the navigational cues more accurately. We evaluated our system in different settings in both simulation and the real world. Based on our results, in simulation, using the combination of LiDAR data and occupancy grid maps can help to achieve a relatively higher score. However, in the real world experiment, the performance from only using 2d LiDAR data was higher. This may be due to the increase of delay and sensor error in the real world, which generates a less accurate GMap compared to the simulation.

We created our own dataset using eight occupancy grid maps. In the future, this study can be improved in terms of accuracy and robustness using a dataset that contains a variety of environments. Also, environmental feedback can be extended to include more useful guidance. For instance, the system can be improved to detect steps or stairs and notify the user. Moreover, this work can be combined with our follow-ahead system [12] to improve the following behaviour. In particular, the robot can use our detection system to slow down near the intersections and watch for the user’s reaction, in order to choose which path to follow.


  • Andreas et al. [2017] J. Andreas, A. D. Dragan, and D. Klein. Translating neuralese. CoRR, abs/1704.06960, 2017.
  • Capi [2012] G. Capi. Assisting and guiding visually impaired in indoor environments. Journal ISSN, 1929:2724, 2012.
  • Daniele et al. [2017] A. F. Daniele, M. Bansal, and M. R. Walter.

    Navigational Instruction Generation As Inverse Reinforcement Learning with Neural Machine Translation.

    In Proceedings of the 2017 ACM/IEEE Int. Conf. on Human-Robot Interaction, HRI ’17, pages 109–118, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4336-7.
  • Fox et al. [1997] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robotics Automation Magazine, 4(1):23–33, March 1997. ISSN 1070-9932.
  • Goeddel and Olson [2016] R. Goeddel and E. Olson. Learning semantic place labels from occupancy grids using cnns. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ Int. Conf. on, pages 3999–4004. IEEE, 2016.
  • Grisetti et al. [2007] G. Grisetti, C. Stachniss, and W. Burgard. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters. IEEE Transactions on Robotics, 23(1):34–46, Feb 2007. ISSN 1552-3098.
  • Guan et al. [2015] H. Guan, Y. Yu, Z. Ji, J. Li, and Q. Zhang. Deep learning-based tree classification using mobile lidar data. Remote Sensing Letters, 6(11):864–873, 2015.
  • Guan et al. [2016] H. Guan, J. Li, S. Cao, and Y. Yu. Use of mobile LiDAR in road information inventory: A review. International Journal of Image and Data Fusion, 7(3):219–242, 2016. ISSN 19479824.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • Howard and Roy [2003] A. Howard and N. Roy. The robotics data set repository (radish), 2003.
  • Mozos et al. [2005] O. M. Mozos, C. Stachniss, and W. Burgard. Supervised learning of places from range data using adaboost. In Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE Int. Conf. on, pages 1730–1735. IEEE, 2005.
  • Nikdel et al. [2018] P. Nikdel, R. Shrestha, and R. Vaughan. The hands-free push-cart: Autonomous following in front by predicting user trajectory around obstacles. In Proceedings of the IEEE Int. Conf. on Robotics and Automation (ICRA), May 2018.
  • Nilsson [1984] N. J. Nilsson. Shakey the robot. Technical Report 323, AI Center, SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, Apr 1984.
  • Paszke et al. [2017] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • Pronobis and Rao [2017] A. Pronobis and R. P. Rao. Learning deep generative spatial models for mobile robots. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ Int. Conf. on, pages 755–762. IEEE, 2017.
  • Quigley et al. [2009] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng. ROS: an open-source Robot Operating System. In ICRA workshop on open source software, volume 3, page 5. Kobe, 2009.
  • Redmon and Farhadi [2017] J. Redmon and A. Farhadi. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, July 2017.
  • Skubic et al. [2004] M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock. Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 34(2):154–167, 2004.
  • Tellex et al. [2011] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011.
  • Vaughan [2008] R. Vaughan. Massively Multi-Robot Simulations in Stage. Swarm Intelligence, 2(2-4):189–208, December 2008.
  • Yi et al. [2016] D. Yi, T. M. Howard, M. A. Goodrich, and K. D. Seppi. Expressing homotopic requirements for mobile robot navigation through natural language instructions. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ Int. Conf. on, pages 1462–1468. IEEE, 2016.
  • Zeiler [2012] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • Zheng et al. [2017] Y. Zheng, Y. Liu, and J. H. Hansen. Navigation-orientated natural spoken language understanding for intelligent vehicle dialogue. In Intelligent Vehicles Symposium (IV), 2017 IEEE, pages 559–564. IEEE, 2017.