Visual Global Localization with a Hybrid WNN-CNN Approach

05/08/2018 ∙ by Avelino Forechi, et al. ∙ Ifes Ufes 0

Currently, self-driving cars rely greatly on the Global Positioning System (GPS) infrastructure, albeit there is an increasing demand for alternative methods for GPS-denied environments. One of them is known as place recognition, which associates images of places with their corresponding positions. We previously proposed systems based on Weightless Neural Networks (WNN) to address this problem as a classification task. This encompasses solely one part of the global localization, which is not precise enough for driverless cars. Instead of just recognizing past places and outputting their poses, it is desired that a global localization system estimates the pose of current place images. In this paper, we propose to tackle this problem as follows. Firstly, given a live image, the place recognition system returns the most similar image and its pose. Then, given live and recollected images, a visual localization system outputs the relative camera pose represented by those images. To estimate the relative camera pose between the recollected and the current images, a Convolutional Neural Network (CNN) is trained with the two images as input and a relative pose vector as output. Together, these systems solve the global localization problem using the topological and metric information to approximate the current vehicle pose. The full approach is compared to a Real- Time Kinematic GPS system and a Simultaneous Localization and Mapping (SLAM) system. Experimental results show that the proposed approach correctly localizes a vehicle 90 1.12m of the SLAM system and 0.37m of the GPS, 89



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Global Positioning System (GPS) has been widely used in ground vehicle positioning. When used in conjunction with Real-Time Kinematic (RTK) data or other sensors, such as Inertial Measurement Units (IMU), it can achieve a military-grade precision even in the positioning of urban vehicles. Although its widespread use in urban vehicle navigation, it suffers from signal unavailability in cluttered environments, such as urban canyons, under a canopy of trees and indoor areas. Such GPS-denied environments require alternative approaches, like solving the global positioning problem in the space of appearance, which involves computing the position given images of an in-vehicle camera. This problem is commonly approached as a visual place recognition problem and it is often modeled as a classification task [1, 2]. However, that solves just the first part of the global localization problem, given that it learns past place images and returns the most similar past location and not its current position. A complementary approach to solve the global localization problem would be to compute a transformation from the past place to the current one given their corresponding images, as illustrated in Figure 1. This problem differs in some aspects from Visual Odometry (VO) [3] and Structure from Motion (SfM) [4] problems. Although all of them can be characterized as visual localization methods, estimating a relative camera pose is more general than the other two. VO computes motion from subsequent image frames, while the relative camera position method computes motion from non-contiguous image frames. SfM computes motion and structure of rigid objects based on feature matching at different times or from different camera viewpoints, while the relative camera position method does not benefit from the structure of rigid objects, most of the time, given that the camera motion is roughly orthogonal to the image plane. In addition to the place recognition task previously addressed with Weightless Neural Networks (WNN), the relative camera pose estimation is approached here using a Convolutional Neural Network (CNN) in order to regress the relative pose between past and current image views of a place. The full WNN-CNN approach is compared to a Real-Time Kinematic GPS system and a Visual Simultaneous Localization and Mapping (SLAM) system [5]. Experimental results show that the proposed combined approach is able to correctly localize an autonomous vehicle 90% of the time with a mean error of 1.20m compared to 1.12m of a Visual SLAM system and to 0.37m of the GPS, 89% of the time.

Fig. 1: Illustration of IARA 3D viewer depicting two images (the left one is a recollected image and the right one is the current image) in their actual world positions and a orange dot trail indicating IARA’s previous positions.

Ii Related Work

In the following, we discuss how the global localization task is addressed in the literature with regards to place recognition and visual localization systems. Place recognition systems serve a large number of applications such as robot localization [6, 7, 5], navigation [8, 9, 10], loop closure detection [11, 12, 13, 14], to geo-tagged services [15, 16, 17]

. Visual localization refers to the problem of inferring the 6 Degree of Freedom (DoF) camera pose associated with where images were taken.

Ii-1 Visual Place Recognition

Herein, we discuss just the latest approaches to the problem. For a comprehensive review please refer to [18]. In [19], authors presented a place recognition technique based on CNN, by combining the features learned by CNN’s with a spatial and sequential filter. They employed a pre-trained network called Overfeat [20]

to extract the output vectors from all layers (21 in total). For each output vector, they built a confusion matrix from all images of training and test datasets using Euclidean distance to compare the output feature vectors. Following they apply a spatial and sequential filter before comparing the precision-recall curve of all layers against FAB-MAP

[21] and SeqSLAM [22]. It was found in [19] that middle layers 9 and 10 perform way better (85.7% recall at 100% precision) than the 51% recall rate of SeqSLAM. In [2, 1], our group employed two different WNN architecture to global localize a self-driving car. Our methods have linear complexity like FAB-MAP and do not require any a priori pre-processing (e.g. to build a visual vocabulary). It is achieved with one method, called VibGL, an average pose precision of about 3m in a few kilometers-long route. VibGL was later integrated into a Visual SLAM system named VIBML [5]

. To this end, VibGL stores landmarks along with GPS coordinates for each image during mapping. VIBML performs position tracking by using stored landmarks to search for corresponding ones in currently observed images. Additionally, VIBML employs an Extended Kalman Filter (EKF)

[23] for predicting the robot state based on a car-like motion model and corrects it using landmark measurement model[23]. VIBML was able to localize an autonomous car with an average positioning error of 1.12m and with 75% of the poses with an error below 1.5m in a 3.75km path [5].

Ii-2 Visual Localization

In [24], authors proposed a system to localize an input image according to the position and orientation information of multiple similar images retrieved from a large reference dataset using nearest neighbor and SURF features [25]. The retrieved images and their associated pose are used to estimate their relative pose with respect to the input image and find a set of candidate poses for the input image. Since each candidate pose actually encodes the relative rotation and direction of the input image with respect to a specific reference image, it is possible to fuse all candidate poses in a way that the 6-DoF location of the input image can be derived through least-square optimization. Experimental results showed that their approach performed comparably with civilian GPS devices in image localization. Perhaps applied to front-facing camera movements as is the case here, their approach might not work properly as most of images would lie along a line causing the least-square optimization to fail. In [26], the task of predicting the transformation between pair of images was reduced to 2D and posed as a classification problem. For training, they followed Slow Feature Analysis (SFA) method [27] by imposing the constraint that temporally close frames should have similar feature representations disregarding either the camera motion and the motion of objects in the scene. This may explain the authors’ decision to treat visual odometry as a classification problem since the adoption of SFA should discard ”motion features” and retain the scale-invariant features, which are more relevant to the problem as classification problem than to the original regression problem. Kendall et. al. [28] proposed a monocular 6-DoF relocalization system named PoseNet, which employs a CNN to regress a 6-DoF camera pose from a single RGB image in an end-to-end manner. The system obtains approximately 3m and accuracy for large scale (500 x 100m) outdoor scenes and 0.6m and accuracy indoors. Its most salient features relate greatly to the environment structure since camera moves around buildings and most of the time it faces them. The task of locating the camera from single input images can be modeled as a Structure of Motion (SfM) problem in a known 3D scene. The state-of-the-art approaches do this in two steps: firstly, projects every pixel in the image to its 3D scene coordinate and subsequently, use these coordinates to estimate the final 6D camera pose via RANSAC. In [29], the authors proposed a differentiable RANSAC method called DSAC in an end-to-end learning approach applied to the problem of camera localization. The authors have achieved an increase in accuracy by directly minimizing the expected loss of the output camera pose estimated by the DSAC. In [30]

, the authors proposed a metric-topological localization system, based on images, that consists of two CNN’s trained for visual odometry and place recognition, respectively, which predictions are combined by a successive optimization. Those networks are trained using the output data of a high accuracy LiDAR-based localization system similar to ours. The VO network is a Siamese-like CNN, trained to regress the translational and rotational relative motion between two consecutive camera images using a loss function similar to

[28], which requires an extra balancing parameter to normalize the different scale of rotation and translation values. For the topological system part, they discretized the trajectory into a finite set of locations and trained a deep network based on DenseNet [31]

to learn the probability distribution over the discrete set of locations.

Ii-3 Visual Global Localization

Summing up, a place recognition system capable of correctly locating a robot through discrete locations serves as an anchor system because it limits the error of the odometry system that tends to drift. By recalling that our place recognition system stores pairs of image-pose about locations, we just need to estimate a 6-DoF relative camera pose, considering as input the recollected image-pose from the place recognition system and a live camera image. The 6-DoF relative pose applied to the recollected camera pose takes the live camera pose in the global frame of reference. The Place Recognition system presented here is based on a WNN such as the one first proposed in [2], whilst the Visual Localization system is a new approach based on a CNN architecture similar to [32]. In the end, our hybrid WNN-CNN approach does not require any additional system to merge the results of the topological and metric systems as the output of one subsystem is input to the other. Our relative camera pose system was developed concurrently to [33] and differs in network architecture, loss function, training regime, and application purpose. While they regress translation and quaternion vectors, we regress translation and rotation vectors; while they use the ground-truth pose as supervisory signal in -norm with an extra balancing parameter [28]; we employ the

-norm on 3D point clouds transformed by the relative pose; while they use transfer learning on a pre-trained Hybrid-CNN

[34] topped with two fully-connected layers for regression, we train a Fully Convolutional Network [35] from scratch. Finally, their system application is more related to Structure from Motion, where images are object-centric and ours is route-centric. We validated our approach on real-world images collected using a self-driving car while theirs were validated solely using indoor images from a camera mounted on a high-precision robotic arm.

Iii Visual Global Localization with a Hybrid WNN-CNN Approach

This section presents the proposed approach to solve the global localization problem in the space of appearance in contrast to traditional Global Positioning System (GPS) based systems. The proposed approach is twofold: (i) a WNN to solve place recognition as a classification problem and (ii) a CNN to solve visual localization as a metric regression problem. Given a live camera image, the first system recollects the most similar image and its associated pose, whilst the latter compares the recollected image to the live camera image in order to output a 6D relative pose. The final outcome is an estimation of the current robot pose, which is composed by the relative pose, given by system (ii), applied to the recollected image pose, given by system (i).

Iii-a Place Recognition System

Our previous place recognition system [2]

, named VibGL, employed a WNN architecture designed to solve image-related classification problems. VibGL employs a committee of Virtual Generalized Random Access Memory (VG-RAM) WNN units, called VG-RAM neurons, for short, which are individually responsible for representing binary patterns extracted from the input and associating with their corresponding label. VG-RAM neurons are organized in layers and can also serve as input to further layers in the architecture. As a machine learning method, VG-RAM has a supervised training phase that stores pairs of inputs and labels, and a test phase that compares stored inputs to new ones. Its testing-time increases linearly with the number of training samples. To overcome this problem, one can employ the Fat-Fast VG-RAM neuron proposed in


for faster neuron memory search, which is leveraged by an indexed data structure with a sub-linear runtime that assumes uniformly distributed patterns.

Our latest place recognition system [1], named SABGL, demonstrated that taking as input a sequence of images is more accurate than taking a single-image as VibGL. The reason is that a sequence of images provides temporal consistency. Although SABGL demonstrated better classification performance than VibGL, in this paper, we will further experiment with VibGL, but in a different scenario, where multiple similar images of a place are acquired over time and used for training. In this context, we are interested in exploiting data spatial consistency.

Iii-B Visual Localization System

In this section, it is described the system proposed to solve the visual localization problem by training a Siamese-like CNN architecture to regress a 6-DoF pose vector.

Iii-B1 Architecture

The CNN architecture adopted here is similar to the one proposed by Handa et al. [32]. Their architecture takes inspiration from the VGG-16 network [37] and uses convolutions in all but the last two layers, where and convolutions are used to compensate for the resolution used as input, as opposed to the used in the original VGG-16. Also, the top fully connected layers were replaced by convolutional layers, which turned the network into a Fully Convolutional Network (FCN) [35].

The network proposed by Handa et al. [32] takes in a pair of consecutive frames, and , captured at time instances and , respectively, in order to learn a 6-DoF visual odometry pose. The CNN adopted here, takes in a pair of image frames distant up to meters from each other. More specifically, the siamese network branches are fed with pairs of keyframes and live frames , in which the relative distance of a live frame to a keyframe is up to meters. The network’s output is a 6-DoF pose vector, , that transforms one image coordinate frame to the other. The first three components of correspond to the rotation and the last three to the translation.

Similarly to the architecture proposed by Handa et al. [32]

, the one adopted here fuses the two siamese network branches earlier, in order to ensure that spatial information is not lost by the depth of the network. Despite the vast majority of CNN architectures alternate convolution and max-pooling layers, it is possible to replace max-pooling layers for a larger-stride convolution kernel, without loss in accuracy on several image recognition benchmarks

[38]. Based on this finding and seeking to preserve spatial information through out the network, there is no pooling layers in the network adopted here.

Figure 2 shows the network adopted in this work. All convolutional layers, with the exception of the last three, are followed by a non-linearity, PReLUs [39]. The major differences to the network proposed by Handa et al. [32] are the dropout layers [40] added before the last two layers and a slight larger receptive field in earlier layers. Dropout was chosen for regularization, since the original network [32] was trained on synthetic data and does not generalize. The receptive field of early layers were made larger, because it is desired to filter out high frequency components of real-world data.

Fig. 2: Siamese CNN architecture for relative pose regression. The siamese network branches takes in pairs of a keyframe and a live frame and outputs a 6-DoF relative pose between those two frames. The Siamese network is a Fully Convolutional Network (FCN) built solely with convolutional layers followed by PReLU non-linearity. Moreover, there are one dropout layer before each of the last two layers.

Iii-B2 Loss Function

Perhaps the most demanding aspect of learning camera poses is defining a loss function that is capable of learning both position and orientation. Kendall et al. [28] noted that a model which is trained to regress both the position and orientation of the camera is better than separate models supervised on each task individually. They first proposed a method to combine position and orientation into a single loss function with a linear weighted sum. However, since position and orientation are represented in different units, a scaling factor was needed to balance the losses. Kendall and Cipolla [41] recommend the reprojection error as a more robust loss function. The reprojection error is given by the difference between the 3D world points reprojected onto the 2D image plane using the ground truth and predicted camera pose. Instead, we chose the 3D projection error of the scene geometry based on the following two reasons. First, it is a representation that combines rotation and translation naturally in a single scalar loss, similar as to the reprojection error. Second, the 3D projection error is expressed in the same metric space as the camera pose and, thus, provides more interpretable results than the reprojection error, which compares a loss function in pixels and a error in meters.

Basically, a 3D projection loss converts rotation and translation measurements into 3D point coordinates, as defined in Equation 1,


where is a homogenized 2D pixel coordinate in the live image, is the corresponding homogenized 3D point, which is obtained by projecting the ray from the given pixel location into the 3D world using inverse camera projection and live depth information, , at that pixel location. The norm is the Euclidean norm applied to points, where and are the depth map width and height, respectively. Moreover, it is worth mention that the intrinsic camera parameters are not required to compute the 3D geometry in the loss function described by Equation 1. The reason is that the same projection is applied to both prediction and ground truth measurements.

Therefore, this loss function naturally balances translation and rotation quantities, depending on the scene and camera geometry. The key advantage of this loss is that it allows the model to vary the weighting between position and orientation, depending on the specific geometry in the training image. For example, training images with geometry far away from the camera would balance rotational and translational error differently to images with geometry very close to the camera. If the scene is very far away, then rotation is more significant than translation and vice versa [41].

The projecting geometry [42] applied to neural network models consists of a differentiable operation that involves matrix multiplication. Handa et al. [32]

provide a 3D Spatial Transformer module that explicitly defines these transformations as layers with no learning parameters. Instead, it allows computing backpropagation from the loss function to the input layers. Figure 

3 illustrates how the geometry layers fit the siamese network. There is a relative camera pose, either given by the siamese network or by the ground truth. There is also geometry layers to compute 3D world point from both relative camera pose and ground truth. On the top left, it is shown the base and top branches of the siamese network that receives as input a pair of a live frame and a keyframe and outputs a relative camera pose vector . This predicted vector is then passed to the SE3 Layer, which outputs a transformation matrix . Following, the 3D Grid Layer receives as input a live depth map , the camera intrinsic matrix and . Subsequently, it projects the ray at every pixel location into the 3D world (by multiplying the inverse camera matrix by the homogenized 2D pixel coordinate ) and multiplies the result by the corresponding live depth . Finally, the resulting homogenized 3D point is transformed by the relative camera transformation encoded in the predicted matrix . The ground truth relative pose is also passed through the SE3 Layer and the resulting transformation matrix is applied to the output of the 3D Grid Layer produced before, in order to get a view of the 3D point cloud according to the ground truth.

Fig. 3: Convolution and Geometry layers jointly applied for learning relative camera poses. On top left, it is shown the base and top branches of the siamese network. Its predicted vector is passed to the SE3 Layer that outputs a transformation matrix . Then, the 3D Grid Layer receives as input a live depth map , the camera intrinsic matrix and . Subsequently, it projects the ray at every pixel location into the 3D world and multiplies the result by the corresponding live depth . Finally, the resulting homogenized 3D point is transformed by the predicted matrix . The ground truth relative pose is also passed through the SE3 Layer and the resulting transformation matrix is applied to the output of the 3D Grid Layer produced before, in order to get a view of the 3D point cloud according to the ground truth.

Iii-C Global Localization System

Lastly, it is presented the integration of the system described in Section III-A for solving the place recognition problem with the system described in Section III-B for solving the visual localization problem. Together, both systems solve the global localization problem, which consists of inferring the current live camera pose given just a single live camera image . Note that the only input to the whole system is the live image, as depicted by the smallest square in Figure 4.

Figure 4 shows the workflow between the place recognition system (WNN approach) and visual localization system (CNN approach), which work together to provide the live global pose given the live camera image . Live image is sent to both WNN and CNN subsystems, while the WNN recollected image is passed only to the CNN subsystem as the keyframe image . Given the image pair, the CNN subsystem outputs the relative camera pose , which is applied to the key global pose , in order to give the live global pose .

Fig. 4: The combined WNN-CNN system. The live image is the only input to the whole WNN-CNN system, which outputs the corresponding live global pose . The WNN subsystem outputs the keyframe image , which, together with the live image , are input to the CNN subsystem, which outputs the relative camera pose . The last is applied to the key global pose to give the live global pose .

Iv Experimental Methodology

This section presents the experimental setup used to evaluate the proposed system. It starts describing the autonomous vehicle platform used to acquire the datasets, follows presenting the datasets themselves, and finishes describing the methodology used in the experiments.

Iv-a Autonomous Vehicle Platform

The data used to evaluate the performance of the proposed system was collected using the Intelligent and Autonomous Robotic Automobile – IARA (Figure 1). IARA is an experimental robotic platform with several high-end sensors based on a Ford Escape Hybrid that is currently being developed at the Laboratório de Computação de Alto Desempenho – LCAD (acronym in Portuguese for High-Performance Computing Laboratory) of the Universidade Federal do Espírito Santo (UFES) in Brazil. For details about IARA specifications please refer to [2, 5].

The datasets used in this work were built using IARA’s frontal Bumblebee XB3 stereo camera to capture VGA-sized images at 16fps, and IARA’s localization module [43] to capture associated poses (6 Degrees of Freedom – 6-DoF). IARA’s localization module is based on a Monte Carlo Localization (MCL) [23] with an Occupancy Grid Mapping (OGM) [44] built with cell grid resolution of 0.2m, as detailed in [45]. Poses computed by IARA’s Monte Carlo Localization - Occupancy Grid Mapping (MCL-OGM) system has the precision of about the grid map resolution, as verified in [43].

Iv-B Datasets

For the experiments, it was collected several laps data in different dates. For each lap, IARA was driven at speeds up to 60 km/h around UFES campus. An entire lap around the university campus has an extension of about 3.57 km. During laps, both image and pose data of IARA were acquired synchronously, amounting to more than 75 thousand pairs of image and pose. Table I summarizes all laps data in ten different sequences. Sequence 8 accounts for two laps, laps 9 and 10 are partial laps and all the others are full laps. The difference in days between sequence 1 and 10 covers more than two years. Such time difference resulted in a challenging testing scenario since it captured substantial changes in the campus environment. Such changes include differences in traffic conditions, number of pedestrians, and alternative routes taken due to obstructions on the road. Also, there were substantial infrastructure modifications of buildings alongside the roads in between dataset recording. The complete set of sequences selected for the experiments is called UFES-LAPS and can be downloaded from the following link

Lap Sequence Lap Date Lap Sampling Spacing
(mm-dd-yyyy) None 1m 5m
UFES-LAP-01 08-25-2016 03:31 PM 7,165 2,868 682
UFES-LAP-02 08-25-2016 03:47 PM 6,939 2,726 679
UFES-LAP-03 08-25-2016 04:17 PM 6,404 2,663 680
UFES-LAP-04 08-30-2016 05:40 PM 1,808 725 170
UFES-LAP-05 10-21-2016 04:15 PM 9,405 2,855 669
UFES-LAP-06 01-19-2017 07:23 PM 1,869 704 171
UFES-LAP-07 11-22-2017 05:20 PM 7,965 2,832 665
UFES-LAP-08 12-05-2017 09:35 AM 17,935 6,012 1,398
UFES-LAP-09 01-12-2018 04:30 PM 7,996 2,868 669
UFES-LAP-10 01-12-2018 04:40 PM 7,605 2,899 662
TOTAL 75,091 27,152 6,445
TABLE I: Ufes Dataset Sequences

To validate the proposed system for both place recognition and visual localization problems, a set of experiments was run with the UFES-LAPS dataset mentioned above. The UFES-LAPS was further split into training, validation and test datasets. The training dataset is named UFES-LAPS-TRAIN and comprises all sequences from UFES-LAPS but the following three: UFES-LAP-04, UFES-LAP-06, and UFES-LAP-07. The UFES-LAP-07 sequence is renamed to UFES-LAPS-TEST to be used for the test. The remaining two sequences, UFES-LAP-04 and UFES-LAP-06, make up the validation dataset, called UFES-LAPS-VALID. This way, UFES-LAPS-TRAIN dataset is used for training, the UFES-LAPS-VALID is used during CNN training to select the best model. Lastly, the UFES-LAPS-TEST dataset is used to test the accuracy of the whole system.

The dataset sequences were sampled at different sampling spacing. For training, a 5-meter spacing is considered for sampling the sequences from UFES-LAPS-TRAIN, and, for the test, it is considered a 1-meter spacing for sampling sequences from UFES-LAPS-TEST, respectively. In other words, the experiments use a 5-meter spacing UFES-LAPS-TRAIN dataset for training, then it will be called UFES-LAPS-TRAIN-5M. While a 1-meter spacing UFES-LAPS-TEST dataset is used for test and called UFES-LAPS-TEST-1M. The same procedure applies to validation dataset, resulting in the following datasets, respectively: UFES-LAPS-VALID-5M and UFES-LAPS-VALID-1M.

To validate the proposed system for the place recognition problem, a set of experiments was run using the UFES-LAPS-TRAIN-5M and UFES-LAPS-TEST-1M datasets for, respectively, training and test the weightless network. In order to validate the proposed system for the visual localization problem is trained with the keyframes selected from UFES-LAPS-TRAIN-5M, while the live frames are picked from UFES-LAPS-TRAIN-1M. The same procedure applies to validation and test dataset, resulting in the following datasets, respectively: UFES-LAPS-VALID-5M/1M and UFES-LAPS-TEST-5M/1M. To define the ground-truth label between places, the correspondences between every two lap data were established using the Euclidean distance between pairs of image-pose from each lap of training and test datasets with a third dataset, for pose registration purposes only. So the UFES-LAP-05 sequence was reserved for pose registration only and none of its images were considered for place recognition. Firstly it was sampled at the fixed 1m spacing interval to create UFES-LAPS-REG-1M. Following, the UFES-LAPS-TRAIN-1M dataset was matched with the registration dataset UFES-LAPS-REG-1M using the Euclidean distance as proximity measure. Finally, the same procedures are applied to the UFES-LAPS-TEST-1M dataset. The final sizes of registered training and test datasets for place recognition are 4,415 and 2,784, respectively.

In order to define the ground-truth relative vector between camera poses, the relative distances between every two sequence data were established using the Euclidean distance between pairs of a key- and live- frames along with their corresponding poses. The UFES-LAPS-TRAIN-1M dataset was matched with the UFES-LAPS-TRAIN-5M using the Euclidean distance as a proximity measure to select the closest keyframe from the 5m spacing dataset. The same procedure applies to the UFES-LAPS-TEST-1M/5M and UFES-LAPS-VALID-1M/5M datasets. The crossing data combinations for each dataset is as follows. For training data, it is crossed the data of every live frame in UFES-LAPS-TRAIN-1M with the keyframes in UFES-LAPS-TRAIN-5M. For the validation data, live frames come from sequence data in UFES-LAPS-VALID-1M dataset while the keyframes can be in any sequence data from UFES-LAPS-VALID-5M or UFES-LAPS-TRAIN-5M dataset. The same procedure applies to the test dataset. Select the live frames from sequence data in UFES-LAPS-TEST-1M and the keyframes from sequence data in UFES-LAPS-TEST-5M or UFES-LAPS-TRAIN-5M dataset. The final sizes after crossing the sequence data of training, validation and test datasets are, respectively: 98,404, 3,471 and 14,249.

Iv-C Network Training

In this subsection, it is described the training procedure and parameter selection for WNN and CNN. For the WNN, the parameters were chosen accordingly to tuning parameter selection done in [2] as follows: one neural layer with size , where each neuron reads a binary feature vector with size from the input layer.

The WNN is trained on the UFES-LAPS-TRAIN-5M dataset using images from the left camera of Bumblebee XB3 cropped to . The same crop window applies for UFES-LAPS-TEST-1M. For more details about the weightless network parameters and training procedure please refer to [2] . The CNN is trained on the UFES-CNN-LAPS-TRAIN-5M/1M dataset using images from left camera of Bumblebee XB3 and the depth image computed with SPS stereo [46], being both image and depth cropped to . No data augmentation was used.

The CNN was trained with Adam optimizer [47] using mini-batches of size 24. Adam hyper-parameters and were set to 0.9 and 0.999, respectively. The learning rate is initially set to

and decreased by a factor of 2 at each epoch. The network was trained for 7 epochs, with 4,101 iterations per epoch. To prevent the network from overfitting, it is employed Dropout layers

[40] and Early Stopping [48]. There are two Dropout layers in the convolutional network architecture presented in Section III-B. Both have probability of units being randomly dropped at each training iteration. Following early stopping criteria, the training was interrupted and the best model, which achieves smaller positioning error on validation data, was saved.

The curves of the graph in Figure 5 show the CNN training evolution using UFES-LAPS-TRAIN-5M/1M dataset for training and UFES-LAPS-VALID-5M/1M dataset for validation. The vertical axis represents the error in meters and the horizontal axis represents the number of iterations. The curve in indigo presents the loss function error as in Equation 1, while the curve in green presents the positioning error measured with the Euclidean distance on the training data. The curve in red is also measured with the Euclidean distance but represents the positioning error of validation data.

Fig. 5: CNN training. The vertical axis represents the error in meters and the horizontal axis represents the training iterations. The curves in indigo, green and red represents the error measured in meters of the loss function, training and validation data. The loss function error is defined as in Equation 1 and represents the mean Euclidean distance between the ground truth and the predicted 3D point projections, while the training and validation metrics measures the mean Euclidean distance between the 3D camera position given by the network and the ground truth.

As the graph of the Figure 5 shows, the loss function curve stays consistently above all others, while the validation error curve crosses the training error curve after 25,000 iterations. What indicates that the network is overfitting. Following early stopping criteria, the training was interrupted and the best model, which achieves smaller positioning error on validation data, was saved.

V Results and Discussions

This section shows and discusses the outcomes of the experiments. It starts describing the performance of the WNN subsystem performance in terms of classification accuracy and follows presenting the CNN subsystem performance in terms of positioning error. A demo video, that shows the WNN-CNN system performance on the UFES-LAPS-TEST-1M test dataset, is available at

V-a Classification Accuracy

This subsection compares the performance of the WNN subsystem by means of the relationship between the number of frames learned by the system and its classification accuracy. The system classification accuracy is measured in terms of how close the estimated image-pose pair is to the ground-truth image-pose pair.

Figure 6 shows the classification accuracy results obtained on UFES-LAPS-TEST-1M test dataset and using for training either one sequence UFES-LAP-01 at the fixed 5m spacing or all sequences (UFES-LAPS-TRAIN-5M). The vertical axis represents the percentage of estimated image-pose pairs that were within an established Maximum Allowed Error (MAE) in frames from the ground-truth image-pose pair. The MAE is equal to the amount of image-pose pairs that one has to go forward or backward in the test dataset to find the corresponding query image. The horizontal axis represents the MAE in frames. Finally, the curves represent the results for different training datasets: one sequence or all sequences.

Fig. 6: Classification accuracy of VG-RAM WNN for different Maximum Allowed Error (MAE) in frames when training with one sequence (UFES-LAP-01-5M) or with all sequences (UFES-LAPS-TRAIN-5M) and test with the UFES-LAPS-TEST-1M dataset for both.

As the graph of Figure 6 shows, the WNN subsystem classification accuracy increases with MAE for both datasets but reaches a plateau at about 3 frames for the all-sequences dataset and at about 10 frames for the one-sequence dataset. For the latter dataset, if one does not accept any system error (MAE equals zero), the accuracy is about 68%. But, if one accepts an error of up to 3 frames (MAE equals 3), the accuracy increases to about 82%. On the other hand, when using the all-sequences dataset for training, the system accuracy increases more sharply. For example, with MAE equals to 1, the classification rate is about 98%. Although the system might show better accuracy with increasing MAE, the positioning error of the system increases. This happens because one frame of error for the training datasets represents 5 meters.

When comparing the graph curves of Figure 6, it can be observed that, for all-sequences training dataset, the WNN subsystem achieves up to 90.3% in terms of classification accuracy with MAE equals to 0.

V-B Positioning Error

In this subsection, it is analyzed the performance of CNN given the ground-truth keyframe (GT+CNN) and when the keyframe is outputted by WNN (WNN+CNN). Both are compared against GPS on the UFES-LAPS-TEST-1M dataset, where both WNN and GPS systems are more accurate. As seen before, the WNN subsystem is more accurate 90.3% of the time, assuming a Maximum Allowed Error (MAE) equals to zero. The GPS subsystem is more accurate where the signal quality is stable, which occurs 89.65% of the time. For this experiment, a signal is considered stable when GPS quality indicator is greater than 0 111\({R}eceiverHelp/V4.44/en/NMEA-0183messages_{G}GA.html\).

We measured the positioning error of the proposed system and GPS by means of how close their estimated trajectories are to the trajectory estimated by the OGM-MCL system (our ground truth) on the UFES-LAPS-TEST-1M test dataset. Figure 7

shows the results as box plots with median, inter-quartile range and whiskers of the error distribution for each system.

Fig. 7: Comparison between hybrid WNN-CNN system and GPS positioning error.

As shown in Figure 7, positioning errors of GPS and GT+CNN systems are equivalent. For the WNN+CNN system, the positioning error of 50% of the poses are under 1m and of 75% of the poses are under 2.3m. Extending the comparison to the Visual SLAM system [5] in a similar context, the combined approach has mean positioning error of 1.20m, slightly higher than the 1.12m performed by the Visual SLAM system in the same trajectory. Considering they serve different purposes, combined results of the hybrid WNN-CNN approach looks promising.

Vi Conclusions and Future Work

It was shown that to solve the global localization problem, it is required more than just outputting the position where the robot was during mapping phase. It is desired to approximate the actual robot position and orientation with respect to the past pose. This problem was tackled here by training a Siamese-like CNN that takes as input two images and regresses a 6-DoF relative pose. It was advocated that using a geometry loss to project the 3D points transformed by network’s output pose is a better approach than using the ground truth pose as the backpropagation signal. It naturally balances the differences in the scaling of the position and rotation units, for instance. It was also verified that the loss function error is consistently above training and validation errors. The geometry loss function apparatus demonstrated being a robust loss for the task of regressing the relative pose. For the final experiment, the best-trained model was applied in conjunction with the WNN subsystem to solve the global localization problem. It was shown that the combined results of the hybrid WNN-CNN approach were on pair with a Visual SLAM system, although needs improvements compared to RTK-GPS precision.

Some direction for future work involves extending this work with larger datasets and evaluating the network performance using transfer learning and fine tuning with Ufes dataset. As more data are provided [49], it is expected an increase in accuracy and regularization for Deep Learning models. For instance, the PoseNet’s [28] localization accuracy was improved by increasing the number of training trajectories, while maintaining a constant-size CNN. Conversely, for the WNN subsystem, larger datasets can degrade runtime performance as the runtime during test scales with the number of training samples.

Deep Learning models have demonstrated superhuman performance in some tasks but at the expense of large amounts of correctly labeled data for training models using standard supervised techniques, which is costly in robotics. To overcome this issue, an alternative is to train Deep Learning models using weakly supervised techniques [50] with noisy labeled data, or even unlabeled data. This could open doors to many new applications in robotics, such as Visual SLAM using end-to-end Deep Learning techniques. For the relative pose estimation problem studied here, another alternative to overcome noisy labeled data is to incorporate its uncertainty in the loss function as an extra parameter [41].


The authors would like to thank NVIDIA Corporation for the donation of some GPUs used in this work and Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq, Brazil (grants 311120/2016-4 and 311504/2017-5) for their financial support to this research work.


  • [1]

    A. Forechi, A. F. De Souza, C. Badue, and T. Oliveira-Santos, “Sequential appearance-based Global Localization using an ensemble of kNN-DTW classifiers,” in

    2016 International Joint Conference on Neural Networks (IJCNN), jul 2016, pp. 2782–2789.
  • [2] L. J. Lyrio Júnior, T. Oliveira-Santos, A. Forechi, L. Veronese, C. Badue, A. F. A. De Souza, L. J. Lyrio Júnior, T. Oliveira-Santos, A. Forechi, L. Veronese, C. Badue, and A. F. A. De Souza, “Image-based global localization using VG-RAM Weightless Neural Networks,” in 2014 International Joint Conference on Neural Networks (IJCNN).   IEEE, jul 2014, pp. 3363–3370.
  • [3] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry,” in

    Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004.

    , vol. 1.   IEEE, jun 2004, pp. 652–659.
  • [4] T. S. Huang and A. N. Netravali, “Motion and Structure from Feature Correspondences: A Review,” Proceedings of the IEEE, vol. 82, no. 2, pp. 252–268, 1994.
  • [5] L. J. Lyrio, T. Oliveira-Santos, C. Badue, and A. F. De Souza, “Image-based mapping, global localization and position tracking using VG-RAM weightless neural networks,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, may 2015, pp. 3603–3610.
  • [6] S. Engelson, “Passive Map Learning and Visual Place Recognition,” Ph.D. dissertation, 1994.
  • [7] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [8] Y. Matsumoto, M. Inaba, and H. Inoue, “Visual navigation using view-sequenced route representation,” in International Conference on Robotics and Automation, vol. 1, no. April.   IEEE, 1996, pp. 83–88.
  • [9] M. Cummins and P. Newman, “Probabilistic Appearance Based Navigation and Loop Closing,” in Proceedings 2007 IEEE International Conference on Robotics and Automation.   IEEE, apr 2007, pp. 2042–2048.
  • [10] M. Milford and G. Wyeth, “Persistent Navigation and Mapping using a Biologically Inspired SLAM System,” The International Journal of Robotics Research, vol. 29, no. 9, pp. 1131–1153, jul 2009.
  • [11] S. Lynen, M. Bosse, P. Furgale, and R. Siegwart, “Placeless place-recognition,” in Proceedings - 2014 International Conference on 3D Vision, 3DV 2014.   IEEE, dec 2015, pp. 303–310.
  • [12] M. Labbe and F. Michaud, “Appearance-based loop closure detection for online large-scale and long-term operation,” IEEE Transactions on Robotics, vol. 29, no. 3, pp. 734–745, jun 2013.
  • [13] D. Galvez-Lopez and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [14] Y. Hou, H. Zhang, and S. Zhou, “Convolutional neural network-based image representation for visual loop closure detection,” in 2015 IEEE International Conference on Information and Automation, ICIA 2015 - In conjunction with 2015 IEEE International Conference on Automation and Logistics, vol. 39, no. 3, oct 2015, pp. 2238–2245.
  • [15] T. Weyand, I. Kostrikov, and J. Philbin, “Planet - photo geolocation with convolutional neural networks,” in

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , vol. 9912 LNCS, 2016, pp. 37–55.
  • [16] T. Y. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep representations for ground-to-aerial geolocalization,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, 2015, pp. 5007–5015.
  • [17] J. Hays and A. A. Efros, “IM2GPS: estimating geographic information from a single image,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, vol. 05.   IEEE, jun 2008, pp. 1–8.
  • [18] S. Lowry, N. Sunderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. Milford, “Visual Place Recognition: A Survey,” IEEE Transactions on Robotics, vol. PP, no. 99, pp. 1–19, 2015.
  • [19] Z. Chen, O. Lam, A. Jacobson, and M. Milford, “Convolutional Neural Network-based Place Recognition,” in 2014 Australasian Conference on Robotics and Automation (ACRA 2014), nov 2014, p. 8.
  • [20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” in International Conference on Learning Representations (ICLR2014), dec 2014, p. 1312.6229.
  • [21] M. Cummins and P. Newman, “FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
  • [22] M. Milford and G. Wyeth, “SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights,” 2012 IEEE International Conference on Robotics and Automation, pp. 1643–1649, may 2012.
  • [23] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics.   Cambridge: MIT press, 2005.
  • [24] Y. Song, X. Chen, X. Wang, Y. Zhang, and J. Li, “Fast Estimation of Relative Poses for 6-DOF Image Localization,” in Proceedings - 2015 IEEE International Conference on Multimedia Big Data, BigMM 2015.   IEEE, apr 2015, pp. 156–163.
  • [25] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, jun 2008.
  • [26] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, may 2015, pp. 37–45.
  • [27] S. Chopra, R. Hadsell, and L. Y., “Learning a similiarty metric discriminatively, with application to face verification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 349–356.
  • [28] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, may 2015, pp. 2938–2946.
  • [29] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC — Differentiable RANSAC for Camera Localization,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, jul 2017, pp. 2492–2500.
  • [30] G. L. Oliveira, N. Radwan, W. Burgard, and T. Brox, “Topometric Localization with Deep Learning,” in Proceedings of the International Symposium on Robotics Research (ISRR), Puerto Varas, Chile, 2017.
  • [31] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, jul 2017, pp. 2261–2269.
  • [32] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, and A. Davison, “Gvnn: Neural network library for geometric computer vision,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9915 LNCS, 2016, pp. 67–82.
  • [33] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Relative Camera Pose Estimation Using Convolutional Neural Networks,” pp. 675–687, feb 2017.
  • [34]

    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning Deep Features for Scene Recognition using Places Database,”

    Advances in Neural Information Processing Systems 27, pp. 487–495, 2014.
  • [35] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, 2015, pp. 3431–3440.
  • [36] A. Forechi, A. F. A. De Souza, J. J. de Oliveira Neto, E. d. E. Aguiar, C. Badue, A. Garcez, O.-S. Thiago, A. d’Avila Garcez, and T. Oliveira-Santos, “Fat-Fast VG-RAM WNN: A High Performance Approach,” Neurocomputing, vol. 183, no. Weightless Neural Systems, pp. 56–69, dec 2015.
  • [37] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR2015), sep 2015.
  • [38] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Simplicity: The All Convolutional Net,” in International Conference on Learning Representations (ICLR2015), dec 2014.
  • [39]

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in

    Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, 2015, pp. 1026–1034.
  • [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, jan 2014.
  • [41] A. Kendall and R. Cipolla, “Geometric Loss Functions for Camera Pose Regression with Deep Learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, jul 2017, pp. 6555–6564.
  • [42] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2004.
  • [43] L. d. P. Veronese, J. Guivant, F. A. A. Cheein, T. Oliveira-Santos, F. Mutz, E. de Aguiar, C. Badue, and A. F. De Souza, “A light-weight yet accurate localization system for autonomous cars in large-scale and complex environments,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, nov 2016, pp. 520–525.
  • [44] A. Elfes, “Using Occupancy Grids for Mobile Robot Perception and Navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989.
  • [45] F. Mutz, L. P. Veronese, T. Oliveira-Santos, E. de Aguiar, F. A. Auat Cheein, and A. Ferreira De Souza, “Large-scale mapping in complex field scenarios using an autonomous car,” Expert Systems with Applications, vol. 46, pp. 439–462, mar 2016.
  • [46] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient joint segmentation, occlusion labeling, stereo and flow estimation,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8693 LNCS, no. PART 5, 2014, pp. 756–771.
  • [47] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of the 3rd International Conference for Learning Representations (ICLR), San Diego, USA, dec 2015.
  • [48] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   The MIT Press, 2016.
  • [49] L. Contreras and W. Mayol-Cuevas, “Towards CNN Map Compression for camera relocalisation,” mar 2017.
  • [50]

    Z.-H. Zhou, “A brief introduction to weakly supervised learning,”

    National Science Review, aug 2017.