Practical Visual Localization for Autonomous Driving: Why Not Filter?

11/20/2018 ∙ by Anh-Dzung Doan, et al. ∙ 28

A major focus of current research on place recognition is visual localization for autonomous driving. However, while many visual localization algorithms for autonomous driving have achieved impressive results, it seems not all previous works have been set in a realistic setting for the problem, namely using training and testing videos that were collected in a distributed manner from multiple vehicles, all traversing through a road network in an urban area under different environmental conditions (weather, lighting, etc.). More importantly, in this setting, we show that exploiting temporal continuity in the testing sequence significantly improves visual localization - qualitatively and quantitatively. Although intuitive, this idea has not been fully explored in recent works. Our main contribution is a novel particle filtering technique that works in conjunction with a visual localization method to achieve accurate city-scale localization that is robust against environmental variations. We provide convincing results on synthetic and real datasets.



There are no comments yet.


page 2

page 3

page 11

page 12

Code Repositories


MATLAB Code for Visual Localization using Particle Filter

view repo


Visual Localization Under Appearance Change: A Filtering Approach(DICTA 2019 Best paper)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: In the proposed framework, the training data consists of image sequences collected in a distributed manner from multiple vehicles under different environmental conditions. The red lines indicate the road network of a part of a city, while the green lines indicate a few of the trajectories taken by different data collection vehicles. Sample frames at similar locations are shown—note the differences in pose and environmental conditions between the images (this data was generated synthetically using G2D [18]). Testing data will also be an image sequence recorded from a vehicle along the road network.

Localization in autonomous vehicles is the task of identifying the position of the vehicle on a map, possibly up to 6DOF precision. Realistically, a practical self-driving system will make use of multiple types of sensors to achieve localization, such as satellite navigation systems (GNSS), inertial navigation systems (INS), and vision-based localization. The different sensors have complementary strengths, which can give higher overall accuracy if properly fused.

In computer and robotic vision, visual place recognition is the broad problem of recognizing “places” based on visual inputs [33, 8]. We regard visual localization (VL) for autonomous driving a special case of this problem [44]. Impressive results have been obtained in VL for autonomous driving [43, 28, 6, 14, 7, 9]. However, as alluded to above, vision is only one component in the sensing package, and the “raw” output of a VL algorithm will likely be fused with the results of other localization techniques, e.g., using Bayes filter. While we do not investigate the fusion aspect in our work, we stress the importance of developing and evaluating VL algorithms in the broader systems context.

To this end, we ask the question: what is a realistic framework for VL for autonomous driving? As with other vision capabilities, having a large training image database is essential to gain robustness against environmental variability (weather, time of day, season, etc.). Conceivably, data collection for a practical VL system must also be an ongoing process, since there will be continual changes in the environment (e.g., construction works, updating of façades and billboards), particularly in urban areas. Also, it makes sense to acquire data in a distributed manner using a fleet of mapping vehicles to increase spatial and temporal coverage. Indeed, there are datasets that grow continuously using data from taxis and freelance mapping vehicles [36, 22].

Based on the above data collection methodology, it is reasonable to assume that, while there are possibly an infinite number of ways in which the testing conditions can differ from the training conditions, a decent sample of the environmental variability is nonetheless captured in the training images. Thus, while VL algorithms still need to be robust against environment changes, severely adversarial testing regimes (e.g., training images all from day versus testing images all from night [44, Sec. 3.1]) would be uncommon.

Lastly, in other important vision-based functionalities for autonomous driving (e.g., pedestrian detection, lane detection), the camera will be operating continuously. Hence, it is realistic to expect videos (continuous image sequences) as inputs to the VL algorithm, as opposed to the single-image querying approach used in other place recognition works (e.g., for augmented reality [12, 52]).

The following summarizes the proposed VL framework:

  • [leftmargin=1em,parsep=1pt,itemsep=1pt,topsep=1pt]

  • Continuous data collection that uses multiple vehicles to record street-level imagery in the road network.

  • Testing data will be in the form of image sequences.

  • VL is just one component in the localization solution for autonomous driving; fusion is to be expected.

Fig. 1 illustrates the proposed framework.

Our contributions

Under the above framework, we propose a novel particle filter algorithm for 6DOF VL. We show how our particle filter can work in conjunction with existing learning- and retrieval-based VL algorithms. By using one of these VL algorithms as a noisy estimator, the particle filter can be used to enforce the temporal consistency via a stochastic motion model.

Our usage of particle filtering is justified as follows:

  • [leftmargin=1em,parsep=1pt,itemsep=1pt,topsep=2pt]

  • When navigating on the roads of an urban area111

    In more “localized” operations such as parking, where highly accurate 6DOF estimation is required, it is probably better to rely on the INS.

    , the motion of the car is fairly restricted, i.e., it largely stays on a road network on an approximately 2D plane222More fundamentally, the car is a non-holonomic system [1]. [42]. While the road surface still allows significant scope for movement (cf. Fig. 1), relative to the size of the map, the motion of the car is rather limited333On uneven or hilly roads, accelerometers can be used to estimate the vertical motion, hence, VL can focus on map-scale navigation.. This is echoed by the recent work of [44], who observed that there is “lower variation in viewpoints as the car follows the same road”. Hence, a particle filter with a simple motion model suffices to track the 6DOF rigid motion of the car.

  • In many cities, the road networks are complex Euclidean graphs. In fact, it is well known that using (visual) odometry alone, it is possible to accurately track a car on a 2D road map [11]444The method of [11] will give ambiguous results on non-informative trajectories, e.g., largely straight routes. Hence, VL is still crucial.. More fundamentally, this suggests that temporal continuity in the testing sequence (which is fully exploited by our method) strongly benefits VL.

To handle environmental variations between the testing and training images, we use existing robust VL algorithms (DenseVLAD [52], PoseNet [28] and MapNet [9]) as the “front end” of our particle filter. More importantly, we demonstrate how our particle filter is able to correct for the errors made by the front-end, and greatly improve its accuracy. This is despite some of these methods (e.g., MapNet) already exploiting temporal constraints in the training.

We test our approach on real [34] and synthetic datasets. Fig. 1 provides an overview of our synthetic data, which was produced using G2D [18]. Since our synthetic dataset is the first that subscribes to the realistic VL framework, this is also a major contribution of our work555See supplementary material for sample images.. Note that there is an increasing recognition of the value of synthetic data towards building autonomous driving systems [53].

2 Related work

Figure 2: An example of our proposed particle filter. 1st row is the distribution of particles, 2nd row is the query image, and from 3rd to 7th rows are the retrieval results of DenseVLAD. Red bounding boxes indicate mistakes. The color bar describes the particle weights. Red lines are the states predicted by the particle filter. At the first frame, particle filter randomly generates 1,000 particles with weight , then particles which are inconsistent with measurements of DenseVLAD are vanished. Eventually, the ambiguity is resolved, and the particles track the movement of the vehicle. It is also worth mentioning that the query and retrieved images are from different time and weather conditions. In particular, the query images are from 6:26pm-cloudy, and the correct retrieved images are from 11:24am-rainy, 9:15pm-clear, and 1:28pm-sunny conditions.

2.1 Datasets for autonomous driving

In the recent work of Sattler et al. [44] which is aimed at autonomous driving, several state-of-the-art algorithms [45, 50, 52, 4, 16] were compared on three datasets: Aachen Day-Night, RobotCar Seasons and CMU Seasons. Although the datasets include substantial environmental variations, and the testing images are sequential in RobotCar Seasons and CMU Seasons, the testing regime does not examine the benefit of temporal continuity for VL. Although there is a section in [44] on “multi-image queries”, it is a relatively small part of the benchmark, with only two methods tested; seqSLAM [35] and an extension of Active Search [45] that uses triplets of images. The former is known to suffer significantly from slight pose variation [49], while the latter does not fully take into account temporal continuity. It is worth mentioning that [44] suggests “multi-image localization” as a promising future research.

Another recent dataset for VL is ApolloScape [24]

, which was collected in various weather conditions, but it only covers 7 roads in 4 different cities, and the video sequences were recorded in daytime. In addition, their evaluation metric, which is the median errors of translation and rotation, does not completely reflect the accuracy of VL in the scenario of autonomous driving. In our experiment, we show that even when the median errors between different methods are comparable, the smoothness of the predicted poses can be different (See section

5.2 for more details).

Apart from VL, there are many datasets aimed towards tasks related to autonomous driving. CamVid [10] is presented for semantic segmentation and camera pose estimation, but it is relatively small with 4 video sequences in total duration of 20 minutes. KITTI benchmark suite [20]

provides ground truth for various computer vision tasks (e.g., stereo reconstruction, optical flow, visual odometry, object detection and tracking). Cityscapes

[15] and Mapillary Vistas [36] target the semantic sementation problem.

Realistic datasets need a lot of effort of annotating the ground truth, thus synthetic data generation tools are presented to overcome that difficulty. SYNTHIA [40] is a large-scale synthetic image generator based on a virtual world constructed via the Unity platform [2]. Similarly, Playing for benchmarks [39] and G2D [18] extract the image data and ground truth from GTA.

2.2 VL algorithms

Current algorithms can be categorized into local feature-based methods and learning-based methods.

Broadly speaking, local feature-based methods estimate the pose of the query image from 2D-3D correspondences by solving the PnP problem [30] or an equivalent step [44, 41]. The main difficulty is establishing and validating the 2D-3D correspondences with the large (city-scale) 3D model. Many methods attempt to improve this step by using geometric constraints [31, 43], semantic information in the images [47, 51], and learning-based matching [48, 6, 7].

Learning-based methods perform 6DOF pose estimation as an image-to-SE(3) regression problem. PoseNet [28]

uses a convolutional neural network (CNN) to learn the regression mapping. LSTM-Pose 


argues that the usage of a fully connected layer in PoseNet possibly leads to the overfitting, despite the usage of dropout. Therefore, they use Long-Short Term Memory (LSTM) units. Other variants of PoseNet include 

[26] who incorporates uncertainty in the pose prediction, and [27]

who considers geometric (re-projection) errors in the loss function (the latter also replace the tunable parameters in PoseNet by learnable ones).

We remark that our particle filtering algorithm can work in conjunction with all the algorithms above.

2.3 Exploiting temporal information

The methods surveyed above conduct mainly per-image camera pose estimation. Recently, there are methods that also take into account temporal continuity in the image sequence. These have mostly appeared in the learning framework. Prominent methods include VidLoc [14], which uses bi-directional LSTM architecture to exploit temporal constraints in the data, and MapNet [9], which incorporates the relative poses between consecutive frames in the training loss function. For testing, both of these networks take an input as sequence of images, and output a sequence of poses. Interestingly, [9] also proposes to conduct pose graph optimization [21] to post-process the output pose sequences. Similar to MapNet, which uses visual odometry (VO) for pose graph optimization, works in [32, 37, 17, 13] also leverage VO to verify the consistency between consecutive pose estimates and ego-motion from VO.

More recently, in DeLS-3D [55]

, a recurrent neural network (RNN) is trained to model the temporal dependency in the image sequence. The RNN is then used to post-process the sequence of poses predicted by a CNN. Interestingly, they also compared the RNN with a Kalman filter, and observed that “

the motion predicted from RNN is better than using a Kalman filter with a constant speed assumption”.

2.4 Advantages of particle filter

Based on the above, the post-processing techniques that have been applied to VL for autonomous driving are:

  • [leftmargin=1em,parsep=1pt,itemsep=1pt,topsep=2pt]

  • Visual odometry (VO) (e.g., using pose graph optimization [9]).

  • Recurrent neural network (RNN) [14] [55].

  • Kalman filter (KF) [55].

The conceptual advantages of our proposed particle filter (PF) method to the above techniques are:

  • [leftmargin=1em,parsep=1pt,itemsep=1pt,topsep=2pt]

  • The dependence on visual odometry (VO) increases computational load and introduces an extra failure point. In contrast, our PF works well with a simple motion model - as mentioned in Section 1, the motion of a car on the roads is relatively restricted.

  • While RNN requires parameter learning, PF does not - this represents a significant benefit in the VL framework where new training images are acquired continuously.

  • KF is only valid for only linear-Gaussian Bayesian filtering problems [5], which do not include 6DOF rigid motion estimation. This could be a reason behind the poorer result of KF in [55]

    . Moreover, in our PF method, outlier rejection is performed to discard bad pose estimates, which invariably arise due to environmental variability. It is unknown whether this was done in


3 Problem formulation

With the proposed framework, we define the map , where is an image sequence comprising the image frame and its 6DOF camera pose at time step with . Every can be separately collected by multiple vehicles running in different weathers, time and routes. The pose for each frame can be attained by conducting collaborative SLAM [19, 38, 46] as an off-line stage. In particular, every vehicle performs its VO before sending its map to a central server. After that, the server detects the overlaps between individual maps, and then fuses them together to get a global map .

For the on-line inference, given a query sequence , our goal is to estimate the camera pose for each with respect to the map .

4 Particle filter for VL

We define the 6DOF camera pose as a hidden state:

where, is the hidden state at time , and respectively are 3D position and Euler orientation. Following the first-order Markov process, Bayes filter recursively determines the hidden state as follows:

where, is the scale factor, and are motion prediction and measurement update respectively, state is distributed according to . This process consists of two parts: the current state is predicted by the motion model, and then it is updated according to the sensor measurement. In this paper, we select particle filter, which is a popular implementation of Bayes filter.

The idea of particle filter is to leverage the Monte Carlo principle to approximate the probability distribution

. In particular, each time step maintains a set of particles , each of which has a state and a weight: . As the number of particles

are sufficiently large, this set of particles is equivalent to a representation of a probability density function.

In the following sections, we present our proposed motion prediction and measurement update which can be integrated into the particle filter implementation. Our method offers three advantages. Firstly, it correctly models the temporal constraint through the proposed motion model. Second, the uncertainty of VL methods is parameterized appropriately by the noise sensor model. Finally, it can be easily incorporated with a wide range of VL approaches [52, 28, 27, 9].

4.1 Motion prediction

In the autonomous driving scenario, if the visual data is an image sequence, whose motion is driven with a control input. In practive, that input can be attained from other sensors (e.g., INS). However, as mentioned in Section 1, the motion of a car is fairly restricted, hence we use hand-crafted motion model in this work.

Mathematically, for each particle, we draw the velocity and angular velocity , where and respectively represent the linear and angular velocities. The accelerations are modeled by the noise covariance matrices and . Then, for each particle, their motion in each time step is:

In practice, the , , , and can be either manually tuned, or estimated from training data [29, 3].

It is worth noting that while 3D positions can be easily updated by using simple additions, it does not make sense for Euler angles. This is because the space of Euler angles is not a vector space. Hence, in the Kinematics control

[25], one will convert two Euler angles to the Direction Cosine Matrix (DCM), multiply two matrices and convert the result back to the Euler representation.

Define is a function that maps an Euler representation to DCM and is its inverse mapping. Our motion model for each particle is the following:


The experimental results will show that our motion model can properly handle the temporal evolution of camera pose in an image sequence.

4.2 Measurement update

VL methods, regardless of how accurate, are always subject to uncertainty. Therefore, we consider the prediction of an VL approach as a noisy measurement from a sensor.

Given an image at time , we define the measurement , where is a nonlinear function, which can be one of existing VL methods [28] [6] [52] [50] [43]. Particularly, for the retrieval-based methods (e.g., DenseVLAD [52]), one takes an input image, outputs a set of similar database images, performs clustering, and then calculates the mean of 6DOF poses. For learning-based approaches (e.g., PoseNet [28] or MapNet [9]), the function is a deep neural network, which receives a query image, then directly outputs the 6DOF pose (See section 5.1 for more details). However, because the prediction of these methods is viewed as a noisy measurement, we define a measurement model:


is a noise variable distributed according to the Gaussian distribution, thus

is also distributed in the Gaussian with mean and covariance . Every particle is assigned a weight as the following:


Our particle filter is described in Algorithm 1. The probability distribution is approximated by the particle set . The purpose of the lines in the algorithm is to resample the particles based on their weights to prevent from the degeneracy problem, which likely occurs in the life-long localization scenario. Figure 2 shows the filtering performed by our proposed particle filter, this example uses DenseVLAD [52] as a noisy measurement sensor. At the first iteration, particle filter randomly generates many hypotheses. Then, hypotheses with small weights vanish if they are inconsistent with the observation measured by DenseVLAD. Finally, the ambiguity is resolved, and the particles successfully track the vehicle. Furthermore, the query and retrieved images are from different times and weather conditions.

1:function ParticleFilter()
3:     for  to  do
4:         sample ,
5:         compute as (1)
6:         compute as (2)
8:     normalize
9:     for  to  do
10:         draw with probability
11:         add to      
12:     return
Algorithm 1 Particle filter for visual localization

5 Experiments

5.1 Implementation details

In our experiment, we select DenseVLAD [52], PoseNet [28] and MapNet [9] as baseline. DenseVLAD+PF, PoseNet+PF, and MapNet+PF respectively are the abbreviations of DenseVLAD, PoseNet and MapNet working in conjunction with our particle filter.

For DenseVLAD, given a query image, we retrieve top 20 database images, then apply mean shift algorithm to perform clustering based on 3D positions of retrieved images. To estimate the 6DOF pose of the query, we select the cluster with the largest number of images, and then compute the mean of 3D positions and orientations of those images. For the learning approach, following suggestion of [9], PoseNet uses ResNet-34 [23]

, adds a global average pooling layer, and parameterizes camera orientation as logarithm of a unit quaternion. PoseNet is implemented in Tensorflow, and MapNet implementation is provided by the authors. Parameters of PoseNet and MapNet are set as the suggestion of the authors.

Regarding our framework, we initialize particles from Gaussian distribution with the mean from the prediction of DenseVLAD, PoseNet or MapNet in the first frame. The covariance matrices for initializing 3D location and orientation respectively are and . The number of particles are .

5.2 Oxford RobotCar

DenseVLAD 27.66m, 15.35
DenseVLAD+PF 7.03m, 3.89
PoseNet 10.45m, 4.45
PoseNet+PF 6.84m, 3.75
MapNet 6.02m, 2.38 28.16m, 8.50
MapNet+PGO 5.76m, 2.27 27.75m, 8.47
MapNet+PF 4.73m, 2.20 17.05m, 8.82
DenseVLAD 6.18m, 1.61
DenseVLAD+PF 6.61m, 1.71
PoseNet 5.01m, 1.67
PoseNet+PF 4.74m, 1.74
MapNet 3.99m, 1.02 7.08m, 1.02
MapNet+PGO 3.88m, 1.01 6.99m, 1.01
MapNet+PF 3.68m, 1.10 7.77m, 1.35
Table 1: Mean (top) and Median (bottom) rotation and translation errors on the Oxford RobotCar dataset. Although the median errors are comparable between VL methods before and after using our particle filter, the mean errors after smoothing are considerably improved. The visualization of predicted trajectories is shown in Figures 3.
Figure 3: Results on LOOP (top) and FULL (bottom) routes in Oxford RobotCar dataset. The length of LOOP and FULL routes are 1km and 9.5km respectively. The green lines are ground truth, and red lines are predicted trajectories.
DenseVLAD and DenseVLAD+PF
MapNet and MapNet+PF
Figure 4: Prediction of 6D camera pose of DenseVLAD [52] and MapNet [9] before and after integrated to our particle filter framework. Before filtering, the variation of DenseVLAD and MapNet predictions (red lines) is large. However, after conducting particle filter, the predicted poses (blue lines) are more smooth. Green lines are the ground truth.

Firstly, we test the performance of the proposed particle filter working in conjunction with state-of-the-art methods (e.g., PoseNet [28], MapNet [9], and DenseVLAD [52]) on Oxford RobotCar dataset [34]. We follow the configuration, and the split of training/testing sequences suggested by [9].

In particular, the experiment is conducted on the LOOP and FULL scenes with the length of 1km and 9.5km respectively. PoseNet is learned with the training set. MapNet is initially trained with the training set, and then fine-tuned in an unlabeled dataset. DenseVLAD uses the training set as the reference database. The parameters for particle filter are set as the following: , , for LOOP scene, and , , for FULL scene, where is a function that outputs a square diagonal matrix with the elements of input vector on the main diagonal. In both scenes, , .

For comparison, we select pose graph optimization (PGO) [9], which fuses the absolute poses and relative odometry (VO) during the inference.


DenseVLAD, PoseNet, and MapNet are used as noisy measurement sensors in our framework. Note that during the inference, DenseVLAD, PoseNet, and MapNet do not take into account the temporal smoothness, hence the particle filter framework helps them through our proposed motion prediction. As shown in Figure 4, the 6DOF poses are smoother after filtering the outputs of DenseVLAD and MapNet. As a result, Figure 3 shows that the proposed particle filter smooths the trajectory of DenseVLAD, PoseNet and MapNet. Figure 3 also demonstrates that trajectory produced by PGO is less smooth than particle filter. One possible reason could be there are some existing outliers from the predictions of both MapNet and VO, which will probably make PGO stuck in a local minimum.

Mean vs median errors

Due to the smoothness offered by particle filter, the outliers produced by DenseVLAD, PoseNet and MapNet are removed. It leads to a drastic improvement in the mean of translation and rotation errors as shown in Table 1, while the median errors between methods remain stable. Furthermore, the results of Table 1 and Figure 3 suggest that smoothness does not greatly influence to the median errors, whilst it has a relative connection to mean errors.

5.3 Synthetic dataset

In this section, we test the feasibility of our solution which is building the map with a distributed manner in the VL for autonomous driving.

To do that, we adapt G2D [18] to construct a large-scale map. Particularly, we simulate that there are 59 vehicles which run independently in different routes. The bird’eye view of our synthetic dataset is illustrated in Figure 5, the coverage area is about 3.36km. Also, those vehicles are simulated to run in different times of day and weather conditions. The distributions of times and weather conditions are shown in two histograms of Figure 5

. The times and weathers in training sequences are uniformly distributed from 1am to 11pm, and in 7 different weather conditions (snowy, foggy, clear, overcast, cloudy, sunny and rainy). Five sequences in different times and weathers are also collected for the testing phase. The statistics information and trajectory of the testing sequences are described in Table

2 and Figure 5 respectively. We sub-sampled on the training sequences at 2Hz and the testing sequences at 4Hz666See supplementary material for more sample images.

Figure 5: Bird’s eye view and distribution of environmental conditions of our synthetic dataset. We simulate that there are 59 vehicles running in different routes, distinct time and weathers. Grid lines are every 100m. In training set, the coverage area is about 3.36km, the time and weather conditions are uniformly distributed. The statistics of testing sequences is shown in Table 2
Sequences # images Time & Traversal
Weather distance
Test seq-1 1451 9:36 am, snowy 1393.03m
Test seq-2 360 10:41 pm, clear 359.74m
Test seq-3 1564 11:11 am, rainy 1566.93m
Test seq-4 805 6:26 pm, cloudy 806.56m
Test seq-5 1013 3:05 pm, overcast 1014.91m
Table 2: Statistics of the testing sequences in the synthetic dataset
Figure 6: Results on testing sequences of the synthetic dataset. The green and red lines respectively are the ground truth and the predicted trajectory. Without smoothing, DenseVLAD produces many outliers, while the trajectory is more smooth after conducting filtering.
Sequence DenseVLAD DenseVLAD+PF
Test seq-1 7.79m, 4.44 3.89m, 4.17
Test seq-2 90.59m, 30.70 23.67m, 24.44
Test seq-3 15.51m, 6.46 3.91m, 4.66
Test seq-4 6.47m, 2.31 2.97m, 2.39
Test seq-5 43.14m, 10.75 4.03m, 9.07
Sequence DenseVLAD DenseVLAD+PF
Test seq-1 2.57m, 3.31 2.63m, 3.46
Test seq-2 4.31m, 1.38 5.03m, 2.12
Test seq-3 3.29m, 3.47 3.21m, 4.03
Test seq-4 2.73m, 1.17 2.58m, 1.82
Test seq-5 1.78m, 7.06 1.83m, 7.29
Table 3: Mean (top) and median (bottom) errors in rotation and translation on testing sequences of the synthetic dataset. By performing the proposed particle filter, the mean errors are dramatically decreased while the median errors remain roughly the same.

In particle filter, we set , , , and .

The benefit of smoothing 6DOF poses is illustrated in Figure 6. The variation of the trajectory predicted by DenseVLAD is very large, but it becomes fairly smooth after filtering. As a consequence, quantitative results in Table 3 show that smoothness leads to a substantial improvement in terms of mean errors, while median errors are slightly influenced.

6 Conclusion

In summary, this paper presents a practical solution that constructs a city-scale map in the distributed manner for the application of self-localization in autonomous driving. The feasibility of our solution is tested and confirmed on a large-scale synthetic dataset. Furthermore, our proposed particle filter is shown as an efficient method, which can exploit the temporal smoothness of an image sequence. Our approach can be effectively and easily incorporated with a broad range of camera localization methods.


  • [1]
  • [2] Unity homepage.
  • [3] P. Abbeel, A. Coates, M. Montemerlo, A. Y. Ng, and S. Thrun. Discriminative training of kalman filters. In Robotics: Science and systems, 2005.
  • [4] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, 2016.
  • [5] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on signal processing, 2002.
  • [6] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSAC-differentiable RANSAC for camera localization. In CVPR, 2017.
  • [7] E. Brachmann and C. Rother. Learning less is more-6d camera localization via 3d surface regression. In CVPR, 2018.
  • [8] E. Brachmann and T. Sattler. Visual Localization: Feature-based vs. Learned Approaches., 2018.
  • [9] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz. Geometry-aware learning of maps for camera localization. In CVPR, 2018.
  • [10] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2009.
  • [11] M. A. Brubaker, A. Geiger, and R. Urtasun. Lost! leveraging the crowd for probabilistic visual self-localization. In CVPR, 2013.
  • [12] D. M. Chen, G. Baatz, B. Girod, R. Grzeszczuk, K. Koser, S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, et al. City-scale landmark identification on mobile devices. 2011.
  • [13] W. Churchill and P. Newman. Experience-based navigation for long-term localisation. The International Journal of Robotics Research, 2013.
  • [14] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. VidLoc: A deep spatio-temporal model for 6-DoF video-clip relocalization. In CVPR, 2017.
  • [15] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In CVPR, 2016.
  • [16] M. Cummins and P. Newman. Fab-map: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 2008.
  • [17] J. Dequaire, C. H. Tong, W. Churchill, and I. Posner. Off the beaten track: Predicting localisation performance in visual teach and repeat. In ICRA, 2016.
  • [18] A.-D. Doan, A. M. Jawaid, T.-T. Do, and T.-J. Chin. G2D: from GTA to Data. arXiv preprint arXiv:1806.07381, pages 1–9, 2018.
  • [19] C. Forster, S. Lynen, L. Kneip, and D. Scaramuzza. Collaborative monocular slam with multiple micro aerial vehicles. In IROS, 2013.
  • [20] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, 2012.
  • [21] G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard. A tutorial on graph-based slam. IEEE Intelligent Transportation Systems Magazine, 2010.
  • [22] M. Haklay and P. Weber. OpenStreetMap: User-generated street maps. IEEE Pervasive Computing, 2008.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [24] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang. The apolloscape dataset for autonomous driving. arXiv: 1803.06184, 2018.
  • [25] J. L. Junkins and H. Schaub. Analytical mechanics of space systems. American Institute of Aeronautics and Astronautics, 2009.
  • [26] A. Kendall and R. Cipolla.

    Modelling uncertainty in deep learning for camera relocalization.

    ICRA, 2015.
  • [27] A. Kendall, R. Cipolla, et al. Geometric loss functions for camera pose regression with deep learning. In CVPR, 2017.
  • [28] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In CVPR, 2015.
  • [29] J. Ko and D. Fox. Gp-bayesfilters: Bayesian filtering using gaussian process prediction and observation models. Autonomous Robots, 2009.
  • [30] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate o(n) solution to the PnP problem. IJCV, 2009.
  • [31] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using prioritized feature matching. In ECCV, 2010.
  • [32] C. Linegar, W. Churchill, and P. Newman. Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera. In ICRA, 2016.
  • [33] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 2016.
  • [34] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 2017.
  • [35] M. J. Milford and G. F. Wyeth. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In ICRA, 2012.
  • [36] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
  • [37] G. Pascoe, W. Maddern, and P. Newman. Direct visual localisation and calibration for road vehicles in changing city environments. In ICCV Workshops, 2015.
  • [38] L. Riazuelo, J. Civera, and J. M. Montiel. C2TAM: A cloud framework for cooperative tracking and mapping. Robotics and Autonomous Systems, 2014.
  • [39] S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In ICCV, 2017.
  • [40] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
  • [41] A. Roshan Zamir, S. Ardeshir, and M. Shah. GPS-tag refinement using random walks with an adaptive damping factor. In CVPR, 2014.
  • [42] C. Rubino, A. Del Bue, and T.-J. Chin. Practical motion segmentation for urban street view scenes. In ICRA, 2018.
  • [43] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. TPAMI, 2017.
  • [44] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al. Benchmarking 6DOF outdoor visual localization in changing conditions. In CVPR, 2018.
  • [45] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla. Are large-scale 3d models really necessary for accurate visual localization? In CVPR, 2017.
  • [46] P. Schmuck and M. Chli. Multi-UAV collaborative monocular slam. In ICRA, pages 3863–3870. IEEE, 2017.
  • [47] J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler. Semantic visual localization. CVPR, 2018.
  • [48] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In CVPR, 2013.
  • [49] N. Sünderhauf, P. Neubert, and P. Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In ICRA Workshop on Long-Term Autonomy, 2013.
  • [50] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson. City-scale localization for cameras with known vertical direction. TPAMI, 2017.
  • [51] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl. Semantic match consistency for long-term visual localization. In ECCV, 2018.
  • [52] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In CVPR, 2015.
  • [53] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In CVPR Workshop on Autonomous Driving, 2018.
  • [54] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers. Image-based localization using lstms for structured feature correlation. In ICCV, 2017.
  • [55] P. Wang, R. Yang, B. Cao, W. Xu, and Y. Lin. DeLS-3D: Deep localization and segmentation with a 3d semantic map. In CVPR, 2018.

7 Supplementary Material

Training samples in our synthetic dataset is shown in Figure (f)f, and Figure S2 are some testing samples.

Figure S1: Training samples.
Figure S2: Testing samples.