MATLAB Code for Visual Localization using Particle Filter
A major focus of current research on place recognition is visual localization for autonomous driving. However, while many visual localization algorithms for autonomous driving have achieved impressive results, it seems not all previous works have been set in a realistic setting for the problem, namely using training and testing videos that were collected in a distributed manner from multiple vehicles, all traversing through a road network in an urban area under different environmental conditions (weather, lighting, etc.). More importantly, in this setting, we show that exploiting temporal continuity in the testing sequence significantly improves visual localization - qualitatively and quantitatively. Although intuitive, this idea has not been fully explored in recent works. Our main contribution is a novel particle filtering technique that works in conjunction with a visual localization method to achieve accurate city-scale localization that is robust against environmental variations. We provide convincing results on synthetic and real datasets.READ FULL TEXT VIEW PDF
A major challenge in place recognition for autonomous driving is to be r...
We present a novel dataset covering seasonal and challenging perceptual
Accurate localization is a foundational capacity, required for autonomou...
The current research interest in autonomous driving is growing at a rapi...
Autonomous driving systems are broadly used equipment in the industries ...
Visual localization has attracted considerable attention due to its low-...
Visual place recognition (VPR) is critical in not only localization and
MATLAB Code for Visual Localization using Particle Filter
Visual Localization Under Appearance Change: A Filtering Approach(DICTA 2019 Best paper) https://arxiv.org/abs/1811.08063
Localization in autonomous vehicles is the task of identifying the position of the vehicle on a map, possibly up to 6DOF precision. Realistically, a practical self-driving system will make use of multiple types of sensors to achieve localization, such as satellite navigation systems (GNSS), inertial navigation systems (INS), and vision-based localization. The different sensors have complementary strengths, which can give higher overall accuracy if properly fused.
In computer and robotic vision, visual place recognition is the broad problem of recognizing “places” based on visual inputs [33, 8]. We regard visual localization (VL) for autonomous driving a special case of this problem . Impressive results have been obtained in VL for autonomous driving [43, 28, 6, 14, 7, 9]. However, as alluded to above, vision is only one component in the sensing package, and the “raw” output of a VL algorithm will likely be fused with the results of other localization techniques, e.g., using Bayes filter. While we do not investigate the fusion aspect in our work, we stress the importance of developing and evaluating VL algorithms in the broader systems context.
To this end, we ask the question: what is a realistic framework for VL for autonomous driving? As with other vision capabilities, having a large training image database is essential to gain robustness against environmental variability (weather, time of day, season, etc.). Conceivably, data collection for a practical VL system must also be an ongoing process, since there will be continual changes in the environment (e.g., construction works, updating of façades and billboards), particularly in urban areas. Also, it makes sense to acquire data in a distributed manner using a fleet of mapping vehicles to increase spatial and temporal coverage. Indeed, there are datasets that grow continuously using data from taxis and freelance mapping vehicles [36, 22].
Based on the above data collection methodology, it is reasonable to assume that, while there are possibly an infinite number of ways in which the testing conditions can differ from the training conditions, a decent sample of the environmental variability is nonetheless captured in the training images. Thus, while VL algorithms still need to be robust against environment changes, severely adversarial testing regimes (e.g., training images all from day versus testing images all from night [44, Sec. 3.1]) would be uncommon.
Lastly, in other important vision-based functionalities for autonomous driving (e.g., pedestrian detection, lane detection), the camera will be operating continuously. Hence, it is realistic to expect videos (continuous image sequences) as inputs to the VL algorithm, as opposed to the single-image querying approach used in other place recognition works (e.g., for augmented reality [12, 52]).
The following summarizes the proposed VL framework:
Continuous data collection that uses multiple vehicles to record street-level imagery in the road network.
Testing data will be in the form of image sequences.
VL is just one component in the localization solution for autonomous driving; fusion is to be expected.
Fig. 1 illustrates the proposed framework.
Under the above framework, we propose a novel particle filter algorithm for 6DOF VL. We show how our particle filter can work in conjunction with existing learning- and retrieval-based VL algorithms. By using one of these VL algorithms as a noisy estimator, the particle filter can be used to enforce the temporal consistency via a stochastic motion model.
Our usage of particle filtering is justified as follows:
When navigating on the roads of an urban area111 In more “localized” operations such as parking, where highly accurate 6DOF estimation is required, it is probably better to rely on the INS.
In more “localized” operations such as parking, where highly accurate 6DOF estimation is required, it is probably better to rely on the INS., the motion of the car is fairly restricted, i.e., it largely stays on a road network on an approximately 2D plane222More fundamentally, the car is a non-holonomic system . . While the road surface still allows significant scope for movement (cf. Fig. 1), relative to the size of the map, the motion of the car is rather limited333On uneven or hilly roads, accelerometers can be used to estimate the vertical motion, hence, VL can focus on map-scale navigation.. This is echoed by the recent work of , who observed that there is “lower variation in viewpoints as the car follows the same road”. Hence, a particle filter with a simple motion model suffices to track the 6DOF rigid motion of the car.
In many cities, the road networks are complex Euclidean graphs. In fact, it is well known that using (visual) odometry alone, it is possible to accurately track a car on a 2D road map 444The method of  will give ambiguous results on non-informative trajectories, e.g., largely straight routes. Hence, VL is still crucial.. More fundamentally, this suggests that temporal continuity in the testing sequence (which is fully exploited by our method) strongly benefits VL.
To handle environmental variations between the testing and training images, we use existing robust VL algorithms (DenseVLAD , PoseNet  and MapNet ) as the “front end” of our particle filter. More importantly, we demonstrate how our particle filter is able to correct for the errors made by the front-end, and greatly improve its accuracy. This is despite some of these methods (e.g., MapNet) already exploiting temporal constraints in the training.
We test our approach on real  and synthetic datasets. Fig. 1 provides an overview of our synthetic data, which was produced using G2D . Since our synthetic dataset is the first that subscribes to the realistic VL framework, this is also a major contribution of our work555See supplementary material for sample images.. Note that there is an increasing recognition of the value of synthetic data towards building autonomous driving systems .
In the recent work of Sattler et al.  which is aimed at autonomous driving, several state-of-the-art algorithms [45, 50, 52, 4, 16] were compared on three datasets: Aachen Day-Night, RobotCar Seasons and CMU Seasons. Although the datasets include substantial environmental variations, and the testing images are sequential in RobotCar Seasons and CMU Seasons, the testing regime does not examine the benefit of temporal continuity for VL. Although there is a section in  on “multi-image queries”, it is a relatively small part of the benchmark, with only two methods tested; seqSLAM  and an extension of Active Search  that uses triplets of images. The former is known to suffer significantly from slight pose variation , while the latter does not fully take into account temporal continuity. It is worth mentioning that  suggests “multi-image localization” as a promising future research.
Another recent dataset for VL is ApolloScape 
, which was collected in various weather conditions, but it only covers 7 roads in 4 different cities, and the video sequences were recorded in daytime. In addition, their evaluation metric, which is the median errors of translation and rotation, does not completely reflect the accuracy of VL in the scenario of autonomous driving. In our experiment, we show that even when the median errors between different methods are comparable, the smoothness of the predicted poses can be different (See section5.2 for more details).
Apart from VL, there are many datasets aimed towards tasks related to autonomous driving. CamVid  is presented for semantic segmentation and camera pose estimation, but it is relatively small with 4 video sequences in total duration of 20 minutes. KITTI benchmark suite 
provides ground truth for various computer vision tasks (e.g., stereo reconstruction, optical flow, visual odometry, object detection and tracking). Cityscapes and Mapillary Vistas  target the semantic sementation problem.
Realistic datasets need a lot of effort of annotating the ground truth, thus synthetic data generation tools are presented to overcome that difficulty. SYNTHIA  is a large-scale synthetic image generator based on a virtual world constructed via the Unity platform . Similarly, Playing for benchmarks  and G2D  extract the image data and ground truth from GTA.
Current algorithms can be categorized into local feature-based methods and learning-based methods.
Broadly speaking, local feature-based methods estimate the pose of the query image from 2D-3D correspondences by solving the PnP problem  or an equivalent step [44, 41]. The main difficulty is establishing and validating the 2D-3D correspondences with the large (city-scale) 3D model. Many methods attempt to improve this step by using geometric constraints [31, 43], semantic information in the images [47, 51], and learning-based matching [48, 6, 7].
Learning-based methods perform 6DOF pose estimation as an image-to-SE(3) regression problem. PoseNet 
uses a convolutional neural network (CNN) to learn the regression mapping. LSTM-Pose
argues that the usage of a fully connected layer in PoseNet possibly leads to the overfitting, despite the usage of dropout. Therefore, they use Long-Short Term Memory (LSTM) units. Other variants of PoseNet include who incorporates uncertainty in the pose prediction, and 
who considers geometric (re-projection) errors in the loss function (the latter also replace the tunable parameters in PoseNet by learnable ones).
We remark that our particle filtering algorithm can work in conjunction with all the algorithms above.
The methods surveyed above conduct mainly per-image camera pose estimation. Recently, there are methods that also take into account temporal continuity in the image sequence. These have mostly appeared in the learning framework. Prominent methods include VidLoc , which uses bi-directional LSTM architecture to exploit temporal constraints in the data, and MapNet , which incorporates the relative poses between consecutive frames in the training loss function. For testing, both of these networks take an input as sequence of images, and output a sequence of poses. Interestingly,  also proposes to conduct pose graph optimization  to post-process the output pose sequences. Similar to MapNet, which uses visual odometry (VO) for pose graph optimization, works in [32, 37, 17, 13] also leverage VO to verify the consistency between consecutive pose estimates and ego-motion from VO.
More recently, in DeLS-3D 
, a recurrent neural network (RNN) is trained to model the temporal dependency in the image sequence. The RNN is then used to post-process the sequence of poses predicted by a CNN. Interestingly, they also compared the RNN with a Kalman filter, and observed that “the motion predicted from RNN is better than using a Kalman filter with a constant speed assumption”.
Based on the above, the post-processing techniques that have been applied to VL for autonomous driving are:
Visual odometry (VO) (e.g., using pose graph optimization ).
Kalman filter (KF) .
The conceptual advantages of our proposed particle filter (PF) method to the above techniques are:
The dependence on visual odometry (VO) increases computational load and introduces an extra failure point. In contrast, our PF works well with a simple motion model - as mentioned in Section 1, the motion of a car on the roads is relatively restricted.
While RNN requires parameter learning, PF does not - this represents a significant benefit in the VL framework where new training images are acquired continuously.
. Moreover, in our PF method, outlier rejection is performed to discard bad pose estimates, which invariably arise due to environmental variability. It is unknown whether this was done in.
With the proposed framework, we define the map , where is an image sequence comprising the image frame and its 6DOF camera pose at time step with . Every can be separately collected by multiple vehicles running in different weathers, time and routes. The pose for each frame can be attained by conducting collaborative SLAM [19, 38, 46] as an off-line stage. In particular, every vehicle performs its VO before sending its map to a central server. After that, the server detects the overlaps between individual maps, and then fuses them together to get a global map .
For the on-line inference, given a query sequence , our goal is to estimate the camera pose for each with respect to the map .
We define the 6DOF camera pose as a hidden state:
where, is the hidden state at time , and respectively are 3D position and Euler orientation. Following the first-order Markov process, Bayes filter recursively determines the hidden state as follows:
where, is the scale factor, and are motion prediction and measurement update respectively, state is distributed according to . This process consists of two parts: the current state is predicted by the motion model, and then it is updated according to the sensor measurement. In this paper, we select particle filter, which is a popular implementation of Bayes filter.
The idea of particle filter is to leverage the Monte Carlo principle to approximate the probability distribution. In particular, each time step maintains a set of particles , each of which has a state and a weight: . As the number of particles
are sufficiently large, this set of particles is equivalent to a representation of a probability density function.
In the following sections, we present our proposed motion prediction and measurement update which can be integrated into the particle filter implementation. Our method offers three advantages. Firstly, it correctly models the temporal constraint through the proposed motion model. Second, the uncertainty of VL methods is parameterized appropriately by the noise sensor model. Finally, it can be easily incorporated with a wide range of VL approaches [52, 28, 27, 9].
In the autonomous driving scenario, if the visual data is an image sequence, whose motion is driven with a control input. In practive, that input can be attained from other sensors (e.g., INS). However, as mentioned in Section 1, the motion of a car is fairly restricted, hence we use hand-crafted motion model in this work.
Mathematically, for each particle, we draw the velocity and angular velocity , where and respectively represent the linear and angular velocities. The accelerations are modeled by the noise covariance matrices and . Then, for each particle, their motion in each time step is:
It is worth noting that while 3D positions can be easily updated by using simple additions, it does not make sense for Euler angles. This is because the space of Euler angles is not a vector space. Hence, in the Kinematics control, one will convert two Euler angles to the Direction Cosine Matrix (DCM), multiply two matrices and convert the result back to the Euler representation.
Define is a function that maps an Euler representation to DCM and is its inverse mapping. Our motion model for each particle is the following:
The experimental results will show that our motion model can properly handle the temporal evolution of camera pose in an image sequence.
VL methods, regardless of how accurate, are always subject to uncertainty. Therefore, we consider the prediction of an VL approach as a noisy measurement from a sensor.
Given an image at time , we define the measurement , where is a nonlinear function, which can be one of existing VL methods     . Particularly, for the retrieval-based methods (e.g., DenseVLAD ), one takes an input image, outputs a set of similar database images, performs clustering, and then calculates the mean of 6DOF poses. For learning-based approaches (e.g., PoseNet  or MapNet ), the function is a deep neural network, which receives a query image, then directly outputs the 6DOF pose (See section 5.1 for more details). However, because the prediction of these methods is viewed as a noisy measurement, we define a measurement model:
is a noise variable distributed according to the Gaussian distribution, thusis also distributed in the Gaussian with mean and covariance . Every particle is assigned a weight as the following:
Our particle filter is described in Algorithm 1. The probability distribution is approximated by the particle set . The purpose of the lines in the algorithm is to resample the particles based on their weights to prevent from the degeneracy problem, which likely occurs in the life-long localization scenario. Figure 2 shows the filtering performed by our proposed particle filter, this example uses DenseVLAD  as a noisy measurement sensor. At the first iteration, particle filter randomly generates many hypotheses. Then, hypotheses with small weights vanish if they are inconsistent with the observation measured by DenseVLAD. Finally, the ambiguity is resolved, and the particles successfully track the vehicle. Furthermore, the query and retrieved images are from different times and weather conditions.
In our experiment, we select DenseVLAD , PoseNet  and MapNet  as baseline. DenseVLAD+PF, PoseNet+PF, and MapNet+PF respectively are the abbreviations of DenseVLAD, PoseNet and MapNet working in conjunction with our particle filter.
For DenseVLAD, given a query image, we retrieve top 20 database images, then apply mean shift algorithm to perform clustering based on 3D positions of retrieved images. To estimate the 6DOF pose of the query, we select the cluster with the largest number of images, and then compute the mean of 3D positions and orientations of those images. For the learning approach, following suggestion of , PoseNet uses ResNet-34 
, adds a global average pooling layer, and parameterizes camera orientation as logarithm of a unit quaternion. PoseNet is implemented in Tensorflow, and MapNet implementation is provided by the authors. Parameters of PoseNet and MapNet are set as the suggestion of the authors.
Regarding our framework, we initialize particles from Gaussian distribution with the mean from the prediction of DenseVLAD, PoseNet or MapNet in the first frame. The covariance matrices for initializing 3D location and orientation respectively are and . The number of particles are .
|DenseVLAD and DenseVLAD+PF|
|MapNet and MapNet+PF|
Firstly, we test the performance of the proposed particle filter working in conjunction with state-of-the-art methods (e.g., PoseNet , MapNet , and DenseVLAD ) on Oxford RobotCar dataset . We follow the configuration, and the split of training/testing sequences suggested by .
In particular, the experiment is conducted on the LOOP and FULL scenes with the length of 1km and 9.5km respectively. PoseNet is learned with the training set. MapNet is initially trained with the training set, and then fine-tuned in an unlabeled dataset. DenseVLAD uses the training set as the reference database. The parameters for particle filter are set as the following: , , for LOOP scene, and , , for FULL scene, where is a function that outputs a square diagonal matrix with the elements of input vector on the main diagonal. In both scenes, , .
For comparison, we select pose graph optimization (PGO) , which fuses the absolute poses and relative odometry (VO) during the inference.
DenseVLAD, PoseNet, and MapNet are used as noisy measurement sensors in our framework. Note that during the inference, DenseVLAD, PoseNet, and MapNet do not take into account the temporal smoothness, hence the particle filter framework helps them through our proposed motion prediction. As shown in Figure 4, the 6DOF poses are smoother after filtering the outputs of DenseVLAD and MapNet. As a result, Figure 3 shows that the proposed particle filter smooths the trajectory of DenseVLAD, PoseNet and MapNet. Figure 3 also demonstrates that trajectory produced by PGO is less smooth than particle filter. One possible reason could be there are some existing outliers from the predictions of both MapNet and VO, which will probably make PGO stuck in a local minimum.
Due to the smoothness offered by particle filter, the outliers produced by DenseVLAD, PoseNet and MapNet are removed. It leads to a drastic improvement in the mean of translation and rotation errors as shown in Table 1, while the median errors between methods remain stable. Furthermore, the results of Table 1 and Figure 3 suggest that smoothness does not greatly influence to the median errors, whilst it has a relative connection to mean errors.
In this section, we test the feasibility of our solution which is building the map with a distributed manner in the VL for autonomous driving.
To do that, we adapt G2D  to construct a large-scale map. Particularly, we simulate that there are 59 vehicles which run independently in different routes. The bird’eye view of our synthetic dataset is illustrated in Figure 5, the coverage area is about 3.36km. Also, those vehicles are simulated to run in different times of day and weather conditions. The distributions of times and weather conditions are shown in two histograms of Figure 5
. The times and weathers in training sequences are uniformly distributed from 1am to 11pm, and in 7 different weather conditions (snowy, foggy, clear, overcast, cloudy, sunny and rainy). Five sequences in different times and weathers are also collected for the testing phase. The statistics information and trajectory of the testing sequences are described in Table2 and Figure 5 respectively. We sub-sampled on the training sequences at 2Hz and the testing sequences at 4Hz666See supplementary material for more sample images.
|Sequences||# images||Time &||Traversal|
|Test seq-1||1451||9:36 am, snowy||1393.03m|
|Test seq-2||360||10:41 pm, clear||359.74m|
|Test seq-3||1564||11:11 am, rainy||1566.93m|
|Test seq-4||805||6:26 pm, cloudy||806.56m|
|Test seq-5||1013||3:05 pm, overcast||1014.91m|
In particle filter, we set , , , and .
The benefit of smoothing 6DOF poses is illustrated in Figure 6. The variation of the trajectory predicted by DenseVLAD is very large, but it becomes fairly smooth after filtering. As a consequence, quantitative results in Table 3 show that smoothness leads to a substantial improvement in terms of mean errors, while median errors are slightly influenced.
In summary, this paper presents a practical solution that constructs a city-scale map in the distributed manner for the application of self-localization in autonomous driving. The feasibility of our solution is tested and confirmed on a large-scale synthetic dataset. Furthermore, our proposed particle filter is shown as an efficient method, which can exploit the temporal smoothness of an image sequence. Our approach can be effectively and easily incorporated with a broad range of camera localization methods.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Modelling uncertainty in deep learning for camera relocalization.ICRA, 2015.