Coarse-To-Fine Visual Localization Using Semantic Compact Map

10/11/2019 ∙ by Ziwei Liao, et al. ∙ Beihang University 19

Robust visual localization for urban vehicles remains challenging and unsolved. The limitation of computation efficiency and memory size has made it harder for large-scale applications. Since semantic information serves as a stable and compact representation of the environment, we propose a coarse-to-fine localization system based on a semantic compact map. Pole-like objects are stored in the compact map, then are extracted from semantically segmented images as observations. Localization is performed by a particle filter, followed by a pose alignment module decoupling translation and rotation to achieve better accuracy. We evaluate our system both on synthetic and realistic datasets and compare it with two baselines, a state-of-art semantic feature-based system and a traditional SIFT feature-based system. Experiments demonstrate that even with a significantly small map, such as a 10 KB map for a 3.7 km long trajectory, our system provides a comparable accuracy with the baselines.



There are no comments yet.


page 1

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robust visual localization for autonomous vehicles has gained wide attention recently. The traditional visual localization pipeline builds up a map from texture features [1, 2]

of the environment or uses image retrieval techniques

[3, 4]

for camera state estimation. To ensure robustness in urban scenarios, researchers must carefully design features

[5, 6] unaffected by viewpoints and highly dynamic objects. However, almost all popular features still cannot adapt well to drastic changes in appearance [7] or rely too much on data used for training [8]. Besides, the complex descriptor makes the map storage too large and limits computation efficiency.

With the development of deep-learning techniques, many researchers have turned to semantic segmentation for an advanced representation of the environment. Toft et al.

[9] use semantic information to assign a weight to points in the RANSAC process, which improves the robustness during the PnP process. Schonberger et al. [10] learn a descriptor that encodes both 3D geometry and semantic information. Yet, the approach requires depth maps. These methods treat semantic information as auxiliary information for traditional features. They may improve robustness and accuracy, but can not fulfill our need for using as little storage as possible.

Fig. 1: This figure demonstrates the fine positioning process. In the current frame, the picture on the upper left is the result of semantic segmentation. In the picture on the upper right, the extracted poles are marked with a long red line. The poles projected from the compact map are marked with a short white line, and their distances are marked in meters with black numbers. In the figure below, the poles with successful data associations are marked with a blue cylinder. Two green circles are calculated based on the positions of poles and observation angles. Then the camera translation is estimated as the intersection of the circles. This method solves the camera state decoupling translation and rotation.

Some researchers have proposed systems that require only semantic information. Toft et al. [11] store a semantic point cloud and extract semantic boundaries on semantically segmented images to build a cost map for the optimization framework. Most of the methods we mentioned above follow the process of a typical SLAM method, with graph optimization process optimizing the results. However, optimization is not efficient enough. Stenborg et al. [12] form a particle filter system, using the match consistency between the semantic point cloud and semantically segmented images as particle weights. The storage for every semantic landmark has greatly reduced compared with the complex feature descriptor. However, the storage needed for the point cloud remains large as a huge number of semantic points are needed, making the system hard to run online.

Some researchers have proposed compact maps that store objects instead of semantic points, which requires much less storage space. Weng et al. [13] consider poles extracted from lidar observations for scan-matching, and fuse an RTK GPS to achieve high accuracy. Spangenberg et al. [14] consider tree trunks and use a particle filter with the help of an RGB-D camera. They all contribute a lot to the development of semantic compact maps but ask for some expensive equipment for object detection.

Our work differs from all the works above. To address the problem of making map storage as small as possible, we choose object-level landmarks to form a semantic compact map. We believe pole-like objects, such as traffic light, lamp, and tree trunks, are everyday objects in urban sceneries and can be used as reliable landmarks for localization. We directly extract object observations from semantically segmented images. To consider both efficiency and accuracy, we propose a coarse-to-fine localization system. The coarse localization system forms a particle filter for efficient state estimation. After that, we propose the pose alignment module to adjust the pose from the particle filter to a finer position. With accurate landmarks in the compact map, we can use geometry solution that decouples translation and rotation [15, 16] to achieve higher accuracy.

We compare the system performance with two baselines using both synthetic and realistic datasets covering urban, suburban, bridge and highway scenarios. We aim to show that even with a significantly small map, the system could achieve comparable accuracy with traditional point-cloud based systems.

Ii System Framework

Fig. 2: System Framework.

Ii-a Overview

Our coarse-to-fine localization system consists of three dependent modules: Segmentation, Coarse Localization, and Pose Alignment, as in Fig. 2.

The whole system is initialized using GPS or other rough localization systems like [17] only once. After localizing the initial pose of the camera, we will operate the segmentation module and the coarse localization module in parallel. The segmentation module outputs poles extracted through CNN models. In the meantime, the camera pose is calculated roughly in Motion Update step of the coarse localization module via the odometry provided. The compact map is then projected to the camera pose to establish a data association between poles extracted by semantic segmentation and poles saved in the compact map. After every particle is assigned a weight through data associations, resampling is done when it is necessary. Then, poses are estimated by state estimation.

The pose alignment module is designed to optimize the pose provided by the coarse localization module and can be closed to save time or operated at a given frequency, such as every five frames. This module uses a geometry-based pose solver and decouples translation and rotation [15] to solve an accurate translation of the camera. Then rotation is optimized with translation fixed. The results of this module will be evaluated. If it is decided to be better than the result of coarse localization, another resample step will follow before outputting the final result.

Ii-B Pole Representation

We assume that the road surface is flat and the camera is parallel to the road, which is reasonable in most autonomous driving conditions. We will illustrate our pipeline through pole-like objects in the experiments, but it should be pointed out that our system can be extended to other semantic objects according to the requirements of different tasks.

Based on the assumption above, we model all semantic objects in our map as infinite poles vertical to the ground. Such representation also makes our map further compressed and light-weight. Since we do not consider their height, we can record landmarks in a two-dimensional state vector. Therefore, the compact map is composed of poles with positions and semantic labels as

, where and {Pole, Lamp, Tree Trunk…}. As each pole is an infinite line vertical to the x-coordinate in the image plane, we only record the x coordinate of poles in each frame.

(a) semantic segmentation
(b) binarized image
(c) middle result
(d) pole extraction
Fig. 3: Pole Extraction and Modeling.

In the segmentation module, we input RGB images and process them using semantic segmentation methods [18]. For each frame , we filter output of the network to extract poles, and store their x coordinates and semantic labels as . We divide the filtering algorithm into three steps, as shown in Fig. 3.

  • Binarize the segmentation image, marking poles and other regions as 1 and 0.

  • Count how many pixels are labeled as poles in each column and filter out columns that do not reach the threshold . Successive columns that have not been filtered out will be grouped.

  • Calculate the width of each column group. If the width is larger than , we will extract a pole located in the middle of this group.

Iii Coarse Localization

The Coarse Localization module is based on particle filters [14]. We frequently sample several particles at each location and update the location of the camera through the calculation of these particles. For every frame the module receives, it will get the camera pose after motion update, measurement update, resampling, and state estimation.

Iii-a Motion Update

We use UTM(Universal Transverse Mercator) coordinate, which takes East and North as two vertical axes and records position together with orientation, to describe camera pose. To make it more detailed, we note the i-th particle in the t-th frame as .

During the initialization of the system, we read in GPS data and its uncertainty to produce an initial distribution of particles. When a new frame comes, we will update the state of every particle via its linear velocity and angular velocity :


where , , and

obeys the Gauss distribution with a mean of 0 and a standard deviation of

. We add to the motion model so that we can do sampling without degradation in configuration space. The parameters are decided by the configuration of vehicles or robots we use, as in [19].

Iii-B Measurement Update

With the help of poles we extract in the Segmentation Module and the constructed map, we update the state of each particle independently.

We first assume that at each time point, there are poles( observations) in the current frame and poles in our map. As is mentioned above, we note the -th pole extracted in the -th particle of the -th frame as and the projection of -th pole in the map as . Since the operation is similar for every frame, we omit and note to simplify the notation.

To locate the position of the camera, we need to establish the data association between and , where indicates that the observation does not belong to any pole in our map.

Our task can be optimized as an Optimal Mapping problem [20]. We first calculate the projection of to the -th particle position. Taken that we have already found a mapping

, the loss function is defined by adding all the distance between corresponding

and together:


By minimizing the loss function , we can acquire the optimum association between poles and the map. Since every particle represents a different state of the camera, together with is also used to assign a weight to different particles.

We divide the weight calculation of poles into two categories.

  • The observation is mapped to an established pole in the map . Similar to [14], we define that


    where is of a positive correlation with the distance between pole and camera in 3D space. is used to model the uncertainty of observations and is decided by the pole extraction module. According to [14], the weight provided by can be formulated as


    where is the possibility that one specific kind of pole is identified and is also related to the pole segmentation model. is intensity of a Poisson clutter process.

  • The pole is not mapped to an established pole . Under this circumstance, we define


The weight of -th particle is calculated by considering all the observations:


Iii-C State Estimation

Inspired by the strategy of GMAPPING[21], we use the number of efficient particles to judge whether we should do resample. For example, if there exists a moving obstacle that influences our pole extraction and update procedure, we should avoid resampling at such frames to minimize errors brought by such barriers. Therefore, we define that


where is the number of particles. We do resample if is smaller than the threshold we define. And the camera state is estimated by the weighted average of all particles.

Iv Pose Alignment

Given the result of coarse localization, pose alignment will make use of geometric constraints to decouple translation and rotation. If the pose alignment procedure gives out another group of particle weights that better satisfy our requirements, a more accurate camera state will be calculated.

Iv-a Translation Calculation

First, we solve the data association based on the result of coarse localization, as is mentioned in III.A. Then we can assign every pole in the image to one landmark in the map.

Sugihara et al. [15] and Kanatani et al. [16] describe the problem in detail. Suppose we have two landmarks , , their projection to the image plane , and the intrinsic matrix of our camera. We can treat the monocular camera as a protractor and calculate horizontal angle :


where we assume and , are intrinsic parameters of the camera.

Fig. 4: Translation calculation. Given the landmarks , , , and the horizontal angles , , camera sits at the intersection of two circles and .

As is shown in Fig. 4, can be seen as an angle in a circular segment and , are two points in this circle . Based on the circumferential angle theorem, the central angle is two times the corresponding angle in the circular segment. Therefore, we can easily get the radius of this circle:


By the same reason, we acquire the circle and its radius through landmark and the viewing angle . and will insert at two points: and the camera . Since and are symmetric about , we can obtain the camera translation now.

The camera rotation is not used in the calculation of camera translation above, which means that rotation has nothing to do with translation in our method. Thus, we have decoupled translation and rotation.

Iv-B Rotation Optimization

We construct an optimization function to solve camera rotation with translation fixed:


In this function, we hold the same assumption that frame is at position in which has been optimized and fixed. is the data association function calculated at the beginning of this section and is used to calculate the distance between associated observations and landmarks(defined in III.A). It is obvious there is only one parameter to be optimized in this nonlinear optimization problem. We solve this problem using the Gaussian-Newton method. After all, we get the candidate pose with weight .

Iv-C Evaluation

We need three landmarks once to adjust the pose of the camera. Suppose that we have landmarks that are successfully associated with observations in the image plane, we obtain possible landmark groups. After calculating all these groups and sorting them by weight, we get the best candidate and start the evaluation procedure.

The evaluation procedure compares got in the coarse localization procedure and of the best candidate . If is larger, and pose difference of accurate state and coarse state is smaller than a threshold, we treat pose refinement as successful and activate the accurate resample step. Otherwise, we output the result of coarse localization.

Iv-D Accurate Resample

We adjust the probability distribution of all particles according to the result of pose refinement. We will generate a new Gauss distribution with a mean of

and a standard deviation of and sample this distribution for new particles. is calculated by:


where is used to evaluate the reliability of the pose alignment module.

Different from traditional resampling methods that take advantage of existing particles in motion update, the accurate resample step generates distributions from observations and constructed maps. Such approaches are more like a global localization technique and reduce the effect of drifts in the long run.

V Experiments

To thoroughly evaluate the performance of our system, we employ both virtual and real datasets. For comparison, we realize a state-of-art semantic localization system that uses semantic point cloud based on a Particle Filter, and a traditional SIFT-UKF system using SIFT features based on an Unscented Kalman Filter (UKF). Both baselines are described in

[12]. Our experiments are carried out using a PC with an Intel Core i5-4590 CPU with 3.3GHz, and an Nvidia GeForce GTX TITAN X GPU.

Fig. 5: Pole-extraction results of suburban, urban, bridge and highway, which are marked with red lines.

We choose the KAIST Urban dataset [22] to test our system in realistic data. KAIST provides detailed 3D lidar point clouds, stereo camera images, and wheel odometry data covering urban, suburban, highway and bridge scenes, which is an ideal dataset to test how our pole-based system will work in common scenes for autonomous driving. Since the trajectories are too long and repeatable, we select Urban23, Urban26, and Urban34 from these sequences, and divide continuous sections with clear poles into 7 parts according to their scene types.

We choose the VirtualKitti dataset [23] to evaluate the performance of our system in synthetic datasets further. VirtualKitti provides the ground truth of semantic segmentation images and camera odometry. The ground truth of segmentation images can verify the accuracy limit of our proposed algorithm. In our experiments, we select continuous sections from trajectory 0001 that contains clear poles, whose timestamp range is 249-356.

In our experiments, we take altogether four benchmarks into consideration. RMSE Trans(m) and RMSE Rot(deg) are Root Mean Squared Error of camera translation and rotation. We also consider two benchmarks from [12] and [24]. The former one, which we note as 0.5/1/2, counts the percentage of camera positions that are within 0.5/1/2 meters from the ground-truth. This benchmark can clearly illustrate whether the system is correct most of the time. However, it only considers the camera translation. Therefore, we use the latter benchmark, noted as 0.25/0.5/5, to count the percentage of camera positions that are within 0.25/0.5/5 meters and 2/5/10 degrees from the ground-truth.

V-a Map Creation

The focus of this paper is on the localization models, but we will give a brief introduction of how we build the map. In both datasets, we manually annotate poles from point clouds to generate the compact map. In the KAIST dataset, the point clouds are given directly. In the Virtual Kitti dataset, as no point clouds are given, we use a structure-from-motion pipeline to build point clouds using ground-truth depth images. For the maps of the baselines, we build them using the structure-from-motion pipeline as the same as [12] for both datasets.

The map storage is shown in Table I. The SIFT descriptor needs large storage space, and the semantic point cloud needs a large number of points. Our compact map exceeds other maps in map storage significantly, which shows the obvious advantages of the object-level compact map.

Trajectory Length Map Storage
UKF+SIFT PC Semantic Ours
KAIST26 3.7 km 5.9 GB 24.0 MB KB
Highway 1.8 km 2.1 GB 8.2 MB KB
Bridge 1.4 km 3.3 GB 5.0 MB KB
VirtualKitti 0.1 km 36.8 MB 0.14 MB KB
TABLE I: Comparison of Map Storage


In the KAIST Urban dataset, we consider four typical scenes in autonomous driving tasks: urban, suburban, highway and bridge. For each scene, we select continuous sections in the dataset and compare our system with the baselines we construct. We also test how changes in the scene will influence the performance of our system. We use images generated by the left camera. And then we use the Bisenet [18] trained on Cityscapes [25] to segment the images.

The results show that our system achieves comparable translation and rotation accuracy with the baselines, even though it uses a much more light-weighted map. When the Pose Alignment module is added, the translation accuracy is improved significantly and even exceeds the baselines on some trajectories. We need to mention that on the KAIST26 trajectory, the map size of the UKF-SIFT system is 5.9 GB, while the map size of our semantic compact map is just 10.0 KB. Fig. 6 shows that the proposed system localizes the vehicle with an error below 1 m most of the time.

Coarse Loc CL+PA PC Semantic[12] UKF-SIFT
Suburban 1
Trans (m) 0.686 0.604 1.798
Rot (deg) 0.882 0.882 0.464
Suburban 2
Trans (m) 1.344 2.034 1.088
Rot (deg) 1.155 0.868 1.205
Suburban 3
Trans (m) 0.750 0.631 1.774
Rot (deg) 0.841 0.760 1.350
Urban 1
Trans (m) 1.589 0.893 1.051
Rot (deg) 1.195 1.080 0.914
Urban 2
Trans (m) 1.955 1.669 1.045
Rot (deg) 2.358 0.838 1.203
Trans (m) 1.948 2.494 1.858
Rot (deg) 1.139 0.935 0.907
Trans (m) 0.737 2.274 1.139
Rot (deg) 1.031 0.845 1.577
KAIST 26 Complete Seq*
Trans (m) 1.539 1.794 0.957
Rot (deg) 1.634 0.902 1.054
  • KAIST 26 contains all the suburban and urban trajectories.

Fig. 6: Comparison of KAIST dataset.
Fig. 7: Comparison of run-times(ms).

In the bridge and highway scenes, the feature-based SIFT-UKF system lacks texture information. And it’s hard for PC Semantic system to localize since there is little semantic difference in the longitudinal direction because of geometric configuration [12]. However, poles are clear for our system as in Fig. 5. The results in the urban scenes demonstrate the robustness of our system in dynamic environments. Feature-based localization is susceptible to dynamic object interference. However, even under the occlusion of vehicles and people, the height of the pole makes it still very clear in the image. The proposed system performs not so well in suburban1 and suburban3. We find that the poles are mixed with vegetation, reducing the accuracy of the segmentation algorithm.

The semantic baseline system performs not so well. One possible reason may come from the orientation of the camera. In the KAIST dataset, the camera points to the front of the vehicle, while Stenborg et al. [12] point the camera at the sides of the vehicle, where more semantic diversity can be found. In our experiments, the localization accuracy quickly recovers when the vehicle turns.

We also compare the time efficiency of our system with the semantic point cloud system. All the particle filter systems use 1000 particles. The average time consumption of each frame on the KAIST26 trajectory is presented in Fig. 7. Our system is faster than the semantic point cloud system thanks to much fewer map points in the Measure Update process.

V-C Virtual Kitti

Since no odometry data is given in the Virtual Kitti dataset, we add some noise to the ground-truth provided by the dataset and use it as the odometry for all three systems. As is mentioned by [12], such odometry is reasonable and can be used to simulate the motion model of autonomous vehicles.

Coarse Loc CL+PA PC Semantic[12] UKF-SIFT
Trans (m) 0.346 0.326 0.302
Rot (deg) 0.366 0.322 0.457

The results in Table III demonstrate that, even though we use much less information than the baselines, our system achieves comparable accuracy. Given the true segmentation result of the synthetic dataset, the translation accuracy with the pose alignment module can exceed the UKF-SIFT baseline and achieves an RMSE of 0.289m. The baselines use much bigger maps, as is shown in table I.

V-D System Analysis

There are some failure cases in our experiments. For example, barricades in the middle of roads are easy to be classified as poles. When the vehicle passes through sections with few poles, or the segmentation algorithm fails to segment enough poles, the error will accumulate since only the odometry is useful. These problems could be improved by improving the segmentation algorithm and expanding semantic objects from poles only into more types, like tree-trunks, traffic signs and so on.

For the rotation accuracy, we can tell that ours is better compared with the semantic baseline. However, there is still a certain gap compared with the UKF-SIFT baseline. Obviously, the feature-based system has much more observations to fuse. Since our geometric solver is based on the 2D assumptions, we assume the result will be improved if we expand it into 3D space.

Vi Conclusion

We propose a coarse-to-fine localization system based on pole-like objects extracted from semantically segmented images. The experiments demonstrated that even using a significantly small compact map, it is possible to achieve comparable accuracy with traditional point-cloud based localization. The pose alignment module decouples translation and rotation, achieving even better translation accuracy than the baselines. We prepare to explore more types of objects and its geometry representation to full fill the conditions where there are no poles, and explore 3D constrains to adapt 6-DOF state estimation in the future.


  • [1] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale localization for cameras with known vertical direction,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1455–1461, 2016.
  • [2] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1744–1756, 2016.
  • [3] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 1808–1817.
  • [4] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
  • [5] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [6] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision.   Springer, 2006, pp. 404–417.
  • [7] H. Aanæs, A. L. Dahl, and K. S. Pedersen, “Interesting interest points,” International Journal of Computer Vision, vol. 97, no. 1, pp. 18–35, 2012.
  • [8] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European Conference on Computer Vision.   Springer, 2016, pp. 467–483.
  • [9] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl, “Semantic match consistency for long-term visual localization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 383–399.
  • [10] J. L. Schonberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic Visual Localization,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6896–6906, 2018. [Online]. Available:
  • [11] C. Toft, C. Olsson, and F. Kahl, “Long-term 3d localization and pose from semantic labellings,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 650–659.
  • [12] E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 6484–6490.
  • [13] L. Weng, M. Yang, L. Guo, B. Wang, and C. Wang, “Pole-based real-time localization for autonomous driving in congested urban scenarios,” in 2018 IEEE International Conference on Real-time Computing and Robotics (RCAR).   IEEE, 2018, pp. 96–101.
  • [14] R. Spangenberg, D. Goehring, and R. Rojas, “Pole-based localization for autonomous vehicles in urban scenarios,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 2161–2166.
  • [15] K. Sugihara, “Some Location Problems for Robot Navigation Using a Single Camera.” Computer vision, graphics, and image processing, vol. 42, no. 1, pp. 112–129, 1988.
  • [16] K. Kanatani and N. Ohta, “Accuracy bounds and optimal computation of robot localization,” Machine Vision and Applications, vol. 13, no. 2, pp. 51–60, 2001.
  • [17] S. Ardeshir, A. R. Zamir, A. Torroella, and M. Shah, “GIS-assisted object detection and geospatial localization,”

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , vol. 8694 LNCS, no. PART 6, pp. 602–617, 2014.
  • [18] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 325–341.
  • [19] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics, 2005.
  • [20] R. Jonker and A. Volgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” Computing, vol. 38, no. 4, pp. 325–340, 1987.
  • [21] G. Grisetti, C. Stachniss, W. Burgard et al., “Improved techniques for grid mapping with rao-blackwellized particle filters.”
  • [22] J. Jeong, Y. Cho, Y.-S. Shin, H. Roh, and A. Kim, “Complex urban dataset with multi-level sensors from highly diverse urban environments,” The International Journal of Robotics Research, p. 0278364919843996, 2019.
  • [23] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in CVPR, 2016.
  • [24] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8601–8610.
  • [25]

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.