Visual localization systems for mobile robots constitute an attractive alternative to laser-based systems, as the former can offer accurate localization performance with a low-cost sensor setup. This is especially true for future autonomous cars, where mass-production renders the sensor suite a sensitive matter of expense. However, visual localization systems generate large amounts of data that need to be processed both online on the vehicles as well as offline through map building and map maintenance. The inherent sensitivity of visual systems with respect to changing appearance conditions further exacerbates this problem in the context of long-term autonomy, as multiple appearances of the mapped places need to be stored and managed in order to be able to localize with satisfying accuracy across all conditions.
In recent years, methods have been presented to address the scalability and efficiency issues of individual components of a visual localization system ([1, 2, 3, 4, 5, 6]). However, little attention has been paid to how to combine the different components to form a complete and tractable localization framework, how the differently optimized methods interact, how visual maps are to be built and managed over indefinite time spans, and most importantly: how the large amount of data accumulated over time can be processed and incorporated in an optimal way; everything with the purpose of allowing precise visual localization in outdoor environments at all times. It is the aim of this paper to address these questions. For this, we have built a scalable and efficient visual localization system for multi-vehicle outdoor shared-map scenarios as depicted in Figure 1 and anticipated for autonomous cars in the near future. It employs both appearance-based landmark selection on the vehicle side, as well as offline map summarization on the cloud-based mapping backend. We demonstrate how the visual maps can be managed and improved over time as the vehicles are exposed to vastly different appearance conditions. With the novel concept of an observation session, together with a modified formulation of the ranking function for appearance-based landmark selection, we propose a lightweight procedure to handle large quantities of frequently collected sensor data with the aim of improving the landmark selection performance without increasing the size of the map.
Our main contributions can be summarized as follows:
Demonstration of a complete map management procedure for an efficient visual localization and mapping system designed for long-term outdoor use.
Introduction of observation sessions and proposal of a new formulation of the ranking function for appearance-based landmark selection, allowing to exploit frequently collected sensor data to significantly increase the landmark selection performance without increasing the map size.
In an extensive evaluation in two real-world scenarios, covering weather and seasonal changes at daylight over the course of one year, and the extreme illumination change from day-time to night-time over the course of one day, we validate, first, the practicability of the proposed map management procedure in challenging outdoor conditions, and second, we show how additional co-observability statistics can improve the appearance-based online localization, and where the limitations thereof lie.
The rest of this paper is structured as follows: After an overview over related literature, we present our map management procedure in detail in Section III, before presenting an extensive evaluation of the system’s performance in Section IV. Summarizing remarks about our key findings conclude the paper in Section V.
Ii related work
Ever since the advent of SLAM systems, maintaining map representations for enabling long-term operations has been a key focus, with a variety of different approaches evolving over time. The methods described in ,  and  aim at maintaining a most up-to-date representation of the environment over longer times. These approaches, however, reach their limits whenever a map is required to represent multiple different representations at the same time, as is the case for outdoor visual localization applications.
For this reason, substantial efforts have been made to permanently augment visual maps with data from differing environment conditions. Churchill et al.  introduced the Experience-Based mapping framework, which on-demand adds new “sub-maps” (referred to as “Experiences”) of the environment under newly observed appearance conditions. In a similar vein, the Multi-Session mapping proposed by Mühlfellner et al.  and the Multi-Experience Localization proposed by Paton et al. 
both allow adding multiple datasets of an environment, collected during differing appearance conditions, to a common map representation. All of these approaches in their basic form, however, suffer from increased and ultimately unbounded storage, memory and computational resource requirements. To address this deficit, map summarization techniques have been developed, aiming at maintaining as small a map representation as possible, while at the same time still providing as good a localization performance across as far ranging appearance conditions as possible. Early works in this field include identifying reliable, geometrically-consistent features in the image retrieval context, and suppressing confusing features from certain regions of the database images . Follow-up approaches vary from clustering  or random pruning  of visual “Views”, to selection at the landmarks level based on various landmark ranking functions , , , , .
The selection must not necessarily be carried out on the backend side, but instead may as well be performed already in an online fashion on the robot, prior to uploading new map data [17, 4, 18, 18]. In general, these contributions focus on constructing a reliable set of landmarks for all possible environment states, reducing the runtime of tracking and localization, and/or the uplink bandwidth requirements.
In contrast to that, online landmark selection algorithms further allow decreasing the resource demands on the vehicle and on the communication downlink by having the vehicle query the map only for a selective fraction of landmarks which are deemed useful under the current operating conditions. Previous work by the authors  and by Linegar et al.  have successfully demonstrated such algorithms in the context of Autonomous Driving – a use-case especially prone to visual appearance change and applicable to the shared-map scenario. In relation to that, the work by Krajník ,  aims at predicting the current state of the environment based on previously observed and learned temporal patterns. While this approach is promising for dynamic indoor applications, it is only partially applicable to outdoor environments with often non-periodic changes.
We believe that an ultimately efficient visual localization system must do justice to constrained resources along the whole pipeline, that is, on the mapping backend side, as well as on the mobile platform and the communication link in-between. None of the aforementioned works, however, address all of these constraints simultaneously, whereas in this paper, we present a map management procedure that allows reaching an entirely scalable and efficient visual localization system for long-term use. Furthermore, and in contrast to  and , our metric multi-session map representation (see , ) keeps all map data (vertices, landmarks), even from multiple appearance conditions, expressed in a single map reference frame. This not only facilitates higher level tasks of autonomous operation, such as path planning and control, but also allows implementing the online landmark-selection and the offline map summarization on the level of individual landmarks.
In this section, we present the theoretical concepts of the three main components of our localization and mapping system: (i) the map update, (ii) appearance-based landmark selection with observation sessions, and offline map summarization.
Iii-a Map Update
The methodology of our map management procedure is based on a map update process as depicted in Figure 2. Sensor data, consisting of camera images and wheel-odometry measurements, is collected during a sortie of a vehicle and processed after the vehicle has returned to its home-base. The newly collected dataset is first localized in an offline process against the map available at that time. In case the performance of this localization is worse than a pre-defined threshold, the map is considered to not cover the appearance condition encountered during this sortie sufficiently well and new landmarks are tracked and triangulated from the dataset. A dataset added to the map in this fashion is referred to as a rich session. A subsequent map summarization step ensures the total number of landmarks to remain below a fixed number, guaranteeing a bounded map size at all times. If, on the other hand, localization has performed sufficiently well, the map is considered to cover the encountered conditions and no new landmarks are added to the map. In this case, however, the localization still reveals useful information about what landmarks in the map have been observed during the sortie. This information is added to the map in the form of an observation session. In contrast to the rich session described above, adding an observation session conforms to merely marking existing landmarks as observed in the respective sortie. However, both for future online localization as well as future map summarization steps, this additional statistical data is valuable, as it allows a better distinction between useful and not useful landmarks. The resulting updated map is then used for localization of subsequent sorties.
In order to benefit from the observation sessions during online localization with appearance-based landmark selection, a modified formulation of the landmark ranking function is required and described in the following subsection.
Iii-B Appearance-Based Landmark Selection with Observation Sessions
In our previous work presented in , a method to tackle the problem of only querying useful map data during an outdoor operation has been introduced on the basis of appearance-based landmark selection. Following an iterative localization paradigm, such as the ones described in  and , a ranking function assigns a score to each landmark of a candidate set (pre-selected based on spatial proximity), according to how likely is observable under the current appearance condition. Then, a small subset of top-ranked landmarks are selected using a selection policy , transmitted to the vehicle, and used for localization at iteration . The ranking function described in 
adaptively weights the different sessions present in the pre-built map based on the session-affiliation of recently observed landmarks along the traversal. Although successfully reducing the amount of landmarks used for localization, it relies on the map to be created a priori with all sessions approximately uniformly distributed across the appearance space.
In a practical scenario, however, the map sessions may not be uniformly distributed, but they are rather added once a “new” appearance condition is encountered for the first time. In addition to that, whenever the vehicle traverses through the mapped area under an appearance condition already well-covered in the map, additional co-observability information can be gathered in the form of observation sessions.
As our experiments presented later show (see Figure 6), the original ranking function from , denoted by , is not well suited to incorporate these additional observation sessions. We thus propose a new formulation of the ranking function that is agnostic to how the mapping sessions are distributed across the appearance space, and the number and distribution of additional observation sessions present in the map.
Let denote the current appearance condition. We are interested in evaluating
, corresponding to the probability of observing landmarkunder the current appearance condition. Let further denote the set of all sessions present in the map, both rich sessions and observation sessions, and denote the set of all landmarks in the map. With every landmark , we associate the set , corresponding to all sessions that have observed landmark .
We note that directly depends on , that is, . We can thus group landmarks into appearance equivalence classes, according to:
Hence, evaluating amounts to evaluating , which can be interpreted as the relevance of appearance equivalence class under appearance condition .
The abstract appearance condition is not directly observable. However, it can be approximated by the means of recently selected and observed landmarks as follows:
where and denote the sets of recently observed and selected landmarks of appearance equivalence class respectively. Accordingly, we define our new landmark ranking function as:
In Figure 3, a comparison of our modified ranking function with the originally proposed formulation on the experimental set-up used in  is shown, ensuring that our modified formulation does not reveal regressive performance under these conditions. Further evaluation results demonstrating the merit of our new formulation are presented in Section IV-B.
Iii-C Offline Map-Summarization
Whenever a rich session is added to the map, the total number of landmarks increases, and therewith also the size of the map. We therefore apply the map summarization techniques proposed by Dymczyk et al. in  to keep the map size bounded at all times.
They suggest to reduce the number of landmarks by solving the following integer-based optimization problem:
Each landmark is assigned a corresponding binary switch variable
, indicating whether the landmark should be kept or removed. The landmarks are selected based on the cost vector
, estimated using the number of sessions a landmark was observed in and the total number of observations. Additional constraints ensure some desired total number of landmarks () to remain in the map, and a sufficient number of landmarks () visible from every vertex. Matrix encodes the vertex-landmark co-observability, while the slack variable allows to relax this constraint (at cost ), ensuring a solution to the optimization problem can be found in all cases.
Our evaluation is structured into two parts as follows: We first present our findings related to the map management process and offline summarization, thereby looking into how many rich sessions are added over time, and how the degree of map summarization affects the localization performance. In the second part, we show the performance of the online appearance-based landmark selection on the incrementally improved maps over time, focusing on a comparison between our modified, more generic ranking function proposed in Section III-B, and the original ranking function proposed in  under the influence of additional observation sessions.
The data for the evaluation has been collected in two complementary real-world scenarios. The first one covers weather and seasonal change over the coarse of a full year at day-time on an outdoor parking-lot, while the second one covers extreme lighting change from full day-light to complete night-time in a city environment. The sensor suite consists of four fish-eye cameras, one facing in each cardinal direction, running at Hz, and wheel-odometry. The images are scaled down to pixels prior to processing. Example images can be found in and in the video contributions available online111https://youtu.be/TJMQCSHTIjU222https://youtu.be/JL_5zMEQKYc. All computations have been performed on a consumer-grade laptop with an Intel i7 CPU. Localization runs in real-time at Hz.
Iv-a Map Update and Summarization
As a metric for assessing the quality of the generated maps over time, we employ translation RMS errors between the rough pose estimate from forward-propagated wheel-odometry and the refined pose after optimization. In accordance with the results found in , we omit the presentation of RMS errors in orientation, as they highly correlate with the translation errors and are of negligible magnitude in any case (
). Note that the refined pose at each iteration is obtained from solving a vision-only non-linear least-squares optimization problem. The resulting RMS error thus approximates the standard deviation of the localization along the trajectory.
In outdoor environments, the updated map must still be able to cover the range of appearance conditions represented by the incorporated rich sessions
, even after summarization. The number of landmarks required to achieve this not only depends on the sensor setup and the spatial extent of the map, but also on the variance in appearance conditions encountered. The City-Environment scenario covers extreme lighting changes from day-time to night-time, but the overall variance is still considerably smaller than in the year-long day-time Parking-Lot scenario. To guide our map update process described in SectionIII-A and Figure 2, we have therefore chosen to perform map summarization with a maximum number of k () and k landmarks in the City-Environment and the Parking-Lot scenario respectively, and use a cm threshold on the translation RMS error on these maps as a decision criterion to add the dataset at hand either as a rich session, or as an observation session. The choice of the cm threshold is motivated by recent work (, ) suggesting this to be a reasonable and realistic upper bound for localization precision with the given sensor suite.
The evolution of the localization performance resulting from this map update regime is shown in Figure 5, where localization has been evaluated with the following three combinations of selection policies and ranking functions: a) , using all candidate landmarks for localization, b) , appearance-based landmark selection with a selection ratio of , and c) , the corresponding random selection. As described in III-B and , the selection policy selects some fraction of top-ranked landmarks from which are then used for localization at the given iteration .
In order to thoroughly assess the influence of the offline map summarization on the localization performance, we have further evaluated the latter against more strictly summarized maps (with for the City-Environment, and landmarks for the Parking-Lot environment respectively), as well as against indefinitely growing unsummarized maps.
In the City-Environment scenario, appearance conditions appear stable throughout the afternoon until the beginning of dusk shortly after pm. At :pm, :pm and :pm, additional rich sessions are added, gradually expanding the appearance coverage of the map until, finally, night-time localization is feasible at pm.
In contrast to that, in the Parking-Lot scenario, the appearance patterns are much less clear. Already the initial map, built from the first dataset from August , does not allow sufficient localization performance for the second dataset from September . In general, it seems to be necessary to have a rich session present for every month of the year. Nevertheless, for the second half of the year, the map clearly shows converging tendencies, with only occasional datasets just barely above the cm precision threshold, and the spread between reference localization using all landmarks, and the random selection decreasing.
Figure 4 further shows the number of landmarks associated with each rich session at each stage of the incremental map building process, both for the summarized map and the unsummarized map. The rich sessions added in the City-Environment scenario in dusk naturally contain considerably fewer landmarks, which is also reflected in both the summarized and the unsummarized map. In contrast to that, the rich sessions added in the Parking-Lot scenario all contain a similar number of landmarks, and summarization reduces already present sessions more or less equally as new sessions are added.
Ideally, the localization precision is preserved after map summarization. If this is the case, the summarization only removes noisy landmarks from the map which are not re-observable under any of the encountered appearance conditions. In the City-Environment scenario, the performance difference related to map summarization is best visible at night-time, where the unsummarized map shows significantly better performance in case of the random selection. However, with the appearance-based selection, almost the same precision is attainable as if all landmarks were used. This shows that the summarization algorithm successfully removes redundant and noisy landmarks while maintaining a good coverage over the different appearance conditions. Similar results can be observed also for the Parking-Lot scenario. The more rich sessions are added, and hence the fewer landmarks of an individual rich session can be present in the summarized map, the larger the performance gap between the differently summarized and the unsummarized map becomes. It can further be observed that for the k map, the appearance-based localization cannot keep up with the performance compared to the k map, indicating that for this scenario and this time span, more than k landmarks are required in order to maintain sufficient appearance space coverage.
Since in these experiments we deliberately choose to build the maps incrementally and in chronological fashion, the performance evaluation of a certain dataset only uses the map available at that point in time. To demonstrate that the summarization algorithm in fact creates maps that maintain usability across all previously encountered appearance conditions, we evaluate the performance of all datasets in retrospect using the final map created after having processed the last dataset in chronological order. The results of this “regression” test are shown in Figure 5, with the corresponding map labelled with “Regression”. As can be seen, all datasets achieve at least as high a precision as if the map available by that time is used instead. Note, however, that for all datasets added as rich sessions this “regression” test in principle corresponds to self-localization. Hence the artificially high precision in these cases.
Iv-B Appearance-Based Landmark Selection
The goal of the appearance-based landmark selection is to achieve a high online localization performance with as few landmarks selected as possible. This can best be evaluated by comparing respective selection ratios with the corresponding observation ratios :
The selection ratio denotes the fraction of candidate landmarks used for localization at iteration . In contrast to that, the observation ratio compares the number of observed landmarks, using a specified ranking function and selection ratio , to the number of observed landmarks when all candidate landmarks are used.
In , it has been shown that with a pre-built map and selection ratios between -, a localization performance comparable to using all landmarks can be achieved. In contrast to that, in this paper, we aim at investigating how the relation between selection ratio and localization performance evolves in a scenario where datasets are chronologically processed and the map is built-up incrementally, with both rich sessions and observation sessions.
Figure 6 shows the observation percentage for a selection ratio of for the three cases of using the ranking function originally proposed in , and using our new ranking function introduced in Section III-B with and without observation sessions.
In early stages of the map building with only few rich sessions present in the map, the benefit of using the observation sessions is most pronounced. As soon as more rich sessions become available though, the selection not using the observation sessions performs more and more similarly. In case of the City-Environment scenario, this is due to the fact that towards night-time even a pure selection on the :pm and :pm rich sessions allows achieving virtually observation percentage already. In contrast to that, the appearance conditions in the Parking-Lot scenario are much more diverse and unrelated from one dataset to the next one. After having some number of rich sessions present in the map, additional co-observability information in potentially only weakly related appearance conditions is only of minor or no help anymore.
We have presented a complete map management process for a visual localization system tailored to long-term operations in resource constraint outdoor environments. Offline map summarization guarantees maps of bounded size at all times, while online localization with appearance-based landmark selection allows only transmitting and processing the map data required and useful under the current appearance condition. With the incorporation of landmark co-observation statistics in the form of observation sessions in combination with a new formulation for the appearance-based landmark ranking function , we have proposed a lightweight mechanism to improve the appearance-based landmark selection during online localization at negligible storage or computational costs. An extensive evaluation in real-world conditions has shown that these additional observation sessions have the potential to significantly improve the landmark selection performance. However, their usefulness degrades as more and more rich sessions are available in the map. We have further evaluated the localization performance on the maps with different degrees of summarization resulting from the proposed map management paradigm, and shown that precise localization is possible over long time frames and across vastly different appearance conditions while keeping the map size bounded.
This project has received funding from the EU H2020 research project under grant agreement No 688652 and from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 15.0284.
-  P. Mühlfellner, M. Bürki, M. Bosse, W. Derendarz, R. Philippsen, and P. Furgale, “Summary maps for lifelong visual localization,” JFR, 2015.
-  M. Dymczyk, S. Lynen, M. Bosse, and R. Siegwart, “Keep it brief: Scalable creation of compressed localization maps,” in IROS, 2015.
-  M. Burki, I. Gilitschenski, E. Stumm, R. Siegwart, and J. Nieto, “Appearance-based landmark selection for efficient long-term visual localization,” in IROS, 2016.
-  M. Dymczyk, T. Schneider, I. Gilitschenski, R. Siegwart, and E. Stumm, “Erasing bad memories: Agent-side summarization for long-term mapping,” in IROS, 2016.
-  W. Churchill and P. Newman, “Experience-based navigation for long-term localisation,” IJRR, vol. 32, no. 14, 2013.
-  M. Paton, K. MacTavish, M. Warren, and T. D. Barfoot, “Bridging the appearance gap: Multi-experience localization for long-term visual teach and repeat,” in IROS, 2016.
-  A. Walcott-Bryant, M. Kaess, H. Johannsson, and J. J. Leonard, “Dynamic pose graph SLAM: Long-term mapping in low dynamic environments,” in IROS, 2012.
-  F. Dayoub, G. Cielniak, and T. Duckett, “Long-term experiments with an adaptive spherical view representation for navigation in changing environments,” Robotics and Autonomous Systems, vol. 59, no. 5, pp. 285–295, 2011.
-  F. Dayoub and T. Duckett, “An adaptive appearance-based map for long-term topological localization of mobile robots,” in IROS, 2008.
-  P. Turcot and D. G. Lowe, “Better matching with fewer features: The selection of useful features in large database recognition problems,” in ICCV Workshops, 2009.
-  J. Knopp, J. Sivic, and T. Pajdla, “Avoiding Confusing Features in Place Recognition,” in ECCV, 2010.
-  K. Konolige and J. Bowman, “Towards lifelong visual maps,” in IROS, 2009.
-  M. Milford and G. Wyeth, “Persistent navigation and mapping using a biologically inspired SLAM system,” IJRR, vol. 29, no. 9, 2010.
-  M. Dymczyk, S. Lynen, T. Cieslewski, M. Bosse, R. Siegwart, and P. Furgale, “The gist of maps-summarizing experience for lifelong localization,” in ICRA, 2015.
-  K. Pirker, M. Ruther, and H. Bischof, “CD SLAM - continuous localization and mapping in a dynamic world,” in IROS, 2011.
-  A. Loquercio, M. Dymczyk, B. Zeisl, S. Lynen, I. Gilitschenski, and R. Siegwart, “Efficient descriptor learning for large scale localization,” in ICRA, 2017.
-  W. Hartmann, M. Havlena, and K. Schindler, “Predicting Matchability,” in CVPR, 2014.
-  D. M. Rosen, J. Mason, and J. J. Leonard, “Towards lifelong feature-based mapping in semi-static environments,” in ICRA, 2016.
-  C. Linegar, W. Churchill, and P. Newman, “Work Smart, Not Hard: Recalling Relevant Experiences for Vast-Scale but Time-Constrained Localisation,” in ICRA, 2015.
-  T. Krajnik, J. Pulido Fentanes, M. Hanheide, and T. Duckett, “Persistent localization and life-long mapping in changing environments using the Frequency Map Enhancement,” in IROS, 2016.
-  T. Krajnik, J. P. Fentanes, O. M. Mozos, T. Duckett, J. Ekekrantz, and M. Hanheide, “Long-term topological localisation for service robots in dynamic environments using spectral maps,” in IROS, 2014.
-  H. Lategahn and C. Stiller, “City GPS using stereo vision,” in ICVES, 2012.