Log In Sign Up

LaMAR: Benchmarking Localization and Mapping for Augmented Reality

by   Paul-Edouard Sarlin, et al.

Localization and mapping is the foundational technology for augmented reality (AR) that enables sharing and persistence of digital content in the real world. While significant progress has been made, researchers are still mostly driven by unrealistic benchmarks not representative of real-world AR scenarios. These benchmarks are often based on small-scale datasets with low scene diversity, captured from stationary cameras, and lack other sensor inputs like inertial, radio, or depth data. Furthermore, their ground-truth (GT) accuracy is mostly insufficient to satisfy AR requirements. To close this gap, we introduce LaMAR, a new benchmark with a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices in large, unconstrained scenes. To establish an accurate GT, our pipeline robustly aligns the trajectories against laser scans in a fully automated manner. As a result, we publish a benchmark dataset of diverse and large-scale scenes recorded with head-mounted and hand-held AR devices. We extend several state-of-the-art methods to take advantage of the AR-specific setup and evaluate them on our benchmark. The results offer new insights on current research and reveal promising avenues for future work in the field of localization and mapping for AR.


page 5

page 15

page 16

page 17

page 18

page 19

page 21

page 25


AR Mapping: Accurate and Efficient Mapping for Augmented Reality

Augmented reality (AR) has gained increasingly attention from both resea...

An Indoor Localization Dataset and Data Collection Framework with High Precision Position Annotation

We introduce a novel technique and an associated high resolution dataset...

Generalized Scene Reconstruction

A new passive approach called Generalized Scene Reconstruction (GSR) ena...

Saliency in Augmented Reality

With the rapid development of multimedia technology, Augmented Reality (...

Verifiable Access Control for Augmented Reality Localization and Mapping

Localization and mapping is a key technology for bridging the virtual an...

The Mobile AR Sensor Logger for Android and iOS Devices

In recent years, commodity mobile devices equipped with cameras and iner...

1 Introduction

Placing virtual content in the physical 3D world, persisting it over time, and sharing it with other users are typical scenarios for Augmented Reality (AR). In order to reliably overlay virtual content in the real world with pixel-level precision, these scenarios require AR devices to accurately determine their 6-DoF pose at any point in time. While visual localization and mapping is one of the most studied problems in computer vision, its use for AR entails specific challenges and opportunities. First, modern AR devices, such as mobile phones or the Microsoft HoloLens or MagicLeap One, are often equipped with multiple cameras and additional inertial or radio sensors. Second, they exhibit characteristic hand-held or head-mounted motion patterns. The on-device real-time tracking systems provide spatially-posed sensor streams. However, many AR scenarios require positioning beyond local tracking, both indoors and outdoors, and robustness to common temporal changes of appearance and structure. Furthermore, given the plurality of temporal sensor data, the question is often not whether, but how quickly can the device localize at any time to ensure a compelling end-user experience. Finally, as AR adoption grows, crowd-sourced data captured by users with diverse devices can be mined for building large-scale maps without a manual and costly scanning effort. Crowd-sourcing offers great opportunities but poses additional challenges on the robustness of algorithms, e.g., to enable cross-device localization 

[dusmanu2021cross], mapping from incomplete data with low accuracy [Schoenberger2016Structure, brachmann2021limits], privacy-preservation of data [speciale2019a, geppert2020privacy, shibuya2020privacy, geppert2021privacy, dusmanu2021privacy], etc.

However, the academic community is mainly driven by benchmarks that are disconnected from the specifics of AR. They mostly evaluate localization and mapping using single still images and either lack temporal changes [shotton2013scene, advio] or accurate ground truth (GT) [sattler2018benchmarking, kendall2015, taira2018inloc], are restricted to small scenes [Balntas2017HPatches, shotton2013scene, kendall2015, wald2020, schops2017multi] or landmarks [Jin2020Image, Schonberger2017Comparative] with perfect coverage and limited viewpoint variability, or disregard temporal tracking data or additional visual, inertial, or radio sensors [sattler2012aachen, sattler2018benchmarking, taira2018inloc, lee2021naver, nclt, sun2017dataset].

Our first contribution is to introduce a large-scale dataset captured using AR devices in diverse environments, notably a historical building, a multi-story office building, and part of a city center. The initial data release contains both indoor and outdoor images with illumination and semantic changes as well as dynamic objects. Specifically, we collected multi-sensor data streams (images, depth, tracking, IMU, BT, WiFi) totalling more than 100 hours using head-mounted HoloLens 2 and hand-held iPhone / iPad devices covering 45’000 square meters over the span of one year (Fig. 1).

Second, we develop a GT pipeline to automatically and accurately register AR trajectories against large-scale 3D laser scans. Our pipeline does not require any manual labelling or setup of custom infrastructure (e.g., fiducial markers). Furthermore, the system robustly handles crowd-sourced data from heterogeneous devices captured over longer periods of time and can be easily extended to support future devices.

Finally, we present a rigorous evaluation of localization and mapping in the context of AR and provide novel insights for future research. Notably, we show that the performance of state-of-the-art methods can be drastically improved by considering additional data streams generally available in AR devices, such as radio signals or sequence odometry. Thus, future algorithms in the field of AR localization and mapping should always consider these sensors in their evaluation to show real-world impact.

The LaMAR dataset, benchmark, GT pipeline, and the implementations of baselines integrating additional sensory data are all publicly available at We hope that this will spark future research addressing the challenges of AR.

2 Related work

dataset out/indoor changes scale density camera motion imaging devices additional sensors ground truth accuracy
Aachen [sattler2012aachen, sattler2018benchmarking]   starstar[solid]star-half-alt starstar[regular]star still images DSLR SfM dm
Phototourism [Jin2020Image]   [solid]star-half-alt[regular]star[regular]star starstarstar still images DSLR, phone SfM m
San Francisco [sanfrancisco]   starstarstar star[regular]star[regular]star still images DSLR, phone GNSS SfM+GNSS m
Cambridge [kendall2015]   [solid]star-half-alt[regular]star[regular]star starstar[regular]star handheld mobile SfM dm
7Scenes [shotton2013scene]   [solid]star-half-alt[regular]star[regular]star starstarstar handheld mobile depth RGB-D cm
RIO10 [wald2020]   [solid]star-half-alt[regular]star[regular]star starstarstar handheld Tango tablet depth VIO dm
InLoc [taira2018inloc]   star[solid]star-half-alt[regular]star [solid]star-half-alt[regular]star[regular]star still images panoramas, phone lidar manual+lidar dm
Baidu mall [sun2017dataset]   star[solid]star-half-alt[regular]star starstar[regular]star still images DSLR, phone lidar manual+lidar dm
Naver Labs [lee2021naver]   starstar[regular]star starstar[regular]star robot-mounted fisheye, phone lidar lidar+SfM dm
NCLT [nclt]   starstar[regular]star starstar[regular]star robot-mounted wide-angle lidar, IMU, GNSS lidar+VIO dm
ADVIO [advio]   starstar[regular]star [solid]star-half-alt[regular]star[regular]star handheld phone, Tango IMU, depth, GNSS manual+VIO m
ETH3D [schops2017multi]   [solid]star-half-alt[regular]star[regular]star starstar[regular]star handheld DSLR, wide-angle lidar manual+lidar mm
LaMAR (ours)   starstar[solid]star-half-alt 3 locations 45’000 m starstarstar 100 hours 40 km handheld head-mounted phone, headset backpack, trolley lidar, IMU, wifi bluetooth depth, infrared lidar+SfM+VIO automated cm
Table 1: Overview of existing datasets. No dataset, besides ours, exhibits at the same time short-term appearance and structural changes due to moving people , weather , or day-night cycles , but also long-term changes due to displaced furniture  or construction work .

Image-based localization 

is classically tackled by estimating a camera pose from correspondences established between sparse local features 

[lowe2004distinctive, bay2008speeded, Rublee2011ORB, mikolajczyk2004ijcv] and a 3D Structure-from-Motion (SfM) [Schoenberger2016Structure] map of the scene [fischler1981random, li2012worldwide, sattler2012improving]

. This pipeline scales to large scenes using image retrieval 

[arandjelovic2012three, vlad, apgem, asmk, cao2020unifying, rau2020imageboxoverlap, densevlad]

. Recently, many of these steps or even the end-to-end pipeline have been successfully learned with neural networks 

[detone2018superpoint, sarlin2020superglue, Dusmanu2019CVPR, schoenberger2018semantic, arandjelovic2016netvlad, NIPS2017_831caa1b, tian2019sosnet, sarlin2019, yi2016lift, Hyeon2021, sarlin21pixloc, lindenberger2021pixsfm]. Other approaches regress absolute camera pose [kendall2015, kendall2017geometric, ng2021reassessing] or scene coordinates [shotton2013scene, valentin2015cvpr, meng2017backtracking, massiceti2017random, angle_scr, brachmann2019esac, Wang2021, brachmann2021dsacstar]. However, all these approaches typically fail whenever there is lack of context (e.g., limited field-of-view) or the map has repetitive elements. Leveraging the sequential ordering of video frames [seqslam, johns2013feature] or modelling the problem as a generalized camera [pless2003using, hee2016minimal, sattler2018benchmarking, speciale2019a] can improve results.

Radio-based localization:  Radio signals, such as WiFi and Bluetooth, are spatially bounded (logarithmic decay) [radar, khalajmehrabadi2016modern, radio_fingerprint], thus can distinguish similarly looking (spatially distant) locations. Their unique identifiers can be uniquely hashed which makes them computationally attractive (compared with high-dimensional image descriptors). Several methods use the signal strength, angle, direction, or time of arrival [radio_aoa, radio_toa, radio_tdoa] but the most popular is model-free map-based fingerprinting [khalajmehrabadi2016modern, radio_fingerprint, fault_tolerant], as it only requires to collect unique identifiers of nearby radio sources and received signal strength. GNSS provides absolute 3-DoF positioning but is not applicable indoors and has insufficient accuracy for AR scenarios, especially in urban environments due to multi-pathing, etc.

Datasets and ground-truth:  Many of the existing benchmarks (cf. Tab. 1) are captured in small-scale environments [shotton2013scene, wald2020, dai2017scannet, hodan2018bop], do not contain sequential data [sattler2012aachen, Jin2020Image, sanfrancisco, taira2018inloc, sun2017dataset, schops2017multi, Balntas2017HPatches, Schonberger2017Comparative], lack characteristic hand-held/head-mounted motion patterns  [sattler2018benchmarking, Badino2011, RobotCarDatasetIJRR, wenzel2020fourseasons], or their GT is not accurate enough for AR [advio, kendall2015]. None of these datasets contain WiFi or Bluetooth data (Tab. 1). The closest to our work are Naver Labs [lee2021naver], NCLT [nclt] and ETH3D [schops2017multi]. Both, Naver Labs [lee2021naver] and NCLT [nclt] are less accurate than ours and do not contain AR specific trajectories or radio data. The Naver Labs dataset [lee2021naver] also does not contain any outdoor data. ETH3D [schops2017multi] is highly accurate, however, it is only small-scale, does not contain significant changes, or any radio data.

To establish ground-truth, many datasets rely on off-the-shelf SfM algorithms [Schoenberger2016Structure] for unordered image collections [sattler2012aachen, Jin2020Image, kendall2015, wald2020, advio, sun2017dataset, taira2018inloc, Jin2020Image]. Pure SfM-based GT generation has limited accuracy [brachmann2021limits] and completeness, which biases the evaluations to scenarios in which visual localization already works well. Other approaches rely on RGB(-D) tracking [wald2020, shotton2013scene], which usually drifts in larger scenes and cannot produce GT in crowd-sourced, multi-device scenarios. Specialized capture rigs of an AR device with a more accurate sensor (lidar) [lee2021naver, nclt] prevent capturing of realistic AR motion patterns. Furthermore, scalability is limited for these approaches, especially if they rely on manual selection of reference images [sun2017dataset], laborious labelling of correspondences [sattler2012aachen, taira2018inloc], or placement of fiducial markers [hodan2018bop]. For example, the accuracy of ETH3D [schops2017multi] is achieved by using single stationary lidar scan, manual cleaning, and aligning very few images captured by tripod-mounted DSLR cameras. Images thus obtained are not representative for AR devices and the process cannot scale or take advantage of crowd-sourced data. In contrast, our fully automatic approach does not require any manual labelling or special capture setups, thus enables light-weight and repeated scanning of large locations.

device motion type cameras radios other data poses
# FOV frequency resolution specs
M6 trolley 6 113° 1-3m 1080p RGB, sync wifibluetooth lidar points+mesh lidar SLAM
VLX backpack 4 90° 1-3m 1080p RGB, sync bluetooth lidar points+mesh lidar SLAM
HoloLens2 head-mounted 4 83° 30Hz VGA gray, GS wifibluetooth ToF depth/IR 1Hz, IMU head-tracking
iPad/iPhone hand-held 1 64° 10Hz 1080p RGB, RS, AF bluetooth lidar depth 10Hz, IMU ARKit
Table 2: Sensor specifications. Our dataset has visible light images (global shutter GS, rolling shutter RS, auto-focus AF), depth data (ToF, lidar), radio signals (, if partial), dense lidar point clouds, and poses with intrinsics from on-device tracking.

3 Dataset

We first give an overview of the setup and content of our dataset.

Locations:  The initial release of the dataset contains 3 large locations representative of AR use cases: 1) HGE (18’000 m) is the ground floor of a historical university building composed of multiple large halls and large esplanades on both sides. 2) CAB (12’000 m) is a multi-floor office building composed of multiple small and large offices, a kitchen, storage rooms, and 2 courtyards. 3) LIN (15’000 m) is a few blocks of an old town with shops, restaurants, and narrow passages. HGE and CAB contain both indoor and outdoor sections with many symmetric structures. Each location underwent structural changes over the span of a year, e.g., the front of HGE turned into a construction site and the indoor furniture was rearranged. See Fig. 2 and Appendix 0.A for visualizations.

Data collection:  We collected data using Microsoft HoloLens 2 and Apple iPad Pro devices with custom raw sensor recording applications. 10 participants were each given one device and asked to walk through a common designated area. They were only given the instructions to freely walk through the environment to visit, inspect, and find their way around. This yielded diverse camera heights and motion patterns. Their trajectories were not planned or restricted in any way. Participants visited each location, both during the day and at night, at different points in time over the course of up to 1 year. In total, each location is covered by more than 100 sessions of 5 minutes. We did not need to prepare the capturing site in any way before recording. This enables easy barrier-free crowd-sourced data collections. Each location was also captured two to three times by NavVis M6 trolley or VLX backpack mapping platforms, which generate textured dense 3D models of the environment using laser scanners and panoramic cameras.

Privacy:  We paid special attention to comply with privacy regulations. Since the dataset is recorded in public spaces, our pipeline anonymizes all visible faces and licence plates.

Sensors:  We provide details about the recorded sensors in Tab. 2. The HoloLens has a specialized large field-of-view (FOV) multi-camera tracking rig (low resolution, global shutter) [ungureanu2020hololens], while the iPad has a single, higher-resolution camera with rolling shutter and more limited FOV. We also recorded outputs of the real-time AR tracking algorithms available on each device, which includes relative camera poses and sensor calibration. All images are undistorted. All sensor data is registered into a common reference frame with accurate absolute GT poses using the pipeline described in the next section.

Figure 2: The locations feature diverse indoor and outdoor spaces. High-quality meshes, obtained from lidar, are registered with numerous AR sequences, each shown here as a different color.

4 Ground-truth generation

The GT estimation process takes as input the raw data from the different sensors. The entire pipeline is fully automated and does not require any manual alignment or input.

Overview:  We start by aligning different sessions of the laser scanner by using the images and the 3D lidar point cloud. When registered together, they form the GT reference map, which accurately captures the structure and appearance of the scene. We then register each AR sequence individually to the reference map using local feature matching and relative poses from the on-device tracker. Finally, all camera poses are refined jointly by optimizing the visual constraints within and across sequences.

Notation:  We denote the 6-DoF pose, encompassing rotation and translation, that transforms a point in frame to another frame . Our goal is to compute globally-consistent absolute poses for all cameras of all sequences and scanning sessions into a common reference world frame .

4.1 Ground-truth reference model

Each capture session of the NavVis laser-scanning platform is processed by a proprietary inertial-lidar SLAM that estimates, for each image , a pose relative to the beginning of the session. The software filters out noisy lidar measurements, removes dynamic objects, and aggregates the remainder into a globally-consistent colored 3D point cloud with a grid resolution of 1cm. To recover visibility information, we compute a dense mesh using the Advancing Front algorithm [cohen2004greedy].

Our first goal is to align the sessions into a common GT reference frame. We assume that the scan trajectories are drift-free and only need to register each with a rigid transformation . Scan sessions can be captured between extensive periods of time and therefore exhibit large structural and appearance changes. We use a combination of image and point cloud information to obtain accurate registrations without any manual initialization. The steps are inspired by the reconstruction pipeline of Choi et al. [choi2015robust, Zhou2018].

Pair-wise registration:  We first estimate a rigid transformation for each pair of scanning sessions . For each image in , we select the most similar images in based on global image descriptors [vlad, arandjelovic2016netvlad, apgem], which helps the registration scale to large scenes. We extract sparse local image features and establish 2D-2D correspondences for each image pair . The 2D keypoints are lifted to 3D, , by tracing rays through the dense mesh of the corresponding session. This yields 3D-3D correspondences , from which we estimate an initial relative pose [umeyama1991least] using RANSAC [fischler1981random]. This pose is refined with the point-to-plane Iterative Closest Point (ICP) algorithm [rusinkiewicz2001efficient] applied to the pair of lidar point clouds.

We use state-of-the-art local image features that can match across drastic illumination and viewpoint changes [sarlin2019, detone2018superpoint, revaudr2d2]. Combined with the strong geometric constraints in the registration, our system is robust to long-term temporal changes and does not require manual initialization. Using this approach, we have successfully registered building-scale scans captured at more than a year of interval with large structural changes.

Global alignment:  We gather all pairwise constraints and jointly refine all absolute scan poses by optimizing a pose graph [grisetti2010tutorial]. The edges are weighted with the covariance matrices of the pair-wise ICP estimates. The images of all scan sessions are finally combined into a unique reference trajectory . The point clouds and meshes are aligned according to the same transformations. They define the reference representation of the scene, which we use as a basis to obtain GT for the AR sequences.

Ground-truth visibility:  The accurate and dense 3D geometry of the mesh allows us to compute accurate visual overlap between two cameras with known poses and calibration. Inspired by Rau et al. [rau2020imageboxoverlap], we define the overlap of image wrt. a reference image by the ratio of pixels in that are visible in :


where projects a 3D point to camera , conversely backprojects it using its known depth with as the image dimensions. The contribution of each pixel is weighted by the angle between the two rays. To handle scale changes, it is averaged both ways and . This score is efficiently computed by tracing rays through the mesh and checking for occlusion for robustness.

This score favors images that observe the same scene from similar viewpoints. Unlike sparse co-visibility in an SfM model [radenovic2018fine], our formulation is independent of the amount of texture and the density of the feature detections. This score correlates with matchability – we thus use it as GT when evaluating retrieval and to determine an upper bound on the theoretically achievable performance of our benchmark.

4.2 Sequence-to-scan alignment

We now aim to register each AR sequence individually into the dense GT reference model (see Fig. 3). Given a sequence of frames, we introduce a simple algorithm that estimates the per-frame absolute pose . A frame refers to an image taken at a given time or, when the device is composed of a camera rig with known calibration (e.g., HoloLens), to a collection of simultaneously captured images.

Figure 3: Sequence-to-scan alignment. We first estimate the absolute pose of each sequence frame using image retrieval and matching. This initial localization prior is used to obtain a single rigid alignment between the input trajectory and the reference 3D model via voting. The alignment is then relaxed by optimizing the individual frame poses in a pose graph based on both relative and absolute pose constraints. We bootstrap this initialization by mining relevant image pairs and re-localizing the queries. Given these improved absolute priors, we optimize the pose graph again and finally include reprojection errors of the visual correspondences, yielding a refined trajectory.

Inputs:  We assume given trajectories estimated by a visual-inertial tracker – we use ARKit for iPhone/iPad and the on-device tracker for HoloLens. The tracker also outputs per-frame camera intrinsics , which account for auto-focus or calibration changes and are for now kept fixed.

Initial localization:  For each frame of a sequence , we retrieve a fixed number of relevant reference images using global image descriptors. We match sparse local features [lowe2004distinctive, detone2018superpoint, revaudr2d2] extracted in the query frame to each retrieved image obtaining a set of 2D-2D correspondences . The 2D reference keypoints are lifted to 3D by tracing rays through the mesh of the reference model, yielding a set of 2D-3D correspondences . We combine all matches per query frame and estimate an initial absolute pose using the (generalized) P3P algorithm [hee2016minimal] within a LO-RANSAC scheme [chum2003locally] followed by a non-linear refinement [Schoenberger2016Structure]. Because of challenging appearance conditions, structural changes, or lack of texture, some frames cannot be localized in this stage. We discard all poses that are supported by a low number of inlier correspondences.

Rigid alignment:  We next recover a coarse initial pose for all frames, including those that could not be localized. Using the tracking, which is for now assumed drift-free, we find the rigid alignment that maximizes the consensus among localization poses. This voting scheme is fast and effectively rejects poses that are incorrect, yet confident, due to visual aliasing and symmetries. Each estimate is a candidate transformation , for which other frames can vote, if they are consistent within a threshold . We select the candidate with the highest count of inliers:


where is the indicator function and returns the magnitude, in terms of translation and rotation, of the difference between two absolute poses. We then recover the per-frame initial poses as .

Pose graph optimization:  We refine the initial absolute poses by maximizing the consistency of tracking and localization cues within a pose graph. The refined poses minimize the energy function


where is the distance between two absolute or relative poses, weighted by covariance matrix

and loss function

. Here, Log maps from the Lie group to the corresponding algebra .

We robustify the absolute term with the Geman-McClure loss function and anneal its scale via a Graduated Non-Convexity scheme [yang2020graduated]. This ensures convergence in case of poor initialization, e.g., when the tracking exhibits significant drift, while remaining robust to incorrect localization estimates. The covariance of the absolute term is propagated from the preceding non-linear refinement performed during localization. The covariance of the relative term is recovered from the odometry pipeline, or, if not available, approximated as a factor of the motion magnitude.

This step can fill the gaps from the localization stage using the tracking information and conversely correct for tracker drift using localization cues. In rare cases, the resulting poses might still be inaccurate when both the tracking drifts and the localization fails.

Guided localization via visual overlap:  To further increase the pose accuracy, we leverage the current pose estimates to mine for additional localization cues. Instead of relying on global visual descriptors, which are easily affected by aliasing, we select reference images with a high overlap using the score defined in Sec. 4.1. For each sequence frame , we select reference images with the largest overlap and again match local features and estimate an absolute pose. These new localization priors improve the pose estimates in a second optimization of the pose graph.

Bundle adjustment:  For each frame , we recover the set of 2D-3D correspondences used by the guided re-localization. We now refine the poses by jointly minimizing a bundle adjustment problem with relative pose graph costs:


where the second term evaluates the reprojection error of a 3D point for observation to frame . The covariance is the noise of the keypoint detection algorithm. We pre-filter correspondences that are behind the camera or have an initial reprojection error greater than . As the 3D points are sampled from the lidar, we also optimize them with a prior noise corresponding to the lidar specifications. We use the Ceres [ceres-solver] solver.

4.3 Joint global refinement

Once all sequences are individually aligned, we refine them jointly by leveraging sequence-to-sequence visual observations. This is helpful when sequences observe parts of the scene not mapped by the LiDAR. We first triangulate a sparse 3D model from scan images, aided by the mesh. We then triangulate additional observations, and finally jointly optimize the whole problem.

Reference triangulation:  We estimate image correspondences of the reference scan using pairs selected according to the visual overlap defined in Sec. 4.2. Since the image poses are deemed accurate and fixed, we filter the correspondences using the known epipolar geometry. We first consider feature tracks consistent with the reference surface mesh before triangulating more noisy observations within LO-RANSAC using COLMAP [Schoenberger2016Structure]. The remaining feature detections, which could not be reliably matched or triangulated, are lifted to 3D by tracing through the mesh. This results in an accurate, sparse SfM model with tracks across reference images.

Sequence optimization:  We then add each sequence to the sparse model. We first establish correspondences between images of the same and of different sequences. The image pairs are again selected by highest visual overlap computed using the aligned poses . The resulting tracks are sequentially triangulated, merged, and added to the sparse model. Finally, all 3D points and poses are jointly optimized by minimizing the joint pose-graph and bundle adjustment (Eq. 4). As in COLMAP [Schoenberger2016Structure], we alternate optimization and track merging. To scale to large scenes, we subsample keyframes from the full frame-rate captures and only introduce absolute pose and reprojection constraints for keyframes while maintaining all relative pose constraints from tracking.

Figure 4: Uncertainty of the GT poses for the CAB scene. Left: The overhead map shows that the translation uncertainties are larger in long corridors and outdoor spaces. Right: Pairs of captured images (left) and renderings of the mesh at the estimated camera poses (right). They are pixel-aligned, which confirms that the poses are sufficiently accurate for our evaluation.

4.4 Ground-truth validation

Potential limits:  Brachmann et al. [brachmann2021limits] observe that algorithms generating pseudo-GT poses by minimizing either 2D or 3D cost functions alone can yield noticeably different results. We argue that there exists a single underlying, true GT. Reaching it requires fusing large amounts of redundant data with sufficient sensors of sufficiently low noise. Our GT poses optimize complementary constraints from visual and inertial measurements, guided by an accurate lidar-based 3D structure. Careful design and propagation of uncertainties reduces the bias towards one of the sensors. All sensors are factory- and self-calibrated during each recording by the respective commercial, production-grade SLAM algorithms. We do not claim that our GT is perfect but analyzing the optimization uncertainties sheds light on its degree of accuracy.

Pose uncertainty:  We estimate the uncertainties of the GT poses by inverting the Hessian of the refinement. To obtain calibrated covariances, we scale them by the empirical keypoint detection noise, estimated as pixels for the CAB scene. The maximum noise in translation is the size of the major axis of the uncertainty ellipsoids, which is the largest eivenvalue of the covariance matrices. Figure 4 shows its distribution for the CAB scene. We retain images whose poses are correct within cm with a confidence of

%. For normally distributed errors, this corresponds to a maximum uncertainty

and discards % of all frames. For visual inspection, we render images at the estimated GT camera poses using the colored mesh. They appear pixel-aligned with the original images, supporting that the poses are accurate. We provide additional visualizations in Appendix 0.B.

4.5 Selection of mapping and query sequences

We divide the set of sequences into two disjoint groups for mapping and localization. Mapping sequences are selected such that they have a minimal overlap between each other yet cover the area visited by all remaining sequences. This simulates a scenario of minimal coverage and maximizes the number of localization query sequences. We cast this as a combinatorial optimization problem solved with a depth-first search. We provide more details in

Appendix 0.C.

5 Evaluation

We evaluate state-of-the-art approaches in both single-frame and sequence settings and summarize our results in Fig. 5. We build maps using both types of AR devices and evaluate the localization accuracy for 1000 randomly-selected queries of each device for each location. All results are averaged across all locations. Appendix 0.D provides more details about the distribution of the evaluation data.

Single-frame:  We first consider in Sec. 5.1 the classical academic setup of single-frame queries (single image for phones and single rig for HoloLens 2) without additional sensor. We then look at how radio signals can be beneficial. We also analyze the impact of various settings: FOV, type of mapping images, and mapping algorithm.

Sequence:  Second, by leveraging the real-time AR tracking poses, we consider the problem of sequence localization in Sec. 5.2. This corresponds to a real-world AR application retrieving the content attached to a target map using the real-time sensor stream from the device. In this context, we care not only about accuracy and recall but also about the time required to localize accurately, which we call the time-to-recall.

Figure 5: Main results. We show results for Fusion image retrieval with SuperPoint local features and SuperGlue matcher on both HoloLens 2 and phone queries. We consider several tracks: single-image / single-rig localization with / without radios and similarly for sequence (10 seconds) localization. In addition, we report the percentage of queries with at least 5% ground-truth overlap with respect to the best mapping image.

5.1 Single-frame localization

We first evaluate several algorithms representative of the state of the art in the classical single-frame academic setup. We consider the hierarchical localization framework with different approaches for image retrieval and matching. Each of them first builds a sparse SfM map from reference images. For each query frame, we then retrieve relevant reference images, match their local features, lift the reference keypoints to 3D using the sparse map, and finally estimate a pose with PnP+RANSAC. We report the recall of the final pose at two thresholds [sattler2018benchmarking]: 1) a fine threshold at (cm), which we see as the minimum accuracy required for a good AR user experience in most settings. 2) a coarse threshold at (m) to show the room for improvement for current approaches.

We evaluate global descriptors computed by NetVLAD [arandjelovic2016netvlad] and by a fusion [humenberger2020robust] of NetVLAD and APGeM [apgem], which are representative of the field [pion2020benchmarking]. We retrieve the 10 most similar images. For matching, we evaluate handcrafted SIFT [lowe2004distinctive], SOSNet [tian2019sosnet] as a learned patch descriptor extracted from DoG [lowe2004distinctive]

keypoints, and a robust deep-learning based joint detector and descriptor R2D2 

[revaudr2d2]. Those are matched by exact mutual nearest neighbor search. We also evaluate SuperGlue [sarlin2020superglue] – a learned matcher based on SuperPoint [detone2018superpoint] features. To build the map, we retrieve neighboring images filtered by frustum intersection from reference poses, match these pairs, and triangulate a sparse SfM model using COLMAP [Schoenberger2016Structure].

We report the results in Tab. 3 (left). Even the best methods have a large gap to perfect scores and much room for improvement. In the remaining ablation, we solely rely on SuperPoint+SuperGlue [detone2018superpoint, sarlin2020superglue] for matching as it clearly performs the best.

Hierarchical localization Query device Retrieval Matching HL2 Phone NetVLAD SIFT 48.3 / 63.7 38.0 / 54.8 DoG+SOSNet 52.3 / 67.3 37.9 / 55.4 R2D2 48.2 / 63.9 42.1 / 58.4 SP+SG 59.9 / 73.0 50.1 / 63.3 Fusion SIFT 51.2 / 67.9 38.5 / 56.9 DoG+SOSNet 55.2 / 71.2 39.3 / 57.4 R2D2 52.0 / 68.4 43.5 / 60.2 SP+SG 64.2 / 77.4 52.2 / 65.8
Table 3: Left: single-frame localization. We report the recall at (cm)/(m) for baselines representative of the state of the art. Our dataset is challenging while most others are saturated. There is a clear progress from SIFT but also large room for improvement. Right: localization with radio signals. Increasing the number {5, 10, 20} of retrieved images increases the localization recall at (cm). The best-performing visual retrieval (Fusion, orange) is however far worse than the GT overlap. Filtering with radio signals (blue) improves the performance in all settings.

Leveraging radio signals:  In this experiment, we show that radio signals can be used to constrain the search space for image retrieval. This has two main benefits: 1) it reduces the risk of incorrectly considering visual aliases, and 2) it lowers the compute requirements by reducing that numbers of images that need to be retrieved and matched. We implement this filtering as follows. We first split the scene into a sparse 3D grid considering only voxels containing at least one mapping frame. For each frame, we gather all radio signals in a s window and associate them to the corresponding voxel. If the same endpoint is observed multiple times in a given voxel, we average the received signal strengths (RSSI) in dBm. For a query frame, we similarly aggregate signals over the past 2s and rank voxels by their L2 distance between RSSIs, considering those with at least one common endpoint. We thus restrict image retrieval to of the map.

Table 3 (right) shows that radio filtering always improves the localization accuracy over vanilla vision-only retrieval, irrespective of how many images are matches. The upper bound based on the GT overlap, defined in Sec. 4.1

, shows that there is still much room for improvement for both image and radio retrieval. As the GT overlap baseline is far from the perfect 100% recall, frame-to-frame matching and pose estimation have also much room to improve.

Varying field-of-view:  We study the impact of the FOV of the HoloLens 2 device via two configurations: 1) Each camera in a rig is seen as a single-frame and localized using LO-RANSAC + P3P. 2) We consider all four cameras in a frame and localize them together using the generalized solver GP3P. With fusion retrieval, SuperPoint, and SuperGlue, single images (1) only achieve 45.6% / 61.3% recall, while using rigs (2) yields 64.2% / 77.4% (Tab. 3). Rig localization is thus highly beneficial, especially in hard cases where single cameras face texture-less areas, such as the ground and walls.

Mapping modality:  We study whether the high-quality lidar mesh can be used for localization. We consider two approaches to obtain a sparse 3D point cloud: 1) By triangulating sparse visual correspondences across multiple views. 2) By lifting 2D keypoints in reference images to 3D by tracing rays through the mesh. Lifting can leverage dense correspondences, which cannot be efficiently triangulated with conventional multi-view geometry. We thus compare 1) and 2) with SuperGlue to 2) with LoFTR [sun2021loftr], a state-of-the-art dense matcher. The results in Tab. 4 (right) show that the mesh brings some improvements. Points could also be lifted by dense depth from multi-view stereo. We however did not obtain satisfactory results with a state-of-the-art approach [wang2020patchmatchnet] as it cannot handle very sparse mapping images.

Mapping scenario:  We study the accuracy of localization against maps built from different types of images: 1) crowd-sourced, dense AR sequences; 2) curated, sparser HD 360 images from the NavVis device; 3) a combination of the two. The results are summarized in Tab. 4 (left), showing that the mapping scenario has a large impact on the final numbers. On the other hand, image pair selection for mapping matters little. Crowd-sourcing and manual scans can complement each other well to address an imperfect scene coverage. We hope that future work can close the gap between the scenarios to achieve better metrics from crowd-sourced data without curation.

Mapping images HL2 + Phone HD 360 Both Image pairs from Retrieval + Poses GT overlap Retrieval + Poses Retrieval + Poses Matching Device SP + SG HL2 64.2 / 77.4 64.2 / 77.3 70.1 / 83.6 64.1 / 77.5 Phone 52.2 / 65.8 52.9 / 66.3 47.4 / 64.9 60.6 / 72.1
Table 4: Impact of mapping. Left: Scenarios. Building the map with HD 360 images from NavVis scanners, instead of or with dense AR sequences, does not consistently boost the performance as they are usually sparser, do not fully cover each location, and have different characteristics than AR images. Right: Modalities. Lifting 2D points to 3D using the lidar mesh instead of triangulating with SfM is beneficial. This can also leverage dense matching, e.g. with LoFTR.

5.2 Sequence localization

In this section, inspired by typical AR use cases, we consider the problem of sequence localization. The task is to align multiple consecutive frames using sensor data aggregated over short time intervals. Our baseline for this task is based on the ground-truthing pipeline and has as such relatively high compute requirements. However, we are primarily interested in demonstrating the potential performance gains by leveraging multiple frames. First, we run image retrieval and single-frame localization, followed by a first PGO with tracking and localization poses. Then, we do a second localization with retrieval guided by the poses of the first PGO, followed by a second PGO. Finally, we run a pose refinement by considering reprojections to query frames and tracking cost. We can also use radio signals to restrict image retrieval throughout the pipeline. As previously, we consider the localization recall but only of the last frame in each sequence, which is the one that influences the current AR user experience in a real-time scenario.

Device Radios? TTR@X% 70% 80% 90% HL2 <1s >10s >10s <1s 1.40s >10s Phone >10s >10s >10s 3.58s >10s >10s
Figure 6: Sequence localization. We report the localization recall at (cm) of SuperPoint features with SuperGlue matcher as we increase the duration of each sequence. The pipeline leverages both on-device tracking and absolute localization, as vision-only (solid) or combined with radio signals (dashed). We show the time-to-recall (TTR) at 80% for HL2 and at 70% for phone queries. Using radio signals reduces the TTR from over 10s to 1.40s and 3.58s, respectively.

We evaluate various query durations and introduce the time-to-recall metric as the sequence length (time) required to successfully localize X% (recall) of the queries within (1, 10cm), or, in short, TTR@X%. Localization algorithms should aim to minimize this metric to render retrieved content as quickly as possible after starting an AR experience. Figure 6 reports the results averaged over all locations. While the performance of current methods is not satisfactory yet to achieve a TTR@90% under 10 seconds, using sequence localization leads to significant gains of 20%. The radio signals improve the performance in particular with shorter sequences and thus effectively reduce the time-to-recall.

6 Conclusion

LaMAR is the first benchmark that faithfully captures the challenges and opportunities of AR for visual localization and mapping. We first identified several key limitations of current benchmarks that make them unrealistic for AR. To address these limitations, we developed a new ground-truthing pipeline to accurately and robustly register AR sensor streams in large and diverse scenes aided by laser scans without any manual labelling or custom infrastructure. With this new benchmark, initially covering 3 large locations, we revisited the traditional academic setup and showed a large performance gap for existing state-of-the-art methods when evaluated using more realistic and challenging data.

We implemented simple yet representative baselines to take advantage of the AR-specific setup and we presented new insights that pave promising avenues for future works. We showed the large potential of leveraging other sensor modalities like radio signals, depth, or query sequences instead of single images. We also hope to direct the attention of the community towards improving map representations for crowd-sourced data and towards considering the time-to-recall metric, which is currently largely ignored. We publicly release at the complete LaMAR dataset, our ground-truthing pipeline, and the implementation of all baselines. The evaluation server and public leaderboard facilitates the benchmarking of new approaches to keep track of the state of the art. We hope this will spark future research addressing the challenges of AR.

Acknowledgements.  LaMAR would not have been possible without the hard work and contributions of Gabriela Evrova, Silvano Galliani, Michael Baumgartner, Cedric Cagniart, Jeffrey Delmerico, Jonas Hein, Dawid Jeczmionek, Mirlan Karimov, Maximilian Mews, Patrick Misteli, Juan Nieto, Sònia Batllori Pallarès, Rémi Pautrat, Songyou Peng, Iago Suarez, Rui Wang, Jeremy Wanner, Silvan Weder and our colleagues in CVG at ETH Zurich and the wider Microsoft Mixed Reality & AI team.


Appendix 0.A Visualizations

Diversity of devices:  We show in Fig. 7 some samples of images captured in the HGE location. NavVis and phone images are colored while HoloLens2 images are grayscale. NavVis images are always perfectly upright, while the viewpoint and height of HoloLens2 and phone images varies significantly. Despite the automatic exposure, phone images easily appear dark in night-time low-light conditions.

Diversity of environments:  We show an extensive overview of the three locations CAB, HGE, and LIN in Figures 8, 9, and 10, respectively. In each image, we show a rendering of the lidar mesh along with the ground truth trajectories of a few sequences.

Figure 7: Sample of images from the different devices: NavVis M6, HoloLens2, phone. Each column shows a different scene of the HGE location with large illumination changes.
Figure 8: The CAB location features 1-2) a staircase spanning 5 similar-looking floors, 3) large and small offices and meeting rooms, 4) long corridors, 5) large halls, and 6) outdoor areas with repeated structures. This location includes the Facade, Courtyard, Lounge, Old Computer, Storage Room, and Office scenes of the ETH3D [schops2017multi] dataset and is thus much larger than each of them.
Figure 9: The HGE location features a highly-symmetric building with 1-2) hallways, 3) long corridors, 4) two esplanades, and 5) a section of sidewalk. This location includes the Relief, Door, and Statue scenes of the ETH3D [schops2017multi] dataset.
Figure 10: The LIN location features large outdoor open spaces (top row), narrow passages with stairs (middle row), and both residential and commercial street-level facades.

Long-term changes:  Because spaces are actively used and managed, they undergo significant appearance and structural changes over the year-long data recording. This is captured by the laser scans, which are aligned based on elements that do not change, such as the structure of the buildings. We show in Fig. 11 a visual comparison between scans captured at different points in time.

Figure 11: Long-term structural changes. Lidar point clouds captured over a year reveal the geometric changes that spaces undergo at different time scales: 1) very rarely (construction work), 2-4) sparsely (displacement of furniture), or even 5-6) daily due to regular usage (people, objects).

Appendix 0.B Uncertainties of the ground truth

We show overhead maps and histograms of uncertainties for all scenes in Fig. 12. We also additional rendering comparisons in Fig. 13. Since we do not use the mesh for any color-accurate tasks (e.g., photo-metric alignment), we use a simple vertex-coloring based on the NavVis colored lidar pointcloud. The renderings are therefore not realistic but nevertheless allow an inspection the final alignment. The proposed ground-truthing pipeline yields poses that allow pixel-accurate rendering.

Figure 12: Translation uncertainties of the ground truth camera centers for the CAB (top), LIN (middle) and HGE (bottom) scenes. Left: The overhead map shows that the uncertainties are larger in areas that are not well covered by the 3D scanners or where the scene is further away from the camera, such as in long corridors and large outdoor space. Right: The histogram of uncertainties shows that most images have an uncertainty far lower than .
Figure 13: Qualitative renderings from the mesh. Using the estimated ground-truth poses, we render images from the vertex-colored mesh (right) and compare them to the originals (left). The first two rows show six HoloLens 2 images while the next two show six phone images. We overlay a regular grid to facilitate the comparison. The bottom rows shows 2x2 mosaics alternating between originals (top-left, bottom-right) and renders (top-right, bottom-left). Best viewed when zoomed in.

Appendix 0.C Selection of mapping and query sequences

We now describe in more details the algorithm that automatically selects mapping and query sequences, whose distributions are shown in Fig. 14.

The coverage is a boolean that indicates whether the image of sequence shares sufficient covisibility with at least one image is sequence . Here two images are deemed covisible if they co-observe a sufficient number of 3D points in the final, full SfM sparse model [radenovic2018fine] or according to the ground truth mesh-based visual overlap. The coverage of sequence with a set of other sequences is the ratio of images in that are covered by at least one image in :

Figure 14: Spatial distribution of AR sequences for the CAB (top), HGE (middle), and LIN (bottom) locations. We show the ground truth trajectories overlaid on the lidar point clouds along 3 orthogonal directions. All axes are in meters and is aligned with the gravity. Left: Types of AR devices among all registered sequences. Right: Map and query sequences selected for evaluation. CAB spans multiple floors while HGE and LIN are mostly 2D but include a range of ground heights. The space is well covered by both types of devices and sequences.

We seek to find the set of mapping sequences and remaining query sequences that minimize the coverage between map sequences while ensuring that each query is sufficiently covered by the map:


where is the minimum query coverage. We ensure that query sequences are out of coverage for at most consecutive seconds, where can be tuned to adjust the difficulty of the localization and generally varies from 1 to 5 seconds. This problem is combinatorial and without exact solution. We solve it approximately with a depth-first search that iteratively adds new images and checks for the feasibility of the solution. At each step, we consider the query sequences that are the least covisible with the current map.

Appendix 0.D Data distribution

We show in Fig. 14 (left) the spatial distribution to the registered sequences for the two devices types HoloLens2 and phone.

We select a subset of all registered sequences for evaluation and split it into mapping and localization groups according to the algorithm described in Appendix 0.C. The spatial distribution of these groups is shown in Fig. 14 (right). We enforce that night-time sequences are not included in the map, which is a realistic assumption for crowd-sourced scenarios. We do not enforce an equal distribution of device types in either group but observe that this occurs naturally. For the evaluation, mapping images are sampled at intervals of at most 2.5FPS, 50cm of distance, and 20°of rotation. This ensures a sufficient covisibility between subsequent frames while reducing the computational cost of creating maps. The queries are sampled every 1s/1m/20°and, for each device type, 1000 poses are selected out of those with sufficiently low uncertainty.

Appendix 0.E Additional evaluation results

0.e.1 Impact of the condition and environment

We now investigate the impact of different capture conditions (day vs night) and environment (indoor vs outdoor) of the query images. Query sequences are labeled as day or night based on the time and date of capture. We manually annotate overhead maps into indoor and outdoor areas. We report the results for single-image localization of phone images in Tab. 5.

In regular day-time conditions, outdoor areas exhibit distinctive texture and are thus easier to coarsely localize in than texture-less, repetitive indoor areas. The scene structure is however generally further away from the camera, so optimizing reprojection errors yields less accurate camera poses.

Indoor scenes generally benefit from artificial light and are thus minimally affected by the night-time drop of natural light. Outdoor scenes benefit from little artificial light, mostly due to sparse street lighting, and thus widely change in appearance between day and night. As a result, the localization performance drops to a larger extent outdoors than indoors.

CAB scene HGE scene LIN scene
Indoor Outdoor Indoor Outdoor Outdoor
day 66.5 / 74.7 73.9 / 88.1 52.7 / 65.9 43.0 / 64.3 71.2 / 82.5
night 30.3 / 44.8 18.8 / 40.6 47.9 / 59.4 12.1 / 33.6 38.6 / 55.6
Table 5: Impact of the condition and environment on single-image phone localization. During the day, localizing indoors can be more accurate (10cm threshold) but less robust (1m threshold) than outdoors due to visual aliasing and a lack of texture. Night-time localization is more challenging outdoors than indoors because of a larger drop of illumination.

0.e.2 Additional results on sequence localization

We run a detailed ablation of the sequence localization on an extended set of queries and report our finding below.

Ablation:  We ablate the different parts of our proposed sequence localization pipeline on sequences of 20 seconds. The localization recall at {cm} and {m} can be seen in Tab. 6 for both HoloLens 2 and Phone queries. The initial PGO with tracking and absolute constraints already offers a significant boost in performance compared to single-frame localization. We notice that the re-localization with image retrieval guided by the PGO poses achieves much better results than the first localization - this points to retrieval being a bottle-neck, not feature matching. Next, the second PGO is able to leverage the improved absolute constraints and yields better results. Finally, the pose refinement optimizing reprojection errors while also taking into account tracking constraints further improves the performance, notably at the the tighter threshold.

Device Radios Steps
Loc. Init. PGO1 Re-loc. PGO2 BA
HL2 66.0 / 79.9 66.1 / 92.5 71.8 / 92.4 74.2 / 88.0 74.9 / 92.5 79.3 / 92.8
67.7 / 82.3 66.4 / 94.5 74.3 / 94.3 76.2 / 90.1 76.7 / 94.4 81.6 / 94.9
Phone 54.2 / 65.5 52.4 / 88.0 62.7 / 87.7 61.8 / 77.4 66.1 / 88.4 69.0 / 88.6
56.7 / 71.5 54.1 / 90.2 64.4 / 89.8 63.1 / 79.5 66.9 / 90.1 71.0 / 90.2
Table 6: Ablation of the sequence localization. We report recall for the different steps of the sequence localization pipeline for 10s sequences on the CAB location. The second localization, guided by the poses of the first PGO, drastically improves over the initial localization, especially when no radio signals are used. The final pose refinement optimizing reprojection errors while also taking into account tracking constraints offers a significant boost for the tighter threshold.

Appendix 0.F Phone capture application

We wrote a simple iOS Swift application that saves the data exposed by the ARKit, CoreBluetooth, CoreMotion, and CoreLocation services. The user can adjust the frame rate prior to capturing. The interface displays the current input image and the trajectory estimated by ARKit as an AR overlay. It also displays the amount of free disk space on the device, the recording time, the number of captured frames, and the status of the ARKit tracking and mapping algorithms. The data storage is optimized such that a single device can capture hours of data without running out of space. After capture, the data can be inspected on-device and shared over Airdrop or cloud storage. Screenshots of the user interface are shown in Fig. 15.

Figure 15: iOS capture application.

Appendix 0.G Implementation details

Scan-to-scan alignment:  The pairwise alignment is initialized by matching the most similar images and running 3D-3D RANSAC with a threshold . The ICP refinement search for correspondences within a cm radius.

Sequence-to-scan alignment:  The single-image localization is only performed for keyframes, which are selected every second, °of rotation, or meter of traveled distance. In RANSAC, the inlier threshold depends on the detection noise and thus of the image resolution: for PnP and GPnP, as camera rigs are better constrained. Poses with fewer than inliers are discarded. In the rigid alignment, inliers are selected for pose errors lower than . Sequences with fewer than 4 inliers are discarded. When optimizing the pose graph, we apply the arctan robust cost function to the absolute pose term, with a scale of after covariance whitening. In the bundle adjustment, reprojection errors larger than px are discarded at initialization. A robust Huber loss function is applied to the remaining ones with a scale of after covariance whitening.

Radio transfer: 

As mentioned in the main paper, currently, Apple devices only expose partial radio signals. Notably, WiFi signals cannot be recovered, Bluetooth beacons are not exposed, and the remaining Bluetooth signals are anonymized. This makes it impossible to match them to those recovered by other devices (e.g., HoloLens, NavVis). To show the potential benefit of exposing these radios, we implement a simple radio transfer algorithm from HoloLens 2 devices to phones. First, we estimate the location of each HoloLens radio detection by linearly interpolating the poses of temporally adjacent frames. For each phone query, we aggregate all radios within at most 3m in any direction and 1.5m vertically (to avoid cross-floor transfer) with respect to the ground-truth pose. If the same radio signal is observed multiple times in this radius, we only keep the spatially closest 5 detections. The final RSSI estimate for each radio signal is then obtained by a distance-based weighted-average of these observations. Note that this step is done on the raw data, after the alignment. Thus, we can safely release radios for phone queries without divulging the ground-truth poses.