1 Related Work
The literature in multi-agent perception systems distinguishes between centralized and distributed architectures.
One of the first works to tackle collaborative SLAM in a fully decentralized manner was DDF-SAM [cunningham2013ddf], evaluating robotic collaboration in a simulated setup using visual, inertial and GPS data.
Choudhary et al. [choudhary2017:IJRR:distributed] show a decentralized SLAM system with pre-trained objects enabling co-localization of the participating robots.
Cieslewski et al. [cieslewski2018:ICRA:data] combine this optimization approach with an efficient and scalable distributed solution for place recognition.
Most recently, Lajoie et al. [lajoie2020:RAL:door] and Chang et al. [chang2020:ARXIV:kimera] proposed systems for distributed SLAM both using a distributed PGO scheme, demonstrating superior performance compared to the Gauss-Seidel approach of [choudhary2017:IJRR:distributed].
While enabling a wide range of applications, and good scalability to large numbers of agents, guaranteeing data consistency and avoiding information double-counting are the biggest challenges for this architecture, whereas centralized systems have a more straightforward management of information, and usually exhibit a significantly higher accuracy [campos2021:TRO:orb3, karrer2018cvi, qin2018:TRO:vins] than state-of-the-art distributed SLAM approaches, such as [chang2020:ARXIV:kimera, lajoie2020:RAL:door].
Zou and Tan have introduced CoSLAM [zou2013coslam], a powerful vision-only collaborative SLAM system, grouping cameras with scene overlap in order to handle dynamic environments. Forster at al. [forster2013collaborative] demonstrated collaboration of up to three UAV by extending a SfM pipeline to collaborative SLAM. With CTAM [riazuelo2014c], Riazuelo et al. proposed a multi-agent system performing only position tracking onboard each agent, while all mapping tasks are offloaded to the server, enabling agents to cope with very limited computational resources, however, thereby also heavily restricting each agent’s autonomy. CCM-SLAM [schmuck2018:JFR:ccm] proposed to efficiently make use of the server by offloading computationally expensive tasks, while still ensuring each agent’s autonomy at low computational resource requirements by running a visual odometry system onboard. CVI-SLAM [karrer2018cvi] extends the approach of [schmuck2018:JFR:ccm] to a visual-inertial setup, enabling higher accuracy as well as metric scale estimation and gravity alignment of the collaborative SLAM estimate, demonstrated with real data from up to four agents. However, while it was the first full VI collaborative SLAM system with two-way communication, CVI-SLAM also has practical limitations, such as limited flexibility, e.g. in terms of interfacing with custom VIO front-ends, and in terms of scalability to larger teams.
The ability to leverage VIO to enable AR experiences with mobile devices has been shown by multiple works over the last years, such as [li2017:ISMAR:monocular, qin2018:TRO:vins]. Just recently, Platinsky et al. [platinsky2020:ISMAR:collaborative] have demonstrated a pipeline supporting city-scale shared augmented reality experiences on mobile devices. However, their approach relies on time-consuming and costly preparatory work, comprising extensive data collection using a car-based platform and offline map generation. On the other hand, the concept of spatial anchors can enable ad hoc multi-user AR, through co-localization with respect to the same anchor. However, this requires at least coarse prior knowledge of the location of the users, for example knowing all devices are in the same room, and does not directly give individual agents the ability to re-use maps created by other agents. Collaborative SLAM can bridge the gap between these two approaches, enabling shared AR experience in larger environments, such as entire factory halls or department stores in an ad hoc fashion, only using the AR devices’ built-in sensors and without pre-mapping.
Multi-agent global collaborative estimates similar to those from collaborative SLAM can also be achieved by recent SLAM systems with multi-session capabilities, such as [campos2021:TRO:orb3, qin2018:TRO:vins]. The ability to re-use SLAM maps created in a previous run enables multi-session SLAM systems to achieve impressive levels of robustness [qin2018:TRO:vins] and accuracy [campos2021:TRO:orb3], outperforming the single-session case. However, the circumstance that only one agent at a time can be active in multi-session SLAM heavily restricts the level of collaboration amongst agents, and the situation that all parts of these multi-session SLAM systems run onboard the same computing unit prevents to offload information and computation load to a powerful server.
COVINS extends the well-established architecture for collaborative SLAM deployed in [karrer2018cvi] towards a more flexible and efficient setup. Agents can connect to the system on-the-fly, the number of agents does not need to be known a priori, and a generic communication interface allows to interface the server back-end with different VIO systems. More efficient map management and optimization schemes and a state-of-the-art redundancy detection scheme translate into improved accuracy and better scalability, allowing to demonstrate collaborative SLAM with 12 agents, while to the best of our knowledge, other recent collaborative [forster2013collaborative, karrer2018cvi, schmuck2018:JFR:ccm] and multi-session [campos2021:TRO:orb3, qin2018:TRO:vins] SLAM systems with comparable accuracy level are tested with no more than five agents so far.
2.1 Notation, IMU Model and System States
In this work, we adopt the notation from [karrer2018cvi] for mathematical notation. Small (e.g. ) and large (e.g.
) bold letters denote vectors and matrices, respectively. Coordinate frames are denoted as plain capital letters (e.g.). For a vector expressed in , the notation is used. A rigid body transformation from frame to is denoted as , with and denoting the translational and rotational part, respectively. Throughout this work we denote the world frame as , the IMU body frame as and the camera frame as .
In order to incorporate IMU information into COVINS, we model the IMU using a standard model, assuming additive Gaussian noise and unknown, time varying sensor bias (cf. [covins:supp]). To account for this IMU model, the system state includes besides the KF poses and LM positions also the linear velocities as well as the bias variables for each KF :
where the sets and denote the set of all KF and LM, respectively. In the following, whenever the context allows for it, we use to denote an individual state variable.
3.1 System Overview
The system architecture of COVINS is illustrated in Fig. 2. Running a VIO front-end maintaining a local map of limited size onboard each agent ensures basic autonomy of the individual agent. At the same time, global map maintenance and computation-heavy processes are transferred to the more powerful server. This underlying architectural principle was first introduced in [schmuck2017multiuav], which is extended to a more flexible and efficient handling and maintenance of collaborative map data on the server in this work. On both the agents’ and server’s sides, we deploy a communication module for data exchange. The communication module establishes p2p connection, allowing the server to run on a locally deployed computer as well as on a remote cloud service, as demonstrated in Sec. 4, furthermore removing the previous ROS dependency of the communication module in [karrer2018cvi]. COVINS implements and exports a generic communication interface, providing the freedom to interface it with any custom-built keyframe-based VIO system, in order to enable collaboration amongst multiple agents. The core of the server modules forms a map manager, which controls access to the global map data present in the system. It maintains this map data in one or multiple maps, as well as a KF database for efficient place recognition. Moreover, it provides algorithms to merge maps once overlap is detected and the functionality to remove redundant KF, altogether facilitating data routing compared to [karrer2018cvi]. Place recognition modules process all incoming KF from the agents to detect visual overlap between re-visited parts of the environment. As opposed to [karrer2018cvi], COVINS does not distinguish between different place recognition modules for either loop closure or map fusion, so a single place recognition query for a KF triggers both events, reducing workload and system complexity. The server, furthermore, provides optimization routines, namely PGO and GBA. In contrast to [karrer2018cvi], COVINS implements an optimization strategy regularly performing PGO, while executing GBA to further refine maps less frequently, in order to better balance restricted map access due to ongoing optimization with desired high map accuracy. In addition, the server provides an external interface, allowing a user to interact with the system.
3.2 Map Structure
The map structure used by the server back-end of COVINS (termed server map) maintains the data of the collaborative estimation process. A server map is a SLAM graph, holding a set of KF and a set of LM as vertices, and edges induced either between two KF through IMU constraints or between a KF and a LM as a landmark observation. While multiple agents can contribute to one server map, multiple server maps exist simultaneously until all participating agents are co-localized. Together with the state definition from Sec. 2.1, this underling SLAM estimation problem induces a factor graph [covins:supp], which forms the basis of the GBA scheme explained in Sec. 3.8. The shared LM observations create dependencies between KF from multiple agents, while IMU factors are only inserted between consecutive KF created by the same agent. In an AR use-case, the map would furthermore store AR content created by users contributing to this map.
3.3 Error Residuals Formulation
By formulating a set of residuals, the optimization of state variables occurring in KF-based VI SLAM can be expressed as a weighted nonlinear least-squares problem. Each such residual expresses the difference between the expected measurement based on the current state of the system and the actual measurement :
where is the set of state variables relevant for measurement , and is the measurement function, predicting the measurement according to these state variables in . By collecting all occurring residual terms, the objective of the optimization can be expressed as:
where denotes the squared Mahalanobis distance with the information matrix . Within our system, we essentially use three different types of residuals: reprojection residuals , relative pose residuals , and IMU pre-integration residuals . A detailed description of the individual residuals can be found in [covins:supp].
3.4 Visual-Inertial Odometry Front-End
COVINS is able to generate accurate collaborative global estimates from map data contributed by a keyframe-based VIO system (also referred to as VIO front-end). To enable the sharing of map information between this VIO front-end running onboard the agent, and the server, this framework provides a communication interface that enables the combination of the server back-end with any indirect VIO system (i.e. using feature-based landmark correspondences, required for the reprojection residuals) as explained in Sec. 3.5. For handling the inertial data in GBA, we use the estimates of the metric scale as well as the IMU biases and the velocities of the VIO as an initialization point. In order to evaluate the performance of COVINS, in the experiments conducted for this paper, we employ the VIO front-end of ORB-SLAM3 [campos2021:TRO:orb3]
, as a nominal open-source option.
The communication module is based on socket programming using the TCP protocol, and the header-only library cereal[code:cereal] for the serialization of messages. This allows to deploy the server on a local computational unit as well as on remote cloud servers The communication module on the server side listens to a pre-defined port for incoming connection requests by the agents, allowing them to join dynamically during the mission without any prior specification of the number of participating agents. A generic communication interface for usage on the agent side is exported by COVINS as a shared library, enabling an existing VIO system to share map data with the server back-end using pre-defined KF and LM messages.
3.5.1 Agent-to-Server Communication
For sharing map data from the agent to the server, COVINS adopts the efficient message passing scheme from [schmuck2018:JFR:ccm], which accounts for static parts of KF and LM, such as extracted 2D feature keypoints and related descriptors, and ensures this information is not repeatedly sent, in order to reduce the required network bandwidth. The communication scheme distinguishes between so-called ‘full’ messages, comprising all relevant information for a KF and LM, including also static measurements (e.g. 2D keypoints), and significantly smaller update messages, where only changes in the state (e.g. modified KF pose) are transmitted. All map data to be shared with the server is accumulated over a short time window, and communicated to the server batch-wise at a fixed frequency. The communication counterpart on the server side integrates the transmitted map information into the collaborative SLAM estimate.
3.5.2 Server-to-Agent Communication (Map Re-Use)
The communication interface of COVINS supports two-way communication between the agents and the server, in this article applied to estimate the drift of an agent’s VIO. On the server, drift can be accounted for on a global scope through loop closure and subsequent optimization-based map refinement. In order to enable the agent to also account for this drift, we regularly share the server’s estimated pose of the most recently created KF of an agent with this agent. Comparing this drift-corrected pose estimate from the server with the estimated pose of the KF in the local map allows to estimate a local odometry transformation onboard the agent, quantifying the drift in the current pose estimate. With this scheme, the map of the local VIO is not modified, leaving the smoothness of the VIO unaffected, which is of substantial importance, for example when using the pose estimate in a feedback system for controlling a robot.
3.6 Multi-Map Management
The map manager maintains the data contributed by all agents in one or more server maps, as described in Sec. 3.2. A new map is initialized for every agent that enters the system. As soon as place recognition detects overlap between two distinct maps, the map fusion routine of the map manager is triggered. Furthermore, the map manager holds and maintains the KF database necessary for efficient place recognition. Besides providing routines for map fusion and graph compression through removal of redundant KF (Sec. 3.8), the map manager is in charge of controlling access to the server maps and the KF database, in order to ensure global consistency. Storing all maps at a central point in the system with individual modules requesting access to either read from or also modify a specific map facilitates to coordinate map access from different system modules in order to keep maps consistent, e.g. when multiple agents contribute to a single server map, or to restrict map access in order to perform map fusion or optimization.
3.7 Place Recognition, Loop Closure & Map Fusion
To detect repeatedly visited locations with high precision, we employ a standard multi-stage place recognition pipeline which we briefly summarize here. For a query KF , a bag-of-words approach [galvez2012bags] is employed to select a set of potential matching candidates from all KF in the system. After establishing feature correspondences between and all KF in , a 3D-2D RANSAC scheme followed by a refinement step minimizing the reprojection error is applied to find a relative transformation between and a potential match . Finally, is used to find additional LM connections between and A place recognition match is accepted, if for a throughout all stages, enough inliers are found. In the case that and are part of the same server map, we perform loop closure, carrying out a PGO in order to optimize the poses of the KF in the map, improving accuracy and reducing drift in the estimate. In the case that and reside in different server maps, the map fusion routine of the map manager is triggered, aligning the map of the query KF and the the map of the candidate KF using , finally replacing both maps with one new server map containing all KF from and . This involves also the fusion of duplicate LM. In the process, potential AR content in would be transformed into the coordinate frame of , and combined together with the AR content contained in . This also entails that after map fusion, AR content of is now available to all users previously associated to , and vice versa.
3.8 Map Refinement
3.8.1 Pose-Graph Optimization
PGO111All optimization schemes of COVINS use the Ceres solver is applied to a map when a new loop constraint between two KF is added to this map after successful loop closure detection. We use the following objective function for PGO, optimizing the pose of all KFs of the server map:
where is an indicator function defined by
and denotes the relative pose residuals and the information matrix of the relative pose constraints (Sec. 3.3). The sets and denote the covisibility edges between KF and loop closure edges, respectively. After the optimization, the positions of all LM in the server map are propagated using the optimized KF-poses.
3.8.2 Global Bundle Adjustment
COVINS performs on-demand GBA, e.g. at the end of the mission when the agents are not actively sending further information to the server. This creates a highly accurate estimate, which can be re-used in a multi-session fashion for further collaborative SLAM session. For a specific server map, we perform GBA taking into account all KF and LM in the map, using the following objective function:
where the first term corresponds to a prior added to the first KF in order to remove the Gauge degree of freedom, anddenotes the set of LM observed by . The function
denotes the use of a robust cost function to reduce the influence of outlier observations, in our case the Cauchy loss is used. The termsand correspond to the reprojection and IMU pre-integration residuals (Sec. 3.3) The term penalizes changes in the bias variables between successive KF. After the optimization, a outliers with large reprojection residuals are removed from the map. Note that the IMU constraints in Eq. (6) are only inserted between consecutive KFs created by the same agent.
3.8.3 Redundancy Removal (KF Pruning)
Creating a large number of KF is beneficial for VIO to achieve a high level of robustness and accuracy. However, for the global SLAM estimate, an increasing number of KF results in increasing runtime of the employed algorithms, notably of the optimization algorithms, scaling cubic with the number of KF in the worst case. Therefore, it is desirable to remove redundant KF from the SLAM graph to increase scalability of the system. For this reason, we employ the structure-based heuristic introduced in[schmuck2019:3DV:redundancy] to identify and remove redundant KF. The underlying assumption of the heuristic is that with increasing number of observations of a specific LM (by different KF), the information gained by an individual observation decreases. Therefore, with denoting the number of observations of , a function assigns a value to each LM depending on its number of observations, with increasing number encoding increasing redundancy of an individual observation of this LM. The complete definition of can be found in [schmuck2019:3DV:redundancy, covins:supp]. Using , the set of LM observed by , we can calculate a redundancy value for each in the map as
This way, assigning a value to each KF estimating its information contributed to the SLAM estimate, the most redundant KF can be removed from the estimate. Redundancy removal is performed before GBA, since the timing of GBA is affected the strongest by the number of KF. LM pruning is implicitly handled: whenever a LM becomes under-observed (i.e. less than 2 observations), either from removal of outlier observations or through removing KF, this LM is removed from the map.
4 Experimental Results
We evaluate COVINS in a thorough testbed of experiments, investigating its accuracy using the EuRoC benchmark dataset [dataset:burri2016euroc] employing a local PC as well as an AWS cloud server (Sec. 4.1), scalabilty in large-scale experiments with 12 agents (Sec. 4.2), drift correction (Sec. 4.3), the influence of the redundancy removal (Sec. 4.4) and communication statistics (Sec. 4.5). All results were obtained by re-playing data in real-time, and values in this section are averaged over 5 runs for each experiment if not stated otherwise. For these experiments, the following setup is used:
Local Server: Lenovo T480s (1.80 GHz 8 (max 4.00 GHz))
Cloud Server: AWS c5a.8xlarge (32 vCPUs at 3.3 GHz)
Agents: Intel NUC 7i7BNH with 3.5 GHz 4
Throughout all experiments, the pre-recorded datasets are processed onboard the agents, which are connected to the server via a wireless network, so that real communication takes place. This makes our evaluation across different runs more comparable and provides us with ground truth, while still using real network communication as it would be the case during a real-world application.
4.1 Collaborative SLAM Estimation Accuracy
We evaluate the accuracy of the global collaborative SLAM estimate of COVINS using [dataset:burri2016euroc], where we use the five Machine Hall (MH) sequences, and the three Vicon Room 1 (V1) sequences to establish a collaborative estimation scenario with three and five participating agents. A global estimate jointly created by five agents is shown in Fig. 3. Table 1 reports the accuracy of the aligned global estimate in terms of ATE and scale error, as well as a comparison to ORB-SLAM3 [campos2021:TRO:orb3], VINS-mono [qin2018:TRO:vins] (both having multi-session functionalities) and CVI-SLAM [karrer2018cvi] using the Local Server for all experiments.
COVINS shows generally high accuracy across all datasets, achieving similar or better performance than the state of the art. The high quality of COVINS’ estimate in multi-agent scenarios is due to the fact that the framework is able to establish a large number of accurate constraints between the data contributed by the individual agents, as visible from Fig. 3, where the red lines encode covisibility edges between separate trajectories. Furthermore, Table 1 reports the results for the same experimental setup for COVINS, except that an AWS cloud server is now used to run the server back-end of COVINS. The accuracy of this cloud-based estimation collaborative SLAM estimate is similar to the accuracy using a local server, attesting to the capability of the COVINS back-end to be executed on remote cloud compute, with potentially much higher computational resources than a locally deployed PC.
4.2 Large-Scale Collaborative SLAM with 12 Agents
In this experiment, we evaluate the applicability of COVINS to a large-scale scene and a large team of participating agents. For this, we use a newly generated dataset with 12 UAVs equipped with a downward looking camera flying over a small village. In order to obtain accurate ground truth, the dataset was created using the visual-inertial simulator from [teixeira2020:RAL:aerial], creating photo-realistic vision datasets for UAV flights using a high-quality 3D model of the scene. It comprises 12 circular UAV trajectories of radius, covering an area of about with total trajectory length. Fig. 1 shows the final collaborative estimate generated by COVINS, consisting of over 3200 KF and about 200k LM. The average ATE of the estimate is , the average scale error is 0.44%. An illustration of the 3D scene is shown in [covins:supp] and the accompanying video 222https://youtu.be/FxJTY5x1fGE.
4.3 Drift Correction
Fig. 4 visually demonstrates the effect of the drift estimation and correction scheme. Fig. 3(a) shows the final trajectory estimated by the agent’s onboard VIO system using the MH3 sequence. Note that although the full trajectory is displayed, this trajectory was never globally optimized by the agent itself. As visible from Fig. 3(a), the estimated trajectory (gold) is affected by some drift, so that the estimated final location of the agent deviates from the true location (red). However, based on the continuous estimation of the drift using the information received from the server, the agent can estimate a corrected trajectory (white), being noticeably closer to the true trajectory. A similar effect can be seen from Fig. 3(b), which reflects a snapshot of the VIO estimate during an experiment using the MH5 sequence, with the green box highlighting a significant drift correction with information received from the server after loop closure.
4.4 Redundancy Detection & Removal
Using the maps created during the 5-agent experiments in Sec. 4.1, we evaluate the influence of the redundancy detection scheme implemented in COVINS. The evaluation is performed as follows: the five multi-agent maps created during the five 5-agent experiments on MH1-MH5 of Sec. 4.1 were saved to file storage after each experiment, so that they can be reloaded into COVINS for further experiments. For each map, which contains on average approximately 1700 KF each, a reduction of the map to 1250, 1000 and 750 KF is performed in separate experiments. The results (average ATE over all five maps) are reported in Table 2, demonstrating that COVINS is able to significantly reduce the number of KF in the estimate at only a small loss in accuracy and a significant reduction of the GBA time. Even when compressing the map by more than 50%, the mean error increases by only .
|Num. KFs||1681 (init. state)||1250||1000||750|
|GBA Time [s]||133||53||34||20|
4.5 Communication Statistics
Table 3 reports the network traffic generated by the communication between sever and agent. Each agent informs the server at a frequency of 5 Hz about new and modified map data since the previously sent data bundle. The server shares information with the agent at a 2 Hz frequency. As visible from Table 3, the generated network traffic from an individual agent to the server lies approximately between 400 and 600 KB/s, which can comfortably be covered by typical WiFi infrastructure. More challenging sequences require VIO to create more KF for successful operation, translating to more transmitted data (e.g. MH1 (easiest): 2.9 KF/s created, V103 (hardest): 6.9 KFs/s). The traffic from the server to the individual agent is significantly lower in our implementation, since only pose information for a single KF needs to be shared for the drift correction scheme. The average size of the individual message types is as follows: KF full: 97kB; KF update: 273 byte; LM full: 162 byte; LM update: 65 byte.
|Sequence||Agent Server||Server Agent||Comm Time|
|Avg. (8 seq.)||493.36 kB/s||2.31 kB/s||792.91 ms|
|MH1||422.83 kB/s||2.29 kB/s||939.92 ms|
|MH5||540.37 kB/s||2.32 kB/s||776.33 ms|
|V103||609.22 kB/s||2.31 kB/s||850.24 ms|
Table 3 also reports the timings of the communication module on the agent. With approximately 1s of total communication time for each sequence, and the sequences containing flights between 84s (V102) and 144s (V101), the overhead of the communication is marginal () compared to the total time of the estimation process and does not compromise the real-time capability of the VIO system.
In this paper, we present COVINS, a powerful and accurate back-end for collaborative SLAM. COVINS allows multiple agents to generate collaborative global SLAM estimates from their simultaneously contributed data online during the mission, eliminating the need for external infrastructure or pre-built maps in order to enable multi-agent applications. The efficient architecture and system design of COVINS allows this framework to process data contributed by many agents simultaneously. Our experiments attest to the high accuracy of the collaborative SLAM estimates in large-scale multi-agent missions, in particular demonstrating collaborative SLAM with up to 12 agents contributing to the system, which, to the best of our knowledge, is the highest number of participants demonstrated by any comparable system in the literature. Boosting applicability and scalability of the system, this framework can run locally on a PC as well as on a remote cloud server, furthermore, supported by a redundancy detection scheme that was demonstrated to be a able to significantly reduce the number of KF in the estimate, while keeping a similar level of accuracy. Future work will focus on further leveraging the applicability and scalability of the system to potentially hundreds of agents, and interfacing COVINS with a front-end that is able to run on mobile devices, such as VINS-Mobile [qin2018:TRO:vins], furthermore enabled to display AR content, in order to leverage COVINS’ collaborative scene understanding to create and demonstrate a shared AR experience for multiple users.