1 Introduction
The SLAM problem was introduced nearly 30 years ago [1] and continues to be a fundamental challenge in robotics today. Solutions to the SLAM problem require algorithms to solve three challenging problems: (1) how to efficiently recognize salient geometry, (2) how to associate and integrate map data, and (3) how to track the position and orientation of the agents in the map. Use of SLAM to generate highquality 2D or 3D maps is a classical subject and recent work focuses on realtime applications of this algorithm in distributed environments to generate and updates largescale maps. Although there are many stateoftheart RGBD SLAM solutions [2, 3, 4, 5], there remain two key shortcomings: the high computational cost (often requiring GPU acceleration) and bandwidth requirements. The requirements prohibit SLAM from being used in robots that having limited computational or memory resources, e.g., lightduty UAVs and swarmstyle robots. The computational bottleneck comes from three sources: (1) tracking the robot pose via the RGBD camera data, (2) updating local and global graphSLAM map with new RGBD map information and (3) detecting loop closures and updating the SLAM map to include detected closures[6]. Our goal is to address the SLAM computational challenge by leveraging and improving upon a leading method for state of the art 3D SLAM algorithms[7].
As one of the stateoftheart dense visual SLAM methods for RGBD cameras, DVO SLAM[7] represents a leading approach that outperforms many current competing methods[8, 9]
. DVO SLAM incorporate the Direct Visual Odometry (DVO) algorithm to solve the RGBD camera tracking problem. Visual odometry seeks to compute the changes in RGBD camera 3D pose from sensed RGBD images from the camera. The DVO algorithm poses this problem as a search for the pose that minimizes the point cloud photometric (intensity) and depth differences. In conjunction with robust outlier weighting, DVO continues to set the performance benchmark for RGBD camera tracking. DVO SLAM constructs maps as a connected graph of keyframes where optimizes this graph using the g
^{2}o framework [10]. While DVO SLAM performs well for building accurate maps in near realtime, it is a standalone application to be run on single CPU. Hence, DVO SLAM cannot be used for distributed or collaborative SLAM applications. Further, DVO SLAM requires large computational and memory resources to process RGBD data in realtime.In this paper, we extend DVO SLAM[7] to distributed scenarios and simultaneously address the computational burden and bandwidth challenge of using RGBD data. Our approach is to separate realtime essential tasks which we refer to as the SLAM frontend from nonessential tasks, e.g., SLAM optimization and map building, which we refer to as the SLAM backend. We then minimize communication requirements between the frontend and backend by sharing compressed versions of sensed RGBD data. To summarize, the main contributions of this paper are:

We decouple the frontend and backend of graphSLAM systems thus our approach has unprecedented flexibility by allowing agents to create onboard maps (frontend and backend) or contribute to other maps (only frontend) or only integrate maps (only backend) using overnetwork SLAM.

We propose a new compression algorithm that transforms 3D RGBD point cloud data into planar patches that compress the geometric data to reduce computational and memory resource requirements for DVO SLAM map building.

We integrate parameters to our compression algorithm so it can adapt to satisfy dynamic computational and/or bandwidth constraints as specified by the host computer. This brings makes it possible to run this stateoftheart SLAM approach on hosts with limited resources.

We incorporate plane cloud surface representations as a new elements for keyframebased mapping having many of the benefits of RGBD keyframes with smaller computational cost and memory usage.
These contributions significantly advance the state of the art for RGBD 3D SLAM. By taking the advanced algorithms that are used by many researchers and advancing it to run in distributed applications with much lower bandwidth and resource usage, our method makes itself approachable for robots with limited resources like UAVs and swarm robots. Our approach is able a compress RGBD scenes of 1MB in realtime to smaller plane clouds that typically require 10KB. This makes these representations more suitable for largescale realtime mapping.
The key enabling component of our approach integration of a compressed map representation to the RGBD SLAM program. This provides a new representation for point cloud that can be quickly calculated and represents the data to a similar degree of geometric accuracy using far fewer parameters. This allows large collections of point measurements to be substituted with a small collection of local planar fits. Fig.1 depicts the difference in the two representations. Fig.1(a) and (b) show a closeup scene and the dense point cloud representation which is used by many stateoftheart SLAM work. Fig.1(c) shows the sparse reconstruction outcome of our SLAM system in which the limited size of planar patches are transformed from dense 3D surface data.
2 Related Work
Our review of closely related research focuses on four areas: (1) SLAM solutions that address SLAM computational cost, (2) SLAM methods that incorporate plane models, (3) overnetwork SLAM solutions and (4) DVOSLAM.
2.1 Computational Cost
One major challenge for current 3D SLAM approaches is to overcome the burden that SLAM and GNC using RGBD sensors places on the host’s available computational resources. Centralized approaches[11, 12, 13]
, address computational cost by aggregating data from multiple robots at a central server having more computational resources where SLAM estimates can be calculated. Yet, such approaches are not viable for RGBD data since sharing this data requires prohibitively large network bandwidth. Further, this computational model does not scale for as the number of robots increase. A second approach solves the SLAM optimization problem via distributed computation approaches. In this context, robots utilize only local computation and communication to optimize the SLAM pose graph and estimate robot trajectories and environmental maps. Another distributed mapping algorithm
[14] optimizes the SLAM algorithm by sharing key informative features. Recent research [15, 16] have extended this implementation as a backend to larger SLAM solutions as a method to reduce the computational burden of solving multiple robot trajectories in multirobot systems. While these solutions can be dataefficient, they are not applicable to RGBD 3D mapping scenarios. Further, these approaches lack representations required to approach 3D map building in bandwidth constrained contexts.2.2 Planar Shape Models for SLAM
There has also been significant interest in probabilistic mathematical methods to model trajectory and geometric shape to mitigate computational issues. Some work [17, 18, 19] succeeds in mitigating the propagation of SLAM uncertainties and controlling the computational complexity using probabilistic graphical models [20]. Compact shape models are also used for efficient surface data representation that can be associated and integrated with low computational complexity. These planeinvolved SLAM solutions, however, either extends featurebased SLAM with the use of planar surfaces[21], or has an orthogonal assumption on the environment which is less applicable[22]. Dense planar SLAM[23] uses a dense ICP method to estimate sensor poses which requires GPU for realtime computation. A quaternionbased minimal plane representation [24] is utilized to update the planes during optimization without using GPU but the system does not perform in realtime. A realtime CPU only execution of dense planar SLAM algorithm[25]
succeeds in exceeding current popular online 3D reconstruction methods in pose estimation while the computational cost can be further saved by aligning planes for loop closures instead of searching for 3D point pairs. Moreover, most of these planar SLAM systems are not distributed.
2.3 Bandwidth
Collaborative mapping in multiagent environments require each agent to share information about their past and current state. Due to the bandwidth constraints and limited communication range, it is challenging to share raw data or other large amounts of data among the agents in a distributed SLAM system [26]. A distributed communication network [27] is proposed where every robot only exchanges the local matches with its neighbors. The algorithm propagates local to global contexts to solve for a global correspondence. Some work [28, 29] manages to reduce bandwidth requirements by transmitting the subsets of the map information as a collection of sparse features. These sparse representations give rise to sparse map data that may contain large holes and ambiguous regions.
A proposed multiagent system reduces the required communication bandwidth and the complexity by partitioning point clouds into parts and then compactly describing each part using discriminating features to efficiently represent the environment[30]. Some work[31, 32, 15]
minimizes bandwidth usage by running place recognition on image frames and only sending the extracted place feature vectors to the robots. A resourceadaptive framework
[33, 34] is proposed for distributed loop closure detection by seeking to maximize taskoriented objectives subject to a budget constraint on total data transmission in frontends. While all of these SLAM systems work efficiently in reducing bandwidth, none of these SLAM solutions are designed for RGBD data nor do they provide dense 3D geometric surface maps as part of their SLAM solution.2.4 Dvo Slam
Much recent work[35, 36, 37, 38] in realtime RGBD sensor based visual SLAM involve direct methods where input data is used directly with minimal processing and maps are dense. DVO SLAM, originally proposed by Kerl et al.[7] in 2013, is still in popular use today. It benefits from low drift odometry via minimization of both geometric and photometric reprojection error between RGBD image pairs. The odometry is made robust by the incorporation of motion priors and the weighting of errors. As many 3D RGBD odometry algorithms do, the map in this approach is modeled by a pose graph, where every vertex is a camera pose with associated keyframe (intensity + depth), and edges represent the relative Euclidean transformations between keyframe poses. New keyframes are generated based on an entropy metric, and a similar metric is used to identify loop closures. For DVO SLAM the map is taken as the RGBD keyframe point clouds whose poses are determined by their current ”best estimate” in the optimized SLAM camera trajectory. While this method is highly accurate there are several limitations including: (1) the algorithm must be run on a single host, (2) other robots cannot benefit from estimated SLAM map data and (3) if the DVO map data were available, it would be prohibitive in size to share even over highbandwidth network connections[39, 25, 24].
This work represents a significantly distinct point in the landscapae of SLAM algorithms. While the proposed approach is derived from DVO SLAM[7], extensive modifications to the programmatic structure, visual odometry tracking and SLAM optimization set this work apart. In combination, these modifications create a highperformance graphbased distributed SLAM system that can dynamically adapt to the changing computational resources and bandwidth conditions available to the host computer. This is accomplished by separating the DVO SLAM algorithm into independent and asynchronous frontend and backend applications. These applications can run on the same or different CPUs or on completely distinct hosts connected over a lowbandwidth network. The frontend is responsible for camera tracking over small changes in view and geographic regions. The backend which receives compressed 3D geometric data from one or more frontends and integrates these data to construct maps. By operating directly upon compact representations of the RGBD data the computational and bandwidth resources consumed by processing and sharing map data is significantly mitigated.
3 Methodology
The goal of the proposed approach is to allow hosts to ”fit” a version of DVO SLAM to their context. This enables robots with limited resources to contribute to and potentially benefit from 3D maps in a distributed 3D SLAM system. To achieve this goal, we introduce significant modifications to DVO SLAM[7] to produce a new SLAM algorithm that dynamically adapts to satisfy computational and bandwidth constraints set by the host computer. This goal is achieve by the following (4) modifications to the DVOSLAM algorithm:

Computational isolation of frontend tasks and backend tasks into separate execution units.

Replacement of 3D point clouds with plane clouds in the graphSLAM keyframe data.

A RGBD point cloud to plane cloud compression algorithm.

Optimization methods for plane clouds for delta pose and loop closure estimation.
The cumulative effect of these modifications creates a completely new SLAM implementation for 3D map building from RGBD sensor data.
3.1 Computational isolation of frontend tasks and backend tasks
Our approach starts with the stateoftheart implementation of a multithreaded singleCPU implementation of graphSLAM as provided by DVO SLAM [7]. As implied by its name, it uses Direct Visual Odometry to track camera pose changes from sensed RGBD image data by minimizing both intensity and depth reprojection error between RGBD image pairs, resulting in estimates of camera movement and associated uncertainty. DVO SLAM implements a keyframe approach to 3D RGBD mapping. Here, the 3D map is taken as a collection of sensed 3D keyframes placed at the estimated pose of the RGBD camera when the keyframe was sensed. DVO SLAM solves the mapping problem using two interconnected mapping components: (1) a local map consisting of a single keyframe that is updated by live RGBD sensor data and (2) a global map which stores all other keyframes historically recorded to represent the history of the RGBD camera trajectory.
Our definition of the new SLAM system frontend consists of the DVO camera tracking algorithm and local map component of the SLAM mapping which only keeps the most recently measure keyframe. This allows the frontend to construct maps of small geographic regions without a need to communicate this information to the backend. Once the local map is ”completed” due to a significant (tunable) change in viewpoint, e.g., rotation, or a change in position, i.e., translation, the local map is serialized and communicated across the network (or host) to the backend application and a new local map is created.
Our definition of the new SLAM system backend holds the global map data from DVO SLAM. The backend is responsible for potentially large scale map integration. Map integration is accomplished by embedding the graphSLAM problem into the generic g^{2}o graph optimization library [10]. G^{2}o graph construction crates graph vertices from estimated camera poses and their uncertainty and attributes each vertex with keyframe (plane cloud) data. Edges connect vertices and imply that the sensor data from the connected keyframes are constrained, i.e., the two keyframe include overlapping views/measurement of the same underlying geographic map regions. The backend can attempt to create new edges, i.e., loop closures, by attempting to detect instances of this circumstance. In DVO SLAM loop closures are found by speculative matching using spatial locality as a guide for which keyframe pairs are candidate matches.
The resulting isolated frontend and backend applications allow completely new modular SLAM deployments. For example, robots having RGBD sensors and low computational resources might run only a frontend or perhaps only a backend. Further separate processors on a host might independently be tasked with these tasks allowing new energy saving strategies. For our application, we wish to perform centralized realtime mapping on a desktop host in near realtime as one or more remote vehicles contribute map data. Note this modularity also allows for decentralized mapping by endowing each robot in a multirobot network with a backend.
3.2 Graph SLAM
The graph SLAM approach to solving the SLAM problem uses a graph model to capture state evolution over time. It is typically more efficient in space and time than other approaches because robot states are often described by a sparse collection of locally correlated states (poses). As mentioned previously, graph nodes are robot poses and edges are relative Euclidean transformations that connect poses. Optimization of the graph estimates the full trajectory of the agent, i.e. the sequence of poses that minimizes errors in the observed constraints. Constraints come in the form of pairwise correspondences, i.e., matches between observed map keyframe data, , and realize as edges in the graphSLAM model. This is also known as the full SLAM problem, which estimates all previous states of the agent trajectory. This is in contrast to filtering based approaches, which seek to solve the incremental or online SLAM problem, whereby previous state estimates are marginalized out.
As the agent moves through the unknown scene it generates vertices as keyframes and sequential keyframes are always connected by odometry edges, , that estimate of the relative pose from one keyframe to the next. Non sequential keyframes may also be connected by these edges when the camera rotates in circles or loiters in an area.
Each inclusion of a new edge changes the globally optimal solution which typically effects poses close to the new observation more than geographically distant locations. Loop closures between distant nodes (via edge traversal) are particularly valuable for controlling the error and uncertainty in map estimation and new methods for remote loop closure is a topic of great interest for accurate SLAM map building.
Finally, the graphSLAM problem also permits the inclusion of features or landmarks in the environment as constraints. Here salient aspects of observed data denoted, , are used to define new nodes with their own metrics for value and uncertainty.
The full SLAM problem is then formulated as the estimation of the posterior probability of the trajectory,
and the environmental map, , given a sequence of motion estimates, , landmark measurements, , and the initial pose as in Equation (1).(1) 
Modeling the full SLAM problem in a graphical way leads to an optimization problem over a sum of nonlinear quadratic constraints in the graph. These constraints are built up over time as the agent explores the environment in the form of both motion and measurement constraints. In this formulation, agent and landmark poses correspond to graph vertices, while the edges represent spatial constraints (transformations) between these vertices as a Gaussian distribution.
Motion estimates form a constraint directly between agent poses, while landmark measurements form constraints between agent poses through the mutual observation of a landmark. Individual measurements therefore link poses to the landmarks in the graph, and during the optimization these measurements are then mapped to constraints between poses that observed the same landmark at different points in time.
Once the graph is constructed, we seek the configuration of the graph vertices that best satisfy the set of constraints. That is, we seek a Gaussian approximation of the posterior distribution of Equation (1). The optimal trajectory, , is estimated by minimizing over the joint loglikelihood of all constraints as described by Equation (2). [40]
(2) 
The leftmost term, , represents our prior knowledge of the initial pose. Often, , is set to zero, anchoring the beginning of the trajectory to the global coordinate system origin.
The middle term describes a sum over the motion constraints, where is a function that computes the pose resulting from applying the odometry estimate obtained at time , , to the pose . The residual vector, , is then the difference between the expected pose, , and the pose observed by applying the motion estimate. The matrix is the information matrix or inverse covariance matrix associated with the motion constraint at time . This matrix encodes the precision or certainty of the measurement, and serves to weight the constraint appropriately in the optimization.
The rightmost term describes a sum over landmark constraints. The function computes a global pose estimate of a given landmark, , using a local measurement of the landmark and the pose of the agent, , when the landmark was observed. The residual , is then the difference between the expected landmark pose in the global coordinate system, and the estimated global landmark pose resulting from the local landmark measurement. As before, the matrix is the information matrix or inverse covariance matrix associated with the landmark constraint at time .
Minimization of this error function is a nonlinear least squares problem, and can certainly be solved using nonlinear optimization approaches, e.g., LevenbergMarquardt, Gauss Newton, etc. It can also be solved efficiently by exploiting the sparse structure of the graph SLAM formulation [41], thereby allowing the use of sparse methods such as sparse Cholesky decomposition or the method of conjugate gradient. Many graph optimization libraries, such as the g^{2}o library [10], leverage these sparse methods in order to quickly optimize large graphs.
DVO SLAM’s formulation of graphSLAM incorporates only motion constraints into the optimization (the third term in equation (2) is 0). Motion constraints originate from two sources: (1) appending/integrating new local map keyframes into the global map from the frontend and (2) detecting loop closures by speculatively evaluating correspondence hypotheses between unconnected keyframe pairs. In both cases, the function from equation (2) produces a pairwise motion estimate for the edgeconnected keyframe pair by computing the relative motion that minimizes the corresponding keyframe data error. The function in this case is the DVO odometry algorithm which, when invoked on the on the RGBD data stored in the keyframe pair, provides motion optimization error estimates.
Similar to DVO SLAM, our extension performs graphSLAM incorporating only motion constraints into the optimization. Motion constraints also originate from the same two sources: (1) appending/integrating new local map keyframes into the global map from the frontend and (2) detecting loop closures by speculatively evaluating correspondence hypotheses between unconnected keyframe pairs. However, in contrast to DVO SLAM, our backend keyframes store compressed plane clouds in lieu of RGBD data. This fundamentally changes the function in the second term of the graphSLAM optimization problem in equation (2) from an RGBD point cloud alignment problem to that of aligning compressed plane clouds. The new optimal pairwise motion estimate for a keyframe pair replaced by a new odometry algorithm that estimates the relative motion of a pair of point clouds by minimizes the corresponding pairwise plane cloud data error. The plane cloud odometry algorithm applied is described in §3.4.
3.3 A RGBD point cloud to plane cloud compression algorithm
We seek to compress the sensed RGBD imaged substituting 3D planar regions for collections of locally flat depth measurements. This approach for 3D scene compression promises to significantly reduce the size of the RGBD image data which serves to reduce both memory usage and computational costs.
RGBD point clouds images are converted to plane clouds using a recursive plane fitting algorithm. The algorithm consists of three conceptual steps:

Tile the image using rectangles from a quadtree decomposition and add them to the plane fitting queue.

Remove the topmost tile data in the plane fitting queue and fit a 3D plane to the data.

If the fit is inaccurate, subdivide the tile and add the 4 new subtiles to the plane fitting queue.

If the fit is accurate, add the computed plane coefficients to the plane cloud.


Continue iterating until either the bandwidth or computational time constraints are exceeded or all tiles have been processed.
A more detailed version of this algorithm is detailed in Algorithm 1.
Each plane fit is accomplished very quickly using an ultra fast least squares plane fitting formulation[42] for RGBD sensors that uses the RGBD camera calibration parameters to fit planes directly to the measured depth image without reconstruction of 3D point cloud coordinates.
This accelerated fitting of planes is made possible by a rearrangement of the standard plane fitting error functions for RGBD data. While standard planar representations adopt a form of equation , we substitute the RGBD 3D reconstruction equations described in the previous work [42] for the variables and then simplify the resulting equation to fit directly to only the measured RGBD depths and, in doing so, save significant computational cost.
The rearrangement of the terms gives where and are values that can be precomputed from the known integer pixel coordinates and the calibration parameters. Multiplying this equation by gives us the equation (3).
(3) 
In (3) we notice the coefficients are functions of only the camera calibration parameters which determine, for each pixel location, the values of and . As a result, the plane coefficients and specifically, only plane coefficient is a linear function of the measured depth.
For implicit plane fitting using the reorganized equation (3), the new monomial vector becomes and the scatter matrix has elements:
(4) 
where the symmetric elements of the uppertriangular matrix have been omitted to preserve space. It is important to note that only 4 elements of this matrix, , depend on the measured depth data and, as such, this matrix requires less (~50% less) operations to compute. The upper left 3x3 matrix of does not depend upon the measured sensor data. As such, once the calibration parameters of the camera are known, i.e., , , , many of the elements of are determined and may be precomputed before any data is measured
with the exception of the 4 elements in the bottom row. The plane coefficients of a fit can be found as the eigenvector,
, corresponding to the smallest eigenvalue,
, of the matrix .The implementation of our planar regions extraction is shown in Algorithm 1. Some considerations are:

The initial RGBD image is covered by an array of tiles (tile size of 80x60 pixels). This size is important and implicitly defined the size of the largest planar surfaces we expect to observe in RGBD image data. Incorrect initial sizes waste computation by either requiring subdivision (too big) or needlessly splitting planes (too small).

Before fitting, a quick check on the tile depth is applied to ensure the existence of the depth values at four corners of the tile. Avoid fitting plane models to tiles which may have significant portions of missing depth data.

Adaptive subdivision using the Quadtree[43] model seeks to cover all the depth data using planar surfaces. Thresholds use the averge fitting error for the tile to determine which planes are accepted. Thresholds change based on the depth to accommodate for the depth noise model of RGBD camera. This model indicates that error at maximum range (7m) can be nearly an order of magnitude larger than those at short range (0.5m).

Before fitting, a planarity check to establish the likelihood that a plane will fit the data. The planarity check uses depth at three of the four corners to define a plane and then evaluates the Euclidean distance of the fourth corner to this plane. A threshold on the error determines if the tile data is subdivided (large error) or if a plane fit is attempted (small error).
It is also noted that our execution of Algorithm 1 is breadthfirst. It first detects and fits all of the valid surface area with planar tiles at current subdivision level then if possible, applies detection and fitting to all of the next subdivided elements. Planar map, as a compact representation, provides significant resource savings while a pointbased representation ignores the scene geometry hence suffers from considerable redundancy. With all planes being approximated, it is essential to have an accurate and a concise representation of the scene, that is, a ratedistortion optimal 3D scene structure, which is estimated by a joint optimization of the number of bits/points used to encode the extracted dense depth field, and the quality/accuracy of representation rendered from the 3D structure.[44]
3.4 Compressed Plane Cloud Odometry Estimation and Loop Closure
Each pairwise plane cloud hypothesis is evaluated using the Iterative Closest Algebraic Plane (ICaP) algorithm (Algorithm 2). This algorithm uses the compact representation of the world in terms of planar algebraic surfaces, i.e., surfaces having equation , to establish the likelihood that a given hypothesized plane cloud pair can be aligned with the intent to align the overlapping observed geometric map regions in a manner similar to puzzlesolving. This is accomplished by minimizing an error functional that solves for both the correspondence of planar surfaces between the maplets and the Euclidean transform that aligns these algebraic surfaces. The magnitude of the algebraic alignment error then serves as a goodnessoffit metric to validate or refute plane pair hypotheses and as a covariance metric for the g^{2}o optimization library.
While the description provided in Algorithm 2 describes the program flow, our mathematical solution for alignment of corresponding algebraic plane equations is described in the following paragraphs. These steps occur inside the alignPlanePairs() function from Algorithm 2 which includes the following statement:
(5) 
This function computes the optimal alignment (in the least squares sense) between planes that are hypothesized to have the same equation up to an unknown Euclidean transformation. For our derivation we denote and as two collections of plane equations. Equation (6) expresses the optimization problem at hand. Here we seek to estimate the transformation that takes the planes of into the coordinate system of planes . Note that Euclidean transformations, when applied to planes, follow the transformation rule for .
(6) 
As in least squares point alignment [45], the solution is obtained by decomposing the problem into two steps: solve for the rotation, , and solve for the translation, . After solving for these variables separately, the optimal transformation can be constructed as shown in Equation (7).
(7) 
We solve for the rotation, , that aligns the orientations of corresponding planes by finding the rotation that maximally aligns their normals, i.e., the vectors formed by the first three coefficients of each matching plane pair. To do so, we form a covariance matrix of the matching plane normal vectors as shown in Equation (8).
(8) 
We then decompose the covariance matrix using singular value decomposition,
, to compute the orthogonal transformation that minimizes the squared error, aligning corresponding plane normals as shown in Equation (9).(9) 
As pointed out in related literature on point alignment [45], the optimal rotation from Equation (9) could require a reflection. To find the optimal righthanded rotation, we must replace the smallest eigenvalue with to obtain the right handed rotation incurring smallest additional error beyond our original solution as shown in Equation (10).
(10) 
Using our solution for the optimal rotation, we then solve for the translation component of the transform that brings the two plane sets into final alignment. To do so, we form a matrix of the aligned normal vectors of the set , . We also form vectors of the fourth coefficients of the corresponding planes to be aligned; and . The mathematical representation for these variables are shown in the Equations (11).
(11) 
Using these matrices and vectors we can write the explicit least squares solution for the translation component, , as shown in Equation (12).
(12) 
4 Results
Our experiments designate a path for TurtleBots to follow on the 2nd floor of the EPIC building where the ECE Department and our laboratory is located within the UNC Charlotte campus. The TurtleBots are equipped with an Orbbec Astra pro RGBD camera (depth image: @30fps) and our SLAM system is running on a Intel Core 2.3 GHz i56200U laptop without GPU. We run our TurtleBots in five different hallways with the length of at least 12 meters and generate a total of 1008 keyframes. To be concise in this section, we randomly sample 100 keyframes and focus on them. We average the results from these images for analysis in this section.
Our first analysis examines the memory or, equivalently, bandwidth resources needed for the system to build a ”welldescribed” map and how much data is required for this compressed representation. A welldescribed map should contain enough geometric information (e.g. point cloud, planes and etc.) to represent a scene. Ultimately, we seek to balance between the computational cost and the bandwidth needed.
4.1 Bandwidth
Intuitively, one advantage of planes over point cloud when explaining a scene is the reduction on the size of map data transmitted among agents. Table 1 shows the bandwidth required for the data streams of planar and point cloud implementations. In our approach, planes are in the Hessian normal form and thus it takes as few as 8 float variables to describe a finite plane while some redundancy reduction may be found. Our experiments show that the naive implementations would require approximately 0.044MB per sensed image, or 1.2MB of data for the dense point cloud, and while some saving could be found, it is clear the planar representation is significantly more compact. Note that RGB data is excluded here because we only consider geometric information. The result shows using planar map representation requires about 27 times less data to be transmitted than the raw point cloud representation otherwise would. When communications are required highly frequently, the ability to share more data per unit time becomes increasingly appealing.
mean  min  max  std dev  

depth images (MB/image)  1.20  1.20  1.20  0.00 
Planar maps (MB/image)  0.044  0.036  0.050  0.004 
4.2 Tradeoff and RateDistortion
In our experiments, the computation cost is measured in terms of plane detection and fitting time. We measure the ability of the system to explain a scene by the percentage of surface area (pixels) in the scene that is fitted by planar blocks. In the mean time, we compare the bandwidth required for sharing the planar map data. We set the time budget 80, initial tile size pixels and minimal tile size pixels. Fig.3(a) shows the tradeoff relationship given a fitting error tolerance. The tolerance is the maximal allowable fitting error. We calculate the average pointwise distance to the planar fit. A fit is accepted when its error is below the threshold. As we can see, for the solid curve (error tolerance 13.1), the surface area in the scene can be nearly 100% fitted within given time budget, while for the dash curve (error tolerance 2.7
), not the whole scene is interpret by planes. The tendency, however, indicates when ample time permitted, the algorithms is capable of explaining the scene in fine detail to portray a estimate of the scene. More generally, as detection time increases, the scene area that has not been fitted by planes decreases and the bandwidth increases. The shade around the curves indicates the standard deviation of our experimental data. The intersection of two solid curves (or dash curves) is recommended to balance this tradeoff: sufficient surface area of the scene is described by planes under an optimal detection time and bandwidth. In our results of dash curves, when
30 fitting time provided, 75% surface area in the scene can be represented while the bandwidth occupancy remains low (10KB). Fig.3(b) shows the total number of planar fits per image, which corresponds with the bandwidth. We also explore the effect of the error tolerance to our SLAM system. In Fig 3(a), the solid curves have error tolerance of 13.1 while the dash curves 2.7. The actual average fitting error are 1.7 and 0.87 respectively. It is shown that for a same scene and a fixed bandwidth capacity, the larger the error tolerance is, the less computational resource is required. For example, at 10KB, the time comparison is 60 versus 30. On the other hand, when computation resources capacity is fixed, larger error tolerance allows more scene fitted rate while only slightly larger bandwidth occupancy. These results demonstrate the adaptive geometric sensitivity of our approach. We manage to adjust the computation burden according to the bandwidth capacity and fitting accuracy. Through our distributed graphSLAM, the agents can adaptively and accurately compress 3D surface data and build maps.A ratedistortion efficient map representation plays a significant role in SLAM systems. To further explore the relationship between fitting error and bandwidth, we develop a ratedistortion curve as shown in Fig.5. The distortion here is interpreted as the fitting error, while the rate is dictated by the bandwidth occupancy of the planar representation. We start with only fitting blocks with the size of pixels to images then continue to reduce the block size to , while in this process, we analyse the error and the bandwidth, as depicted in Fig.5. The ”elbow” area (marked as red circle) of the curve tells the optimal balance, with about 1.2 pointwise error and around 40KB bandwidth requirement. It should be noted that, the computation cost is not in the consideration here. Depending on the constraints on the bandwidth capability, the reconstruction could be stopped at any point, i.e. the algorithm allows scalability.
4.3 Quadtree Level
In Algorithm 1, we explain that we enlarge the adaptability of our approach by this fitting strategy: when there are tile candidates, at different Quadtree levels our algorithms recursively divide the planar tiles into available smaller cells and also recursively shrink the current fitting size. We fit the awaiting tiles with with current fitting blocks. Moreover, once a planar block fits a surface within the error tolerance, we stop the dividing even if current block size is larger than the minimal size. In this way, our approach promises to reduce the computation and bandwidth occupancy by lowering the Quadtree level. We choose 2424 pixels as our initial tile size so the possible maximal Quadtree level is 5 (2412632). We investigate 50 images with Quadtree level 5 and show the average experimental statistics in Fig.5. Similar as before, the bars show the standard deviation of data. Here, we restrict the fitting Quadtree at different levels. Level 1 indicates all of the fitting blocks are in the size of initial value (2424) and level 2 indicates only one recursion (for both tile candidates and fitting size) is allowed. It is shown that as the Quadtree goes deeper, less surface is left unfitted and meanwhile more bandwidth is consumed. We also denote the surface fitting time for each level in the horizontal axis. As we can see, for images revoking all recursion levels, the scene fitted rate and the bandwidth occupancy are balanced when the average Quadtree level is between 3 and 4. Also level 5 has the very minor contribution to the scene fitted rate while still has a major impact on the bandwidth. It is quite understandable considering that level 5 contains the smallest fitting tiles but for each tile it still takes 8 floats to define. Evidently, trimming the Quadtree to lower levels can help save computational cost, as well as bandwidth load. Note that here we show the results for images revoking all levels, however, in the reallife scene fitting process it is not always the case.
4.4 Loop Closure
Collaborative mapping in multiagent environments requires each agent to share information about their past and current states. To detect loop closures in multirobot contexts one robot needs to transfer its keyframe data to other agents and also hold keyframes from other agents. It is challenging to share raw sensor data like point cloud or other large amounts of data among the agents in a distributed SLAM system. Hence, transferring limitedsize data is critical when dealing with a large collection of robots. We reconstruct one of the hallways on the 2nd floor of the EPIC building at the UNC Charlotte campus and show our comparison in Table 2. In total 189 keyframes are generated for this map of a 152.4 hallway. It takes 221.48 MB of memory to represent this map if point cloud were used but only 7.48 MB if plane fits were used, with nearly 30 times of resource saving. The amount of data to be compared are 58,060,800 pixels and 245,249 plane fits. The reduction shown here is very significant considering that in a distributed system, the resources consumption has polynomial growth with respect to the number of agents.
depth image  compact representation  

amount of data  58.06M pixels  0.25M plane fits 
memory  221.48 MB  7.48 MB 
The map loop closures in our approach are generated by the identification of overlapping scenes in shared plane sets. Fig.6 shows 4 loop closures among 5 maps generated by our robots when running on the 2nd floor of the EPIC building. The hallway (including the wall and the floor) is described in planar fits. All maps and corresponding trajectories of agents are colorcoded. Even a limited number of loop closures or range measurements can provide strong ties between the agents. If two agents, for example, the orange and the cyan in Fig.6, never communicate directly, it is quiet feasible for them to be connected though a mutual neighbor (or chain of neighbors) like the green. In benefit of this loop closure strategy, our approach has the potential to construct observation with nonoverlapping data from agents. Two nonoverlapping parts of a flat surface may share the same plane representation coefficients (e.g., Hessian normal form), thus can be detected as the same surface without observation of overlapping.
5 Conclusion
Using an RGBD sensor onboard turtlebots, we have demonstrated our approach of adaptive compression for 3D surface data and mapbuilding capability as an extension to the DVO SLAM to create a new form of graphSLAM. Free choice of the deployment of frontend and backend applications allows this new form of SLAM to take on realizations as either centralized or decentralized SLAM systems as required by the application. We have showed how the SLAM operations can be distributed across a frontend and backend system. The frontend processes the observations into compact planar representations and then packages the information for the backend system without requiring highbandwidth communications. The frontend maintains a short term history which notifies the backend of potential loopclosures. The backend system then operates as the workhorse in that it has additional computational resources available to maintain a history of observations and confirm loopclosures. The frontend system could be extended to multiple frontend system therefore allowing multiple view points to be shared in a collaborative effort. Our approach minimizes memory burden by supplying a compact data representation which ensures computationally efficiency by adapting according to processing availability.
Representing the observed scene with compact planar equations naturally compresses the required information for SLAM and mapbuilding. Although it takes very short computation time to fit surfaces with planes and introduces approximation uncertainty, by representing the 3D scene data with planar representations, we manage to minimize the computational burden of operating on dense point clouds. With this approach, we’re able to quickly integrate a scene by operating on the planar equations to determine how the planes interact with one another such as if they intersect to form edges or corners as well as determining concave or convexity. Understanding this provides insight on what potential objects could exist according to the conditions of the observed planes, which also helps identify characteristic about the environment such as open spaces, tight corridors, or obstructions.
Another notable contribution of this work is the adaptive compression capability. SLAM algorithms have always sought for computational efficiency. Our proposed method adapts according to the user’s time and bandwidth requirements. If ample time is permitted, the algorithms have the opportunity to interrogate the scene in finer detail otherwise larger generalizations are made that portray a rough estimate of the scene. This allows the architecture to adapt rather than become overwhelmed with sensory data. The algorithms also incorporate detected planar regions as landmarks and use them to improve depth estimation, leading to improvements in both camera tracking and dense reconstruction accuracy. The estimated planar regions also provide invaluable semantic information about the scene.
References
 [1] R. C. Smith and P. Cheeseman, “On the representation and estimation of spatial uncertainly,” International Journal of Robotics Research 5, pp. 56–68, Dec. 1986.
 [2] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Realtime dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136, IEEE, 2011.

[3]
R. F. SalasMoreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J.
Davison, “SLAM++: Simultaneous localisation and mapping at the level of
objects,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1352–1359, 2013.  [4] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “BundleFusion,” ACM Transactions on Graphics 36, pp. 1–18, jul 2017.
 [5] T. Whelan, R. F. SalasMoreno, B. Glocker, A. J. Davison, and S. Leutenegger, “Elasticfusion: Realtime dense SLAM and light source estimation,” The International Journal of Robotics Research 35(14), pp. 1697–1716, 2016.
 [6] J. Aulinas, Y. Petillot, J. Salvi, and X. Lladó, “The SLAM problem: A survey,” in Artificial Intelligence Research and Development, Proceedings of the 2008 Conference on, pp. 363–371, IOS Press, (Amsterdam, The Netherlands, The Netherlands), 2008.
 [7] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGBd cameras,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, nov 2013.
 [8] A. I. Comport, E. Malis, and P. Rives, “Accurate quadrifocal tracking for robust 3d visual odometry,” in Proceedings 2007 IEEE International Conference on Robotics and Automation, pp. 40–45, IEEE, 2007.
 [9] F. Steinbrücker, J. Sturm, and D. Cremers, “Realtime visual odometry from dense rgbd images,” in 2011 IEEE international conference on computer vision workshops (ICCV Workshops), pp. 719–722, IEEE, 2011.
 [10] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “G2o: A general framework for graph optimization,” in 2011 IEEE International Conference on Robotics and Automation, pp. 3607–3613, May 2011.
 [11] T. Bailey, M. Bryson, H. Mu, J. Vial, L. McCalman, and H. DurrantWhyte, “Decentralised cooperative localisation for heterogeneous teams of mobile robots,” in 2011 IEEE International Conference on Robotics and Automation, pp. 2859–2865, IEEE, 2011.
 [12] M. T. Lazaro, L. M. Paz, P. Pinies, J. A. Castellanos, and G. Grisetti, “Multirobot SLAM using condensed measurements,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1069–1076, IEEE, 2013.

[13]
J. Dong, E. Nelson, V. Indelman, N. Michael, and F. Dellaert, “Distributed realtime cooperative localization and mapping using an uncertaintyaware expectation maximization approach,” in
2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 5807–5814, IEEE, 2015.  [14] S. Choudhary, L. Carlone, C. Nieto, J. Rogers, H. I. Christensen, and F. Dellaert, “Distributed mapping with privacy and communication constraints: Lightweight algorithms and objectbased models,” The International Journal of Robotics Research 36(12), pp. 1286–1311, 2017.
 [15] T. Cieslewski, S. Choudhary, and D. Scaramuzza, “Dataefficient decentralized visual SLAM,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2466–2473, IEEE, 2018.
 [16] W. Wang, N. Jadhav, P. Vohs, N. Hughes, M. Mazumder, and S. Gil, “Active Rendezvous for MultiRobot Pose Graph Optimization using Sensing over WiFi,” 2019.
 [17] F. Dellaert, “Factor graphs and GTSAM: A handson introduction,” Tech. Rep. GTRIMCP&R2012002, Georgia Tech. RIM, Sept. 2012.
 [18] A. Cunningham, V. Indelman, and F. Dellaert, “DDFSAM 2.0: Consistent distributed smoothing and mapping,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, May 2013.
 [19] B. Kim, M. Kaess, L. Fletcher, J. Leonard, A. Bachrach, N. Roy, and S. Teller, “Multiple relative pose graphs for robust cooperative mapping,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, pp. 3185–3192, May 2010.
 [20] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sumproduct algorithm,” IEEE Transactions on Information Theory 47, pp. 498–519, Sept. 2006.
 [21] A. Trevor, J. Rogers, and H. Christensen, “Planar surface SLAM with 3D and 2D sensors,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 3041–3048, May 2012.
 [22] V. Nguyen, A. Harati, A. Martinelli, N. Tomatis, and B. Sa, “Orthogonal SLAM: a step toward lightweight indoor autonomous navigation,” in In: Proceedings of the IEEE/RSJ Intenational Conference on Intelligent Robots and Systems, IROS, 2006.
 [23] R. F. SalasMoreno, B. Glocken, P. H. J. Kelly, and A. J. Davison, “Dense planar SLAM,” in 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 157–164, 2014.
 [24] M. Kaess, “Simultaneous localization and mapping with infinite planes,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 4605–4611, 2015.
 [25] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframebased dense planar SLAM,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5110–5117, IEEE, 2017.
 [26] P.Y. Lajoie, B. Ramtoula, Y. Chang, L. Carlone, and G. Beltrame, “DOORSLAM: Distributed, online, and outlier resilient SLAM for robotic teams,” arXiv preprint arXiv:1909.12198 0, 2019.
 [27] E. Montijano, R. Aragues, and C. Sagüés, “Distributed data association in robotic networks with cameras and limited communications,” IEEE Transactions on Robotics 29(6), pp. 1408–1423, 2013.
 [28] E. Nettleton, S. Thrun, H. DurrantWhyte, and S. Sukkarieh, “Decentralised SLAM with lowbandwidth communication for teams of vehicles,” in Field and Service Robotics, pp. 179–188, Springer, 2003.
 [29] D. Tardioli, E. Montijano, and A. R. Mosteo, “Visual data association in narrowbandwidth networks,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2572–2577, IEEE, 2015.
 [30] R. Dubé, A. Gawel, H. Sommer, J. Nieto, R. Siegwart, and C. Cadena, “An online multirobot SLAM system for 3d lidars,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1004–1011, IEEE, 2017.
 [31] T. Cieslewski and D. Scaramuzza, “Efficient decentralized visual place recognition using a distributed inverted index,” IEEE Robotics and Automation Letters 2(2), pp. 640–647, 2017.
 [32] T. Cieslewski and D. Scaramuzza, “Efficient decentralized visual place recognition from fullimage descriptors,” in 2017 International Symposium on MultiRobot and MultiAgent Systems (MRS), pp. 78–82, IEEE, 2017.
 [33] Y. Tian, K. Khosoussi, M. Giamou, J. P. How, and J. Kelly, “Nearoptimal budgeted data exchange for distributed loop closure detection,” arXiv preprint arXiv:1806.00188 0, 2018.
 [34] Y. Tian, K. Khosoussi, and J. P. How, “A resourceaware approach to collaborative loop closure detection with provable performance guarantees,” arXiv preprint arXiv:1907.04904 0, 2019.
 [35] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in realtime,” in 2011 International Conference on Computer Vision, pp. 2320–2327, 2011.
 [36] J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: Largescale direct monocular SLAM,” in European Conference on Computer Vision (ECCV), September 2014.
 [37] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” in arXiv:1607.02565, July 2016.
 [38] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics and Automation 0, 11 2016.
 [39] Y. Taguchi, Y.D. Jian, S. Ramalingam, and C. Feng, “Pointplane SLAM for handheld 3D sensors,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5182–5189, May 2013.
 [40] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics (Intelligent Robotics and Autonomous Agents), The MIT Press, 2005.
 [41] G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard, “A tutorial on graphbased SLAM,” IEEE Intelligent Transportation Systems Magazine 2, pp. 31–43, winter 2010.
 [42] J. Papadakis and A. R. Willis, “Realtime surface fitting to rgbd sensor data,” in SoutheastCon 2017, pp. 1–7, March 2017.
 [43] R. Finkel, “Quad trees: A data structure for retrieval on composite keys.,” Acta Inf. 4, pp. 1–9, 03 1974.
 [44] E. Imre, A. A. Alatan, and U. Gudukbay, “Ratedistortion based piecewise planar 3d scene geometry representation,” in 2007 IEEE International Conference on Image Processing, 5, pp. V – 113–V – 116, 2007.
 [45] S. Umeyama, “Leastsquares estimation of transformation parameters between two point patterns,” IEEE Trans. Pattern Anal. Mach. Intell. 13, pp. 376–380, Apr. 1991.
Comments
There are no comments yet.