Abstract scene graphs of environments represent the key structures, objects and interactions in a semantically rich and compact way. Ideally, an intelligent embodied device should build a scene graph rapidly and on-the-fly in a new environment using on-board sensing and processing to enable immediate intelligent action. Identifying high-level abstractions is challenging and can require an expensive search-and-test over both the type of abstraction and the subset of elements to which it applies. Pre-trained neural networks can amortize this cost by directly proposing candidate abstractions and most algorithms for scene graph construction operate by either post-processing a low-level representation [McCormac:etal:ICRA2017] or by committing to abstractions of the low-level data at measurement time [Salas-Moreno:etal:CVPR2013][Sucar:etal:3DV2020] .
A more ambitious target is general incremental abstraction
. Where abstract scene elements can be identified from single observations, they should be immediately added to a scene representation. More commonly, several observations may be needed to identify abstractions with high confidence, requiring the system to temporarily store low-level information (e.g. raw geometry as point clouds). As exploration continues, abstraction should operate continually and hierarchically on both stored and incoming observations, gradually replacing the raw elements in the map. A scene graph should therefore at any point in time be a hybrid mix of raw and abstract elements, potentially of many kinds.
The correct way to accumulate many different measurements and priors into coherent estimates is via probabilistic inference, and a factor graph represents the probabilistic structure of inference problems in SLAM[Dellaert:Kaess:Foundations2017]. Abstract scene elements can be combined into this estimation framework and probabilistic inference can refine and confirm or reject the abstractions. A hybrid, incrementally abstracting map is therefore represented by a complicated, heterogeneous and dynamically changing factor graph, where new raw structure is continually added while abstractions are tested, and replace raw structure if those tests are passed.
There have been few attempts to solve the true, complex inference problems these factor graphs represent in real-time systems. Most SLAM systems that go beyond sparse point cloud processing make severe approximations to the true inference problem by artificially layering estimation [McCormac:etal:ICRA2017], baking in specific variable orderings [Zhou:etal:ISMAR2020], or by using alternation to avoid joint estimation [Newcombe:etal:ISMAR2011]. We believe that these choices are often related to the rigidity of existing optimisation algorithms that need to exploit the fixed structure of an optimisation problem to achieve efficient performance on standard processing hardware.
Gaussian Belief Propagation (GBP) has recently been proposed as a strong candidate algorithm for real-time inference of arbitrary and dynamically changing factor graphs [Davison:Ortiz:ARXIV2019][Ortiz:etal:ARXIV2021]. Its computational structure is node-wise parallel and it operates by local message passing on a factor graph. GBP trades the optimality of global updates for more flexible distribution of compute and memory, meaning it can better exploit parallel hardware and operate without assumptions about the global structure of an estimation problem. When implemented on a graph processor [Graphcore], GBP has already been shown to have speed advantages over global methods for inference on static bundle adjustment graphs [Ortiz:etal:CVPR2020].
In this work, we present a general method for incrementally constructing scene graphs in real-time based on two novel components: 1) our incremental abstraction framework and 2) distributed optimisation via GBP.
First, our incremental abstraction framework combines amortized inference from an off-the-shelf network with probabilistic inference to robustly identify abstract scene elements. Additionally, upon accepting a scene element we linearise and join factors connecting common nodes to yield a more semantic, dense and compressed representation.
Second, enabled by our novel routing procedure, we are the first to use GBP on a graph processor for inference of dynamic heterogeneous factor graphs. The time per iteration of GBP is independent of the graph structure and we demonstrate the resulting advantages over direct methods.
While our framework is general and can be used with any abstract feature detector, we experiment with planar abstractions, which provide a compact way to densely represent geometry in many human-made environments. In evaluations, our framework reconstructs accurate planar scene graphs of real indoor environments at different levels of granularity, demonstrates significant graph compression and improves tracking compared to ablated baselines.
Ii Related work
For long-term SLAM operation it is vital to manage computational cost by limiting factor graph growth to match the spatial extent of the map. Towards this goal, [Johannsson:etal:ICRA2013] reuses existing keyframes for new measurements and [Ila:etal:TRO2009] uses filtering to add only non-redundant nodes and edges. Alternatively, compression by node removal [Carlevaris:etal:TRO2014][Mazuran:etal:IJRR2016], attempts to recover the best non-linear and sparse factor graph that approximates the marginalised distribution. These methods target compression with minimal information loss, while we focus on semantic-guided compression.
Recent SLAM research has investigated incremental scene reconstruction with semantic elements included in the factor graph. Object-centric SLAM methods use object recognition and correspondence in the front-end to build scene graphs of objects [Salas-Moreno:etal:CVPR2013][Sucar:etal:3DV2020]. Planar SLAM methods operate similarly, either detecting planes from an RGB-D input [Salas-Moreno:etal:ISMAR2014][Kaess:ICRA2015][Hsiao:etal:ICRA2018][Hosseinzadeh:etal:ACCV2018][Zhou:etal:ISMAR2020], or in the monocular case from CNN predictions [Yang:etal:IROS2016][Hosseinzadeh:etal:ICRA2019] or from the reconstructed points [Arndt:etal:IROS2020]. All previous methods rely on accurate initial plane predictions and, unlike our method, cannot confirm or reject planes within the inference process, nor compress the factor graph. Additionally, for real-time operation, these methods often layer optimisation [Yang:etal:IROS2016] or construct reduced systems to leverage specific problem structure [Zhou:etal:ISMAR2020].
Recent methods have built hierarchical 3D scene graphs containing layers of objects, rooms and buildings in which edges describe non-probabilistic relations [Armeni:etal:ICCV2019][Rosinol:etal:ARXIV2020]. [Armeni:etal:ICCV2019] constructs the graph from an annotated mesh model while [Rosinol:etal:ARXIV2020] is an incremental system that also tracks human meshes and has shown impressive results on simulated datasets.
Iii-a Factor Graphs
Factor graphs are bipartite graphs that consist of variable nodes , factor nodes and edges
. They are commonly used as a graphical representation of the factorisation of probability distributions. Given the factorisation:
where , the factor graph is constructed by creating nodes for all variables and factors and then connecting each factor to variables with an undirected edge.
Iii-B Belief Propagation
Belief propagation (BP) [Pearl:book1988] is a distributed inference algorithm that infers the marginal distribution
for each variable from the joint distribution. On each iteration, all factor nodes send messages to neighbouring variable nodes . These messages are aggregated to update all variable node beliefs before all variable nodes send messages to neighbouring factor nodes. BP only involves node-wise local compute and message passing, making it trivial to distribute on highly parallel hardware by placing nodes on separate cores. Factor to variable messages are computed as:
where denotes all elements in the set apart from . Variable to factor messages are computed as:
and variable beliefs are updated by simply taking the product of incoming messages from adjacent factors: . See [Bishop:Book2006] for a full derivation.
Iii-C GBP for Spatial Inference Problems
BP is exact for trees, but in general it is not guaranteed to converge on the true marginals when applied to graphs with loops. Nonetheless, empirical results strongly suggest that when all distributions are Gaussian, BP performs robustly across a range of inference tasks [Bickson:PhDThesis:2008][Davison:Ortiz:ARXIV2019].
In our GBP framework, we utilise the Gaussian information form which is parameterised by the information vectorand precision matrix :
Measurements are modelled as a deterministic function plus centred Gaussian noise, , where . Spatial estimation problems tend to have the property of locality in which each independent measurement depends on only a small subset of the variables . The likelihood therefore has the factorised form:
and the factor graph for the posterior consists of a variable node for each and a factor node for each .
We are interested in estimating both the configuration that maximises the posterior distribution given observed measurements and the associated uncertainty given by the marginal covariances. Using Bayes Rule, we can write the maximisation as:
where is proportional to but is a function of the variables with the measurements as parameters [Dellaert:Kaess:Foundations2017].
Gaussian Belief Propagation performs posterior inference on the factor graph by computing the marginals for all variables, where . For factors with non-linear measurement functions , the density is not a Gaussian and so we use the first order Taylor expansion, , to yield the following Gaussian likelihood (dropping all subscripts) [Davison:Ortiz:ARXIV2019]:
During inference, each non-linear factor is relinearised independently when the distance between the current beliefs and the linearisation point exceeds a threshold .
Iv Incremental Planar Abstraction Framework
We combine amortized inference via a neural network with distributed probabilistic inference via GBP to incrementally abstract factor graphs in SLAM. We emphasise that the master representation of the scene is always the factor graph and during online operation GBP is continually performing inference on this graph.
GBP is interrupted to perform editing of the factor graph, which occurs when: i) adding a new keyframe (Sec. IV-A, IV-B) ii) testing plane hypotheses (Sec. IV-C) or iii) merging planes (Sec. IV-D). These components are detailed in the following sections, and a high-level overview of the system-level operation is described in Algorithm 1.
Iv-a Feature Extraction from Keyframes
We construct the abstract scene graph on top of a feature-based monocular SLAM system. We choose a sparse front-end for experimental purposes while we concentrate on the graphical back-end in this paper.
Given a stream of live images, we use the ORB-SLAM2 [Mur-Artal:etal:TRO2017] front-end to build the raw factor graph composed of keyframes, points and reprojection factors (Figure 2 left). Reprojection factors penalise the distance between a matched feature at image coordinates in keyframe and the projection of the corresponding point in the image plane. Reprojection factors have the form:
where proj is the projection operator and and are the rotations and translations derived from keyframe pose .
Iv-B Integrating Plane Predictions
We denote the homogeneous plane vector [Hartley:Zisserman:Book2004]. Points lying in a plane satisfy , where is the normal vector and is the distance from the origin. The homogeneous plane vector is transformed between coordinate frames using the inverse transpose of the homogeneous point transform: . For optimisation, we represent planes using the minimal parametrisation and denote the operator to subtract the plane parameters in our chosen minimal form (.
For each keyframe, we run a forward pass of the PlaneRCNN model [Liu:etal:CVPR2019] to predict a set of plane parameters and corresponding segmentation masks. The predictions are filtered by removing planes with especially small or disconnected segmentation masks. The resulting plane hypotheses are then integrated into the existing factor graph as plane hypothesis nodes (Figure 2 middle).
Using the segmentation mask for each plane, we determine the map points that are predicted to lie in the plane and introduce plane-point distance factors, connecting the hypothesised plane to each of these points . Plane-point distance factors penalise the perpendicular distance from a point to a plane and have the form:
Each plane hypothesis is also connected to the keyframe in which it was predicted via a plane prediction factor. Plane prediction factors treat the network prediction of the plane parameters as a measurement and take the form:
where is the transformation from the global coordinate frame to the coordinate frame of the camera . In experiments we set to be very large as the network plane parameter prediction can be unreliable.
We emphasise that our framework is not specific to planes. Any abstract scene element with an appropriate compatibility factor and inference model to generate hypotheses could be used.
Iv-C Plane hypothesis confirmation and rejection
Having added the plane hypotheses to the raw factor graph, GBP carries out inference on the hybrid graph (Figure 2 middle
) and converges to the configuration that minimises the energy of the constraints in the graph. To allow bad plane hypotheses to be treated as outlying measurements and have only a small contribution to the graph energy, we employ the robust Tukey loss function for all factors using covariance rescaling as in[Davison:Ortiz:ARXIV2019] and [Agarwal:etal:ICRA2013].
After convergence, for each plane hypothesis, we go through all connected points and read off from the factor graph the likelihood that the point lies in the plane. The likelihood is the plane-point distance factor density evaluated at the converged belief means and . To determine whether to confirm or reject each plane hypothesis, we use the proportion of points with likelihood and the number of iterations the hypothesis has been in the graph .
If rejected, the plane hypothesis node and all adjacent factors are removed from the graph, as in Figure 2. If confirmed, the plane hypothesis node and points with
are replaced by a single rigid body variable node with just 6 degrees of freedom, as in Figure2 right. We use a rigid body to represent the plane as we assume that the relative configuration of the planar points has now been determined and that they can be optimised as a single rigid planar body — this compression could also be used for small rigid objects.
The reprojection factors connected to the planar points, and the plane prediction factors connected to the plane hypothesis node are transferred to the new rigid body variable node. For these factors, the rigid body transformation becomes an argument of the measurement function while and are parameters as we replace:
where , and are the transformation matrix, rotation matrix and translation derived from the minimal vector . Plane prediction and reprojection factors become:
As all transferred reprojection factors connect to the same pair of nodes, they can be linearised and combined into a single factor by taking the product of the factors. This further compresses the graph and reduces the computation at each iteration of GBP. Note that relinearising this combined factor requires linearising the contributions from all of the original reprojection factors, however this only occurs on a small proportion of GBP iterations.
Iv-D Merging Planes
As there may be multiple plane hypotheses for the same geometric plane, we merge confirmed planes that have: i) aligned normal vectors ii) small perpendicular separation and iii) large overlap. The latter two criteria are estimated by sampling points in the plane. Once two planes are chosen to be merged, we replace the two rigid body planes with a single rigid body plane. The new plane normal is the average of the merged normals and the factors connecting to the merged planes are transferred to the new plane.
V Dynamic Routing on the IPU
We implement our incremental abstraction method on a single Graphcore MK1 IPU chip [Graphcore] as in [Ortiz:etal:CVPR2020]. The IPU is a large chip composed of 1216 independent cores arranged in a fully connected graph structure. Like a GPU it is highly parallel, but due to its interconnect structure and the local memory on each core, it has breakthrough performance for algorithms with a sparse message passing character.
Inference on static factor graphs is achieved in [Ortiz:etal:CVPR2020]
by mapping the factor graph onto the IPU cores with the communication pattern matching the topology of the graph. This approach however cannot be applied to dynamic graphs, as the communication pattern must be precompiled and recompilation is expensive. To enable parallel inference of dynamic factor graphs, we develop a routing procedure that can manage arbitrary graph topologies while operating on a fixed precompiled communication pattern.
Our routing solution introduces densely connected routing nodes to mediate the transfer of messages through the factor graph. Routing nodes have knowledge about the structure of a part of the factor graph, stored in a routing matrix. As illustrated in Figure 3, when the factor graph is edited, the routing matrix can be updated to enable inference on the new factor graph without changing the compiled communication pattern.
To distribute the routing, we create a routing node for each pair of node types that may be connected. One additional requirement is that we specify the maximum number of nodes of each type in the graph and the maximum number of edges each type of node can have — these are weak requirements and the limits can be set generously.
The routing procedure enables optimisation of arbitrary graph topologies at minimal time cost; only increasing the time per iteration from to . As GBP typically converges in less than 100 iterations [Ortiz:etal:CVPR2020], inference remains comfortably within real-time constraints.
Vi Implementation Settings
For all experiments, we use the following default parameters: , pixels, , , , , , , , iterations, iterations, iterations and . We use message dropout of and message damping of to stabilise GBP [Bickson:PhDThesis:2008]. New keyframes are initialised with a constant velocity motion model and new points are initialised at an average depth from the camera.
Vii Experimental Evaluation
For evaluation we use real-world sequences captured with the Kinect camera in varied indoor environments and sequences from the TUM dataset [Sturm:etal:IROS2012]. In compression and tracking evaluations, we use [Ortiz:etal:CVPR2020] as a baseline and call the method GBP-BL. Similar to our system GBP-BL uses GBP for inference in SLAM but without planar abstractions, meaning that comparisons serve as ablations of our planar abstraction framework.
We first present convergence time evaluations and qualitative reconstructions before evaluating compression and tracking accuracy in ablation experiments.
Vii-a Convergence Time Evaluation
We evaluate the convergence time for bundle adjustment problems based on the sitting room sequence for factor graphs with an increasing number of different factor types. We begin with a factor graph containing 35 keyframes, 3108 points and 10000 reprojection factors and measure the convergence time from noisy initialisations. We then add 200 additional factors of a new type to the graph along with any new variables nodes (for example we add plane hypotheses variable nodes when adding plane hypothesis prediction factors) and repeat the experiment. We conduct experiments adding 4 additional types of factors, meaning in the final experiment there are 10800 factors in the graph.
Following [Ortiz:etal:CVPR2020], we compare GBP implemented on a single IPU chip [Graphcore] with Ceres [CeresManual], a non-linear least squares optimisation library, run on a 6 core i7-8700K CPU with 18 threads. Ceres uses Levenberg-Marquardt (LM), a Huber kernel and analytic derivatives (which we found maximise performance). In Figure 4 we compare the convergence time of GBP with the 3 fastest Ceres linear solvers.
As different types of factors are added to the graph, convergence time for Ceres increases greatly while for GBP there is only a slight increase. To understand these curves, it is instructive to break down convergence time as the product of the time per iteration and the number of iterations to converge. Ceres makes optimal global updates through LM and, across all experiments, converges in 5-10 iterations while GBP requires 20-25 iterations. The time per iteration however is the more significant factor and where the two methods differ greatly in performance.
The local distributed nature of GBP makes its time per iteration structure-agnostic; in other words, the compute per iteration depends only on the number of factors and not their structure. Consequently, the time per iteration for GBP is approximately constant for all experiments and the convergence time only increases slightly, due to extra factors and more iterations required to converge.
In contrast, as different types of factors are added to the graph, solving the linear system or normal equations for the Ceres LM update becomes considerably more expensive, increasing the time per iteration. Ceres linear solvers compute the update by exploiting sparsity structure in the Fisher information matrix which depends on both the number of non-zero entries or equivalently factors in the graph and the variable ordering. In the base case, the Dense Schur solver is designed to leverage the large zero blocks in the information matrix to efficiently solve the normal equations without inverting the full information matrix. As different types of factors are added, even in very small numbers, these zero blocks are eroded and the time per iteration for Ceres increases by over 5x with only an additional of factors.
These experiments expose the reliance of direct solvers on fixed sparsity structure and suggest that GBP is more efficient for optimising heterogeneous scene graphs without strong structure. Lastly, not only does GBP have the right computational properties, but it is also doing additional work by computing both the MAP and the marginal covariances while LM only computes the MAP with significant extra computation required to get the covariances.
Vii-B Qualitative Reconstructions
We present planar reconstructions of 4 real-world sequences by our system in Figure 5. Our system captures the prominent planes in real scenes such as walls, beds (b), desks (b, c) and cupboards (a). In Figure 5 d) we verify that for a simple entirely planar scene, our system can achieve a complete reconstruction. Figure 1 shows an intermediate reconstruction of the same sequence with points.
The granularity of abstractions in our system is controlled by two interpretable parameters, which can be set according to the specific requirements of downstream processes which operate on the resulting reconstruction. For example, dense fine-grained geometry might be required for object manipulation, whereas coarse grained representations would be more efficient for path planning.
The value of balances the contributions of the plane-point factors and the reprojection factors. As demonstrated in Figure 6, weak plane-point factors (large ) prioritise minimising reprojection errors and consequently only very planar regions are abstracted. On the other hand, strong plane-point factors (small ) encourage points in hypothesised planes to be co-planar and can lead to the abstraction of regions that are approximately planar, such as the top of the shampoo bottle in the bottom right panel of Figure 6. Additionally, the merging criteria can be varied to determine the level of granularity. In Figure 6, coarse merging thresholds join the posters into a single wall plane and merge the planar objects into the table plane while finer thresholds leave these reconstructed planes separate.
Vii-C Compression Evaluation
Graph compression yields a more dense, semantic and parameter-efficient representation and reduces the amount of computation per iteration of GBP. The computation is proportional to the number of edges or equivalently the number of factors when all factors are pairwise. To quantify the compression capabilities of our system, in Figure 7 we plot the number of factor nodes in the graph as a function of keyframes added and compare with GBP-BL [Ortiz:etal:CVPR2020].
For our planar system, there are initially more factors as many plane hypotheses are added, however the graph is quickly compressed as planes are confirmed. The compression is most significant in the Laboratory and TUM str_tex_far sequences in which there are large textured walls, while there is less compression in the sitting room sequence due to large curved objects such as the sofa.
Vii-D Tracking Evaluation
We evaluate the absolute trajectory error (ATE) of our method on 3 TUM sequences (chosen for their prominent planar regions) in Table I. We compare our full method with two ablated systems: our planar method without compression (Ours-C) and GBP-BL which is a GBP based method for robust bundle adjustment with similar performance to Ceres.
Ours-C has lower ATE than GBP-BL demonstrating that additional planar constraints help tracking. As expected, our full method performs slightly worse than Ours-C because compression is lossy and bakes in errors. The fact that compression only comes with a small price in accuracy is further evidence that we are finding correct abstractions.
We also demonstrate comparable trajectory error to ORB-SLAM2 [Mur-Artal:etal:TRO2017], even achieving lower ATE for the long office sequence in which there are many planar points on desks and monitors and slower camera motion. Both our method and GBP-BL lack a tracking system so begin from a more difficult initialisation than ORB-SLAM2 when a new keyframe is added, likely explaining the performance difference on the first two sequences. Designing a distributed tracking system that would fit in with the GBP framework remains an important research direction.
We have proposed a method for efficient incremental construction of probabilistic scene graphs from monocular input by introducing two novel components. First, our incremental scene abstraction framework combines amortized inference via an off-the-shelf network with probabilistic inference to identify abstract scene elements and build a more semantic, dense and compact representation. Second, our routing procedure enables the first application of GBP on a graph processor to inference of dynamic heterogeneous factor graphs. We demonstrate the advantage of GBP over direct methods for optimising complex factor graphs due to the structure-agnostic time per iteration. We implement our system with planar abstractions and show accurate reconstructions, significant compression and improved tracking over ablated systems.
In the near term, we would like to explore using our framework in a more practical SLAM system with an RGB-D input in which dense raw geometry is abstracted into a hierarchical factor graph including objects, planes and rooms. More generally, we hope that the SLAM community will begin to adopt more flexible and distributed inference frameworks, such as the system we present here. Such frameworks are vital to leverage novel parallel hardware and tackle the computational challenges brought about by slowing advancements in CPUs and GPUs and the move towards dynamic heterogeneous scene graphs for representing environments.