Semantic 3D Reconstruction with Finite Element Bases

by   Audrey Richard, et al.

We propose a novel framework for the discretisation of multi-label problems on arbitrary, continuous domains. Our work bridges the gap between general FEM discretisations, and labeling problems that arise in a variety of computer vision tasks, including for instance those derived from the generalised Potts model. Starting from the popular formulation of labeling as a convex relaxation by functional lifting, we show that FEM discretisation is valid for the most general case, where the regulariser is anisotropic and non-metric. While our findings are generic and applicable to different vision problems, we demonstrate their practical implementation in the context of semantic 3D reconstruction, where such regularisers have proved particularly beneficial. The proposed FEM approach leads to a smaller memory footprint as well as faster computation, and it constitutes a very simple way to enable variable, adaptive resolution within the same model.



There are no comments yet.


page 2

page 10

page 11

page 12

page 13

page 14

page 23

page 24


Sublabel-Accurate Relaxation of Nonconvex Energies

We propose a novel spatially continuous framework for convex relaxations...

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations

Multi-label learning has attracted significant interests in computer vis...

Sublabel-Accurate Convex Relaxation of Vectorial Multilabel Energies

Convex relaxations of nonconvex multilabel problems have been demonstrat...

Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

We state a combinatorial optimization problem whose feasible solutions d...

Shooting Labels: 3D Semantic Labeling by Virtual Reality

Availability of a few, large-size, annotated datasets, like ImageNet, Pa...

Multigrid in H(div) on Axisymmetric Domains

In this paper, we will construct and analyze a multigrid algorithm that ...

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Dense semantic 3D reconstruction is typically formulated as a discrete o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A number of computer vision tasks, such as segmentation, multiview reconstruction, stitching and inpainting, can be formulated as multi-label problems on continuous domains, by functional lifting [PockCBC10, Cremers11, Lellmann11, ChambolleCP12, NieuwenhuisTC13]. A recent example is semantic 3D reconstruction (e.g[HaneZCAP13, BlahaVRWPS16]), which solves the following problem: Given a set of images of a scene, reconstruct both its 3D shape and a segmentation into semantic object classes. The task is particularly challenging, because the evidence is irregularly distributed in the 3D domain; but it also possesses a rich, anisotropic prior structure that can be exploited. Jointly reasoning about shape and class allows one to take into account class-specific shape priors (e.g, building walls should be smooth and vertical, and vice versa smooth, vertical surfaces are likely to be building walls), leading to improved reconstruction results. So far, models for the mentioned multi-label problems, and in particular for semantic 3D reconstruction, have been limited to axis-aligned discretisations. Unless the scenes are aligned with the coordinate axes, this leads to an unnecessarily large number of elements. Moreover, since the evidence is (inevitably) distributed unevenly in 3D, it also causes biased reconstructions. Thus, it is desirable to adapt the discretisation to the scene content (as often done for purely geometric surface reconstruction, e.g. [Labatut07]).

Our formulation makes it possible to employ a finer tesselation in regions that are likely to contain a surface, exploiting the fact that both high spatial resolution and high numerical precision are only required in those regions. Our discretisation scheme leads to a smaller memory footprint and faster computation, and it constitues a very simple technique to allow for arbitrary adaptive resolution levels within the same problem. I.e, we can refine or coarsen the discretisation as appropriate, to adapt to the scene to be reconstructed. While our scheme is applicable to a whole family of finite element bases, we investigate two particularly interesting cases: Lagrange (P1) and Raviart-Thomas elements of first order. We further show that the grid-based voxel discretisation is a special case of our P1 basis, such that minimum energy solutions of “identical” discretisations (same vertex set) are equivalent.

Figure 1:

Semantic 3D model, estimated from aerial views with our FEM method.

2 Related Work

Since the seminal work [Curless1996] volumetric reconstruction from image data has evolved remarkably [Kazhdan:2006:PSR, VogiatzisETC07, Furu:2010:PMVS, LiuC10, Cremers-Kolev-pami11, kbc-pami11, KostrikovHL14, Ulusoy2016]. Most methods use depth maps or 2.5D range scans for evidence [Zach2007_Range, zach2008fast], represent the scene via an indicator or signed distance function in the volumetric domain, and extract the surface as its zero level set, e.g, [Lorensen1987, treece1999mtet].

Joint estimation of geometry and semantic labels, which had earlier been attempted only for single depth maps [Ladicky10], has recently emerged as a powerful extension of volumetric 3D reconstruction from multiple views [HaneZCAP13, Bao13, Kundu14, Savinov15, vineet2015icra, BlahaVRWPS16, Ulusoy2017CVPR]. A common trait of these works is the integration of depth estimates and appearance-based labeling information from multiple images, with class-specific regularisation via shape priors.

Multi-label problems are in general NP-hard, but under certain conditions on the pairwise interactions, the original non-convex problem can be converted into a convex one via functional lifting and subsequent relaxation, e.g[ChambolleCP12]. This construction was further extended to anisotropic (direction-dependent) regularisers [Strekalovskiy11]. Moreover, [ZachHP14] also relaxed the requirement that the regulariser forms a metric on the label set, yet its construction can only be applied after discretisation [Lellmann11]. In this paper, we consider the relaxation in its most general form [ZachHP14], but are not restricted to it. The latter construction is also the basis to the model of [HaneZCAP13], whose energy model we adapt for our semantic 3D reconstruction method. Their voxel-based formulation can be seen as a special case of our discretisation scheme.

For (non-semantic) surface reconstruction, several authors prefer a data-dependent discretisation, normally a Delaunay tetrahedralisation of a 3D point cloud [Labatut07, Jancosek11, vu12]. The occupancy states of the tetrahedra are found by discrete (binary) labeling, and the final surface is composed of the triangles that separate different labels. Loosely speaking, our proposed methodology can be seen either as an extension of [HaneZCAP13] to arbitrary simplex partitions of the domain; or as an extension of [Labatut07] to semantic (multi-label) reconstruction.

We note that regular voxel partitioning of the volume leads to a high memory consumption and computation time. Yet, we are essentially reconstructing a 2D manifold in 3D space, and this can be exploited to reduce run-time and memory footprint. [Kundu14] use an octree instead of equally sized voxels to adapt to the local density of the input data. [BlahaVRWPS16] go one step further and propose an adaptive octree, where the discretisation is refined on-the-fly, during optimisation. In our framework the energy is independent of the discretisation, it can thus be combined directly with such an adaptive procedure.

Also volumetric fusion via signed distance functions [Newcombe2011] benefits from irregular tesselations of 3D space, e.g, octrees [Steinbruecker-etal-icra14] or hashmaps [Niessner2013]. In contrast to our work, these target real-time reconstruction and refrain from global optimisation, instead locally fusing depth maps. Their input normally is a densely sampled, overcomplete RGB-D video-stream, whereas we deal with noisy and incomplete inputs. To achieve high-quality reconstructions in our setting, we incorporate semantic information, leading to a multi-label problem.

Our work is based on the finite element method (FEM), e.g[reddy2005introduction, Brezzi1991]. Introduced by Ritz [Ritz1909] more than a century ago, and refined by Galerkin and Courant [courant1943], FEM serves to numerically solve variational problems, by partitioning the domain into finite, parametrised elements. In computer vision FEM has been applied in the context of level-set methods [Weber2004] and for Total Variation [Bartels12]. To our knowledge, we are the first to apply it to multi-labeling.

3 Method

The multi-labeling problem [ChambolleCP12, Strekalovskiy11, Lellmann11, ZachHP14] in the domain is defined by finding labeling functions as the solution of:


where models the data term for a specific label at location and denotes a convex regularisation functional that enforces the spatial consistency of the labels. One prominent example is to chose , known as Total Variation, which penalises the perimeter of the individual regions [ChambolleCP12, NieuwenhuisTC13]. Note that in the two-label case (Potts model), this relaxation is exact after thresholding with any threshold from the open unit interval [ChambolleCP12]. Although we are ultimately interested in non-metric regularisation, we start with the continuous, anisotropic model [Strekalovskiy11], and postpone the extension to the non-metric case to Sec. 3.5.

3.1 Convex Relaxation

The continuous model allows for an anisotropic regulariser in : label transitions can be penalised on the area of the shared surface, as well as on the surface normal direction. This is achieved with problem-specific 1-homogeneous functions that emerge from convex sets, so called Wulff-shapes. A relaxation of to then leads to a convex energy, which can be written as the following saddle point problem, with primal functions and dual functions :


The constraints have to be fulfilled for all . In addition to the primal variables

, we have introduced the dual vector-field

, whose pairwise differences are constrained to lie in the convex sets (Wulff-shapes) . By letting these shapes take an anisotropic form, one can then encode scene structure, e.g[Strekalovskiy11, HaneZCAP13]. For our problem we demand Neumann conditions at the boundary of , i.e, because the scene will continue beyond our domain ( is the normal of the domain boundary ).

3.2 Finite Element Spaces

Here, we can only informally introduce the basic idea of FEM and explain its suitability for problems of the form (2). We refer to textbooks [Larson2013FEM, Brezzi1991, Duran2008] for a deeper and formal treatment.

One way to solve  (2) is to approximate it at a finite number of regular grid points, using finite differences. FEM instead searches for a solution in a finite-dimensional vector space; this trial space is a subspace of the space in which the exact solution is defined. To that end, one chooses a suitable basis for the trial space, with basis functions of finite support, as well as an appropriate test function space. FEM methods then find approximate solutions to variational problems by identifying the element from the trial space that is orthogonal to all functions of the test function space. For our saddle-point problem, we can instead identify the trial space with our primal function space and the test space with its dual counterpart. Now, we can apply the same principles, and after discretisation our solution corresponds to the continuous solution defined by the respective basis. As (2) is already a relaxation of the original problem (1), we do not present an analysis of convergence at this point. Instead, the reader is referred to [Bartels12] for an introduction to this somewhat involved topic.

In order to choose a space with good approximation properties and suitable basis functions, we tesselate our domain into simplices. More formally, we define to be a simplex mesh with vertices , faces defined by , and simplices , defined by vertices that partition : i.etwo adjacent simplices share only a single face. In this work, for a specific set of vertices , we select M to be the corresponding Delaunay tetrahedralisation of and only consider explicit bases. In particular, we focus on the Lagrangian (P1) basis, which we use in the following to derive our framework; and on the Raviart-Thomas (RT) basis. Details for the latter are given in the supplementary material. The main difference between them is that P1 leads to piecewise linear solutions, which must be thresholded, while RT leads to a constant labeling function per simplex, similar to discrete MAP solutions on CRFs. We note that constant labeling can lead to artefacts, such that the adaptiveness of the FEM model becomes even more important.

The idea of both derivations is similar: (i) select a basis for our primal (P1) or dual (RT) variable set, (ii) find a suitable form via the divergence theorem and Fenchel duality, (iii) extend to the non-metric case, following a principle we term ”label mass preservation”.

3.3 Lagrange Elements

The Lagrange basis functions describe a conforming polynomial basis of order on our simplex mesh , i.eits elements belong to the Hilbert space of differentiable function with finite Lebesgue measure on the domain : . We are interested in the Lagrange basis of first order, :


We construct our linear basis with functions that are defined for each vertex of a simplex and can be described in local form with barycentric coordinates:


In each simplex, one can define a scalar field and compute a gradient in this basis that will be constant per simplex (cf. Fig. 2):


with coefficients . denotes a vector that is normal to the face opposite node , has length , and points towards the simplex centre. denotes the area of the face , and the volume of the simplex (cfFig. 2 and supplementary material).

Figure 2: Left: Illustration of P1 basis function shape. Middle: Scalar field defined as a convex combination of basis coefficients. Right: Gradient definition in a simplex (5).

3.4 Discretisation

To apply our Lagrange basis to (2) we first make use of the divergence theorem:


The latter summand vanishes by our choice of . Our approach for a discretisation in the Lagrange basis is to choose the labeling function . This implies that our dual space consists of constant vector-fields per simplex: . To fulfill the constraint set in (2) we have to verify that, per simplex, the lie in the respective Wulff-shape. The simplex constraints on the have to be modeled per vertex. According to (4), the labeling functions are convex combinations of their values at the vertices and thus stay within the simplex.

We also have to convert the continuous data costs into a cost per vertex , which can be achieved by convolving the continuous cost with the respective basis function: . In practice, the integral can be computed by sampling . Integrating the right hand side in (6) over the simplex leads to a weighting with its volume and the energy (2) in the discrete setting becomes:


3.5 Non-metric extension

To start with, we note that a non-metric model does not exist in the continuous case [maggi2012sets] and our extension works only after the discretisation into the FEM basis. Please refer to the supplementary material for an in-depth discussion. Note that our label set of semantic classes does not have a natural order (in contrast to, e.g, stereo depth or denoised brightness); and also the direction-dependent regulariser is unordered and does not induce a metric cost. To allow for non-metric regularisation we transform the constraint set (), by introducing auxiliary variables and Lagrange multipliers , and use Fenchel-Duality:


The dual functions of the indicator functions for the convex sets are 1-homogeneous, of the form . Recall that our label costs are not metric: , does not hold. It was shown [ZachHP14] that a regulariser of the form (8) transforms any non-metric cost to the metric case. Figure 3 shows an example. Here, an expensive transition between labels 0 and 2 will be replaced by two cheaper transitions 0–1 and 1–2. To prevent this, we replace the with direction dependent variables : We rearrange (8) and combine the first summand with the regulariser from (7) to arrive at the following equations (for now ignoring ):

with denoting Iverson brackets. Let and , where . Expanding the gradient (5) we get, per simplex ,


which we analyse further to achieve non-metric costs. It was observed in [ZachHP14] that the can be interpreted as encoding the “label mass” that transitions from label to label in a specific direction. Positivity constraints (by definition) on the avoid the transport of negative label mass. To anchor transport of mass on the actual mass of label present at a vertex, we introduce the variables for the mass that remains at label , and split the above constraints into two separate sets with the help of additional dual variables :

Figure 3: Left: Without our non-metric extension, optimisation w.r.t(7) can lower transition costs by inserting another label (here 1 between 0 and 2). Right: A solution is to split the gradients of the indicator functions and use direction-dependent variables .

Note that this construction is only possible because our elements (simplices) are of strictly positive volume, in contrast to zero sets in w.r.tthe Lebesgue measure. Finally, we can write down our discrete energy in the Lagrange basis defined on the simplex mesh :


where we have moved the weighting with from the constraint set to the regulariser, and denote by the indicator function of the unit simplex.

4 Semantic Reconstruction Model

A prime application scenario for our FEM multi-label energy model (11) is 3D semantic reconstruction. In particular, we focus on an urban scenario and let our labeling functions encode freespace (), building wall, roof, vegetation or ground. Objects that are not explicitly modeled are collected in an extra clutter class. We define the data cost at a 3D-point as in [HaneZCAP13]: project into the camera views , and retrieve the corresponding depth and class likelihoods from the image space. The

are obtained from a MultiBoost classifier. For the depth we look at the difference between the actual distance

to the camera and the observed depth: . For the freespace label we always set the cost to 0, for we define:


This model assumes independence of the per-pixel observations, and exponentially distributed inlier noise in the depthmaps, bounded by a parameter

(k=3 in practice). It is essentially a continuous version of [HaneZCAP13], see that paper for details. The parameter sets a lower bound for the minimal height of the simplices in the tesselation, and thus defines the target resolution. The discretisation of the data cost involves a convolution with the respective basis functions, which can be approximated via sampling. Please refer to the supplementary material for details. The Wulff-shapes in (11) are given as the Minkowski sum of the -Ball, and an anisotropic shape : . In the isotropic part, contains the neighbourhood statistics of the classes. The anisotropic part models the likelihood of a transition between classes and in a certain direction. Fig. 4 (a,b) shows an example. For our case we prefer flat, horizontal surfaces at the following label transitions: ground-freespace, ground-building, building-roof, ground-vegetation and roof-freespace. A second prior prefers vertical boundaries for the transitions building-freespace and building-vegetation. More details on the exact form can be found in [HaneZCAP13].

Figure 4: (a): Wulff-shape (red) with isolines. (b): Minkowski sum of two Wulff-shapes. (c): Simplices are split after inserting new vertices (blue) close to the surface (green). Right: Initialisation of vertices after refinement. (d): Finite differences on a regular grid ([HaneZCAP13]) only cover constraints in green areas, the P1 basis covers all of the domain .

The energy (11) is already in primal-dual form, such that we can apply the minimisation scheme of [chambolle11], with pre-conditioning [Pock11]. That numerical scheme requires us to project onto shapes that are Minkowski sums of convex sets. In our case, the sets are simple and the projection onto each shape can be performed in closed form. We employ a Dykstra-like projection scheme [Dykstra], which avoids storing additional variables and proves remarkably efficient, see supplementary material. We also project the labeling functions directly onto the unit simplex [WangC13a]. In order to extract the transition surface, we employ a variant of marching tetrahedra (triangles) [treece1999mtet], using the isolevel at for each label.

We conclude with two interesting remarks. First, note that a tesselation with a regular grid [HaneZCAP13] can be seen as a simplified version of our discretisation in the Lagrange basis. In Fig. 4d we consider the 2D case of the regular grid used in [HaneZCAP13]. Here, variables are defined at voxel level. In its dual graph, the vertex set consists of the corners of the primal grid cubes, leading to shifted indicator variables. Per vertex the data term is mainly influenced from the cost in its Voronoi area. Similarly, [HaneZCAP13] evaluates the data cost at grid centers, approximately corresponding to integration within the respective Voronoi-area of grid cell. Furthermore, taking finite differences in this regular case corresponds to verifying constraints for only one of the two triangles (Fig. 4d). The supplementary material includes a more formal analysis.

Second, our formulation is adaptive, in the sense of [BlahaVRWPS16]: Hierarchical refinement of the tesselation can only decrease our energy. Hence, our scheme is applicable when refining the model on-the-fly. We again must defer a formal proof to the supplementary material and give an intuitive, visual explanation in Fig. 4c . Assume that is an optimal solution for a certain triplet . Then a refined tesselation can be found by introducing additional vertices, i.e (ideally on the label transition surfaces). To define a new set of simplices, we demand that no faces are flipped, with . Then one can find a new variable set and data cost with the same energy: We initialise the new variables from the continuous solution at the respective location, and find new by integration. Subsequent minimisation in the refined mesh can only decrease the energy. The argument works in both ways: Vertices that have the same solution as their adjacent neighbors can be removed without changing the energy. For now we stick to this simple scheme, future work might explore more sophisticated ideas, e.galong the lines of [Grinspun02].

5 Evaluation

Before we present results on challenging real 3D data we evaluate our method in 2D on a synthetic dataset. All results are obtained with a multi-core implementation, on a 12-core, 3.5 GHz machine with 64GB RAM. For clarity, we only present the Lagrange discretisation. We refer to the supplementary material for an evaluation of the Raviart-Thomas discretisation.

Input Data. We create a synthetic 2D scene composed of 4 labels: free space, building, ground and roof

, surrounded by 17 virtual cameras. To replace depth maps and class-likelihood images, we extract 2D points on the boundary “surface” and assign ground truth label costs to each point. For the evaluation in 3D, we use three real-world aerial mapping data sets. Our method requires two types of input data: depthmaps and pixel-wise class probabilities (

cfSec. 4). Moreover, we build a control mesh around the initially predicted surface, to facilitate our FEM discretisation. Ideally, the control mesh enwraps the true surface, using a finer meshing close to it. We densely evaluate the data cost at the vertices of a regular data cost grid and let each control vertex accumulate the cost of its nearest neighbours in that grid, to approximate an integration over its Voronoi cell.

Figure 5: Left: Synthetic 2D scene, colors indicate ground (gray), building (red) and roof (yellow). Middle: Control mesh. Right: Example reconstruction.
overall acc. [%] Tetra Octree MB Scene 1 84.0 83.9 82.5 Scene 2 92.5 92.8 89 table Quantitative comparison with octree model [BlahaVRWPS16] and MultiBoost input data.

2D Lagrange results. Fig. 5 illustrates the result we obtain in a perfect setting. The original 2D image serves as ground truth for our quantitative evaluation. In this baseline setting, our method achieves of overall accuracy and of average accuracy, confirming the soundness of our Lagrange discretisation. In order to evaluate how our model behaves in a more realistic setting, we conduct a series of experiments where we incrementally add different types of perturbations. Our algorithm is tested against: (i) noise in the initial 2D point cloud, respectively depth maps, (ii) wrong class probabilities and (iii) ambiguous class probabilities of random subsets, (iv) missing data, e.gdeleting part of a facade to simulate unobserved areas, (v) sparsity of the initial point cloud. Fig. 6b illustrates the influence of defective inputs. Under reasonable assumptions on the magnitude of the investigated perturbations, we do not observe a significant loss in accuracy. The reconstruction quality starts to decrease if more than half of the input data is misclassified or if the input point cloud is excessively sparse, meaning that 50% of the input is wrong or nonexistent. Average accuracy is naturally more sensitive, due to the larger relative error in small classes.

Influence of the control mesh. Recall from (12) that the data cost of a control vertex is approximately equal to an integral of over its respective Voronoi area (cfFig. 6a, left). Therefore, but also because of the sign change in (12), vertices close to the surface receive small cost values and are mainly steered by the regulariser, i.ethese vertices realise a smooth surface. On the other hand, vertices that integrate only over areas with positive or negative sign determine the inside/outside decision, but are more or less agnostic about the exact location of the surface. We conclude that a sufficient amount of control vertices should lie within the band defined by the truncation of the cost function around the observed depth (cf(12) and [HaneZCAP13]). Ideally control vertices are equally distributed along each line-of-sight in front and behind the putative match (cfFig. 6a, middle column). Undersampling within the near-band can lead to smooth, but inaccurate results (cfFig. 6a, top right). Unobserved transitions, e.gbuilding-ground or roof-building, can also lead to problems if the affected simplices are too large. To mitigate the effect, we add a few vertices (e.g, a sparse regular grid) on top of the control mesh (cfFig. 6a, bottom row). Finally, oversampling each line-of-sight in order to increase the resolution of the control mesh is not recommended, the right spacing is determined by the noise level and and , chosen in (12).

To conclude, it is an important advantage of the FEM framework that additional vertices can be inserted as required, without changing the energy. In future work we will use this flexibility to develop smarter control meshes, possibly as a function of the local noise level.

Figure 6: (a) Illustration of the control mesh foundation. Dots represent values of the datacost grid and crosses the control vertices. Voronoi cells of the control vertices are depicted with dashed grey lines and the control mesh with a solid black line. Colors indicate ground (purple), building (red), roof (yellow), free space (cyan) and no datacost (black). (b) Quantitative evaluation of Lagrange FEM method w.r.tdifferent degradations of the input data.

3D Lagrange results. To test our algorithm on real world data, we focus on a dataset from the city of Enschede. Complementary results for other datasets are shown in the supplementary material. As baseline we use [BlahaVRWPS16], the current state-of-the-art in large-scale semantic 3D reconstruction. Due to the lack of 3D ground truth, we follow their evaluation protocol and back-project subsets of the 3D reconstruction to image space, where it is compared to a manual segmentation. As can be seen in Fig. 7 and Tab. 5, the two results are similar in terms of quantitative correctness. We note that measuring labeling accuracy in the 2D projection does not consider the geometric quality of the reconstruction within regions of a single label.

Figs. 1 and 7 show city-modelling results obtained from (nadir and oblique) aerial images. Visually, our models are crisper and less “clay-like”. Compared to axis-aligned discretisation schemes, e.g[HaneZCAP13, BlahaVRWPS16], our method appears to better represent surfaces not aligned with the coordinate axis, and exhibit reduced grid aliasing. Both effects are consistent with the main strength of the FEM framework, to adapt the size and the orientation of the volume elements to the data. Small tetrahedra, and vertices that coincide with accurate 3D points on surface discontinuities, favour sharp surface details and crease edges (e.g, substructures on roofs). Faces that follow the data points rather than arbitrary grid directions mitigate aliasing on surfaces not aligned with the coordinate axes (e.g, building walls). The freedom of a local control mesh unleashes the power of the regulariser in regions where the evidence is weak or ambiguous (e.g, roads, weakly textured building parts).

As already mentioned, our FEM framework can be readily combined with on-the-fly adaptive computation, as used in the baseline [BlahaVRWPS16]. Compared to their voxel/octree model, adaptive refinement is straight-forward, due to the flexibility of the FEM framework, which allows for the introduction of arbitrary new vertices. As a preliminary proof-of-concept, we have tested the naive refinement scheme described in Sec. 4. We execute three refinement steps, where we repeatedly reconstruct the scene and subsequently refine simplices that contain surface transitions, while lowering by half. Compared to computing everything at the final resolution, this already yields substantial savings of 89–97% in memory and 82–93% in computation time, without any loss in accuracy. Targetting (measured w.r.ta bounding box of 256 units), the runtimes for the tested scenes are 1h04m–1h47m and memory consumption is 573–764 MB, on a single machine.

Figure 7: Quantitative evaluation of Scene 1 from Enschede. Left: One of the input images. Middle left: Semantic 3D model. Middle right: Back-projected labels overlayed on the image. Right: Error map, misclassified pixels are marked in red.

6 Conclusion

We have proposed a novel framework for the discretisation of multi-label problems, and have shown that, in the context of semantic 3D reconstruction, the increased flexibility of our scheme allows one to better adapt the discretisation to the data at hand. Our basic idea is generic and not limited to semantic 3D reconstruction or the specific class of regularisers. We would like to explore other applications where it may be useful to abandon grid discretisations and move to a decomposition into simplices.

Acknowledgements: Audrey Richard was supported by SNF grant . Christoph Vogel and Thomas Pock acknowledge support from the ERC starting grant 640156, ’HOMOVIS’.

Supplementary Material

This document provides supplementary information to support the main paper. It is structured as follows: Sec. A gives more information about the data and pre-processing used in our experiments, not mentioned in the paper due to lack of space. We hope that the added details will help readers to better appreciate the experimental results. In Sec. B we show complementary results obtained with the proposed Lagrange FEM method on other datasets, as well as the full large-scale reconstruction of the city of Enschede. Sec. C contains technical details and formal proofs that had to be omitted in the paper. Finally, Sec. D discusses our formalism for the case of the Raviart-Thomas basis (instead of Lagrange P1), leading to piecewise constant labels. We also show results in 2D and 3D and a comparison to those obtained with the Lagrange basis.

Appendix A Input Data

For our real-world experiments, we start from aerial images, cfFig. 8. To mitigate foreshortening and occlusion, images are acquired in a Maltese cross configuration, with four oblique views in addition to the classical nadir view. We orient the images with VisualSFM [wu2011visualsfm], create depth maps from neighbouring views with Semi-global Matching [hirschmuller2008stereo, opencv], and predict pixel-wise class-conditional probabilities with a MultiBoost classifier [benbouzid_jml12]. The classifier is trained on a few hand-labeled images, using the same features as [BlahaVRWPS16]: raw RGB-intensities in a window, and

geometry features (height, normal direction, anisotropy of structure tensor,

etc) derived from the depth map.

Figure 8: Input data. Left: aerial input images for one position (four oblique views to the north, east, south, west, and a nadir image). Middle: oriented image block. Right: depth map and class probabilities (visualised by maximum-likelihood labels).

Appendix B Additional Visualizations

We have tested our semantic reconstruction method on several (synthetic) 2D and (real) 3D datasets. Here we provide additional examples to give the reader an impression of the variety of cases tested in our evaluation. We apply the same prior models as for our 3D reconstructions. We prefer flat, horizontal structures in the model for the following label transitions: ground-freespace, ground-building, building-roof and roof-freespace. The second prior applies to the transition building-freespace and prefers vertical boundaries. Fig. 9 shows examples for different degradations of the synthetic input (many more cases were tested).

Figure 9: Example scenes of our 2D data set and results obtained with our Lagrange FEM method. Left: Data term at vertices of the control mesh. Colors for the data cost indicate: free space/empty space (cyan), building (red), ground (pink), roof (yellow), occupied space (green) and no data cost (black). Middle: Semantic 2D model. Right: Classification result, misclassified pixels are depicted in red.

In the top row, we simulate imperfect classifier input by adding noise to the semantic class likelihoods. In our experience, the method is still able to reconstruct the geometry quite well, but sometimes assigns the wrong label. A closer inspection reveals that, locally, the roof and building classes are confused in locations where the class likelihoods are significantly wrong. The global geometry and labels in other regions remain unaffected. The second row gives an example of missing input data, a frequent situation in the real world, due to occlusions and constraints on camera placement. Fortunately, missing data does not seem to greatly challenge our method. In fact, our method is specifically designed to work well for these cases and complete the outline, relying on the prior assumptions about pairwise class transitions and class-specific local shape. In the last row we utilise only a sparse control mesh, even near the surface. The method can still recover the geometry, but struggles to determine the correct semantic labeling near the (unobserved) roof-to-building transition. The adaptive version of our method is designed to avoid exactly that case. It refines the control mesh near the predicted transitions, effectively increasing the resolution at the most promising locations.

Fig. 10 shows city models obtained from two additional aerial datasets (Zürich, Switzerland and Dortmund, Germany), and a further patch from Enschede. These results qualitatively illustrate that our method works for different image sets and architectural layouts.

Figure 10: Additional datasets. First row: Original aerial images. Second row: Semantic 3D models obtained with the Lagrange FEM method for Enschede (left), Zürich (middle) and Dortmund (right). For the latter, light green denotes an additional class grass and agricultural fields.

Finally, we show the complete semantic 3D reconstruction of Enschede. Fig. 11 shows the model rendered in an oblique view, together with the corresponding viewpoint in Google Earth, to illustrate its accuracy and high level-of-detail.

Figure 11: Large-scale semantic 3D reconstruction of Enschede (Netherlands), computed from aerial images with our Lagrange FEM method. Top: View from Google Earth (not used during reconstruction). Bottom: Our model from matching viewpoint.

Appendix C Proofs

In this section we give the technical proofs promised in the main paper, as well as further details about the optimisation. We start with a discussion of the extension to non-metric energies, and its consequences on the equivalence of continuous and discrete models.

c.1 Non-metric priors: continuous vs. discrete

The main message of this section is that a non-metric model does not exist in the continuous view, unless one imposes additional constraints on the function spaces. We briefly explain why: Let’s look at the transition boundary between two labels and . Without additional constraints, one can always introduce a zero set with label between the two, i.e, a set with Lebesgue measure in the domain space. If the transition costs are not metric, then the cost for the label pair is potentially higher than the sum of the costs for and . Inserting the zero set will avoid that extra cost and the energy will be under-estimated. In other words, let be a segmentation of into regions and , labeled with and respectively. Assume further that their costs do not fulfill the triangle inequality w.r.t another label . Then one can find a sequence of segmentations with : the label disappears in the limit, such that . Hence, metric transition costs are a necessary condition for the lower semi-continuity of the energy functional. Methods that try to resolve the issue with additional constraints on the function space, for instance by demanding Lipshitz continuity of the labeling functions, are an active research area, e.g[BreMas15], but are beyond the scope of this work.

The above conceptual problem does have consequences for a practical implementation: Any discretisation of the domain will ultimately consist only of a finite number of elements of measurable () volume. Thus, the label in the example will not disappear completely from the solution, and the computed energy matches the solution. In practice, one can simply prescribe a minimum edge length in the tesselation, since one cannot refine infinitely. Note that this also constrains the Lipshitz constant of the labeling functions; they are restricted to values between and , such that the Lipshitz constant of functions defined on the mesh is bounded by , cf (13). Because we utilise a Delaunay triangulation/tetrahedralisation of the domain and also limit the minimal dihedral angle, a further constraint on the edge length implies a bound on the Lipshitz constant. Note also, our analysis implies that a discrete solution in the non-metric setting does not have a continuous counterpart, and consequently investigations of the limiting case, i.e, convergence analysis after infinite refinement of the tesselation, are futile.

c.2 Gradient in the Lagrange basis

We show that gradients of functions in the P1 (Lagrange) basis are constant per simplex and given by:


Here, the coefficients and denote a vector of length , normal to the face opposite to vertex , and pointing inwards towards the center of the simplex. Recall that is the area of face and is the volume of simplex .

The gradient can be obtained with basic algebra. First, notice that the gradient of in (13) has to fulfill , meaning that integration along the edge leads to the respective change in . After collecting a sufficient number of linear equations of this form, one can directly solve the resulting linear system. Since is, by definition, orthogonal to all edges that do not involve vertex , we arrive at (13).

Formally, we pick one vertex of simplex and compile for () equations of the form . By construction, . The vector is normal to face . The scalar product of the edge and the normal is the ”height” within the simplex, so with the chosen scaling of we have for any .

Thus multiplying each side of our equation system by a matrix with the vectors as columns leads to: . If we can show that , then we arrive at the desired expression (13). For , and . All equations are also fulfilled by in place of , which concludes the proof.

c.3 Data Term for Lagrange basis

We again start from the ideas in the main paper. We have to convert the continuous data costs into discrete form (in a practical implementation, “continuous” means that the cost can be evaluated at any ). In our basis representation, we can get discrete cost values for the basis elements by convolving the continuous cost with the respective basis function. For simplicity, we consider the P1 basis function here. Thus, we seek a cost per vertex . In detail we obtain:


To numerically compute , we sample at a finite number of locations . For each we determine into which simplex it falls, and accumulate the contributions of over all , weighted by their barycentric coordinates. The final step is to scale by and divide by the sum of weights assigned to vertex . In other words, we compute the sample mean and scale it by the area covered by the vertex. In our current implementation is sampled at regular grid points, without importance sampling. This simple strategy is indeed very similar to the method employed in [HaneZCAP13, BlahaVRWPS16]. There, the data cost is evaluated on a regular grid, by reprojecting grid vertices into each image, computing the data term, and adding its respective contribution to the grid location. Such a “per-voxel accumulation” is equivalent to integrating the data cost within the respective Voronoi-area of a vertex in the dual grid: the latter is proportional to the number of regular samples that fall into a Voronoi-cell and therefore have the respective vertex as nearest neighbour. Hence, summing the individual contributions directly corresponds to integrating the data term within the Voronoi region.

c.4 Grid vs. P1

Here, we detail why the grid-based version with finite differences (corresponding to [HaneZCAP13]) can be seen as an approximation of our proposed FEM discretisation with P1 basis elements, if the vertices (cells) are aligned in a regular grid. Without loss of generality we consider a grid of edge length , and note that in this case the gradient for a function at a grid point x, evaluated with forward differences becomes:


with the unit vector in direction . We have used the identity , according to the definition in Sec. C.2, and obtain the last equality from , cfSec. C.2. This is exactly the formula for the gradient of the corresponding P1 function in the simplex defined by the vertices . Accordingly, if implemented as finite differences, the constraints on the dual vector field , see Eq. (2) from the paper, are only checked within the respective simplex, but not in the whole domain (1/2 of the domain in 2D; 1/6 in 3D). Note also that, with grid-aligned vertices, the simplex in question cannot be part of a partition of , unless : edges of adjacent faces would intersect.

Fig. 12 illustrates the specific case with . On the left, the regular grid (green) and the triangle (simplex, red). Grid centers correspond to vertices in the (triangle-)mesh. The grid corresponds to the discretisation used in [HaneZCAP13], whereas the simplex mesh is used in this paper. The gradient of the lower left triangle for the simplex mesh corresponds exactly to the one computed via forward differences as shown in (15). Consequently, discretisation via finite differences is a special case of our method, where the elements are layed out on a regular grid, and the constraints are tested only in the upper right triangle (in the 2D case).

Figure 12: Left: Grid (green) and simplex mesh (red) cover the same domain, but are offset against each other. Grid centers correspond to vertex positions. The gradient in a triangle (middle) corresponds to the gradient computed with forward differences (right).

c.5 Adaptiveness

We have stated in the paper that our formulation is adaptive, in the sense that a hierarchical refinement of the tesselation can only decrease the energy. We have also explained a way to find the refined tesselation of , by introducing additional vertices and splitting simplices , such that no faces are flipped; and we have put forward a procedure to initialise the new variables. Here, we formally prove that the described scheme is sound.

Let be the solution for a triplet . And let be the refined mesh with and with . Furthermore, we define the sets and to denote newly introduced vertices and simplices. Our construction works by induction, i.ewe introduce one vertex at a time. The vertex is assumed to lie in simplex , , which is split into simplices with . By definition, vertex has the barycentric coordinates , i.e.

We initialize the labeling variables at new vertices

via barycentric interpolation:

. Dual variables of the new simplices , and also the transition variables , are simply copied from their enclosing simplex : , and . The data terms at the new vertex , as well as at the other vertices of simplex , are (re)computed following (C.3), see also Fig. 13. We call the new variables and claim that and that is feasible. The latter is trivially the case (cfEq. (11) from the paper); transition variables remain positive by construction, the newly introduced labeling variables still fulfill the simplex constraints, and they induce the same gradients in the new simplices as in the parent simplex. The same applies for the cost of the regulariser in the new simplices: . More interesting is the data cost. We introduce the notation for the data cost at vertex , originating from the integral over simplex . By induction, we need to only verify the following equality for simplex , which is split into :


Recall we have ”old” vertices , and a new vertex . According to (16), we must verify:


It is sufficient to show


The right hand side represents a linear function for each . We can check if both sides agree on points in each simplex, which is easy to verify. The locations we check – substitute on both sides of Eq. (18) – are and . These are the defining vertices of the simplices . Left and right hand side vanish, except for and . Finally, we get for and for on both sides.

In our adaptive version, we directly follow the proof and split simplices with the introduction of a single new vertex. We emphasise again that this splitting schedule is merely a proof of concept. The FEM discretisation allows for more sophisticated refinement schemes, e.g, along the lines of [Grinspun02], or flipping edges according to the energy functional, etc.

Figure 13: Updated data term after adding a new vertex.

c.6 Optimisation

The energy (11) from the paper is given in primal-dual form, optimisation with existing tools is straight-forward. We apply the minimisation scheme of [chambolle11], with pre-conditioning [Pock11]. Internally, that algorithm however requires the projection onto the Wulff-shapes , which is slightly more involved.

c.6.1 Proxmap for the Minkowski sum of convex sets

Recall that, per label pair , our Wulff-shapes are of the form . They are the Minkowski sum of two simple convex sets. Recall that the encode the direction dependent likelihood of a certain label transition. In our case, all Wulff-shapes permit a closed form projection scheme, such that we solve the following sub-problem as proximal step, independently per simplex :


For the following derivation we rename the two sets and . In order to decouple the argument within the regulariser, we introduce auxiliary variables and additional Lagrange multipliers , and replace and respectively:


Optimality w.r.t implies:


which, after reinserting into (20), leads to:


Applying Fenchel-duality yields:


The latter summand requires and :


In this last form, we can apply a few iterations of block coordinate descent on the dual variables and recover the update for from (21).

Appendix D Raviart-Thomas basis

d.1 Methodology

In this section, we show how to discretise the convex relaxation, Eq. (2) from the paper, for the case of the Raviart-Thomas basis. For convenience, we restate the energy:


The Raviart-Thomas basis is chosen as a strong contrast to the (preferred) Lagrange basis. With Raviart-Thomas functions, we model the dual functions in (25), within our trial space. The Raviart-Thomas basis functions describe a -conforming polynomial basis of order , i.ethe divergence of the modeled vector field is continuous across simplices. We again discretise on a simplex mesh with vertices ; faces defined by vertices; and simplices defined by vertices, which partition : .