## 1 Introduction

A number of computer vision tasks, such as segmentation, multiview
reconstruction, stitching and inpainting, can be formulated
as *multi-label problems on continuous domains*, by functional
lifting [PockCBC10, Cremers11, Lellmann11, ChambolleCP12, NieuwenhuisTC13].
A recent example is semantic 3D reconstruction
(*e.g*[HaneZCAP13, BlahaVRWPS16]), which solves the following
problem: Given a set of images of a scene, reconstruct both its 3D
shape and a segmentation into semantic object classes.
The task is particularly challenging, because the evidence is
irregularly distributed in the 3D domain; but it also possesses a
rich, anisotropic prior structure that can be exploited.
Jointly reasoning about shape and class allows one to take into
account class-specific shape priors (*e.g*, building walls should be
smooth and vertical, and vice versa smooth, vertical surfaces are
likely to be building walls), leading to improved reconstruction
results.
So far, models for the mentioned multi-label problems, and in
particular for semantic 3D reconstruction, have been limited to
axis-aligned discretisations. Unless the scenes are aligned with the
coordinate axes, this leads to an unnecessarily large number of
elements. Moreover, since the evidence is (inevitably) distributed
unevenly in 3D, it also causes biased reconstructions.
Thus, it is desirable to adapt the discretisation to the scene content
(as often done for purely geometric surface reconstruction,
*e.g.* [Labatut07]).

Our formulation makes it possible to employ a finer tesselation in
regions that are likely to contain a surface, exploiting the fact that
both high spatial resolution and high numerical precision are only
required in those regions.
Our discretisation scheme leads to a smaller memory footprint and faster
computation, and it constitues a very simple technique to allow for
arbitrary adaptive resolution levels within the same problem.
*I.e*, we can refine or coarsen the discretisation as appropriate, to adapt
to the scene to be reconstructed.
While our scheme is applicable to a whole family of finite element
bases, we investigate two particularly interesting cases: Lagrange
(P1) and Raviart-Thomas elements of first order.
We further show that the grid-based voxel discretisation is a special
case of our P1 basis, such that minimum energy solutions of
“identical” discretisations (same vertex set) are equivalent.

## 2 Related Work

Since the seminal work [Curless1996] volumetric reconstruction
from image data has evolved remarkably
[Kazhdan:2006:PSR, VogiatzisETC07, Furu:2010:PMVS, LiuC10, Cremers-Kolev-pami11, kbc-pami11, KostrikovHL14, Ulusoy2016].
Most methods use depth maps or 2.5D range scans for evidence
[Zach2007_Range, zach2008fast], represent the scene via an
indicator or signed distance function in the volumetric domain, and
extract the surface as its zero level set, *e.g*,
[Lorensen1987, treece1999mtet].

Joint estimation of geometry and semantic labels, which had earlier been attempted only for single depth maps [Ladicky10], has recently emerged as a powerful extension of volumetric 3D reconstruction from multiple views [HaneZCAP13, Bao13, Kundu14, Savinov15, vineet2015icra, BlahaVRWPS16, Ulusoy2017CVPR]. A common trait of these works is the integration of depth estimates and appearance-based labeling information from multiple images, with class-specific regularisation via shape priors.

Multi-label problems are in general NP-hard, but under certain
conditions on the pairwise interactions, the original non-convex
problem can be converted into a convex one via functional lifting and
subsequent relaxation, *e.g*[ChambolleCP12].
This construction was further extended to anisotropic
(direction-dependent) regularisers [Strekalovskiy11].
Moreover, [ZachHP14] also relaxed the requirement that the
regulariser forms a metric on the label set, yet its construction can
only be applied after discretisation [Lellmann11].
In this paper, we consider the relaxation in its most general form
[ZachHP14], but are not restricted to it.
The latter construction is also the basis to the model of
[HaneZCAP13], whose energy model we adapt for our semantic 3D
reconstruction method.
Their voxel-based formulation can be seen as a special case of our
discretisation scheme.

For (non-semantic) surface reconstruction, several authors prefer a data-dependent discretisation, normally a Delaunay tetrahedralisation of a 3D point cloud [Labatut07, Jancosek11, vu12]. The occupancy states of the tetrahedra are found by discrete (binary) labeling, and the final surface is composed of the triangles that separate different labels. Loosely speaking, our proposed methodology can be seen either as an extension of [HaneZCAP13] to arbitrary simplex partitions of the domain; or as an extension of [Labatut07] to semantic (multi-label) reconstruction.

We note that regular voxel partitioning of the volume leads to a high memory consumption and computation time. Yet, we are essentially reconstructing a 2D manifold in 3D space, and this can be exploited to reduce run-time and memory footprint. [Kundu14] use an octree instead of equally sized voxels to adapt to the local density of the input data. [BlahaVRWPS16] go one step further and propose an adaptive octree, where the discretisation is refined on-the-fly, during optimisation. In our framework the energy is independent of the discretisation, it can thus be combined directly with such an adaptive procedure.

Also volumetric fusion via signed distance
functions [Newcombe2011] benefits from irregular tesselations of
3D space, *e.g*, octrees [Steinbruecker-etal-icra14] or
hashmaps [Niessner2013].
In contrast to our work, these target real-time reconstruction and
refrain from global optimisation, instead locally fusing depth maps.
Their input normally is a densely sampled, overcomplete RGB-D
video-stream, whereas we deal with noisy and incomplete inputs.
To achieve high-quality reconstructions in our setting, we
incorporate semantic information, leading to a multi-label problem.

Our work is based on the finite element method (FEM), *e.g*[reddy2005introduction, Brezzi1991]. Introduced by Ritz
[Ritz1909] more than a century ago, and refined by Galerkin and
Courant [courant1943], FEM serves to numerically solve
variational problems, by partitioning the domain into finite,
parametrised elements.
In computer vision FEM has been applied in the context of level-set
methods [Weber2004] and for Total Variation [Bartels12].
To our knowledge, we are the first to apply it to multi-labeling.

## 3 Method

The multi-labeling problem [ChambolleCP12, Strekalovskiy11, Lellmann11, ZachHP14] in the domain is defined by finding labeling functions as the solution of:

(1) |

where models the data term for a specific label at location and denotes a convex regularisation functional that enforces the spatial consistency of the labels. One prominent example is to chose , known as Total Variation, which penalises the perimeter of the individual regions [ChambolleCP12, NieuwenhuisTC13]. Note that in the two-label case (Potts model), this relaxation is exact after thresholding with any threshold from the open unit interval [ChambolleCP12]. Although we are ultimately interested in non-metric regularisation, we start with the continuous, anisotropic model [Strekalovskiy11], and postpone the extension to the non-metric case to Sec. 3.5.

### 3.1 Convex Relaxation

The continuous model allows for an anisotropic regulariser in : label transitions can be penalised on the area of the shared surface, as well as on the surface normal direction. This is achieved with problem-specific 1-homogeneous functions that emerge from convex sets, so called Wulff-shapes. A relaxation of to then leads to a convex energy, which can be written as the following saddle point problem, with primal functions and dual functions :

(2) |

The constraints have to be fulfilled for all . In addition to the primal variables

, we have introduced the dual vector-field

, whose pairwise differences are constrained to lie in the convex sets (Wulff-shapes) . By letting these shapes take an anisotropic form, one can then encode scene structure,*e.g*[Strekalovskiy11, HaneZCAP13]. For our problem we demand Neumann conditions at the boundary of ,

*i.e*, because the scene will continue beyond our domain ( is the normal of the domain boundary ).

### 3.2 Finite Element Spaces

Here, we can only informally introduce the basic idea of FEM and explain its suitability for problems of the form (2). We refer to textbooks [Larson2013FEM, Brezzi1991, Duran2008] for a deeper and formal treatment.

One way to solve (2) is to approximate it at
a finite number of regular grid points, using finite
differences.
FEM instead searches for a solution in a finite-dimensional vector space;
this *trial* space is a subspace of the space in which the exact
solution is defined.
To that end, one chooses a suitable basis for the *trial* space,
with basis functions of finite support, as well as an
appropriate *test* function space.
FEM methods then find approximate solutions to variational problems by
identifying the element from the *trial* space that is orthogonal to all
functions of the *test* function space.
For our saddle-point problem, we can instead identify the *trial* space with
our primal function space and the *test* space with its dual counterpart.
Now, we can apply the same principles, and after discretisation our
solution corresponds to the continuous solution defined
by the respective basis.
As (2) is already a relaxation of the original
problem (1), we do not present an analysis of
convergence at this point.
Instead, the reader is referred to [Bartels12] for an
introduction to this somewhat involved topic.

In order to choose a space with good approximation properties and
suitable basis functions, we tesselate our domain into simplices.
More formally, we define to be a simplex mesh with
vertices , faces defined by ,
and simplices , defined by vertices that
partition :
– *i.e*two adjacent simplices share only a single face.
In this work, for a specific set of vertices , we select M to be
the corresponding Delaunay tetrahedralisation of
and only consider explicit bases. In particular, we
focus on the Lagrangian (P1) basis, which we use in the following to
derive our framework; and on the Raviart-Thomas (RT) basis.
Details for the latter are given in the supplementary material.
The main difference between them is that P1 leads to piecewise linear
solutions, which must be thresholded, while RT leads to a constant
labeling function per simplex, similar to discrete MAP solutions on CRFs.
We note that constant labeling can lead to artefacts, such that the
adaptiveness of the FEM model becomes even more important.

The idea of both derivations is similar:
*(i)* select a basis for our primal (P1) or dual (RT) variable set,
*(ii)* find a suitable form via the divergence theorem and Fenchel duality,
*(iii)* extend to the non-metric case, following a principle we term ”label mass preservation”.

### 3.3 Lagrange Elements

The Lagrange basis functions describe a
*conforming* polynomial basis of
order on our simplex mesh , *i.e*its elements
belong to the Hilbert space of differentiable function with
finite Lebesgue measure on the domain :
.
We are interested in the Lagrange basis of first order, :

(3) |

We construct our linear basis with functions that are defined for each vertex of a simplex and can be described in local form with barycentric coordinates:

(4) |

In each simplex, one can define a scalar field and compute a gradient in this basis that will be constant per simplex (cf. Fig. 2):

(5) |

with coefficients .
denotes a vector that is normal to the face opposite node ,
has length , and points towards the simplex centre.
denotes the area of the face ,
and
the volume of the simplex
(*cf*Fig. 2 and supplementary material).

### 3.4 Discretisation

To apply our Lagrange basis to (2) we first make use of the divergence theorem:

(6) |

The latter summand vanishes by our choice of . Our approach for a discretisation in the Lagrange basis is to choose the labeling function . This implies that our dual space consists of constant vector-fields per simplex: . To fulfill the constraint set in (2) we have to verify that, per simplex, the lie in the respective Wulff-shape. The simplex constraints on the have to be modeled per vertex. According to (4), the labeling functions are convex combinations of their values at the vertices and thus stay within the simplex.

We also have to convert the continuous data costs into a cost per vertex , which can be achieved by convolving the continuous cost with the respective basis function: . In practice, the integral can be computed by sampling . Integrating the right hand side in (6) over the simplex leads to a weighting with its volume and the energy (2) in the discrete setting becomes:

(7) |

### 3.5 Non-metric extension

To start with, we note that a non-metric model does not exist in the
continuous case [maggi2012sets] and our extension works
only *after* the discretisation into the FEM basis. Please refer
to the supplementary material for an in-depth discussion.
Note that our label set of semantic classes does not have a natural
order (in contrast to, *e.g*, stereo depth or denoised brightness); and
also the direction-dependent regulariser is unordered and does not
induce a metric cost.
To allow for non-metric regularisation we transform the constraint set
(), by introducing auxiliary
variables and Lagrange multipliers , and use
Fenchel-Duality:

(8) |

The dual functions of the indicator functions for the convex sets are 1-homogeneous, of the form . Recall that our label costs are not metric: , does not hold. It was shown [ZachHP14] that a regulariser of the form (8) transforms any non-metric cost to the metric case. Figure 3 shows an example. Here, an expensive transition between labels 0 and 2 will be replaced by two cheaper transitions 0–1 and 1–2. To prevent this, we replace the with direction dependent variables : We rearrange (8) and combine the first summand with the regulariser from (7) to arrive at the following equations (for now ignoring ):

with denoting Iverson brackets. Let and , where . Expanding the gradient (5) we get, per simplex ,

(9) |

which we analyse further to achieve non-metric costs. It was observed in [ZachHP14] that the can be interpreted as encoding the “label mass” that transitions from label to label in a specific direction. Positivity constraints (by definition) on the avoid the transport of negative label mass. To anchor transport of mass on the actual mass of label present at a vertex, we introduce the variables for the mass that remains at label , and split the above constraints into two separate sets with the help of additional dual variables :

(10) |

Note that this construction is only possible because our elements (simplices) are of strictly positive volume, in contrast to zero sets in w.r.tthe Lebesgue measure. Finally, we can write down our discrete energy in the Lagrange basis defined on the simplex mesh :

(11) |

where we have moved the weighting with from the constraint set to the regulariser, and denote by the indicator function of the unit simplex.

## 4 Semantic Reconstruction Model

A prime application scenario for our FEM multi-label energy model
(11) is 3D semantic reconstruction.
In particular, we focus on an urban scenario and let our labeling
functions encode *freespace* (), *building wall*,
*roof*, *vegetation* or *ground*.
Objects that are not explicitly modeled are collected in an
extra *clutter* class.
We define the data cost at a 3D-point as in
[HaneZCAP13]: project
into the camera views , and retrieve the
corresponding depth and class likelihoods
from the image space.
The

are obtained from a MultiBoost classifier. For the depth we look at the difference between the actual distance

to the camera and the observed depth: . For the*freespace*label we always set the cost to 0, for we define:

(12) |

This model assumes independence of the per-pixel observations, and exponentially distributed inlier noise in the depthmaps, bounded by a parameter

(k=3 in practice). It is essentially a continuous version of [HaneZCAP13], see that paper for details. The parameter sets a lower bound for the minimal height of the simplices in the tesselation, and thus defines the target resolution. The discretisation of the data cost involves a convolution with the respective basis functions, which can be approximated via sampling. Please refer to the supplementary material for details. The Wulff-shapes in (11) are given as the Minkowski sum of the -Ball, and an anisotropic shape : . In the isotropic part, contains the neighbourhood statistics of the classes. The anisotropic part models the likelihood of a transition between classes and in a certain direction. Fig. 4 (a,b) shows an example. For our case we prefer flat, horizontal surfaces at the following label transitions:*ground-freespace*,

*ground-building*,

*building-roof*,

*ground-vegetation*and

*roof-freespace*. A second prior prefers vertical boundaries for the transitions

*building-freespace*and

*building-vegetation*. More details on the exact form can be found in [HaneZCAP13].

The energy (11) is already in primal-dual form, such that we can apply the minimisation scheme of [chambolle11], with pre-conditioning [Pock11]. That numerical scheme requires us to project onto shapes that are Minkowski sums of convex sets. In our case, the sets are simple and the projection onto each shape can be performed in closed form. We employ a Dykstra-like projection scheme [Dykstra], which avoids storing additional variables and proves remarkably efficient, see supplementary material. We also project the labeling functions directly onto the unit simplex [WangC13a]. In order to extract the transition surface, we employ a variant of marching tetrahedra (triangles) [treece1999mtet], using the isolevel at for each label.

We conclude with two interesting remarks. First, note that a tesselation with a regular grid [HaneZCAP13] can be seen as a simplified version of our discretisation in the Lagrange basis. In Fig. 4d we consider the 2D case of the regular grid used in [HaneZCAP13]. Here, variables are defined at voxel level. In its dual graph, the vertex set consists of the corners of the primal grid cubes, leading to shifted indicator variables. Per vertex the data term is mainly influenced from the cost in its Voronoi area. Similarly, [HaneZCAP13] evaluates the data cost at grid centers, approximately corresponding to integration within the respective Voronoi-area of grid cell. Furthermore, taking finite differences in this regular case corresponds to verifying constraints for only one of the two triangles (Fig. 4d). The supplementary material includes a more formal analysis.

Second, our formulation is adaptive, in the sense
of [BlahaVRWPS16]: Hierarchical refinement of the tesselation can
only decrease our energy. Hence, our scheme is applicable when refining
the model on-the-fly.
We again must defer a formal proof to the supplementary material and
give an intuitive, visual explanation in Fig. 4c . Assume that is an optimal solution
for a certain triplet . Then a refined tesselation
can be found by introducing
additional vertices, *i.e* (ideally on the label
transition surfaces).
To define a new set of simplices, we demand that no
faces are flipped, with
. Then one can find a new variable set and
data cost with the same energy: We initialise the new variables from the
continuous solution at the respective location, and find new
by integration.
Subsequent minimisation in the refined mesh can only decrease the
energy.
The argument works in both ways: Vertices that have the same solution
as their adjacent neighbors can be removed without changing the
energy.
For now we stick to this simple scheme, future work might explore more
sophisticated ideas, *e.g*along the lines of [Grinspun02].

## 5 Evaluation

Before we present results on challenging real 3D data we evaluate our method in 2D on a synthetic dataset. All results are obtained with a multi-core implementation, on a 12-core, 3.5 GHz machine with 64GB RAM. For clarity, we only present the Lagrange discretisation. We refer to the supplementary material for an evaluation of the Raviart-Thomas discretisation.

Input Data.
We create a synthetic 2D scene composed of 4 labels: *free space,
building, ground* and *roof*

, surrounded by 17 virtual cameras. To replace depth maps and class-likelihood images, we extract 2D points on the boundary “surface” and assign ground truth label costs to each point. For the evaluation in 3D, we use three real-world aerial mapping data sets. Our method requires two types of input data: depthmaps and pixel-wise class probabilities (

*cf*Sec. 4). Moreover, we build a

*control mesh*around the initially predicted surface, to facilitate our FEM discretisation. Ideally, the control mesh enwraps the true surface, using a finer meshing close to it. We densely evaluate the data cost at the vertices of a regular

*data cost grid*and let each

*control vertex*accumulate the cost of its nearest neighbours in that grid, to approximate an integration over its Voronoi cell.

2D Lagrange results.
Fig. 5 illustrates the result we obtain
in a *perfect* setting.
The original 2D image serves as ground truth for our quantitative evaluation.
In this baseline setting, our method achieves
of *overall accuracy* and of *average
accuracy*, confirming the soundness of our Lagrange discretisation.
In order to evaluate how our model behaves in a more realistic setting,
we conduct a series of experiments where we incrementally add
different types of perturbations.
Our algorithm is tested against:
*(i)* noise in the initial 2D point cloud, respectively depth maps,
*(ii)* wrong class probabilities and
*(iii)* ambiguous class probabilities of random subsets,
*(iv)* missing data, *e.g*deleting part of a facade to simulate unobserved areas,
*(v)* sparsity of the initial point cloud.
Fig. 6b
illustrates the influence of defective inputs.
Under reasonable assumptions on the magnitude of the investigated
perturbations, we do not observe a significant loss in accuracy.
The reconstruction quality starts to decrease if
more than half of the input data is misclassified
or if the input point cloud is excessively sparse, meaning that 50% of
the input is wrong or nonexistent. Average accuracy is naturally more
sensitive, due to the larger relative error in small classes.

Influence of the control mesh.
Recall from (12) that the data cost of a *control
vertex* is approximately equal to an integral of
over its respective Voronoi area (*cf*Fig. 6a, left).
Therefore, but also because of the sign change in (12),
vertices close to the surface receive small cost values and are
mainly steered by the regulariser, *i.e*these vertices realise a
smooth surface.
On the other hand, vertices that integrate only over areas with
positive or negative sign determine the inside/outside decision, but
are more or less agnostic about the exact location of the surface.
We conclude that a sufficient amount of control vertices
should lie within the band
defined by the truncation
of the cost function around the observed depth
(*cf*(12) and [HaneZCAP13]).
Ideally control vertices are equally distributed along each line-of-sight
in front and behind the putative match
(*cf*Fig. 6a, middle column).
Undersampling within the near-band can lead to smooth, but inaccurate results
(*cf*Fig. 6a, top right).
Unobserved transitions, *e.g**building-ground* or *roof-building*,
can also lead to problems if the affected simplices are too large.
To mitigate the effect, we add a few vertices (*e.g*, a sparse regular grid)
on top of the *control mesh* (*cf*Fig. 6a, bottom row).
Finally, oversampling each line-of-sight in order to increase the resolution of
the *control mesh* is not recommended, the right spacing is determined by the
noise level and and , chosen in (12).

To conclude, it is an important advantage of the FEM framework that
additional vertices can be inserted as required, without changing the energy.
In future work we will use this flexibility to develop smarter *control
meshes*, possibly as a function of the *local* noise level.

3D Lagrange results. To test our algorithm on real world data, we focus on a dataset from the city of Enschede. Complementary results for other datasets are shown in the supplementary material. As baseline we use [BlahaVRWPS16], the current state-of-the-art in large-scale semantic 3D reconstruction. Due to the lack of 3D ground truth, we follow their evaluation protocol and back-project subsets of the 3D reconstruction to image space, where it is compared to a manual segmentation. As can be seen in Fig. 7 and Tab. 5, the two results are similar in terms of quantitative correctness. We note that measuring labeling accuracy in the 2D projection does not consider the geometric quality of the reconstruction within regions of a single label.

Figs. 1 and 7 show
city-modelling results obtained from (nadir and oblique) aerial
images.
Visually, our models are crisper and less “clay-like”.
Compared to axis-aligned discretisation schemes,
*e.g*[HaneZCAP13, BlahaVRWPS16], our method
appears to better represent surfaces not aligned with the coordinate axis,
and exhibit reduced grid aliasing.
Both effects are consistent with the main strength of the FEM
framework, to adapt the size *and* the orientation of the volume
elements to the data.
Small tetrahedra, and vertices that coincide with accurate 3D points
on surface discontinuities, favour sharp surface details and crease
edges (*e.g*, substructures on roofs).
Faces that follow the data points rather than arbitrary grid
directions mitigate aliasing on surfaces not aligned with the
coordinate axes (*e.g*, building walls).
The freedom of a local *control mesh* unleashes the
power of the regulariser in regions where the evidence is
weak or ambiguous (*e.g*, roads, weakly textured building parts).

As already mentioned, our FEM framework can be readily combined with on-the-fly adaptive computation, as used in the baseline [BlahaVRWPS16]. Compared to their voxel/octree model, adaptive refinement is straight-forward, due to the flexibility of the FEM framework, which allows for the introduction of arbitrary new vertices. As a preliminary proof-of-concept, we have tested the naive refinement scheme described in Sec. 4. We execute three refinement steps, where we repeatedly reconstruct the scene and subsequently refine simplices that contain surface transitions, while lowering by half. Compared to computing everything at the final resolution, this already yields substantial savings of 89–97% in memory and 82–93% in computation time, without any loss in accuracy. Targetting (measured w.r.ta bounding box of 256 units), the runtimes for the tested scenes are 1h04m–1h47m and memory consumption is 573–764 MB, on a single machine.

## 6 Conclusion

We have proposed a novel framework for the discretisation of multi-label problems, and have shown that, in the context of semantic 3D reconstruction, the increased flexibility of our scheme allows one to better adapt the discretisation to the data at hand. Our basic idea is generic and not limited to semantic 3D reconstruction or the specific class of regularisers. We would like to explore other applications where it may be useful to abandon grid discretisations and move to a decomposition into simplices.

Acknowledgements: Audrey Richard was supported by SNF grant . Christoph Vogel and Thomas Pock acknowledge support from the ERC starting grant 640156, ’HOMOVIS’.

## Supplementary Material

This document provides supplementary information to support the main paper. It is structured as follows: Sec. A gives more information about the data and pre-processing used in our experiments, not mentioned in the paper due to lack of space. We hope that the added details will help readers to better appreciate the experimental results. In Sec. B we show complementary results obtained with the proposed Lagrange FEM method on other datasets, as well as the full large-scale reconstruction of the city of Enschede. Sec. C contains technical details and formal proofs that had to be omitted in the paper. Finally, Sec. D discusses our formalism for the case of the Raviart-Thomas basis (instead of Lagrange P1), leading to piecewise constant labels. We also show results in 2D and 3D and a comparison to those obtained with the Lagrange basis.

## Appendix A Input Data

For our real-world experiments, we start from aerial images, *cf*Fig. 8.
To mitigate foreshortening and occlusion, images are acquired in
a *Maltese cross* configuration, with four oblique views in
addition to the classical nadir view.
We orient the images with VisualSFM [wu2011visualsfm], create
depth maps from neighbouring views with Semi-global
Matching [hirschmuller2008stereo, opencv], and predict pixel-wise
class-conditional probabilities with a MultiBoost
classifier [benbouzid_jml12]. The classifier is trained on a few
hand-labeled images, using the same features as [BlahaVRWPS16]:
raw RGB-intensities in a window, and

geometry features (height, normal direction, anisotropy of structure tensor,

*etc*) derived from the depth map.

## Appendix B Additional Visualizations

We have tested our semantic reconstruction method on several
(synthetic) 2D and (real) 3D datasets. Here we provide additional
examples to give the reader an impression of the variety of cases
tested in our evaluation.
We apply the same prior models as for our 3D reconstructions.
We prefer flat, horizontal structures in the model for the following label
transitions: *ground-freespace*, *ground-building*, *building-roof* and *roof-freespace*.
The second prior applies to the transition *building-freespace* and
prefers vertical boundaries.
Fig. 9 shows examples for different degradations of
the synthetic input (many more cases were tested).

In the top row, we simulate imperfect classifier input by adding noise
to the semantic class likelihoods. In our experience, the method is
still able to reconstruct the geometry quite well, but sometimes
assigns the wrong label. A closer inspection reveals that, locally,
the
*roof* and *building* classes are confused in locations where the
class likelihoods are significantly wrong. The global geometry and labels
in other regions remain unaffected.
The second row gives an example of missing input data, a frequent
situation in the real world, due to occlusions and constraints on
camera placement. Fortunately, missing data does not seem to greatly
challenge our method. In fact, our method is specifically designed to
work well for these cases and complete the outline, relying on the
prior assumptions about pairwise class transitions and class-specific local shape.
In the last row we utilise only a sparse control mesh, even near the
surface. The method can still recover the geometry, but struggles to
determine the correct semantic labeling near the
(unobserved) *roof-to-building* transition.
The adaptive version of our method is designed to avoid exactly that
case. It refines the control mesh near the predicted transitions,
effectively increasing the resolution at the most promising locations.

Fig. 10 shows city models obtained from two additional aerial datasets (Zürich, Switzerland and Dortmund, Germany), and a further patch from Enschede. These results qualitatively illustrate that our method works for different image sets and architectural layouts.

Finally, we show the complete semantic 3D reconstruction of Enschede.
Fig. 11 shows the model rendered in an
oblique view, together with the corresponding viewpoint
in *Google Earth*, to illustrate its accuracy and high
level-of-detail.

## Appendix C Proofs

In this section we give the technical proofs promised in the main paper, as well as further details about the optimisation. We start with a discussion of the extension to non-metric energies, and its consequences on the equivalence of continuous and discrete models.

### c.1 Non-metric priors: continuous vs. discrete

The main message of this section is that a non-metric model does not
exist in the continuous view, unless one imposes additional
constraints on the function spaces.
We briefly explain why:
Let’s look at the transition boundary between two labels and .
Without additional constraints, one can always introduce a zero set
with label between the two, *i.e*, a set with Lebesgue measure
in the domain space.
If the transition costs are not metric, then the cost for the label
pair is potentially higher than the sum of the costs for
and . Inserting the zero set will avoid that extra
cost and the energy will be under-estimated.
In other words, let be a segmentation of into regions
and , labeled with and respectively. Assume further
that their costs do not fulfill the triangle inequality *w.r.t* another
label .
Then one can find a sequence of segmentations with : the label
disappears in the limit, such that . Hence, metric transition costs
are a necessary condition for the lower semi-continuity of the energy
functional.
Methods that try to resolve the issue with additional constraints on
the function space, for instance by demanding Lipshitz continuity of
the labeling functions, are an active research area, *e.g*[BreMas15], but are beyond the scope of this work.

The above conceptual problem does have consequences for a practical
implementation:
Any discretisation of the domain will ultimately consist only of a
finite number of elements of measurable () volume. Thus, the
label in the example will not disappear completely from the
solution, and the computed energy matches the solution.
In practice, one can simply prescribe a minimum edge length in the
tesselation, since one cannot refine infinitely.
Note that this also constrains the Lipshitz constant of the labeling
functions; they are restricted to values between and ,
such that the Lipshitz constant of functions defined on the
mesh is bounded by
, *cf* (13).
Because we utilise a Delaunay triangulation/tetrahedralisation of the domain
and also limit the minimal dihedral angle, a further constraint on the edge
length implies a bound on the Lipshitz constant.
Note also, our analysis implies that a discrete solution in the
non-metric setting does not have a continuous counterpart, and
consequently investigations of the limiting case, *i.e*, convergence
analysis after infinite refinement of the tesselation, are futile.

### c.2 Gradient in the Lagrange basis

We show that gradients of functions in the P1 (Lagrange) basis are constant per simplex and given by:

(13) |

Here, the coefficients and denote a vector of length , normal to the face opposite to vertex , and pointing inwards towards the center of the simplex. Recall that is the area of face and is the volume of simplex .

The gradient can be obtained with basic algebra. First, notice that the gradient of in (13) has to fulfill , meaning that integration along the edge leads to the respective change in . After collecting a sufficient number of linear equations of this form, one can directly solve the resulting linear system. Since is, by definition, orthogonal to all edges that do not involve vertex , we arrive at (13).

Formally, we pick one vertex of simplex and compile for () equations of the form . By construction, . The vector is normal to face . The scalar product of the edge and the normal is the ”height” within the simplex, so with the chosen scaling of we have for any .

Thus multiplying each side of our equation system by a matrix with the vectors as columns leads to: . If we can show that , then we arrive at the desired expression (13). For , and . All equations are also fulfilled by in place of , which concludes the proof.

### c.3 Data Term for Lagrange basis

We again start from the ideas in the main paper. We have to convert the continuous data costs into discrete form (in a practical implementation, “continuous” means that the cost can be evaluated at any ). In our basis representation, we can get discrete cost values for the basis elements by convolving the continuous cost with the respective basis function. For simplicity, we consider the P1 basis function here. Thus, we seek a cost per vertex . In detail we obtain:

(14) |

To numerically compute , we sample at a finite number of locations . For each we determine into which simplex it falls, and accumulate the contributions of over all , weighted by their barycentric coordinates. The final step is to scale by and divide by the sum of weights assigned to vertex . In other words, we compute the sample mean and scale it by the area covered by the vertex. In our current implementation is sampled at regular grid points, without importance sampling. This simple strategy is indeed very similar to the method employed in [HaneZCAP13, BlahaVRWPS16]. There, the data cost is evaluated on a regular grid, by reprojecting grid vertices into each image, computing the data term, and adding its respective contribution to the grid location. Such a “per-voxel accumulation” is equivalent to integrating the data cost within the respective Voronoi-area of a vertex in the dual grid: the latter is proportional to the number of regular samples that fall into a Voronoi-cell and therefore have the respective vertex as nearest neighbour. Hence, summing the individual contributions directly corresponds to integrating the data term within the Voronoi region.

### c.4 Grid vs. P1

Here, we detail why the grid-based version with finite differences (corresponding to [HaneZCAP13]) can be seen as an approximation of our proposed FEM discretisation with P1 basis elements, if the vertices (cells) are aligned in a regular grid. Without loss of generality we consider a grid of edge length , and note that in this case the gradient for a function at a grid point x, evaluated with forward differences becomes:

(15) |

with the unit vector in direction .
We have used the identity
, according to the definition in
Sec. C.2, and obtain the last equality from , *cf*Sec. C.2.
This is exactly the formula for the gradient of the corresponding P1 function
in the simplex defined by the vertices .
Accordingly, if implemented as finite differences, the constraints on
the dual vector field , see Eq. (2) from the paper,
are only checked within the respective simplex, but not in the whole domain
(1/2 of the domain in 2D; 1/6 in 3D).
Note also that, with grid-aligned vertices, the simplex in question
cannot be part of a partition of , unless
: edges of adjacent faces would intersect.

Fig. 12 illustrates the specific case with . On
the left, the regular grid (*green*) and the triangle
(simplex, *red*). Grid centers correspond to vertices in the
(triangle-)mesh. The grid corresponds to the discretisation used in
[HaneZCAP13], whereas the simplex mesh is used in this paper.
The gradient of the lower left triangle for the simplex mesh
corresponds exactly to the one computed via forward differences as
shown in (15).
Consequently, discretisation via finite differences is a special case
of our method, where the elements are layed out on a regular grid, and
the constraints are tested only in the upper right triangle (in the 2D
case).

### c.5 Adaptiveness

We have stated in the paper that our formulation is adaptive, in the sense that a hierarchical refinement of the tesselation can only decrease the energy. We have also explained a way to find the refined tesselation of , by introducing additional vertices and splitting simplices , such that no faces are flipped; and we have put forward a procedure to initialise the new variables. Here, we formally prove that the described scheme is sound.

Let be the solution for a triplet
. And let be the
refined mesh with and with .
Furthermore, we define the sets and
to denote newly introduced vertices and
simplices.
Our construction works by induction, *i.e*we introduce one vertex
at a time. The vertex is assumed to lie in simplex ,
, which is split into simplices
with .
By definition, vertex has the barycentric coordinates
, *i.e*.

We initialize the labeling variables at new vertices

via barycentric interpolation:

. Dual variables of the new simplices , and also the transition variables , are simply copied from their enclosing simplex : , and . The data terms at the new vertex , as well as at the other vertices of simplex , are (re)computed following (C.3), see also Fig. 13. We call the new variables and claim that and that is feasible. The latter is trivially the case (*cf*Eq. (11) from the paper); transition variables remain positive by construction, the newly introduced labeling variables still fulfill the simplex constraints, and they induce the same gradients in the new simplices as in the parent simplex. The same applies for the cost of the regulariser in the new simplices: . More interesting is the data cost. We introduce the notation for the data cost at vertex , originating from the integral over simplex . By induction, we need to only verify the following equality for simplex , which is split into :

(16) |

Recall we have ”old” vertices , and a new vertex . According to (16), we must verify:

(17) |

It is sufficient to show

(18) |

The right hand side represents a linear function for each . We can check if both sides agree on points in each simplex, which is easy to verify. The locations we check – substitute on both sides of Eq. (18) – are and . These are the defining vertices of the simplices . Left and right hand side vanish, except for and . Finally, we get for and for on both sides.

In our adaptive version, we directly follow the proof and split
simplices with the introduction of a single new vertex. We emphasise
again that this splitting schedule is merely a proof of concept. The
FEM discretisation allows for more sophisticated refinement
schemes, *e.g*, along the lines of [Grinspun02], or flipping edges
according to the energy functional, *etc*.

### c.6 Optimisation

The energy (11) from the paper is given in primal-dual form, optimisation with existing tools is straight-forward. We apply the minimisation scheme of [chambolle11], with pre-conditioning [Pock11]. Internally, that algorithm however requires the projection onto the Wulff-shapes , which is slightly more involved.

#### c.6.1 Proxmap for the Minkowski sum of convex sets

Recall that, per label pair , our Wulff-shapes are of the form . They are the Minkowski sum of two simple convex sets. Recall that the encode the direction dependent likelihood of a certain label transition. In our case, all Wulff-shapes permit a closed form projection scheme, such that we solve the following sub-problem as proximal step, independently per simplex :

(19) |

For the following derivation we rename the two sets and . In order to decouple the argument within the regulariser, we introduce auxiliary variables and additional Lagrange multipliers , and replace and respectively:

(20) |

Optimality w.r.t implies:

(21) |

which, after reinserting into (20), leads to:

(22) |

Applying Fenchel-duality yields:

(23) |

The latter summand requires and :

(24) |

In this last form, we can apply a few iterations of block coordinate descent on the dual variables and recover the update for from (21).

## Appendix D Raviart-Thomas basis

### d.1 Methodology

In this section, we show how to discretise the convex relaxation, Eq. (2) from the paper, for the case of the Raviart-Thomas basis. For convenience, we restate the energy:

(25) |

The Raviart-Thomas basis is chosen as a strong contrast to the
(preferred) Lagrange basis. With Raviart-Thomas functions, we model the dual
functions in (25),
within our trial space.
The Raviart-Thomas basis functions describe a
-conforming polynomial basis of order , *i.e*the divergence
of the modeled vector field is continuous across simplices.
We again discretise on a simplex mesh
with vertices ; faces defined by
vertices; and simplices defined by vertices, which
partition : .

Comments

There are no comments yet.