Learning Structural Graph Layouts and 3D Shapes for Long Span Bridges 3D Reconstruction

by   Fangqiao Hu, et al.
Harbin Institute of Technology

A learning-based 3D reconstruction method for long-span bridges is proposed in this paper. 3D reconstruction generates a 3D computer model of a real object or scene from images, it involves many stages and open problems. Existing point-based methods focus on generating 3D point clouds and their reconstructed polygonal mesh or fitting-based geometrical models in urban scenes civil structures reconstruction within Manhattan world constrains and have made great achievements. Difficulties arise when an attempt is made to transfer these systems to structures with complex topology and part relations like steel trusses and long-span bridges, this could be attributed to point clouds are often unevenly distributed with noise and suffer from occlusions and incompletion, recovering a satisfactory 3D model from these highly unstructured point clouds in a bottom-up pattern while preserving the geometrical and topological properties makes enormous challenge to existing algorithms. Considering the prior human knowledge that these structures are in conformity to regular spatial layouts in terms of components, a learning-based topology-aware 3D reconstruction method which can obtain high-level structural graph layouts and low-level 3D shapes from images is proposed in this paper. We demonstrate the feasibility of this method by testing on two real long-span steel truss cable-stayed bridges.


page 4

page 5

page 10

page 11

page 12

page 13

page 20

page 22


Point2Skeleton: Learning Skeletal Representations from Point Clouds

We introduce Point2Skeleton, an unsupervised method to learn skeletal re...

Deep Unsupervised Learning of 3D Point Clouds via Graph Topology Inference and Filtering

We propose a deep autoencoder with graph topology inference and filterin...

Generate Point Clouds with Multiscale Details from Graph-Represented Structures

Generating point clouds from structures is a highly valued method to con...

Dynamic Plane Convolutional Occupancy Networks

Learning-based 3D reconstruction using implicit neural representations h...

RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction

Semantic scene understanding from point clouds is particularly challengi...

Sensor-topology based simplicial complex reconstruction

We propose a new method for the reconstruction of simplicial complexes (...

Topology-Aware Surface Reconstruction for Point Clouds

We provide an approach to utilize topological priors to reconstruct the ...

1 Introduction

The aim of this paper is to learn a 3D model of long-span bridges (hereinafter referred to as "bridges") from images captured by unmanned aerial vehicles (UAVs) fully automatically.

Recent developments in UAVs have heightened applications in wide range of industrial scenarios. UAVs can explore inaccessible areas carrying various types of sensors, such as digital cameras, infrared cameras, laser scanners (LiDAR) and so on, thus expected to play an increasingly important role in civil structure visual inspection systems. Cameras mounted on UAVs can capture images of the structure from various view points and record the exterior state of the structure. However, current visual inspection strategies often provide massive sets of unstructured and unaligned digital images, which take a lot of human efforts to filter and organize for further usage, the problem of how to fully automate this task becomes urgent that should be settled. One feasible solution is to integrate these images into a single 3D model of the target structure, which enables inspectors to engage with these images in a more intuitive manner and can provide a better monitoring scheme by recording and visualizing the life cycle of the entire structure. However, existing 3D models are as-designed models which somehow differ from current as-is models and failed to express the current exterior state of the structure, so recording or scanning the structure on the site is believed to be the only feasible solution.

Up to now, the research on civil structure 3D reconstruction has tended to focus on local points (i.e., pixels in image grids and points in 3D Euclidean space) rather than global features. These methods first obtain 3D coordinates of key-points and camera parameters from multiple images using structure-from-motion (SfM) [1, 2, 3], then generate dense 3D point clouds with multi-view stereo (MVS) [4, 5].

Point clouds are rarely directly used in practice, since they are actually unstructured point sets and are not capable for texture mapping. Based on point clouds, there are two post processing, they are surface reconstruction [6, 7, 8] and fitting-based point cloud modeling methods [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34].

Surface reconstruction generates a polygonal mesh (a triangular mesh in most cases) as a surface of a 3D point cloud directly. A triangular mesh is represented as a set of ordered 3D vertices and triangular faces expressed as , each face use 3 vertices indexes. However, a polygonal mesh is often overly complex for civil structures in urban scenes where most surfaces are flat. Worse still, the meshing quality may degrade significantly because: (a) a MVS point cloud often suffers from severe noise and uneven distribution, which is different from LiDAR point clouds; (b) the target 3D object is often with complex geometry and topology, such as for a cable-stayed bridge in this paper; (c) no structure priors are introduced in surface reconstruction methods.

Fitting-based modeling methods generate a compact geometric model to fit a 3D point cloud, where basic geometric shapes (e.g., planes, cones, cuboids, cylinders, pyramids, spheres, etc.) are employed, see Figure 1. Sometimes point cloud segmentation algorithms (see [35] to get a literature review) are used for pre-processing. Fitting-based methods are summarized into: (a) piece-wise planes fitting; (b) 3D geometric primitives fitting; (c) non-uniform rational b-spline (NURBS) curves and surfaces fitting; (d) Hybrid fitting. They are briefly discussed in the following.

Figure 1: Basic geometric shapes.

Piece-wise planar fitting methods use planes to fit a point cloud. Manhattan-world prior is introduced due to it is common in the real-world buildings [9, 10]. Planar fitting has achieved remarkable results in building exterior reconstruction [10, 11, 12], Holzmann et al. [13] use lines in addition to point clouds, Raposo et al. [14] use planes rather than being limited by point clouds, both achieving better results.

3D geometric primitives fitting methods use 3D geometric primitives to fit a point cloud, sometimes constructive solid geometry (CSG) models are employed. Representing a 3D object with 3D geometric primitives is the most commonly used in computer vision and computer graphics fields [15]. There are many different methods to fit 3D geometric primitives including RANSAC [16, 17, 22], Hough transform [18], primitive-driven region growth [19], learning-based methods [20, 21], etc., which has been widely used in building exterior reconstruction [22, 23], building information modeling (BIM) such as pipes [24, 25] etc., and have achieved remarkable results.

NURBS fitting use NURBS curve and surfaces to fit a point cloud, which is useful in curved lines and surfaces including power lines [26, 27], complex building exterior and curved pipes [28, 29], etc.

Hybrid fitting [30, 31, 32, 33] is employed as a technique solution using LiDAR point clouds, MVS point clouds, surface reconstruction, geometric primitives fitting, etc. For example, the Google Earth [34].

However, fitting-based methods failed to reconstruct the MVS point cloud in this paper due to: (a) noise, uneven distribution, missing points and occlusion in MVS point cloud; (b) high level structure priors are not introduced, since fitting-based methods only consider bottom-up fitting.

In summary, surface reconstruction and point cloud modeling are both susceptible to noise, they are in a bottom-up fashion, where high level structure priors are rarely introduced, this brings enormous challenge to existing algorithms.

Machine learning has long been a question of great interest in a wide range of fields including computer vision and computer graphics. In contrast to traditional point-based algorithms, which can be considered as a point-wise stereo vision measurement and does not rely on global feature and prior knowledge, learning-based methods [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]

learn global features from images. In general, these algorithms first encode single or multiple images into a latent vector by convolutional neural networks (CNNs) and then decode it into 3D models represented by: (a) polygonal mesh models

[36, 37, 38, 39, 40]; (b) geometric primitives models [41, 42, 43]; (c) point cloud models [44, 45, 46]; (d) volumetric models [47, 48, 49, 50], in terms of output 3D representation forms. These methods work well in synthesized benchmark datasets [51, 52] but are still far from meeting the industrial engineering requirements.

Table 1 and Figure 2 shows a comparison of different representation forms of a bridge model, all these models are directly converted from the same original 3D bridge model. They are: (a) Triangular mesh, defined by a set of vertices and faces ; (b) Geometric primitives, defined by a set of cuboids and cylinders; (c) Point cloud, defined by a set of points; and (d) Volumetric model, defined by a 3D voxel grid .

Representation Expression Pros. Cons.
Triangular mesh - High representation ability for details - Vertices connections are very hard to learn
Geometric primitives
- Compact
- Parameterized
- Able to learn
- Lose details
Point cloud
- Manipulation flexibility
- Easy to learn
- Low representation ability
- High memory overhead
Volumetric model - Easy to learn
- Low resolution
- High memory overhead
Table 1: A comparison of different representation forms.
Figure 2: A comparison of different representation forms.

The objective of this paper is to learn a 3D model for bridges from UAV images. We use hybrid models including geometric primitives and volumetric models to represent a bridge object, and hybrid input data including images and point clouds.

2 Methodology

Only images of long-span bridges captured by UAVs are used for 3D reconstruction of bridges. Taking an example of seeing a picture of a bridge shown in Figure 3 by human to understand the bridge, one can delicately distinguish the bridge object in the picture from the background, then have further information that this bridge has two towers arranged symmetrically, the truss blocks are aligned repeatedly and the cables are similar from one to another. Inspired by the human recognition process on an object above, a hierarchical binary parsing tree is proposed to parse a bridge structure in this paper, as shown in Figure 4.

Figure 3: A picture of long-span cable-stayed bridge. Humans can delicately distinguish the bridge object in the picture from the background, then have information that this bridge has two towers arranged symmetrically, the truss blocks are aligned repeatedly and the cables are similar from one to another.
Figure 4: The parsing tree for a representative simplified cable-stayed bridge. (a) Structural graph layouts used for parsing this bridge, and (b) Corresponding hierarchical binary parsing tree (four node types are used in total).

The entire procedure for 3D reconstruction of a bridge is summarized as: (a) first, a 3D point cloud and camera parameters from images are obtained as 3D global initial information through a general rigid SfM and MVS pipeline; (b) then, regions of interest (RoIs) in 3D point cloud and 2D image are obtained to reduce the interference of background and noise; (c) following feeding 3D point cloud, images, and RoIs into the proposed learning framework, a binary tree to describe the structural graph layouts is determined via learning algorithm; (d) when the binary tree is obtained, 3D shapes are further learned from shape nodes; and (e) the final 3D model is assembled with structural graph layouts and 3D shapes.

2.1 3D Point Cloud and Camera Parameters

This step is to obtain 3D point cloud and camera parameters (see Figure 5) from images as 3D global initial information through a general rigid SfM and MVS pipeline. SfM obtains 3D coordinates of key-points and camera parameters from multiple images. It can be summarized as (a) 2D key-points extraction and matching; (b) camera parameters verification and (c) sparse 3D key-points reconstruction. MVS then takes the output (i.e., camera parameters and coordinators of 3D key-points) of SfM as input, and generates depth-maps, dense 3D point clouds and so on. These steps will be described in detail as follows.

Figure 5: 3D point cloud and camera parameters obtained by rigid SfM and MVS pipeline provided by [3]. Point clouds are represented by colored points, camera poses (i.e., locations and orientations) are represented by red blocks.

2.1.1 2D key-points extraction and matching

This step is to extract key-points and find key-point pairs in multi-view images. A 2D key-point is a coordinate in an image indicating that the pixel values in its neighboring pixels vary greatly.

Given an image , the scale invariant feature transform (SIFT) [53] key-point extraction processing is formulated as


where is pixel coordinate within an image, is a Gaussian blurred image parameterized by a factor and generated by convoluting a Gaussian kernel with an input image . A pyramid-like scale space is constructed in [53] to store Gaussian blurred images with different factors and different sizes to enable multi-scale detection. The factors are (, , , , …) in each octave, where , and is the number of intervals in each octave. The image size is halved from a lower octave to a higher one. is a difference-of-Gaussian image produced by subtracting two adjacent Gaussian blurred images in each octave. The key-point candidates are extracted by comparing each pixel in to its neighbors in the current factor and two adjacent factors in each octave, and is chosen if it is larger or smaller than all other neighboring pixels.

A key-point descriptor represented by a 128-D vector is needed to characterize a key-point. It assigns gradient magnitudes and orientations in a neighboring pixel grids region, each magnitude and orientation are expressed as


These gradient magnitudes and orientations are then accumulated into orientation histograms by summarizing the contents over subregions, which generates a descriptor in form of a array of histograms, each with 8 orientation bins, this results in a dimensional vector, which is then normalized to form the key-point descriptor.

Matching is to find all descriptor pairs in an image pair. Approximate nearest neighbors (ANN) [54] is used to reduce algorithm complexity compared to exhaustive descriptor pairs matching. Given an image pair, for a query descriptor in one image, ANN finds the approximate closest descriptor (i.e.,

) in the other image by constructing a priority search k-means tree w.r.t metric distance (here Euclidean distance


In summary, given images, the SIFT detector extracts 2D key-points and their descriptors for all images, this results in a key-point set , where denotes the -th key-point in the -th image and (a 128-D vector in [53]) is the corresponding descriptor. These key-points are then matched via distances of descriptors for all image pairs, and this results in a set of matched key-point pairs .

2.1.2 Camera parameters verification

This step is to verify camera parameters for multiple views including intrinsic parameters (e.g., focal length) and extrinsic parameters (camera motions). In this section, homogeneous coordinate system is introduced to simplify calculations, e.g., a 2D key-point is expressed as in homogeneous coordinate, and subject to . In single view geometry, camera parameters defines a projective mapping , where and are two points in 3D world coordinate and 2D pixel coordinate in homogeneous coordinate system. Camera parameters includes intrinsic parameters which defines a projective mapping from 3D camera coordinate (cartesian coordinate system) to 2D pixel coordinate (homogeneous coordinate system), and extrinsic parameters where is a rotation matrix and is a translation vector, defining rotation and translation from 3D world coordinate (homogeneous coordinate system) to 3D camera coordinate (cartesian coordinate system). Camera lens radial distortion matrix is introduced to eliminate the image distortion caused by camera lens, where is a nonlinear polynomial function of the distance from the current pixel to image center, are radial distortion parameters. As a result, the camera parameters are defined by a matrix .

In multi-view geometry, and are shared among different views if they are obtained with the same camera. Note that and

are known, calibrated or estimated approximately beforehand and optimized at the last stage. Images are then undistorted with

to avert nonlinear mapping functions, i.e., only is considered, hence only camera motion for each view is need to be solved. Generally, the camera coordinate of the first view is used as the world coordinate. In this study, and are estimated from exchangeable image file format (EXIF) tags and are shared among different views.

To solve camera motion , consider an image pair (i.e., the -th and -th image) in epipolar geometry (i.e., two views geometry), the camera parameters of two views are encapsulated in the fundamental matrix , where and describes the relative translation and rotation of two views,

is a skew-symmetric matrix of

for matrixes cross product. The fundamental matrix gives constraints on how the scene changes under two views, or how the coordinates of a matched point pair changes within an image pair. A matched key-point pair in homogeneous coordinate system fulfills , where and are two key-points in the -th and -th image. The first step is to solve subject to from key-point pairs, the 9 elements in can be solved by least square algorithm since an image pair has a large number of matched key-point pairs. The next step is to solve and from , assume that the camera matrixes of the first and second view are and (i.e., the world coordinate is on the first camera coordinate), and can then be retrieved from

as following, suppose that the singular value decomposition (SVD) of

is since has two equal singular values (refer to [1] for proof), the solution of and are up to a scale and a four-fold ambiguity, they are or and or , where is orthogonal and skew-symmetric. A solution of and is valid only when a reconstructed point is in front of both cameras.

Random sample consensus (RANSAC) [55]

is an iterative method to remove outliers in key-points matching and improve epipolar geometry accuracy, where the epipolar geometry is estimated by randomly and iteratively sampled key-point pairs rather than by all key-point pairs.

2.1.3 3D Key-points reconstruction

This step is to reconstruct 3D key-points from 2D key-point pairs. A 3D point in homogeneous representation is obtained by triangulation algorithm for each matched key-point pair and via camera matrixes and in the -th and -th image, the sign implies that there is a non-zero scale factor since homogeneous representation is involved.

For multiple views, Incremental Structure-from-Motion [2, 3] first initializes with a two-view reconstruction, then registers other views to the current reconstruction one-by-one. A new view image is registered to the current reconstruction if it observes existing 3D points, i.e., the key-points in the new view has an overlap to the current views w.r.t key-point descriptors. Camera motion of the new view is then estimated with Perspective-n-Point algorithm from corresponding 3D and 2D points, expressed as , where and are corresponding 3D and 2D points w.r.t key-point descriptors. New 3D points are then obtained with triangulation algorithm in a pair-by-pair mode and results in a new reconstruction. The reconstruction is completed if no new image can be registered. To record multi-view correspondence of 2D points among multiple views, a track is defined for a reconstructed 3D point , i.e., a list of corresponding (w.r.t key-point descriptors) 2D points for different new views.

Bundle adjustment refines all camera parameters for views and positions of all reconstructed 3D key-points to minimize the overall re-projection error, expressed as after every incremental reconstruction.

2.1.4 Multi-view stereo (MVS)

This step is to generate dense point cloud. The output of SfM are sparse 3D key-points and camera parameters, to get dense 3D points, the depth value for all pixels need to be calculated via epipolar geometry.

Given an image pair and one pixel within the first image, the preimage of is a ray that goes from the camera center and across . To find the corresponding pixel in the second image, consider epipolar geometry, all that fulfill the epipolar geometry make up the epipolar line and have . In fact, should lie on the projected line of the preimage on the second image, i.e., lies on the epipolar line . The search for the corresponding pixel is to find the best-matched pixel to along the epipolar line , this processing is done via a sliding window rather than a single pixel. Consider a fixed window around the point in the first image and a slide window along the epipolar line in the second image, Normalized Cross Correlation (NCC) compares the score of the pixel values within the two windows, expressed as , where and are vectorized pixel values within the first and second window, and are mean values, and denotes inner product. The corresponding pixel is selected in the second image when NCC score achieves maximum. The depth values of two corresponding pixels are then calculated by triangulation algorithm mentioned in Section 2.1.3, that is, the depth value for is the distance from the first camera center to its reconstructed 3D point , the same with .

By calculating all pixel depth values within an image, a depth-map is obtained, which shares the identical camera parameters with its original RGB image. By treating a depth-map as a 2D array of 3D points, multiple depth-maps can be considered as a merged 3D point cloud. Patch-based multi-view stereo (PMVS) [5] is an alternative algorithm to generate dense point cloud. Figure 5 shows 3D key-points, dense point cloud and camera poses.

2.2 Region of interest (RoI) in 3D and 2D

This step is to obtain region of interest (RoI) in 3D point cloud and 2D images. RoI is necessary since background and noise interfere algorithms significantly [56].

2.2.1 RoI in 3D: a 3D orientated bounding box (3D OBB)

3D orientated bounding box (3D OBB) is an approximate solution to find RoI in 3D point cloud. The raw point cloud is with a large range including many environmental points, we first filter a point cloud with criterion below


where and denote the input and output 3D point sets, and

stand for the mean and standard deviation of

, is a fixed scale factor. This results in a sphere boundary parameterized by center and radius . works well in our experiments.

A 3D OBB has 9 degrees of freedom (DoFs) including a 3D translation

indicating OBB center coordinates, 3 rotation angles and 3 dimensional parameters including length, width and height expressed as . The baseline of camera is rectified to horizontal in advance, if not, many image processing tools can rectify horizon line to horizontal by rotating images. So only one rotation angle instead of is needed to determine the 3D OBB, hence a 7-DoFs 3D OBB is denoted as .

We design a simple yet efficient two-step convolutional neural network to obtain the 7-DoFs 3D OBB. The configuration of the CNN (see Figure 6) follows VGG [57] style. In Figure 6, the input of this network is a dimensional image, the output is a dimensional vector. In the first step, the input is a gray scale bird view image, the output is a 2D OBB with dimension . On applying the 2D OBB to filter and align the point cloud, we can get the front view image. In the second step, the input is a gray scale front view image, the output is with dimension . After that, the 3D point cloud is filtered with the 7-DoFs OBB. see Figure 7. Table 2 lists the detailed properties of layers and operators. We will describe the function of these layers in following.

Figure 6: The proposed CNN. the input of this network is a dimensional image, the output is a

dimensional vector. From conv1 to conv7, each of them contains 2 convolutional (Conv) layers, 2 batch normalization (BN) layers, 2 activation (PReLU) layers and 1 max pooling (MP) layer, composed in sequence of Conv – BN – PReLU – Conv – BN – PReLU – MP. For the last 3 fully connected (FC) layers from fc8 to fc10, each of them is composed in sequence of FC – PReLU.

Figure 7: The obtained 3D OBB . In the first-step CNN, a 2D OBB is obtained from the bird view (left). Known 2D OBB in the bird view, the second-step CNN is to get from the front view of the bridge (right).
Layers Feature size Operators Kernel size No. Stride Padding
0 Input - - - -
1 Conv 8 1 1
2 BN - - - -
3 PReLU - - - -
4 Conv 8 1 1
5 BN - - - -
6 PReLU - - - -
7 MP - 2 -
8 Conv 16 1 1
9 BN - - - -
10 PReLU - - - -
11 Conv 16 1 1
12 BN - - - -
13 PReLU - - - -
14 MP - 2 -
15 Conv 32 1 1
16 BN - - - -
17 PReLU - - - -
18 Conv 32 1 1
19 BN - - - -
20 PReLU - - - -
21 MP - 2 -
22 Conv 64 1 1
23 BN - - - -
24 PReLU - - - -
25 Conv 64 1 1
26 BN - - - -
27 PReLU - - - -
28 MP - 2 -
29 Conv 128 1 1
30 BN - - - -
31 PReLU - - - -
32 Conv 128 1 1
33 BN - - - -
34 PReLU - - - -
35 MP - 2 -
36 Conv 256 1 1
37 BN - - - -
38 PReLU - - - -
39 Conv 256 1 1
40 BN - - - -
41 PReLU - - - -
42 MP - 2 -
43 Conv 512 1 1
44 BN - - - -
45 PReLU - - - -
46 Conv 512 1 1
47 BN - - - -
48 PReLU - - - -
49 MP - 2 -
50 Re-shape - - - -
51 FC - - -
52 Dropout (0.5) - - - -
53 BN - - - -
54 PReLU - - - -
55 FC - - -
56 Output - - - -
Table 2: Detailed properties of layers and operators.

Convolutional operation extract features from an image through a convolutional kernel. One single channel of an image is viewed as a matrix, so convolution indicates 2-dimensional discrete convolution, expressed as


where means convolution, , and represent the input image, output image, and learn-able convolutional kernel, respectively,

are pixel coordinates. Note that there are some slight differences from convolution in math to recent deep learning libraries

[58, 59], where channels, stride and padding are employed and no flip is required for convolutional kernels since they are randomly initialized.

Batch normalization fixes the means and variances of each layer’s inputs, it facilitates network convergence and alleviate over-fitting problem. Assume an input mini-batch with batch size

, for each dimension of BN layer’s input, the output is expressed as


where and are one dimension of BN layer’s input and output (both are dimensional vectors), and are mean and variance of input data, and are learn-able parameters, a small constant is added for numerical stability.

The activation layer enables non-linearity of networks, which is attached after each layer in the network. We use Parametric ReLU (PReLU)

[60] expressed as


where is a learn-able parameter, is the input.

Max pooling operation down sample an image while preserving its dominating features, expressed as


where and represent the input image and output image, are pixel coordinates.

Fully connected (FC) layers reduce or increase the dimension of features, using simple matrix product expressed as


where means matrix product, and are input and output vectors, and are learn-able parameters. The first layer of FC requires huge number of learn-able parameters (about 134M) causing severe over-fitting problem, which is alleviated by dropout [61]

. Dropout disconnects the connections of neurons randomly with a fixed probability

. In this paper, dropout layers are attached after each FC layers with if the FC layer has more than 1M parameters.

The CNN is trained by minimizing the loss function below. We consider

and in the first-step CNN and second-step CNN respectively.


where and stand for ground truth bounding box and predicted bounding box respectively, the sign is Hadamard product for vectors (i.e., element-wise product). The use of forces the network to optimize smaller values in . Training data and details are described in Section 3.

2.2.2 RoI in 2D: A foreground-background segmentation

RoI in 2D indicates a foreground-background segmentation mask (a gray scale image), it is obtained from 3D point cloud, 3D OBB and camera parameters, see Figure 8. 3D point cloud is first filtered by 3D OBB and then projected into 2D using camera parameters. The projection is expressed as


where denotes filtered 3D point cloud, is camera lens radial distortion parameters, and are intrinsic and extrinsic parameters, is the projection of on image coordinate with homogeneous representation. Note that is not ready to use, to get a foreground-background segmentation image, all 2D points in should be accumulated to 2D image grids. A general method is distance-based accumulation, when accumulating a 2D point with value 1, its four neighboring pixels are updated as , where is the intensity value of a neighboring pixel, is the distance from 2D point to the neighboring pixel.

Figure 8: The original RGB image (left) and the obtained foreground-background segmentation (right) from 3D point cloud, 3D OBB and camera parameters.

2.3 Learning structural graph layouts and 3D shapes

This step is to learn structural graph layouts and 3D shapes from 3D point cloud and images. We design a learning framework, see Figure 9. A multi-view convolutional neural network (Multi-view CNN, Figure 11) combined with a point cloud network (Figure 12) is designed to encode multi-view images and point cloud into a latent feature, which is then decoded into structural graph layouts (i.e., a hierarchical structural parsing binary tree) by designing a recursive binary tree network (Recursive BiTreeNet, Figure 13), 3D shapes are decoded from shape nodes in the binary parsing tree (Figure 14). We formulate the learning framework below.

Figure 9: The proposed learning framework. Multi-view CNN and Point cloud network learn features from images and point clouds, Recursive BiTreeNet decodes features into the final 3D model.

2.3.1 Multi-view convolutional neural network (Multi-view CNN)

CNNs learn features from images. A multi-view CNN in this section (Figure 11) learns a feature from -view images , the Multi-view CNN can be seen as a mapping function . A single view has two channels, a gray scale image and its corresponding foreground-background segmentation. A 16 times down sampled image resolution (about ) was chosen to balance algorithm performance and memory overhead, see Figure 10. Such high-resolution images with multiple views require huge learning parameters and memory, to alleviate this, we use multi-view CNN, where convolutional kernels among different views are shared, compared with general framework like VGG16 [57] (553M), ResNet101 [62] (178M) and DenseNet121 [63] (31M), the proposed multi-view CNN (6M) reduced learning parameters and memory cost significantly, thus works on a consumer grade GPU with 11GB memory and with batch size 1 up to 6 views.

Figure 10: Choosing a resolution. 16 times down-sampling is chosen as a balance between memory cost and performance.
Figure 11:

Multi-view Convolutional Neural Network (Multi-view CNN). Terminology: Conv, convolutional layer; Pooling, max pooling layer; Re-shape, convert a tensor to a vector; FC layers, fully connected layers, indicating a three-layer fully connected network in this figure.

Figure 11 illustrates the Multi-view CNN, each view branch contains convolutional layers, activation layers and max pooling layers. Batch normalization layers are not considered in this section since all inputs are with batch size 1. Learn-able parameters are shared among different view branches. The view pooling layer compares features from different views and selects maximum features, expressed as , where and represent the input and output feature, denotes pixel coordinate, is channel index, is view index for input views. The output feature is then re-shaped to a vector and fed into three fully connected layers. The final feature is a 4096-dimensional vector. Table 3 lists the detailed properties of layers and operators.

Layers Feature size Operators Kernel size No. Stride Padding
0 Input - - - -
1 Conv 8 1 1
2 PReLU - - - -
3 MP - 2 -
4 Conv 16 1 1
5 PReLU - - - -
6 MP - 2 -
7 Conv 32 1 1
8 PReLU - - - -
9 MP - 2 -
10 Conv 64 1 1
11 PReLU - - - -
12 MP - 2 -
13 Conv 128 1 1
14 PReLU - - - -
15 MP - 2 -
16 Conv 256 1 1
17 PReLU - - - -
18 MP - 2 -
19 Conv 512 1 1
20 PReLU - - - -
21 MP - 2 -
22 Conv 1024 1 1
23 PReLU - - - -
24 MP - 2 -
25 View pooling - - - -
26 Re-shape - - - -
27 FC - - -
28 Dropout (0.5) - - - -
29 PReLU - - - -
30 FC - - -
31 Dropout (0.5) - - - -
32 PReLU - - - -
33 FC - - -
34 Dropout (0.5) - - - -
35 Output - - - -
Table 3: Detailed properties of layers and operators.

2.3.2 Point cloud network

Point cloud network learns features from a point cloud (a point set). A point cloud network in this section learns a feature from a 3D point set , the point cloud network can be seen as a mapping function . A point cloud is composed by a set of points with following properties: (a) disorder; (b) unfixed number; (c) arbitrary rotation. Based on (a) and (b), symmetry functions (e.g., max function where ) are introduced [64]. Unlike recent popular methods [64, 65, 66], the arbitrary rotation was already solved in Section 2.2. To solve (a) and (b), given an input point cloud composed of 3D points, a series of fully connected layers followed by a max pooling layer are employed in this section. Table 4 lists the detailed properties of layers and operators. We will describe these layers in the following.

Figure 12: Point cloud network. Terminology: FC layer, fully connected layer; Max pooling, channel-wise (column-wise) max pooling.
Layers Feature size Operators Kernel size
0 Input -
1 FC
2 PReLU -
3 FC
4 PReLU -
5 FC
6 PReLU -
7 FC
8 Dropout (0.5) -
9 PReLU -
10 MP
11 Output -
Table 4: Detailed properties of layers and operators.

Fully connected layer. The unordered input requires shared weights among different points, meanwhile allowing arbitrary number of points. Hence a fully connected layer meets these requirements, which indicates matrix product expressed as


where means matrix product, and denote input and output points in Euclidean -space and -space, respectively, forming two matrixes, and are learn-able parameters.

Channel-wise (column-wise) max pooling layer. The channel-wise max pooling in this section indicates the symmetry function in [64], where a max value is selected in each channel (column) for points, expressed as


where denotes input points in Euclidean -space, denotes the learned feature, is the channel index..

2.3.3 Recursive binary tree network (Recursive BiTreeNet)

A binary parsing tree is a rooted tree, where every node has at most two children, it is defined as a graph and

is empty. The notion of parsing tree comes from the field of Linguistics and becomes frequently used in learning-based natural language processing

[67, 68] and computer graphics research [69, 70]. For example, a sentence "The cat sleeps on the carpet." is parsed into "(ROOT (S (NP (DT The) (NN cat)) (VP (VBD sleeps) (PP (IN on) (NP (DT the) (NN carpet)))) (. .)))", where are node types in the binary parsing tree.

The proposed recursive binary tree network (see Figure 13) learns a binary tree from image features and point cloud features. The binary tree describes the structural graph layouts, where the shape nodes can be seen as 3D shape codes and are decode into 3D shapes by shape decoder, and similar nodes describe the distribution of 3D shapes.

Figure 13: Recursive Binary Tree Network (Recursive BiTreeNet). Terminology: Concatenate, concatenating two vectors; FC layers, fully connected layers, all FC layers in this figure indicate a two-layer fully connected network.

In Figure 13, the Recursive BiTreeNet first fuse image feature and point cloud feature by two FC layers, the output is a 128-D root node feature. The root node is regarded as a parent node and is decoded to a binary tree by the BiTreeNet recursively. One recursive element is shown in Figure 13, a parent node is decoded to one or two child nodes, and the recursive element is reused in the next hierarchy. Given a root node , the recursion processing is written in Algorithm 1. We will explain Algorithm 1 below in detail.

2.5 Function Recursion(, ):
       if  then
       else if  then
             (, ) SplitNode()
             return Recursion(, ) and Recursion(, )
       else if  then
             (, ) SimilarNode()
             return Recursion(, ) and
       end if
End Function
Algorithm 1 One recursive element function.

Algorithm 1 shows the recursion function. In one recursive element, in training stage, the node class is known (), but not in test stage (

). The node classifier first classifies the parent node into three node types, for training stage, we use the known node class and calculate node classification loss; for test stage, we use the node class given by node classifier

. The node classifier is a two-layer fully connected network, the output is a 3-D one-hot vector indicating probabilities of 3 node types, e.g., means the second type and is equivalent to , we use the latter in written form to simplify. The three node types are split node, similar node and shape node, Figure 4 shows functions of these nodes.

means the current node is a split node, and it can split into 2 nodes, that is, the current shape can split into 2 adjacent shapes, realized by a two-layer FC network . The two output child nodes can be seen as two parent nodes for the next hierarchy, they are then fed into the next recursion function.

means the current node is a similar node, and it can split into 2 nodes, which means the current shape can split into one shape and its "copies", obtained by a two-layer FC network . The two output child nodes can be seen as one parent node for the next hierarchy, and one similar node with 16 similar parameters. We use up to 16 parameters to represent similarity, the first 3 parameters are a one-hot vector indicating the similar type, including 1-D translation (rigid translation), 2-D translation (non-rigid translation) and reflection. For 1-D translation (e.g., truss element), 1 parameter is for copy number, 1 parameter is for distance, 3 parameters are for translation directions specified by a 3-D vector, e.g., means the direction. For 2-D translation, e.g., cables are similar but vary in lengths, 1 parameter is for copy number, 2 parameters are for 2 endpoints’ translation distances, 6 parameters are for 2 endpoints’ translation directions specified by 2 3-D vectors. For reflection, 1 parameter is for reflection distance, 3 parameters are for reflection directions specified by a 3-D vector. The rest parameters are set to 0 in labels. Here we use a small trick to predict a reasonable 3D model, notice that translation distance of cable and truss are correlative, but there is no edge between similar nodes of cables and trusses. Actually, this correlation hasn’t been modeled in binary tree, we simply merged close translation distance using their average, e.g., two translation distances are 0.121 and 0.117, we use 0.119 for both.

indicates that the current node is a shape node, which means the current shape is inseparable. The shape node is decoded to a 3D shape by Shape decoder in Section 2.3.4.

One may notice that these operations require dynamic computation graph, which means the computation graph is variable during each iteration. This is friendly supported in [59] but not in [58]. Table 5 lists the detailed properties of layers and operators.

Layers Feature size Operators Kernel size
Feature fusion 4096+4096 Input -
8192 Concatenation -
1024 FC
1024 Dropout -
1024 PReLU -
128 FC
128 PReLU -
NodeClassifier 128 Input -
20 FC
20 PReLU
3 FC
3 Softmax
SplitNode 128 Input -
181 FC
181 PReLU -
256 FC
128+128 Output -
SimilarNode 128 Input -
136 FC
136 PReLU -
144 FC
128+16 Output -
ShapeNode 128 Input -
A 3D shape Output -
Table 5: Detailed properties of layers and operators.

2.3.4 Shape decoder network

Shape decoder network (Figure 14 is to decode a 128-D shape node vector into a 3D shape. The shape decoder first classifies the input shape node into 3 types, including cuboid, cylinder and irregular shapes like bridge tower. Similar to the node classifier mentioned above in Section 2.3.3, for training stage, we use the known shape type and calculate node classification loss; for test stage, we use the shape type given by node classifier.

Figure 14: Shape decoder. Terminology: FC layers, fully connected layer; Re-shape, convert a vector to a tensor; 3D CNN, 3D convolutional layer; Marching cubes, differentiable marching cubes (MC) algorithm [71].

A cuboid requires 8 parameters, for 2 endpoints and 2 for section width and height, obtained from a two-layer FC network . A cylinder requires 7 parameters, for 2 endpoints and 1 for section radius. Table 6 lists detailed properties of layers and operators except for 3D CNNs.

Layers Feature size Operators Kernel size
NodeClassifier 128 Input -
20 FC
20 PReLU
3 FC
3 Softmax
Cuboid 128 Input -
32 FC
32 PReLU -
8 FC
8 Output -
Cylinder 128 Input -
30 FC
30 PReLU -
7 FC
7 Output -
Irregular 128 Input -
Output -
Table 6: Detailed properties of layers and operators.

For an irregular shape, 3D CNNs are employed to decode the shape node to a 3D occupancy grid and a vertex displacement grid , then differentiable (MC) converts these two grids to a triangular mesh. A 128-D shape node is first mapped to a

grid with 512 channels by a FC layer, the voxel grid is then up-sampled by a series of fractional-stride convolutional layers. Fractional-stride convolution first adds spacing between existing pixels (or voxels) within a feature map to increase spatial resolution, the spacing are then filled with zeros, the rest is to do regular convolution operation. Fractional-stride convolution is also known as transposed convolution or dilated convolution, the spacing between pixels is called dilation. Table

7 lists the detailed properties of layers and operators in 3D fractional-stride CNN.

Layers Feature size Operators Kernel size No. Dilation Stride Padding
0 Input - - - - -
1 FC 512 - - -
2 PReLU - - - - -
3 Re-shape - - - - -
4 Transposed Conv 256 1 1 2
5 PReLU - - - - -
6 Transposed Conv 128 1 1 2
7 PReLU - - - - -
8 Transposed Conv 64 1 1 2
9 PReLU - - - - -
10 Transposed Conv 4 1 1 2
11 Softmax - - - - -
12 Mesh Differentiable MC - - - - -
Table 7: Detailed properties of layers and operators in 3D fractional-stride CNN.

To enable end-to-end training, the unsigned distance field is converted to a triangular mesh through differentiable marching cubes [71]. The final 3D model is obtained by "copying" the decoded 3D shapes with similar parameters.

2.3.5 Loss functions

The structural graph layouts are modeled by minimizing the summation of cross entropy classification loss for each node expressed as


where and denote ground truth probability and predicted probability for event in events set for the -th node in total nodes. The event means, for example, if the current node is a first type node, then is either 0 or 1, is the first value in the one-hot 3-D vector. It should be note that the binary tree including node types is known in training procedure, hence the node classifier is trained in training and is useful only when testing.

3D shapes are modeled by minimizing the distance loss function for all shapes (taken as meshes) decoded by shape decoder expressed as


where and are ground truth shape and predicted shape in the -th shape node in total shape nodes, is Chamfer distance for the case that two shapes have different number of vertices, expressed as


where and are vertices in shapes and respectively, and are vertices number in and . For cuboids, 8 vertices are parameterized with 8 parameters mentioned above, for cylinders, we use cuboids with square section for calculation and back to cylinders for visualization. No smooth or long edge length regularization terms are considered since the mesh is converted from a regular voxel grid.

The final 3D model requires to minimize the similar nodes loss function


where and are the -th ground truth similar parameters and the -th predicted parameters in total similar nodes.

The total loss function is the summation of all mentioned losses expressed as


where , and are 3 weights to compose the total loss. The overall loss is to update the network in an end-to-end fashion.

3 Experiments

3.1 Training Details

We collected and generated about 2200 models of long span bridges, see Figure 15. Each training sample includes inputs and labels.

Figure 15: Some training samples.

3.1.1 Details for 3D OBB learning

The point clouds are synthesized by densely sampled 3D bridge point cloud (about 1M points) with Gaussian noise and non-uniform sampling, and random terrain point cloud (about 6M points). The synthesized point clouds are first filtered with a sphere boundary mentioned in Section 2.2.1. Bird views are generated by projecting point cloud within the sphere boundary to an image along (up) direction, front views are generated by projecting point cloud within the 2D OBB to an image along its local (front) direction. The labels are 3D OBBs . Images are with size and are normalized to , point cloud within the sphere boundary is translated and scaled to a unit sphere with center .

Learning rate is set to initially and

finally with exponential decline, training epoch is 200 with batch size 3, we use adaptive moment estimation (Adam) with

and to optimize the network, no pre-training is implemented.

3.1.2 Details for structural graph layouts and 3D shapes learning

The inputs are composed of: (a) a sparsely sampled 3D point cloud (from 4K to 8K points) with Gaussian noise and non-uniform sampling; (b) 4 to 8 views, each view includes a rendered gray scale image from 3D model with random background, and a re-projected foreground-background segmentation from a densely sampled 3D point cloud (about 1M points) with Gaussian noise and non-uniform sampling. Images are with size and are normalized to , the geometric centers of point clouds are translated to , point clouds are scaled to 1 according to (up) direction.

The labels are composed of: (a) a binary tree represented with nested tuples (in Python language) by assembling one of the three node types on each node; (b) the 3D model with 3D shapes (mesh) on shape nodes and similar parameters on similar nodes.

Learning rate is set to initially and finally with exponential decline, training epoch is 500 with batch size 1, we use adaptive moment estimation (Adam) with and to optimize the network, no pre-training is implemented. In the loss function, we choose , and .

3.2 Qualitative comparison

In addition to these methods, we manually create a 3D model based on point cloud and images as a reference, which is regarded as the best w.r.t details but not in accuracy, however, it takes a dozen of hours for manual work. It’s worth mentioning that when manually creating these models, we refer to three orthographic views of point cloud to obtain approximate size information, and refer to raw images to infer detailed part relations and 3D shapes, hence it is reasonable that how the learning framework works.

A qualitative comparison among different methods is shown in Figure 16 and Figure 17: (a) Dense point cloud (Section 2.1). Dense point cloud is not capable for texture mapping, and suffers from severe noise and uneven distribution. (b) and (c) Delaunary triangulation and Poisson surface reconstruction [6, 7, 8]. Surface reconstruction is directly based on point clouds and no structure prior is introduced so the meshing quality is unpleasant. (d) Point cloud modeling. This method works well in practice for high quality point clouds, we have tested on RANSAC based methods [16] and Learning-based method [20], both failed to fit 3D primitives in this case, this is due to the severe noise on MVS point cloud caused by long distance photography and limited image quantities. (e) The proposed method. (f) Manual work. (gh) Front and side views of selected methods.

We haven’t tested image-based learning methods, since that mesh-based learning

[36, 37, 38, 39, 40] haven’t solve the topological genus problem; current geometric primitives-based methods [41, 42, 43] only take cuboids into consideration; volumetric methods [47, 48, 49, 50] are limited by voxel resolution. More importantly, to our best knowledge, no 3D priors are employed since no 3D information or stereo vision mechanisms are adopted in their networks. As the question "What Do Single-view 3D Reconstruction Networks Learn?" asked in [72], current state of the art in single-view object reconstruction does not actually perform reconstruction but image classification.

Figure 16: Bridge 1. A qualitative comparison of different methods. In addition to these methods, we manually create a 3D model based on point cloud and images as a reference.
Figure 17: Bridge 2. A qualitative comparison of different methods. In addition to these methods, we manually create a 3D model based on point cloud and images as a reference.

3.3 Quantitative comparison

We use manually created 3D model as the ground truth (GT), although it seems not perform best in spatial accuracy. Chamfer distance is calculated with vertices of manually created 3D model and other models. Since Chamfer distance requires a huge KD Tree to handle a large number of vertices, hence we down-sampled to under 10000 vertices. For point cloud, the vertices are down-sampled 3D points. We also compared the reconstructed bridge components number, note that the “components” concept does not exist in the first three methods. Finally, we compared the output data size, one point has a size of 3, one triangular face is 3, one cuboid is 8, and one cylinder is 7. See Table 8 for a comparison.

Method Chamfer distance No. of components Data size
Bridge 1 Dense point cloud [3] 0.04418 - 1117761
Delaunay triangulation [6] 0.04418 - 14031345
Poisson surface reconstruction [8] 0.55827 - 1234419
Point cloud modeling [16, 20] - - -
The proposed method 0.06237 1488 66812
Manual work (GT) 0 1430 15872
Bridge 2 Dense point cloud [3] 0.11345 - 5398473
Delaunay triangulation [6] 0.11345 - 29836674+
Poisson surface reconstruction [8] 0.19662 - 4307535
Point cloud modeling [16, 20] - - -
The proposed method 0.02001 914 73016
Manual work (GT) 0 914 10952
Table 8: A quantitative comparison of different methods.

4 Conclusions

A learning framework is designed which can learn a mathematical model from prior knowledge. Compared with previous methods, the proposed method reconstructs a 3D model while preserving its topological properties and spatial relations successfully.

The 3D digital model shows several potential applications, including: (a) Visual structural health monitoring. By assembling 2D images into a 3D digital bridge model using texture mapping, it enables inspectors to engage with these images in a more intuitive manner, like VR and AR. It also provides a better monitoring scheme by recording and visualizing the life cycle of an entire bridge. (b) Finite element modeling. The proposed method provides a topology-aware 3D digital model, which reduce the time spending on manual establishment of finite element model significantly.

5 Acknowledgements

The study is financially supported by the National Natural Science Foundation of China (Grant U1711265 and 51638007) and supported by grants from the National Key R&D Program of China (Grant 2017YFC1500603).


  • [1] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [2] C. Wu, “Visualsfm: A visual structure from motion system,” http://www. cs. washington. edu/homes/ccwu/vsfm, 2011.
  • [3] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 4104–4113, 2016.
  • [4] Y. Furukawa, C. Hernández, et al., “Multi-view stereo: A tutorial,” Foundations and Trends® in Computer Graphics and Vision, vol. 9, no. 1-2, pp. 1–148, 2015.
  • [5] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview stereopsis,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 8, pp. 1362–1376, 2010.
  • [6] B. Delaunay et al., “Sur la sphere vide,” Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, vol. 7, no. 793-800, pp. 1–2, 1934.
  • [7] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” in Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7, 2006.
  • [8] M. Kazhdan and H. Hoppe, “Screened poisson surface reconstruction,” ACM Transactions on Graphics (ToG), vol. 32, no. 3, p. 29, 2013.
  • [9] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Manhattan-world stereo,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1422–1429, IEEE, 2009.
  • [10] S. Sinha, D. Steedly, and R. Szeliski, “Piecewise planar stereo for image-based rendering,” 2009.
  • [11] A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra, “Rapter: rebuilding man-made scenes with regular arrangements of planes.,” ACM Trans. Graph., vol. 34, no. 4, pp. 103–1, 2015.
  • [12] B. Xiong, M. Jancosek, S. O. Elberink, and G. Vosselman, “Flexible building primitives for 3d building modeling,” ISPRS journal of photogrammetry and remote sensing, vol. 101, pp. 275–290, 2015.
  • [13] T. Holzmann, M. Maurer, F. Fraundorfer, and H. Bischof, “Semantically aware urban 3d reconstruction with plane-based regularization,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 468–483, 2018.
  • [14] C. Raposo, M. Antunes, and J. P. Barreto, “Piecewise-planar stereoscan: Sequential structure and motion using plane primitives,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 8, pp. 1918–1931, 2018.
  • [15] A. Kaiser, J. A. Ybanez Zepeda, and T. Boubekeur, “A survey of simple geometric primitives detection methods for captured 3d data,” in Computer Graphics Forum, vol. 38, pp. 167–196, Wiley Online Library, 2019.
  • [16] R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for point-cloud shape detection,” in Computer graphics forum, vol. 26, pp. 214–226, Wiley Online Library, 2007.
  • [17] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. Cohen-Or, and N. J. Mitra, “Globfit: Consistently fitting primitives by discovering global relations,” ACM transactions on graphics (TOG), vol. 30, no. 4, p. 52, 2011.
  • [18] T. Rabbani, S. Dijkman, F. van den Heuvel, and G. Vosselman, “An integrated approach for modelling and global registration of point clouds,” ISPRS journal of Photogrammetry and Remote Sensing, vol. 61, no. 6, pp. 355–370, 2007.
  • [19] M. Attene and G. Patanè, “Hierarchical structure recovery of point-sampled surfaces,” in Computer Graphics Forum, vol. 29, pp. 1905–1920, Wiley Online Library, 2010.
  • [20] L. Li, M. Sung, A. Dubrovina, L. Yi, and L. Guibas, “Supervised fitting of geometric primitives to 3d point clouds,” arXiv preprint arXiv:1811.08988, 2018.
  • [21] X. Li, Y.-L. Lin, J. Miller, A. Cheon, and W. Dixon, “Primitive-based 3d building modeling, sensor simulation, and estimation,” arXiv preprint arXiv:1901.05554, 2019.
  • [22] M. Li, P. Wonka, and L. Nan, “Manhattan-world urban reconstruction from point clouds,” in European Conference on Computer Vision, pp. 54–69, Springer, 2016.
  • [23] M. Li, L. Nan, and S. Liu, “Fitting boxes to manhattan scenes using linear integer programming,” International journal of digital earth, vol. 9, no. 8, pp. 806–817, 2016.
  • [24] J. Lee, H. Son, C. Kim, and C. Kim, “Skeleton-based 3d reconstruction of as-built pipelines from laser-scan data,” Automation in construction, vol. 35, pp. 199–207, 2013.
  • [25] A. K. Patil, P. Holi, S. K. Lee, and Y. H. Chai, “An adaptive approach for the reconstruction and modeling of as-built 3d pipelines from point clouds,” Automation in Construction, vol. 75, pp. 65–78, 2017.
  • [26] B. Guo, Q. Li, X. Huang, and C. Wang, “An improved method for power-line reconstruction from point cloud data,” Remote sensing, vol. 8, no. 1, p. 36, 2016.
  • [27] S. Ortega, A. Trujillo, J. M. Santana, J. P. Suárez, and J. Santana, “Characterization and modeling of power line corridor elements from lidar point clouds,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 152, pp. 24–33, 2019.
  • [28] L. Barazzetti, “Parametric as-built model generation of complex shapes from point clouds,” Advanced Engineering Informatics, vol. 30, no. 3, pp. 298–311, 2016.
  • [29] A. Dimitrov, R. Gu, and M. Golparvar-Fard, “Non-uniform b-spline surface fitting from unordered 3d point clouds for as-built modeling,” Computer-Aided Civil and Infrastructure Engineering, vol. 31, no. 7, pp. 483–498, 2016.
  • [30] P. Labatut, J.-P. Pons, and R. Keriven, “Hierarchical shape-based surface reconstruction for dense multi-view stereo,” in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1598–1605, IEEE, 2009.
  • [31] F. Lafarge and C. Mallet, “Creating large-scale city models from 3d-point clouds: a robust approach with hybrid representation,” International journal of computer vision, vol. 99, no. 1, pp. 69–85, 2012.
  • [32] F. Lafarge, R. Keriven, M. Brédif, and H.-H. Vu, “A hybrid multiview stereo algorithm for modeling urban scenes,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 5–17, 2013.
  • [33] F. Lafarge and P. Alliez, “Surface reconstruction through point set structuring,” in Computer Graphics Forum, vol. 32, pp. 225–234, Wiley Online Library, 2013.
  • [34] N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google earth engine: Planetary-scale geospatial analysis for everyone,” Remote Sensing of Environment, vol. 202, pp. 18–27, 2017.
  • [35] E. Grilli, F. Menna, and F. Remondino, “A review of point clouds segmentation and classification algorithms,” The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 42, p. 339, 2017.
  • [36] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson, “Learning free-form deformations for 3d object reconstruction,” arXiv preprint arXiv:1803.10932, 2018.
  • [37] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, “Pixel2mesh: Generating 3d mesh models from single rgb images,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67, 2018.
  • [38] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papier-mâché approach to learning 3d surface generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224, 2018.
  • [39] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learning category-specific mesh reconstruction from image collections,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386, 2018.
  • [40] E. Smith, S. Fujimoto, and D. Meger, “Multi-view silhouette and depth decomposition for high resolution 3d object representation,” in Advances in Neural Information Processing Systems, pp. 6478–6488, 2018.
  • [41] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik, “Learning shape abstractions by assembling volumetric primitives,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2635–2643, 2017.
  • [42]

    C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem, “3d-prnn: Generating shape primitives with recurrent neural networks,” in

    Proceedings of the IEEE International Conference on Computer Vision, pp. 900–909, 2017.
  • [43] C. Niu, J. Li, and K. Xu, “Im2struct: Recovering 3d shape structure from a single rgb image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4521–4529, 2018.
  • [44] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613, 2017.
  • [45] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang, “3d shape reconstruction from sketches via multi-view convolutional networks,” in 2017 International Conference on 3D Vision (3DV), pp. 67–77, IEEE, 2017.
  • [46] C.-H. Lin, C. Kong, and S. Lucey, “Learning efficient point cloud generation for dense 3d object reconstruction,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [47] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified approach for single and multi-view 3d object reconstruction,” in European conference on computer vision, pp. 628–644, Springer, 2016.
  • [48] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision,” in Advances in Neural Information Processing Systems, pp. 1696–1704, 2016.
  • [49] A. Kar, C. Häne, and J. Malik, “Learning a multi-view stereo machine,” in Advances in neural information processing systems, pp. 365–376, 2017.
  • [50] X.