1 Introduction
The aim of this paper is to learn a 3D model of longspan bridges (hereinafter referred to as "bridges") from images captured by unmanned aerial vehicles (UAVs) fully automatically.
Recent developments in UAVs have heightened applications in wide range of industrial scenarios. UAVs can explore inaccessible areas carrying various types of sensors, such as digital cameras, infrared cameras, laser scanners (LiDAR) and so on, thus expected to play an increasingly important role in civil structure visual inspection systems. Cameras mounted on UAVs can capture images of the structure from various view points and record the exterior state of the structure. However, current visual inspection strategies often provide massive sets of unstructured and unaligned digital images, which take a lot of human efforts to filter and organize for further usage, the problem of how to fully automate this task becomes urgent that should be settled. One feasible solution is to integrate these images into a single 3D model of the target structure, which enables inspectors to engage with these images in a more intuitive manner and can provide a better monitoring scheme by recording and visualizing the life cycle of the entire structure. However, existing 3D models are asdesigned models which somehow differ from current asis models and failed to express the current exterior state of the structure, so recording or scanning the structure on the site is believed to be the only feasible solution.
Up to now, the research on civil structure 3D reconstruction has tended to focus on local points (i.e., pixels in image grids and points in 3D Euclidean space) rather than global features. These methods first obtain 3D coordinates of keypoints and camera parameters from multiple images using structurefrommotion (SfM) [1, 2, 3], then generate dense 3D point clouds with multiview stereo (MVS) [4, 5].
Point clouds are rarely directly used in practice, since they are actually unstructured point sets and are not capable for texture mapping. Based on point clouds, there are two post processing, they are surface reconstruction [6, 7, 8] and fittingbased point cloud modeling methods [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34].
Surface reconstruction generates a polygonal mesh (a triangular mesh in most cases) as a surface of a 3D point cloud directly. A triangular mesh is represented as a set of ordered 3D vertices and triangular faces expressed as , each face use 3 vertices indexes. However, a polygonal mesh is often overly complex for civil structures in urban scenes where most surfaces are flat. Worse still, the meshing quality may degrade significantly because: (a) a MVS point cloud often suffers from severe noise and uneven distribution, which is different from LiDAR point clouds; (b) the target 3D object is often with complex geometry and topology, such as for a cablestayed bridge in this paper; (c) no structure priors are introduced in surface reconstruction methods.
Fittingbased modeling methods generate a compact geometric model to fit a 3D point cloud, where basic geometric shapes (e.g., planes, cones, cuboids, cylinders, pyramids, spheres, etc.) are employed, see Figure 1. Sometimes point cloud segmentation algorithms (see [35] to get a literature review) are used for preprocessing. Fittingbased methods are summarized into: (a) piecewise planes fitting; (b) 3D geometric primitives fitting; (c) nonuniform rational bspline (NURBS) curves and surfaces fitting; (d) Hybrid fitting. They are briefly discussed in the following.
Piecewise planar fitting methods use planes to fit a point cloud. Manhattanworld prior is introduced due to it is common in the realworld buildings [9, 10]. Planar fitting has achieved remarkable results in building exterior reconstruction [10, 11, 12], Holzmann et al. [13] use lines in addition to point clouds, Raposo et al. [14] use planes rather than being limited by point clouds, both achieving better results.
3D geometric primitives fitting methods use 3D geometric primitives to fit a point cloud, sometimes constructive solid geometry (CSG) models are employed. Representing a 3D object with 3D geometric primitives is the most commonly used in computer vision and computer graphics fields [15]. There are many different methods to fit 3D geometric primitives including RANSAC [16, 17, 22], Hough transform [18], primitivedriven region growth [19], learningbased methods [20, 21], etc., which has been widely used in building exterior reconstruction [22, 23], building information modeling (BIM) such as pipes [24, 25] etc., and have achieved remarkable results.
NURBS fitting use NURBS curve and surfaces to fit a point cloud, which is useful in curved lines and surfaces including power lines [26, 27], complex building exterior and curved pipes [28, 29], etc.
Hybrid fitting [30, 31, 32, 33] is employed as a technique solution using LiDAR point clouds, MVS point clouds, surface reconstruction, geometric primitives fitting, etc. For example, the Google Earth [34].
However, fittingbased methods failed to reconstruct the MVS point cloud in this paper due to: (a) noise, uneven distribution, missing points and occlusion in MVS point cloud; (b) high level structure priors are not introduced, since fittingbased methods only consider bottomup fitting.
In summary, surface reconstruction and point cloud modeling are both susceptible to noise, they are in a bottomup fashion, where high level structure priors are rarely introduced, this brings enormous challenge to existing algorithms.
Machine learning has long been a question of great interest in a wide range of fields including computer vision and computer graphics. In contrast to traditional pointbased algorithms, which can be considered as a pointwise stereo vision measurement and does not rely on global feature and prior knowledge, learningbased methods [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
learn global features from images. In general, these algorithms first encode single or multiple images into a latent vector by convolutional neural networks (CNNs) and then decode it into 3D models represented by: (a) polygonal mesh models
[36, 37, 38, 39, 40]; (b) geometric primitives models [41, 42, 43]; (c) point cloud models [44, 45, 46]; (d) volumetric models [47, 48, 49, 50], in terms of output 3D representation forms. These methods work well in synthesized benchmark datasets [51, 52] but are still far from meeting the industrial engineering requirements.Table 1 and Figure 2 shows a comparison of different representation forms of a bridge model, all these models are directly converted from the same original 3D bridge model. They are: (a) Triangular mesh, defined by a set of vertices and faces ; (b) Geometric primitives, defined by a set of cuboids and cylinders; (c) Point cloud, defined by a set of points; and (d) Volumetric model, defined by a 3D voxel grid .
Representation  Expression  Pros.  Cons.  

Triangular mesh   High representation ability for details   Vertices connections are very hard to learn  
Geometric primitives 

 Lose details  
Point cloud 



Volumetric model   Easy to learn 

The objective of this paper is to learn a 3D model for bridges from UAV images. We use hybrid models including geometric primitives and volumetric models to represent a bridge object, and hybrid input data including images and point clouds.
2 Methodology
Only images of longspan bridges captured by UAVs are used for 3D reconstruction of bridges. Taking an example of seeing a picture of a bridge shown in Figure 3 by human to understand the bridge, one can delicately distinguish the bridge object in the picture from the background, then have further information that this bridge has two towers arranged symmetrically, the truss blocks are aligned repeatedly and the cables are similar from one to another. Inspired by the human recognition process on an object above, a hierarchical binary parsing tree is proposed to parse a bridge structure in this paper, as shown in Figure 4.
The entire procedure for 3D reconstruction of a bridge is summarized as: (a) first, a 3D point cloud and camera parameters from images are obtained as 3D global initial information through a general rigid SfM and MVS pipeline; (b) then, regions of interest (RoIs) in 3D point cloud and 2D image are obtained to reduce the interference of background and noise; (c) following feeding 3D point cloud, images, and RoIs into the proposed learning framework, a binary tree to describe the structural graph layouts is determined via learning algorithm; (d) when the binary tree is obtained, 3D shapes are further learned from shape nodes; and (e) the final 3D model is assembled with structural graph layouts and 3D shapes.
2.1 3D Point Cloud and Camera Parameters
This step is to obtain 3D point cloud and camera parameters (see Figure 5) from images as 3D global initial information through a general rigid SfM and MVS pipeline. SfM obtains 3D coordinates of keypoints and camera parameters from multiple images. It can be summarized as (a) 2D keypoints extraction and matching; (b) camera parameters verification and (c) sparse 3D keypoints reconstruction. MVS then takes the output (i.e., camera parameters and coordinators of 3D keypoints) of SfM as input, and generates depthmaps, dense 3D point clouds and so on. These steps will be described in detail as follows.
2.1.1 2D keypoints extraction and matching
This step is to extract keypoints and find keypoint pairs in multiview images. A 2D keypoint is a coordinate in an image indicating that the pixel values in its neighboring pixels vary greatly.
Given an image , the scale invariant feature transform (SIFT) [53] keypoint extraction processing is formulated as
(1) 
where is pixel coordinate within an image, is a Gaussian blurred image parameterized by a factor and generated by convoluting a Gaussian kernel with an input image . A pyramidlike scale space is constructed in [53] to store Gaussian blurred images with different factors and different sizes to enable multiscale detection. The factors are (, , , , …) in each octave, where , and is the number of intervals in each octave. The image size is halved from a lower octave to a higher one. is a differenceofGaussian image produced by subtracting two adjacent Gaussian blurred images in each octave. The keypoint candidates are extracted by comparing each pixel in to its neighbors in the current factor and two adjacent factors in each octave, and is chosen if it is larger or smaller than all other neighboring pixels.
A keypoint descriptor represented by a 128D vector is needed to characterize a keypoint. It assigns gradient magnitudes and orientations in a neighboring pixel grids region, each magnitude and orientation are expressed as
(2) 
These gradient magnitudes and orientations are then accumulated into orientation histograms by summarizing the contents over subregions, which generates a descriptor in form of a array of histograms, each with 8 orientation bins, this results in a dimensional vector, which is then normalized to form the keypoint descriptor.
Matching is to find all descriptor pairs in an image pair. Approximate nearest neighbors (ANN) [54] is used to reduce algorithm complexity compared to exhaustive descriptor pairs matching. Given an image pair, for a query descriptor in one image, ANN finds the approximate closest descriptor (i.e.,
) in the other image by constructing a priority search kmeans tree w.r.t metric distance (here Euclidean distance
).In summary, given images, the SIFT detector extracts 2D keypoints and their descriptors for all images, this results in a keypoint set , where denotes the th keypoint in the th image and (a 128D vector in [53]) is the corresponding descriptor. These keypoints are then matched via distances of descriptors for all image pairs, and this results in a set of matched keypoint pairs .
2.1.2 Camera parameters verification
This step is to verify camera parameters for multiple views including intrinsic parameters (e.g., focal length) and extrinsic parameters (camera motions). In this section, homogeneous coordinate system is introduced to simplify calculations, e.g., a 2D keypoint is expressed as in homogeneous coordinate, and subject to . In single view geometry, camera parameters defines a projective mapping , where and are two points in 3D world coordinate and 2D pixel coordinate in homogeneous coordinate system. Camera parameters includes intrinsic parameters which defines a projective mapping from 3D camera coordinate (cartesian coordinate system) to 2D pixel coordinate (homogeneous coordinate system), and extrinsic parameters where is a rotation matrix and is a translation vector, defining rotation and translation from 3D world coordinate (homogeneous coordinate system) to 3D camera coordinate (cartesian coordinate system). Camera lens radial distortion matrix is introduced to eliminate the image distortion caused by camera lens, where is a nonlinear polynomial function of the distance from the current pixel to image center, are radial distortion parameters. As a result, the camera parameters are defined by a matrix .
In multiview geometry, and are shared among different views if they are obtained with the same camera. Note that and
are known, calibrated or estimated approximately beforehand and optimized at the last stage. Images are then undistorted with
to avert nonlinear mapping functions, i.e., only is considered, hence only camera motion for each view is need to be solved. Generally, the camera coordinate of the first view is used as the world coordinate. In this study, and are estimated from exchangeable image file format (EXIF) tags and are shared among different views.To solve camera motion , consider an image pair (i.e., the th and th image) in epipolar geometry (i.e., two views geometry), the camera parameters of two views are encapsulated in the fundamental matrix , where and describes the relative translation and rotation of two views,
is a skewsymmetric matrix of
for matrixes cross product. The fundamental matrix gives constraints on how the scene changes under two views, or how the coordinates of a matched point pair changes within an image pair. A matched keypoint pair in homogeneous coordinate system fulfills , where and are two keypoints in the th and th image. The first step is to solve subject to from keypoint pairs, the 9 elements in can be solved by least square algorithm since an image pair has a large number of matched keypoint pairs. The next step is to solve and from , assume that the camera matrixes of the first and second view are and (i.e., the world coordinate is on the first camera coordinate), and can then be retrieved fromas following, suppose that the singular value decomposition (SVD) of
is since has two equal singular values (refer to [1] for proof), the solution of and are up to a scale and a fourfold ambiguity, they are or and or , where is orthogonal and skewsymmetric. A solution of and is valid only when a reconstructed point is in front of both cameras.2.1.3 3D Keypoints reconstruction
This step is to reconstruct 3D keypoints from 2D keypoint pairs. A 3D point in homogeneous representation is obtained by triangulation algorithm for each matched keypoint pair and via camera matrixes and in the th and th image, the sign implies that there is a nonzero scale factor since homogeneous representation is involved.
For multiple views, Incremental StructurefromMotion [2, 3] first initializes with a twoview reconstruction, then registers other views to the current reconstruction onebyone. A new view image is registered to the current reconstruction if it observes existing 3D points, i.e., the keypoints in the new view has an overlap to the current views w.r.t keypoint descriptors. Camera motion of the new view is then estimated with PerspectivenPoint algorithm from corresponding 3D and 2D points, expressed as , where and are corresponding 3D and 2D points w.r.t keypoint descriptors. New 3D points are then obtained with triangulation algorithm in a pairbypair mode and results in a new reconstruction. The reconstruction is completed if no new image can be registered. To record multiview correspondence of 2D points among multiple views, a track is defined for a reconstructed 3D point , i.e., a list of corresponding (w.r.t keypoint descriptors) 2D points for different new views.
Bundle adjustment refines all camera parameters for views and positions of all reconstructed 3D keypoints to minimize the overall reprojection error, expressed as after every incremental reconstruction.
2.1.4 Multiview stereo (MVS)
This step is to generate dense point cloud. The output of SfM are sparse 3D keypoints and camera parameters, to get dense 3D points, the depth value for all pixels need to be calculated via epipolar geometry.
Given an image pair and one pixel within the first image, the preimage of is a ray that goes from the camera center and across . To find the corresponding pixel in the second image, consider epipolar geometry, all that fulfill the epipolar geometry make up the epipolar line and have . In fact, should lie on the projected line of the preimage on the second image, i.e., lies on the epipolar line . The search for the corresponding pixel is to find the bestmatched pixel to along the epipolar line , this processing is done via a sliding window rather than a single pixel. Consider a fixed window around the point in the first image and a slide window along the epipolar line in the second image, Normalized Cross Correlation (NCC) compares the score of the pixel values within the two windows, expressed as , where and are vectorized pixel values within the first and second window, and are mean values, and denotes inner product. The corresponding pixel is selected in the second image when NCC score achieves maximum. The depth values of two corresponding pixels are then calculated by triangulation algorithm mentioned in Section 2.1.3, that is, the depth value for is the distance from the first camera center to its reconstructed 3D point , the same with .
By calculating all pixel depth values within an image, a depthmap is obtained, which shares the identical camera parameters with its original RGB image. By treating a depthmap as a 2D array of 3D points, multiple depthmaps can be considered as a merged 3D point cloud. Patchbased multiview stereo (PMVS) [5] is an alternative algorithm to generate dense point cloud. Figure 5 shows 3D keypoints, dense point cloud and camera poses.
2.2 Region of interest (RoI) in 3D and 2D
This step is to obtain region of interest (RoI) in 3D point cloud and 2D images. RoI is necessary since background and noise interfere algorithms significantly [56].
2.2.1 RoI in 3D: a 3D orientated bounding box (3D OBB)
3D orientated bounding box (3D OBB) is an approximate solution to find RoI in 3D point cloud. The raw point cloud is with a large range including many environmental points, we first filter a point cloud with criterion below
(3) 
where and denote the input and output 3D point sets, and
stand for the mean and standard deviation of
, is a fixed scale factor. This results in a sphere boundary parameterized by center and radius . works well in our experiments.A 3D OBB has 9 degrees of freedom (DoFs) including a 3D translation
indicating OBB center coordinates, 3 rotation angles and 3 dimensional parameters including length, width and height expressed as . The baseline of camera is rectified to horizontal in advance, if not, many image processing tools can rectify horizon line to horizontal by rotating images. So only one rotation angle instead of is needed to determine the 3D OBB, hence a 7DoFs 3D OBB is denoted as .We design a simple yet efficient twostep convolutional neural network to obtain the 7DoFs 3D OBB. The configuration of the CNN (see Figure 6) follows VGG [57] style. In Figure 6, the input of this network is a dimensional image, the output is a dimensional vector. In the first step, the input is a gray scale bird view image, the output is a 2D OBB with dimension . On applying the 2D OBB to filter and align the point cloud, we can get the front view image. In the second step, the input is a gray scale front view image, the output is with dimension . After that, the 3D point cloud is filtered with the 7DoFs OBB. see Figure 7. Table 2 lists the detailed properties of layers and operators. We will describe the function of these layers in following.
Layers  Feature size  Operators  Kernel size  No.  Stride  Padding 

0  Input          
1  Conv  8  1  1  
2  BN          
3  PReLU          
4  Conv  8  1  1  
5  BN          
6  PReLU          
7  MP    2    
8  Conv  16  1  1  
9  BN          
10  PReLU          
11  Conv  16  1  1  
12  BN          
13  PReLU          
14  MP    2    
15  Conv  32  1  1  
16  BN          
17  PReLU          
18  Conv  32  1  1  
19  BN          
20  PReLU          
21  MP    2    
22  Conv  64  1  1  
23  BN          
24  PReLU          
25  Conv  64  1  1  
26  BN          
27  PReLU          
28  MP    2    
29  Conv  128  1  1  
30  BN          
31  PReLU          
32  Conv  128  1  1  
33  BN          
34  PReLU          
35  MP    2    
36  Conv  256  1  1  
37  BN          
38  PReLU          
39  Conv  256  1  1  
40  BN          
41  PReLU          
42  MP    2    
43  Conv  512  1  1  
44  BN          
45  PReLU          
46  Conv  512  1  1  
47  BN          
48  PReLU          
49  MP    2    
50  Reshape          
51  FC        
52  Dropout (0.5)          
53  BN          
54  PReLU          
55  FC        
56  Output         
Convolutional operation extract features from an image through a convolutional kernel. One single channel of an image is viewed as a matrix, so convolution indicates 2dimensional discrete convolution, expressed as
(4) 
where means convolution, , and represent the input image, output image, and learnable convolutional kernel, respectively,
are pixel coordinates. Note that there are some slight differences from convolution in math to recent deep learning libraries
[58, 59], where channels, stride and padding are employed and no flip is required for convolutional kernels since they are randomly initialized.Batch normalization fixes the means and variances of each layer’s inputs, it facilitates network convergence and alleviate overfitting problem. Assume an input minibatch with batch size
, for each dimension of BN layer’s input, the output is expressed as(5) 
where and are one dimension of BN layer’s input and output (both are dimensional vectors), and are mean and variance of input data, and are learnable parameters, a small constant is added for numerical stability.
The activation layer enables nonlinearity of networks, which is attached after each layer in the network. We use Parametric ReLU (PReLU)
[60] expressed as(6) 
where is a learnable parameter, is the input.
Max pooling operation down sample an image while preserving its dominating features, expressed as
(7) 
where and represent the input image and output image, are pixel coordinates.
Fully connected (FC) layers reduce or increase the dimension of features, using simple matrix product expressed as
(8) 
where means matrix product, and are input and output vectors, and are learnable parameters. The first layer of FC requires huge number of learnable parameters (about 134M) causing severe overfitting problem, which is alleviated by dropout [61]
. Dropout disconnects the connections of neurons randomly with a fixed probability
. In this paper, dropout layers are attached after each FC layers with if the FC layer has more than 1M parameters.The CNN is trained by minimizing the loss function below. We consider
and in the firststep CNN and secondstep CNN respectively.(9) 
where and stand for ground truth bounding box and predicted bounding box respectively, the sign is Hadamard product for vectors (i.e., elementwise product). The use of forces the network to optimize smaller values in . Training data and details are described in Section 3.
2.2.2 RoI in 2D: A foregroundbackground segmentation
RoI in 2D indicates a foregroundbackground segmentation mask (a gray scale image), it is obtained from 3D point cloud, 3D OBB and camera parameters, see Figure 8. 3D point cloud is first filtered by 3D OBB and then projected into 2D using camera parameters. The projection is expressed as
(10) 
where denotes filtered 3D point cloud, is camera lens radial distortion parameters, and are intrinsic and extrinsic parameters, is the projection of on image coordinate with homogeneous representation. Note that is not ready to use, to get a foregroundbackground segmentation image, all 2D points in should be accumulated to 2D image grids. A general method is distancebased accumulation, when accumulating a 2D point with value 1, its four neighboring pixels are updated as , where is the intensity value of a neighboring pixel, is the distance from 2D point to the neighboring pixel.
2.3 Learning structural graph layouts and 3D shapes
This step is to learn structural graph layouts and 3D shapes from 3D point cloud and images. We design a learning framework, see Figure 9. A multiview convolutional neural network (Multiview CNN, Figure 11) combined with a point cloud network (Figure 12) is designed to encode multiview images and point cloud into a latent feature, which is then decoded into structural graph layouts (i.e., a hierarchical structural parsing binary tree) by designing a recursive binary tree network (Recursive BiTreeNet, Figure 13), 3D shapes are decoded from shape nodes in the binary parsing tree (Figure 14). We formulate the learning framework below.
2.3.1 Multiview convolutional neural network (Multiview CNN)
CNNs learn features from images. A multiview CNN in this section (Figure 11) learns a feature from view images , the Multiview CNN can be seen as a mapping function . A single view has two channels, a gray scale image and its corresponding foregroundbackground segmentation. A 16 times down sampled image resolution (about ) was chosen to balance algorithm performance and memory overhead, see Figure 10. Such highresolution images with multiple views require huge learning parameters and memory, to alleviate this, we use multiview CNN, where convolutional kernels among different views are shared, compared with general framework like VGG16 [57] (553M), ResNet101 [62] (178M) and DenseNet121 [63] (31M), the proposed multiview CNN (6M) reduced learning parameters and memory cost significantly, thus works on a consumer grade GPU with 11GB memory and with batch size 1 up to 6 views.
Figure 11 illustrates the Multiview CNN, each view branch contains convolutional layers, activation layers and max pooling layers. Batch normalization layers are not considered in this section since all inputs are with batch size 1. Learnable parameters are shared among different view branches. The view pooling layer compares features from different views and selects maximum features, expressed as , where and represent the input and output feature, denotes pixel coordinate, is channel index, is view index for input views. The output feature is then reshaped to a vector and fed into three fully connected layers. The final feature is a 4096dimensional vector. Table 3 lists the detailed properties of layers and operators.
Layers  Feature size  Operators  Kernel size  No.  Stride  Padding 

0  Input          
1  Conv  8  1  1  
2  PReLU          
3  MP    2    
4  Conv  16  1  1  
5  PReLU          
6  MP    2    
7  Conv  32  1  1  
8  PReLU          
9  MP    2    
10  Conv  64  1  1  
11  PReLU          
12  MP    2    
13  Conv  128  1  1  
14  PReLU          
15  MP    2    
16  Conv  256  1  1  
17  PReLU          
18  MP    2    
19  Conv  512  1  1  
20  PReLU          
21  MP    2    
22  Conv  1024  1  1  
23  PReLU          
24  MP    2    
25  View pooling          
26  Reshape          
27  FC        
28  Dropout (0.5)          
29  PReLU          
30  FC        
31  Dropout (0.5)          
32  PReLU          
33  FC        
34  Dropout (0.5)          
35  Output         
2.3.2 Point cloud network
Point cloud network learns features from a point cloud (a point set). A point cloud network in this section learns a feature from a 3D point set , the point cloud network can be seen as a mapping function . A point cloud is composed by a set of points with following properties: (a) disorder; (b) unfixed number; (c) arbitrary rotation. Based on (a) and (b), symmetry functions (e.g., max function where ) are introduced [64]. Unlike recent popular methods [64, 65, 66], the arbitrary rotation was already solved in Section 2.2. To solve (a) and (b), given an input point cloud composed of 3D points, a series of fully connected layers followed by a max pooling layer are employed in this section. Table 4 lists the detailed properties of layers and operators. We will describe these layers in the following.
Layers  Feature size  Operators  Kernel size 

0  Input    
1  FC  
2  PReLU    
3  FC  
4  PReLU    
5  FC  
6  PReLU    
7  FC  
8  Dropout (0.5)    
9  PReLU    
10  MP  
11  Output   
Fully connected layer. The unordered input requires shared weights among different points, meanwhile allowing arbitrary number of points. Hence a fully connected layer meets these requirements, which indicates matrix product expressed as
(11) 
where means matrix product, and denote input and output points in Euclidean space and space, respectively, forming two matrixes, and are learnable parameters.
Channelwise (columnwise) max pooling layer. The channelwise max pooling in this section indicates the symmetry function in [64], where a max value is selected in each channel (column) for points, expressed as
(12) 
where denotes input points in Euclidean space, denotes the learned feature, is the channel index..
2.3.3 Recursive binary tree network (Recursive BiTreeNet)
A binary parsing tree is a rooted tree, where every node has at most two children, it is defined as a graph and
is empty. The notion of parsing tree comes from the field of Linguistics and becomes frequently used in learningbased natural language processing
[67, 68] and computer graphics research [69, 70]. For example, a sentence "The cat sleeps on the carpet." is parsed into "(ROOT (S (NP (DT The) (NN cat)) (VP (VBD sleeps) (PP (IN on) (NP (DT the) (NN carpet)))) (. .)))", where are node types in the binary parsing tree.The proposed recursive binary tree network (see Figure 13) learns a binary tree from image features and point cloud features. The binary tree describes the structural graph layouts, where the shape nodes can be seen as 3D shape codes and are decode into 3D shapes by shape decoder, and similar nodes describe the distribution of 3D shapes.
In Figure 13, the Recursive BiTreeNet first fuse image feature and point cloud feature by two FC layers, the output is a 128D root node feature. The root node is regarded as a parent node and is decoded to a binary tree by the BiTreeNet recursively. One recursive element is shown in Figure 13, a parent node is decoded to one or two child nodes, and the recursive element is reused in the next hierarchy. Given a root node , the recursion processing is written in Algorithm 1. We will explain Algorithm 1 below in detail.
Algorithm 1 shows the recursion function. In one recursive element, in training stage, the node class is known (), but not in test stage (
). The node classifier first classifies the parent node into three node types, for training stage, we use the known node class and calculate node classification loss; for test stage, we use the node class given by node classifier
. The node classifier is a twolayer fully connected network, the output is a 3D onehot vector indicating probabilities of 3 node types, e.g., means the second type and is equivalent to , we use the latter in written form to simplify. The three node types are split node, similar node and shape node, Figure 4 shows functions of these nodes.means the current node is a split node, and it can split into 2 nodes, that is, the current shape can split into 2 adjacent shapes, realized by a twolayer FC network . The two output child nodes can be seen as two parent nodes for the next hierarchy, they are then fed into the next recursion function.
means the current node is a similar node, and it can split into 2 nodes, which means the current shape can split into one shape and its "copies", obtained by a twolayer FC network . The two output child nodes can be seen as one parent node for the next hierarchy, and one similar node with 16 similar parameters. We use up to 16 parameters to represent similarity, the first 3 parameters are a onehot vector indicating the similar type, including 1D translation (rigid translation), 2D translation (nonrigid translation) and reflection. For 1D translation (e.g., truss element), 1 parameter is for copy number, 1 parameter is for distance, 3 parameters are for translation directions specified by a 3D vector, e.g., means the direction. For 2D translation, e.g., cables are similar but vary in lengths, 1 parameter is for copy number, 2 parameters are for 2 endpoints’ translation distances, 6 parameters are for 2 endpoints’ translation directions specified by 2 3D vectors. For reflection, 1 parameter is for reflection distance, 3 parameters are for reflection directions specified by a 3D vector. The rest parameters are set to 0 in labels. Here we use a small trick to predict a reasonable 3D model, notice that translation distance of cable and truss are correlative, but there is no edge between similar nodes of cables and trusses. Actually, this correlation hasn’t been modeled in binary tree, we simply merged close translation distance using their average, e.g., two translation distances are 0.121 and 0.117, we use 0.119 for both.
indicates that the current node is a shape node, which means the current shape is inseparable. The shape node is decoded to a 3D shape by Shape decoder in Section 2.3.4.
One may notice that these operations require dynamic computation graph, which means the computation graph is variable during each iteration. This is friendly supported in [59] but not in [58]. Table 5 lists the detailed properties of layers and operators.
Layers  Feature size  Operators  Kernel size 
Feature fusion  4096+4096  Input   
8192  Concatenation    
1024  FC  
1024  Dropout    
1024  PReLU    
128  FC  
128  PReLU    
NodeClassifier  128  Input   
20  FC  
20  PReLU  
3  FC  
3  Softmax  
SplitNode  128  Input   
181  FC  
181  PReLU    
256  FC  
128+128  Output    
SimilarNode  128  Input   
136  FC  
136  PReLU    
144  FC  
128+16  Output    
ShapeNode  128  Input   
A 3D shape  Output   
2.3.4 Shape decoder network
Shape decoder network (Figure 14 is to decode a 128D shape node vector into a 3D shape. The shape decoder first classifies the input shape node into 3 types, including cuboid, cylinder and irregular shapes like bridge tower. Similar to the node classifier mentioned above in Section 2.3.3, for training stage, we use the known shape type and calculate node classification loss; for test stage, we use the shape type given by node classifier.
A cuboid requires 8 parameters, for 2 endpoints and 2 for section width and height, obtained from a twolayer FC network . A cylinder requires 7 parameters, for 2 endpoints and 1 for section radius. Table 6 lists detailed properties of layers and operators except for 3D CNNs.
Layers  Feature size  Operators  Kernel size 
NodeClassifier  128  Input   
20  FC  
20  PReLU  
3  FC  
3  Softmax  
Cuboid  128  Input   
32  FC  
32  PReLU    
8  FC  
8  Output    
Cylinder  128  Input   
30  FC  
30  PReLU    
7  FC  
7  Output    
Irregular  128  Input   
Output   
For an irregular shape, 3D CNNs are employed to decode the shape node to a 3D occupancy grid and a vertex displacement grid , then differentiable (MC) converts these two grids to a triangular mesh. A 128D shape node is first mapped to a
grid with 512 channels by a FC layer, the voxel grid is then upsampled by a series of fractionalstride convolutional layers. Fractionalstride convolution first adds spacing between existing pixels (or voxels) within a feature map to increase spatial resolution, the spacing are then filled with zeros, the rest is to do regular convolution operation. Fractionalstride convolution is also known as transposed convolution or dilated convolution, the spacing between pixels is called dilation. Table
7 lists the detailed properties of layers and operators in 3D fractionalstride CNN.Layers  Feature size  Operators  Kernel size  No.  Dilation  Stride  Padding 

0  Input            
1  FC  512        
2  PReLU            
3  Reshape            
4  Transposed Conv  256  1  1  2  
5  PReLU            
6  Transposed Conv  128  1  1  2  
7  PReLU            
8  Transposed Conv  64  1  1  2  
9  PReLU            
10  Transposed Conv  4  1  1  2  
11  Softmax            
12  Mesh  Differentiable MC           
To enable endtoend training, the unsigned distance field is converted to a triangular mesh through differentiable marching cubes [71]. The final 3D model is obtained by "copying" the decoded 3D shapes with similar parameters.
2.3.5 Loss functions
The structural graph layouts are modeled by minimizing the summation of cross entropy classification loss for each node expressed as
(13) 
where and denote ground truth probability and predicted probability for event in events set for the th node in total nodes. The event means, for example, if the current node is a first type node, then is either 0 or 1, is the first value in the onehot 3D vector. It should be note that the binary tree including node types is known in training procedure, hence the node classifier is trained in training and is useful only when testing.
3D shapes are modeled by minimizing the distance loss function for all shapes (taken as meshes) decoded by shape decoder expressed as
(14) 
where and are ground truth shape and predicted shape in the th shape node in total shape nodes, is Chamfer distance for the case that two shapes have different number of vertices, expressed as
(15) 
where and are vertices in shapes and respectively, and are vertices number in and . For cuboids, 8 vertices are parameterized with 8 parameters mentioned above, for cylinders, we use cuboids with square section for calculation and back to cylinders for visualization. No smooth or long edge length regularization terms are considered since the mesh is converted from a regular voxel grid.
The final 3D model requires to minimize the similar nodes loss function
(16) 
where and are the th ground truth similar parameters and the th predicted parameters in total similar nodes.
The total loss function is the summation of all mentioned losses expressed as
(17) 
where , and are 3 weights to compose the total loss. The overall loss is to update the network in an endtoend fashion.
3 Experiments
3.1 Training Details
We collected and generated about 2200 models of long span bridges, see Figure 15. Each training sample includes inputs and labels.
3.1.1 Details for 3D OBB learning
The point clouds are synthesized by densely sampled 3D bridge point cloud (about 1M points) with Gaussian noise and nonuniform sampling, and random terrain point cloud (about 6M points). The synthesized point clouds are first filtered with a sphere boundary mentioned in Section 2.2.1. Bird views are generated by projecting point cloud within the sphere boundary to an image along (up) direction, front views are generated by projecting point cloud within the 2D OBB to an image along its local (front) direction. The labels are 3D OBBs . Images are with size and are normalized to , point cloud within the sphere boundary is translated and scaled to a unit sphere with center .
3.1.2 Details for structural graph layouts and 3D shapes learning
The inputs are composed of: (a) a sparsely sampled 3D point cloud (from 4K to 8K points) with Gaussian noise and nonuniform sampling; (b) 4 to 8 views, each view includes a rendered gray scale image from 3D model with random background, and a reprojected foregroundbackground segmentation from a densely sampled 3D point cloud (about 1M points) with Gaussian noise and nonuniform sampling. Images are with size and are normalized to , the geometric centers of point clouds are translated to , point clouds are scaled to 1 according to (up) direction.
The labels are composed of: (a) a binary tree represented with nested tuples (in Python language) by assembling one of the three node types on each node; (b) the 3D model with 3D shapes (mesh) on shape nodes and similar parameters on similar nodes.
Learning rate is set to initially and finally with exponential decline, training epoch is 500 with batch size 1, we use adaptive moment estimation (Adam) with and to optimize the network, no pretraining is implemented. In the loss function, we choose , and .
3.2 Qualitative comparison
In addition to these methods, we manually create a 3D model based on point cloud and images as a reference, which is regarded as the best w.r.t details but not in accuracy, however, it takes a dozen of hours for manual work. It’s worth mentioning that when manually creating these models, we refer to three orthographic views of point cloud to obtain approximate size information, and refer to raw images to infer detailed part relations and 3D shapes, hence it is reasonable that how the learning framework works.
A qualitative comparison among different methods is shown in Figure 16 and Figure 17: (a) Dense point cloud (Section 2.1). Dense point cloud is not capable for texture mapping, and suffers from severe noise and uneven distribution. (b) and (c) Delaunary triangulation and Poisson surface reconstruction [6, 7, 8]. Surface reconstruction is directly based on point clouds and no structure prior is introduced so the meshing quality is unpleasant. (d) Point cloud modeling. This method works well in practice for high quality point clouds, we have tested on RANSAC based methods [16] and Learningbased method [20], both failed to fit 3D primitives in this case, this is due to the severe noise on MVS point cloud caused by long distance photography and limited image quantities. (e) The proposed method. (f) Manual work. (gh) Front and side views of selected methods.
We haven’t tested imagebased learning methods, since that meshbased learning
[36, 37, 38, 39, 40] haven’t solve the topological genus problem; current geometric primitivesbased methods [41, 42, 43] only take cuboids into consideration; volumetric methods [47, 48, 49, 50] are limited by voxel resolution. More importantly, to our best knowledge, no 3D priors are employed since no 3D information or stereo vision mechanisms are adopted in their networks. As the question "What Do Singleview 3D Reconstruction Networks Learn?" asked in [72], current state of the art in singleview object reconstruction does not actually perform reconstruction but image classification.3.3 Quantitative comparison
We use manually created 3D model as the ground truth (GT), although it seems not perform best in spatial accuracy. Chamfer distance is calculated with vertices of manually created 3D model and other models. Since Chamfer distance requires a huge KD Tree to handle a large number of vertices, hence we downsampled to under 10000 vertices. For point cloud, the vertices are downsampled 3D points. We also compared the reconstructed bridge components number, note that the “components” concept does not exist in the first three methods. Finally, we compared the output data size, one point has a size of 3, one triangular face is 3, one cuboid is 8, and one cylinder is 7. See Table 8 for a comparison.
Method  Chamfer distance  No. of components  Data size  

Bridge 1  Dense point cloud [3]  0.04418    1117761 
Delaunay triangulation [6]  0.04418    14031345  
Poisson surface reconstruction [8]  0.55827    1234419  
Point cloud modeling [16, 20]        
The proposed method  0.06237  1488  66812  
Manual work (GT)  0  1430  15872  
Bridge 2  Dense point cloud [3]  0.11345    5398473 
Delaunay triangulation [6]  0.11345    29836674+  
Poisson surface reconstruction [8]  0.19662    4307535  
Point cloud modeling [16, 20]        
The proposed method  0.02001  914  73016  
Manual work (GT)  0  914  10952 
4 Conclusions
A learning framework is designed which can learn a mathematical model from prior knowledge. Compared with previous methods, the proposed method reconstructs a 3D model while preserving its topological properties and spatial relations successfully.
The 3D digital model shows several potential applications, including: (a) Visual structural health monitoring. By assembling 2D images into a 3D digital bridge model using texture mapping, it enables inspectors to engage with these images in a more intuitive manner, like VR and AR. It also provides a better monitoring scheme by recording and visualizing the life cycle of an entire bridge. (b) Finite element modeling. The proposed method provides a topologyaware 3D digital model, which reduce the time spending on manual establishment of finite element model significantly.
5 Acknowledgements
The study is financially supported by the National Natural Science Foundation of China (Grant U1711265 and 51638007) and supported by grants from the National Key R&D Program of China (Grant 2017YFC1500603).
References
 [1] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
 [2] C. Wu, “Visualsfm: A visual structure from motion system,” http://www. cs. washington. edu/homes/ccwu/vsfm, 2011.

[3]
J. L. Schonberger and J.M. Frahm, “Structurefrommotion revisited,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4104–4113, 2016.  [4] Y. Furukawa, C. Hernández, et al., “Multiview stereo: A tutorial,” Foundations and Trends® in Computer Graphics and Vision, vol. 9, no. 12, pp. 1–148, 2015.
 [5] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview stereopsis,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 8, pp. 1362–1376, 2010.
 [6] B. Delaunay et al., “Sur la sphere vide,” Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, vol. 7, no. 793800, pp. 1–2, 1934.
 [7] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” in Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7, 2006.
 [8] M. Kazhdan and H. Hoppe, “Screened poisson surface reconstruction,” ACM Transactions on Graphics (ToG), vol. 32, no. 3, p. 29, 2013.
 [9] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Manhattanworld stereo,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1422–1429, IEEE, 2009.
 [10] S. Sinha, D. Steedly, and R. Szeliski, “Piecewise planar stereo for imagebased rendering,” 2009.
 [11] A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra, “Rapter: rebuilding manmade scenes with regular arrangements of planes.,” ACM Trans. Graph., vol. 34, no. 4, pp. 103–1, 2015.
 [12] B. Xiong, M. Jancosek, S. O. Elberink, and G. Vosselman, “Flexible building primitives for 3d building modeling,” ISPRS journal of photogrammetry and remote sensing, vol. 101, pp. 275–290, 2015.
 [13] T. Holzmann, M. Maurer, F. Fraundorfer, and H. Bischof, “Semantically aware urban 3d reconstruction with planebased regularization,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 468–483, 2018.
 [14] C. Raposo, M. Antunes, and J. P. Barreto, “Piecewiseplanar stereoscan: Sequential structure and motion using plane primitives,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 8, pp. 1918–1931, 2018.
 [15] A. Kaiser, J. A. Ybanez Zepeda, and T. Boubekeur, “A survey of simple geometric primitives detection methods for captured 3d data,” in Computer Graphics Forum, vol. 38, pp. 167–196, Wiley Online Library, 2019.
 [16] R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for pointcloud shape detection,” in Computer graphics forum, vol. 26, pp. 214–226, Wiley Online Library, 2007.
 [17] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. CohenOr, and N. J. Mitra, “Globfit: Consistently fitting primitives by discovering global relations,” ACM transactions on graphics (TOG), vol. 30, no. 4, p. 52, 2011.
 [18] T. Rabbani, S. Dijkman, F. van den Heuvel, and G. Vosselman, “An integrated approach for modelling and global registration of point clouds,” ISPRS journal of Photogrammetry and Remote Sensing, vol. 61, no. 6, pp. 355–370, 2007.
 [19] M. Attene and G. Patanè, “Hierarchical structure recovery of pointsampled surfaces,” in Computer Graphics Forum, vol. 29, pp. 1905–1920, Wiley Online Library, 2010.
 [20] L. Li, M. Sung, A. Dubrovina, L. Yi, and L. Guibas, “Supervised fitting of geometric primitives to 3d point clouds,” arXiv preprint arXiv:1811.08988, 2018.
 [21] X. Li, Y.L. Lin, J. Miller, A. Cheon, and W. Dixon, “Primitivebased 3d building modeling, sensor simulation, and estimation,” arXiv preprint arXiv:1901.05554, 2019.
 [22] M. Li, P. Wonka, and L. Nan, “Manhattanworld urban reconstruction from point clouds,” in European Conference on Computer Vision, pp. 54–69, Springer, 2016.
 [23] M. Li, L. Nan, and S. Liu, “Fitting boxes to manhattan scenes using linear integer programming,” International journal of digital earth, vol. 9, no. 8, pp. 806–817, 2016.
 [24] J. Lee, H. Son, C. Kim, and C. Kim, “Skeletonbased 3d reconstruction of asbuilt pipelines from laserscan data,” Automation in construction, vol. 35, pp. 199–207, 2013.
 [25] A. K. Patil, P. Holi, S. K. Lee, and Y. H. Chai, “An adaptive approach for the reconstruction and modeling of asbuilt 3d pipelines from point clouds,” Automation in Construction, vol. 75, pp. 65–78, 2017.
 [26] B. Guo, Q. Li, X. Huang, and C. Wang, “An improved method for powerline reconstruction from point cloud data,” Remote sensing, vol. 8, no. 1, p. 36, 2016.
 [27] S. Ortega, A. Trujillo, J. M. Santana, J. P. Suárez, and J. Santana, “Characterization and modeling of power line corridor elements from lidar point clouds,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 152, pp. 24–33, 2019.
 [28] L. Barazzetti, “Parametric asbuilt model generation of complex shapes from point clouds,” Advanced Engineering Informatics, vol. 30, no. 3, pp. 298–311, 2016.
 [29] A. Dimitrov, R. Gu, and M. GolparvarFard, “Nonuniform bspline surface fitting from unordered 3d point clouds for asbuilt modeling,” ComputerAided Civil and Infrastructure Engineering, vol. 31, no. 7, pp. 483–498, 2016.
 [30] P. Labatut, J.P. Pons, and R. Keriven, “Hierarchical shapebased surface reconstruction for dense multiview stereo,” in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1598–1605, IEEE, 2009.
 [31] F. Lafarge and C. Mallet, “Creating largescale city models from 3dpoint clouds: a robust approach with hybrid representation,” International journal of computer vision, vol. 99, no. 1, pp. 69–85, 2012.
 [32] F. Lafarge, R. Keriven, M. Brédif, and H.H. Vu, “A hybrid multiview stereo algorithm for modeling urban scenes,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 5–17, 2013.
 [33] F. Lafarge and P. Alliez, “Surface reconstruction through point set structuring,” in Computer Graphics Forum, vol. 32, pp. 225–234, Wiley Online Library, 2013.
 [34] N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google earth engine: Planetaryscale geospatial analysis for everyone,” Remote Sensing of Environment, vol. 202, pp. 18–27, 2017.
 [35] E. Grilli, F. Menna, and F. Remondino, “A review of point clouds segmentation and classification algorithms,” The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 42, p. 339, 2017.
 [36] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson, “Learning freeform deformations for 3d object reconstruction,” arXiv preprint arXiv:1803.10932, 2018.
 [37] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.G. Jiang, “Pixel2mesh: Generating 3d mesh models from single rgb images,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67, 2018.
 [38] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papiermâché approach to learning 3d surface generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224, 2018.
 [39] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learning categoryspecific mesh reconstruction from image collections,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386, 2018.
 [40] E. Smith, S. Fujimoto, and D. Meger, “Multiview silhouette and depth decomposition for high resolution 3d object representation,” in Advances in Neural Information Processing Systems, pp. 6478–6488, 2018.
 [41] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik, “Learning shape abstractions by assembling volumetric primitives,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2635–2643, 2017.

[42]
C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem, “3dprnn: Generating shape primitives with recurrent neural networks,” in
Proceedings of the IEEE International Conference on Computer Vision, pp. 900–909, 2017.  [43] C. Niu, J. Li, and K. Xu, “Im2struct: Recovering 3d shape structure from a single rgb image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4521–4529, 2018.
 [44] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613, 2017.
 [45] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang, “3d shape reconstruction from sketches via multiview convolutional networks,” in 2017 International Conference on 3D Vision (3DV), pp. 67–77, IEEE, 2017.
 [46] C.H. Lin, C. Kong, and S. Lucey, “Learning efficient point cloud generation for dense 3d object reconstruction,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [47] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3dr2n2: A unified approach for single and multiview 3d object reconstruction,” in European conference on computer vision, pp. 628–644, Springer, 2016.
 [48] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision,” in Advances in Neural Information Processing Systems, pp. 1696–1704, 2016.
 [49] A. Kar, C. Häne, and J. Malik, “Learning a multiview stereo machine,” in Advances in neural information processing systems, pp. 365–376, 2017.
 [50] X.