The aim of this paper is to learn a 3D model of long-span bridges (hereinafter referred to as "bridges") from images captured by unmanned aerial vehicles (UAVs) fully automatically.
Recent developments in UAVs have heightened applications in wide range of industrial scenarios. UAVs can explore inaccessible areas carrying various types of sensors, such as digital cameras, infrared cameras, laser scanners (LiDAR) and so on, thus expected to play an increasingly important role in civil structure visual inspection systems. Cameras mounted on UAVs can capture images of the structure from various view points and record the exterior state of the structure. However, current visual inspection strategies often provide massive sets of unstructured and unaligned digital images, which take a lot of human efforts to filter and organize for further usage, the problem of how to fully automate this task becomes urgent that should be settled. One feasible solution is to integrate these images into a single 3D model of the target structure, which enables inspectors to engage with these images in a more intuitive manner and can provide a better monitoring scheme by recording and visualizing the life cycle of the entire structure. However, existing 3D models are as-designed models which somehow differ from current as-is models and failed to express the current exterior state of the structure, so recording or scanning the structure on the site is believed to be the only feasible solution.
Up to now, the research on civil structure 3D reconstruction has tended to focus on local points (i.e., pixels in image grids and points in 3D Euclidean space) rather than global features. These methods first obtain 3D coordinates of key-points and camera parameters from multiple images using structure-from-motion (SfM) [1, 2, 3], then generate dense 3D point clouds with multi-view stereo (MVS) [4, 5].
Point clouds are rarely directly used in practice, since they are actually unstructured point sets and are not capable for texture mapping. Based on point clouds, there are two post processing, they are surface reconstruction [6, 7, 8] and fitting-based point cloud modeling methods [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34].
Surface reconstruction generates a polygonal mesh (a triangular mesh in most cases) as a surface of a 3D point cloud directly. A triangular mesh is represented as a set of ordered 3D vertices and triangular faces expressed as , each face use 3 vertices indexes. However, a polygonal mesh is often overly complex for civil structures in urban scenes where most surfaces are flat. Worse still, the meshing quality may degrade significantly because: (a) a MVS point cloud often suffers from severe noise and uneven distribution, which is different from LiDAR point clouds; (b) the target 3D object is often with complex geometry and topology, such as for a cable-stayed bridge in this paper; (c) no structure priors are introduced in surface reconstruction methods.
Fitting-based modeling methods generate a compact geometric model to fit a 3D point cloud, where basic geometric shapes (e.g., planes, cones, cuboids, cylinders, pyramids, spheres, etc.) are employed, see Figure 1. Sometimes point cloud segmentation algorithms (see  to get a literature review) are used for pre-processing. Fitting-based methods are summarized into: (a) piece-wise planes fitting; (b) 3D geometric primitives fitting; (c) non-uniform rational b-spline (NURBS) curves and surfaces fitting; (d) Hybrid fitting. They are briefly discussed in the following.
Piece-wise planar fitting methods use planes to fit a point cloud. Manhattan-world prior is introduced due to it is common in the real-world buildings [9, 10]. Planar fitting has achieved remarkable results in building exterior reconstruction [10, 11, 12], Holzmann et al.  use lines in addition to point clouds, Raposo et al.  use planes rather than being limited by point clouds, both achieving better results.
3D geometric primitives fitting methods use 3D geometric primitives to fit a point cloud, sometimes constructive solid geometry (CSG) models are employed. Representing a 3D object with 3D geometric primitives is the most commonly used in computer vision and computer graphics fields . There are many different methods to fit 3D geometric primitives including RANSAC [16, 17, 22], Hough transform , primitive-driven region growth , learning-based methods [20, 21], etc., which has been widely used in building exterior reconstruction [22, 23], building information modeling (BIM) such as pipes [24, 25] etc., and have achieved remarkable results.
Hybrid fitting [30, 31, 32, 33] is employed as a technique solution using LiDAR point clouds, MVS point clouds, surface reconstruction, geometric primitives fitting, etc. For example, the Google Earth .
However, fitting-based methods failed to reconstruct the MVS point cloud in this paper due to: (a) noise, uneven distribution, missing points and occlusion in MVS point cloud; (b) high level structure priors are not introduced, since fitting-based methods only consider bottom-up fitting.
In summary, surface reconstruction and point cloud modeling are both susceptible to noise, they are in a bottom-up fashion, where high level structure priors are rarely introduced, this brings enormous challenge to existing algorithms.
Machine learning has long been a question of great interest in a wide range of fields including computer vision and computer graphics. In contrast to traditional point-based algorithms, which can be considered as a point-wise stereo vision measurement and does not rely on global feature and prior knowledge, learning-based methods [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
learn global features from images. In general, these algorithms first encode single or multiple images into a latent vector by convolutional neural networks (CNNs) and then decode it into 3D models represented by: (a) polygonal mesh models[36, 37, 38, 39, 40]; (b) geometric primitives models [41, 42, 43]; (c) point cloud models [44, 45, 46]; (d) volumetric models [47, 48, 49, 50], in terms of output 3D representation forms. These methods work well in synthesized benchmark datasets [51, 52] but are still far from meeting the industrial engineering requirements.
Table 1 and Figure 2 shows a comparison of different representation forms of a bridge model, all these models are directly converted from the same original 3D bridge model. They are: (a) Triangular mesh, defined by a set of vertices and faces ; (b) Geometric primitives, defined by a set of cuboids and cylinders; (c) Point cloud, defined by a set of points; and (d) Volumetric model, defined by a 3D voxel grid .
|Triangular mesh||- High representation ability for details||- Vertices connections are very hard to learn|
|- Lose details|
|Volumetric model||- Easy to learn||
The objective of this paper is to learn a 3D model for bridges from UAV images. We use hybrid models including geometric primitives and volumetric models to represent a bridge object, and hybrid input data including images and point clouds.
Only images of long-span bridges captured by UAVs are used for 3D reconstruction of bridges. Taking an example of seeing a picture of a bridge shown in Figure 3 by human to understand the bridge, one can delicately distinguish the bridge object in the picture from the background, then have further information that this bridge has two towers arranged symmetrically, the truss blocks are aligned repeatedly and the cables are similar from one to another. Inspired by the human recognition process on an object above, a hierarchical binary parsing tree is proposed to parse a bridge structure in this paper, as shown in Figure 4.
The entire procedure for 3D reconstruction of a bridge is summarized as: (a) first, a 3D point cloud and camera parameters from images are obtained as 3D global initial information through a general rigid SfM and MVS pipeline; (b) then, regions of interest (RoIs) in 3D point cloud and 2D image are obtained to reduce the interference of background and noise; (c) following feeding 3D point cloud, images, and RoIs into the proposed learning framework, a binary tree to describe the structural graph layouts is determined via learning algorithm; (d) when the binary tree is obtained, 3D shapes are further learned from shape nodes; and (e) the final 3D model is assembled with structural graph layouts and 3D shapes.
2.1 3D Point Cloud and Camera Parameters
This step is to obtain 3D point cloud and camera parameters (see Figure 5) from images as 3D global initial information through a general rigid SfM and MVS pipeline. SfM obtains 3D coordinates of key-points and camera parameters from multiple images. It can be summarized as (a) 2D key-points extraction and matching; (b) camera parameters verification and (c) sparse 3D key-points reconstruction. MVS then takes the output (i.e., camera parameters and coordinators of 3D key-points) of SfM as input, and generates depth-maps, dense 3D point clouds and so on. These steps will be described in detail as follows.
2.1.1 2D key-points extraction and matching
This step is to extract key-points and find key-point pairs in multi-view images. A 2D key-point is a coordinate in an image indicating that the pixel values in its neighboring pixels vary greatly.
Given an image , the scale invariant feature transform (SIFT)  key-point extraction processing is formulated as
where is pixel coordinate within an image, is a Gaussian blurred image parameterized by a factor and generated by convoluting a Gaussian kernel with an input image . A pyramid-like scale space is constructed in  to store Gaussian blurred images with different factors and different sizes to enable multi-scale detection. The factors are (, , , , …) in each octave, where , and is the number of intervals in each octave. The image size is halved from a lower octave to a higher one. is a difference-of-Gaussian image produced by subtracting two adjacent Gaussian blurred images in each octave. The key-point candidates are extracted by comparing each pixel in to its neighbors in the current factor and two adjacent factors in each octave, and is chosen if it is larger or smaller than all other neighboring pixels.
A key-point descriptor represented by a 128-D vector is needed to characterize a key-point. It assigns gradient magnitudes and orientations in a neighboring pixel grids region, each magnitude and orientation are expressed as
These gradient magnitudes and orientations are then accumulated into orientation histograms by summarizing the contents over subregions, which generates a descriptor in form of a array of histograms, each with 8 orientation bins, this results in a dimensional vector, which is then normalized to form the key-point descriptor.
Matching is to find all descriptor pairs in an image pair. Approximate nearest neighbors (ANN)  is used to reduce algorithm complexity compared to exhaustive descriptor pairs matching. Given an image pair, for a query descriptor in one image, ANN finds the approximate closest descriptor (i.e.,
) in the other image by constructing a priority search k-means tree w.r.t metric distance (here Euclidean distance).
In summary, given images, the SIFT detector extracts 2D key-points and their descriptors for all images, this results in a key-point set , where denotes the -th key-point in the -th image and (a 128-D vector in ) is the corresponding descriptor. These key-points are then matched via distances of descriptors for all image pairs, and this results in a set of matched key-point pairs .
2.1.2 Camera parameters verification
This step is to verify camera parameters for multiple views including intrinsic parameters (e.g., focal length) and extrinsic parameters (camera motions). In this section, homogeneous coordinate system is introduced to simplify calculations, e.g., a 2D key-point is expressed as in homogeneous coordinate, and subject to . In single view geometry, camera parameters defines a projective mapping , where and are two points in 3D world coordinate and 2D pixel coordinate in homogeneous coordinate system. Camera parameters includes intrinsic parameters which defines a projective mapping from 3D camera coordinate (cartesian coordinate system) to 2D pixel coordinate (homogeneous coordinate system), and extrinsic parameters where is a rotation matrix and is a translation vector, defining rotation and translation from 3D world coordinate (homogeneous coordinate system) to 3D camera coordinate (cartesian coordinate system). Camera lens radial distortion matrix is introduced to eliminate the image distortion caused by camera lens, where is a nonlinear polynomial function of the distance from the current pixel to image center, are radial distortion parameters. As a result, the camera parameters are defined by a matrix .
In multi-view geometry, and are shared among different views if they are obtained with the same camera. Note that and
are known, calibrated or estimated approximately beforehand and optimized at the last stage. Images are then undistorted withto avert nonlinear mapping functions, i.e., only is considered, hence only camera motion for each view is need to be solved. Generally, the camera coordinate of the first view is used as the world coordinate. In this study, and are estimated from exchangeable image file format (EXIF) tags and are shared among different views.
To solve camera motion , consider an image pair (i.e., the -th and -th image) in epipolar geometry (i.e., two views geometry), the camera parameters of two views are encapsulated in the fundamental matrix , where and describes the relative translation and rotation of two views,
is a skew-symmetric matrix offor matrixes cross product. The fundamental matrix gives constraints on how the scene changes under two views, or how the coordinates of a matched point pair changes within an image pair. A matched key-point pair in homogeneous coordinate system fulfills , where and are two key-points in the -th and -th image. The first step is to solve subject to from key-point pairs, the 9 elements in can be solved by least square algorithm since an image pair has a large number of matched key-point pairs. The next step is to solve and from , assume that the camera matrixes of the first and second view are and (i.e., the world coordinate is on the first camera coordinate), and can then be retrieved from
as following, suppose that the singular value decomposition (SVD) ofis since has two equal singular values (refer to  for proof), the solution of and are up to a scale and a four-fold ambiguity, they are or and or , where is orthogonal and skew-symmetric. A solution of and is valid only when a reconstructed point is in front of both cameras.
2.1.3 3D Key-points reconstruction
This step is to reconstruct 3D key-points from 2D key-point pairs. A 3D point in homogeneous representation is obtained by triangulation algorithm for each matched key-point pair and via camera matrixes and in the -th and -th image, the sign implies that there is a non-zero scale factor since homogeneous representation is involved.
For multiple views, Incremental Structure-from-Motion [2, 3] first initializes with a two-view reconstruction, then registers other views to the current reconstruction one-by-one. A new view image is registered to the current reconstruction if it observes existing 3D points, i.e., the key-points in the new view has an overlap to the current views w.r.t key-point descriptors. Camera motion of the new view is then estimated with Perspective-n-Point algorithm from corresponding 3D and 2D points, expressed as , where and are corresponding 3D and 2D points w.r.t key-point descriptors. New 3D points are then obtained with triangulation algorithm in a pair-by-pair mode and results in a new reconstruction. The reconstruction is completed if no new image can be registered. To record multi-view correspondence of 2D points among multiple views, a track is defined for a reconstructed 3D point , i.e., a list of corresponding (w.r.t key-point descriptors) 2D points for different new views.
Bundle adjustment refines all camera parameters for views and positions of all reconstructed 3D key-points to minimize the overall re-projection error, expressed as after every incremental reconstruction.
2.1.4 Multi-view stereo (MVS)
This step is to generate dense point cloud. The output of SfM are sparse 3D key-points and camera parameters, to get dense 3D points, the depth value for all pixels need to be calculated via epipolar geometry.
Given an image pair and one pixel within the first image, the preimage of is a ray that goes from the camera center and across . To find the corresponding pixel in the second image, consider epipolar geometry, all that fulfill the epipolar geometry make up the epipolar line and have . In fact, should lie on the projected line of the preimage on the second image, i.e., lies on the epipolar line . The search for the corresponding pixel is to find the best-matched pixel to along the epipolar line , this processing is done via a sliding window rather than a single pixel. Consider a fixed window around the point in the first image and a slide window along the epipolar line in the second image, Normalized Cross Correlation (NCC) compares the score of the pixel values within the two windows, expressed as , where and are vectorized pixel values within the first and second window, and are mean values, and denotes inner product. The corresponding pixel is selected in the second image when NCC score achieves maximum. The depth values of two corresponding pixels are then calculated by triangulation algorithm mentioned in Section 2.1.3, that is, the depth value for is the distance from the first camera center to its reconstructed 3D point , the same with .
By calculating all pixel depth values within an image, a depth-map is obtained, which shares the identical camera parameters with its original RGB image. By treating a depth-map as a 2D array of 3D points, multiple depth-maps can be considered as a merged 3D point cloud. Patch-based multi-view stereo (PMVS)  is an alternative algorithm to generate dense point cloud. Figure 5 shows 3D key-points, dense point cloud and camera poses.
2.2 Region of interest (RoI) in 3D and 2D
This step is to obtain region of interest (RoI) in 3D point cloud and 2D images. RoI is necessary since background and noise interfere algorithms significantly .
2.2.1 RoI in 3D: a 3D orientated bounding box (3D OBB)
3D orientated bounding box (3D OBB) is an approximate solution to find RoI in 3D point cloud. The raw point cloud is with a large range including many environmental points, we first filter a point cloud with criterion below
where and denote the input and output 3D point sets, and
stand for the mean and standard deviation of, is a fixed scale factor. This results in a sphere boundary parameterized by center and radius . works well in our experiments.
A 3D OBB has 9 degrees of freedom (DoFs) including a 3D translationindicating OBB center coordinates, 3 rotation angles and 3 dimensional parameters including length, width and height expressed as . The baseline of camera is rectified to horizontal in advance, if not, many image processing tools can rectify horizon line to horizontal by rotating images. So only one rotation angle instead of is needed to determine the 3D OBB, hence a 7-DoFs 3D OBB is denoted as .
We design a simple yet efficient two-step convolutional neural network to obtain the 7-DoFs 3D OBB. The configuration of the CNN (see Figure 6) follows VGG  style. In Figure 6, the input of this network is a dimensional image, the output is a dimensional vector. In the first step, the input is a gray scale bird view image, the output is a 2D OBB with dimension . On applying the 2D OBB to filter and align the point cloud, we can get the front view image. In the second step, the input is a gray scale front view image, the output is with dimension . After that, the 3D point cloud is filtered with the 7-DoFs OBB. see Figure 7. Table 2 lists the detailed properties of layers and operators. We will describe the function of these layers in following.
|Layers||Feature size||Operators||Kernel size||No.||Stride||Padding|
Convolutional operation extract features from an image through a convolutional kernel. One single channel of an image is viewed as a matrix, so convolution indicates 2-dimensional discrete convolution, expressed as
where means convolution, , and represent the input image, output image, and learn-able convolutional kernel, respectively,
are pixel coordinates. Note that there are some slight differences from convolution in math to recent deep learning libraries[58, 59], where channels, stride and padding are employed and no flip is required for convolutional kernels since they are randomly initialized.
Batch normalization fixes the means and variances of each layer’s inputs, it facilitates network convergence and alleviate over-fitting problem. Assume an input mini-batch with batch size, for each dimension of BN layer’s input, the output is expressed as
where and are one dimension of BN layer’s input and output (both are dimensional vectors), and are mean and variance of input data, and are learn-able parameters, a small constant is added for numerical stability.
The activation layer enables non-linearity of networks, which is attached after each layer in the network. We use Parametric ReLU (PReLU) expressed as
where is a learn-able parameter, is the input.
Max pooling operation down sample an image while preserving its dominating features, expressed as
where and represent the input image and output image, are pixel coordinates.
Fully connected (FC) layers reduce or increase the dimension of features, using simple matrix product expressed as
where means matrix product, and are input and output vectors, and are learn-able parameters. The first layer of FC requires huge number of learn-able parameters (about 134M) causing severe over-fitting problem, which is alleviated by dropout . In this paper, dropout layers are attached after each FC layers with if the FC layer has more than 1M parameters.
The CNN is trained by minimizing the loss function below. We considerand in the first-step CNN and second-step CNN respectively.
where and stand for ground truth bounding box and predicted bounding box respectively, the sign is Hadamard product for vectors (i.e., element-wise product). The use of forces the network to optimize smaller values in . Training data and details are described in Section 3.
2.2.2 RoI in 2D: A foreground-background segmentation
RoI in 2D indicates a foreground-background segmentation mask (a gray scale image), it is obtained from 3D point cloud, 3D OBB and camera parameters, see Figure 8. 3D point cloud is first filtered by 3D OBB and then projected into 2D using camera parameters. The projection is expressed as
where denotes filtered 3D point cloud, is camera lens radial distortion parameters, and are intrinsic and extrinsic parameters, is the projection of on image coordinate with homogeneous representation. Note that is not ready to use, to get a foreground-background segmentation image, all 2D points in should be accumulated to 2D image grids. A general method is distance-based accumulation, when accumulating a 2D point with value 1, its four neighboring pixels are updated as , where is the intensity value of a neighboring pixel, is the distance from 2D point to the neighboring pixel.
2.3 Learning structural graph layouts and 3D shapes
This step is to learn structural graph layouts and 3D shapes from 3D point cloud and images. We design a learning framework, see Figure 9. A multi-view convolutional neural network (Multi-view CNN, Figure 11) combined with a point cloud network (Figure 12) is designed to encode multi-view images and point cloud into a latent feature, which is then decoded into structural graph layouts (i.e., a hierarchical structural parsing binary tree) by designing a recursive binary tree network (Recursive BiTreeNet, Figure 13), 3D shapes are decoded from shape nodes in the binary parsing tree (Figure 14). We formulate the learning framework below.
2.3.1 Multi-view convolutional neural network (Multi-view CNN)
CNNs learn features from images. A multi-view CNN in this section (Figure 11) learns a feature from -view images , the Multi-view CNN can be seen as a mapping function . A single view has two channels, a gray scale image and its corresponding foreground-background segmentation. A 16 times down sampled image resolution (about ) was chosen to balance algorithm performance and memory overhead, see Figure 10. Such high-resolution images with multiple views require huge learning parameters and memory, to alleviate this, we use multi-view CNN, where convolutional kernels among different views are shared, compared with general framework like VGG16  (553M), ResNet101  (178M) and DenseNet121  (31M), the proposed multi-view CNN (6M) reduced learning parameters and memory cost significantly, thus works on a consumer grade GPU with 11GB memory and with batch size 1 up to 6 views.
Figure 11 illustrates the Multi-view CNN, each view branch contains convolutional layers, activation layers and max pooling layers. Batch normalization layers are not considered in this section since all inputs are with batch size 1. Learn-able parameters are shared among different view branches. The view pooling layer compares features from different views and selects maximum features, expressed as , where and represent the input and output feature, denotes pixel coordinate, is channel index, is view index for input views. The output feature is then re-shaped to a vector and fed into three fully connected layers. The final feature is a 4096-dimensional vector. Table 3 lists the detailed properties of layers and operators.
|Layers||Feature size||Operators||Kernel size||No.||Stride||Padding|
2.3.2 Point cloud network
Point cloud network learns features from a point cloud (a point set). A point cloud network in this section learns a feature from a 3D point set , the point cloud network can be seen as a mapping function . A point cloud is composed by a set of points with following properties: (a) disorder; (b) unfixed number; (c) arbitrary rotation. Based on (a) and (b), symmetry functions (e.g., max function where ) are introduced . Unlike recent popular methods [64, 65, 66], the arbitrary rotation was already solved in Section 2.2. To solve (a) and (b), given an input point cloud composed of 3D points, a series of fully connected layers followed by a max pooling layer are employed in this section. Table 4 lists the detailed properties of layers and operators. We will describe these layers in the following.
|Layers||Feature size||Operators||Kernel size|
Fully connected layer. The unordered input requires shared weights among different points, meanwhile allowing arbitrary number of points. Hence a fully connected layer meets these requirements, which indicates matrix product expressed as
where means matrix product, and denote input and output points in Euclidean -space and -space, respectively, forming two matrixes, and are learn-able parameters.
Channel-wise (column-wise) max pooling layer. The channel-wise max pooling in this section indicates the symmetry function in , where a max value is selected in each channel (column) for points, expressed as
where denotes input points in Euclidean -space, denotes the learned feature, is the channel index..
2.3.3 Recursive binary tree network (Recursive BiTreeNet)
A binary parsing tree is a rooted tree, where every node has at most two children, it is defined as a graph and
is empty. The notion of parsing tree comes from the field of Linguistics and becomes frequently used in learning-based natural language processing[67, 68] and computer graphics research [69, 70]. For example, a sentence "The cat sleeps on the carpet." is parsed into "(ROOT (S (NP (DT The) (NN cat)) (VP (VBD sleeps) (PP (IN on) (NP (DT the) (NN carpet)))) (. .)))", where are node types in the binary parsing tree.
The proposed recursive binary tree network (see Figure 13) learns a binary tree from image features and point cloud features. The binary tree describes the structural graph layouts, where the shape nodes can be seen as 3D shape codes and are decode into 3D shapes by shape decoder, and similar nodes describe the distribution of 3D shapes.
In Figure 13, the Recursive BiTreeNet first fuse image feature and point cloud feature by two FC layers, the output is a 128-D root node feature. The root node is regarded as a parent node and is decoded to a binary tree by the BiTreeNet recursively. One recursive element is shown in Figure 13, a parent node is decoded to one or two child nodes, and the recursive element is reused in the next hierarchy. Given a root node , the recursion processing is written in Algorithm 1. We will explain Algorithm 1 below in detail.
Algorithm 1 shows the recursion function. In one recursive element, in training stage, the node class is known (), but not in test stage (
). The node classifier first classifies the parent node into three node types, for training stage, we use the known node class and calculate node classification loss; for test stage, we use the node class given by node classifier. The node classifier is a two-layer fully connected network, the output is a 3-D one-hot vector indicating probabilities of 3 node types, e.g., means the second type and is equivalent to , we use the latter in written form to simplify. The three node types are split node, similar node and shape node, Figure 4 shows functions of these nodes.
means the current node is a split node, and it can split into 2 nodes, that is, the current shape can split into 2 adjacent shapes, realized by a two-layer FC network . The two output child nodes can be seen as two parent nodes for the next hierarchy, they are then fed into the next recursion function.
means the current node is a similar node, and it can split into 2 nodes, which means the current shape can split into one shape and its "copies", obtained by a two-layer FC network . The two output child nodes can be seen as one parent node for the next hierarchy, and one similar node with 16 similar parameters. We use up to 16 parameters to represent similarity, the first 3 parameters are a one-hot vector indicating the similar type, including 1-D translation (rigid translation), 2-D translation (non-rigid translation) and reflection. For 1-D translation (e.g., truss element), 1 parameter is for copy number, 1 parameter is for distance, 3 parameters are for translation directions specified by a 3-D vector, e.g., means the direction. For 2-D translation, e.g., cables are similar but vary in lengths, 1 parameter is for copy number, 2 parameters are for 2 endpoints’ translation distances, 6 parameters are for 2 endpoints’ translation directions specified by 2 3-D vectors. For reflection, 1 parameter is for reflection distance, 3 parameters are for reflection directions specified by a 3-D vector. The rest parameters are set to 0 in labels. Here we use a small trick to predict a reasonable 3D model, notice that translation distance of cable and truss are correlative, but there is no edge between similar nodes of cables and trusses. Actually, this correlation hasn’t been modeled in binary tree, we simply merged close translation distance using their average, e.g., two translation distances are 0.121 and 0.117, we use 0.119 for both.
indicates that the current node is a shape node, which means the current shape is inseparable. The shape node is decoded to a 3D shape by Shape decoder in Section 2.3.4.
One may notice that these operations require dynamic computation graph, which means the computation graph is variable during each iteration. This is friendly supported in  but not in . Table 5 lists the detailed properties of layers and operators.
|Layers||Feature size||Operators||Kernel size|
|A 3D shape||Output||-|
2.3.4 Shape decoder network
Shape decoder network (Figure 14 is to decode a 128-D shape node vector into a 3D shape. The shape decoder first classifies the input shape node into 3 types, including cuboid, cylinder and irregular shapes like bridge tower. Similar to the node classifier mentioned above in Section 2.3.3, for training stage, we use the known shape type and calculate node classification loss; for test stage, we use the shape type given by node classifier.
A cuboid requires 8 parameters, for 2 endpoints and 2 for section width and height, obtained from a two-layer FC network . A cylinder requires 7 parameters, for 2 endpoints and 1 for section radius. Table 6 lists detailed properties of layers and operators except for 3D CNNs.
|Layers||Feature size||Operators||Kernel size|
For an irregular shape, 3D CNNs are employed to decode the shape node to a 3D occupancy grid and a vertex displacement grid , then differentiable (MC) converts these two grids to a triangular mesh. A 128-D shape node is first mapped to a
grid with 512 channels by a FC layer, the voxel grid is then up-sampled by a series of fractional-stride convolutional layers. Fractional-stride convolution first adds spacing between existing pixels (or voxels) within a feature map to increase spatial resolution, the spacing are then filled with zeros, the rest is to do regular convolution operation. Fractional-stride convolution is also known as transposed convolution or dilated convolution, the spacing between pixels is called dilation. Table7 lists the detailed properties of layers and operators in 3D fractional-stride CNN.
|Layers||Feature size||Operators||Kernel size||No.||Dilation||Stride||Padding|
To enable end-to-end training, the unsigned distance field is converted to a triangular mesh through differentiable marching cubes . The final 3D model is obtained by "copying" the decoded 3D shapes with similar parameters.
2.3.5 Loss functions
The structural graph layouts are modeled by minimizing the summation of cross entropy classification loss for each node expressed as
where and denote ground truth probability and predicted probability for event in events set for the -th node in total nodes. The event means, for example, if the current node is a first type node, then is either 0 or 1, is the first value in the one-hot 3-D vector. It should be note that the binary tree including node types is known in training procedure, hence the node classifier is trained in training and is useful only when testing.
3D shapes are modeled by minimizing the distance loss function for all shapes (taken as meshes) decoded by shape decoder expressed as
where and are ground truth shape and predicted shape in the -th shape node in total shape nodes, is Chamfer distance for the case that two shapes have different number of vertices, expressed as
where and are vertices in shapes and respectively, and are vertices number in and . For cuboids, 8 vertices are parameterized with 8 parameters mentioned above, for cylinders, we use cuboids with square section for calculation and back to cylinders for visualization. No smooth or long edge length regularization terms are considered since the mesh is converted from a regular voxel grid.
The final 3D model requires to minimize the similar nodes loss function
where and are the -th ground truth similar parameters and the -th predicted parameters in total similar nodes.
The total loss function is the summation of all mentioned losses expressed as
where , and are 3 weights to compose the total loss. The overall loss is to update the network in an end-to-end fashion.
3.1 Training Details
We collected and generated about 2200 models of long span bridges, see Figure 15. Each training sample includes inputs and labels.
3.1.1 Details for 3D OBB learning
The point clouds are synthesized by densely sampled 3D bridge point cloud (about 1M points) with Gaussian noise and non-uniform sampling, and random terrain point cloud (about 6M points). The synthesized point clouds are first filtered with a sphere boundary mentioned in Section 2.2.1. Bird views are generated by projecting point cloud within the sphere boundary to an image along (up) direction, front views are generated by projecting point cloud within the 2D OBB to an image along its local (front) direction. The labels are 3D OBBs . Images are with size and are normalized to , point cloud within the sphere boundary is translated and scaled to a unit sphere with center .
3.1.2 Details for structural graph layouts and 3D shapes learning
The inputs are composed of: (a) a sparsely sampled 3D point cloud (from 4K to 8K points) with Gaussian noise and non-uniform sampling; (b) 4 to 8 views, each view includes a rendered gray scale image from 3D model with random background, and a re-projected foreground-background segmentation from a densely sampled 3D point cloud (about 1M points) with Gaussian noise and non-uniform sampling. Images are with size and are normalized to , the geometric centers of point clouds are translated to , point clouds are scaled to 1 according to (up) direction.
The labels are composed of: (a) a binary tree represented with nested tuples (in Python language) by assembling one of the three node types on each node; (b) the 3D model with 3D shapes (mesh) on shape nodes and similar parameters on similar nodes.
Learning rate is set to initially and finally with exponential decline, training epoch is 500 with batch size 1, we use adaptive moment estimation (Adam) with and to optimize the network, no pre-training is implemented. In the loss function, we choose , and .
3.2 Qualitative comparison
In addition to these methods, we manually create a 3D model based on point cloud and images as a reference, which is regarded as the best w.r.t details but not in accuracy, however, it takes a dozen of hours for manual work. It’s worth mentioning that when manually creating these models, we refer to three orthographic views of point cloud to obtain approximate size information, and refer to raw images to infer detailed part relations and 3D shapes, hence it is reasonable that how the learning framework works.
A qualitative comparison among different methods is shown in Figure 16 and Figure 17: (a) Dense point cloud (Section 2.1). Dense point cloud is not capable for texture mapping, and suffers from severe noise and uneven distribution. (b) and (c) Delaunary triangulation and Poisson surface reconstruction [6, 7, 8]. Surface reconstruction is directly based on point clouds and no structure prior is introduced so the meshing quality is unpleasant. (d) Point cloud modeling. This method works well in practice for high quality point clouds, we have tested on RANSAC based methods  and Learning-based method , both failed to fit 3D primitives in this case, this is due to the severe noise on MVS point cloud caused by long distance photography and limited image quantities. (e) The proposed method. (f) Manual work. (gh) Front and side views of selected methods.
We haven’t tested image-based learning methods, since that mesh-based learning[36, 37, 38, 39, 40] haven’t solve the topological genus problem; current geometric primitives-based methods [41, 42, 43] only take cuboids into consideration; volumetric methods [47, 48, 49, 50] are limited by voxel resolution. More importantly, to our best knowledge, no 3D priors are employed since no 3D information or stereo vision mechanisms are adopted in their networks. As the question "What Do Single-view 3D Reconstruction Networks Learn?" asked in , current state of the art in single-view object reconstruction does not actually perform reconstruction but image classification.
3.3 Quantitative comparison
We use manually created 3D model as the ground truth (GT), although it seems not perform best in spatial accuracy. Chamfer distance is calculated with vertices of manually created 3D model and other models. Since Chamfer distance requires a huge KD Tree to handle a large number of vertices, hence we down-sampled to under 10000 vertices. For point cloud, the vertices are down-sampled 3D points. We also compared the reconstructed bridge components number, note that the “components” concept does not exist in the first three methods. Finally, we compared the output data size, one point has a size of 3, one triangular face is 3, one cuboid is 8, and one cylinder is 7. See Table 8 for a comparison.
|Method||Chamfer distance||No. of components||Data size|
|Bridge 1||Dense point cloud ||0.04418||-||1117761|
|Delaunay triangulation ||0.04418||-||14031345|
|Poisson surface reconstruction ||0.55827||-||1234419|
|Point cloud modeling [16, 20]||-||-||-|
|The proposed method||0.06237||1488||66812|
|Manual work (GT)||0||1430||15872|
|Bridge 2||Dense point cloud ||0.11345||-||5398473|
|Delaunay triangulation ||0.11345||-||29836674+|
|Poisson surface reconstruction ||0.19662||-||4307535|
|Point cloud modeling [16, 20]||-||-||-|
|The proposed method||0.02001||914||73016|
|Manual work (GT)||0||914||10952|
A learning framework is designed which can learn a mathematical model from prior knowledge. Compared with previous methods, the proposed method reconstructs a 3D model while preserving its topological properties and spatial relations successfully.
The 3D digital model shows several potential applications, including: (a) Visual structural health monitoring. By assembling 2D images into a 3D digital bridge model using texture mapping, it enables inspectors to engage with these images in a more intuitive manner, like VR and AR. It also provides a better monitoring scheme by recording and visualizing the life cycle of an entire bridge. (b) Finite element modeling. The proposed method provides a topology-aware 3D digital model, which reduce the time spending on manual establishment of finite element model significantly.
The study is financially supported by the National Natural Science Foundation of China (Grant U1711265 and 51638007) and supported by grants from the National Key R&D Program of China (Grant 2017YFC1500603).
-  R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
-  C. Wu, “Visualsfm: A visual structure from motion system,” http://www. cs. washington. edu/homes/ccwu/vsfm, 2011.
J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113, 2016.
-  Y. Furukawa, C. Hernández, et al., “Multi-view stereo: A tutorial,” Foundations and Trends® in Computer Graphics and Vision, vol. 9, no. 1-2, pp. 1–148, 2015.
-  Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview stereopsis,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 8, pp. 1362–1376, 2010.
-  B. Delaunay et al., “Sur la sphere vide,” Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, vol. 7, no. 793-800, pp. 1–2, 1934.
-  M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” in Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7, 2006.
-  M. Kazhdan and H. Hoppe, “Screened poisson surface reconstruction,” ACM Transactions on Graphics (ToG), vol. 32, no. 3, p. 29, 2013.
-  Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Manhattan-world stereo,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1422–1429, IEEE, 2009.
-  S. Sinha, D. Steedly, and R. Szeliski, “Piecewise planar stereo for image-based rendering,” 2009.
-  A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra, “Rapter: rebuilding man-made scenes with regular arrangements of planes.,” ACM Trans. Graph., vol. 34, no. 4, pp. 103–1, 2015.
-  B. Xiong, M. Jancosek, S. O. Elberink, and G. Vosselman, “Flexible building primitives for 3d building modeling,” ISPRS journal of photogrammetry and remote sensing, vol. 101, pp. 275–290, 2015.
-  T. Holzmann, M. Maurer, F. Fraundorfer, and H. Bischof, “Semantically aware urban 3d reconstruction with plane-based regularization,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 468–483, 2018.
-  C. Raposo, M. Antunes, and J. P. Barreto, “Piecewise-planar stereoscan: Sequential structure and motion using plane primitives,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 8, pp. 1918–1931, 2018.
-  A. Kaiser, J. A. Ybanez Zepeda, and T. Boubekeur, “A survey of simple geometric primitives detection methods for captured 3d data,” in Computer Graphics Forum, vol. 38, pp. 167–196, Wiley Online Library, 2019.
-  R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for point-cloud shape detection,” in Computer graphics forum, vol. 26, pp. 214–226, Wiley Online Library, 2007.
-  Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. Cohen-Or, and N. J. Mitra, “Globfit: Consistently fitting primitives by discovering global relations,” ACM transactions on graphics (TOG), vol. 30, no. 4, p. 52, 2011.
-  T. Rabbani, S. Dijkman, F. van den Heuvel, and G. Vosselman, “An integrated approach for modelling and global registration of point clouds,” ISPRS journal of Photogrammetry and Remote Sensing, vol. 61, no. 6, pp. 355–370, 2007.
-  M. Attene and G. Patanè, “Hierarchical structure recovery of point-sampled surfaces,” in Computer Graphics Forum, vol. 29, pp. 1905–1920, Wiley Online Library, 2010.
-  L. Li, M. Sung, A. Dubrovina, L. Yi, and L. Guibas, “Supervised fitting of geometric primitives to 3d point clouds,” arXiv preprint arXiv:1811.08988, 2018.
-  X. Li, Y.-L. Lin, J. Miller, A. Cheon, and W. Dixon, “Primitive-based 3d building modeling, sensor simulation, and estimation,” arXiv preprint arXiv:1901.05554, 2019.
-  M. Li, P. Wonka, and L. Nan, “Manhattan-world urban reconstruction from point clouds,” in European Conference on Computer Vision, pp. 54–69, Springer, 2016.
-  M. Li, L. Nan, and S. Liu, “Fitting boxes to manhattan scenes using linear integer programming,” International journal of digital earth, vol. 9, no. 8, pp. 806–817, 2016.
-  J. Lee, H. Son, C. Kim, and C. Kim, “Skeleton-based 3d reconstruction of as-built pipelines from laser-scan data,” Automation in construction, vol. 35, pp. 199–207, 2013.
-  A. K. Patil, P. Holi, S. K. Lee, and Y. H. Chai, “An adaptive approach for the reconstruction and modeling of as-built 3d pipelines from point clouds,” Automation in Construction, vol. 75, pp. 65–78, 2017.
-  B. Guo, Q. Li, X. Huang, and C. Wang, “An improved method for power-line reconstruction from point cloud data,” Remote sensing, vol. 8, no. 1, p. 36, 2016.
-  S. Ortega, A. Trujillo, J. M. Santana, J. P. Suárez, and J. Santana, “Characterization and modeling of power line corridor elements from lidar point clouds,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 152, pp. 24–33, 2019.
-  L. Barazzetti, “Parametric as-built model generation of complex shapes from point clouds,” Advanced Engineering Informatics, vol. 30, no. 3, pp. 298–311, 2016.
-  A. Dimitrov, R. Gu, and M. Golparvar-Fard, “Non-uniform b-spline surface fitting from unordered 3d point clouds for as-built modeling,” Computer-Aided Civil and Infrastructure Engineering, vol. 31, no. 7, pp. 483–498, 2016.
-  P. Labatut, J.-P. Pons, and R. Keriven, “Hierarchical shape-based surface reconstruction for dense multi-view stereo,” in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1598–1605, IEEE, 2009.
-  F. Lafarge and C. Mallet, “Creating large-scale city models from 3d-point clouds: a robust approach with hybrid representation,” International journal of computer vision, vol. 99, no. 1, pp. 69–85, 2012.
-  F. Lafarge, R. Keriven, M. Brédif, and H.-H. Vu, “A hybrid multiview stereo algorithm for modeling urban scenes,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 5–17, 2013.
-  F. Lafarge and P. Alliez, “Surface reconstruction through point set structuring,” in Computer Graphics Forum, vol. 32, pp. 225–234, Wiley Online Library, 2013.
-  N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google earth engine: Planetary-scale geospatial analysis for everyone,” Remote Sensing of Environment, vol. 202, pp. 18–27, 2017.