1 Introduction
Point cloud is an important and widely used representation for 3D data. Surface reconstruction from point clouds has been well studied in computer graphics. A lot of geometric reconstruction methods have been proposed. A commonly used pipeline in geometric reconstruction methods first computes an implicit function on a 3D grid [5, 12]. Then the Marching Cubes (MC) method [15] is applied to extract an isosurface from the 3D grid through implicit function values. The grid is usually an adaptive octree. The intersection of two grid lines in an octree is named as octree vertex. Figure 1 shows a 2D example. Specifically, the implicit function can be an indicator function that indicates whether the vertices are inside or outside the implicit surface. That is, a surface reconstruction problem can be treated as a binary classification problem for octree vertices.
In realworld reconstruction tasks, the point clouds are mainly acquired by scanners or MultiView Stereo (MVS) methods [24, 29]. These point clouds are usually dense, with complex topologies and millions of points, which bring challenges to surface reconstruction methods.
With the development of deep learning, a variety of learningbased surface reconstruction methods have been proposed
[6, 22, 14, 1, 8, 17, 18]. However, there are still problems for learningbased methods to reconstruct realworld point clouds for four reasons illustrated as follows.1) The network architectures and output representations of these learningbased methods usually force the learningbased approaches to consider all the points at once. Therefore, they can not allow dividing input data and processing different parts separately. When point clouds scale to millions of points, devices may not be able to handle large amounts of memory usage, which will make the model less scalable. For example, the Deep Marching Cubes [14] transform the entire point cloud into voxel grids at once and directly output a irregular mesh, which is less scalable and is only applicable to point clouds with thousands of points.
2) Although some methods can be applied to largescale point clouds, in reality they greatly downsample or aggregate the points into a relative small scale. For example, the Occupancy Networks (ONet) [17] and DeepSDF [18]
encode the entire points into a fixedsize latent vector of 512 elements. The representing power is limited. Consequently, they come at the expense of surface quality reconstructed from largescale point clouds.
3) Even on smallscale point clouds, learningbased methods is good at reconstructing shapes with quite simple topology rather than those complex shapes. In fact, the result surfaces of simple shapes are usually over smoothed. The Figure 2 shows reconstruction examples of a stateoftheart learningbased method (ONet) [17] and ours. For the simple shapes, ONet gets a smooth surface. However, for the more complex topology shapes, it can not perfectly reconstruct the relevant details.
4) Some learningbased methods need a large portion of the dataset for training. For example, the ONet [17] uses 4/5 of its dataset (about 3D shapes) as training set. The trained network is inevitably more prone to over fitting the dataset. What’s more, it can not generalize well among different datasets.
Recently the tangent convolution network has been proposed for semantic segmentation of point clouds. It constructs a set of 2D tangent images for points by projecting local neighbor points onto tangent planes. Then 2D convolutions are applied on these images to learn point features. The tangent images stores signed projection distance and other input features. The network operations of tangent convolution network are all performed in a fixedsize local spatial region. They focus more on local neighbors instead of the global shape. This allows dividing the input point cloud through overlapping bounding boxes and processing them parallelly, which is different from the wellknown classification network, PointNet [19].
In this paper, we take the advantages of the commonly used implicit function method discussed above for surface reconstruction. Specifically, the implicit function values are 01 labels of octree vertices. They indicate whether the octree vertices are in front or at back of the implicit surface. The strategy simplifies the surface reconstruction to binary classification problem, which is simpler and more effective. The ONet [17] also trains its network in a binary classification manner. However, this strategy does not guarantee quality results. There are three main limitations in dealing with largescale data that need to be broken.
1) The scalability of the network. A strong scalability of the classification network is important for tackling largescale data. However, the commonly used PointNet [19] in reconstruction methods consider entire points at once and are thus less scalable. Our network does not need the entire input data at once and wins strong scalability. This is mainly because the tangent convolution operations involved in our network are all performed in a fixedsize local spatial region, and the convolution operations are also independent. Moreover, these network operations are independent from the octree structures. Therefore, our network allows partitioning the points and octree vertices through bounding boxes and processing different parts separately so that it wins strong scalability.
2) The ability to reconstruct geometry details. In order to achieve a high classification accuracy for octree vertices, the ability to capture local geometry details is required. Some methods like ONet[17] and DeepSDF [18] aggregate the features of the entire points into a fixedsize global latent vector (512 elements), which could lose a lot of geometry details. This will lead to a decrease on reconstruction quality.
In our network, we focus on learning local geometry information of the implicit surface. Since the implicit surface are determined by octree vertex labels, in order to make accurate classification for octree vertices, the most important task is to construct vertex features properly. The octree vertex features in our network are constructed from local neighbor points and take the advantage of local projection information in the tangent image. The most important information is signed projection distances between octree vertices and its neighbor points. The signed distances are commonly used in geometric methods such as the TSDF [5] method and MLS methods [13, 9]. They directly provides local geometry information for accurate frontback classification. This enables our method to capture local geometry details of largescale point clouds. Therefore, our network benefits from geometryaware features to reconstruct accurate surfaces.
Some other methods such as the ONet[17] and DeepSDF [18] construct the features of vertices by directly concatenating 3D coordinates of vertices with the global latent vector of the point cloud. This feature construction method cannot provide explicit information about relative position among vertices and the implicit surface. Therefore, they cannot capture local geometry details well to reconstruct quality surfaces.
3) The generalization ability of the network. Giving too much attention to global feature rather than just considering local information would limit the generalization performance among different type of shapes. The octree vertex features in our network are local geometryaware. It doesn’t rely on global shape information excessively. Therefore, it has good generalization capability and does not need too much training data, which avoids overfitting on dataset.
Overall, our contribution can be summarized as the following two points. 1) We design a scalable pipeline for reconstructing surface from realworld point clouds. 2) We construct local geometryaware octree vertex features, which leads to accurate classification for octree vertices and good generalization capability among different datasets. Experiments have shown that our method achieves a significant improvement in scalability and quality compared with stateoftheart learningbased methods. It also achieves comparable results with stateoftheart geometric methods. The proposed TSRNet is a practical learningbased method for realworld surface reconstruction tasks.
2 Related Work
In this section, we first review some important geometric reconstruction methods to introduce basic concepts. Then we focus on existing learningbased surface reconstruction methods from point clouds. We mainly analyze whether they are able to scale to largescale datasets and to capture geometry details in terms of network architectures and output representations.
2.1 Geometric reconstruction methods
Geometric reconstruction methods can be broadly categorized into global methods and local methods. The global methods consider all the data at once, such as the radial basis functions (RBFs) methods
[2, 28], and the (Screened) Poisson Surface Reconstruction method (PSR). [11, 12].The local fitting methods usually define a truncate signed distance function (TSDF) [5] on a volumetric grid. The various moving least squares methods (MLS) fit a local distance field or vector fields by spatially varying lowdegree polynomials and blend several nearby points together [13, 9]. The local projection and local least squares fitting used by MLS are similar to the tangent convolution used in our network. Local methods can be well scaled.
2.2 Learningbased reconstruction methods
Network architecture. One straightforward network architecture for point clouds is to convert the point clouds into regular 3D voxel grids or adaptive octree grids and apply 3D convolutions [16, 21]. The voxelbased network is used in 3DEPN method [6] and the Deep Marching Cubes (DMC) method [14]. The OctNetFusion [22] and 3DCFCN [1] reconstruct implicit surface from multiple depth images based on OctNet [21]. The voxelbased network faces cubic growth of computational and memory requirements. The network operations of octreebased networks are complicated and are highly related to the octree structure. They also face computational issues. Therefore, networks operating on grids still face difficulties to reconstruct datasets with large scales.
Another type of network architecture learns point features directly. The commonly used point cloud network is the PointNet [19]. It encodes global shape features into a latent vector of fixed size. Some reconstruction methods also extract a latent vector from the point cloud, such as the the Occupancy Networks [17] and DeepSDF [18]. The networks are able to encode features of a largescale point cloud into a latent vector. However, the latent vector of a small size limits its representation power for complex point clouds. Our TSRNet also directly learns point and vertex features but the features are learned from local neighbors. Compared with networks using latent vector, our TSRNet is able to capture geometry details of largescale point clouds with complex topology.
Output representation. The scalability of learningbased reconstruction methods are also greatly influenced by the output representation. The reconstruction methods based on voxel or octree grids usually use the occupancy or TSDF on grids as the output representation. This output representation, together with their network operations, is highly related to the grid structure. Their scalability is limited. The Deep Marching Cubes method [14] and the AtlasNet [8] directly produce a triangular surface. The predictions of different parts of the surface are interdependent so that they have to consider the whole input point cloud at once.
The Occupancy Networks [17] and the DeepSDF [18] method learn shapeconditioned classifiers and the decision boundary is the surface. Their representations are similar to the frontback representation of our TSRNet. Although the Occupancy Networks and the DeepSDF support to make predictions over 3D locations parallelly, they need the whole point cloud to get the global latent vector. They can only reconstruct watertight surfaces so that they do not allow partitioning input points.
2.3 Tangent Convolution
The tangent convolution [26] is a new convolutional network for semantic segmentation of 3D point clouds. The tangent convolution method defines three new network operations: tangent convolution, pooling and unpooling. The tangent convolution is defined as 2D convolution on tangent images. The tangent images are construct by local projection of neighbor points. The indices of the projected points used for tangent image are precomputed. It makes the tangent convolution much more efficient. The signals used in tangent images are signed projection distance, normals, etc that represent local surface geometry.
The neighbor points used for tangent convolution are collected by ball query that finds all points within a radius to the query point. The radius is defined as the receptive field size of the tangent convolution. The pooling and unpooling are implemented via hashing onto a regular 3D grid. The radius search and grid sampling guarantee that the tangent convolution, pooling and unpooling are performed in a fixedsize spatial region. Another representative network learning local features, the PointNet++ [20], uses iterative farthest point sampling (FPS) to choose a subset of points. Its grouping operation cannot be guaranteed to be in a fixedsize spatial region. Therefore it does not allow dividing input points through bounding boxes.
3 Method
In this section, we introduce the pipeline of our method in Section 3.1. In Section 3.2, the construction method for geometryaware vertex feature is illustrated. In Section 3.3, we describe our network architecture and some implementation details. The surface extraction and smoothing methods are briefly introduced in Section 3.4. And in Section 3.5, we explain how to prepare the training data.
3.1 Pipeline
We design a scalable learningbased surface reconstruction method, named as TSRNet, which allows dividing the input point clouds and octree vertices. Figure 3 shows the pipeline of our method. Let be the input point cloud of points, where is point coordinate with normal . We construct an octree from this point cloud and extract the vertices of the finest level of this octree, where each vertex is the coordinate. To prevent the whole input data exceeding the GPU memory, we use bounding boxes to partition the point cloud and the vertices into and , where is the number of boxes. The bounding boxes for vertices are not overlapping. The corresponding bounding boxes of points are larger than those of octree vertices to ensure that all the points in the largest receptive field of the vertices are included. This enables our network to make accurate prediction for vertices in the boundary of the bounding box. For the part , the vertices and the corresponding points are fed into TSRNet. The TSRNet classifies each vertex in as in front or at back of the implicit surface represented by . Let the function represented by the network be , where represents the parameters of the network. Then the key of our reconstruction method is to classify vertices in . It can be formulated as:
(1) 
The front or back of an octree vertex is defined by the normal direction of its nearest surface. To classify a vertex as front or back, this vertex needs to be near the implicit surface. The finestlevel vertices of an adaptive octree are near the surface so that frontback classification on them is feasible. It is worth noting that the network operations in TSRNet are not related to the structure of the octree.
After all the vertices are labeled, we extract surface using Marching Cubes (MC) method and postprocess the surface with a simple Laplacianbased mesh smoothing method.
3.2 Geometryaware Vertex Feature
In order to get accurate classification for octree vertices and good generalization capability, we construct geometryaware features for octree vertices from point clouds. As discussed in the Introduction, the most important features for octree vertex classification are signed projection distances among octree vertices and its neighbor points. Since the tangent image in the tangent convolution [26] is constructed by local projection of neighbor points, the tangent convolution has the potential to construct geometryaware features for octree vertices. However, there are problems in applying tangent convolution for octree vertices directly. For a 3D location , the normal
of its tangent image is estimated through local covariance analysis
[23]. Letbe the eigenvector related to the smallest eigenvalue of the covariance matrix of
. The normal of tangent image is defined as . Due to the direction of eigenvector is ambiguous, it may not be consistent with the real orientations of local implicit surfaces. The signs of projection distances hence do not represent frontback information accurately.In order to solve this problem, we modify the definition of . Since the frontback classification is related to neighbor points, we use the input normals of neighbor points as additional constraints to define . Let be the average input normal of the neighbor points of . In our definition, if the angle between eigenvector and is more than , we invert the direction of . Then we use it as the tangent image’s normal . That is, our definition of the tangent image ensures that .
The features of octree vertices constructed by modified tangent convolution directly encode frontback information. They are local and not related to the types of shapes, so our network is scalable and can generalize well among different datasets. It’s worth noting that we use neighbor points in the point cloud rather than neighbor vertices to compute the tangent images. It is because our network classifies octree vertices with respect to the surface represented by points from , rather than represented by neighbor vertices.
3.3 Network
Network architecture. The network architecture of our TSRNet is illustrated in Figure 4
. It contains two parts. The left part is for point feature extraction. The right is for vertex feature construction and label prediction. They are connected by tangent convolution layers for vertices.
The left part of our network encodes the features of the input points from a point cloud. It is a fullyconvolutional Ushaped network with skip connections. It is also built by our modified tangent convolution layers. Two pooling operations are used to increase the size of the receptive field.
The right part is the core of our method. In the right part, the geometryaware features of octree vertices are firstly constructed by modified tangent convolution, from point features in corresponding scale levels. Then the convolutions and unpooling operations are applied to encode more abstract features and to predict labels for each vertex. The index matrices used by unpooling layers in this part are precomputed when downsampling vertices via grid hashing.
The input signals used by our method are all local signals. We use the signed distances from neighbor points to tangent plane (D), and normals relative to the normal of the tangent plane (N) as input. Since the network operations in TSRNet are all performed in a fixedsize local region, i.e. the modified tangent convolution, the convolution, the pooling and unpooling of points and vertices,
our network allows dividing the points and vertices with boundingboxes and processing each part separately. It is worth noting that the boundingbox used for points are larger than octree vertices in order to make accurate prediction for vertices within the boundary.
Implementation details. In our implementation, similar to [26], we downsample the input points and vertices using grid hashing for pooling and unpooling. Given an octree of depth , The length the finestlevel edges is . We set the initial receptive field size in our network as . The grid hashing size of input points at three scales are set as , , . We don’t apply gird hashing for vertices in the first scale because the vertices have an initial grid size of originally. The last two grid hashing sizes of vertices are set as and . In order to retain more surface details for the feature construction of vertices, we set smaller grid hashing sizes for points.
3.4 Surface Extraction and Smoothing
We use the Marching Cubes (MC) method [15] to extract surface from the labeled octree vertices. The MC needs to find the intersections between octree edges and the implicit surface using the labels of vertices. Since we directly use the midpoints of edges as the intersections, the resulting mesh has discretization artifacts inevitably. In order to refine the output surfaces, we use a simple Laplacianbased mesh smoothing method [27] as a postprocessing step.
3.5 Data Preparation
To prepare training data, we first normalize the coordinates of the input point cloud and surfaces. Then an octree is built from the point cloud. The octree building method is adapted from the opensource code of Poisson Surface Reconstruction
[11, 12]. It ensures that the finest level of the result octree contains not only cubes with points in them, but also their neighbor cubes. Therefore, the octree is dense enough to completely reconstruct the surface. We then use the ground truth surface with normals to label these vertices. For datasets without ground truth surfaces, we generate ground truth surfaces through PSR [12]. More detailed introduction of our data preparing method can be found in our supplementary file.4 Experiments
In this section we perform a series of experiments on datasets of different scales to evaluate our method. The datasets are all widely used in 3Drelated tasks. We mainly examine the ability to capture geometry details, the scalability and generalization capability of our method on both smallscale and largescale datasets. Due to the lack of triangular meshes, the groundtruth meshes used for training are all produced by PSR [12].
4.1 Datasets and Evaluation Metrics
Datasets. In order to compare with several stateoftheart learningbased methods, we do experiments on the same subset of ShapeNet [3] as ONet [17], in which the point clouds are relatively small with tens of thousands of points in each. Furthermore, in order to better evaluate the scalability and generalization capability to handle largescale point clouds, we choose the scans of DTU dataset [10] and Stanford 3D Scanning Repository [25]. Most point clouds in the two datasets contain millions of points.
Evaluation Metrics.
The direct metric of our network is the classification accuracy of octree vertices, which matters a lot to the quality of the final mesh. For mesh comparison on ShapeNet, we use the same evaluation metrics with ONet
[17], including the volumetric IoU (higher is better), the Chamfer distance (lower is better) and the normal consistency score (higher is better). On the DTU dataset, we use DTU evaluation method [10], which mainly evaluates DTU Accuracy and DTU Completeness (both lower is better). And for Stanford datasets, Chamfer distance (CD) (lower is better) is adopted in order to evaluate the points of the output meshes, and the CD is also taken into consideration in the evaluation on DTU dataset.4.2 Classification Accuracy on Different Datasets
We train our network on the ShapeNet dataset and DTU dataset respectively. The testing results on the three datasets mentioned above are satisfactory. Detailed training settings and train/test split are described in Section 4.3 and 4.4. The classification accuracy of octree vertices are shown in Table 1. Note that there are two testing results on Stanford dataset in the table. They are obtained by the model trained on ShapeNet and DTU respectively.
The classification accuracy of octree vertices on ShapeNet dataset shows strong robustness of our network. Since the original models of ShapeNet is synthetic, the point clouds are clean and ideal. We apply Gaussian noise with zero mean and standard deviation 0.05 to the point clouds in ShapeNet as the ONet does
[17]. We can see that our network is still very robust and can achieve high classification accuracy of on noisy point clouds.For 3D scans of large scales, noise in the point clouds is inevitable and we don’t need to add any noise to them. On DTU dataset, which only includes open scenes, our network gets an average classification accuracy of . In order to test the generalization ability of our method, we test on the Stanford dataset with models trained on ShapeNet and DTU respectively. We still achieve high average classification accuracy of and with the two models. It proves strong generalization ability of our method. Frankly, the powerful generalization of our network makes it unnecessary to retrain the model on different datasets. More generalization capability will be illustrated in Section 4.5.
Dataset  ShapeNet [3]  DTU [10]  Stanford [25]  

ShapeNetModel  DTUModel  
Accuracy (%)  97.6  95.7  98.1  98.2 
IoU  Chamfer  NC  

3DR2N2 [4]  0.565  0.169  0.719 
PSGN [7]  –  0.202  – 
DMC [14]  0.647  0.117  0.848 
ONet [17]  0.778  0.079  0.895 
Ours  0.957  0.024  0.967 
4.3 Results on Shapenet
In our first experiment, we evaluate the capacity of our network to capture shape details on ShapeNet. We use the same test split as ONet [17] for fair comparisons. The testing set consists of about shapes. For training our network, we only randomly select a subset from the training set of ONet. The training set of our method ( shapes) is 1/10 of the training set of ONet ( shapes). We do not use a large amount of training data as ONet did mainly for two reasons. On the one hand, it is necessary to eliminate the possibility of overfitting a dataset due to too much training data. On the other hand, our network learns local geometryaware features so that it can capture more geometry details with a relatively smallscale training data. Therefore, it is unnecessary for us to use too many training samples.
In our experiments, each point cloud in training and testing set contains points. We add Gaussian noise with zero mean and standard deviation 0.05 to the point clouds as ONet did. Our network is superiorly efficiency to handle point clouds with large number of points so that our method do not downsample the point clouds. The original synthetic models of ShapeNet don’t have consistent normals so that we reconstruct surfaces from the ground truth point clouds using PSR on octrees at depth 9 to generate the traning data. We use octrees of depth 8 in our method for training and testing. We apply the postprocessing step of our method to the result meshes of the testing set.
For mesh comparison, we use the evaluation metrics on ShapeNet mentioned in Section 4.1, including the volumetric IoU, the Chamfer distance and the normal consistency score. We evaluate all shapes in the testing set on these metrics. The results of existing wellknown learning based 3D reconstruction approaches, i.e. the 3DR2N2 [4], PSGN [7], Deep Marching Cube (DMC) [14] and ONet [17], are obtained from the paper of ONet. As mentioned in ONet [17], it is not possible to evaluate the IoU for PSGN for the reason that it doesn’t yield watertight meshes. Quantitative results are shown in Table 2.
The Table 2 shows that our method gets great improvements on these metrics. We achieve the highest IoU, lower Chamfer distance and higher normal consistency compared with the other learningbased baselines. More specifically, the IoU of our results is about 18% higher than ONet. It is worth noting that our network has only 0.49 million parameters, while ONet has about 13.4 million parameters.
As shown in Figure 5, the surface quality of ours is generally better than those of the ONet. Our network is good at capturing local geometry details of shapes of different classes, even with quite different and complex topology. The ONet can reconstruct smooth surfaces for shapes with simple topology. However, since the global latent vector encoder in ONet loses shape details, it tends to generate an oversmoothed shape of a complex topology. For more visual results, please see our supplementary materials.
Method  DA  DC  CD  

Mean  Var.  Mean  Var.  Mean  RMS  
PSR(trim 8) [12]  0.473  1.33  0.327  0.220  3.16  12.5 
PSR(trim 9.5) [12]  0.330  0.441  0.345  0.438  1.17  4.49 
Ours  0.321  0.285  0.304  0.0888  1.46  4.42 
4.4 Results on 3D Scans of Larger Scales
Evaluation on DTU dataset. We train and test our network on DTU [10] at octree depth 9 ^{1}^{1}1 We use scenes {1, 2, 3, 4, 5, 6} as training set, scenes {7, 8, 9} as validation set and scenes {10, 11, 12, 13, 14, 15, 16, 18, 19, 21, 24, 29, 30, 34, 35, 38, 41, 42, 45, 46, 48, 51, 55, 59, 60, 61, 62, 63, 65, 69, 84, 93, 94, 95, 97, 106, 110, 114, 122, 126, 127, 128} as testing set. . Since the ground truth surfaces of DTU are not available, we reconstruct surfaces using PSR on octrees of depth 10 to generate the traning data. We trim PSR surfaces using SurfaceTrimmer software provided in PSR with trimming value 8. We randomly extract batches of points from each training scene to train our network. Even though we use only 6 scenes in training set, we achieve a high accuracy and good generalization capability. Table 3 gives the quantitative results on testing scenes in DTU dataset. Qualitative results are shown in Figure 6 (a).
The point clouds in DTU are all open scenes. Although ONet can finish reconstruction tasks of watertight surfaces, it cannot reconstruct surface from point clouds of open scenes. Here we set results of PSR as an evaluation reference. Surfaces reconstructed by PSR for evaluation are also reconstructed at octree depth 9. The results of PSR are always closed surfaces even reconstructed from open scenes. Therefore, we trim them with trimming value 8 and 9.5.
As illustrated in Table 1, our network gets a high average classification accuracy of . Compared with PSR with trimming value 8, our surfaces outperform them on DTU Accuracy, DTU Completeness and CD. As for PSR with trimming value 9.5, we get similar performance on DTU Accuracy and perform better on DTU Completeness. It is because larger trimming value on PSR surfaces decrease the completeness. The PSR with trimming value 9.5 is more accurate on CD while it is at cost of completeness. As shown in Figure 6 (a), our method can reconstruct more complete boundaries for open scenes. In conclusion, the quality of surfaces of our method is comparable with PSR on DTU datasets. We perform better with respect to the completeness on open scenes. Detailed figures and videos also provided in our supplementary materials.
4.5 Generalization Capability
In this section we evaluate the generalization capability of our method. It is worth noting that three datasets (ShapeNet, DTU, Stanford 3D) are quite different in terms of data scale and data category. Data in the ShapeNet are all synthetic. The DTU only contains realworld scan data of open scenes while the Stanford 3D only has closed scenes. Therefore, tests among these three datasets can well evaluate the generalization performance of the model.
We train our network on the same dataset (ShapeNet) with ONet for fair comparisons. We do not use a large amount of training data ( shapes) as ONet did ( shapes) since our network doesn’t rely on the training data excessively. Since ONet cannot reconstruct surface through point clouds of open scenes, here we test the two trained models on Stanford 3D. Table 4 shows that the our method gets high classification accuracy. It generalizes better than ONet on Stanford 3D in metric of Chamfer Distance. Figure 6 (b) shows that our results achieve outstanding visual effects and reconstruct geometry details of the shapes well.
We also train our network on DTU and test on Stanford 3D. We only train our network on 6 scenes in DTU dataset and get good generalization capability. Table 4 shows that our network trained on open scenes can also achieve higher classification accuracy on closed scenes. Besides, classification accuracy of our models trained on ShapeNet and DTU are almost exactly the same. We can also find that our models trained on ShapeNet and DTU have obtained roughly similar evaluation results on CD. In general, the results prove good generalization capability of our network. One strong benefit is that we don’t need to retrain our network on different datasets to complete the reconstruction work.
4.6 Efficiency
We test our network on 4 GeForce GTX 1080 Ti GPUs in a Intel Xeon(R) CPU system with 322.10 GHz cores. We implement data preprocessing method with CUDA. In our experiments, we set the maximum number of point and vertex in one batch as 300,000. As is illustrated in Table 5, our method can tackle millions of points and vertices in reasonable time ^{2}^{2}2DTU [10] contains a series of point clouds, here we just randomly list time consumption of stl_030, stl_034, stl_062. We do not consider the time of reading and writing files.. As a learningbased method, the time consumption is generally acceptable. Since our network is able to perform predictions parallelly using multiGPUs, our prediction procedure can be accelerated by more GPUs. As shown in the table, the predictions with 4 GPUs are about 2.6 times faster than 1 GPU. This reveals the efficiency potential of our method scaling to large datasets. When conditions permitting (more GPUs), it can be accelerated a lot.
Data  Accuracy(%)  CD Mean  CD RMS  
Ours  Ours  ONet [17]  Ours  Ours  ONet [17]  Ours  Ours  
S  D  S  D  S  D  
Armadillo  98.2  97.8  93.46  0.028  0.023  168.59  0.131  0.056 
Bunny  98.2  98.7  94.88  0.064  0.057  165.44  0.134  0.087 
Dragon  98.0  98.0  40.69  0.053  0.047  74.99  0.208  0.150 
Number  Armadillo  Bunny  Dragon  stl_030  stl_034  stl_062 

Point /M  2.16  0.361  1.71  2.43  2.01  2.19 
Vertex /M  3.29  3.62  3.07  1.07  0.766  0.922 
Triangle /M  1.18  1.52  1.16  0.42  0.31  0.36 
Batch  109  73  77  59  88  86 
Time /s  Armadillo  Bunny  Dragon  stl_030  stl_034  stl_062 
Prep  19.4  8.86  16.6  10.9  9.34  9.99 
Pred (1 GPU)  133  82.2  110.5  63.5  84.5  87.9 
Pred (4 GPUs)  50.5  27.3  38.2  30.4  31.1  34.3 
Total (1 GPU)  153  92.0  128  75.4  94.5  98.6 
Total (4 GPUs)  70.6  37.1  55.7  42.3  41.1  45.0 
5 Discussion and Conclusion
Our TSRNet has some advantages. One of the advantages is its strong scalability that allows dividing the input data and processing different parts parallelly. It has successfully reconstruct quality surfaces from point clouds with millions of points in reasonable time.
The other important advantage of our method is the construction of local geometryaware features. The geometryaware features are only related to local neighbor information and doesn’t rely on any global shape information. Giving too much attention to global shape feature rather than just considering local geometry information would limit the generalization capability among different shape types. Thus our method has good generalization capability and it does not need too much training data, which avoids overfitting.
In conclusion, we have successfully designed a scalable network for quality surface reconstruction from point clouds and make a significant breakthrough compared with existing stateoftheart learningbased methods. We believe that it can inspire more researches in related directions, such as more efficient network architectures and more accurate largescale 3D surface datasets, etc.
References

[1]
(2018)
Learning to reconstruct highquality 3d shapes with cascaded fully convolutional networks.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 616–633. Cited by: §1, §2.2.  [2] (2001) Reconstruction and representation of 3d objects with radial basis functions. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 67–76. Cited by: §2.1.
 [3] (2015) Shapenet: an informationrich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: Figure 2, Figure 5, Figure 6, §4.1, Table 1, Table 2, Table 4.
 [4] (2016) 3dr2n2: a unified approach for single and multiview 3d object reconstruction. In European conference on computer vision, pp. 628–644. Cited by: §4.3, Table 2.
 [5] (1996) A volumetric method for building complex models from range images. Cited by: §1, §1, §2.1.

[6]
(2017)
Shape completion using 3dencoderpredictor cnns and shape synthesis.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 5868–5877. Cited by: §1, §2.2.  [7] (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §4.3, Table 2.
 [8] (2018) A papiermâché approach to learning 3d surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 216–224. Cited by: §1, §2.2.
 [9] (2007) Algebraic point set surfaces. In ACM Transactions on Graphics (TOG), Vol. 26, pp. 23. Cited by: §1, §2.1.
 [10] (2014) Large scale multiview stereopsis evaluation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 406–413. Cited by: Figure 6, §4.1, §4.1, §4.4, Table 1, Table 3, Table 4, footnote 2.
 [11] (2006) Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, Vol. 7. Cited by: §2.1, §3.5.
 [12] (2013) Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG) 32 (3), pp. 29. Cited by: §1, §2.1, §3.5, Figure 6, Table 3, §4.

[13]
(2004)
Meshindependent surface interpolation
. In Geometric modeling for scientific visualization, pp. 37–49. Cited by: §1, §2.1.  [14] (2018) Deep marching cubes: learning explicit surface representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2916–2925. Cited by: §1, §1, §2.2, §2.2, §4.3, Table 2.
 [15] (1987) Marching cubes: a high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, Vol. 21, pp. 163–169. Cited by: §1, §3.4.

[16]
(2015)
Voxnet: a 3d convolutional neural network for realtime object recognition
. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.2.  [17] (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: Figure 2, §1, §1, §1, §1, §1, §1, §1, §2.2, §2.2, Figure 5, Figure 6, §4.1, §4.1, §4.2, §4.3, §4.3, Table 2, Table 4.
 [18] (2019) DeepSDF: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §1, §1, §1, §1, §2.2, §2.2.
 [19] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660. Cited by: §1, §1, §2.2.
 [20] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2.3.
 [21] (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3577–3586. Cited by: §2.2.
 [22] (2017) Octnetfusion: learning depth fusion from data. In 2017 International Conference on 3D Vision (3DV), pp. 57–66. Cited by: §1, §2.2.
 [23] (2014) SHOT: unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding 125, pp. 251–264. Cited by: §3.2.
 [24] (2016) Pixelwise view selection for unstructured multiview stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–518. Cited by: §1.
 [25] (2013) The stanford 3d scanning repository. Note: http://graphics.stanford.edu/data/3Dscanrep Cited by: Figure 2, Figure 6, §4.1, Table 1, Table 4.
 [26] (2018) Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3887–3896. Cited by: §2.3, §3.2, §3.3.
 [27] (1995) A signal processing approach to fair surface design. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 351–358. Cited by: §3.4.
 [28] (2002) Modelling with implicit surfaces that interpolate. ACM Transactions on Graphics (TOG) 21 (4), pp. 855–873. Cited by: §2.1.
 [29] (201906) Multiscale geometric consistency guided multiview stereo. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Comments
There are no comments yet.