In recent years, 3D shape representation learning has aroused much attention [qi2017pointnet, qi2017pointnet++, maturana2015voxnet, groueix2018papier, park2019deepsdf]. Compared with images indexed by regular 2D grids, there has not been a single standard representation for 3D shapes in the literature. Existing 3D shape representations can be cast into several categories including: point-based [qi2017pointnet, qi2017pointnet++, wang2019dynamic, duan2019structural, achlioptas2018learning], voxel-based [wu20153d, maturana2015voxnet, qi2016volumetric, choy20163d], mesh-based [groueix2018papier, guo20153d, wang2018pixel2mesh, sinha2016deep], and multi-view [su2015multi, qi2016volumetric, tulsiani2017multi].
More recently, implicit function representations have gained an increasing amount of interest due to their high fidelity and efficiency. An implicit function depicts a shape through assigning a gauge value to each point in the object space [park2019deepsdf, mescheder2019occupancy, chen2019learning, michalkiewicz2019deep]. Typically, a negative, a positive or a zero gauge value represents that the corresponding point lies inside, outside or on the surface of the 3D shape. Hence, the shape is implicitly encoded by the iso-surface (e.g., zero-level-set) of the function, which can then be rendered by Marching Cubes [lorensen1987marching]
or similar methods. Implicit functions can also be considered as a shape-conditioned binary classifier whose decision boundary is the surface of the 3D shape. As each shape is represented by a continuous field, it can be evaluated at arbitrary resolution, irrespective of the resolution of the training data and limitations in the memory footprint.
One of the main challenges in implicit function learning lies in accurate reconstruction of shape surfaces, especially around complex or fine structure. Fig. 1 shows some 3D shape reconstruction results where we can observe that DeepSDF [park2019deepsdf] fails to precisely reconstruct complex local details. Note that the implicit function is less smooth in these areas and hence difficult for the network to parameterize precisely. Furthermore, as the magnitudes of SDF values inside small parts are usually close to zero, a tiny mistake may lead to a wrong sign, resulting in inaccurate surface reconstruction.
Inspired by the works on curriculum learning [elman1993learning, bengio2009curriculum], we aim to address this problem in learning SDF by starting small: starting from easier geometry and gradually increasing the difficulty of learning. In this paper, we propose a Curriculum DeepSDF method for shape representation learning. We design a shape curriculum where we first teach the network using coarse shapes, and gradually move on to more complex geometry and fine structure once the network becomes more experienced. In particular, our shape curriculum is designed according to two criteria: surface accuracy and sample difficulty. We consider these two criteria both important and complementary to each other for shape representation learning: surface accuracy cares about the stringency in supervising with training loss, while sample difficulty focuses on the weights of hard training samples containing complex geometry.
Surface accuracy. We design a tolerance parameter
that allows small errors in estimating the surfaces. Starting with a relatively large, the network aims for a smooth approximation, focusing on the global structure of the target shape and ignoring hard local details. Then, we gradually decrease to expose more shape details until . We also use a shallow network to reconstruct coarse shapes at the beginning and then progressively add more layers to learn more accurate details.
Sample difficulty. Signs greatly matter in implicit function learning. The points with incorrect sign estimations lead to significant errors in shape reconstruction, suggesting that we treat these as hard samples during training. We gradually increase the weights of hard and semi-hard111Here, semi-hard samples are with the correct sign estimations but close to the boundary. In practice, we also decrease the weights of easy samples to avoid overshooting. training samples to make the network more and more focused on difficult local details.
One advantage of curriculum shape representation learning is that, it provides a training path for the network to start from coarse shapes and finally reach fine-grained geometries. At the beginning, it is substantially more stable for the network to reconstruct coarse surfaces with the complex details omitted. Then, we continuously ask for more accurate shapes which are relatively simple tasks, benefiting from the previous reconstruction results. Lastly, we focus on hard samples to obtain complete reconstruction with precise shape details. This training process can help avoid poor local minima as compared with learning to reconstruct the precise complex shapes directly. Fig. 1 shows that Curriculum DeepSDF obtains better reconstruction accuracy than DeepSDF. Experimental results illustrate the effectiveness of the designed shape curriculum. Code will be available at https://github.com/haidongz-usc/Curriculum-DeepSDF.git.
In summary, the key contributions of this work are:
We design a shape curriculum for shape representation learning, starting from coarse shapes to complex details. The curriculum includes two aspects of surface accuracy and sample difficulty.
For surface accuracy, we introduce a tolerance parameter in the training objective to control the smoothness of the learned surfaces. We also progressively grow the network according to different training stages.
For sample difficulty, we define hard, semi-hard and easy training samples for SDF learning based on sign estimations. We re-weight the samples to make the network gradually focus more on hard local details.
2 Related Work
2.0.1 Implicit function.
Different from point-based, voxel-based, mesh-based and multi-view methods which explicitly represent shape surfaces, implicit functions aim to learn a continuous field and represent the shape with the iso-surface. Conventional implicit function based methods include [carr2001reconstruction, shen2004interpolating, turk1999shape, turk2002modelling, ohtake2005multi]. For example, Carr et al. [carr2001reconstruction]
used polyharmonic Radial Basis Functions (RBFs) to implicitly model the surfaces from point clouds. Shenet al. [shen2004interpolating]
created implicit surfaces by moving least squares. In recent years, several deep learning based methods have been proposed to capture more complex topologies[park2019deepsdf, mescheder2019occupancy, chen2019learning, genova2019learning, saito2019pifu, liu2019learning, michalkiewicz2019deep, liao2018deep, xu2019disn, genova2019deep, gropp2020implicit]. For example, Park et al. [park2019deepsdf] proposed DeepSDF by learning an implicit field where the magnitude represents the distance to the surface and the sign shows whether the point lies inside or outside of the shape. Mescheder et al. [mescheder2019occupancy]
presented Occupancy Networks by approximating the 3D continuous occupancy function of the shape, which indicates the occupancy probability of each point. Chen and Zhang[chen2019learning] proposed IM-NET by only encoding the signs of SDF, which can be used for representation learnign (IM-AE) and shape generation (IM-GAN). Saito et al. [saito2019pifu] and Liu et al. [liu2019learning] learned implicit surfaces of 3D shapes from 2D images. These methods show promising results in 3D shape representation. However, the challenges still remains to reconstruct the local details accurately. Instead of proposing new implicit functions, our approach studies how to design a curriculum of shapes for more effective model training.
2.0.2 Curriculum learning.
The idea of curriculum learning can be at least traced back to [elman1993learning]. Inspired by the learning system of humans, Elman [elman1993learning]
demonstrated the importance of starting small in neural network training. Sanger[sanger1994neural] extended the idea to robotics by gradually increasing the difficulty of the task. Bengio et al. [bengio2009curriculum] further formalized this training strategy and explored curriculum learning in various cases including vision and language tasks. They introduced one formulation of curriculum learning by using a family of functions , where is the highly smoothed version and is the real objective. One could start with and gradually increase to 1, keeping at a local minimum of . They also explained the advantage of curriculum learning as a continuation method [allgower2003introduction], which could benefit the optimization of a non-convex training criterion to find better local minima. Graves et al. [graves2017automated] designed an automatic curriculum learning method by automatically selecting the training path to address the sensitivity of progression mode. Recently, curriculum learning has been successfully applied to varying tasks [schroff2015facenet, duan2019deep, ilg2017flownet, karras2017progressive, sharma2018improved, jiang2018mentornet, weinshall2018curriculum, hacohen2019power]. For example, FaceNet [schroff2015facenet]
proposed an online negative sample mining strategy for face recognition, which was improved by DE-DSP[duan2019deep] to learn a discriminative sampling policy. Progressive growing of GANs [karras2017progressive, sharma2018improved] learned to sequentially generate images from low-resolution to high-resolution, and also grew both generator and discriminator symmetrically. MentorNet [jiang2018mentornet] provided a training curriculum for the StudentNet by reweighting the samples to release the influence of corrupted labels. Although curriculum learning has improved the performance of many tasks, the problem of how to design a curriculum for 3D shape representation learning still remains. Unlike 2D images where the pixels are regularly arranged, 3D shapes usually have irregular structures, which makes the effective curriculum design more challenging.
3 Proposed Approach
Our shape curriculum is designed based on DeepSDF [park2019deepsdf], which is a popular implicit function based 3D shape representation learning method. In this section, we first review DeepSDF and then describe the proposed Curriculum DeepSDF approach. Finally, we introduce the implementation details.
3.1 Review of DeepSDF [park2019deepsdf]
DeepSDF is trained on a set of shapes , where points are sampled around each shape with the corresponding SDF values precomputed. This results in (point, SDF value) pairs:
A deep neural network is trained to approximate SDF values of points , with an input latent code representing the target shape.
The loss function given, and is defined by the -norm between the estimated and ground truth SDF values:
where uses a parameter to clamp an input value . For simplicity, we use to represent a clamping function with in the rest of the paper.
DeepSDF also designs an auto-decoder structure to directly pair a latent code with a target shape without an encoder. Please refer to [park2019deepsdf] for more details. At training time, is randomly initialized from and optimized along with the parameters of the network through back-propagation:
where is the regularization parameter.
At inference time, an optimal can be estimated with the network fixed:
3.2 Curriculum DeepSDF
Different from DeepSDF which trains the network with a fixed objective all the time, Curriculum SDF starts from learning smooth shape approximations and then gradually strives for more local details. We carefully design the curriculum from the following two aspects: surface accuracy and sample difficulty.
3.2.1 Surface accuracy.
A smoothed approximation for a target shape could capture the global shape structure without focusing too much on local details, and thus is a good starting point for the network to learn. With a changing smoothness level at different training stages, more and more local details can be exposed to improve the network. Such smoothed approximations could be generated by traditional geometry processing algorithms. However, the generation process is time-consuming, and it is also not clear whether such fixed algorithmic routines could meet the needs of network training. In this paper, we address the problem from another view by introducing surface error tolerance which represents the upper bound of the allowed errors in the predicted SDF values. We observe that starting with relatively high surface error tolerance, the network tends to omit complex details and aims for a smooth shape approximation. Then, we gradually reduce the tolerance to expose more details.
More specifically, we allow small mistakes for the SDF estimation within the range of for Curriculum DeepSDF. In other words, all the estimated SDF values in that range are considered correct without any punishment, and we can control the difficulty of the task by changing . Fig. 2 illustrates the physical meaning of the tolerance parameter . Compared with DeepSDF which aims to reconstruct the exact surface of the shape, Curriculum DeepSDF provides a tolerance zone with the thickness of , and the objective becomes to reconstruct any surface in the zone. At the beginning of network training, we set a relatively large which allows the network to learn general and smooth surfaces in a wide tolerance zone. Then, we gradually decrease to expose more details and finally set to predict the exact surface.
Unlike most recent curriculum learning methods that rank training samples by difficulty [weinshall2018curriculum, hacohen2019power], our designed curriculum on shape accuracy directly modifies the training loss. It follows the formulation in [bengio2009curriculum] and also has a clear physical meaning for the task of SDF estimation. We summarize the two advantages of the tolerance parameter based shape curriculum as follows:
We only need to change the hyperparameterto control the surface accuracy, instead of manually creating series of smooth shapes. The network automatically finds the surface that is easy to learn in the tolerance zone.
For any , the ground truth surface of the original shape is always an optimal solution of the objective, which has good optimization consistency.
In addition to controlling the surface accuracy by the tolerance parameter, we also use a shallow network to learn coarse shapes with a large , and gradually add more layers to improve the surface accuracy when decreases. This idea is mainly inspired by [karras2017progressive]. Fig. 3 shows the network architecture of the proposed Curriculum DeepSDF, where we employ the same network as DeepSDF for fair comparisons. After adding a new layer with random initialization to the network, the well-trained lower layers may suffer from sudden shocks if we directly train the new network in an end-to-end manner. Inspired by [karras2017progressive], we treat the new layer as a residual block with a weight of , where the original link has a weight of . We linearly increase from 0 to 1, so that the new layer can be faded in the original network smoothly.
3.2.2 Sample difficulty.
In DeepSDF, the sampled points in all share the same weights in training, which presumes that every point is equally important. However, this assumption may result in the following two problems for reconstructing complex local details:
Points depicting local details are usually undersampled, and they could be ignored by the network during training due to their small population. We take the second lamp in Fig. 1 as an example. The number of sampled points around the lamp rope is nearly 1/100 of all the sampled points, which is too small to affect the network training.
In these areas, the magnitudes of SDF values are small as the points are close to surfaces (e.g. points inside the lamp rope). Without careful emphasis, the network could easily predict the wrong signs. Followed by a surface reconstruction method like Marching Cubes, the wrong sign estimations will further lead to inaccurate surface reconstructions.
To address these issues, we weight the sampled points differently during training. An intuitive idea is to locate all the complex local parts at first, and then weight or sort the training samples according to some difficulty measurement [bengio2009curriculum, weinshall2018curriculum, hacohen2019power]. However, it is difficult to detect complex regions and rank the difficulty of points exactly. In this paper, we propose an adaptive difficulty measurement based upon the SDF estimation of each sample and re-weight the samples to gradually emphasize more on hard and semi-hard samples on the fly.
Most deep embedding learning methods judge the difficulty of samples according to the loss function [schroff2015facenet, duan2019deep]. However, the -norm loss can be very small for the points with wrong sign estimations. As signs play an important role in implicit function based shape representations, we directly define the hard and semi-hard samples based on their sign estimations. More specifically, we consider the points with wrong sign estimations as hard samples, with the estimated SDF values between zero and ground truth values as semi-hard samples, and the others as easy samples. Fig. 4 shows the examples. For the semi-hard samples, although currently they obtain correct sign estimations, they are still at high risk of becoming wrong as they are close to the boundary.
To increase the weights of both hard and semi-hard samples, and also decrease the weights of easy samples, we formulate the objective function as below:
where is a hyperparameter controlling the importance of the hard and semi-hard samples, if and -1 otherwise.
The physical meaning of (6) is that we increase the weights of hard and semi-hard samples to , and also decrease the weights of easy samples to . Although we treat hard and semi-hard samples similarly, their properties are different due to the varying physical meanings as we will demonstrate in the experiments. As hard and semi-hard samples are decided by the estimated SDF values, our hard sample mining strategy always targets at the weakness of the current network rather than using the predefined weights. Still, (6) will degenerate to (5) if we set .
There is another understanding of (6). In (6), shows the ground truth sign while indicates the direction of optimization. We increase the weights if this direction matches the ground truth sign where overshooting would still lead to correct sign estimations, and decrease the weights otherwise.
We also design a curriculum for sample difficulty by controlling at different training stages. At the beginning of training, we aim to teach the network global structures and allow small errors in shape geometry. To this end, we set a relatively small to make the network equally focused on all training samples. Then, we gradually increase to emphasize more on hard and semi-hard samples, which helps the network to address its weaknesses and reconstruct better local details. Strictly speaking, the curriculum of sample difficulty is slightly different from the formulation in [bengio2009curriculum], as it starts from the original task and gradually increases the difficulty to a harder objective. However, they share similar thoughts and the ablation study also shows the effectiveness of the designed curriculum.
3.3 Implementation Details.
In order to make fair comparisons, we applied the same training data, training epochs and network architecture as DeepSDF [park2019deepsdf]. More specifically, we prepared the input samples from each shape mesh which was normalized to a unit sphere. We sampled 500,000 points from each shape. The points were sampled more aggressively near the surface to capture more shape details. The learning rate for training the network was set as where is the batch size and for the latent vectors. We trained the models for 2,000 epochs. Table 1 presents the training details, which will degenerate to DeepSDF if we train all the 8 fully connected layers by setting from beginning to the end.
In this section, we perform a thorough comparison of our proposed Curriculum DeepSDF to DeepSDF along with comprehensive ablation studies for the shape reconstruction task on the ShapeNet dataset [chang2015shapenet]. We use the missing part recovery task as an application to demonstrate the usage of our method.
Following [park2019deepsdf], we report the standard distance metrics of mesh reconstruction including the mean and the median of Chamfer distance (CD), mean Earth Mover’s distance (EMD) [rubner2000earth], and mean mesh accuracy [seitz2006comparison]. For evaluating CD, we sample 30,000 points from mesh surfaces. For evaluating EMD, we follow [park2019deepsdf] by sampling 500 points from mesh surfaces due to a high computation cost. For evaluating mesh accuracy, following [seitz2006comparison, park2019deepsdf], we sample 1,000 points from mesh surfaces and compute the minimum distance such that 90% of the points lie within of the ground truth surface.
4.1 Shape Reconstruction
We conducted experiments on the ShapeNet dataset [chang2015shapenet] for the shape reconstruction task. In the following, we will introduce quantitative results, ablation studies and visualization results.
Quantitative results. We compare our method to the state-of-the-art methods, including AtlasNet [groueix2018papier] and DeepSDF [park2019deepsdf] in Table 2. We also include several variants of our own method for ablation studies. Ours, representing the proposed Curriculum DeepSDF method, performs a complete curriculum learning considering both surface accuracy and sample difficulty. As variants of our method, ours-sur and ours-sur w/o only employ the surface accuracy based curriculum learning with/without progressively growth of the network layers; ours-sam only employs sample difficulty based curriculum learning. For a fair comparison, we evaluated all SDF-based methods following the same training and testing protocols as DeepSDF, including training/test split, the number of training epochs, and network architecture, etc. For AtlasNet-based methods, we directly report the numbers from [park2019deepsdf].
Here are the three key observations from Table 2:
Compared to vanilla DeepSDF, curriculum learning on either surface accuracy or sample difficulty can lead to a significant performance gain. The best performance is achieved by simultaneously performing both curricula.
In general, the curriculum of sample difficulty helps more on lamp and plane as these categories suffer more from reconstructing slender or thin structures. The curriculum of surface accuracy is more effective for the categories of chair, sofa and table where shapes are more regular.
As we only sample 500 points for computing EMD, even the ground truth mesh has non-zero EMD to itself rising from the randomness in point sampling. We observe that our performance is approaching the upper bound on plane and sofa. Please refer to Table 5 for evaluating EMD with more points.
|Mesh acc, mean|
Hard sample mining strategies. Besides direct comparisons on ShapeNet, we also conducted ablation studies for a more detailed analysis of different hard sample mining strategies. We conducted the experiments on the lamp category due to its large variations and complex shape details. In the curriculum of sample difficulty, we gradually increase to make the network more and more focused on the hard samples. We compared it with the simple strategy by fixing a single , and Table 3 shows the results. We observe that the performance improves as increases until reaching a sweet spot, after which further increasing could hurt the performance. The best result is achieved by our method which gradually increases as it encourages the network to focus more and more on hard details.
For hard sample mining, we increase the weights of hard and semi-hard samples to and also decrease the weights of easy samples to through (6). As various similar strategies can be used, we conducted ablation studies to demonstrate the effectiveness of our design. Table 4 shows the results. We observe that both increasing the weights of semi-hard samples and decreasing the weights of easy samples can boost the performance. However, it is risky to only increase weights for hard samples excluding semi-hard ones in which case the performance drops. One possible reason is that focusing too much on hard samples may lead to more wrong sign estimations for the semi-hard ones as they are close to the boundary. Hence, it is necessary to increase the weights of semi-hard samples as well to maintain their correct sign estimations. The best performance is achieved by simultaneously increasing the weights of hard and semi-hard samples and decreasing the weights of easy ones.
|Number of points||500||2000||5000||10000|
Number of points for EMD. In Table 2, we followed [park2019deepsdf] by sampling 500 points to compute accurate EMD, which would lead to relatively large distance even for the resampled point clouds from ground truth meshes. To this end, we increase the number of sampled points during EMD computation and tested the performance on lamp. Results are shown in Table 5. We observe that the number of sampled points can affect EMD due to the randomness in sampling, and the EMD of resampled ground truth decreases when using more points. Our method continuously obtains better results than DeepSDF.
Visualization results. We visualize the shape reconstruction results in Fig. 1 to qualitatively compare DeepSDF and Curriculum DeepSDF. We observe that Curriculum DeepSDF reconstructs more accurate shape surfaces. The curriculum of surface accuracy helps to better capture the general structure, and sample difficulty encourages the recovery of complex local details. We also provide the reconstructed shapes at key epochs in Fig. 5. Curriculum DeepSDF learns coarse shapes at early stages which omits complex details. Then, it gradually refines local parts based on the learned coarse shapes. This training procedure improves the performance of the learned shape representation.
4.2 Missing Part Recovery
One of the main advantages of the DeepSDF framework is that we can optimize a shape code based upon a partial shape observation, and then render the complete shape through the learned network. In this subsection, we compare DeepSDF with Curriculum DeepSDF on the task of missing part recovery.
To create partial shapes with missing parts, we remove a subset of points from each shape As random point removal may still preserve the holistic structures, we remove all the points in a local area to create missing parts. More specifically, we randomly select a point from the shape and then remove a certain quantity of its nearest neighbor points including itself, so that all the points within a local range can be removed. We conducted the experiments on three ShapeNet categories: plane, sofa and lamp. In these categories, plane and sofa have more regular and symmetric structures, while lamp is more complex and contains large variations. Table 6 shows the mean CD and mesh accuracy for missing part recovery. We observe that part removal largely affects the performance on the lamp category compared with plane and sofa, and Curriculum DeepSDF continuously obtains better results than DeepSDF under different ratios of removed points. A visual comparison is provided in Fig. 6.
In this paper, we have proposed a Curriculum DeepSDF method by designing a shape curriculum for shape representation learning. Inspired by the learning principle of humans, we organize the learning task into a series of difficulty levels from surface accuracy and sample difficulty. For surface accuracy, we design a tolerance parameter to control the global smoothness, where we learn coarse shapes with a shallow network at first and then gradually increase the accuracy with more layers. For sample difficulty, we define hard, semi-hard and easy training samples in SDF learning. We gradually re-weight the samples to make the network more and more focused on difficult local details. Experimental results show that our shape curriculum largely improves the performance of DeepSDF with the same training data, training epochs and network architecture.