Learning a Hierarchical Latent-Variable Model of 3D Shapes

05/17/2017 ∙ by Shikun Liu, et al. ∙ Penn State University Imperial College London 0

We propose the Variational Shape Learner (VSL), a hierarchical latent-variable model for 3D shape learning. VSL employs an unsupervised approach to learning and inferring the underlying structure of voxelized 3D shapes. Through the use of skip-connections, our model can successfully learn a latent, hierarchical representation of objects. Furthermore, realistic 3D objects can be easily generated by sampling the VSL's latent probabilistic manifold. We show that our generative model can be trained end-to-end from 2D images to perform single image 3D model retrieval. Experiments show, both quantitatively and qualitatively, the improved performance of our proposed model over a range of tasks.



There are no comments yet.


page 6

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past several years, impressive strides have been made in the generative modeling of 3D objects. Much of this progress can be attributed to recent advances in artificial neural network research. Instead of the usual approach to representing 3D shapes with voxel occupancy vectors, promising previous work has taken to learning simple latent representations of such objects. Neural architectures that been developed with this goal in mind include deep belief networks


, deep autoencoders

[48, 12, 31], and 3D convolutional networks [26, 47, 34, 5, 15]. The positive progress made so far with neural networks has also led to the creation of several large-scale 3D CAD model benchmarks, notably ModelNet [44] and ShapeNet [3].

However, despite the progress made so far, one key weakness shared among all previous state-of-the-art approaches is that all of them have focused on learning a single (unified) vector representation of 3D shapes. These include recent and powerful models such as the autoencoder-like T-L Network [12] and the probabilistic 3D Generative Adversarial Network (3D-GAN) [43], which shared its vector representation over multiple tasks. Other models [18, 17] further required additional supervision using information regarding camera viewpoints, shape keypoints, and segmentations.

Trying to describe the input with only a single layer of latent variables might be too restrictive an assumption, hindering the expressiveness of the underlying generative model learned. Having a multilevel latent structure, on the other hand, would allow for lower level latent variables to focus on modeling features such as edges and the upper levels to learn to command those lower-level variables as to where to place those edges in order to form curves and shapes. This composition of latent (local) sub-structures would allow us to exploit the fact that most 3D shapes usually have similar structure. This is the essence of abstract representations (which can be considered to be a coarse-to-fine feature extraction process), which can be easily constructed in terms of less abstract ones

[2] – higher level variables, or disentangled features, would be modeling complex interactions of low-level patterns. Thus, to encourage and expedite the learning of hierarchical features, we explicitly incorporate this as a prior in our model through explicit architectural constraints.

In this paper, motivated by the argument developed above and the promise shown in work such as that of [8], we show how to encourage a latent-variable generative model to learn a hierarchy of latent variables through the use of synaptic skip-connections. These skip-connections encourage each layer of latent variables to model exactly one level of abstraction of the data. To efficiently learn such a latent structure, we further exploit recent advances in approximate inference [21] to develop a variational learning procedure. Empirically, we show that the learned model, which we call the Variational Shape Learner, acquires rich representations of 3D shapes which leads to significantly improved performance across a multitude of 3D shape tasks.

In summary, the main contributions of this paper are as follows:

  • [leftmargin=*]

  • We propose a novel latent-variable model, which we call the Variational Shape Learner, which is capable of learning expressive features of 3D shapes.

  • For both general 3D model building and single image reconstruction, we show that our model is fully unsupervised, requiring no extra human-generated information about segmentation, keypoints, or pose information.

  • We show that our model outperforms current state-of-the-art in unsupervised (object) model classification while requiring significantly fewer learned feature extractors.

  • In real-world image reconstruction, our extensive set of experiments show that the proposed Variational Shape Learner surpasses state-of-the-art in 8 of 10 classes. Half of these the VSL surpasses by a large margin.

2 Related Work

3D object recognition is a well-studied problem in the computer vision literature. Early efforts

[27, 22, 33]

often combined simple image classification methods with hand-crafted shape descriptors, requiring intensive effort on the side of the human data annotator. However, ever since the ImageNet contest of 2012

[23], deep convolutional networks (ConvNets) [10, 24] have swept the vision industry, becoming nearly ubiquitous in countless applications.

Research in learning probabilistic generative models has also benefited from the advances made by artificial neural networks. Generative Adversarial Networks (GANs), proposed in [13] and Variational auto-encoders (VAEs), proposed in [21, 32], are some of the most popular and important frameworks that have emerged from improvements in generative modeling. Successful adaptation of these frameworks range from a focus in natural language and speech processing [6, 35] to realistic image synthesis [14, 30, 28], yielding promising, positive results. Nevertheless, very little work, outside of [43, 12, 31], has focused on modeling 3D objects, where generative architectures can be used to learn probabilistic embeddings. The model proposed in this paper will offer another step towards constructing powerful probabilistic generative models of 3D structures.

One study, amidst the rise of neural network-based approaches to 3D object recognition, most relevant to this paper is that of [44], which presented promising results and a useful benchmark for 3D model recognition: ModelNet. Following this key study, researchers have tried applying 3D ConvNets [26, 5, 41, 47], autoencoders [46, 48, 12, 31], and a variety of probabilistic neural generative models [43, 31] to the problem of 3D model recognition, with each study progressively advancing state-of-the-art.

With respect to 3D object generation from 2D images, commonly used methods can be roughly grouped into two categories: 3D voxel prediction [44, 43, 12, 31, 5, 15] and mesh-based methods [11, 7]. The 3D-R2N2 model [5]

represents a more recent approach to the task, which involves training a recurrent neural network to predict 3D voxels from one or more 2D images.

[31] also takes a recurrent network-based approach, but receives a depth image as input rather than normal 2D images. The learnable stereo system [17] processes one or more camera views and camera pose information to produce compelling 3D object samples.

Many of the above methods require multiple images and/or additional human-provided information. Some approaches have attempted to minimize human involvement by developing weakly-supervised schemes, making use of image silhouettes to conduct 3D object reconstruction [47, 42]. Of the few unsupervised neural-based approaches that exist, the T-L network [12] is quite important, which combines a convolutional autoencoder with an image regressor to encode a unified vector representation of a given 2D image. However, one fundamental issue with the T-L Network is its three-phase training procedure, since jointly training the system components proves to be too difficult. The 3D-GAN [43] offers a way to train 3D object models probabilistically, employing an adversarial learning scheme. However, GANs are notoriously difficult to train [1]

, often due to ill-designed loss functions and the higher chance of zero gradients.

In contrast to this prior work, our approach, which is derived from an approximate inference approach to learning, naturally allows for joint training of all model parameters. Furthermore, our approaches make use of a well-formulated loss function from a variational Bayesian perspective, circumventing the instability involved with adversarial learning while still able to produce higher-quality samples.

3 The Variational Shape Learner

In this section, we introduce our proposed model, the Variational Shape Learner (VSL), which builds on the ideas of the Neural Statistician [8] and the volumetric convolutional network [26], the parameters of which are learned under a variational inference scheme [21].

Figure 2: The network structure of the Variational Shape Learner. Solid lines represent synaptic connections for either fully-connected or convolutional layers while dashed lines represent concatenation. Dotted-dashed lines represent possible applications. means latent features, means concatenated features, and means equivalence relation.

3.1 The Design Philosophy

It is well known that generative models, learned through variational inference, are excellent at reconstructing complex data but tend to produce blurry samples. This happens because there is uncertainty in the model’s predictions when we reconstruct the data from a latent space. As described above, previous approaches to 3D object modeling have focused on learning a single latent representation of the data. However, this simple latent structure might be hindering the underlying model’s ability to extract richer structure from the input distribution and thus lead to blurrier reconstructions.

To improve the quality of the samples of generated objects, we introduce a more complex internal variable structure, with the specific goal of encouraging the learning of a hierarchical arrangement of latent feature detectors. The motivation for a latent hierarchy comes from the observation that objects under the same category usually have similar geometric structure. As can be seen in Figure 2, we start from a global latent variable layer (horizontally depicted) that is hardwired to a set of local latent variables layers (vertically depicted), each tasked with representing one level of feature abstraction. The skip-connections tie together the latent codes, and in a top-down directed fashion, local codes closer to the input will tend to represent lower-level features while local codes farther away from the input will tend towards representing higher-level features.

The global latent vector can be thought of as a large pool of command units that ensures that each local code extracts information relative to its position in the hierarchy, forming an overall coherent structure. This explicit global-local form, and the way it constrains how information flows across it, lends itself to a straightforward parametrization of the generative model and furthermore ensures robustness, dramatically cutting down on over-fitting. To make things easier for training via stochastic back-propagation, the local codes will be concatenated to a flattened structure when fed into the task-specific models, e.g., a shape classifier or a voxel reconstruction module. Ultimately, more realistic samples should be generated by an architecture supporting this kind of latent-variable design, since the local variable layers will robustly encode hierarchical semantic cues in an unsupervised fashion.

3.2 Model Objective: Variational + Latent Loss

To learn the parameters of the VSL latent-variable model, we will take a variational inference approach, where the goal is to learn a directed generative model , with generative parameters , using a recognition model , with variational parameters . The VSL’s learning objective contains a standard reconstruction loss term as well as a regularization penalty over the latent variables. Furthermore, the loss contains an additional term for the latent variables , which is particularly relevant and useful for the 3D model retrieval task of Section 4.5. This extra term is a simple penalty imposed on the the difference between the learned features of the image regressor and true latent features where denotes concatenation.

We assume a fixed, spherical unit Gaussian prior, . The conditional distribution over each local latent code () is defined as follows:


where the first local code is simply:


Know that and are also spherical Gaussians and contains the generative parameters.

The (occupancy) probability for one voxel

can then be calculated by,


Let the reconstructed voxel be directly parametrized by occupancy probability. The loss for the input voxel of the VSL is then calculated by the following equation:


where each term in the equation above is defined as follows:


Note that and , which weigh the contributions of the each term towards the overall cost, are tunable hyper-parameters.

3.3 Encoder: 3D-ConvNet + Skip-Connections

The global latent code is directly learned from the input voxel through three convolutional layers with kernel sizes , strides and channels .

Each local latent code is conditioned on the global latent code, the input voxel , and the previous latent code (except for

, which does not have a previous latent code) using two fully-connected layers with 100 neurons each. These skip-connections between local codes help to ease the process of learning hierarchical features and force each local latent code to learn one level of abstraction.

The approximate posterior for one single voxel is then given by:


where , the variational parameters, is parametrized by neural networks. represents the number of local latent codes.

3.4 Decoder: 3D-DeConvNet

After we learn the global and local latent codes , we then concatenate them into a single vector as shown in Figure 2 in blue dashed lines.

A 3D deconvolutional neural network with dimensions symmetrical to the encoder of Section 3.3 is used to decode the learned latent features into a voxel. An element-wise logistic sigmoid is applied to the output layer in order to convert the learned features to occupancy probabilities for each voxel cell.

3.5 Image Regressor: 2D-ConvNet

We use a standard 2D convolutional network to encode input RGB images into a feature space with the same dimension as the concatenation of global and local latent codes . The network contains four fully-convolutional layers with kernel sizes , strides , and channels . The last convolutional layer is flattened and fed into two fully-connected layers with 200 and 100 neurons each. Unlike the encoder described in Section 3.3, we apply dropout [40] before the last fully-connected layer.

4 Experiments

To evaluate the quality of our proposed neural generative model for 3D shapes, we conduct several extensive experiments.

In Section 4.3

, we investigate our model’s ability to generalize and synthesize through a shape interpolation experiment and an nearest neighbors analysis of random generated samples from the VSL. Following this, in Section

4.4, we evaluate our model on the task of unsupervised shape classification by directly using the learned latent features on both the ModelNet10 and ModelNet40 datasets. We compare these results to previous supervised and unsupervised state-of-the-art methods. Next, we test our model’s ability to reconstruct real-world image in Section 4.5, comparing our results to 3D-R2N2 [5] and NRSfM [18]. Finally, we demonstrate the richness of the VSL’s learned semantic embeddings through vector arithmetic, using the latent features trained on ModelNet40 for Section 4.6.

4.1 Datasets

ModelNet There are two variants of the ModelNet dataset, ModelNet10 and ModelNet 40, introduced in [44], with 10 and 40 target classes respectively. ModelNet10 has 3D shapes which are pre-aligned with the same pose across all categories. In contrast, ModelNet40 (which includes the shapes found in ModelNet10) features a variety of poses. We voxelize both ModelNet10 and ModelNet40 with resolution . To test our model’s ability to handle 3D shapes of great variety and complexity, we use ModelNet40 for most of the experiments, especially for those in Section 4.3 and 4.6. Both ModelNet10 and ModelNet40 are used to conduct the shape classification experiments.

PASCAL 3D The PASCAL 3D dataset is composed of the images from the PASCAL VOC 2012 dataset [9], augmented with 3D annotations using PASCAL 3D+ [45]. We voxelize the 3D CAD models using resolution and use the same training and testing splits of [18], which was also used in [5] to conduct real-world image reconstruction (of which the experiment in Section 4.5

is based off of). We use the bounding box information as provided in the dataset. Note that the only pre-processing we applied was image cropping and padding with 0-intesntity pixels to create final samples of resolution

(which was required for our model).

4.2 Training Protocol

Training was the same across all experiments, with only minor details that were task-dependent. The architecture of the VSL experimented with in this paper consisted of 5 local latent codes, each made up of 10 variables for ModelNet40 and 5 for ModelNet10. For ModelNet40, the global latent code was set to a dimensionality of 20 variables, while for ModelNet10, it was set to 10 variables.

The hyper-parameter was set to across training on both ModelNet10 and ModelNet40. We optimize parameters by maximizing the loss function defined in Equation 4 using the Adam adaptive learning rate [20], with step size set to . For the experiments of Sections 4.3, 4.4, and 4.6

, over 2500 epochs, parameter updates were calculated using mini-batches of 200 samples on ModelNet40 and 100 samples on ModelNet10.

For the experiment in Section 4.5, we use 5 local latent codes (each with dimensionality of 5) and a global latent code of 20 variables for the jointly trained model. For the separately trained model, we use 3 local latent codes, each with dimensionality of 2, and a global latent code of dimensionality 5. Mini-batches of 40 samples were use to compute gradients for the joint model while 5 samples were used for the separately trained model. For both model variants, dropout [40] was to control for over-fitting, with , and early stopping was employed (resulting in only 150 epochs).

For Section 4.5, which involved image reconstruction and thus required the loss term , instead of searching for an optimal value of the hyper-parameter through cross-validation, we employed a “warming-up” schedule, similar to that of [39]. “Warming-up” involves gradually increasing (on a log-scale as depicted in Figure 3), which controls the relative weighting of in Equation 4. The schedule is defined as follows,

Figure 3: Training the VSL for image reconstruction using a warming-up schedule compared to using constant weights and .

Figure 3 depicts, empirically, the benefits of employing a warming-up schedule over using a fixed, externally set coefficient for the term in our image reconstruction experiment. We remark that using a warming-up schedule plays an essential role in acquiring good performance on the image reconstruction task.

4.3 Shape Generation and Learning

Shape Generation Nearest Neighbor
Figure 4: Randomly generated results from the proposed Variational Shape Learner trained on ModelNet40. The nearest neighbors are the ground-truth shapes, fetched from the test data, and placed for reference in the last column of the table.
Intra-Class Interpolation (airplane)
Inter-Class Interpolation (chair bed)
Figure 5: Interpolation results of the Variational Shape Learner on ModelNet40.
airplane desk
sofa chair
Figure 6: Shape generation from previous state-of-the-art approaches. Up: generated shapes in resolution from [44]; Down: generated shapes in resolution from [43].

To examine our model’s ability to generate high-resolution 3D shapes with realistic details, we design a task that involves shape generation and shape interpolation. We add Gaussian noise to the learned latent codes on test data taken from ModelNet40 and then use our model to generate “unseen” samples that are similar to the input voxel. In effect, we generate objects from our VSL model directly from vectors, without a reference object/image.

The results of our shape interpolation experiment, from both within-class and across-class perspectives, is presented in Figure 5. It can be observed that the proposed VSL shows the ability to smoothly transition between two objects. Our results on shape generation are shown in Figure 4. Notably, in our visualizations, darker colors correspond to smaller occupancy probability while lighter corresponds to higher occupancy probability. We further compare to previous state-of-the-art results in shape generation, which are depicted in Figure 6.

4.4 Shape Classification

One way to test the expressiveness of our model would be to conduct shape classification directly using the learned embeddings. We evaluate our learned features on the ModelNet dataset [44] by concatenating both the global latent variable with the local latent layers, creating a single feature vector

. We train a Support Vector Machine with an RBF kernel for classification using these “pre-trained” embeddings.

Supervision Method Classification Rate
ModelNet10 ModelNet40
Supervised 3D ShapeNets [44] 83.5% 77.3%
DeepPano [37] 85.5% 77.6%
Geometry Image [38] 88.4% 83.9%
VoxNet [26] 92.0% 83.0%
PointNet [29] - 89.2%
MVCNN [41] - 90.1%
ORION [34] 93.8% -
Unsupervised SPH [19] 79.8% 68.2%
LFD [4] 79.9% 75.5%
T-L Network [12] 74.4% -
VConv-DAE [36] 80.5% 75.5%
3D-GAN [43] 91.0% 83.3%
VSL (ours) 91.0% 84.5%
Table 1: ModelNet classification results for both unsupervised and supervised methods.

Table 1 shows the performance of previous state-of-the-art supervised and unsupervised methods in shape classification on both variants of the ModelNet dataset. Notably, the best unsupervised state-of-the-art results reported so far were from the 3D-GAN of [43], which used features from 3 layers of convolutional networks with total dimensions . This is a far larger feature space than that required by our model, which is simply (for 10-way classification) and (for 40-way classification) and reaches the exact same level of performance. The VSL performs comparably to supervised state-of-the-art, outperforming models such as 3D ShapeNet [44], DeepPano [37], and Geometry Image [38], by a large margin, and comes close to models such as VoxNet [26].

In order to visualize the learned feature embeddings, we employ t-SNE [25] to map our high dimensional feature to a 2D plane. The visualization is shown in Figure 7.

Figure 7: t-SNE plots of the latent embeddings for ModelNet10 and ModelNet40. Each color represents one class.

4.5 Single Image 3D Model Retrieval

Real-world, single image 3D model retrieval is another application of the proposed VSL model. This is a challenging problem, forcing a model to deal with real-world 2D images under a variety of lighting conditions and resolutions. Furthermore, there are many instances of model occlusion as well as different color gradings.

To test our model on this application, we use the PASCAL 3D [45] dataset and utilize the same exact training and testing splits from [18]. We compare our results with those reported for recent approaches, including the NRSfM [18] and 3D-R2N2 [5] models. Note that these also used the exact same experimental configurations we did.

aero bike boat bus car chair mbike sofa train tv mean
NRSfM 0.298 0.144 0.188 0.501 0.472 0.234 0.361 0.149 0.249 0.492 0.318
3D-R2N2 [LSTM-1] 0.472 0.330 0.466 0.677 0.579 0.203 0.474 0.251 0.518 0.438 0.456
3D-R2N2 [Res3D-GRU-3] 0.544 0.499 0.560 0.816 0.699 0.280 0.649 0.332 0.672 0.574 0.571
VSL (jointly trained) 0.514 0.269 0.327 0.558 0.633 0.199 0.301 0.173 0.402 0.337 0.432
VSL (separately trained) 0.631 0.657 0.554 0.856 0.786 0.311 0.656 0.601 0.804 0.454 0.619
Table 2: Per-category voxel predictive performance on PASCAL VOC, as measured by Intersection-of-Union (IoU).
Input GT VSL 3D-R2N2[5] NRSfM[18]
Figure 8: Reconstruction samples (for PASCAL VOC) from the separately trained VSL. Note: our model uses resolution while 3D-R2N2 and NRSfM use resolution , thus, visualizations will differ slightly.

For this task, we train our model in two different ways: 1) jointly on all categories, and 2) separately on each category. In Figure 8, we observe better reconstructions from the (separately-trained) VSL when compared to previous work. Unlike the NRSfM [18], the VSL does not require any segmentation, pose information, or keypoints. In addition, the VSL is trained from scratch while the 3D-R2N2 is pre-trained using the ShapeNet dataset [3]. However, the jointly-trained VSL did not outperform the 3D-R2N2, which is also jointly-trained. The performance gap is due to the fact that the 3D-R2N2 is specifically designed for image reconstruction and employs a residual network [16] to help the model learn richer semantic features.

Quantitatively, we compare our VSL to the NRSfM [18] and two versions of 3D-R2N2 from [5], one with an LSTM structure and another with a deep residual network. Results (Intersection-of-Union) are shown in Table 2. Observe that our jointly trained model performs comparably to the 3D-R2N2 LSTM variant while the separately trained version surpasses the 3D-R2N2 ResNet structure in 8 out of 10 categories, half of them by a wide margin. Note that our convolutional network components can be replaced with residual network components, an extension we leave as future work.

4.6 Shape Arithmetic

Another way to explore the learned embeddings is to perform various vector operations on the latent space, much what was done in [43, 12]. We present some results of our shape arithmetic experiment in Figure 9. Different from previous results, all of our objects are sampled from the model embeddings which were trained using the whole dataset with 40 classes. Furthermore, unlike the blurrier generations of [12], the VSL seems to generate very interesting combinations of the input embeddings without the need for any matching to actual 3D shapes from the original dataset. The resultant objects appear to clearly embody the intuitive meaning of the vector operators.

Figure 9: Samples of our shape arithmetic experiment.

5 Conclusion

In this paper, we proposed the Variational Shape Learner, a hierarchical latent-variable model for 3D shape modeling, learnable through variational inference. In particular, we have demonstrated 3D shape generation results on a popular benchmark, the ModelNet dataset. We also used the learned embeddings of our model to obtain state-of-the-art in unsupervised shape classification and furthermore showed that we could generate unseen shapes using shape arithmetic. Future work will entail a more thorough investigation of the embeddings learned by our hierarchical latent-variable model as well as integration of better prior distributions into the framework.