Spectral-GANs for High-Resolution 3D Point-cloud Generation

12/04/2019 ∙ by Sameera Ramasinghe, et al. ∙ Australian National University 19

Point-clouds are a popular choice for vision and graphics tasks due to their accurate shape description and direct acquisition from range-scanners. This demands the ability to synthesize and reconstruct high-quality point-clouds. Current deep generative models for 3D data generally work on simplified representations (e.g., voxelized objects) and cannot deal with the inherent redundancy and irregularity in point-clouds. A few recent efforts on 3D point-cloud generation offer limited resolution and their complexity grows with the increase in output resolution. In this paper, we develop a principled approach to synthesize 3D point-clouds using a spectral-domain Generative Adversarial Network (GAN). Our spectral representation is highly structured and allows us to disentangle various frequency bands such that the learning task is simplified for a GAN model. As compared to spatial-domain generative approaches, our formulation allows us to generate arbitrary number of points high-resolution point-clouds with minimal computational overhead. Furthermore, we propose a fully differentiable block to transform from the spectral to the spatial domain and back, thereby allowing us to integrate knowledge from well-established spatial models. We demonstrate that Spectral-GAN performs well for point-cloud generation task. Additionally, it can learn a highly discriminative representation in an unsupervised fashion and can be used to accurately reconstruct 3D objects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point-clouds are a popular 3D representation for real-world objects and scenes. In comparison to other representations such as voxels, mesh and truncated signed distance function (TSDF), point-clouds are often an attractive choice for 3D data because they capture shape details accurately, are computationally efficient to process and can be acquired as a default output from several 3D sensors (e.g., LiDAR). However, point-clouds pose a major challenge for deep networks, particularly the generative pipelines, due to their inherent redundancy and irregular nature (e.g., permutation-invariance).

Due to the complexity of point-clouds, most 3D synthesis approaches are inapplicable. For example, generative approaches designed for voxelized inputs [wu2016learning, kingma2013auto, wu20153d, xie2018learning, Huang_2017_CVPR, khan2019unsupervised], cannot work with the irregular point sets. To overcome this challenge, some recent generative approaches solely focus on point-cloud synthesis. For example, Achlioptas et al. [achlioptas2017learning] use a GAN framework for 3D point-cloud distribution modelling in the data and auto-encoder latent space, Yang et al. [Yang_2019_ICCV] sample 3D points from a prior spatial distribution and then transform them using an invertible parameterization while [shu20193d, valsesia2018learning] employ graph-structured networks for point-cloud generation.

All such efforts so far, operate in the ‘spatial-domain’ (3D Euclidean space) which makes the modelling task relatively difficult due to arbitrary point configurations in 3D space. This leads to a number of roadblocks towards a versatile generative model e.g., considering a fixed set of points [achlioptas2017learning] and limited scalability to arbitrary point resolutions [shu20193d, valsesia2018learning]

. As opposed to previous works, we perform generative modelling in the spectral space using spherical harmonic moment vectors (SMVs), which inherently offers a solution to the above mentioned problems. Specifically, generating 3D shapes via spectral representations allows us to compactly represent redundant information in point-clouds, easily scale to high-dimensional point-cloud sets, remain invariant to the permutations in unordered point sets and generate high-fidelity shapes with relatively minimal outliers. Besides, our spectral representation allow us to develop an understanding about the frequency domain functional space of generic 3D objects. Our main contributions are:

  • To handle the redundancy and irregularity of point-clouds, we propose the first spectral-domain GAN that synthesizes novel 3D shapes by using a spherical harmonics based representation.

  • A fully differentiable transformation from the spectral to the spatial domain and back, thus allowing us to integrate knowledge from well-established spatial models.

  • Through both quantitative and qualitative evaluations, we illustrate that Spectral-GAN can generate high-quality 3D shapes with minimal artifacts and can be easily scaled to high-dimensional outputs.

  • Our proposed framework learns highly discriminative unsupervised features and can seamlessly perform 3D reconstruction from 2D inputs. Moreover, we show that Spectral-GAN is scalable to high-resolution outputs (40 resolution increase with just 4 parameters).

2 Related Work

Generative models in spectral-domain: Yang et al. [yang2017dagan] and Souza et al. [souza2018hybrid]

develop methods for MRI reconstruction using GANs, and use Fourier domain information to refine the output. In the former approach, the generator operates in the spatial domain, and spectral information is used to refine the output. The latter approach, in contrast, uses two separate networks in the frequency and spatial domains and adopts the Fourier transform to exchange information between the two. A significant drawback of these approaches is that output resolution is tightly coupled to the network design and thus they lack scalability to high dimensions.

In a different application, Portilla et al. [portilla2000parametric] present a method to synthesize textures as 2D images based on a complex wavelet transform. They parameterize this operation using a set of statistics computed on pairs of coefficients corresponding to basis functions at adjacent spatial locations, orientations, and scales. However, their approach is not a learning model, which offers less flexibility. Furthermore, Zhu et al. [zhu2018image] recently proposed a model that initially processes undersampled input data in the frequency domain and then refines the result in the spatial domain using the inverse Fourier transform. They approximate the inverse Fourier transform using a sequence of connected layers, but one disadvantage is that their transformation has quadratic complexity with respect to the size of the input image. Furthermore, the above works are limited to 2D and do not study the 3D point-cloud generation problem in spectral domain.

3D GANs in spatial-domain: 3D GANs can be primarily categorized into two types: voxel outputs and point-cloud outputs. The latter typically entails more challenges as point-clouds are unordered and highly irregular in nature.

For voxelized 3D object modeling, several influential methods have been proposed in the literature. Wu et al. [wu2016learning] extend the 2D GAN framework to 3D domain for the first time. Following their work, Smith et al. [smith2017improved]

use a novel GAN architecture for 3D shape generation by employing Wasserstein distance as the loss function. A recent work by Khan

et al. [khan2019unsupervised] presents a factorized 3D generative model that sequentially generates shapes in a coarse-to-fine manner. Our approach also uses a two-step procedure–a forward pass and backward pass—to refine a coarse 3D shape, but a key difference here is that they use spatial information to refine the shape, while our method depends on frequency information.

Naive extensions of traditional spatial GANs to 3D point-cloud generation do not produce satisfactory results, due to their inherent properties such as being an unordered, irregularly distributed collection (see Sec. 3). Achlioptas et al. [achlioptas2017learning] were the first to use GANs to generate point-clouds. They first convert a point-cloud to a compact latent representation and then train a discriminator on it. Although we also use a compact representation, i.e., the SMV to train the GAN, SMVs provide a richer representation compared to latent space approximations and theoretically guarantee accurate reconstruction of the 3D point-cloud. Moreover, Valsesia et al. [valsesia2018learning] propose a graph convolution based network to extract localized features from 3D point-clouds, in order to reduce the effect of irregularity. A drawback of their method, however, is the rather high computational complexity of graph convolution, and less scalability with the resolution of the point-cloud. A recent work by Shu et al. [shu20193d] also propose a tree-structured graph convolution network, which is more computationally efficient. The model proposed by Li et al. [li2018point] attempts to handle the irregularity of point-clouds using a separate inference model which captures a latent distribution, to deal with the irregularity of point-clouds. In contrast, we effectively reduce the problem to the standard GAN setting by using a fixed-dimensional representation for point-clouds.

3 Problem Formulation

An exchangeable

sequence can be considered as a sequence of random variables

, where the joint probability distribution of

does not vary under position permutations. More formally,

Definition: For a given finite set of random objects, let

be their joint distribution. This finite set is exchangeable if

, for every permutation . The spatial representation of a point-cloud is a set of -dimensional vectors, and in cases of Euclidean geometry, typically, . A set is a collection of elements without any particular order or a fixed number of elements and thus, the probability distribution is an exchangeable sequence. According to the Hewitt and Savage theorem [hewitt1955symmetric], there exists a latent distribution such that,

(1)

Eq. 1 shows that in order to properly model as an exchangeable sequence and obtain a distribution , it is necessary to capture the latent representation . In other words, it is difficult for a GAN to model as an exchangeable sequence, only by observing a set of

sequences and estimating the marginal distributions

. In this case, the generative model needs to learn the joint probability distribution instead of . This makes it challenging to extend traditional GANs to the point-cloud generation problem. A straightforward approach to resolve this is to model point-cloud data as ordered, fixed-dimensional vectors. However, this approach does not hold the integral probability metric (IPM) guarantees of a GAN [li2018point].

On the contrary, we propose to model point-cloud data as SMVs, which effectively reduces the problem to the traditional case in two ways: 1) SMVs encode the corresponding shape information in a structured, fixed dimensional vector and 2) the vector elements are highly correlated with each other. The task of learning the distribution of elements of SMVs is theoretically similar to learning the pixel distribution of images, as in the latter case also, we only need to capture the joint probability distribution of pixels of each instance. In the case of image synthesis, however, GANs exploit the correlation of pixels using convolution kernels, which is not possible in the case of SMVs as correlation does not depend on proximity. Furthermore, different frequency portions of the SMVs show different characteristics. To handle these specific attributes, we propose a series of cascaded GANs, each consisting of only fully connected layers. Since each GAN only needs to generate a specific portion of the SMV, they can be designed as shallow models with fewer floating point operations (FLOPs).

4 Spectral GAN

We propose a 3D generative model that operates entirely in the spectral domain. Such a design offers unique advantages over spatial domain 3D generative models: (a) a compact representation of 3D shapes with an intuitive frequency-domain interpretation, (b) the flexibility to generate high-dimensional shapes with minimal changes to the model complexity, and (c) a permutation invariant representation which handles the irregularity of point-clouds. Below, we first introduce the spherical harmonics representations that serve as the basis for our proposed Spectral GAN model.

Figure 2: The overview of the Spectral Generative Adversarial Network. An unrolled version (with an explicit forward and backward pass) of the training scheme is shown for clarity.

4.1 Spherical Harmonics for 3D Objects

Spherical harmonics are a set of complete and orthogonal basis functions, which can efficiently represent functions on the unit sphere in . They are a higher dimensional analogy of the Fourier series, which forms a basis for functions on unit circle. The spherical harmonics are defined on as,

(2)

where is the polar angle, is the azimuth angle, is a non-negative integer, is an integer, , is the imaginary unit, is the normalization coefficient and is the associated Legendre function,

(3)

Since spherical harmonics are orthogonal and complete over the continuous functions on with finite energy, such a function can be expanded as,

(4)

where are the spherical harmonic moments obtained by,

(5)

The sufficient conditions for the expansion in Eq. 4 are given in [hobson1931theory]. In practical cases, a bounded set of spherical harmonic basis functions is defined, where is the maximum degree of harmonics series.

The process of 3D shape modeling via spherical harmonics can be decomposed into two major steps. First, sample points from the 3D shape surface and then computing spherical harmonic moments. Any polar 3D surface function can be represented as , where is a single valued function on the unit sphere , is the radial coordinate with respect to a predefined origin inside an object, and is the direction vector. Thus, we can compute moments of the corresponding 3D point-cloud using Eq. 5.

4.2 Cascaded GAN Structure

SMVs provide a highly structured representation of 3D objects, as explained in Sec. 4.1. Due to this structured nature, the margin for error is significantly lower in our setup, compared to GANs that try to produce spatial domain representations. Also, different frequency bands of the SMV typically entail different characteristics, which makes it highly challenging for a single GAN to generalize over the complete SMV. Therefore, to overcome this obstacle, we use multiple cascaded GANs, where each GAN specializes in generating a pre-defined frequency band of the SMV.

Our approach uses a combination of GAN models to generate the SMV of 3D shapes. Among them, the first model is a regular GAN while the remaining models are conditional GANs (cGAN). The objective of initial GAN model is given by a two-player min-max game,

(6)

where is the SMV band sampled from the spectral coefficient distribution and

is the noise vector sampled from a Gaussian distribution. In a cGAN, synthetic data modes are controlled by forwarding conditioning variables (e.g., a class label) as additional information to the generator. In our case, we use a specific band of SMVs

predicted by the previous generator to condition the subsequent generator. Then, the cGAN objective becomes,

(7)

Each GAN generates a portion of the complete spherical moment vector for the next GAN to be conditioned upon. The setup includes two major steps: (i) forward pass and (ii) backward pass. Accordingly, the overall architecture can be decomposed into two sets of generators and , that implement the forward and backward functions, respectively. In the forward pass, the model tries to generate a coarse shape representation, and the backward pass refines the coarse representation to generate a refined representation. The basis of our design is the Markovian assumption, i.e., given the outputs from the neighbouring generators, a current generator is independent from the outputs of the rest. We describe the two generation steps in Sec. 4.2.1 and 4.2.2.

4.2.1 Forward pass

In the forward pass, we have a set of generative models , which work in unison to generate a coarse representation of a 3D shape. Each is conditioned upon the outputs of , and generates a predefined frequency band () of the complete spherical harmonic representation () of the corresponding 3D shape. It is worthwhile to note that the forward pass is sufficient to generate the complete SMV without the aid of a backward pass. However, a critical limitation of this setup is that each GAN is only conditioned upon lower frequency bands of the SMV. In practice, this results in noisy outputs. Therefore, we also perform a backward pass, which allows the GANs to refine the generation by observing the higher frequencies. This procedure is explained on Sec. 4.2.2.

4.2.2 Backward pass

As explained in Sec.  4.2.1, the aim of the backward pass is to generate a more refined SMV, which produces a more refined 3D shape. Similar to forward pass, the backward pass is implemented using another set of generators , where . Each is conditioned upon the outputs of and generates a specific portion of the complete SMV. In the training phase, we first transfer the trained weights from to , before training . Therefore, this can be intuitively considered as fine-tuning based on higher frequencies. The training procedure is explained in Sec. 6.

5 Spatial domain regularizer

Since SMVs are highly structured, each element of a particular SMV is crucial for accurate reconstruction of its corresponding 3D point-cloud. In other words, even slight variations of a particular SMV cause significant variations in the spatial domain. Therefore, it is cumbersome for a GAN to synthesize SMVs, corresponding to visually pleasing point-clouds, by solely observing a distribution of ground truth SMVs.

To surmount this barrier, we use a spatial domain regularizer that can refine the weights of our cascaded GAN architecture, in order to synthesize more plausible SMVs. The spatial domain regularizer provides feedback from the spatial domain to the GANs, depending on the quality of the spatial reconstruction. Firstly, we employ a pre-trained PointNet [qi2017pointnet] model on the reconstructed synthetic point-cloud, and extract a global feature. Secondly, using the same procedure, we obtain another global feature from a ground truth point-cloud from the same class, and compute the distance between these two features. Finally, using back back-propagation, we update the weights of all the generators to minimize the distance. The final architecture of the proposed model is shown in Fig. 2.

In order to back-propagate error signals from the spatial domain to the spectral domain, we require , where is the SMV and is the loss. To this end, we derive the following formula: let be the SMV of a particular instance and be the corresponding reconstructed points on

for the same instance. Then, using the chain rule it can be shown that,

(8)
where, (9)

Combining Eq. 8 and 9, we obtain,

(10)

The above expression can be written as a matrix-vector product to obtain derivatives . This makes the transformer a fully differentiable and a network-agnostic module which can be used to communicate between spectral and spatial domains.

6 Network architecture and training

Our aim is to generate a compact spectral representation, i.e., a vector, instead of a irregular point set. In the spatial domain, points are correlated across the spatial space, and convolutions can be adopted to capture these dependencies. In fact, convolution kernels extract local features, under the assumption that spatially closer data points form useful local features. In contrast, closer elements in spectral domain representations do not necessarily exhibit strong correlations. Therefore, convolutional layers fail to excel in this scenario and thus, we opt for fully connected (FC) layers in designing our GANs. Interestingly, however, our GANs learn to generate quality outputs with a low depth architecture.

Generator architecture: For our main experiments, we choose the maximum degree of SMVs and the number of GANs as and , respectively, where and . Each generator in respectively generates frequency bands (), (), () and (). Since are used to fine tune

, they generate the same frequency portions as the latter set. For all the generators, we use the same architecture, except for the last FC layer. Each generator consists of three FC layers, first two layers with 512 neurons each, and the number of neurons in the last layer depends on the output size. For the first two layers, we use ReLU activation and the final layer has a

activation.

Training: The input to each of our generators, except to , is a -d vector: a -d noise vector concatenated with a -d vector sampled in equal intervals from the previous generator output. For , we use a

-d noise input. We use RMSprop as the optimization algorithm with

, where symbols refer to usual notation. For and , we use learning rates and respectively, and for discriminators, we use a learning rate . While training, we use three discriminator updates per each generator update. Our sampling procedure is explained in supplementary materials and the training scheme is illustrated in Algorithm 1.

; = A set of samples from ground truth point-clouds; for i iterations do
        for each  do
               for j iterations do
                      Train ;
                     
       ;
        for each  do
               for j iterations do
                      Train ;
                     
       for  p iterations  do
               ;
               ;
               ;
               ;
               ;
               ;
              
Algorithm 1 Training procedure for the Spectral-GAN.

7 3D reconstruction from single image

As a different application, we propose a generative model which can reconstruct 3D objects by observing a single RGB image. The model follows the network architecture proposed in Sec. 6, with a few alterations. Instead of randomly choosing the latent vector , we use a set of image encoders to obtain an object representative vector , by taking a 2D image as the input. We use the same image encoder proposed in [wu20153d], which consists of five spatial convolution layers with kernel size

with strides

. We use batch normalization after each layer, and ReLu activation as the non-linearity.

We use such image encoders for each , and use the same vectors generated for when training . Each image encoder is trained end-to-end with . The training procedure is similar to Algorithm 1, although we use different loss functions in this case. To optimize the GANs in spectral domain, we use two loss components: an adversarial loss and a spectral reconstruction loss . The final spectral domain loss is,

(11)

where is the distance between the ground-truth SMV and the generated SMV from and is given as,

(12)

Here, is the encoder function, , and are discriminator function, generator function and image input, respectively. is a scalar weight. For the spatial domain optimization, we replace spatial regularization loss with the Chamfer distance as follows:

(13)

where and are ground-truth and generated point sets, respectively. First, we obtain by converting the SMV to a point-cloud using Eq. 4 and then compute the loss (Eq. 13).

Figure 3: Qualitative analysis of the results. From the left, column: Ground truth, column: ground truth point-clouds reconstructed by SMV, columns: generated samples using spectral GAN.

8 Experiments

In this section, we evaluate our model both qualitatively and quantitatively, and develop useful insights.

8.1 3D shape generation

Qualitative results: We train our model for each category in ModelNet10 and show samples of generated 3D point-clouds in Fig. 3. As expected, the reconstruction from SMV adds some noise to the ground truth point-clouds. An interesting observation, however, is that the quality of generated point-clouds are not far from from the reconstructed point-clouds from the ground-truth. Since the network only consumes the reconstructed ground-truth, this observation highlights the ability of our network in accurate modeling of input data distributions.

Method Type Accuracy
3D-ShapeNet (CVPR’15) [wu20153d] Supervised 93.5%
EC-CNNs (CVPR’17) [simonovsky2017dynamic] Supervised 90.0%
Kd-Network (ICCV’17) [klokov2017escape] Supervised 93.5%
LightNet (3DOR’17) [zhi2017lightnet] Supervised 93.4%
SO-Net (CVPR’18) [li2018so] Supervised 95.5%
Light Filed Descriptor [chen2003visual] Unsupervised 79.9%
Vconv-DAE (ECCV’16) [sharma2016vconv] Unsupervised 80.5%
3D-GAN (NIPS’16) [wu2016learning] Unsupervised 91.0%
3D-DesNet (CVPR’18) [xie2018learning] Unsupervised 92.4%
3D-WINN (AAAI’19) [huang20193d] Unsupervised 91.9%
PrimtiveGAN (CVPR’19) [khan2019unsupervised] Unsupervised 92.2%
Spectral-GAN (ours) Unsupervised 93.1%
Table 1: 3D shape classification results on ModelNet10.
Method 3D Data Accuracy
3D-ShapeNet [wu20153d] (CVPR’15) voxel 4.13 0.19
3D-VAE [kingma2013auto] (ICLR’15) voxel 11.02 0.42
3D-GAN [wu2016learning] (NIPS’16) voxel 8.66 0.45
3D-DesNet [xie2018learning] (CVPR’18) voxel 11.77 0.42
3D-WINN [huang20193d] (AAAI’19) voxel 8.81 0.18
PrimitiveGAN [khan2019unsupervised] (CVPR’19) voxel 11.52 0.33
Spectral-GAN (ours) p-cloud 11.58 0.08
Table 2: Inception scores (IS) for 3D shape generation. We only compare with voxel based methods since no point-cloud (p-cloud) based method reports IS.
Method

Dresser

Toilet

Stand

Chair

Table

Sofa

Monitor

Bed

Bathtub

Desk

3D-GAN [wu2016learning] (NIPS’16) - - - 469 - 517 - - - 651
3D-DesNet [xie2018learning] (CVPR’18) 414 662 517 490 538 494 511 574 - -
3D-WINN [huang20193d] (AAAI’19) 305 474 456 225 220 151 181 222 305 322
Spectral-GAN (ours) 462 195 452 472 522 180 192 230 208 354
Table 3: FID scores for 3D shape generation. (lower is better) All the methods except ours are voxel based.

Quantitative analysis: To assess the proposed approach quantitatively, we compare the Inception Score (IS) of our network with other voxel-based generative models in Tab. 2. In this experiment, we use [qi2016volumetric] as the reference network. IS evaluates a model in terms of both quality and diversity of the generated shapes. Overall, our model demonstrates the second highest performance with a score of . To the best of our knowledge, our work is the first 3D point-cloud GAN to report IS.

We further evaluate our model using Frechet Inception Distance (FID) proposed by Heusel et al. [heusel2017gans], and compare with state-of-the-art. IS does not always coincide with human judgement regarding the quality of the generated shapes, as it does not directly capture the similarity between the synthetic and generated shapes. Therefore, FID is used as a complementary measure to evaluate GAN performance. Huang et al. [huang20193d] were the first to incorporate FID to 3D GANs, and following them, we also use [qi2016volumetric] as the reference network. As evident from Table 3, our results are on-par with state-of-the-art, getting highest scores in three categories: toilet, night stand and bath tub. Interestingly, our Spectral-GAN generally performs better with objects that have curved boundaries, which is a favorable characteristic, as curved boundaries are generally difficult to generate in Euclidean spaces. Note that we convert the point-clouds to meshes before evaluating with both IS and FID.

Comparison with point-cloud generation approaches: We use two metrics proposed in Achlioptas et al. [achlioptas2017learning] (i.e., MMD-CD, MMD-ED) to compare the performance of the proposed architecture with other point-cloud generation methods, and display the results in Table 4. In this experiment, we use classes of ShapeNet [yi2016scalable]. As shown, our network gives best results. Intuitively, this suggests that shapes generated by our network have high fidelity compared to the test set.

Method Class MMD-CD MMD-EMD
r-GAN (dense) [achlioptas2017learning] 0.0029 0.136
r-GAN (conv) [achlioptas2017learning] 0.0030 0.223
Valsesia et al. (no up.) [valsesia2018learning] Chair 0.0033 0.104
Valsesia et al. (up.) [valsesia2018learning] 0.0029 0.097
TreeGAN [shu20193d] 0.0016 0.101
Spectral-GAN (ours) 0.0012 0.080
r-GAN (dense) [achlioptas2017learning] 0.0009 0.094
r-GAN (conv) [achlioptas2017learning] 0.0008 0.101
Valsesia et al. (no up.) [valsesia2018learning] Airplane 0.0010 0.102
Valsesia et al. (up.) [valsesia2018learning] 0.0008 0.071
TreeGAN [shu20193d] 0.0004 0.068
Spectral-GAN (ours) 0.0002 0.057
r-GAN (dense) [achlioptas2017learning] 0.0020 0.146
r-GAN (conv) [achlioptas2017learning] 0.0025 0.110
Valsesia et al. (no up.) [valsesia2018learning] Sofa 0.0024 0.094
Valsesia et al. (up.) [valsesia2018learning] 0.0020 0.083
Spectral-GAN (ours) 0.0020 0.080
r-GAN (dense) [achlioptas2017learning] 0.0021 0.155
TreeGAN [shu20193d] All classes 0.0018 0.107
Spectral-GAN (w/o backward pass) 0.0020 0.127
Spectral-GAN (ours) 0.0015 0.097
Table 4: Comparison with point-cloud generative models.
Figure 4: Scalability of the proposed network with resolution. We obtain increasingly dense resolution by only changing the output layer size in each training phase. Number of points from the left: and

.

Scalability to high resolutions: A favorable attribute of our network design is the ability to scale to higher resolutions with minimal changes to the architecture. To verify this, we vary the degree of SMV, and train our model separately for each case. Since the number of points is tied to the maximum degree of SMVs as , we obtain samples with different resolutions for each case (see Fig. 4). A key point here is that we only change the output layer size of the generator (according to the length of SMV) to generate point-clouds with different resolutions. Fig. 5 illustrates the variation of resolution with the number of FLOPs. Remarkably, we are able to generate high-resolution outputs up to points with only FLOPs. Another intriguing observation is that our network is able to increase the output resolution by a factor of 40, while the number of FLOPs is only increased by a factor around .

Usefulness of backward pass: Fig. 6 illustrates the effect of performing a backward pass. As shown, the forward pass only generates a coarse representation of the shapes without fine details. This is anticipated, since cascaded GANs can only observe the lower frequency portions of SMV in the forward pass. In contrast, the backward pass observes the higher frequency portions, and fine tunes the coarse representation by adding complementary details.

Figure 5: Spctral GAN can generate high-resolution outputs with minimal computational overhead. We increase resolution approximately while only an increase of FLOPs.
Figure 6: Effect of backward pass. Top row: samples generated using only forward pass. Bottom row: same samples after passing through both forward and backward pass. Backward pass refines the image by adding more fine details.

8.2 Unsupervised 3D Representation Learning

In this section, we evaluate the representation learning capacity of our discriminator. To this end, we pass relevant SMV frequency bands of 3D point-clouds through trained discriminators, extract the features from the third FC layer, and finally concatenate them to create a feature vector. This feature vector is then fed through a binary SVM classifier and the classification results are obtained as one-against-the-rest. The classification results on ModelNet10 are depicted in Table

1. As evident, we achieve the highest result with a value of , which highlights the excellent representation learning capacity of our discriminators.

8.3 3D reconstruction results

In this section, we evaluate the performance of the 3D reconstruction network proposed in Sec. 7. First, we randomly apply a rotation to each 3D model from the IKEA dataset 15 times, and render the rotated model in front of background images obtained from [xiao2010sun]. Afterwards, we save the rendered images and the corresponding 3D models to create ground-truth image-3D model pairs. The ground truth 3D-models are manually aligned using the Iterative closest point (ICP) algorithm. While applying rotations, we set the constraints and and crop the rendered images for the object to be in the center of the images. For the test set, we use the original images provided in the IKEA dataset and test our network on four object classes: chair, sofa, table and bed. We train our model separately for each category and use mean average precision (mAP) to evaluate the performance. In evaluation, we voxelize the generated and ground truth point-clouds using a voxel grid, and obtain average precision for voxel prediction. The results and illustrative examples are shown in Table 5 and Fig. 7, respectively. As depicted, we surpass state-of-the-art results in sofa and bed categories, while achieving second best results in the table category.

Figure 7: Qualitative results for 3D point-cloud reconstruction from a single image.
Method Chair Sofa Bed Table
AlexNet-fc8 [girdhar2016learning] 20.4 38.8 29.5 16.0
AlexNet-conv4 [girdhar2016learning] 31.4 69.3 38.2 19.1
T-L network [girdhar2016learning] 32.9 71.7 56.3 23.3
3D-VAE-GAN [wu2016learning] 47.2 78.8 63.2 42.3
VAE-IWGAN [smith2017improved] 49.3 68.0 65.7 52.2
PrimtiveGAN [khan2019unsupervised] 47.5 77.1 68.4 60.0
Spectral-GAN (ours) 42.3 81.2 71.4 48.3
Table 5: Average precision for 3D point-cloud reconstruction from single image. The point-clouds are voxelized before obtaining the score.

9 Conclusion

We propose a generative model for 3D point-clouds that operates in the spectral-domain. In contrast to previous methods that operate in the spatial-domain, our approach provides a structured way to deal with the inherent redundancy and irregularity of point-clouds. We demonstrate that our model generates sound 3D outputs, can scale to high-dimensional outputs and learns discriminative features in an unsupervised manner. Further, it can be used for 3D reconstruction task.

References

Appendix A Sampling and reconstruction

A key attribute of any sampling theorem is the minimum number of sample points required to accurately represent a band-limited function in a particular space. Several such sampling theorems have been proposed to represent a signal with finite energy in , whereas a most popular choice is the Driscoll and Healy’s (DH) theorem proposed by Driscoll et al. [driscoll1994computing], which we also use in our work.

According to DH theorem, to accurately represent a signal on using spherical harmonic moments band-limited at degree , equiangular sampled points are needed. For all the main experiments in this work, we choose and obtain an equally sampled grid in each and directions, where and . However, as mentioned in Sec. 4, spherical harmonics can represent only polar 3D shapes, which can result in less visually pleasing spatial representations of non-polar shapes. To overcome this obstacle, we follow the following sampling procedure.

First, we scale the 3D mesh to fit inside the unit ball , and cast rays from the centroid of the shape to outward direction, and take the first hit locations of the rays with a face as a sample point. In the first stage, we sample such equiangular points in a grid, sampled in and directions respectively, where and . In the second stage, we rotate the casted rays in direction, by an amount of , and obtain the last hit locations of the each ray with a face of the 3d shape as a sample point. Union of these two sampling sets provide a more visually pleasing point-cloud for non-polar 3D shapes. This procedure is illustrated in Fig. 1.

Figure 1: Illustration of the sampling procedure. Red arrows and green arrows demonstrate first stage and second stage sampling, respectively.

Appendix B Literature on cascaded generative designs:

Denton et al. [denton2015deep] proposed a cascaded GAN architecture for 2D image generation. Similar to our work, they also use a series of conditional GANs which are conditioned upon one another. These GANs generate image representations in a Laplacian pyramid framework to create increasingly refined images. Instead of generating images directly in the spatial domain, these generative models specialize in generating a specific residual image, according to the corresponding stage of the Laplacian pyramid, which are finally combined together to produce a high quality image. This is analogous to our work, where our generators generate a specific frequency portion of SMVs, which are finally combined together to obtain the full representation. Other recent works also employ cascaded generative architectures to improve image quality e.g., [wang2018high] use a combination of generators operating on low and high resolution domains, [Wang_ssganECCV2016] separately train generative models to learn style and structure components, [Zhang_2017_ICCV] progressively adds photorealistic details in low-resolution generated images. The conditional stacked GAN architecture of Huang et al. [Huang_2017_CVPR] is particularly close to ours, that feeds onto previous generators output and new latent vectors to create novel images. Finally, the seminal SinGAN [Shaham_2019_ICCV] approach designs a pyramid of coarse-to-fine generators that can be trained on a single image. However, as opposed to current work, all above efforts operate in the spatial domain and have no concrete definition of spectral bands.

Appendix C Computational complexity analysis

A key feature of our network is its high computational efficiency despite being a cascaded design. Since the target is a 1-D structured vector, the generators are allowed to have a shallow architecture, which decreases the total number of FLOPs during operation. Table 1 compares the our model complexity against the state-of-the-art models. We achieve the best performance in terms of MMD-CD and MMD-EMD while having the lowest model complexity. Experiments are conducted for inference with 20 batch size.

Figure 2: Qualitative results: generated point clouds for each class.
Figure 3: Our network tends to generate weird artifacts among plausible samples, when trained without the spatial domain regularizer, since small variations in spectral domain cause significant variations in spatial domain. A few such examples are illustrated here. These artifacts are effectively suppressed by our spatial domain regularizer.
Method MMD-CD () MMD-EMD () #FLOPs () #Points ()
r-GAN (dense) [achlioptas2017learning] 0.0029 0.136 0.1B 2048
Valsesia et al. (up.) [valsesia2018learning] Chair 0.0029 0.097 304B 2048
Spectral-GAN (ours) 0.0012 0.080 0.09B 3600
r-GAN (dense) [achlioptas2017learning] 0.0009 0.094 0.1B 2048
Valsesia et al. (up.) [valsesia2018learning] Airplane 0.0008 0.071 304B 2048
Spectral-GAN (ours) 0.0002 0.057 0.09B 3600
r-GAN (dense) [achlioptas2017learning] 0.0020 0.146 0.1B 2048
Valsesia et al. (up.) [valsesia2018learning] Sofa 0.0020 0.083 304B 2048
Spectral-GAN (ours) 0.0020 0.080 0.09B 3600
r-GAN (dense) [achlioptas2017learning] All classes 0.0021 0.155 0.1B 2048
Spectral-GAN (ours) 0.0015 0.097 0.09B 3600
Table 1: Model complexity comparison with point-cloud generative models (inference). We achieve the best performance while having the lowest complexity. ( denotes lower is better, denotes higher is better)