1 Introduction
Pointclouds are a popular 3D representation for realworld objects and scenes. In comparison to other representations such as voxels, mesh and truncated signed distance function (TSDF), pointclouds are often an attractive choice for 3D data because they capture shape details accurately, are computationally efficient to process and can be acquired as a default output from several 3D sensors (e.g., LiDAR). However, pointclouds pose a major challenge for deep networks, particularly the generative pipelines, due to their inherent redundancy and irregular nature (e.g., permutationinvariance).
Due to the complexity of pointclouds, most 3D synthesis approaches are inapplicable. For example, generative approaches designed for voxelized inputs [wu2016learning, kingma2013auto, wu20153d, xie2018learning, Huang_2017_CVPR, khan2019unsupervised], cannot work with the irregular point sets. To overcome this challenge, some recent generative approaches solely focus on pointcloud synthesis. For example, Achlioptas et al. [achlioptas2017learning] use a GAN framework for 3D pointcloud distribution modelling in the data and autoencoder latent space, Yang et al. [Yang_2019_ICCV] sample 3D points from a prior spatial distribution and then transform them using an invertible parameterization while [shu20193d, valsesia2018learning] employ graphstructured networks for pointcloud generation.
All such efforts so far, operate in the ‘spatialdomain’ (3D Euclidean space) which makes the modelling task relatively difficult due to arbitrary point configurations in 3D space. This leads to a number of roadblocks towards a versatile generative model e.g., considering a fixed set of points [achlioptas2017learning] and limited scalability to arbitrary point resolutions [shu20193d, valsesia2018learning]
. As opposed to previous works, we perform generative modelling in the spectral space using spherical harmonic moment vectors (SMVs), which inherently offers a solution to the above mentioned problems. Specifically, generating 3D shapes via spectral representations allows us to compactly represent redundant information in pointclouds, easily scale to highdimensional pointcloud sets, remain invariant to the permutations in unordered point sets and generate highfidelity shapes with relatively minimal outliers. Besides, our spectral representation allow us to develop an understanding about the frequency domain functional space of generic 3D objects. Our main contributions are:

To handle the redundancy and irregularity of pointclouds, we propose the first spectraldomain GAN that synthesizes novel 3D shapes by using a spherical harmonics based representation.

A fully differentiable transformation from the spectral to the spatial domain and back, thus allowing us to integrate knowledge from wellestablished spatial models.

Through both quantitative and qualitative evaluations, we illustrate that SpectralGAN can generate highquality 3D shapes with minimal artifacts and can be easily scaled to highdimensional outputs.

Our proposed framework learns highly discriminative unsupervised features and can seamlessly perform 3D reconstruction from 2D inputs. Moreover, we show that SpectralGAN is scalable to highresolution outputs (40 resolution increase with just 4 parameters).
2 Related Work
Generative models in spectraldomain: Yang et al. [yang2017dagan] and Souza et al. [souza2018hybrid]
develop methods for MRI reconstruction using GANs, and use Fourier domain information to refine the output. In the former approach, the generator operates in the spatial domain, and spectral information is used to refine the output. The latter approach, in contrast, uses two separate networks in the frequency and spatial domains and adopts the Fourier transform to exchange information between the two. A significant drawback of these approaches is that output resolution is tightly coupled to the network design and thus they lack scalability to high dimensions.
In a different application, Portilla et al. [portilla2000parametric] present a method to synthesize textures as 2D images based on a complex wavelet transform. They parameterize this operation using a set of statistics computed on pairs of coefficients corresponding to basis functions at adjacent spatial locations, orientations, and scales. However, their approach is not a learning model, which offers less flexibility. Furthermore, Zhu et al. [zhu2018image] recently proposed a model that initially processes undersampled input data in the frequency domain and then refines the result in the spatial domain using the inverse Fourier transform. They approximate the inverse Fourier transform using a sequence of connected layers, but one disadvantage is that their transformation has quadratic complexity with respect to the size of the input image. Furthermore, the above works are limited to 2D and do not study the 3D pointcloud generation problem in spectral domain.
3D GANs in spatialdomain: 3D GANs can be primarily categorized into two types: voxel outputs and pointcloud outputs. The latter typically entails more challenges as pointclouds are unordered and highly irregular in nature.
For voxelized 3D object modeling, several influential methods have been proposed in the literature. Wu et al. [wu2016learning] extend the 2D GAN framework to 3D domain for the first time. Following their work, Smith et al. [smith2017improved]
use a novel GAN architecture for 3D shape generation by employing Wasserstein distance as the loss function. A recent work by Khan
et al. [khan2019unsupervised] presents a factorized 3D generative model that sequentially generates shapes in a coarsetofine manner. Our approach also uses a twostep procedure–a forward pass and backward pass—to refine a coarse 3D shape, but a key difference here is that they use spatial information to refine the shape, while our method depends on frequency information.Naive extensions of traditional spatial GANs to 3D pointcloud generation do not produce satisfactory results, due to their inherent properties such as being an unordered, irregularly distributed collection (see Sec. 3). Achlioptas et al. [achlioptas2017learning] were the first to use GANs to generate pointclouds. They first convert a pointcloud to a compact latent representation and then train a discriminator on it. Although we also use a compact representation, i.e., the SMV to train the GAN, SMVs provide a richer representation compared to latent space approximations and theoretically guarantee accurate reconstruction of the 3D pointcloud. Moreover, Valsesia et al. [valsesia2018learning] propose a graph convolution based network to extract localized features from 3D pointclouds, in order to reduce the effect of irregularity. A drawback of their method, however, is the rather high computational complexity of graph convolution, and less scalability with the resolution of the pointcloud. A recent work by Shu et al. [shu20193d] also propose a treestructured graph convolution network, which is more computationally efficient. The model proposed by Li et al. [li2018point] attempts to handle the irregularity of pointclouds using a separate inference model which captures a latent distribution, to deal with the irregularity of pointclouds. In contrast, we effectively reduce the problem to the standard GAN setting by using a fixeddimensional representation for pointclouds.
3 Problem Formulation
An exchangeable
sequence can be considered as a sequence of random variables
, where the joint probability distribution of
does not vary under position permutations. More formally,Definition: For a given finite set of random objects, let
be their joint distribution. This finite set is exchangeable if
, for every permutation . The spatial representation of a pointcloud is a set of dimensional vectors, and in cases of Euclidean geometry, typically, . A set is a collection of elements without any particular order or a fixed number of elements and thus, the probability distribution is an exchangeable sequence. According to the Hewitt and Savage theorem [hewitt1955symmetric], there exists a latent distribution such that,(1) 
Eq. 1 shows that in order to properly model as an exchangeable sequence and obtain a distribution , it is necessary to capture the latent representation . In other words, it is difficult for a GAN to model as an exchangeable sequence, only by observing a set of
sequences and estimating the marginal distributions
. In this case, the generative model needs to learn the joint probability distribution instead of . This makes it challenging to extend traditional GANs to the pointcloud generation problem. A straightforward approach to resolve this is to model pointcloud data as ordered, fixeddimensional vectors. However, this approach does not hold the integral probability metric (IPM) guarantees of a GAN [li2018point].On the contrary, we propose to model pointcloud data as SMVs, which effectively reduces the problem to the traditional case in two ways: 1) SMVs encode the corresponding shape information in a structured, fixed dimensional vector and 2) the vector elements are highly correlated with each other. The task of learning the distribution of elements of SMVs is theoretically similar to learning the pixel distribution of images, as in the latter case also, we only need to capture the joint probability distribution of pixels of each instance. In the case of image synthesis, however, GANs exploit the correlation of pixels using convolution kernels, which is not possible in the case of SMVs as correlation does not depend on proximity. Furthermore, different frequency portions of the SMVs show different characteristics. To handle these specific attributes, we propose a series of cascaded GANs, each consisting of only fully connected layers. Since each GAN only needs to generate a specific portion of the SMV, they can be designed as shallow models with fewer floating point operations (FLOPs).
4 Spectral GAN
We propose a 3D generative model that operates entirely in the spectral domain. Such a design offers unique advantages over spatial domain 3D generative models: (a) a compact representation of 3D shapes with an intuitive frequencydomain interpretation, (b) the flexibility to generate highdimensional shapes with minimal changes to the model complexity, and (c) a permutation invariant representation which handles the irregularity of pointclouds. Below, we first introduce the spherical harmonics representations that serve as the basis for our proposed Spectral GAN model.
4.1 Spherical Harmonics for 3D Objects
Spherical harmonics are a set of complete and orthogonal basis functions, which can efficiently represent functions on the unit sphere in . They are a higher dimensional analogy of the Fourier series, which forms a basis for functions on unit circle. The spherical harmonics are defined on as,
(2) 
where is the polar angle, is the azimuth angle, is a nonnegative integer, is an integer, , is the imaginary unit, is the normalization coefficient and is the associated Legendre function,
(3) 
Since spherical harmonics are orthogonal and complete over the continuous functions on with finite energy, such a function can be expanded as,
(4) 
where are the spherical harmonic moments obtained by,
(5) 
The sufficient conditions for the expansion in Eq. 4 are given in [hobson1931theory]. In practical cases, a bounded set of spherical harmonic basis functions is defined, where is the maximum degree of harmonics series.
The process of 3D shape modeling via spherical harmonics can be decomposed into two major steps. First, sample points from the 3D shape surface and then computing spherical harmonic moments. Any polar 3D surface function can be represented as , where is a single valued function on the unit sphere , is the radial coordinate with respect to a predefined origin inside an object, and is the direction vector. Thus, we can compute moments of the corresponding 3D pointcloud using Eq. 5.
4.2 Cascaded GAN Structure
SMVs provide a highly structured representation of 3D objects, as explained in Sec. 4.1. Due to this structured nature, the margin for error is significantly lower in our setup, compared to GANs that try to produce spatial domain representations. Also, different frequency bands of the SMV typically entail different characteristics, which makes it highly challenging for a single GAN to generalize over the complete SMV. Therefore, to overcome this obstacle, we use multiple cascaded GANs, where each GAN specializes in generating a predefined frequency band of the SMV.
Our approach uses a combination of GAN models to generate the SMV of 3D shapes. Among them, the first model is a regular GAN while the remaining models are conditional GANs (cGAN). The objective of initial GAN model is given by a twoplayer minmax game,
(6) 
where is the SMV band sampled from the spectral coefficient distribution and
is the noise vector sampled from a Gaussian distribution. In a cGAN, synthetic data modes are controlled by forwarding conditioning variables (e.g., a class label) as additional information to the generator. In our case, we use a specific band of SMVs
predicted by the previous generator to condition the subsequent generator. Then, the cGAN objective becomes,(7) 
Each GAN generates a portion of the complete spherical moment vector for the next GAN to be conditioned upon. The setup includes two major steps: (i) forward pass and (ii) backward pass. Accordingly, the overall architecture can be decomposed into two sets of generators and , that implement the forward and backward functions, respectively. In the forward pass, the model tries to generate a coarse shape representation, and the backward pass refines the coarse representation to generate a refined representation. The basis of our design is the Markovian assumption, i.e., given the outputs from the neighbouring generators, a current generator is independent from the outputs of the rest. We describe the two generation steps in Sec. 4.2.1 and 4.2.2.
4.2.1 Forward pass
In the forward pass, we have a set of generative models , which work in unison to generate a coarse representation of a 3D shape. Each is conditioned upon the outputs of , and generates a predefined frequency band () of the complete spherical harmonic representation () of the corresponding 3D shape. It is worthwhile to note that the forward pass is sufficient to generate the complete SMV without the aid of a backward pass. However, a critical limitation of this setup is that each GAN is only conditioned upon lower frequency bands of the SMV. In practice, this results in noisy outputs. Therefore, we also perform a backward pass, which allows the GANs to refine the generation by observing the higher frequencies. This procedure is explained on Sec. 4.2.2.
4.2.2 Backward pass
As explained in Sec. 4.2.1, the aim of the backward pass is to generate a more refined SMV, which produces a more refined 3D shape. Similar to forward pass, the backward pass is implemented using another set of generators , where . Each is conditioned upon the outputs of and generates a specific portion of the complete SMV. In the training phase, we first transfer the trained weights from to , before training . Therefore, this can be intuitively considered as finetuning based on higher frequencies. The training procedure is explained in Sec. 6.
5 Spatial domain regularizer
Since SMVs are highly structured, each element of a particular SMV is crucial for accurate reconstruction of its corresponding 3D pointcloud. In other words, even slight variations of a particular SMV cause significant variations in the spatial domain. Therefore, it is cumbersome for a GAN to synthesize SMVs, corresponding to visually pleasing pointclouds, by solely observing a distribution of ground truth SMVs.
To surmount this barrier, we use a spatial domain regularizer that can refine the weights of our cascaded GAN architecture, in order to synthesize more plausible SMVs. The spatial domain regularizer provides feedback from the spatial domain to the GANs, depending on the quality of the spatial reconstruction. Firstly, we employ a pretrained PointNet [qi2017pointnet] model on the reconstructed synthetic pointcloud, and extract a global feature. Secondly, using the same procedure, we obtain another global feature from a ground truth pointcloud from the same class, and compute the distance between these two features. Finally, using back backpropagation, we update the weights of all the generators to minimize the distance. The final architecture of the proposed model is shown in Fig. 2.
In order to backpropagate error signals from the spatial domain to the spectral domain, we require , where is the SMV and is the loss. To this end, we derive the following formula: let be the SMV of a particular instance and be the corresponding reconstructed points on
for the same instance. Then, using the chain rule it can be shown that,
(8) 
where,  (9) 
Combining Eq. 8 and 9, we obtain,
(10) 
The above expression can be written as a matrixvector product to obtain derivatives . This makes the transformer a fully differentiable and a networkagnostic module which can be used to communicate between spectral and spatial domains.
6 Network architecture and training
Our aim is to generate a compact spectral representation, i.e., a vector, instead of a irregular point set. In the spatial domain, points are correlated across the spatial space, and convolutions can be adopted to capture these dependencies. In fact, convolution kernels extract local features, under the assumption that spatially closer data points form useful local features. In contrast, closer elements in spectral domain representations do not necessarily exhibit strong correlations. Therefore, convolutional layers fail to excel in this scenario and thus, we opt for fully connected (FC) layers in designing our GANs. Interestingly, however, our GANs learn to generate quality outputs with a low depth architecture.
Generator architecture: For our main experiments, we choose the maximum degree of SMVs and the number of GANs as and , respectively, where and . Each generator in respectively generates frequency bands (), (), () and (). Since are used to fine tune
, they generate the same frequency portions as the latter set. For all the generators, we use the same architecture, except for the last FC layer. Each generator consists of three FC layers, first two layers with 512 neurons each, and the number of neurons in the last layer depends on the output size. For the first two layers, we use ReLU activation and the final layer has a
activation.Training: The input to each of our generators, except to , is a d vector: a d noise vector concatenated with a d vector sampled in equal intervals from the previous generator output. For , we use a
d noise input. We use RMSprop as the optimization algorithm with
, where symbols refer to usual notation. For and , we use learning rates and respectively, and for discriminators, we use a learning rate . While training, we use three discriminator updates per each generator update. Our sampling procedure is explained in supplementary materials and the training scheme is illustrated in Algorithm 1.7 3D reconstruction from single image
As a different application, we propose a generative model which can reconstruct 3D objects by observing a single RGB image. The model follows the network architecture proposed in Sec. 6, with a few alterations. Instead of randomly choosing the latent vector , we use a set of image encoders to obtain an object representative vector , by taking a 2D image as the input. We use the same image encoder proposed in [wu20153d], which consists of five spatial convolution layers with kernel size
with strides
. We use batch normalization after each layer, and ReLu activation as the nonlinearity.
We use such image encoders for each , and use the same vectors generated for when training . Each image encoder is trained endtoend with . The training procedure is similar to Algorithm 1, although we use different loss functions in this case. To optimize the GANs in spectral domain, we use two loss components: an adversarial loss and a spectral reconstruction loss . The final spectral domain loss is,
(11) 
where is the distance between the groundtruth SMV and the generated SMV from and is given as,
(12) 
Here, is the encoder function, , and are discriminator function, generator function and image input, respectively. is a scalar weight. For the spatial domain optimization, we replace spatial regularization loss with the Chamfer distance as follows:
(13) 
where and are groundtruth and generated point sets, respectively. First, we obtain by converting the SMV to a pointcloud using Eq. 4 and then compute the loss (Eq. 13).
8 Experiments
In this section, we evaluate our model both qualitatively and quantitatively, and develop useful insights.
8.1 3D shape generation
Qualitative results: We train our model for each category in ModelNet10 and show samples of generated 3D pointclouds in Fig. 3. As expected, the reconstruction from SMV adds some noise to the ground truth pointclouds. An interesting observation, however, is that the quality of generated pointclouds are not far from from the reconstructed pointclouds from the groundtruth. Since the network only consumes the reconstructed groundtruth, this observation highlights the ability of our network in accurate modeling of input data distributions.
Method  Type  Accuracy 
3DShapeNet (CVPR’15) [wu20153d]  Supervised  93.5% 
ECCNNs (CVPR’17) [simonovsky2017dynamic]  Supervised  90.0% 
KdNetwork (ICCV’17) [klokov2017escape]  Supervised  93.5% 
LightNet (3DOR’17) [zhi2017lightnet]  Supervised  93.4% 
SONet (CVPR’18) [li2018so]  Supervised  95.5% 
Light Filed Descriptor [chen2003visual]  Unsupervised  79.9% 
VconvDAE (ECCV’16) [sharma2016vconv]  Unsupervised  80.5% 
3DGAN (NIPS’16) [wu2016learning]  Unsupervised  91.0% 
3DDesNet (CVPR’18) [xie2018learning]  Unsupervised  92.4% 
3DWINN (AAAI’19) [huang20193d]  Unsupervised  91.9% 
PrimtiveGAN (CVPR’19) [khan2019unsupervised]  Unsupervised  92.2% 
SpectralGAN (ours)  Unsupervised  93.1% 
Method  3D Data  Accuracy 
3DShapeNet [wu20153d] (CVPR’15)  voxel  4.13 0.19 
3DVAE [kingma2013auto] (ICLR’15)  voxel  11.02 0.42 
3DGAN [wu2016learning] (NIPS’16)  voxel  8.66 0.45 
3DDesNet [xie2018learning] (CVPR’18)  voxel  11.77 0.42 
3DWINN [huang20193d] (AAAI’19)  voxel  8.81 0.18 
PrimitiveGAN [khan2019unsupervised] (CVPR’19)  voxel  11.52 0.33 
SpectralGAN (ours)  pcloud  11.58 0.08 
Method 
Dresser 
Toilet 
Stand 
Chair 
Table 
Sofa 
Monitor 
Bed 
Bathtub 
Desk 
3DGAN [wu2016learning] (NIPS’16)        469    517        651 
3DDesNet [xie2018learning] (CVPR’18)  414  662  517  490  538  494  511  574     
3DWINN [huang20193d] (AAAI’19)  305  474  456  225  220  151  181  222  305  322 
SpectralGAN (ours)  462  195  452  472  522  180  192  230  208  354 
Quantitative analysis: To assess the proposed approach quantitatively, we compare the Inception Score (IS) of our network with other voxelbased generative models in Tab. 2. In this experiment, we use [qi2016volumetric] as the reference network. IS evaluates a model in terms of both quality and diversity of the generated shapes. Overall, our model demonstrates the second highest performance with a score of . To the best of our knowledge, our work is the first 3D pointcloud GAN to report IS.
We further evaluate our model using Frechet Inception Distance (FID) proposed by Heusel et al. [heusel2017gans], and compare with stateoftheart. IS does not always coincide with human judgement regarding the quality of the generated shapes, as it does not directly capture the similarity between the synthetic and generated shapes. Therefore, FID is used as a complementary measure to evaluate GAN performance. Huang et al. [huang20193d] were the first to incorporate FID to 3D GANs, and following them, we also use [qi2016volumetric] as the reference network. As evident from Table 3, our results are onpar with stateoftheart, getting highest scores in three categories: toilet, night stand and bath tub. Interestingly, our SpectralGAN generally performs better with objects that have curved boundaries, which is a favorable characteristic, as curved boundaries are generally difficult to generate in Euclidean spaces. Note that we convert the pointclouds to meshes before evaluating with both IS and FID.
Comparison with pointcloud generation approaches: We use two metrics proposed in Achlioptas et al. [achlioptas2017learning] (i.e., MMDCD, MMDED) to compare the performance of the proposed architecture with other pointcloud generation methods, and display the results in Table 4. In this experiment, we use classes of ShapeNet [yi2016scalable]. As shown, our network gives best results. Intuitively, this suggests that shapes generated by our network have high fidelity compared to the test set.
Method  Class  MMDCD  MMDEMD 
rGAN (dense) [achlioptas2017learning]  0.0029  0.136  
rGAN (conv) [achlioptas2017learning]  0.0030  0.223  
Valsesia et al. (no up.) [valsesia2018learning]  Chair  0.0033  0.104 
Valsesia et al. (up.) [valsesia2018learning]  0.0029  0.097  
TreeGAN [shu20193d]  0.0016  0.101  
SpectralGAN (ours)  0.0012  0.080  
rGAN (dense) [achlioptas2017learning]  0.0009  0.094  
rGAN (conv) [achlioptas2017learning]  0.0008  0.101  
Valsesia et al. (no up.) [valsesia2018learning]  Airplane  0.0010  0.102 
Valsesia et al. (up.) [valsesia2018learning]  0.0008  0.071  
TreeGAN [shu20193d]  0.0004  0.068  
SpectralGAN (ours)  0.0002  0.057  
rGAN (dense) [achlioptas2017learning]  0.0020  0.146  
rGAN (conv) [achlioptas2017learning]  0.0025  0.110  
Valsesia et al. (no up.) [valsesia2018learning]  Sofa  0.0024  0.094 
Valsesia et al. (up.) [valsesia2018learning]  0.0020  0.083  
SpectralGAN (ours)  0.0020  0.080  
rGAN (dense) [achlioptas2017learning]  0.0021  0.155  
TreeGAN [shu20193d]  All classes  0.0018  0.107 
SpectralGAN (w/o backward pass)  0.0020  0.127  
SpectralGAN (ours)  0.0015  0.097 
Scalability to high resolutions: A favorable attribute of our network design is the ability to scale to higher resolutions with minimal changes to the architecture. To verify this, we vary the degree of SMV, and train our model separately for each case. Since the number of points is tied to the maximum degree of SMVs as , we obtain samples with different resolutions for each case (see Fig. 4). A key point here is that we only change the output layer size of the generator (according to the length of SMV) to generate pointclouds with different resolutions. Fig. 5 illustrates the variation of resolution with the number of FLOPs. Remarkably, we are able to generate highresolution outputs up to points with only FLOPs. Another intriguing observation is that our network is able to increase the output resolution by a factor of 40, while the number of FLOPs is only increased by a factor around .
Usefulness of backward pass: Fig. 6 illustrates the effect of performing a backward pass. As shown, the forward pass only generates a coarse representation of the shapes without fine details. This is anticipated, since cascaded GANs can only observe the lower frequency portions of SMV in the forward pass. In contrast, the backward pass observes the higher frequency portions, and fine tunes the coarse representation by adding complementary details.
8.2 Unsupervised 3D Representation Learning
In this section, we evaluate the representation learning capacity of our discriminator. To this end, we pass relevant SMV frequency bands of 3D pointclouds through trained discriminators, extract the features from the third FC layer, and finally concatenate them to create a feature vector. This feature vector is then fed through a binary SVM classifier and the classification results are obtained as oneagainsttherest. The classification results on ModelNet10 are depicted in Table
1. As evident, we achieve the highest result with a value of , which highlights the excellent representation learning capacity of our discriminators.8.3 3D reconstruction results
In this section, we evaluate the performance of the 3D reconstruction network proposed in Sec. 7. First, we randomly apply a rotation to each 3D model from the IKEA dataset 15 times, and render the rotated model in front of background images obtained from [xiao2010sun]. Afterwards, we save the rendered images and the corresponding 3D models to create groundtruth image3D model pairs. The ground truth 3Dmodels are manually aligned using the Iterative closest point (ICP) algorithm. While applying rotations, we set the constraints and and crop the rendered images for the object to be in the center of the images. For the test set, we use the original images provided in the IKEA dataset and test our network on four object classes: chair, sofa, table and bed. We train our model separately for each category and use mean average precision (mAP) to evaluate the performance. In evaluation, we voxelize the generated and ground truth pointclouds using a voxel grid, and obtain average precision for voxel prediction. The results and illustrative examples are shown in Table 5 and Fig. 7, respectively. As depicted, we surpass stateoftheart results in sofa and bed categories, while achieving second best results in the table category.
Method  Chair  Sofa  Bed  Table 
AlexNetfc8 [girdhar2016learning]  20.4  38.8  29.5  16.0 
AlexNetconv4 [girdhar2016learning]  31.4  69.3  38.2  19.1 
TL network [girdhar2016learning]  32.9  71.7  56.3  23.3 
3DVAEGAN [wu2016learning]  47.2  78.8  63.2  42.3 
VAEIWGAN [smith2017improved]  49.3  68.0  65.7  52.2 
PrimtiveGAN [khan2019unsupervised]  47.5  77.1  68.4  60.0 
SpectralGAN (ours)  42.3  81.2  71.4  48.3 
9 Conclusion
We propose a generative model for 3D pointclouds that operates in the spectraldomain. In contrast to previous methods that operate in the spatialdomain, our approach provides a structured way to deal with the inherent redundancy and irregularity of pointclouds. We demonstrate that our model generates sound 3D outputs, can scale to highdimensional outputs and learns discriminative features in an unsupervised manner. Further, it can be used for 3D reconstruction task.
References
Appendix A Sampling and reconstruction
A key attribute of any sampling theorem is the minimum number of sample points required to accurately represent a bandlimited function in a particular space. Several such sampling theorems have been proposed to represent a signal with finite energy in , whereas a most popular choice is the Driscoll and Healy’s (DH) theorem proposed by Driscoll et al. [driscoll1994computing], which we also use in our work.
According to DH theorem, to accurately represent a signal on using spherical harmonic moments bandlimited at degree , equiangular sampled points are needed. For all the main experiments in this work, we choose and obtain an equally sampled grid in each and directions, where and . However, as mentioned in Sec. 4, spherical harmonics can represent only polar 3D shapes, which can result in less visually pleasing spatial representations of nonpolar shapes. To overcome this obstacle, we follow the following sampling procedure.
First, we scale the 3D mesh to fit inside the unit ball , and cast rays from the centroid of the shape to outward direction, and take the first hit locations of the rays with a face as a sample point. In the first stage, we sample such equiangular points in a grid, sampled in and directions respectively, where and . In the second stage, we rotate the casted rays in direction, by an amount of , and obtain the last hit locations of the each ray with a face of the 3d shape as a sample point. Union of these two sampling sets provide a more visually pleasing pointcloud for nonpolar 3D shapes. This procedure is illustrated in Fig. 1.
Appendix B Literature on cascaded generative designs:
Denton et al. [denton2015deep] proposed a cascaded GAN architecture for 2D image generation. Similar to our work, they also use a series of conditional GANs which are conditioned upon one another. These GANs generate image representations in a Laplacian pyramid framework to create increasingly refined images. Instead of generating images directly in the spatial domain, these generative models specialize in generating a specific residual image, according to the corresponding stage of the Laplacian pyramid, which are finally combined together to produce a high quality image. This is analogous to our work, where our generators generate a specific frequency portion of SMVs, which are finally combined together to obtain the full representation. Other recent works also employ cascaded generative architectures to improve image quality e.g., [wang2018high] use a combination of generators operating on low and high resolution domains, [Wang_ssganECCV2016] separately train generative models to learn style and structure components, [Zhang_2017_ICCV] progressively adds photorealistic details in lowresolution generated images. The conditional stacked GAN architecture of Huang et al. [Huang_2017_CVPR] is particularly close to ours, that feeds onto previous generators output and new latent vectors to create novel images. Finally, the seminal SinGAN [Shaham_2019_ICCV] approach designs a pyramid of coarsetofine generators that can be trained on a single image. However, as opposed to current work, all above efforts operate in the spatial domain and have no concrete definition of spectral bands.
Appendix C Computational complexity analysis
A key feature of our network is its high computational efficiency despite being a cascaded design. Since the target is a 1D structured vector, the generators are allowed to have a shallow architecture, which decreases the total number of FLOPs during operation. Table 1 compares the our model complexity against the stateoftheart models. We achieve the best performance in terms of MMDCD and MMDEMD while having the lowest model complexity. Experiments are conducted for inference with 20 batch size.
Method  MMDCD ()  MMDEMD ()  #FLOPs ()  #Points ()  
rGAN (dense) [achlioptas2017learning]  0.0029  0.136  0.1B  2048  
Valsesia et al. (up.) [valsesia2018learning]  Chair  0.0029  0.097  304B  2048 
SpectralGAN (ours)  0.0012  0.080  0.09B  3600  
rGAN (dense) [achlioptas2017learning]  0.0009  0.094  0.1B  2048  
Valsesia et al. (up.) [valsesia2018learning]  Airplane  0.0008  0.071  304B  2048 
SpectralGAN (ours)  0.0002  0.057  0.09B  3600  
rGAN (dense) [achlioptas2017learning]  0.0020  0.146  0.1B  2048  
Valsesia et al. (up.) [valsesia2018learning]  Sofa  0.0020  0.083  304B  2048 
SpectralGAN (ours)  0.0020  0.080  0.09B  3600  
rGAN (dense) [achlioptas2017learning]  All classes  0.0021  0.155  0.1B  2048 
SpectralGAN (ours)  0.0015  0.097  0.09B  3600 
Comments
There are no comments yet.