Log In Sign Up

3D-aware Image Synthesis via Learning Structural and Textural Representations

Making generative models 3D-aware bridges the 2D image space and the 3D physical world yet remains challenging. Recent attempts equip a Generative Adversarial Network (GAN) with a Neural Radiance Field (NeRF), which maps 3D coordinates to pixel values, as a 3D prior. However, the implicit function in NeRF has a very local receptive field, making the generator hard to become aware of the global structure. Meanwhile, NeRF is built on volume rendering which can be too costly to produce high-resolution results, increasing the optimization difficulty. To alleviate these two problems, we propose a novel framework, termed as VolumeGAN, for high-fidelity 3D-aware image synthesis, through explicitly learning a structural representation and a textural representation. We first learn a feature volume to represent the underlying structure, which is then converted to a feature field using a NeRF-like model. The feature field is further accumulated into a 2D feature map as the textural representation, followed by a neural renderer for appearance synthesis. Such a design enables independent control of the shape and the appearance. Extensive experiments on a wide range of datasets show that our approach achieves sufficiently higher image quality and better 3D control than the previous methods.


page 1

page 3

page 5

page 6

page 8

page 12


pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

We have witnessed rapid progress on 3D-aware image synthesis, leveraging...

High-Fidelity Synthesis with Disentangled Representation

Learning disentangled representation of data without supervision is an i...

Learning Compositional Radiance Fields of Dynamic Human Heads

Photorealistic rendering of dynamic humans is an important ability for t...

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

3D-aware generative adversarial networks (GANs) synthesize high-fidelity...

Exemplar-bsaed Pattern Synthesis with Implicit Periodic Field Network

Synthesis of ergodic, stationary visual patterns is widely applicable in...

NeuralHDHair: Automatic High-fidelity Hair Modeling from a Single Image Using Implicit Neural Representations

Undoubtedly, high-fidelity 3D hair plays an indispensable role in digita...

Magnetic Field Prediction Using Generative Adversarial Networks

Plenty of scientific and real-world applications are built on magnetic f...

Code Repositories


VolumeGAN - 3D-aware Image Synthesis via Learning Structural and Textural Representations

view repo

1 Introduction

Learning 3D-aware image synthesis draws wide attention recently [graf, giraffe, pigan]. An emerging solution is to integrate a Neural Radiance Field (NeRF) [nerf] into a Generative Adversarial Network (GAN) [gan]

. Specifically, the 2D Convolutional Neural Network (CNN) based generator is replaced with a generative implicit function, which maps the raw 3D coordinates to point-wise densities and colors conditioned on the given latent code. Such an implicit function encodes the structure and the texture of the output image in the 3D space.

However, there are two problems of directly employing NeRF [nerf]

in the generator. On one hand, the implicit function in NeRF produces the color and density for each 3D point using a Multi-Layer Perceptron (MLP) network. With a very local receptive field, it is hard for the MLP to represent the underlying structure globally when synthesizing images. Thus only using the 3D coordinates as the inputs 

[graf, pigan, giraffe] is not expressive enough to guide the generator with the global structure. On the other hand, volume rendering generates the pixel values of the output image separately, which requires sampling numerous points along the camera ray regarding each pixel. The computational cost hence significantly increases when the image size becomes larger. It may cause the insufficient optimization of the model training, and further lead to unsatisfying performance for high-resolution image generation.

Figure 1: Images of faces and cars synthesized by VolumeGAN, which enables the control of viewpoint, structure, and texture.

Prior work has found that 2D GANs benefits from valid representations learned by the generator [interfacegan, higan, xu2021generative]. Such generative representations describe a synthesis with high-level features. For example, Xu et al. [xu2021generative] confirm that a face synthesis model is aware of the landmark positions of the output face, and Yang et al. [higan] identify the multi-level variation factors emerging from generating bedroom images. These representative features encode rich texture and structure information, thereby enhancing the synthesis quality [stylegan] and the controllability [interfacegan] of image GANs. In contrast, as mentioned above, existing 3D-aware generative models directly render the pixel values from coordinates [graf, pigan], without learning explicit representations.

In this work, we propose a new generative model, termed as VolumeGAN, which achieves 3D-aware image synthesis through explicitly learning a structural and a textural representation. Instead of using the 3D coordinates as the inputs, we generate a feature volume using a 3D convolutional network, which encodes the relationship between various spatial regions and hence compensates for the insufficient receptive field caused by the MLP in NeRF. With the feature volume modeling the underlying structure, we query a coordinate descriptor from the feature volume to describe the structural information for each 3D point. We then employ a NeRF-like model to create a feature field, by taking the coordinate descriptor attached with the raw coordinate as the input. The feature field is further accumulated into a 2D feature map as the textural representation, followed by a CNN with kernel size to finally render the output image. In this way, we separately model the structure and the texture with the 3D feature volume and the 2D feature map, enabling the disentangled control of the shape and the appearance.

We evaluate our approach on various datasets and demonstrate its superior performance over existing alternatives. In terms of the image quality, VolumeGAN achieves substantially better Fréchet Inception Distance (FID) score [fid]. Taking the FFHQ dataset [stylegan] under resolution as an instance, we improve the FID from to . We also enable 3D-aware image synthesis on the challenging indoor scene dataset, i.e., LSUN bedroom [lsun]. Our model also suggests stable control of the object pose and shows better consistency across different viewpoints, benefiting from the learned structural representation (i.e., the feature volume). Furthermore, we conduct a detailed empirical study on the learned structural and textural representations, and analyze the trade-off between the image quality and the 3D property.

Figure 2: Framework of the proposed VolumeGAN. We first learn a feature volume, starting from a learnable spatial template, as the structural representation. Given the camera pose

, we sample points along a camera ray and query the coordinate descriptor of each point from the feature volume via trilinear interpolation. The resulting coordinate descriptors, concatenated with the raw 3D coordinates, are then converted to a generative feature field and further accumulated as a 2D feature map. Such a feature map is regarded as the

textural representation, which guides the rendering of the appearance of the output synthesis with the help of a neural renderer.

2 Related work

Neural Implicit Representations. Recent methods [sitzmann2019scene, occupancy, deepsdf, chibane2020implicit, nerf, liu2020neural] propose to represent 3D scenes with neural implicit functions, such as occupancy field [occupancy], signed distance field [deepsdf], and radiance field [nerf]. To recover these representations from images, they develop differentiable renderers [liu2020dist, niemeyer2020differentiable, yariv2020multiview, wang2021neus] that render implicit functions into images, and optimize the network parameters by minimizing the difference between rendered images and observed images. These methods can reconstruct high-quality 3D shapes and perform photo-realistic view synthesis, but they have several strong assumptions on the input data, including dense camera views, precise camera parameters, and constant lighting effects. More recently, some methods [martinbrualla2020nerfw, jain2021putting, meng2021gnerf, graf, pigan, giraffe] have attempted to reduce the constraints on the input data. By appending an appearance embedding to each input image, [martinbrualla2020nerfw] can recover 3D scenes from multi-view images with different lighting effects. [jain2021putting, meng2021gnerf] reconstructs neural radiance fields from very sparse views by applying a discriminator to supervise the synthesized images on novel views. Different from these methods requiring multi-view images, our approach can synthesize high-resolution images by training networks only on unstructured single-view image collections.

Image Synthesis with 2D GANs. Generative Adversarial Networks (GANs) [pggan, gan] have made significant progress in synthesizing photo-realistic images but lack the ability to control the generation. To obtain better controllability in synthesizing process, [interfacegan, higan, sefa, lowrankgan] investigate the latent space of the pre-trained GANs to determine the semantic direction. Many works [chen2016infogan, peebles2020hessian] add regularizers or modify the network structure [he2021eigengan, stylegan, stylegan2, aliasfreegan] to improve the disentanglement of variation factors without explicit supervision. Besides, recent methods [idinvert, xu2021generative, mganprior, image2stylegan] adopt optimization or train encoders for controlling attributes of real images by pre-trained GANs. However, these efforts control the generation only in 2D space and ignore the 3D nature of the physical world, resulting in a lack of consistency for view synthesis.

3D-Aware Image Synthesis. 2D GANs lack knowledge of 3D structure. Some prior works directly introduce 3D representation to perform 3D-aware image synthesis. VON [VON] generates a 3D shape represented by voxels which is then projected into 2D image space by a differentiable renderer. HoloGAN [nguyen2019hologan] proposes voxelized and implicit 3D representations and then render it to 2D space with a reshape operation. While these methods can achieve good results, the synthesized images suffer from the fine details and identity shift because of the voxel resolution restriction. Instead of using the voxel representation, GRAF [graf] and -GAN [pigan] propose to model 3D shapes by neural implicit representation, which maps the coordinates to the RGB color. However, due to the computationally intensive rendering process, they cannot synthesize high-resolution images with good visual quality. To overcome this problem, [giraffe] first render low-resolution feature maps with neural feature fields and then generate high-resolution images with 2D CNNs, also with the coordinates as the input. However, severe artifacts across different camera views are introduced because CNN-based decoder harms the 3D consistency. Unlike previous attempts, we leverage the feature volume to provide the feature descriptor for each coordinate and a neural renderer consisting of convolution block to synthesize high-quality images with better multi-view consistency and 3D control.

The concurrent work StyleNeRF [gu2021stylenerf] also adopts convolution block to synthesize high-quality images. However, we adopt the feature volume to provide the structural description for the synthesized object instead of using regularizers to improve the 3D properties.

3 Method

This work targets at learning 3D-aware image generative models from a collection of 2D images. Previous attempts replace the generator of a GAN model with an implicit function [nerf], which maps 3D coordinates to pixel values. To improve the controllability and synthesis quality, we propose to explicitly learn the structural and the textural representations that are responsible for the underlying structure and texture of the object respectively. Concretely, instead of directly bridging coordinates with densities and RGB colors, we ask the implicit function to transform 3D feature volume (i.e., the structural representation) to a generative feature field, which are then accumulated into a 2D feature map (i.e., the textural representation). The overall framework is illustrated in Fig. 2. Before going into details, we first briefly introduce the Neural Radiance Field (NeRF), which is a core module of the proposed model.

3.1 Preliminary

The neural radiance field [nerf] is formulated as a continuous function , which maps a 3D coordinate and the viewing direction to the RGB color and a volume density . Then, given a sampled ray, we can predict the colors and densities of all the points that the ray goes through, which are then accumulated into the pixel value with volume rendering techniques. Typically, the function is parameterized with a multi-layer perceptron (MLP), , as the backbone, and two independent heads, and , to regress the color and density:


where the color is related with the viewing direction due to the variation factors like lighting, while the density is independent of .

NeRF is primarily proposed for 3D reconstruction and novel view synthesis, which is trained with the supervision from multi-view images. To enable random sampling by learning from a collection of single-view images, recent attempts [graf, pigan] introduce a latent code to the function . In this way, the geometry and appearance of the rendered image will vary according to the input , resulting in diverse generation. Such a stochastic implicit function is asked to compete with a discriminator of GANs [gan] to mimic the distribution of real 2D images. In the learning process, the revised function is supposed to encode the structure and the texture information simultaneously.

3.2 3D-aware Generator in VolumeGAN

To improve the controllability and image quality of the NeRF-based 3D-aware generative model, we propose to explicitly learn a structural representation and a textural representation, which control the underlying structure and texture, respectively. In this part, we will introduce the design of the structural and the textural representations, as well as their integration through a generative neural feature field.

3D Feature Volume as Structural Representation. As pointed out by NeRF [nerf], the low-dimensional coordinates should be projected into a higher-dimensional feature to describe the complex 3D scenes. For this purpose, a typical solution is to characterize into Fourier features [vaswani2017attention]

. However, such a Fourier transformation cannot introduce additional information beyond the spatial position. It may be enough for reconstructing a fixed scene, but yet far from encoding a distributed feature for the image synthesis of different object instances. Hence, we propose to learn a grid of features providing the inputs of implicit functions, which gives a more detailed description of each spatial point. We term such a 3D feature volume,

, as the structural representation

which characterizes the underlying 3D structure. To obtain the feature volume, we employ a sequence of 3D convolutional layers with the Leaky ReLU (LReLU) functions 

[lrelu]. Inspired by Karras et al. [stylegan], we apply Adaptive Instance Normalization (AdaIN) [adain]

to the output of each layer to introduce diversity to the feature volume. Starting from a learnable 3D tensor,

, the structural representation is generated with


where denotes the number of layers for structure learning. is the upsampling scale for the -th layer.

2D Feature Map as Textural Representation. As discussed before, volume rendering can be extremely slow and computationally expensive, making it costly to directly render the raw pixels of a high-resolution image. To mitigate the issue, we propose to learn a feature map at a low resolution, followed by a CNN to render a high-fidelity result. Here, the 2D feature map is responsible for describing the visual appearance of the final output. The tailing CNN consists of several Modulated Convolutional Layers (ModConv) [stylegan2], also activated by LReLU. To avoid the CNN from weakening the 3D consistency [giraffe], we use kernel size for all layers such that the per-pixel feature can be processed independently. In particular, given a 2D feature map, , as the textural representation, the image is generated by


where denotes the number of layers for texture learning. is the upsampling scale for the -th layer.

Bridging Representations with Neural Feature Field. To connect the structural and the textural representations in the framework, we introduce a neural radiance field [nerf] as the bridge. Different from the implicit function in the original NeRF, which maps coordinates to pixel values, we first query the coordinate descriptor, , from the feature volume, , given a 3D coordinate , and then concatenate it with to obtain as the input. Then, the implicit function transform

to the density and feature vector of the field. The above process can be formulated as


where denotes the number of layers to parameterize the neural field, while and are the learnable layer-wise weight and bias. Eq. 8 concatenates coordinates onto feature to explicitly introduce the structural information. Eq. 10 follows Chan et al. [pigan], which conditions the layer-wise output of the backbone on the frequencies, , and phase shifts, , learned from the random noise . Eq. 11 replaces the color modeling in Eq. 1 with feature modeling.

A per-pixel final feature can be obtained via volume rendering along a ray (with viewing direction ). A collection of regarding different ray groups into a 2D feature map as the textural representation, , which will be further used to render the image.


Eq. 13 approximates the integral of points on the sampled ray , where stands for the distance between adjacent sampled points.

3.3 Training Pipeline

Generative Sampling. The whole generation process is formulated as , where

is a latent code sampled from a Gaussian distribution

and denotes the camera pose sampled from a prior distribution . is tuned for different datasets as either Gaussian or Uniform.

Discriminator. Like existing approaches for 3D-aware image synthesis [graf, pigan, giraffe], we employ a discriminator to compete with the generator. The discriminator is a CNN consisting of several residual blocks like [stylegan2].

Figure 3: Qualitative comparison between our VolumeGAN and existing alternatives on FFHQ [stylegan], CompCars [compcars], and LSUN bedroom [lsun] datasets. All images are in resolution.

Training Objectives. During training, we randomly sample and from the prior distributions, while the real images are sampled from the real data distribution . The generator and the discriminator are jointly trained with


where is the softplus function. The last term in Eq. (16) stands for the gradient penalty regularizer and is the loss weight.

4 Experiment

4.1 Settings

Datasets. We evaluate the proposed VolumeGAN on five real-world unstructured datasets including CelebA [celeba], Cats [cat], FFHQ [stylegan], CompCars [compcars], LSUN bedroom [lsun], and a synthetic dataset Carla [carla]. CelebA contains around 20 face images from 10 identities. The crop from the top of the hair to the bottom of the chin is adopted for data preprocessing on CelebA. The Cats dataset contains 6.5 images of cat heads at resolution. FFHQ contains 70 images of real human faces in a resolution of . We follow the protocol of [stylegan] to preprocess the faces of FFHQ. Compcars includes 136 real cars whose pose varies greatly. The original images are in different aspect ratios. Hence we center crop the cars and resize them into . Carla dataset contains 10 images which are rendered from Carla Driving simulator [carla] using 16 car models with different textures. LSUN bedroom includes 300 samples in various camera views and aspect ratios. We also use center cropping to preprocess the bedroom images. We train VolumeGAN on resolutions of for CelebA, Cats, and Carla and for FFHQ, CompCars and LSUN bedroom.

Baselines. We choose four 3D-aware image synthesis approaches as the baselines, including HoloGAN [nguyen2019hologan], GRAF [graf], -GAN [pigan] and GIRAFFE [giraffe]. Baseline models are officially released by the original papers or trained with the official implementation. More details can be found in Appendix.111We fail to reproduce HoloGAN on LSUN bedroom with the official implementation, hence we do not report the quantitative results. The qualitative results of bedrooms are borrowed from the original paper [nguyen2019hologan].

Method CelebA 128 Cats 128 Carla 128 FFHQ 256 CompCars 256 Bedroom 256
HoloGAN [nguyen2019hologan] 39.7 40.4 126.4 72.6 65.6
GRAF [graf] 41.1 28.9 41.6 81.3 222.1 63.9
-GAN [pigan] 15.9 17.7 30.1 53.2 194.5 33.9
GIRAFFE [giraffe] 17.5 20.1 30.8 36.7 27.2 44.2
VolumeGAN (Ours) 8.9 (7.0) 5.1 (12.6) 7.9 (22.2) 9.1 (27.6) 12.9 (14.3) 17.3 (16.6)
Table 1: Quantitative comparisons on different datasets. FID [fid]

(lower is better) is used as the evaluation metric. Numbers in brackets indicate the improvements of our VolumeGAN over the second method.

Implementation Details. The learnable 3D template are randomly initialized in shape and 3D convolutions with kernel size are stacked to embed the template, resulting in the feature volume in resolution. We sample rays in a resolution of , and four conditioned MLPs (SIREN [pigan, siren]) with 256 dimensions are adopted to model the feature field and the volume density. We use an Upsample block [stylegan2] and two  [stylegan2, cips] at each resolution for the neural renderer until reaching the output image resolution. We also apply progressive training strategy used in StyleGAN [stylegan] and PG-GAN [pggan] to achieve better image qualities. For the network training, we use Adam [adam] optimizer with and over 8 GPUs. The entire training requires the discriminator to see 25000 images. The batch size is 64, and the weight decay is 0. Unless specified, in Eq. (16) is set to 1 to balance different loss terms, and the learning rate of generator and discriminator is set to and , respectively. More details about network architecture and training can be found in Appendix.

4.2 Main Results

Qualitative Results. Fig. 3 compares the synthesized images of our method with baselines on FFHQ, CompCars and LSUN bedroom. The images are sampled from three views and synthesized in a resolution of for visualization. Although all baseline methods can synthesize images under different camera poses on FFHQ, they suffer from low image quality and the identity shift across different angles. When transferred to challenging CompCars with larger viewpoint variations, some baselines such as GRAF [graf] and -GAN [pigan] struggle to generate realistic cars. HoloGAN can achieve good image quality but suffers from multi-view inconsistency. GIRAFFE can generate realistic cars while the color of the cars changes significantly under different views. When tested on bedroom, HoloGAN, GRAF, -GAN and GIRRAFE cannot handle such indoor scene data with larger structure and texture variations.

VolumeGAN can synthesize high-fidelity view-consistent images. Compared with the existing approaches, it generates more fine details, such as teeth (face), headlights (car) and windows (bedroom). Even on the more challenging CompCars and LSUN bedroom datasets, VolumeGAN still achieves satisfying synthesis performance thanks to the feature volume and the neural renderer.

Quantitative Results. We quantitatively evaluate the visual quality of the synthesized images using Frechet Inception Distance (FID) [fid]. We follow the evaluation protocol of StyleGAN [stylegan] which adopts 50 real and fake samples to calculate the FID score. All baseline models are evaluated with the same setting for a fair comparison. As shown in  Tab. 1, our approach leads to a significant improvement compared with baselines, particularly on the challenging datasets with the larger pose variation or the finer details. Note that although GIRAFFE also uses the neural renderer, our method still outperforms it with a clear margin. It demonstrates that the structural information encoded in the feature volume provides representative visual concepts, resulting in better images quality.

4.3 Ablation Studies

We conduct ablation studies on CelebA to examine the importance of each component in VolumeGAN.

Metrics. In addition to the FID score that measures the image quality, we also provide two quantitative metrics to measure the multi-view consistency and the precision of 3D control as follows. 1) Reprojection Error. We first extract the underlying geometry of an object from the generated density using marching cubes [lorensen1987marching]. Then, We render each object in sequence and sample five viewpoints uniformly to synthesize the images. The depth of each image is rendered from the resulting extracted mesh, which is used to calculate the reprojection error on two consecutive views by warping them each other. Specifically, we fix the yaw to be 0 and sample pitch from . The marching cube is set to 10 due to the best visualization results of meshes. The reprojection error is calculated in the normalized image space like [image2stylegan, idinvert, xu2021generative] to evaluate the multi-view consistency. 2) Pose Error

. We synthesize 20,000 images and regard the results predicted from the head pose estimator 

[zhou2020whenet] as the ground truth. The L1 distance between the given camera pose and the predicted pose is reported to evaluate the 3D control quantitatively.

Figure 4: Synthesized results with the front camera view by -GAN [pigan] and our VolumeGAN, where the faces proposed by VolumeGAN are more consistent to the given view, suggesting a better 3D controllability.

Ablations on VolumeGAN Components. Our approach proposes to use Feature Volume as the structural representation and adopt the neural renderer consisting of to render textural representation into high-fidelity images. We ablate them to better understand their individual contributions. Our baseline is built upon -GAN [pigan] using conditioned MLPs to achieve 3D-aware image synthesis by mapping coordinates to RGB color. The layer number of the baseline is set to be 4, the same as our setting illustrated in Sec. 4.1 for a fair comparison. As shown in Tab. 2, introducing the feature volume that provides the structural representation could further improve the FID score of the baseline approach from 18.7 to 13.6.

More importantly, lower reprojection error and pose error are also achieved, demonstrating the structural representation from the feature volume not only facilitates better visual results but also maintains the 3D properties regarding multi-view consistency and 3D explicit controlling. On top of this, the neural renderer further enhances FID to 8.9 with a slight drop in reprojection error and pose error, leading to the new state-of-the-art result on 3D-aware image synthesis. Notably, involving the neural renderer to the baseline could also boost the FID score but apparently sacrifice the 3D properties to some extent according to the 3D metrics. It also indicates that FID is not a comprehensive metric to evaluate 3D-aware image synthesis. In addition, Fig. 4 gives several synthesized samples of -GAN baseline and our approach under the front view. More samples can be found in Appendix. Qualitatively, the poses of our synthesized samples are closer to the given camera view which is quantitatively reflected by the pose-error score.

Resolution of the Feature Volume. The feature volume resolution depicts the spatial refinement of the structural representation, and thus it plays an essential role in synthesizing images. Tab. 3 presents the metrics of the synthesis results for various resolutions of feature volume. As the resolution increases, the multi-view consistency and 3D control become better consistently while the visual quality measured by FID fluctuates little. This demonstrates that a more detailed feature volume provides better geometry consistency across various camera poses. However, increasing the feature volume resolution inevitably results in a greater computational burden. As a result, we choose a feature volume resolution of 32 in all of our experiments to maintain the balance between efficiency and image quality.

Neural Renderer Depth. The neural renderer is adopted to convert textural representations into 2D images; thus, its capacity is critical to the quality of the generated images. We adjust its capacity by varying the depth of the neural renderer to investigate its effect. Tab. 4 shows a trade-off between image quality and 3D properties. As the depth of the network increases, better image visual quality can be achieved while the quality of multi-view consistency and 3D control downgrades. This implies that increasing the capacity of the neural renderer would damage the 3D structure to some extent, revealing FID is not a comprehensive metric for 3D-aware image synthesis again. We thus choose the shallower network as the neural renderer for better 3D consistency and control.

FV NR FID Rep-Er Pose-Er
-GAN 18.7 0.071 12.7
13.6 0.031 8.3
11.3 0.103 12.1
8.9 0.037 8.6
Table 2: Ablation studies on the components of VolumeGAN, including the feature volume (FV) and the neural renderer (NR). “Rep-Er” and “Pose-Er” are the reprojection-error and pose-error.
Str Res FID Rep-Er Pose-Er Speed (fps)
16 9.0 0.040 9.1 5.58
32 8.9 0.037 8.6 5.15
64 9.2 0.032 8.4 3.86
Table 3: Effect of the size of feature volume. “Str Res” denotes the resolution of the feature volume (i.e., the structural representation).
Depth Tex Res FID Rep-Er Pose-Er
6 64 8.0 0.051 9.7
4 64 8.8 0.046 9.3
2 64 8.9 0.037 8.6
Table 4: Effect of the depth of neural renderer. “Tex Res” denotes the resolution of the 2D feature map (i.e., the textural representation).
Figure 5: Synthesized results by exchanging the structural and the textural latent codes.

4.4 Properties of Learned Representations

A key advantage of our approach over previous attempts is that by separately modeling the structure and texture with the 3D feature volume and 2D feature map, our model learns the disentangle representations for the object. These representations allow us to achieve control of the shape and appearance. The coordinate descriptor and the 3D mesh extracted from the density are visualized to interpret the learned representations.

Independent Control of Structure and Texture. At test time, we could easily swap and combine the latent codes regarding the structural and textural individually. In this way, we can investigate whether such two representations are well disentangled. For example, we could combine the structural representation (i.e., feature volume code) of a certain instance with the textural (i.e., generative feature field and neural renderer code) of another. The corresponding results are shown in Fig. 5. The faces results show that the feature volume code controls the shape of the face and hairstyle, whereas the feature field and neural renderer code determine the skin and hair color. Concretely, glasses are controlled by the volume code, in line with our perception. We can swap the structure and texture of cars successfully. It demonstrates that our method can disentangle shape and appearance in synthesizing images. Different from GRAF [graf] and GIRRAFE [giraffe], we do not explicitly introduce shape code and appearance code to control image synthesis. Thanks to the structural and textural representations in our framework, the disentanglement between shape and appearance emerges naturally.

Figure 6: Visualization of coordinate descriptor. PCA is used to reduce the feature dimension.

Coordinate Descriptor Visualization. To further explore how the feature volume describes the underlying structure, we visualize the corresponding coordinate descriptors queried in the feature volume. Specifically, we accumulate the coordinate descriptors on each ray, resulting in a high-dimensional feature map. PCA [pca] is utilized to reduce the dimension to 3 for visualization. Fig. 6 shows that the feature volume serves as a coarse structure template. The face outline, hair, and background can be recognized easily. Impressively, the eyes have a strong symmetry even with the glasses. Compared to raw coordinates, the feature descriptor provides a structured constraint to guide the image synthesis so that our method inherently synthesize image with better visual quality and 3D properties.

Underlying Geometry. The volume density of the implicit representation can construct an underlying geometry of the object due to its view-independent properties. We extract the underlying geometry with marching cube [lorensen1987marching] on the density, resulting in a surface mesh. Fig. 7 shows the meshes with various views and identities. The geometry is consistent across different views, supporting the good 3D properties of our method.

Figure 7: 3D Mesh extracted from the density.

5 Conclusion and Discussion

In this paper, we propose a new 3D-aware generative model, VolumeGAN, for synthesizing high-fidelity images. By learning structural and textural representations, our model achieves sufficiently higher image quality and better 3D control on various challenging datasets.

Limitations. Despite the structural representation learned by VolumeGAN, the synthesized 3D mesh surface is still not smooth and lacks fine details. Meanwhile, even though we can improve the synthesis resolution via introducing a deeper CNN (i.e., the neural renderer), it may weaken the multi-view consistency and 3D control. Future research will focus on generating fine-grained 3D shape as well as making the tailing CNN in VolumeGAN with improved 3D properties through introducing regularizers.

Ethical Consideration. Due to the high-quality 3D-aware synthesis performance, our approach is potentially applicable for deep fake generation. We strongly oppose the abuse of our method in violating privacy and security. On the contrary, we hope it can be used to improve the existing fake detection systems.



This appendix is organized as follows. Appendix A and Appendix B introduce the network structure and the training configurations used in VolumeGAN. Appendix C describes the details of implementing baseline approaches. Appendix D shows more qualitative results.

Appendix A Network Structure

Recall that, our VolumeGAN first learns a feature volume with 3D CNN. The feature volume is then transformed into a feature field using a NeRF-like model. A 2D feature map is finally accumulated from the feature field and rendered to an image with a 2D CNN. Taking 256 resolution as an instance, we illustrate the architectures of these three models in Tab. A1, Tab. A2, and Tab. A3, respectively.

Appendix B Training Configurations

Because of the wildly divergent data distribution, the training parameters vary greatly on different datasets. Tab. A4 illustrates the detailed training configuration of different datasets. Fov, Range, and Steps are the field of view, depth range and the number of sampling steps along a camera ray. Range and Range denotes the horizontal and vertical angle range of the camera pose . ’Sample_Dist’ denotes the sampling scheme of the camera pose. We only use Gaussian or Uniform sampling in our experiments. is the loss weight of the gradient penalty.

Appendix C Implementation Details of Baselines

HoloGAN [nguyen2019hologan]. We use the official implementation of HoloGAN.222

We train HoloGAN for 50 epochs. The generator of HoloGAN can only synthesize images in

or resolution. We extend the generator with an extra and block to synthesize images for comparison.

GRAF [graf]. We use the official implementation of GRAF.333 We directly use the pre-trained checkpoints of CelebA and Carla provided by the authors. For the other datasets, we train GRAF with the same data and camera parameters as ours at the target resolution.

-GAN [pigan]. We use the official implementation of -GAN.444 We also directly use the pre-trained checkpoints of CelebA, Carla and Cat for comparison and retrain -GAN models on the other three datasets, including FFHQ, CompCars and LSUN bedroom. The retrained models are progressively trained from a resolution of to following the official implementation.

GIRAFFE [giraffe]. We use the official GIRAFFE implementation.555 GIRAFFE provides the pre-trained weights of FFHQ and CompCars in a resolution of . The remaining datasets are also trained with the same camera distribution for a fair comparison.

Stage Block Output Size
input Learnable Template
Table A1: Network structure for learning a feature volume as the structural representation. The output size is with order , where denotes the depth dimension.
Stage Block Output Size
Table A2: Network structure of the generative feature field. The output size is with order , where is the number of sampling points along a certain camera ray. denotes the FiLM layer [perez2018film] and stands for the Sine activation [siren].
Stage Block Output Size
RGB 3 3 , 3
Table A3: Network structure of the neural renderer, which renders a 2D feature map to a synthesized image. The output size is with order .
Datasets Fov Range Steps Range Range Sample_Dist
CelebA 12 12 Gaussian 0.2
Cat 12 12 Gaussian 0.2
Carla 30 36 Uniform 1
FFHQ 12 14 Gaussian 1
CompCars 20 30 Uniform 1
Bedroom 26 40 Uniform 1
Table A4: Training configurations regarding different datasets.

Appendix D Additional Results

Synthesis with front camera view. To better illustrate the 3D controllability, we show additional results of generating images with the front view. As shown in Fig. A1, the faces synthesized by VolumeGAN are more consistent with the given view, demonstrating a better 3D controllability.

Synthesis with varying camera views. Besides the front camera view, we provide a demo video, which shows more results with varying camera views. From the video, we can see the continuous 3D control achieved by our VolumeGAN. We also include in the demo video the comparisons with the state-of-the-art methods, i.e., -GAN [pigan] and GIRRAFE [giraffe].

Figure A1: More Synthesized results with the front camera view by -GAN [pigan] and our VolumeGAN, where the faces proposed by VolumeGAN are more consistent with the given view, suggesting a better 3D controllability.