EpiGRAF: Rethinking training of 3D GANs

06/21/2022
by   Ivan Skorokhodov, et al.
14

A very recent trend in generative modeling is building 3D-aware generators from 2D image collections. To induce the 3D bias, such models typically rely on volumetric rendering, which is expensive to employ at high resolutions. During the past months, there appeared more than 10 works that address this scaling issue by training a separate 2D decoder to upsample a low-resolution image (or a feature tensor) produced from a pure 3D generator. But this solution comes at a cost: not only does it break multi-view consistency (i.e. shape and texture change when the camera moves), but it also learns the geometry in a low fidelity. In this work, we show that it is possible to obtain a high-resolution 3D generator with SotA image quality by following a completely different route of simply training the model patch-wise. We revisit and improve this optimization scheme in two ways. First, we design a location- and scale-aware discriminator to work on patches of different proportions and spatial positions. Second, we modify the patch sampling strategy based on an annealed beta distribution to stabilize training and accelerate the convergence. The resulted model, named EpiGRAF, is an efficient, high-resolution, pure 3D generator, and we test it on four datasets (two introduced in this work) at 256^2 and 512^2 resolutions. It obtains state-of-the-art image quality, high-fidelity geometry and trains ≈ 2.5 × faster than the upsampler-based counterparts. Project website: https://universome.github.io/epigraf.

READ FULL TEXT VIEW PDF

page 1

page 5

page 19

page 20

12/15/2021

Efficient Geometry-aware 3D Generative Adversarial Networks

Unsupervised generation of high-quality multi-view-consistent images and...
07/02/2019

Multi-scale GANs for Memory-efficient Generation of High Resolution Medical Images

Currently generative adversarial networks (GANs) are rarely applied to m...
12/21/2021

StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation

We introduce a high resolution, 3D-consistent image and shape generation...
10/18/2021

StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis

We propose StyleNeRF, a 3D-aware generative model for photo-realistic hi...
12/09/2020

Positional Encoding as Spatial Inductive Bias in GANs

SinGAN shows impressive capability in learning internal patch distributi...
04/28/2021

InfinityGAN: Towards Infinite-Resolution Image Synthesis

We present InfinityGAN, a method to generate arbitrary-resolution images...
08/16/2021

Online Multi-Granularity Distillation for GAN Compression

Generative Adversarial Networks (GANs) have witnessed prevailing success...

1 Introduction

Figure 1: We build a pure NeRF-based generator trained in a patch-wise fashion. Left two grids: samples on FFHQ  Karras et al. (2019) and Cats  Zhang et al. (2008)

. Middle grids: interpolations between samples on M-Plants and M-Food (upper) and corresponding geometry interpolations (lower). Right grid: background separation examples. In contrast to the upsampler-based methods, one can naturally incorporate the techniques from the traditional NeRF literature into our generator: for background separation, we simply copy-pasted the corresponding code from NeRF++ 

Zhang et al. (2020).

Generative models for image synthesis achieved remarkable success in recent years and enjoy a lot of practical applications Ramesh et al. (2022); Karras et al. (2020a). While initially they mainly focused on 2D images Ho et al. (2020); Van den Oord et al. (2016); Karras et al. (2019); Brock et al. (2018); Kingma and Dhariwal (2018), recent research explored generative frameworks with partial 3D control over the underlying object in terms of texture/structure decomposition, novel view synthesis or lighting manipulation (e.g.,  Shi et al. (2021); Schwarz et al. (2020); Chan et al. (2021a); Wang et al. (2021a); Chan et al. (2021b); Deng et al. (2022); Pan et al. (2021)). These techniques are typically built on top of the recently emerged neural radiance fields (NeRF) Mildenhall et al. (2020) to explicitly represent the object (or its latent features) in 3D space.

NeRF is a powerful framework, which made it possible to built expressive 3D-aware generators from challenging RGB datasets Chan et al. (2021a); Deng et al. (2022); Chan et al. (2021b)

. Under the hood, it trains a multi-layer perceptron (MLP)

to represent a scene by encoding a density for each coordinate position and a color value from and view direction  Mildenhall et al. (2020). To synthesize an image, one renders each pixel independently by casting a ray (for ) from origin into the direction and aggregating many color values along it with their corresponding densities. Such a representation is very expressive, but comes at a cost: rendering a single pixel is computationally expensive and makes it intractable to produce a lot of pixels in one forward pass. It is not fatal for reconstruction tasks where the loss can be robustly computed on a subset of pixels, but it creates significant scaling problems for generative NeRFs: they are typically formulated in a GAN-based framework Goodfellow et al. (2014) with 2D convolutional discriminators requiring a full image as input.

People address these scaling issues of NeRF-based GANs in different ways, but the dominating approach is to train a separate 2D decoder to produce a high-resolution image from a low-resolution image or feature grid rendered from a NeRF backbone Niemeyer and Geiger (2021b). During the past six months, there appeared more than a dozen of methods which follow this paradigm (e.g., Chan et al. (2021b); Gu et al. (2022); Xu et al. (2021); Or-El et al. (2021); Zhou et al. (2021); Mejjati et al. (2021); Zhang et al. (2021); Jo et al. (2021); Xue et al. (2022); Zhang et al. (2022); Tan et al. (2022)). While using the upsampler allows to scale the model to high resolution, it comes with two severe limitations: 1) it breaks multi-view consistency of a generated object, i.e. its texture and shape change when the camera moves; and 2) the geometry gets only represented in a low resolution (). In our work, we show that by dropping the upsampler and using a simple patch-wise optimization scheme, one can build a 3D generator with better image quality, faster training speed and without the above limitations.

Patch-wise training of NeRF-based GANs was originally proposed by GRAF Schwarz et al. (2020) and got largely neglected by the community since then. The idea is simple: instead of training the generative model on full-size images, one does this on small random crops. Since the model is coordinate-based Sitzmann et al. (2020); Tancik et al. (2020), it does not face any issues to synthesize only a subset of pixels. This serves as a good way to save computation for both the generator and the discriminator, since it makes both of them operate on patches of small spatial resolution. To make the generator learn both the texture and the structure, crops are sampled to be of variable scales (but having the same number of pixels): in some sense, this can be seen as optimizing the model on low-resolution images + high-resolution patches.

In our work, we improve patch-wise training in two crucial ways. First, we redesign the discriminator by making it better suited to operating on image patches of variable scales and locations. Convolutional filters of a neural network learn to capture different patterns in their inputs depending on their semantic receptive fields 

Krizhevsky et al. (2012); Olah et al. (2017). That’s why it is detrimental to reuse the same discriminator to judge both high-resolution local and low-resolution global patches, inducing additional burden on it to mix filters’ responses of different scales. To mitigate this, we propose to modulate the discriminator’s filters with a hypernetwork Ha et al. (2016), that predicts which filters to suppress or reinforce from a given patch scale and location.

Second, we change the random scale sampling strategy from an annealed uniform to an annealed beta distribution. Typically, patch scales are sampled from a uniform distribution

 Schwarz et al. (2020); Meng et al. (2021); Chai et al. (2022), where the minimum scale is gradually decreased (i.e. annealed) till some iteration from to a smaller value (in the interval ) during training. This sampling strategy prevents learning high-frequency details early on in training and puts too little attention on the structure after reached its final value . This makes the overall convergence of the generator slower and less stable that’s why we propose to sample patch scales using the beta distribution instead, where is gradually annealed from to some maximum value . In this way, the model starts learning high-frequency details immediately after the training starts and has more focus on the structure after the growth is done. This simple change stabilizes the training and allows to converge faster than the typically used uniform distribution Schwarz et al. (2020); Chai et al. (2022); Meng et al. (2021).

We use those two ideas to develop a novel state-of-the-art 3D GAN: Efficient patch-informed Generative Radiance Fields (EpiGRAF). We employ it for high-resolution 3D-aware image synthesis on four datasets: FFHQ Karras et al. (2019), Cats Zhang et al. (2008), Megascans Plants and Megascans Food. The last two benchmarks are introduced in our work and contain renderings of photo-realistic scans of different plants and food objects (described in §4). They are much more difficult in terms of geometry and are well suited for assessing the structural limitations of modern 3D-aware generators.

Our model uses a pure NeRF-based backbone, that’s why it represents geometry in high resolution and does not suffer from multi-view synthesis artifacts, as opposed to upsampler-based generators. Moreover, it has higher or comparable image quality (as measured by FID Heusel et al. (2017)) and lower training cost. Also, in contrast to upsampler-based 3D GANs, our generator can naturally incorporate the techniques from the traditional NeRF literature. To demonstrate this, we incorporate background separation into our framework by simply copy-pasting the corresponding code from NeRF++ Zhang et al. (2020).

Figure 2: Our generator (left) is purely NeRF-based and uses the tri-plane backbone Chan et al. (2021b) with the StyleGAN2 Karras et al. (2020b) decoder (but without the 2D upsampler). Our discriminator (right) is also based on StyleGAN2, but is modulated by the patch location and scale parameters. We use the patch-wise optimization for training Schwarz et al. (2020) with our proposed Beta scale sampling, which allows our model to converge 2-3 faster than the upsampler-based architectures despite the generator modeling geometry in full resolution (see Tab 1).

2 Related work

Neural Radiance Fields. Neural Radiance Fields (NeRF) is an emerging area Mildenhall et al. (2020), which combines neural networks with volumetric rendering techniques to perform novel-view synthesis Mildenhall et al. (2020); Zhang et al. (2020); Barron et al. (2021), image-to-scene generation Yu et al. (2021), surface reconstruction Oechsle et al. (2021); Wang et al. (2021b); Niemeyer et al. (2020) and other tasks Chen et al. (2021); Hao et al. (2021); Park et al. (2020a). In our work, we employ them in the context of 3D-aware generation from a dataset of RGB images Schwarz et al. (2020); Chan et al. (2021a).

3D generative models

. A popular way to learn a 3D generative model is to train it on 3D data or in an autoencoder’s latent space (e.g., 

Chen and Zhang (2019); Wu et al. (2015); Achlioptas et al. (2018); Luo and Hu (2021); Li et al. (2021); Mittal et al. (2022); Kosiorek et al. (2021)). This requires explicit 3D supervision and there appeared methods which train from RGB datasets with segmentation masks, keypoints or multiple object views Gadelha et al. (2017); Li et al. (2019); Pavllo et al. (2020). Recently, there appeared works which train from single-view RGB only, including mesh-generation methods Henderson et al. (2020); Ye et al. (2021); Pavllo et al. (2021) and methods that extract 3D structure from pretrained 2D GANs Shi et al. (2021); Pan et al. (2020). And recent neural rendering advancements allowed to train NeRF-based generators Schwarz et al. (2020); Chan et al. (2021a); Niemeyer and Geiger (2021a) from purely RGB data from scratch, which became the dominating direction since then and which are typically formulated in the GAN-based framework Goodfellow et al. (2014).

NeRF-based GANs. HoloGAN Nguyen-Phuoc et al. (2019) generates a 3D feature voxel grid which is projected on a plane and then upsampled. GRAF Schwarz et al. (2020) trains a noise-conditioned NeRF in an adversarial manner. -GAN Chan et al. (2021a) builds upon it and uses progressive growing and hypernetwork-based Ha et al. (2016) conditioning in the generator. GRAM Deng et al. (2022) builds on top of -GAN and samples ray points on a set of learnable iso-surfaces. GNeRF Meng et al. (2021)

adapts GRAF for learning a scene representation from RGB images without known camera parameters. GIRAFFE 

Niemeyer and Geiger (2021b) uses a composite scene representation for better controllability. CAMPARI Niemeyer and Geiger (2021a) learns a camera distribution and a background separation network with inverse sphere parametrization Zhang et al. (2020). To mitigate the scaling issue of volumetric rendering, many recent works train a 2D decoder under different multi-view consistency regularizations to upsample a low-resolution volumetrically rendered feature grid Chan et al. (2021b); Gu et al. (2022); Xu et al. (2021); Or-El et al. (2021); Zhou et al. (2021); Xue et al. (2022); Zhang et al. (2022). However, none of such regularizations can currently provide the multi-view consistency of pure-NeRF-based generators.

Patch-wise generative models. Patch-wise training had been routinely utilized to learn the textural component of image distribution when the global structure is provided from segmentation masks, sketches, latents or other sources (e.g.,  Isola et al. (2017); Shaham et al. (2019); Choi et al. (2018); Vinker et al. (2021); Park et al. (2020b, 2019); Lin et al. (2021); Skorokhodov et al. (2021b)). Recently, there appeared works which sample patches at variable scales, in which way a patch can carry global information about the whole image. Recent works use it to train a generative NeRF Schwarz et al. (2020), fit a neural representation in an adversarial manner Meng et al. (2021) or to train a 2D GAN on a dataset of variable resolution Chai et al. (2022).

3 Model

Figure 3:

Comparing uniform (left) and beta (middle) annealed patch scale sampling in terms of their probability density function (PDF) (for visualization purposes, we clamp the maximum density value to 5); (right) PDF of

, provided for completeness. Uniform distribution with annealed from 0.9 to does not put any attention to high-frequency details in the beginning and treats small-scale and large-scale patches equally at the end of the annealing. Beta distribution with annealed to , in contrast, lets the model learn high-resolution texture immediately after the training starts, and puts more focus on the structure at the end.

We build upon StyleGAN2 Karras et al. (2020b), replacing its generator with the tri-plane-based NeRF model Chan et al. (2021b) and using its discriminator as the backbone. We train the model on patches (we use everywhere) of random scales instead of the full images of resolution . Scales are randomly sampled from a time-varying distribution .

3.1 3D generator

Compared to upsampler-based 3D GANs Gu et al. (2022); Niemeyer and Geiger (2021b); Xue et al. (2022); Zhou et al. (2021); Chan et al. (2021b); Zhang et al. (2022), we use a pure NeRF Mildenhall et al. (2020) as our generator and utilize the tri-plane representation Chan et al. (2021b); Chen et al. (2022) as the backbone. It consists of three components: 1) mapping network

which transforms a noise vector

into the latent vector ; 2) synthesis network which takes the latent vector and synthesizes three 32-dimensional feature planes of resolution (i.e. ); 3) tri-plane decoder network , which takes the space coordinate and tri-planes as input and produces the RGB color and density value at that point by interpolating the tri-plane features in the given coordinate and processing them with a tiny MLP. In contrast to classical NeRF Mildenhall et al. (2020), we do not utilize view direction conditioning since it worsens multi-view consistency Chan et al. (2021a) in GANs which are trained on RGB datasets with a single view per instance. To render a single pixel, we follow the classical volumetric rendering pipeline with hierarchical sampling Mildenhall et al. (2020); Chan et al. (2021a), using 48 ray steps in coarse and 48 ones in fine sampling stages. See the accompanying source code for more details.

3.2 2D scale/location-aware discriminator

Our discriminator is built on top of StyleGAN2 Karras et al. (2020b). Since we train the model in a patch-wise fashion, the original backbone is not well suited for this: convolutional filters are forced to adapt to signals of very different scales and extracted from different locations. A natural way to resolve this problem is to use separate discriminators depending on the scale, but that strategy has three limitations: 1) each particular discriminator receives less overall training signal (since the batch size is limited); 2) from an engineering perspective, it is more expensive to evaluate a convolutional kernel with different parameters on different inputs; 3) one can use only a small fixed amount of possible patch scales. This is why we develop a novel hypernetwork-modulated Ha et al. (2016); Skorokhodov et al. (2022) discriminator architecture to operate on patches with continuously varying scale.

To modulate the convolutional kernels of , we define a hypernetwork as a 2-layer MLP with tanh non-linearity at the end which takes patch scale and its cropping offsets as input and produces modulations (we shift the tanh output by 1 to map into the 1-centered interval), where is the number of output channels in the -th convolutional layer. Given a convolutional kernel and input , a straightforward strategy to apply the modulation is to multiply on the weights (depicting the convolution operation by and omitting its other parameters for simplicity):

(1)

where we broadcast the remaining axes and

is the layer output (before the non-linearity). However, using different kernel weights on top of different inputs is not too efficient in modern deep learning frameworks (even with the group-wise convolution trick 

Karras et al. (2020b)), that’s why we use an equivalent strategy of multiplying the weights on instead:

(2)

This suppresses and reinforces different convolutional filters of the layer depending on the patch scale and location. And to incorporate an even stronger conditioning, we also use the projection strategy Miyato and Koyama (2018) in the final discriminator block. We depict our discriminator architecture in Fig 2. As we show in Tab 2, it allows to obtain % lower FID compared to the standard discriminator.

Figure 4: Comparing samples of EpiGRAF and modern 3D-aware generators. Our method attains state-of-the-art image quality, recovers high-fidelity geometry and preserves multi-view consistency for both simple-shape (FFHQ and Cats) and variable-shape (M-Plants and M-Food) datasets. We refer the reader to the supplementary for the video comparisons to evaluate multi-view consistency.

3.3 Patch-wise optimization with Beta-distributed scales

Training NeRF-based GANs is computationally expensive, because rendering each pixel via volumetric rendering requires many evaluations (e.g., in our case, 96) of the underlying MLP. For scene reconstruction tasks, it does not create issues since the typically used loss Mildenhall et al. (2020); Zhang et al. (2020); Wang et al. (2021b) can be robustly computed on a sparse subset of the pixels. But for NeRF-based GANs, it becomes prohibitively expensive for high resolutions since convolutional discriminators operate on dense full-size images. The currently dominating approach to mitigate this is to train a separate 2D decoder to upsample a low-resolution image representation, rendered from a NeRF-based MLP. But this breaks multi-view consistency (i.e. object’s shape and texture change when the camera is moving) and learns the 3D geometry in a low resolution (from  Xue et al. (2022) to  Chan et al. (2021b)). This is why we build upon the multi-scale patch-wise training scheme Schwarz et al. (2020) and demonstrate that it can give state-of-the-art image quality and training speed without the above limitations.

Patch-wise optimization works the following way. On each iteration, instead of passing the full-size image to , we instead input only a small patch with resolution of random scale and extracted with a random offset . We illustrate this procedure in Fig 2. Patch parameters are sampled from distribution:

(3)

where is the current training iteration. In this way, patch scales depend on the current training iteration and offsets are sampled independently after we know . As we show next, the choice of distribution has crucial influence on the learning speed and stability.

Typically, patch scales are sampled from the annealed uniform distribution Schwarz et al. (2020); Meng et al. (2021); Chai et al. (2022) :

(4)

where lerp is the linear interpolation function111 for and ., and the left interval bound is gradually annealed during the first iterations until it reaches the minimum possible value of .222In practice, those methods use a very slightly different distribution (see Appx B) But this strategy does not let the model learn high-frequency details early on in training and puts little focus on the structure when is fully annealed to (which is usually very small, e.g. for a typical patch-wise training on resolution). As we show, the first issue makes the generator converge slower, and the second one makes the overall optimization less stable.

To mitigate this, we propose a small change in the pipeline by simply replacing the uniform scale sampling distribution with:

(5)

where is gradually annealed from to some final value . Using beta distribution instead of the uniform one gives a very convenient knob to shift the training focus between large patch scales (carrying the global information about the whole image) and small patch scales (representing high-resolution local crops).

A natural way to do the annealing is to anneal from : at the start, the model focuses entirely on the structure, while at the end it transforms into the uniform distribution (See Fig 3). We follow this strategy, but from the design perspective instead set to a value, which is slightly smaller than 1 (we use everywhere) to keep more focus on the structure at the end of the annealing as well. In our initial experiments, we observed that perform similarly. The scales distributions comparison between beta and uniform sampling is provided in Fig 3 and the convergence comparison in Fig 6.

3.4 Training details

We inherit the training procedure from StyleGAN2-ADA Karras et al. (2020a) with minimal changes. The optimization is performed by Adam Kingma and Ba (2014) with learning rate of 0.002 and betas of 0 and 0.99 for both and . We use for , and set . is trained with R1 regularization Mescheder et al. (2018) with . We train with the overall batch size of 64 for M images seen by for resolution and M for . Similar to previous works Chan et al. (2021b); Deng et al. (2022), we use pose supervision for for the FFHQ and Cats dataset to avoid geometry ambiguity. We train in full precision and use mixed precision for . Since FFHQ has too noticeable 3D biases, we use generator pose conditioning for it Chan et al. (2021b). Further details can be found in the source code.

4 Experiments

Method FFHQ Cats M-Plants M-Food Training cost
Geometry constraints
StyleNeRF Gu et al. (2022) 8.00 7.8 5.91 19.32 16.75 40 56 -res + 2D upsampler
StyleSDF Or-El et al. (2021) 11.5 11.19 42 56 -res + 2D upsampler
EG3D Chan et al. (2021b)   4.8   4.7 N/A 76 -res + 2D upsampler
VolumeGAN Xu et al. (2021) 9.1 N/A N/A -res + 2D upsampler
MVCGAN Zhang et al. (2022) 13.7 13.4 39.16 31.70 29.29 42 64 -res + 2D upsampler
GIRAFFE-HD Xue et al. (2022) 11.93 12.36 N/A N/A -res + 2D upsampler
pi-GAN Chan et al. (2021a) 53.2 OOM 68.28 75.64 51.99 56 none
GRAM Deng et al. (2022) 13.78 OOM 13.40 188.6 178.9 56 iso-surfaces
EpiGRAF (ours) 9.71 9.92 6.93 19.42 18.15 16 24 none
Table 1: FID scores of modern 3D GANs. “” — evaluated on a re-aligned version of FFHQ (different from original FFHQ Karras et al. (2019)). Training cost is measured in terms of NVidia V100 GPU days. “OOM” denotes out-of-memory error.
Figure 5: Visualizing the learned geometry for different methods. -GAN Chan et al. (2021a) recovers high-fidelity shapes, but has worse image quality (see Table 1) and is much more expensive to train than our model. MVC-GAN Zhang et al. (2022) fails to capture good geometry because of the 2D upsampler. Our method learns proper geometry and achieves state-of-the-art image quality. We extracted the surfaces using marching cubes from the density fields sampled on grid and visualized them in PyVista Sullivan and Kaszynski (2019). We manually optimized the marching cubes contouring threshold for each checkpoint of each method. We noticed that -GAN Chan et al. (2021a) produces a lot of “spurious” density which makes.

4.1 Experimental setup

Benchmarks. In our study, we consider four benchmarks: 1) FFHQ Karras et al. (2019) in and resolutions, consisting of 70,000 (mostly front-view) human face images; 2) Cats  Zhang et al. (2008), consisting of 9,998 (mostly front-view) cat face images; 3) Megascans Food (M-Food) consisting of 199 models of different food items with 128 views per model (25472 images in total); and 4) Megascans Plants (M-Plants) consisting of 1108 different plant models with 128 views per model (141824 images in total). The last two datasets are introduced in our work to fix two issues with the modern 3D generation benchmarks. First, existing benchmarks have low variability of global object geometry, focusing entirely on a single class of objects, like human/cat faces or cars, that do not vary much from instance to instance. Second, they all have limited camera pose distribution: for example, FFHQ Karras et al. (2019) and Cats Zhang et al. (2008) are completely dominated by the frontal and near-frontal views (see Appx D). That’s why we obtain and render 1307 Megascans models from Quixel, which are photo-realistic (barely distinguishable from real) scans of real-life objects with complex geometry. Those benchmarks, together with the rendering code, will be made publicly available.

Metrics. We use FID Heusel et al. (2017)

to measure image quality and also estimate the training cost for each method in terms of NVidia V100 GPU days needed for it to complete the training process.

Baselines. For upsampler-based baselines, we compare to the following generators: StyleNeRF Gu et al. (2022), StyleSDF Or-El et al. (2021), EG3D Chan et al. (2021b), VolumeGAN Xu et al. (2021), MVCGAN Zhang et al. (2022) and GIRAFFE-HD Xue et al. (2022). Apart from that, we also compare to pi-GAN Chan et al. (2021a) and GRAM Deng et al. (2022)

, which are non-upsampler-based GANs. To compare on Megascans, we train StyleNeRF, MVCGAN, pi-GAN, and GRAM from scratch using their official code repositories (obtained online or requested from the authors), using their FFHQ or CARLA hyperparameters, except for the camera distribution and rendering settings. We also train StyleNeRF, MVCGAN and

-GAN on Cats . GRAM Deng et al. (2022) restricts the sampling space to a set of learnable iso-surfaces which makes it not well-suited for datasets with varying geometry.

4.2 Results

EpiGRAF achieves state-of-the-art image quality. For Cats , M-Plants and M-Food , EpiGRAF outperforms all the baselines in terms of FID except for StyleNeRF, performing very similar to it on all the datasets despite the fact that it does not have a 2D upsampler. For FFHQ, our model attains very similar FID scores as the other methods, ranking 4/9 (including older -GAN Chan et al. (2021a)), noticeably losing only to EG3D Chan et al. (2021b), which trains and evaluates on a different version of FFHQ and uses pose conditioning in the generator (which potentially improves FID at the cost of multi-view consistency). We provide a visual comparison for different methods in Fig 4. GRAM’s high FID scores on M-Plants/M-Food are due to its severe mode collapse (see Fig 13 in Appx E).

EpiGRAF is much faster to train. As reported in Tab 1, existing methods typically train for week on 8 V100s, EpiGRAF finishes training in just 2 days for and days for resolutions, which is faster. Note that this high training efficiency is achieved without the use of an upsampler, which initially enabled high-resolution synthesis of 3D-aware GANs. As to the non-upsampler methods, we couldn’t train GRAM or -GAN on resolution due to the memory limitations of the setup with 8 NVidia V100 32GB GPUs (i.e., 256GB of GPU memory in total).

EpiGRAF learns high-fidelity geometry. Using a pure NeRF-based backbone carries two crucial benefits: it provides multi-view consistency and allows to learn the geometry in the full dataset resolution. In Fig 5, we visualize the learned shapes on M-Food and M-Plants for 1) -GAN: a pure NeRF-based generator without the geometry constraints; 2) MVC-GAN Zhang et al. (2022): an upsampler-based generator with strong multi-view consistency regularization; 3) our model. We provide the details and analysis in the caption of Fig 5.

EpiGRAF easily capitalizes on techniques from the NeRF literature. Since our generator is purely NeRF based and renders images without a 2D upsampler, it is well coupled with the existing techniques from the NeRF scene reconstruction field. To demonstrate this, we adopted background separation from NeRF++ Zhang et al. (2020) using the inverse sphere parametrization by simply copy-pasting the corresponding code from their repo. We depict the results in Fig 1 and provide the details in Appx B.

4.3 Ablations

We report the ablations for different discriminator architectures and patch sizes on FFHQ and M-Plants in Tab 2. Using a traditional discriminator architecture results in worse performance. Using several ones (via the group-wise convolution trick Karras et al. (2020b)) results in noticeably slower training time and degrades the image quality a lot. We hypothesize that the reason of it was the reduced overall training signal which each discriminator receives, which we tried to alleviate by increasing the learning rate for them, but that did not improve the results. A too small patch size hampers the learning process and results in a % worse FID. A too large one provides decent image quality, but greatly reduces the training speed.

To assess the convergence of our proposed patch sampling scheme, we compared against uniform sampling on Cats  for , representing different annealing speeds. We show the results for it in Fig 6: our proposed beta scale sampling strategy with k schedule robustly converges to lower values than the uniform one with k or k and does not fluctuate much compared to the uniform one (where the model reached its final annealing stage in just 1k kilo-images seen by ).

To analyze how hyper-modulation manipulates the convolutional filters of the discriminator, we visualize the modulation weights , predicted by , in Fig 7 (see the caption for the details). These visualizations show that some of the filters are always switched on, regardless of the patch scale; while others are always switched off providing potential room for pruning He et al. (2017). And % of the filters are getting switched on and off depending on the patch scale, which shows that indeed learns to perform meaningful modulation.

Figure 6: Convergence comparison on Cats for EpiGRAF between uniform and beta scale sampling strategies in terms of log-FID measured on 2048 fake images. Experiment FFHQ M-Plants Training cost Discriminator  – standard 11.57 21.77 24  – 2 scale-specific -s 10.87 21.02 28  – 4 scale-specific -s 21.56 43.11 28  – scale/position-aware 9.92 19.42 24 Patch size  – 17.44 34.32 19  – (default) 9.92 19.42 24  – 11.36 18.90 34 Table 2: Ablating the patch size and the discriminator architecture for our model in terms of FID scores and training cost (V100 GPU days) on resolution.
Figure 7: Visualizing modulation weights , predicted by for 2-nd, 6-th, 10-th and 14-th convolutional layers. Each subplot denotes a separate layer and we visualize random 32 filters for it.

5 Conclusion

In this work, we showed that it is possible to build a state-of-the-art 3D GAN framework without a 2D upsampler, but using a pure NeRF-based generator trained in a multi-scale patch-wise fashion. For this, we improved the traditional patch-wise training scheme in two important ways. First, we proposed to use a scale/location-aware discriminator with convolutional filters modulated by a hypernetwork depending on the patch parameters. Second, we developed a schedule for patch scale sampling based on the beta distribution, that leads to faster and more robust convergence. We believe that the future of 3D GANs is a combination of efficient volumetric representations, regularized 2D upsamplers, and patch-wise training. We propose this avenue of research for future work.

Our method also has several limitations. Before switching to training 3D-aware generators, we spent a considerable amount of time exploring our ideas on top of StyleGAN2 for traditional 2D generation. This always resulted in increased FID scores (see Appx A). Further, the discriminator looses information about global context. We tried multiple ideas to incorporate global context, but it did not lead to an improvement. Finally, 3D GANs generating faces and humans may have negative societal impact as discussed in Appx G.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018) Learning representations and generative models for 3d point clouds. In

    International conference on machine learning

    ,
    pp. 40–49. Cited by: §2.
  • [2] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan (2021) Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 5855–5864. Cited by: §2.
  • [3] Blender Online Community (2022) Blender - a 3d modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam. External Links: Link Cited by: §D.1.
  • [4] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
  • [5] L. Chai, M. Gharbi, E. Shechtman, P. Isola, and R. Zhang (2022) Any-resolution training for high-resolution image synthesis. arXiv preprint arXiv:2204.07156. Cited by: Table 3, §1, §2, §3.3.
  • [6] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein (2021)

    Pi-gan: periodic implicit generative adversarial networks for 3d-aware image synthesis

    .
    In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5799–5809. Cited by: Appendix C, Table 4, §1, §1, §2, §2, §2, §3.1, Figure 5, §4.1, §4.2, Table 1.
  • [7] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. D. Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, T. Karras, and G. Wetzstein (2021) Efficient geometry-aware 3D generative adversarial networks. In arXiv, Cited by: §D.1, Figure 2, §1, §1, §1, §2, §3.1, §3.3, §3.4, §3, §4.1, §4.2, Table 1.
  • [8] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su (2022) TensoRF: tensorial radiance fields. arXiv preprint arXiv:2203.09517. Cited by: §3.1.
  • [9] H. Chen, B. He, H. Wang, Y. Ren, S. Lim, and A. Shrivastava (2021) NeRV: neural representations for videos. arXiv preprint arXiv:2110.13903. Cited by: §2.
  • [10] Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948. Cited by: §2.
  • [11] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797. Cited by: §2.
  • [12] Y. Deng, J. Yang, J. Xiang, and X. Tong (2022) GRAM: generative radiance manifolds for 3d-aware image generation. In IEEE Computer Vision and Pattern Recognition, Cited by: §B.1, Figure 13, Appendix E, §1, §1, §2, §3.4, §4.1, Table 1.
  • [13] M. Gadelha, S. Maji, and R. Wang (2017) 3d shape induction from 2d views of multiple objects. In 2017 International Conference on 3D Vision (3DV), pp. 402–411. Cited by: §2.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §A.2, §1, §2.
  • [15] J. Gu, L. Liu, P. Wang, and C. Theobalt (2022) StyleNeRF: a style-based 3d aware generator for high-resolution image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §B.2, §1, §2, §3.1, §4.1, Table 1.
  • [16] D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §1, §2, §3.2.
  • [17] Z. Hao, A. Mallya, S. Belongie, and M. Liu (2021) GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds. In ICCV, Cited by: §2.
  • [18] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1389–1397. Cited by: §4.3.
  • [19] P. Henderson, V. Tsiminaki, and C. H. Lampert (2020) Leveraging 2d data to learn textured 3d mesh generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7498–7507. Cited by: §2.
  • [20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §1, §4.1.
  • [21] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: §1.
  • [22] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
  • [23] K. Jo, G. Shim, S. Jung, S. Yang, and J. Choo (2021) CG-nerf: conditional generative neural radiance fields. arXiv preprint arXiv:2112.03517. Cited by: §1.
  • [24] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676. Cited by: §A.1, Table 3, §B.1, §1, §3.4.
  • [25] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: Appendix C, 10(a), Table 4, 11(a), Figure 1, §1, §1, §4.1, Table 1.
  • [26] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: Table 3, Appendix C, Figure 2, §3.2, §3.2, §3, §4.3.
  • [27] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [28] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. Advances in neural information processing systems 31. Cited by: §1.
  • [29] A. R. Kosiorek, H. Strathmann, D. Zoran, P. Moreno, R. Schneider, S. Mokrá, and D. J. Rezende (2021) Nerf-vae: a geometry aware 3d scene generative model. arXiv preprint arXiv:2104.00587. Cited by: §2.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §1.
  • [31] R. Li, X. Li, K. Hui, and C. Fu (2021) SP-gan: sphere-guided 3d shape generation and manipulation. ACM Transactions on Graphics (TOG) 40 (4), pp. 1–12. Cited by: §2.
  • [32] X. Li, Y. Dong, P. Peers, and X. Tong (2019) Synthesizing 3d shapes from silhouette image collections using multi-projection generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5535–5544. Cited by: §2.
  • [33] C. H. Lin, H. Lee, Y. Cheng, S. Tulyakov, and M. Yang (2021) InfinityGAN: towards infinite-resolution image synthesis. arXiv preprint arXiv:2104.03963. Cited by: §2.
  • [34] S. Luo and W. Hu (2021-06) Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [35] Y. A. Mejjati, I. Milefchik, A. Gokaslan, O. Wang, K. I. Kim, and J. Tompkin (2021) GaussiGAN: controllable image synthesis with 3d gaussians from unposed silhouettes. arXiv preprint arXiv:2106.13215. Cited by: §1.
  • [36] Q. Meng, A. Chen, H. Luo, M. Wu, H. Su, L. Xu, X. He, and J. Yu (2021) Gnerf: gan-based neural radiance field without posed camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6351–6361. Cited by: Figure 8, §B.3, Appendix C, §1, §2, §2, §3.3.
  • [37] L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. In International conference on machine learning, pp. 3481–3490. Cited by: §3.4.
  • [38] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §B.1, Appendix C, Appendix C, §1, §1, §2, §3.1, §3.3.
  • [39] P. Mittal, Y. Cheng, M. Singh, and S. Tulsiani (2022) AutoSDF: shape priors for 3d completion, reconstruction and generation. In CVPR, Cited by: §2.
  • [40] T. Miyato and M. Koyama (2018) CGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §B.1, §3.2.
  • [41] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang (2019-11)

    HoloGAN: unsupervised learning of 3d representations from natural images

    .
    In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [42] M. Niemeyer and A. Geiger (2021) Campari: camera-aware decomposed generative neural radiance fields. In 2021 International Conference on 3D Vision (3DV), pp. 951–961. Cited by: §2, §2.
  • [43] M. Niemeyer and A. Geiger (2021) Giraffe: representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11453–11464. Cited by: §1, §2, §3.1.
  • [44] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515. Cited by: §2.
  • [45] M. Oechsle, S. Peng, and A. Geiger (2021) Unisurf: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5589–5599. Cited by: §2.
  • [46] C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill. Note: https://distill.pub/2017/feature-visualization External Links: Document Cited by: §1.
  • [47] R. Or-El, X. Luo, M. Shan, E. Shechtman, J. J. Park, and I. Kemelmacher-Shlizerman (2021) StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation. arXiv preprint arXiv:2112.11427. Cited by: §1, §2, §4.1, Table 1.
  • [48] X. Pan, B. Dai, Z. Liu, C. C. Loy, and P. Luo (2020) Do 2d gans know 3d shape? unsupervised 3d shape reconstruction from 2d image gans. arXiv preprint arXiv:2011.00844. Cited by: §2.
  • [49] X. Pan, X. Xu, C. C. Loy, C. Theobalt, and B. Dai (2021) A shading-guided generative implicit model for shape-accurate 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
  • [50] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla (2020) Deformable neural radiance fields. arXiv preprint arXiv:2011.12948. Cited by: §2.
  • [51] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346. Cited by: §2.
  • [52] T. Park, J. Zhu, O. Wang, J. Lu, E. Shechtman, A. Efros, and R. Zhang (2020) Swapping autoencoder for deep image manipulation. Advances in Neural Information Processing Systems 33, pp. 7198–7211. Cited by: §2.
  • [53] D. Pavllo, J. Kohler, T. Hofmann, and A. Lucchi (2021-10) Learning generative models of textured 3d meshes from real-world images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13879–13889. Cited by: §2.
  • [54] D. Pavllo, G. Spinks, T. Hofmann, M. Moens, and A. Lucchi (2020) Convolutional generation of textured 3d meshes. In Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [55] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125. Cited by: §1.
  • [56] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger (2020) GRAF: generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Figure 8, §B.3, Appendix C, Appendix C, §D.2, Figure 2, §1, §1, §1, §2, §2, §2, §2, §3.3, §3.3.
  • [57] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4570–4580. Cited by: §2.
  • [58] Y. Shi, D. Aggarwal, and A. K. Jain (2021) Lifting 2d stylegan for 3d-aware face generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6258–6266. Cited by: §1, §2.
  • [59] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020)

    Implicit neural representations with periodic activation functions

    .
    Advances in Neural Information Processing Systems 33. Cited by: §B.1, §B.1, §1.
  • [60] I. Skorokhodov, S. Ignatyev, and M. Elhoseiny (2021) Adversarial generation of continuous images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10753–10764. Cited by: §B.1.
  • [61] I. Skorokhodov, G. Sotnikov, and M. Elhoseiny (2021) Aligning latent and image spaces to connect the unconnectable. arXiv preprint arXiv:2104.06954. Cited by: §2.
  • [62] I. Skorokhodov, S. Tulyakov, and M. Elhoseiny (2022) Stylegan-v: a continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3626–3636. Cited by: §3.2.
  • [63] C. B. Sullivan and A. Kaszynski (2019-05) PyVista: 3d plotting and mesh analysis through a streamlined interface for the visualization toolkit (VTK).

    Journal of Open Source Software

    4 (37), pp. 1450.
    External Links: Document, Link Cited by: Figure 5.
  • [64] F. Tan, S. Fanello, A. Meka, S. Orts-Escolano, D. Tang, R. Pandey, J. Taylor, P. Tan, and Y. Zhang (2022) VoLux-gan: a generative model for 3d face synthesis with hdri relighting. arXiv preprint arXiv:2201.04873. Cited by: §1.
  • [65] M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §B.1, §B.1, §1.
  • [66] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29. Cited by: §1.
  • [67] Y. Vinker, E. Horwitz, N. Zabari, and Y. Hoshen (2021-10) Image shape manipulation from a single augmented training sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13769–13778. Cited by: §2.
  • [68] C. Wang, M. Chai, M. He, D. Chen, and J. Liao (2021) CLIP-nerf: text-and-image driven manipulation of neural radiance fields. arXiv preprint arXiv:2112.05139. Cited by: §1.
  • [69] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang (2021) Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689. Cited by: §2, §3.3.
  • [70] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §2.
  • [71] Y. Xu, S. Peng, C. Yang, Y. Shen, and B. Zhou (2021) 3D-aware image synthesis via learning structural and textural representations. arXiv preprint arXiv:2112.10759. Cited by: §1, §2, §4.1, Table 1.
  • [72] Y. Xue, Y. Li, K. K. Singh, and Y. J. Lee (2022) GIRAFFE hd: a high-resolution 3d-aware generative model. arXiv preprint arXiv:2203.14954. Cited by: §1, §2, §3.1, §3.3, §4.1, Table 1.
  • [73] Y. Ye, S. Tulsiani, and A. Gupta (2021-06) Shelf-supervised mesh prediction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8843–8852. Cited by: §2.
  • [74] A. Yu, V. Ye, M. Tancik, and A. Kanazawa (2021-06) PixelNeRF: neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4578–4587. Cited by: §2.
  • [75] J. Zhang, E. Sangineto, H. Tang, A. Siarohin, Z. Zhong, N. Sebe, and W. Wang (2021) 3D-aware semantic-guided generative model for human synthesis. arXiv preprint arXiv:2112.01422. Cited by: §1.
  • [76] K. Zhang, G. Riegler, N. Snavely, and V. Koltun (2020) Nerf++: analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492. Cited by: §B.1, Figure 1, §1, §2, §2, §3.3, §4.2.
  • [77] W. Zhang, J. Sun, and X. Tang (2008) Cat head detection-how to effectively exploit shape and texture features. In European conference on computer vision, pp. 802–816. Cited by: 10(b), Table 4, 11(b), Figure 1, §1, §4.1.
  • [78] X. Zhang, Z. Zheng, D. Gao, B. Zhang, P. Pan, and Y. Yang (2022) Multi-view consistent generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §3.1, Figure 5, §4.1, §4.2, Table 1.
  • [79] P. Zhou, L. Xie, B. Ni, and Q. Tian (2021) Cips-3d: a 3d-aware generator of gans based on conditionally-independent pixel synthesis. arXiv preprint arXiv:2110.09788. Cited by: §1, §2, §3.1.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See §5 and Appx A.

    3. Did you discuss any potential negative societal impacts of your work? We do this in Appendix F.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them? We discuss the potential ethical concerns of using our model in Appendix F.

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We provide the code/data and additional visualizations on https://universome.github.io/epigraf(as specified in the introduction).

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? We provide the most important training details in §3.4. The rest of the details are provided in Appx B and the provided source code.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? . That’s too computationally expensive and single-run results are typically reliable in the GAN field.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? We report this numbers in Appx B.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? We cite all the sources of the datasets which were used or mentioned in our submission.

    2. Did you mention the license of the assets? In this work, we release two new datasets: Megascans Plants and Megascans Food. We discuss their licensing in Appx D.

    3. Did you include any new assets either in the supplemental material or as a URL? We provide our datasets on the project website: https://universome.github.io/epigraf.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? We specify the information on dataset collection in Appx D.

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? As discussed in Appx D, the released data does not contain personally identifiable information or offensive content.

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Limitations

Multi-scale patch-wise training, studied in this work, has both practical and theoretical limitations. The practical ones include “engineering” difficulties when trying to squeeze the best performance, and theoretical ones are related to the issues which could be faced in asymptotic cases.

a.1 Practical limitations

Performance drop for 2D generation. Before switching to training 3D-aware generators, we spent a considerable amount of time, exploring our ideas on top of StyleGAN2 Karras et al. (2020a) for traditional 2D generation since it is faster, less error-prone and more robust to a hyperparameters choice. What we observed is that despite our best efforts (see C) and even with longer training, we couldn’t obtain the same image quality as the full resolution StyleGAN2 generator.

Method FFHQ LSUN Bedroom
FID Training cost FID Training cost
StyleGAN2-ADA Karras et al. (2020a) 3.83 8 4.12 5
 + multi-scale patch-wise training 7.11 6 6.73 4
 + longer training 5.71 12 5.42 8
 + longer training 4.76 24 4.31 16
Table 3: Trying to train a traditional StyleGAN2 Karras et al. (2020b) generator in the patch-wise fashion. We tried to train longer to compensate for a smaller learning signal overall (a patch is of information compared to a image), but this didn’t allow to catch up. Note, however, that AnyResGAN Chai et al. (2022) reaches SotA when training on patches compared to images.

A range of possible patch sizes is restricted. Tab 2 shows the performance drop when using the patch size instead of the default one without any dramatic improvement in speed. Trying to decrease it further would produce even worse performance (imagine training in the extreme case of patches). Increasing the patch size is also not desirable since it decreases the training speed a lot: going from to resulted in 30% cost increase without clear performance benefits. In this way, we are very constrained in what patch size one can use.

Discriminator does not see the global context

. When the discriminator classifies patches of small scale, it is forced to do so without relying on the global image information, which could be useful for this. Our attempts to incorporate it (see Appx 

C) did not improve the performance (though we believe we under-explored this).

a.2 Theoretical limitations

Multi-scale patch-wise training is not theoretically equivalent to full-resolution training. Imagine that we have a distribution for , where we consider images to be single-channel (for simplicity) and is the image size. If we train with patch size of , then each optimization step uses random pixels out of , i.e. we optimize over the distribution of all possible marginals where is an image patch and is a patch extraction function with random seed .333For brevity, we “hide” all the randomness of the patch extraction process into . This means that our minimax objective becomes:

(6)

If we rely on the GAN convergence theorem Goodfellow et al. (2014), stating that we recover the training distribution as the solution of the minimax problem, then will learn to approximate all the possible marginal distributions

instead of the full joint distribution

, that we seek.

Appendix B Training details

b.1 Hyper-parameters and optimization details

We inherit most of the hyperparameters from the StyleGAN2-ADA repo Karras et al. (2020a) repo which we build on top444https://github.com/NVlabs/stylegan2-ada-pytorch. In this way, we use the dimensionalities of for both and . The mapping network has 2 layers of dimensionality 512 with LeakyReLU non-linearities with the negative slope of . Synthesis network produced three planes of

channels each. We use the SoftPlus non-linearity instead of typically used ReLU 

Mildenhall et al. (2020) as a way to clamp the volumetric density. Similar to -GAN, we also randomize

For FFHQ and Cats, we also use camera conditioning in . For this, we encode yaw and pitch angles (roll is always set to 0) with Fourier positional encoding Sitzmann et al. (2020); Tancik et al. (2020)

, apply dropout with 0.5 probability (otherwise,

can start judging generations from 3D biases in the dataset, hurting the image quality), pass through a 2-layer MLP with LeakyReLU activations to obtain a 512-dimensional vector, which is finally as a projection conditioning Miyato and Koyama (2018). Cameras positions were extracted in the same way as in GRAM Deng et al. (2022).

We optimize both and with the batch size of 64 until sees 25,000,000 real images, which is the default setting from StyleGAN2-ADA. We the default setup of adaptive augmentations, except for random horizontal flipping, since it would require the corresponding change in the yaw angle at augmentation time, which was not convenient to incorporate from the engineering perspective. Instead, random horizontal flipping is used non-adaptively as a dataset mirroring where flipping the yaw angles it more accessible. We train in full precision, while uses mixed precision.

Hypernetwork is structured very similar to the generator’s mapping network. It consists on 2 layers with LeakyReLU non-linearities with the negative slope of . Its input is the positional embedding of the patch scales and offsets , encoded with Fourier features Sitzmann et al. (2020); Tancik et al. (2020) and concatenated into a single vector of dimensionality 828. It produces a patch representation vector , which is then adapted for each convolutional layer via:

(7)

where is the modulation vector, is the layer-specific affine transformation, is the amount of output filters in the -th layer. In this way, has layer-specific adapters.

For the background separation experiment, we adapt the neural representation MLP from INR-GAN Skorokhodov et al. (2021a), but passing 4 coordinates (for the inverse sphere parametrization Zhang et al. (2020)) instead of 2 as an input. It consists on 2 blocks with 2 linear layers each. We use 16 steps per ray for the background without hierarchical sampling.

Further details could be found in the accompanying source code.

b.2 Utilized computational resources

While developing our model, we had been launching experiments on NVidia A100 81GB or Nvidia V100 32GB GPUs with the AMD EPYC 7713P 64-Core processor. We found that in practice, running the model on A100s gives a speed-up compared to V100s due to the possibility of increasing the batch size from 32 to 64. In this way, training EpiGRAF on A100s gives the same training speed as training it V100s.

For the baselines, we were running them on 4-8 V100s GPUs as was specified by the original papers unless the model could fit into 4 V100s without decreasing the batch size (it was only possible for StyleNeRF Gu et al. (2022)).

For rendering Megascans, we used NVIDIA TITAN RTX with 24GB memory each. But resource utilization for rendering is negligible compared to training the generators.

In total, the project consumed A100s GPU-years, V100s GPU-years, and TITAN RTX GPU-days. Note, that out of this time, training the baselines consumed V100s GPU-years.

b.3 Annealing schedule details

As being said in §3.3, the existing multi-scale patch-wise generators Schwarz et al. (2020); Meng et al. (2021) use uniform distribution to sample patch scales, where is gradually annealed during training from 0.9 (or 0.8 Meng et al. (2021)) to with different speeds. We visualize the annealing schedule for both GRAF and GNeRF on Fig 8, which demonstrates that their schedules are very close to lerp-based one, described in §3.3.

Figure 8: Comparing annealing schedules for GRAF Schwarz et al. (2020), GNeRF Meng et al. (2021) and the lerp-based schedule from §3.3. We simplified the exposition by stating that GRAF and GNeRF use the lerp-based schedule, which is very close to reality.

Appendix C Failed experiments

Modern GANs are a lot of engineering and it often takes a lot of futile experiments to get to a point where the obtained performance is acceptable. We want to enumerate some experiments which did not work out (despite looking like they should work) — either because the idea was fundamentally flawed on its own or because we’ve under-explored it (or both).

Conditioning on global context worsened the performance. In Appx A, we argued that when processes a small-scale patch, it does not have access to the global image information, which might be a source of decreased image quality. We tried several strategies to compensate for this. Our first attempt was to generate a low-resolution image, bilinearly upsample it to the target size, and then “grid paste” a high-resolution patch into it. The second attempt was to simply always concatenating a low-resolution version of an image as 3 additional channels. However, in both cases, generator learned to produce low-resolution version of images well, but the texture was poor. We hypothesize that it was due to starting to produce its prediction almost entirely based on the low-resolution image, ignoring the high-resolution patches since they are harder to discriminate.

Patch importance sampling did not work. Almost all the datasets used for 3D-aware image synthesis has regions of difficult content and regions with simpler content — it is especially noticeable for CARLA Schwarz et al. (2020) and our Megascans datasets, which contain a lot of white background. That’s why, patch-wise sampling could be improved if we sample patches from the more difficult regions more frequently. We tried this strategy in the GNeRF Meng et al. (2021) problem setup on the NeRF-Synthetic dataset Mildenhall et al. (2020) of fitting a scene without known camera parameters. We sampled patches from regions with high average gradient norm more frequently. For some scenes, it helped, for other ones, it worsened the performance.

View direction conditioning breaks multi-view consistency. Similar to the prior works Schwarz et al. (2020); Chan et al. (2021a), our attempt to condition the radiance (but not density) MLP on ray direction (similar to NeRF Mildenhall et al. (2020)) led to poor multi-view consistency with radiance changing with camera moving. We tested this on FFHQ Karras et al. (2019), which has only a single view per object instance and suspect that it wouldn’t be happening on Megascans, where view coverage is very rich.

Tri-planes produced from convolutional layers are much harder to optimize for reconstruction. While debugging our tri-plane representation, we found that tri-planes produced with convolutional layers are extremely difficult to optimize for reconstruction. I.e., if one fits a 3D scene while optimizing tri-planes directly, then everything goes smoothly, but when those tri-plane are being produced by the synthesis network of StyleGAN2 Karras et al. (2020b), then PNSR scores (and the loss values) are plateauing very soon.

Appendix D Datasets details

d.1 Megascans dataset

Modern 3D-aware image synthesis benchmarks has two issues: 1) they contain objects of very similar global geometry (like, human or cat faces, cars and chairs), and 2) they have poor camera coverage. Moreover, some of them (e.g., FFHQ), contain 3D-biases, when an object features (e.g., smiling probability, gaze direction, posture or haircut) correlate with the camera position Chan et al. (2021b). As a result, this does not allow to evaluate a model’s ability to represent the underlying geometry and makes it harder understand whether performance come from methodological changes or better data preprocessing.

To mitigate these issues, we introduce two new datasets: Megascans Plants (M-Plants) and Megascans Food (M-Food). To build them, we obtain models from Quixel Megascans555https://quixel.com/megascans from Plants, Mushrooms and Food categories. Megascans are very high-quality scans of real objects which are almost indistinguishable from real. For Mushrooms and Plants, we merge them into the same Food category since they have too few models on their own.

We render all the models in Blender Blender Online Community (2022) with cameras, distributed uniformly at random over the sphere of radius 3.5 and field-of-view of . While rendering, we scale each model into cube and discard those models, which has the dimension produce of less than 2. We render 128 views per object from a fixed distance to the object center from uniformly sampled points on the entire sphere (even from below). For M-Plants, we additionally remove those models which has less than 0.03 pixel intensity on average (computed as the mean alpha value over the pixels and views). This is needed to remove small grass or leaves which will be occupying a too small amount of pixels. As a result, this procedure produces 1,108 models for the Plants category and 199 models for the Food category.

We include the rendering script as a part of the released source code. We cannot release the source models or textures due to the copyright restrictions. We release all the images under the CC BY-NC-SA 4.0 license666https://creativecommons.org/licenses/by-nc-sa/4.0. Apart from the images, we also release the class categories for both M-Plants and M-Food.

The released datasets does not contain any personally identifiable information or offensive content since it does not have any human subjects, animals or other creatures with scientifically proved cognitive abilities. One concern that might arise is the inclusion of Amanita muscaria777https://en.wikipedia.org/wiki/Amanita_muscaria into the Megascans Food dataset, which is poisonous (when consumed by ingestion without any specific preparation). This is why we urge the reader not to treat the included objects as edible items, even though they are a part of the “food” category. We provide random samples from both of them in Fig 9 and Fig 10. Note that they are almost indistinguishable from real objects.

Figure 9: Real images from the Megascans Plants dataset. This dataset contains very complex geometry and texture, while having good camera coverage.
Figure 10: Real images from the Megascans Food dataset. Caution: some objects in this dataset could be poisonous.

d.2 Datasets statistics

We provide the datasets statistics in Tab 4. For CARLA Schwarz et al. (2020), we provide them for comparison and do not use this dataset as a benchmark since it is small, has simple geometry and texture.

Dataset Number of images Yaw distribution Pitch distribution Resolution
FFHQ Karras et al. (2019) 70,000 Normal(0, 0.3) Normal(, 0.2)
Cats 10,000 Normal(0, 0.2) Normal(, 0.2)
CARLA 10,000 USphere(0, ) USphere(, )
M-Plants 141,824 USphere(0, ) USphere(, )
M-Food 25,472 USphere(0, ) USphere(, )
Table 4: Comparing 3D datasets. Megascans Plants and Megascans Food are much more complex in terms of geometry and has much better camera coverage than FFHQ Karras et al. (2019) or Cats Zhang et al. (2008). The abbreviation “USphere()” denotes uniform distribution on a sphere (see -GAN Chan et al. (2021a)) with mean and pitch interval of . For Cats, the final resolution depends on the cropping and we report the original dataset resolution.
(a) FFHQ Karras et al. (2019).
(b) Cats Zhang et al. (2008)
(c) Megascans Plants.
(d) Megascans Food.
Figure 11: Comparing yaw/pitch angles distribution for different datasets.

Appendix E Additional samples

We provide random non-cherry-picked samples from our model in Fig 12, but we recommend visiting the website for video illustrations: https://universome.github.io/epigraf.

To demonstrate GRAM’s Deng et al. (2022) mode collapse, we provide more its samples in Fig 13.

(a) FFHQ  Karras et al. (2019).
(b) Cats  Zhang et al. (2008)
(c) Megascans Plants
(d) Megascans Food
Figure 12: Random samples (without any cherry-picking) for our model. Zoom-in is recommended.
Figure 13: Random samples from GRAM Deng et al. (2022) on M-Plants (left) and M-Food (right). Since it uses the same set of iso-surfaces for each sample to represent the geometry, it struggles to fit the datasets with variable structure, suffering from a very severe mode collapse.

Appendix F Potential negative societal impacts

Our developed method is in the general family of media synthesis algorithms, that could be used for automatized creation and manipulation of different types of media content, like images, videos or 3D scenes. Of particular concern is creation of deepfakes888https://en.wikipedia.org/wiki/Deepfake — photo-realistic replacing of one person’s identity with another one in images and videos. While our model does not yet rich good enough quality to have perceptually indistinguishable generations from real media, such concerns should be kept in mind when developing this technology further.

Appendix G Ethical concerns

We have reviewed the ethics guidelines999https://nips.cc/public/EthicsGuidelines and confirm that our work complies with them. As being discussed in Appx D.1, our released datasets are not human-derived and hence do not contain any personally identifiable information and are not biased against any groups of people.