Equivariant Neural Rendering

06/13/2020 ∙ by Emilien Dupont, et al. ∙ 6

We propose a framework for learning neural scene representations directly from images, without 3D supervision. Our key insight is that 3D structure can be imposed by ensuring that the learned representation transforms like a real 3D scene. Specifically, we introduce a loss which enforces equivariance of the scene representation with respect to 3D transformations. Our formulation allows us to infer and render scenes in real time while achieving comparable results to models requiring minutes for inference. In addition, we introduce two challenging new datasets for scene representation and neural rendering, including scenes with complex lighting and backgrounds. Through experiments, we show that our model achieves compelling results on these datasets as well as on standard ShapeNet benchmarks.



There are no comments yet.


page 1

page 6

page 7

page 8

page 13

page 14

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Designing useful 3D scene representations for neural networks is a challenging task. While several works have used traditional 3D representations such as voxel grids

(maturana2015voxnet; nguyen2018rendernet; zhu2018visual), meshes (jack2018learning), point clouds (qi2017pointnet; insafutdinov2018unsupervised) and signed distance functions (DeepSDF), they each have limitations. For example, it is often difficult to scalably incorporate texture, lighting and background into these representations. Recently, neural scene representations have been proposed to overcome these problems (Eslami1204; sitzmann2019deepvoxels; sitzmann2019scene), usually by incorporating ideas from graphics rendering into the model architecture.

In this paper, we argue that equivariance with respect to 3D transformations provides a strong inductive bias for neural rendering and scene representations. Indeed, we argue that, for many tasks, scene representations need not be explicit (such as point clouds and meshes) as long as they transform like explicit representations.

Our model is trained with no 3D supervision and only requires images and their relative poses to learn equivariant scene representations. Our formulation does not pose any restrictions on the rendering process and, as result, we are able to model complex visual effects such as reflections and cluttered backgrounds. Unlike most other scene representation models (Eslami1204; sitzmann2019deepvoxels; sitzmann2019scene), our model does not require any pose information at inference time. From a single image, we can infer a scene representation, transform it and render it (see Fig. 1). Further, we can infer and render scene representations in real time while many scene representation algorithms require minutes to perform inference from an image or a set of images (nguyen2018rendernet; sitzmann2019scene; DeepSDF).



Figure 1: From a single image (left), our model infers a scene representation and generates new views of the scene (right) with a learned neural renderer.

While several works achieve impressive results by training models on images of a single scene and then generating novel views of that same scene (mildenhall2020nerf), we focus on generalizing across different scenes. This provides an additional challenge as we are required to learn a prior over shapes and textures to generalize to novel scenes. Our approach also allows us to bypass the need for different scenes in the training set to be aligned (or share the same coordinate system). Indeed, since we learn scene representations that transform like real scenes, we only require relative transformations to train the model. This is particularly advantageous when considering real scenes with complicated backgrounds where alignment can be difficult to achieve.

Neural rendering and scene representation models are usually tested and benchmarked on the ShapeNet dataset (chang2015shapenet). However, the images produced from this dataset are often very different from real scenes: they are rendered on empty backgrounds and only involve a single fixed object. As our model does not rely on 3D supervision, we are able to train it on rich data where it is very expensive or difficult to obtain 3D ground truths. We therefore introduce two new datasets of posed images which can be used to test models with complex visual effects. The first dataset, MugsHQ, is composed of photorealistic renders of colored mugs on a table with an ambient backgroud. The second dataset, 3D mountains, contains renders of more than 500 mountains in the Alps using satellite and topography data. In summary, our contributions are:

  • We introduce a framework for learning scene representations and novel view synthesis without explicit 3D supervision, by enforcing equivariance between the change in viewpoint and change in the latent representation of a scene.

  • We show that we can generate convincing novel views in real time without requiring alignment between scenes nor pose at inference time.

  • We release two new challenging datasets to test representations and neural rendering for complex, natural scenes, and show compelling rendering results on each, highlighting the versatility of our method.

2 Related Work

Scene representations. Traditional scene representations (e.g. point clouds, voxel grids and meshes) do not scale well due to memory and compute requirements. Truncated signed distance functions (SDF) have been used to aggregate depth measurements from 3D sensors (curless_tsdf) to map and track surfaces in real-time (newcombe2011kinectfusion), without requiring assumptions about surface structure. niessner2013real extend these implicit surface methods by incrementally fusing depth measurements into a hashed memory structure. More recently DeepSDF extend SDF representations to whole classes of shapes, with learned neural mappings. Similar implicit neural representations have been used for 3D reconstruction from a single view (xu2019disn; mescheder2018occupancy).

Neural rendering. Neural rendering approaches produce photorealistic renderings given noisy or incomplete 3D or 2D observations. In DeferredNeuralRendering, incomplete 3D inputs are converted to rich scene representations using neural textures, which fill in and regularize noisy measurements. sitzmann2019scene encode geometry and appearance into a latent code that is decoded using a differentiable ray marching algorithm. Similar to our work, DeepVoxels (sitzmann2019deepvoxels) encodes scenes into a 3D latent representation. In contrast with our work, these methods either require 3D information during training, complicated rendering priors or expensive inference schemes.

Novel view synthesis. In Eslami1204

, one or more input views with camera poses are aggregated into a context feature vector, and are rendered into a target 2D image given a query camera pose.

tobin-egqn extend this base method using epipolar geometrical constraints to improve the decoding. Our model does not require the expensive sequential decoding steps of these models and enforces 3D structure through equivariance. tatarchenko2016multi can perform novel view synthesis for single objects consistent with a training set, but require depth to train the model. Hedman2018; Hedman2016; thies2018headon; Xu2019

use coarse geometric proxies. Our method only requires images and their poses to train, and can therefore extend more readily to real scenes with minimal assumptions about geometry. Works based on flow estimation for view synthesis

(sun2018multi; zhou2016view) predict a flow field over the input image(s) conditioned on a camera viewpoint transformation. These approaches model a free-form deformation in image space, as a result, they cannot explicitly enforce equivariance with respect to 3D rotation. In addition, these models are commonly restricted to single objects, not entire scenes.

Equivariance. While translational equivariance is a natural property of convolution on the spatial grid, traditional neural networks are not equivariant with respect to general transformation groups. Equivariance for discrete rotations can be achieved by replicating and rotating filters (pmlr-v48-cohenc16). Equivariance to rotation has been extended to 3D using spherical CNNs (esteves17). Steerable filters (cohen2016steerable) and equivariant capsule networks (NIPS2018_8100) achieve approximate smooth equivariance by estimating pose and transforming filters, or by disentangling pose and filter representations. worrall2017interpretable

use equivariance to learn autoencoders with interpretable transformations, although they do not explicitly encode 3D structure in the latent space.

olszewski2019transformable’s method is closely related to ours but only focuses on a limited range of transformations, instead of complete 3D rotations. In our method, we achieve equivariance by treating our latent representation as a geometric 3D data structure and applying rotations directly to this representation.

3 Equivariant Scene Representations

We denote an image by where are the number of channels, height and width of the image respectively. We denote a scene representation by . We further define a rendering function mapping scene representations to images and an inverse renderer mapping images to scenes.

We distinguish between two classes of scene representations: explicit and implicit representations (see Fig. 2). Explicit representations are designed to be interpreted by humans and are rendered by a fixed interpretable process. As an example, can be a 3D mesh and a standard rendering function such as a raytracer. Implicit representations, in contrast, are abstract and need not be human interpretable. For example, could be the latent space of an autoencoder and a neural network. We argue that, for many tasks, scene representations need not be explicit as long as they transform like explicit representations.

Figure 2:

Left: A camera on the sphere observing an explicit scene representation (a mesh). Right: A camera on the sphere observing an implicit scene representation (a 3D tensor).

Indeed, we can consider applying some transformation to a scene representation. For example, we can rotate and translate a 3D mesh. The resulting image rendered by should then reflect these transformations, that is we would expect an equivalent transformation to occur in image space (see Fig. 3). We can write down this relation as


This equation encodes the fact that transforming a scene representation with and rendering it with is equivalent to rendering the original scene and performing a transformation on the rendered image. More specifically, the renderer is equivariant with respect to the transformations in image and scene space111Formally, and represent the action of a group, such as the group of 3D rotations SO(3) or the group of 3D rotations and translations SE(3).. We then define an equivariant scene representation as one that satisfies the equivariance relation in equation (1). We can therefore think of equivariant scene representations as a generalization of several other scene representations. Indeed, meshes, voxels, point clouds (and so on) paired with their appropriate rendering function all satisfy this equation.

Figure 3: Rotating a mesh with and rendering it with is equivalent to rendering the original mesh and applying a transformation in image space. This is true regardless of the choice of scene representation and rendering function.

4 Model

In this section, we design a model and loss that can be used to learn equivariant scene representations from data. While our formulation applies to general transformations and scene representations, we focus on the case where the scene representations are deep voxels and the family of transformations is 3D rotations. Specifically, we set where , , , are the channels, depth, height and width of the scene representation. We denote the rotation operation in scene space by and the equivalent rotation operation acting on rendered images by .

As our model learns implicit scene representations, we do not require 3D ground truths. Instead, our dataset is composed of pairs of views of scenes and relative camera transformations linking the two views. Specifically, we assume the camera observing the scenes is on a sphere looking at the origin. For a given scene, we consider two image captures of the scene and and the relative camera transformation between the two where is the angle and the axis parameterizing the 3D rotation222We use the axis-angle parameterization for notational convenience, but any rotation formalism such as euler angles, rotation matrices and quaternions could be used. In our implementation, we parameterize this rotation by a rotation matrix.. A training data point is then given by . In practice, we capture a large number of views for each scene and randomly sample new pairs at every iteration in training. This allows us to build models that generalize well across a large variety of camera transformations.

To design a loss that enforces equivariance with respect to the rotation transformation, we consider two images of the same scene and their relative transformation . We first map the images through the inverse renderer to obtain their scene representations and . We then rotate each encoded representation by its relative transformation , such that and . As and represent the same scene in different poses, we expect the rotated to be rendered as the image and the rotated as . This is illustrated in Fig. 4. We can then ensure our model obeys these transformations by minimizing


As , minimizing this loss then corresponds to satisfying the equivariance property for the renderer

. Note that the form of this loss function is similar to the ones proposed by

worrall2017interpretable and olszewski2019transformable.

Figure 4: Model training. We encode two images , of the same scene into their respective scene representations , . Since they are representations of the same scene viewed from different points, we can rotate each one into the other. The rotated scene representations , should then be decoded to match the swapped image pairs , .

Model architecture. In contrast to most other works learning implicit scene representations (worrall2017interpretable; Eslami1204; Chen_2019_ICCV), our representation is spatial in three dimensions, allowing us to use fully convolutional architectures for both the inverse and forward neural renderer. To build the forward renderer, we take inspiration from RenderNet (nguyen2018rendernet) and HoloGAN (nguyen2019hologan) as these have been shown to achieve good performance on rendering tasks. Specifically, the scene representation is mapped through a set of 3D convolutions, followed by a projection layer of convolutions and finally a set of 2D convolutions mapping the projection to an image. The inverse renderer is simply defined as the transpose of this architecture (see Fig. 5). For complete details of the architecture, please refer to the appendix.

Figure 5: Model architecture. An input image (top left) is mapped through 2D convolutions (blue), followed by an inverse projection (purple) and a set of 3D convolutions (green). The inferred scene is then rendered through the transpose of this architecture.

Voxel rotation. Defining the rotation operation in scene space is crucial. As our scene representation

is a deep voxel grid, we simply apply a 3D rotation matrix to the coordinates of the features in the voxel grid. As the rotated points may not align with the grid, we use inverse warping with trilinear interpolation to reconstruct the values at the voxel locations (see


for more detail). We note that warping and interpolation operations are available in frameworks such as Pytorch and Tensorflow, making it simple to implement voxel rotations in practice.

Rendering loss. There are several possible choices for the rendering loss, the most common being the norm, norm and SSIM (wang2004image) or combinations thereof. As noted in other works (worrall2017interpretable; snell2017learning) a weighted sum of and SSIM works well in practice. However, we found that our model is not particularly sensitive to the choice of regression loss, and analyse the various trade offs through ablation studies in the experimental section.

5 Experiments

We perform experiments on ShapeNet benchmarks (chang2015shapenet) as well as on two new datasets designed to challenge the model on more complex scenes. For all experiments, the images are of size and the scene representations are of size

. For both the 2D and 3D parts of the network we use residual layers for convolutions that preserve the dimension of the input and strided convolutions for downsampling layers. We use the LeakyReLU nonlinearity

(maas2013rectifier) and GroupNorm (wu2018group) for normalization. Complete architecture and training details can be found in the appendix.

Most novel view synthesis works are tested on the ShapeNet dataset or variants of it. However, renders from ShapeNet objects are typically very far from real life scenes, which tends to limit the use cases for models trained on them. As our scene representation and rendering framework make no restricting assumptions about the rendering process (such as requiring single objects, no reflections, no background etc.), we create new datasets to test the performance of our model on more advanced tasks.

The new datasets are challenging by design and are composed of photorealistic 3D scenes and 3D landscapes with textures from satellite images. We achieve compelling results on these datasets and hope they will spur further research into scene representations that are not limited to simple scenes without backgrounds. We plan to open source all datasets and code.

Requires absolute
Yes Yes Yes No
Requires pose at
inference time
No Yes Yes No
Optimization at
inference time
No No Yes No
Table 1: Requirements for each baseline. Our model performs comparably to other models that make much stronger assumptions about the data and inference process.

5.1 Baselines

We compare our model with three strong baselines. The first is the model proposed by tatarchenko2016multi which we refer to as TCO, the second is a deterministic variant of Generative Query Networks (Eslami1204) which we refer to as dGQN and the third is the Scene Representation Network (SRN) as described in sitzmann2019scene333For detailed descriptions of these baselines, please refer to the appendix of (sitzmann2019scene).. All baselines make strong assumptions that substantially simplify the view-synthesis and scene representation problem. We discuss each of these assumptions in detail below and provide a comparison in Table 1. Our model requires neither of these assumptions, making the task it has to solve considerably more challenging while also being more generally applicable.

Absolute and relative pose. All baselines require an absolute coordinate system444This is often referred to as a world coordinate system. for the pose (or viewpoints). For example, when trained on chairs, the viewpoint corresponding to the camera being at the origin would be the one observing the chair face on. The poses are then absolute in the sense that the camera at the origin corresponds to observing the chair face on for all chairs, i.e. we need all scenes to be perfectly aligned. While this is possible for simple datasets like ShapeNet, it is difficult to define a consistent alignment for a set of scenes, particularly for complex scenes with backgrounds and real life images. In contrast, our model does not require any notion of alignment or absolute pose. Equivariance is exactly why we are able to build a representation that is “origin-free”, because it only depends on relative transformations between poses.

Pose at inference time. In order to infer a scene representation, our model takes as input a single image of the scene. In contrast, both dGQN and SRNs require an image as well as the viewpoint from which the image was taken. This considerably simplifies the task as the model does not need to infer the pose.

Optimization at inference time.

At inference time, SRNs require solving an optimization problem in order to fit a scene to the model. As such, inferring a scene representation from a single input image (on a Tesla V100 GPU) takes 2 minutes with SRNs but only 22ms for our model (three orders of magnitude faster). The idea of training at inference time is a crucial element of SRNs and other works in 3D computer vision

(DeepSDF), but is not required for our model.

5.2 Chairs

We evaluate our model on the ShapeNet chairs class by following the experimental setup given in sitzmann2019scene, using the same train/validation/test splits. The dataset is composed of 6591 chairs each with 50 views sampled uniformly on the sphere for a total of 329,550 images. Images are sampled on the full sphere around the object, making the task much more difficult than typical setups which limit the elevation or azimuth or both (tatarchenko2016multi; Chen_2019_ICCV; olszewski2019transformable).

Novel view synthesis. Results for novel view synthesis are shown in Fig. 6. The novel views were produced by taking a single image of an unseen chair

, inferring its scene representation with the inverse renderer, rotating the scene and generating a novel view with the learned neural renderer. As can be seen, our model is able to generate plausible views of new chairs even when viewed from difficult angles and in the presence of occlusion. The model works well even for oddly shaped chairs with thin structures.




Figure 6: Novel view synthesis for chairs. Given a single image of an unseen object (left), we infer a scene representation, rotate and render it with our learned renderer to generate novel views. Due to space constraints we include chairs with interesting properties here and show randomly sampled chairs in the appendix.

Quantitative comparisons. To perform quantitative comparisons, we follow the setup in sitzmann2019scene by considering a single informative view of an unseen test object and measuring the reconstruction performance on the upper hemisphere around the object (results are shown in Table 2). Surprisingly, even though our model makes much weaker assumptions than all the baselines, it significantly improves upon both the TCO and dGQN baselines and is comparable with the state of the art SRNs.

Dataset TCO dGQN SRN Ours
Chairs 21.27 21.59 22.89 22.83
Table 2: Reconstruction accuracy (higher is better) in PSNR (units of dB) for baselines and our model on ShapeNet chairs.

Qualitative comparisons. We show qualitative comparisons with the baselines for single shot novel view synthesis in Fig. 7. As can be seen our model produces high quality novel views that are comparable to or better than dGQN and TCO while being slightly worse than SRNs.

Input     dGQN    TCO    SRN    Ours    Target




Figure 7: Qualitative comparisons for single shot novel view synthesis. The baseline images were borrowed with permission from sitzmann2019scene.

5.3 Cars

We also evaluate our model on the ShapeNet cars class, allowing us to test our model on images with richer texture than chairs. The dataset is composed of 3514 cars each with 50 views sampled uniformly on the sphere for a total of 175,700 images.

Novel view synthesis. As can be seen in Fig. 8, our model is able to generate plausible views for cars with various colors and thin structures like spoilers. While our model successfully infers 3D shape and appearance, it still struggles to capture some fine texture and geometry details (see Section 6 for a thorough discussion of the limitations and failures of our model).




Figure 8: Novel view synthesis for cars.

Absolute and relative poses. As mentioned in Section 5.1, our model only relies on relative transformations and therefore alleviates the need for alignment between scenes. As all baselines require absolute poses and alignment between scenes, we run tests to see how important this assumption is. Specifically, we break the alignment between scenes in the cars dataset by randomly rotating each scene around the up axis555We found that rotating around one axis was enough to see a significant effect. Rotating around all 3 axes would likely have an even larger effect.. We then train an SRN model on the perturbed and unperturbed dataset to understand to which extent the model relies on the absolute coordinates. As can be seen in Fig. 9, breaking the alignment between scenes significantly deteriorates the performance of SRNs while it leaves the performance of our model unchanged. This is similarly reflected when measuring reconstruction accuracy on the test set (see Table 3).

Input     SRN   SRN (relative)   Ours    Target




Figure 9: Qualitative comparisons on cars between SRNs, SRNs with relative poses around the up axis and our model.
Dataset SRN SRN (relative) Ours
Cars 22.36 21.05 22.26
Table 3: Reconstruction accuracy (higher is better) in PSNR (units of dB) on ShapeNet cars.

5.4 MugsHQ

As the model does not make any restricting assumptions about the rendering process, we test it on more difficult scenes by building the MugsHQ dataset based on the mugs class from ShapeNet. Instead of rendering images on a blank background, every scene is rendered with an environment map (lighting conditions) and a checkerboard disk platform. For each of the 214 mugs, we sample 150 viewpoints uniformly over the upper hemisphere and render views using the Mitsuba renderer (Mitsuba). Note that the environment map and disk platform is the same for every mug. The resulting scenes include more complex visual effects like reflections and look more realistic than typical ShapeNet renders, making the task of novel view synthesis considerably more challenging. A complete description of the dataset as well as samples can be found in the appendix.

Novel view synthesis. Results for single shot novel view synthesis on unseen mugs are shown in Fig. 10. As can be seen, the model successfully infers the shape of unseen mugs from a single image and is able to perform large viewpoint transformations. Even from difficult viewpoints, the model is able to produce consistent and realistic views of the scenes, even generating reflections on the mug edges. As is the case for the ShapeNet dataset, our model can still miss fine details such as thin mug handles and struggles with some oddly shaped mugs (see Section 6 for examples).




Figure 10: Novel view synthesis on MugsHQ.

5.5 Mountains

We also introduce 3D mountains, a dataset of mountain landscapes. We created the dataset by scraping the height, latitude and longitude of the 559 highest mountains in the Alps (we chose this mountain range because it was easiest to find data). We then used satellite images combined with topography data to sample random views of each mountain at a fixed height (see appendix for samples and detailed description). This dataset is extremely challenging, with varied and complex geometry and texture. While obtaining high quality results on this dataset is beyond the scope of our algorithm, we hope it can be useful for pushing the boundaries of research in neural rendering.

Novel view synthesis. Results for single shot novel view synthesis are shown in Fig. 11. While the model struggles to capture high frequency detail, it faithfully reproduces the 3D structure and texture of the mountain as the camera rotates around the scene representation. For a variety of mountain landscapes (snowy, rocky etc.), our model is able to generate plausible, albeit blurry, views. An interesting feature is that, for views near the input image, the generated images are considerably sharper than for views far away from the input. This is likely due to the considerable uncertainty in generating views far from the source view: given the front of a mountain, there are many plausible ways the back of the mountain could appear. As our model is deterministic, it generates sharper views near the input where there is less uncertainty and blurs views far from the input where there is more uncertainty.




Figure 11: Novel view synthesis on 3D mountains.

+SSIM        Target   +SSIM        Target



Figure 12: Comparisons on chairs showing the trade off between different rendering losses.

5.6 Ablation studies

We perform ablation studies to test the trade offs between various rendering losses. Fig. 12 shows the difference in generated images when using and losses. While both losses perform well, the loss produces somewhat blurrier images than the loss. However, there are also cases where the produces artifacts that the loss does not. Ultimately, there is a trade off between using the two losses and the choice is largely dependent on the application.

6 Scope, limitations and future work

In this section, we discuss some of the advantages and weaknesses of our method as well as potential directions for future work.

Advantages. The main advantage of our model is that it makes very few assumptions about the scene representation and rendering process. Indeed, we learn representations simply by enforcing equivariance with respect to 3D rotations. As such, we can easily encode material, texture and lighting which is difficult with traditional 3D representations. The simplicity of our model also means that it can be trained purely from posed 2D images with no 3D supervision. As we have shown, this allows us to apply our method to interesting data where obtaining 3D geometry is difficult. Crucially, and unlike most other methods, our model does not require alignment between scenes nor any pose information at inference time. Further, our model is fast: inferring a scene representation simply corresponds to performing a forward pass of a neural network. This is in contrast to most other methods that require solving an expensive optimization problem at inference time for every new observed image (nguyen2018rendernet; DeepSDF; sitzmann2019scene). Rendering is also performed in a single forward pass, making it faster than other methods that often require recurrence to produce an image (Eslami1204; sitzmann2019scene).

Limitations. As our scene representation is spatial and 3-dimensional, our model is quite memory hungry. This implies we need to use a fairly small batch size which can make training slow (see appendix for detailed analysis of training times). Using a voxel-like representation could also make it difficult to generalize the model to other symmetries such as translations. In addition, our model typically produces samples of lower quality than models which make stronger assumptions. As an example, SRNs generally produce sharper and more detailed images than our model and are able to infer more fine-grained 3D information. Further SRNs can, unlike our model, generalise to viewpoints that were not observed during training (such as rolling the camera or zooming). While this is partly because we are solving a task that is inherently more difficult, it would still be desirable to narrow this gap in performance. We also show some failure cases of our model in Fig. 13. As can be seen, the model struggles with very thin structures as well as objects with unusual shapes. Further, the model can create unrealistic renderings in certain cases, such as mugs with disconnected handles.

Future work. The main idea of the paper is that equivariance with respect to symmetries of a real scene provides a strong inductive bias for representation learning of 3D environments. While we implement this using voxels as the representation and rotations as the symmetry, we could just as well have chosen point clouds as the representation and translation as the symmetry. The formulation of the model and loss are independent of the specific choices of representation and symmetry and we plan to explore the use of different representations and symmetries in future work.

In addition, our model is deterministic, while inferring a scene from an image is an inherently uncertain process. Indeed, for a given image, there are several plausible scenes that could have generated it and, similarly, several different scenes could be rendered as the same image. It would therefore be interesting to learn a distribution over scenes . Training a probabilistic or adversarial model may also help sharpen rendered images.

Another promising route would be to use the learned scene representation for 3D reconstruction. Indeed, most 3D reconstruction methods are object-centric (i.e. every object is reconstructed in the same orientation). This has been shown to cause models to effectively perform shape classification instead of reconstruction (tatarchenko2019single). As our scene representation is view-centric, it is likely that it could be useful for the downstream task of 3D reconstruction in the view-centric case.

Input     Model    Target      Input     Model    Target




Figure 13: Failure examples of our model. As can be seen, the model fails on oddly shaped chairs, cars and mugs. On cars, the model sometimes infers the correct shape but misses high frequency texture detail. On mugs, the model can miss mug handles and other thin structure.

7 Conclusion

In this paper, we proposed learning scene representations by ensuring that they transform like real 3D scenes. The proposed model requires no 3D supervision and can be trained using only posed 2D images. At test time, our model can, from a single image and in real time, infer a scene representation and manipulate this representation to render novel views. Finally, we introduced two challenging new datasets which we hope will help spur further research into neural rendering and scene representations for complex scenes.


We thank Shuangfei Zhai, Walter Talbott and Leon Gatys for useful discussions. We also thank Lilian Liang and Leon Gatys for help with running compute jobs. We thank Per Fahlberg for his help in generating the 3D mountains dataset. Finally we thank Russ Webb for feedback on an early version of the manuscript.


Appendix A Architecture details and hyperparameters

The inverse renderer is composed of 3 submodels: a 2D convolutional network mapping images to 2D features, an inverse projection layer mapping 2D features to 3D features and a 3D convolutional network mapping 3D features to the scene representation. Each subnetwork is described in detail in the tables below. The renderer is simply the transpose of the inverse renderer with a sigmoid activation at the ouput layer to ensure pixel values are in .

Note that every layer is followed by a GroupNorm layer and a LeakyReLU activation (except the final scene and image layers). Each ResBlock is composed of a sequence of , , convolutions added to the identity.

To ensure rotations of the scene representation do not exit the bounds of the voxel grid we apply a spherical mask to the scene representation before performing rotations.

The full model has 11.5 million trainable parameters.

Input shape Output shape Operation
(3, 128, 128) (64, 128, 128) 1x1 Conv
(64, 128, 128) (64, 128, 128) 2x ResBlock
(64, 128, 128) (128, 64, 64) 4x4 Conv, Stride 2
(128, 64, 64) (128, 64, 64) 1x ResBlock
(128, 64, 64) (128, 32, 32) 4x4 Conv, Stride 2
(128, 32, 32) (128, 32, 32) 1x ResBlock
(128, 32, 32) (256, 16, 16) 4x4 Conv, Stride 2
(256, 16, 16) (256, 16, 16) 1x ResBlock
(256, 16, 16) (128, 32, 32) 4x4 Conv.T, Stride 2
(128, 32, 32) (128, 32, 32) 2x ResBlock
Table 4: Architecture of 2D subnetwork.
Input shape Output shape Operation
(128, 32, 32) (256, 32, 32) 1x1 Conv
(256, 32, 32) (512, 32, 32) 1x1 Conv
(512, 32, 32) (1024, 32, 32) 1x1 Conv
(1024, 32, 32) (32, 32, 32, 32) Reshape
Table 5: Architecture of inverse projection network from 2D to 3D.
Input shape Output shape Operation
(32, 32, 32, 32) (32, 32, 32, 32) 1x1 Conv
(32, 32, 32, 32) (32, 32, 32, 32) 2x ResBlock
(32, 32, 32, 32) (128, 16, 16, 16) 4x4 Conv, Stride 2
(128, 16, 16, 16) (128, 16, 16, 16) 2x ResBlock
(128, 16, 16, 16) (64, 32, 32, 32) 4x4 Conv.T, Stride 2
(64, 32, 32, 32) (64, 32, 32, 32) 2x ResBlock
Table 6: Architecture of 3D subnetwork.

Hyperparameters. When training with loss, we set the weight of the SSIM loss to 0.05.


We train each model for 100 epochs on all datasets, although most models converge much earlier than this (around 60 epochs). When training on a single GPU we use a batch size of 16 and when training on 8 GPUs we use a batch size of 112.

Optimizer. We use Adam with a learning rate of 2e-4.

Losses. We use the loss for quantitative comparisons as PSNR is inversely proportional to . Indeed, the baselines we compare against (except TCO) all directly optimize , making comparisons fairer. We generally found that produces more visually pleasing samples and therefore use this loss for qualitative comparisons and novel view synthesis.

Appendix B Dataset descriptions

Detailed descriptions of the ShapeNet chairs and cars dataset can be found in the appendix of sitzmann2019scene.

b.1 MugsHQ

The MugsHQ data set was rendered with a branch of the Mitsuba Renderer (Mitsuba) adapted to import ShapeNet geometry (Mitsuba_ShapeNet). Every scene was rendered with the same environment map (lighting conditions) and checkerboard disk platform. ShapeNet objects were scaled by their largest bounding box dimension, centered, and placed on the platform. The object’s material is a two-sided plastic designed to highlight glossy reflections and the diffuse reflectance color was randomly sampled. For each object, 150 viewpoints were uniformly sampled over the upper hemisphere. Each viewpoint was rendered to a high dynamic range image, and then resized and tone-mapped to a linear RGB image.

b.2 3D Mountains

We created the 3D mountains dataset by first scraping the height, latitude and longitude of the 559 highest mountains in the Alps. We then used Apple Maps to render 50 images of each mountain. Specifically, the camera was placed on a sphere of radius 600m centered on the latitude, longitude and height - 100m of the mountain. We then fixed the elevation angle to be 55 degrees (or a pitch of 35 degrees) and randomly sampled the azimuth angle between 0 and 360 degrees to capture multiple views of each mountain.

Appendix C Train/validation/test splits

For each dataset we train a model and choose hyperparameters based on the lowest validation loss. All images and quantitative measurements are then made on a held out test set which is only seen after everything else has been fixed.

c.1 Chairs

The chairs dataset consists of 6591 scenes, with the training and validation set each having 50 views per scene and the test set having 251 views per scene, for a total of 594,267 images. The train/validation/test splits are:

  • Train: 4612 scenes (230,600 images)

  • Validation: 662 scenes (33,100 images)

  • Test: 1317 scenes (330,567 images)

c.2 Cars

The cars dataset consists of 3514 scenes, with the training and validation set each having 50 views per scene and the test set having 251 views per scene, for a total of 317,204 images. The train/validation/test splits are:

  • Train: 2458 scenes (122,900 images)

  • Validation: 352 scenes (17,600 images)

  • Test: 704 scenes (176,704 images)

c.3 MugsHQ

The MugsHQ dataset consists of 214 scenes, each with 150 views for a total of 32,100 images. The train/validation/test splits are:

  • Train: 186 scenes (27,900 images)

  • Validation: 14 scenes (2,100 images)

  • Test: 14 scenes (2,100 images)

c.4 3D mountains

The 3D mountains dataset consists of 559 scenes, each with 50 views for a total of 27,950 images. The train/validation/test splits are:

  • Train: 478 scenes (23,900 images)

  • Validation: 26 scenes (1,300 images)

  • Test: 55 scenes (2,750 images)

Appendix D Runtimes

d.1 Training time

Training time for all datasets are shown in Table 7. When training on a single V100 GPU we use a batch size of 16, whereas we use a batch size of 112 when training on 8 V100s.

Dataset V100 8 V100s
Chairs 9.7 days 2.2 days
Cars 5.5 days 1.3 days
MugsHQ 1.2 days 6 hrs
Mountains 1 day 5 hrs
Table 7: Training times.

d.2 Inference time

We measured inference time with a trained model on the cars dataset running on a single Tesla V100 GPU. We took the mean and standard deviation over 1000 iterations (using 100 warmup steps).

Single image: ms

Batch of 128 images: ms ( ms per image)

Note that for a single image this corresponds to a framerate of 45 fps, allowing for real time inference.

Appendix E Things that didn’t work

We experimented with several things which we found did not improve performance.

  • We experimented with partitioning the latent space (across channels) into a viewpoint invariant and equivariant part. We hypothesized this might help in learning complex textures and create something akin to a global texture map, but found that this did not decrease (nor increase) the loss in practice.

  • When rotating the voxels we use trilinear interpolation to calculate the value of points that do not align with the grid. While rotations on the grid will always suffer from aliasing we hypothesized that using nearest neighbor interpolation (instead of trilinear) could help model performance. We also tried using shear rotations as these have been shown to reduce aliasing in certain cases (paeth1986fast). In practice we found that this did not make a big difference.

  • The latent space we use in our model has shape . We hypothesized that increasing the spatial resolution might help improve performance. We therefore tried a latent space of size but found that this performed the same as the original latent space, but was much slower to train.

Appendix F Samples from datasets

We include random ground truth samples from the MugsHQ and 3D mountains dataset.

Figure 14: Random samples from the MugsHQ dataset.

Figure 15: Random samples from the 3D mountains dataset.

Appendix G Random samples from model

We include random novel view synthesis samples on all datasets.

Input     Model    Target      Input     Model    Target




Figure 16: Random single shot novel view synthesis samples on chairs.

Input     Model    Target      Input     Model    Target




Figure 17: Random single shot novel view synthesis samples on cars.

Input     Model    Target      Input     Model    Target




Figure 18: Random single shot novel view synthesis samples on MugsHQ.

Input     Model    Target      Input     Model    Target




Figure 19: Random single shot novel view synthesis samples on 3D mountains.