neuralvolumes
Training and Evaluation Code for Neural Volumes
view repo
Modeling and rendering of dynamic scenes is challenging, as natural scenes often contain complex phenomena such as thin structures, evolving topology, translucency, scattering, occlusion, and biological motion. Mesh-based reconstruction and tracking often fail in these cases, and other approaches (e.g., light field video) typically rely on constrained viewing conditions, which limit interactivity. We circumvent these difficulties by presenting a learning-based approach to representing dynamic objects inspired by the integral projection model used in tomographic imaging. The approach is supervised directly from 2D images in a multi-view capture setting and does not require explicit reconstruction or tracking of the object. Our method has two primary components: an encoder-decoder network that transforms input images into a 3D volume representation, and a differentiable ray-marching operation that enables end-to-end training. By virtue of its 3D representation, our construction extrapolates better to novel viewpoints compared to screen-space rendering techniques. The encoder-decoder architecture learns a latent representation of a dynamic scene that enables us to produce novel content sequences not seen during training. To overcome memory limitations of voxel-based representations, we learn a dynamic irregular grid structure implemented with a warp field during ray-marching. This structure greatly improves the apparent resolution and reduces grid-like artifacts and jagged motion. Finally, we demonstrate how to incorporate surface-based representations into our volumetric-learning framework for applications where the highest resolution is required, using facial performance capture as a case in point.
READ FULL TEXT VIEW PDFTraining and Evaluation Code for Neural Volumes
None
Polygon meshes are an extremely popular representation for 3D geometry in photo-realistic scenes. Mesh-based representations efficiently model solid surfaces and can be paired with sophisticated reflectance functions to generate compelling renderings of natural scenes. In addition, there has been significant progress recently in optimization techniques to support real-time ray-tracing, allowing for interactivity and immersion in demanding applications such as Virtual Reality (VR). However, little of the interactive photo-real content available today is data-driven because many real-world phenomena are challenging to reconstruct and track with high fidelity. State-of-the-art motion capture systems struggle to handle complex occlusions (e.g., running hands through one’s hair), to account for reflectance variability (e.g., specularities in the sheen of a moving object), or to track topological evolution in dynamic participating media (e.g., smoke as it billows upward). Where solutions exist, they are typically specialized to an individual phenomenon (Xu et al., 2014; Atcheson et al., 2008; Roth and Black, 2006; Hawkins et al., 2005), and are often aimed at either image generation (Buehler et al., 2001; Kalantari et al., 2016) or 3D reconstruction (Goesele et al., 2007; Nießner et al., 2013), but not both. Since mesh-based representations rely heavily on the quality of reconstruction to produce compelling renderings, they are ill-suited to handle such cases. Nonetheless, these kinds of phenomena are necessary to create compelling renderings of much of our natural world.
To address limitations posed by inaccurate geometric reconstructions, great progress has been made in recent years by relaxing the physics-inspired representation of light transport, and instead leveraging machine learning to bridge the gap between the representation and observed images of the scene. In
Lombardi et al. (2018), this technique was used to great effect in modeling the human face, where it was demonstrated that a neural network can be trained to compensate for geometric reconstruction and tracking inaccuracies through a view-dependent texture. Similar approaches have also been shown to be effective for modeling general far-field scenes (Overbeck et al., 2018). An extreme variant of this technique is screen-space rendering, where no geometry of the scene is used at all (Karras et al., 2018; Wang et al., 2018). Although these approaches have been shown to produce high quality renderings of complex scenes, they are limited to viewpoints available to the system at training time. Since their neural architectures are not 3D aware, the methods do not extrapolate to novel viewpoints in a way that is consistent with the real world. The problem is exacerbated when modeling near-field scenes, where variation in viewpoint is more common as a user interacts with objects in the scene, compared with far-field captures where there is less interactivity and the viewer is mainly stationary.An important insight in this work is that if both geometry and appearance variations can be learned simultaneously, phenomena explainable by geometric variations may be modeled as such, leading to better generalization across viewpoints. The challenge, then, is to formulate this joint learning such that good solutions can be found. Directly optimizing over a mesh-representation using gradient-based optimization is prone to terminating in poor local minima. This is the case even when a model of both appearance and geometry are known a priori, and is exacerbated when these models are also unknown. One of the main reasons for this difficulty is the local support of the gradients of mesh-based representations. To address this, we propose using a volumetric representation consisting of opacity and color at each position in 3D space, where rendering is realized through integral projection. During optimization, this semi-transparent representation of geometry disperses gradient information along the ray of integration, effectively widening the basin of convergence, enabling the discovery of good solutions.
Although the volumetric representation has the ability to represent 3D geometric phenomena in a geometrically faithful way, it can easily over-fit from image-supervision for a typical density of viewpoints. As such, additional regularization is necessary to achieve good results. In this work, we show that a neural-network decoder is sufficient to encourage discovery of solutions that generalize across viewpoints. At first glance, this may appear surprising given the decoder network typically has enough capacity to reproduce solutions found through a direct solve of the volume’s entries. We conjecture that the decoder network introduces spatial regularity into the gradients of the volume’s entries (i.e., opacity and color), leading to more generalizable solutions without diminishing the volumetric representation’s capacity. Additionally, the decoder network is paired with an encoder network that produces a low-dimensional latent space that encodes the state of the scene at each frame, enabling joint reconstruction of sequences rather than just individual frames. Analogous to non-rigid structure from motion (Torresani et al., 2008), this architecture can leverage a scene’s regularity across time to improve viewpoint generalization further. This latent code can be used to generate novel renderings of the scene’s content by traversing the latent space, enabling realistic modifications of the recording, or even completely new sequence animation, without requiring object/scene/content specific solutions.
Despite these advantages of the volumetric representation, its main drawback is limited resolution. Using voxel-based data structures to represent a scene typically requires an order of magnitude more memory than its mesh-based counterpart to achieve similar levels of resolution. Furthermore, much of this memory is dedicated to modeling empty space or the inside of objects; neither of which have an impact on the rendered result. This limitation stems from the regular grid structure this representation exhibits. To overcome this limitation, we employ a warping technique that indirectly escapes the restrictions imposed by a regular grid structure, allowing the learning algorithm to make the best use of available memory. Using this technique, we demonstrate significantly higher fidelity than using only a conventional voxel data structure. Furthermore, as our representation is 3D based, we can naturally combine it with surface-based reconstruction and tracking methods when appropriate. This allows us to reach the highest levels of fidelity on objects in the scene for which state-of-the-art reconstruction and tracking work well, while also maintaining a complete model of the scene.
In summary, we propose a novel volumetric representation that is object/scene agnostic, can generalize well to novel viewpoints, reconstructs dynamic scenes jointly, facilitates novel content generation, requires only image-level supervision and is end-to-end trainable. The resulting models afford real-time rendering and support on-the-fly adjustments, suitable for interactive applications in VR. In §2 we cover related work, followed by an overview of our approach in §3. Details of the encoder and decoder architectures are covered in sections §4 and §5 respectively. Rendering through integral projection is discussed in §6 and details of the learning problem are covered in §7. We evaluate this architecture on a number of challenging scenes and present ablation study on various design choices in our construction in §8. We conclude in §9 with a discussion and directions of future work.
Our approach is driven by learning and rendering techniques spanning multiple domains, from volumetric reconstruction and deformable volumes to neural rendering and novel view synthesis. The following paragraphs discuss similarities and differences to previous works in these areas.
Point- and surface-based reconstruction techniques have a long history in computer vision (see
(Furukawa and Hernández, 2015) for a review). Following successful efforts in stereo matching (Scharstein and Szeliski, 2002), most of the subsequent literature has focused on the multi-view case, including extensions of photometric consistency (Furukawa and Ponce, 2010) and depth map fusion (Merrell et al., 2007; Zach et al., 2007). Despite some attempts at handling more complex materials or semi-transparent surfaces (Szeliski and Golland, 1999; Fitzgibbon and Zisserman, 2005), many popular multi-view stereo (MVS) methods, such as COLMAP (Schönberger and Frahm, 2016; Schönberger et al., 2016), still struggle with thin structures and dense semi-transparent materials (e.g., hair and smoke).Volumetric reconstruction methods side-step this problem of explicit correspondence matching, beginning with Voxel Coloring (Seitz and Dyer, 1997; Prock and Dyer, 1998; Seitz and Dyer, 1999) and Space Carving (Kutulakos and Seitz, 2000; Broadhurst et al., 2001; Bonet and Viola, 1999). These methods recover occupancy and color in a voxel grid from multi-view images by evaluating the photo-consistency of each voxel in a particular order. Our method is similar to these classical voxel-based techniques in spirit, but rather than using a strict photometric consistency criterion, we learn a generative model that tries to best match the input images. Since we do not assume that objects in a scene are composed of flat surfaces, this approach also allows us to overcome the typical limitations of MVS methods and capture rich materials and fine geometry.
A number of more recent works on volumetric reconstruction have explored the concept of ray potentials, i.e., cost functions between the first surface struck by a ray and the color (or other property) of the corresponding pixel. Ulusoy et al. (2015), Savinov et al. (2016), and Paschalidou et al. (2018) formulate graph-based energy or inference objectives using ray potentials as constraints. A differentiable ray consistency criterion that inspired our work is developed by Tulsiani et al. (2017, 2018)
, who use an encoder-decoder architecture to predict voxel occupancy probabilities from RGB images. A loss on ray potentials evaluated in this volume is then backpropagated to the underlying convolutional architecture.
A fundamental difference between our work and previous works using ray potentials is that we model voxel transparency rather than occupancy probability, as we are focused on rendering rather than reconstruction. This explicit image formation process allows us to reconstruct and render dynamic scenes of semi-transparent materials, such as smoke.
Non-rigidly deforming objects pose challenges to both optimization- and learning-based approaches. Over the years, significant efforts have been spent on their reconstruction and tracking from RGB-D sensors: the DynamicFusion method of Newcombe et al. (2015) produces a base 3D template surface using a Truncated Signed Distance Function (TSDF) representation, and a time-dependent warp to fuse a sequence of depth frames. Zollhöfer et al. (2014) and Innmann et al. (2016) also deform a 3D template surface with an as-rigid-as-possible regularizer.
Our rendering model is based on classical volume ray marching (Levoy, 1988; Ikits et al., 2004), but we introduce the concept of a warp field to it: instead of directly sampling color and opacity along the ray, we first sample a 3D warp field encoding the location (within a template voxel grid) from which color and opacity are sampled. In 2D, similar techniques have been used to learn warps that align images (Jaderberg et al., 2015; Dai et al., 2017; Shu et al., 2018). In our work, the warp field is not only used to model motion but also increases the effective resolution of the voxel grid by simulating a dynamic non-uniform sampling grid.
Deep learning-based rendering has become an active area of exploration, with some methods relying on volumetric representations. Nguyen-Phuoc et al. (2018)
use a convolutional neural network applied on a voxel grid to produce an image, but their method requires correspondence between the images and the voxel reconstruction. DeepVoxels
(Sitzmann et al., 2018) automatically learns a 3D feature representation for novel view synthesis but is limited to static objects and scenes, and the employed architecture does not lend itself to real-time inference. Martin-Brualla et al. (2018) use neural networks to fill holes and generally improve the quality of a textured geometric representation. Kim et al. (2018) use a U-net architecture to convert an image of rasterized attributes to realistic images, similar to pix-to-pix (Isola et al., 2017). Our method shares many similarities with these approaches but has one important difference: machine learning is only used to generate an RGBA volume, which is then rendered with a ray marching algorithm with no learned parameters. This is important as it gives us an interpretable volume, which may lead to better viewpoint generalization.In our evaluation of hybrid rendering approaches, we employ the Deep Appearance Model of Lombardi et al. (2018)
that provides a textured mesh representation. The method learns a Variational Autoencoder (VAE) model representing the dynamic mesh and view-dependent texture of a specific person’s face. While we also evaluate our model on human faces, our method does not require precise mesh tracking or other forms of pre-processing. Instead, it is trained end-to-end, using only raw images as supervision. Volumetric representations such as ours are also able to better represent complex surfaces like hair that are difficult to model using meshes.
Novel view synthesis aims to produce RGB images of novel views given a set of input RGB images. Typically, these methods use a geometric proxy to assist in reprojecting 3D points back into the input images and some blending is performed to produce a final pixel color. Buehler et al. (2001) and Davis et al. (2012)
use heuristics to blend contributions of different images based on the rays from the geometric proxy to each camera.
Hedman et al. (2018) uses neural networks to determine blending weights, which can overcome inaccurate geometric proxies. Zhou et al. (2018) skip the geometric proxy altogether and use a neural network to compute blending weights of each image projected along a set of planes. Penner and Zhang (2017) compute a soft volumetric representation using MVS depth maps to perform novel view synthesis. Unlike most novel view synthesis techniques, our method operates on sequences and creates an animatable model.To compute geometric proxies, a multi-view stereo method is often employed. While free-viewpoint video methods (Collet et al., 2015; Prada et al., 2016) rely on a sophisticated combination of multi-view stereo techniques (including silhouettes and MVS) to reconstruct many kinds of objects, our method creates novel views of dynamic scenes with a single generative framework. In addition, our model’s latent embedding of the scene allows us to generate novel animations more flexibly by producing new embedding sequences.
We present an end-to-end pipeline for rendering images from novel views with only image supervision that leverages an internal 3D volumetric representation. There are two main parts to the method: an encoder-decoder network that converts input images into a 3D volume , and a differentiable raymarching step that renders an image from the volume given a set of camera parameters. The method can be thought of as an autoencoder whose final layer is a fixed-function (i.e., no free parameters) volume rendering operation.
Formally, we model a volume that maps 3D positions, , to a local RGB color and differential opacity at that point,
(1) |
where is the color at and is its differential opacity in the range , with representing full transparency. The purpose of a semi-transparent volume is two-fold: first, it is a softening of a discrete volume representation which enables gradients to flow for learning; second, it allows us to model semi-transparent objects or bundles of thin structures that appear translucent at limited resolutions, like hair.
Fig. 2 shows a visual representation of the pipeline. We first capture a set of synchronized and calibrated video streams of a performance from different viewpoints. Next, an encoder network takes images from a subset of the cameras for each time instant and produces a latent code that represents the state of the scene at that time. A volume decoder network produces a 3D volume given this latent code, , which yields an value at each point . Finally, an accumulative ray marching algorithm renders the volume from a particular point-of-view. We train this system end-to-end by reconstructing each of the input images and minimizing the squared pixel reconstruction loss over the entire training set. At training time, we run through the entire pipeline to train the weights of the encoder-decoder network. At inference time, we produce a stream of latent codes (either the sequence of latent codes produced by the training images or a novel generated sequence) and decode and render in real time.
The main component of our system that enables novel sequence generation is the encoder-decoder architecture, where the scene’s state is encoded using a consistent latent representation . A traversal in this latent space can be decoded into a novel sequence of volumes that can then be rendered from any viewpoint (see §5 for details). This is in contrast to methods that rely on specialized mesh constructions per frame which only allow for playback (Collet et al., 2015) or limited control over the generative process (Prada et al., 2016). Moreover, this representation allows for conditional decoding, where only part of the scene’s state is modified on playback (i.e., expression during speech, view-dependent appearance effects, etc.). The encoder-decoder architecture naturally supports this capability without requiring specialized treatment on the decoder side so long as paired samples of the conditioning variable are available during training.
To build the latent space, the information state of the scene at any given time is codified by encoding a subset of views from the multi-camera capture system using a convolutional neural network (CNN). The architecture of the encoder is shown in Fig. 3. Each camera view is passed through a dedicated branch before being concatenated with those from other views and further encoded down to the final shape. Although using all camera views as input is optimal from an information-theoretic perspective, we found that using views worked well in practice, while being much more memory and computationally efficient. To maximize coverage, we select a subset of cameras that are roughly orthogonal, although our system is not especially sensitive to the specific choice of views. In practice, we used the frontal, left and right-most camera views and downsampled the images by a factor of to size pixels.
To generate plausible samples during a traversal through the latent space, the generative model needs to generalize well between training samples. This is typically achieved by learning a smooth latent space. To encourage smoothness, we use a variational architecture (Kingma and Welling, 2013). The encoder outputs parameters of a diagonal 256-dimensional Gaussian (i.e., and
), whose KL-divergence from a standard Normal distribution is used as regularization. Generating an instance involves sampling from this distribution using the reparameterization trick, and decoding into the volumetric components described in §
5.In addition to encouraging latent space smoothness, the variational architecture also ensures that the decoder makes use of the conditioning variable when it is trained jointly with the encoder, as described in §7. Specifically, since the variational bottleneck maximizes the non-informative latent dimensions (Higgins et al., 2017), information pertaining to the conditioning variable is projected out of the latent space, leaving the decoder no choice but to use the conditioning variable in its reconstruction.
This method of conditioning can be applied using any auxiliary information available to the user for controlling the rendered output. In §8
we show experiments demonstrating this for a few types of conditioning information. Of particular importance is view-conditioning, which allows view-dependent effects, such as specularity, to be rendered correctly. When viewed in VR, the auxiliary information in the form of a view-vector can be obtained from the relative orientation of the headset in the virtual scene.
In this section we discuss parameterizing the volume function using different neural network architectures which we call volume decoders
. We discuss possible representations for the volumes (voxel grids and multi-layer perceptrons) and a method for increasing the effective resolution by using warping fields. Finally, we discuss view-conditioning for modeling view-dependent appearance.
One possible model for the volume function at point with state , is an implicit one with a series of fully connected layers with non-linearities. A benefit of this approach is that we’re not restricted by voxel grid resolution or storage space. Unfortunately, in practice an MLP requires prohibitive size to produce high-quality reconstructions. We must also evaluate the MLP at every step along each ray in the ray-marching process (see §6), imposing an equally restrictive upper bound on the MLP complexity for real-time applications.
Rather than trying to model the entire volume implicitly with an MLP, we may instead assume that the volume function can be modeled as a discrete 3D grid of voxels. We produce this explicit 3D voxel grid as the output tensor of a neural network. Let the tensor
represent a grid of values in , with the channel dimensions. Defineto be an interpolation function that samples from the grid
by scaling continuous values in the range to the grid along each dimension, followed by trilinear interpolation. We can define a volume decoder for a 3D cube with center at and sides of size ,(2) |
where is a neural network that produces a tensor of size . Note that we only evaluate the decoder function inside the volume it covers by computing intersections with its bounding volume.
In practice, we use either a convolutional architecture or a series of fully-connected layers to implement . In the former case, we first apply a fully-connected layer and non-linearity to transform into a -dimensional representation and reinterpret the resulting vector as a cube with channels. We experiment with two convolutional architectures, one with a final size of (achieved with transposed convolutions and running at 90Hz) and one with a final size of (achieved with transposed convolutions and running at 22Hz). As an alternative, we consider a bottleneck architecture capable running at 90Hz consisting of fully-connected layers with output sizes of , , and , respectively, with the last layer representing the decoded volume. After decoding the volume, we apply a softplus function to the values to ensure they are non-negative. Both decoder variants are illustrated in Fig. 4.
On their own, voxel grids are limited because they can only represent details as small as a single voxel and are computationally expensive to evaluate and store at high resolution. Additionally, they are wasteful in typical scenes where much of the scene consists of empty space. Common solutions to these problems are spatial acceleration structures like octrees (e.g., Riegler et al. (2017)), but it’s difficult to modify these to work in a learning setting. They also typically require that the distribution of objects within the structure is known a priori, which is not true in our case as we do not use any 3D training data.
To solve these problems, we propose to use warping fields to both alter the effective resolution of the voxel volumes as well as model motion more naturally. In our warping formulation, we produce a template RGB volume and a warp volume . Each point in the warp volume gives a corresponding location in the template volume from which to sample, making this an inverse warp as it maps from output positions to template sample locations. As before, both template and warp are decoded from a dynamic (per-frame) latent code , which we drop from the notation for conciseness.
The choice of inverse warps rather than forward warps allows representing resolution-increasing transformations by mapping a small area of voxels in the output space to a larger area in the template space without requiring additional memory. Thus, the inverse warp can represent details in the output space with higher resolution than uniform grid sampling, but remains well-defined everywhere in the output space, which is necessary for providing usable gradients during learning.
Formally, we define the inverse warp field,
(3) |
where is a 3D point in the output (rendering) space and a 3D point in the RGB template space. To generate the final volume value, we first evaluate the value of the inverse warp and then sample the template volume at the warped point,
(4) |
An important piece of including warp fields is determining how they’re produced. As we show later, the architecture of the warp field decoder makes a large difference on the quality of the model.
A straightforward approach to decoding warp fields would be to use deconvolutions to produce a warp field with freely-varying template sample points at each output point. This parameterization, however, is too flexible, resulting in overfitting and poor generalization to novel views as we show in our experimental results. We instead take the approach that the basic building block of a warp field should be an affine warp. Since a single affine warp can’t model non-linear bending, we use a spatial mixture of affine warps to produce an inverse warp field.
We write the affine mixture as,
(5) |
(6) |
where is the affine transformation, define the rotation, scaling, and translation of the affine transformation parameters, is element-wise multiplication, and is the weight volume of the warp. Note that we sample the spatial mixture weight after warping (“warped weights”). The intuition behind this is that the warped space represents different parts of the scene, and the weighting function should be in that space as well. This can be viewed as an extension of linear blend skinning (Lewis et al., 2000) applied to a volumetric space.
To compute the transformation parameters and the weighting volume we use 2 fully-connected layers after the encoding . For rotation, we produce a rotation quaternion vector which is normalized and transformed into a rotation matrix. Before outputting the values of the weighting volume, we apply to ensure the weights are non-negative. Unlike the voxel volume , we clamp samples outside the weighting volume to the surface otherwise the warps can get “stuck” early in training if they land outside the volume. In practice, we found that a mixture of 16 warps provides sufficient expressiveness. In all experiments where warping is used, we learn an additional global warp (with parameters produced by MLP) that is applied to before the warping field, and we use a voxel grid to represent the warp field as we found that the low resolution provides smoothness from the trilinear interpolation that helps learning.
In order to model view-dependent appearance, we opt to condition the decoder network on the viewpoint. This allows us to model specularities in a data-driven way, without specifying a particular functional form. To do this, we input the normalized direction of the camera to the decoder alongside the encoding. Note that for view-conditioned models, we use separate convolutional branches to produce the values and values as we only want to condition the values on the viewpoint.
We formulate a rendering process for semi-transparent volumes that mimics front-to-back additive blending: As a camera ray traverses a volume of inhomogeneous material, it accumulates color in proportion to the local color and density of the material at each point along its path.
We generate images from a volume function by marching rays through the volume it models. To model occlusions, the ray accumulates not only color but also opacity. If the accumulated opacity reaches (for example, when the ray traverses an opaque region), then no further color can be accumulated on the ray. In particular, the color at a pixel in the focal plane of a camera with center and image-to-world transformation is given by raymarching in the unit direction . This leads to the rendering process
(7) | ||||
(8) |
where denotes the segment of the ray that intersects the volume modeled by , and in Eq. (8) ensures that the accumulated ray opacity is clamped at . Furthermore, we set the final image opacities to .
Algorithm 1 shows the computation of the output color for a ray intersecting the volume , and represents the numerical integration of Eqs. (7–8) using the rectangle rule^{1}^{1}1We use the rectangle rule twice with samples on the right of each step interval and backwards differences to discretize the derivative of .. Importantly, this image formation model is differentiable, which allows us to optimize the parameters of the volume to match target images. In practice, we set the step size to be the size of the volume, which provides adequate voxel coverage and allows the rendering process to run at 90Hz in an OpenGL shader at resolution, enabling real-time stereo in Virtual Reality.
While the semi-transparent volume representation is very versatile, certain kinds of scene content can be more efficiently represented at high resolution using surface-based representations combined with unwrapped texture maps. One example is fine detail in human faces, for which specialized capture systems are available and commonly use texture resolutions larger than (Beeler et al., 2011; Lombardi et al., 2018; Fyffe et al., 2017).
The volume representation described above offers a natural integration with such existing mesh-based representations. Rendering proceeds as described in Algorithm 1 with one modification: we set to the mesh depth for all rays that intersect the mesh. Whenever one of these rays reaches , any remaining color throughput is filled with the color of the mesh at the intersection.
This technique can also be used during learning to avoid expending representational power in these parts of the scene. As we show in §8.6, the resulting semi-transparent volume learned using the hybrid rendering process described above naturally avoids occluding the mesh in areas where the mesh provides a higher-fidelity representation.
In this section we discuss the details of training our method. Training the system consists of training the weights
of the encoder-decoder network. We discuss the estimation of a per-camera color calibration matrix and static background image, the construction of our loss function, and reconstruction priors that improve accuracy.
Although we have geometrically calibrated the cameras in our multi-view system, we have not color calibrated them relative to one-another. We need to ensure that one radiance value will be converted to the same pixel value for each of the cameras. To do this, we introduce a per-camera and per-channel gain and bias that is applied to our reconstructed image before comparing to ground truth. This allows our system to account for slight differences in overall intensity in the image.
In our training data we often have static backgrounds that the algorithm will try to reconstruct. To ensure that our algorithm only reconstructs the object of interest, we estimate a per-camera background image . The background image is static across the entire sequence, capturing only stationary objects that are generally outside of the reconstruction volume.
We obtain a final image from a specific view by raymarching all pixels according to Eq. (7) and merging with its corresponding background pixel according to the remaining opacity when exiting the volume,
(9) |
This background estimation process greatly reduces the amount of artifacts in the reconstruction.
Without using priors, the reconstructed volumes tend to include smoke-like artifacts. These artifacts are caused by slight differences in appearance from different viewpoints due to calibration errors or view-dependent effects. The system can learn to compensate for these differences by adding a small amount of opacity that becomes visible from one particular camera. To reduce these artifacts, we introduce two priors.
The first prior regularizes the total variation of the log voxel opacities,
(10) |
where the sum is performed over all the voxel centers and is the number of voxels. This term helps recover sharp boundaries between opaque and transparent regions by enforcing sparse spatial gradients. We apply this prior in log space to increase the sensitivity of the prior to small values because the artifacts tend to be mostly transparent.
The second prior is a beta distribution
on the final image opacities . We write the regularization term using the negative log-likelihood of the beta distribution,(11) |
where is an image pixel and is the number of pixels. This prior reduces the entropy of the exit opacities and is based on the intuition that most of our rays should strike the object or the background; fewer rays will graze the surface of the object, picking up some opacity but not saturating.
Our full training objective is
(12) |
where is the ground truth image and is the KL divergence between the latent encoding and a standard normal distribution used in a variational autoencoder (Kingma and Welling, 2013).
We use Adam (Kingma and Ba, 2015) to optimize the loss function. In our experiments, we set , , and and we use a batch size of with a fixed learning rate of on the encoder-decoder network weights (the estimated background images and per-camera gain and bias are given separate learning rates of and , respectively). We randomly sample pixels from the image to reduce memory usage. While raymarching, we compute step sizes in normalized voxel space (i.e., ) as it makes the end of the ray the expected saturation point at network initialization. We train for about 500,000 iterations, depending on dataset size, which takes 10 days on a single NVIDIA Tesla V100.
Face | ||
---|---|---|
Train | Val. | |
No view conditioning | 51.1 | 117.8 |
View conditioning | 38.7 | 85.7 |
We perform a number of quantitative and qualitative experiments to validate our model. For all experiments, we use the convolutional volume decoder with size . We show that the design choices of our model provide a good compromise between quality and speed, and that the model generalizes to new viewpoints. We show results of our method on objects that are typically hard to reconstruct (e.g., fuzz, smoke, and hair) and demonstrate the method combined with traditional triangle rasterization. Finally, we demonstrate animation of our models by interpolating in the latent space and driving the reconstruction from user input.
To capture data, we used a multi-camera capture system consisting of 34 -resolution 30Hz color cameras placed on a hemisphere with a radius of approximately one meter. We calibrate the camera system using an icosahedral checkerboard pattern (Ha et al., 2017). The raw images and camera calibration are then used as the only input to our method.
In our quantitative experiments, we compare mean-squared error of pixel reconstructions on the training cameras and also on a set of 7 held-out validation cameras. This allows us to test how well our model extrapolates to novel viewpoints.
Fuzzy Toy | ||
---|---|---|
Background Model | Train | Val. |
Known BG / priors | 87.4 | 197.6 |
Known BG / no priors | 76.1 | 330.5 |
Learned BG / priors | 115.4 | 183.7 |
Learned BG / no priors | 94.1 | 281.7 |
No BG / priors | 208.8 | 386.7 |
No BG / no priors | 86.6 | 727.6 |
To validate our warp representation, we compare against several variants: a model with no warping, a model with a warp produced by a convolutional neural network, a model that does not apply the warp before computing the affine mixture weight, and the proposed affine mixture warp model (i.e., Eq. (5)).
For the model that does not apply the warp to compute the mixture, we modify Eq. (6) as follows:
(13) |
The main difference is that the mixture is done in “world” space rather than in warped space. Since the mixture weights are not warped, the decoder must match any motion of the template by moving the mixture weights.
Fig. 5 shows the results of our warping evaluation on three datasets: a moving hand, swinging hair, and dry ice smoke, each approximately 20 seconds long. In each dataset, our affine mixture model outperforms models with no warp and outperforms models with a convolutional warp field on the validation cameras. In particular, the convolutional warp fields completely fail on the “hair swing” dataset. The results also show that modeling the weighting volume in warp space is better than in “world” space.
An important part of rendering is being able to model view-dependent effects such as specularities. To do this, we can condition the decoder on the viewpoint of the rendered view. This allows the network to change the color of certain parts of the scene depending on the angle it’s viewed from.
Table 1 shows a quantitative experiment comparing a view-conditioned model to a non-view-conditioned model for a human face. The results show that the view-conditioned model is better able to model the scene from novel viewpoints. This happens not only because the view-conditioned model can represent some of the view-dependent appearance of the scene, but also because the non-view-conditioned model incorrectly reconstructs extra semi-transparent voxels near the surface of the object to explain view-dependent phenomena.
We impose several priors on the reconstructed volume to help reduce the occurrence of artifacts in the reconstruction. In this experiment, we evaluate the effectiveness of background estimation and the priors on the reconstructed volume quality in terms of mean-squared-error on the validation viewpoints. We evaluate 3 different scenarios: known background image, learned background image, and no background model, each scenario evaluated with and without priors.
Table 2 shows the results of this experiment. For each background setting, using the priors improves performance on the validation viewpoints. Surprisingly, learning a background image outperformed using a known background image, but the improvement is small.
In this section, we show qualitative results on a number of different types of objects. We compare our renderings to held-out ground truth views and also demonstrate hybrid rendering with a mesh representation.
Fig. 6 shows renderings produced by our method compared to ground truth images for several validation (held-out) viewpoints. The renderings demonstrate that our method is able to model difficult phenomena like fuzz, smoke, and human skin and hair. The figure also shows typical artifacts produced by our reconstructions: typically, a very light smokey pattern is added which may be modeling view-dependent appearance for certain training cameras. Using more camera views tends to reduce all artifacts.
Fig. 7 shows how the template volume changes through time across several frames, compared to the final warped volume that is rendered. Ideally, object motion should be represented entirely by the warp field. However, this does not always happen when representing such changes requires more resolution than is available (e.g., as would be necessary to cleanly separate the rim of the glasses from the side of the head) or because representing the warp field becomes too complex.
Fig. 11 visualizes the learned RGB volumes. While there is no explicit surface reconstruction objective in our method, we can visualize isosurfaces of constant opacity, shown in (b). Ideally, fully opaque surfaces such as the hand in the first row would be represented by delta functions in , but we find that the method trades off some opacity to better match reconstruction error in the training views. Note how translucent materials such as the glasses in the second row, or materials which appear translucent at coarse resolution, such as hair in the 3rd and 4th rows, are modeled with lower opacity values but retain a distinct structure particular to the object.
We evaluate the quality of our algorithm using only a single frame as input. In some ways this should make the problem easier as there is less information to represent within the model. On the other hand, our model is less able to exploit regularities and redundancies of motion. This experiment helps us disentangle those factors and determine the contribution of each component of the model.
To perform this experiment, we run our model on only a single frame. Although the encoder will produce a constant value, we keep the entire encoder-decoder network intact while training. We also compare to “direct” volume estimation, i.e., we directly estimate the template voxel volume and warp values without the encoder-decoder network.
Fig. 12 shows the results of the experiment on four objects we captured as well as one scene from a publicly available MVS dataset (Aanæs et al., 2016). As shown, the “direct” voxel/warp estimation contains artifacts and incorrectly reconstructs the object. This experiment shows that the convolutional architecture provides a regularization that, even in the case of a single frame, allows us to accurately recover and re-render objects. Similar observations have been made in a recent work on deep image priors (Ulyanov et al., 2018).
For comparison, we show the same frame reconstructed as a mesh using a commercial multi-view stereo system (Agisoft, 2019) and the open-source multi-view stereo system COLMAP (Schönberger and Frahm, 2016; Schönberger et al., 2016), as well as a comparison to space carving (Kutulakos and Seitz, 2000). Characteristically, the recovered resolution in the MVS texture map is greater than what we can achieve with volume reconstructions on current hardware. However, the mesh also shows artifacts in thin regions, like the frame of the glasses, and translucent regions, like the glass material and the hair at the top of the head, or the smoke in the 4th row. Space carving heavily relies on consistent appearance across views and struggles with translucent materials.
With the proposed reconstruction method, we can playback and re-render the captured data from many different angles. We not only want to extrapolate in viewpoint, but also in the content of the performance. Our latent variable model allows us to create new sequences of content by modifying the latent variables.
Fig. 8 shows two examples of content modification by interpolating latent codes and by changing the conditioning variable based on user input. Our latent space interpolation shows that the encoder network learns a compact representation of the scene. In the second example, we condition the decoder on the head pose of the subject, allowing us to create novel sequences in real time.
Fig. 9 shows the results of combining a textured mesh representation with our voxel representation. Textured meshes can efficiently and accurately represent fine detail in regions of the face like the skin and eyes while the voxel representation excels at modeling hair. For a mesh model we use the Deep Appearance Model of Lombardi et al. (2018) trained to reconstruct the same face we used to train our volume encoder/decoder network.
In this paper, we presented a method for modeling objects and scenes with a semi-transparent volume representation that we learn end-to-end from multi-view RGB images. We showed that our method can convincingly reconstruct challenging objects such as moving hair, fuzzy toys, and smoke. Our method requires no explicit tracking, and can be run in real time alongside traditional triangle rasterization.
One limitation of our method is that given a surface with limited texture, our estimated volume may represent that surface as transparent and place its color in the background, so long as doing so does not cause otherwise occluded surfaces to appear. This is a challenge that affects traditional 3D reconstruction methods as well. With our method, however, the reconstruction degrades gracefully and still produces perceptually-pleasing results thanks to our image-space loss function. Fig. 10 shows an example of this via a depth map computed as the distance each ray travels before saturating or hitting the bounding box. In a more practical setting, this could be addressed simply by capturing the sequence with a bright background such as a green screen.
Although our method can handle transparent objects, like plastic bottles, the method doesn’t currently consider refractive surfaces. We believe the approach can be extended to model refraction and even reflection, and we leave that to future work. Our model can represent dull specular highlights through view conditioning but high-frequency specular highlights are not correctly represented.
The latent space is the feature enabling us to generate dynamic content, but we do not explicitly model any temporal dynamics. This is not a problem for playback, since the playback sequence implicitly encodes the same temporal dynamics as the recording. It is also not a problem when driving the representation from user input, so long as that user input has reasonable temporal dynamics of its own. However, if we traverse the latent space in some manner not guided by temporal information, we may generate sequences which, while visually accurate, do not represent real behaviors of the object we modeled.
Volumetric representations typically suffer from limited resolution due to the cubic relationship between resolution and memory requirement. In this work, we showed some ways to increase the effective resolution without simply increasing the voxel grid resolution by using warping fields. We believe that we can further improve this approach to achieve a level of fidelity and resolution previously only achievable with traditional textured mesh surfaces.
Computer Vision and Pattern Recognition (CVPR)
(2017).