Photo-realistic rendering of dynamic 3D objects and scenes from 2D image data is a central focus of research in computer vision and graphics. Volumetric representations have seen a resurgence of interest in the graphics community in recent years, driven by impressive empirical results attained using learning-based methods(Lombardi et al., 2019; Mildenhall et al., 2020)
. Through the use of generic function approximators, such as deep neural networks, these methods achieve compelling results by supervising directly on raw image pixels. Thus, they avoid the often difficult task of assigning geometric and radiometric properties, which is typically required by classical physics-inspired representations. Leveraging the inherent simplicity of volumetric models, much work has been dedicated to extending the approach for modeling small motions(Park et al., 2020), illumination variation (Srinivasan et al., 2020), reducing data requirements (Trevithick and Yang, 2020; Yu et al., 2020), and learning efficiency (Tancik et al., 2020). All these methods employ a soft volumetric representation of 3D space that helps them model thin structures and semi-transparent objects realistically.
Despite the aforementioned advances in volumetric models, they still have to make a trade-off; either they have a large memory footprint or they are computationally expensive to render. The large memory footprint drastically limits the resolution at which these approaches can operate and results in a lack of high-frequency detail. In addition, their high computational cost limits applicability to real-time applications, such as VR telepresence (Wei et al., 2019; Orts-Escolano et al., 2016). The ideal representation would be memory efficient, can be rendered fast, is drivable, and has high rendering quality.
Neural Volumes (Lombardi et al., 2019) is a method for learning, rendering, and driving dynamic objects captured using an outside-in camera rig. The method is suited to objects rather than scenes as a uniform voxel grid is used to model the scene. This grid’s memory requirement prevents the use of high resolutions, even on high-end graphics cards. Since much of the scene is often comprised of empty space, Neural Volumes employs a warp field to maximize the utility of its available resolution. The efficacy of this, however, is limited by the resolution of the warp and the ability of the network to learn complex inverse-warps in an unsupervised fashion.
Neural Radiance Fields (NeRF) (Mildenhall et al., 2020)
addresses the issue of resolution using a compact representation. NeRF only handles static scenes. Another challenge is runtime, since a multi-layer perceptron (MLP) has to be evaluated at every sample point along the camera rays. This leads to billions of MLP evaluations to synthesize a single high-resolution image, resulting in extremely slow render times around thirty seconds per frame. Efforts to mitigate this rely on coarse-to-fine greedy selection that can miss small structures(Liu et al., 2020). This approach can not easily be extended to dynamics, since it relies on a static acceleration structure.
In this work, we present Mixture of Volumetric Primitives (MVP), an approach designed to directly address the memory and compute limitations of existing volumetric methods, while maintaining their desirable properties of completeness and direct image-based supervision. It is comprised of a mixture of jointly-generated overlapping volumetric primitives that are selectively ray-marched, see Fig. 1. MVP leverages the conditional computation of ray-tracing to eliminate computation in empty regions of space. The generation of the volumetric primitives that occupy non-empty space leverages the shared computation properties of deconvolutional deep neural networks, which avoids the wasteful re-computation of common intermediate features for nearby areas, a common limitation of recent methods (Mildenhall et al., 2020; Liu et al., 2020)
. Our approach can naturally leverage correspondences or tracking results defined previously by opportunistically linking the estimated placement of these primitives to the tracking results. This results in good motion interpolation. Moreover, through a user-defined granularity parameter, MVP generalizes volumetric(Lombardi et al., 2019) on one end, and primitive-based methods (Lombardi et al., 2018; Aliev et al., 2019) on the other, enabling the practitioner to trade-off resolution for completeness in a straightforward manner. We demonstrate that our approach produces higher quality, more driveable models, and can be evaluated more quickly than the state of the art. Our key technical contributions are:
A novel volumetric representation based on a mixture of volumetric primitives that combines the advantages of volumetric and primitive-based approaches, thus leading to high performance decoding and efficient rendering.
A novel motion model for voxel grids that better captures scene motion, minimization of primitive overlap to increase the representational power, and minimization of primitive size to better model and exploit free space.
A highly efficient, data-parallel implementation that enables faster training and real-time rendering of the learned models.
2. Related Work
In the following, we discuss different scene representations for neural rendering. For an extensive discussion of neural rendering applications, we refer to Tewari et al. (2020).
The simplest geometric primitive are points. Point-based representations can handle topological changes well, since no connectivity has to be enforced between the points. Differentiable point-based rendering has been extensively employed in the deep learning community to model the geometry of objects(Insafutdinov and Dosovitskiy, 2018; Roveri et al., 2018; Lin et al., 2018; Yifan et al., 2019). Differentiable Surface Splatting (Yifan et al., 2019) represents the points as discs with a position and orientation. Lin et al. (Lin et al., 2018) learns efficient point cloud generation for dense 3D object reconstruction. Besides geometry, point-based representations have also been employed extensively to model scene appearance (Meshry et al., 2019; Aliev et al., 2019; Wiles et al., 2020; Lassner and Zollhöfer, 2020; Kolos et al., 2020). One of the drawbacks of point-based representations is that there might be holes between the points after projection to screen space. Thus, all of these techniques often employ a network in screen-space, e.g., a U-Net (Ronneberger et al., 2015), to in-paint the gaps. SynSin (Wiles et al., 2020) lifts per-pixel features from a source image onto a 3D point cloud that can be explicitly projected to the target view. The resulting feature map is converted to a realistic image using a screen-space network. While the screen-space network is able to plausibly fill in the holes, point-based methods often suffer from temporal instabilities due to this screen space processing. One approach to remove holes by design is to switch to geometry proxies with explicit topology, i.e., use a mesh-based model.
Mesh-based representations explicitly model the geometry of an objects based on a set of connected geometric primitives and their appearance based on texture maps. They have been employed, for example, to learn personalized avatars from multi-view imagery based on dynamic texture maps (Lombardi et al., 2018). Differentiable rasterization approaches (Chen et al., 2019; Loper and Black, 2014; Kato et al., 2018; Liu et al., 2019; Valentin et al., 2019; Ravi et al., 2020; Petersen et al., 2019; Laine et al., 2020) enable the end-to-end integration of deep neural networks with this classical computer graphics representation. Recently, of-the-shelf tools for differentiable rendering have been developed, e.g., TensorFlow3D (Valentin et al., 2019), Pytorch3D (Ravi et al., 2020), and Nvidia’s nvdiffrast (Laine et al., 2020). Differentiable rendering strategies have for example been employed for learning 3D face models (Tewari et al., 2018; Genova et al., 2018; Luan Tran, 2019) from 2D photo and video collections. There are also techniques that store a feature map in the texture map and employ a screen-space network to compute the final image (Thies et al., 2019). If accurate surface geometry can be obtained a-priori, mesh-based approaches are able to produce impressive results, but they often struggle if the object can not be well reconstructed. Unfortunately, accurate 3D reconstruction is notoriously hard to acquire for humans, especially for hair, eyes, and the mouth interior. Since such approaches require a template with fixed topology they also struggle to model topological changes and it is challenging to model occlusions in a differentiable manner.
One example of a mixture-based representation are Multi-Plane Images (MPIs) (Zhou et al., 2018; Tucker and Snavely, 2020; Srinivasan et al., 2019). MPIs employ a set of depth-aligned textured planes to store color and alpha information. Novel views are synthesized by rendering the planes in back-to-front order using hardware-supported alpha blending. These approaches are normally limited to a small restricted view-box, since specular surfaces have to be modeled via the alpha blending operator. Local Light Field Fusion (LLFF) (Mildenhall et al., 2019) enlarges the view-box by maintaining and smartly blending multiple local MPI-based reconstructions. Multi-sphere images (MSIs) (Attal et al., 2020; Broxton et al., 2020) replace the planar and textured geometry proxies with a set of textured concentric sphere proxies. This enables inside-out view synthesis for VR video applications. MatryODShka (Attal et al., 2020) enables real-time 6DoF video view synthesis for VR by converting omnidirectional stereo images to MSIs.
Grid-based representations are similar to the multi-layer representation, but are based on a dense uniform grid of voxels. They have been extensively used to model the 3D shape of objects (Mescheder et al., 2019; Peng et al., 2020; Choy et al., 2016; Tulsiani et al., 2017; Wu et al., 2016; Kar et al., 2017). Grid-based representations have also been used as the basis for neural rendering techniques to model object appearance (Sitzmann et al., 2019a; Lombardi et al., 2019). DeepVoxels (Sitzmann et al., 2019a) learns a persistent 3D feature volume for view synthesis and employs learned ray marching. Neural Volumes (Lombardi et al., 2019) is an approach for learning dynamic volumetric reconstructions from multi-view data. One big advantage of such representations is that they do not have to be initialized based on a fixed template mesh and are easy to optimize with gradient-based optimization techniques. The main limiting factor for all grid-based techniques is the required cubic memory footprint. The sparser the scene, the more voxels are actually empty, which wastes model capacity and limits resolution. Neural Volumes employs a warping field to maximize occupancy of the template volume, but empty space is still evaluated while raymarching. We propose to model deformable objects with a set of rigidly-moving volumetric primitives.
Multi-Layer Perceptrons (MLPs) have first been employed for modeling 3D shapes based on signed distance (Park et al., 2019; Jiang et al., 2020; Chabra et al., 2020; Saito et al., 2019a, b) and occupancy fields (Mescheder et al., 2019; Genova et al., 2020; Peng et al., 2020). DeepSDF (Park et al., 2019) is one of the first works that learns the 3D shape variation of an entire object category based on MLPs. ConvOccNet (Peng et al., 2020)
enables fitting of larger scenes by combining an MLP-based scene representation with a convolutional decoder network that regresses a grid of spatially-varying conditioning codes. Afterwards, researchers started to also model object appearance using similar scene representations. Neural Radiance Fields (NeRF)(Mildenhall et al., 2020) proposes a volumetric scene representaiton based on MLPs. One of its challenges is that the MLP has to be evaluated at a large number of sample points along each camera ray. This makes rendering a full image with NeRF extremely slow. Furthermore, NeRF can not model dynamic scenes or objects. Scene Representation Networks (SRNs) (Sitzmann et al., 2019b) can be evaluated more quickly as they model space with a signed-distance field. However, using a surface representation means that it can not represent thin structures or transparency well. Neural Sparse Voxel Fields (NSVF) (Liu et al., 2020) culls empty space based on an Octree acceleration structure, but it is extremely difficult to extend to dynamic scenes. There exists also a large number of not-yet peer-reviewed, but impressive extensions of NeRF (Gafni et al., 2020; Gao et al., 2020; Li et al., 2020; Martin-Brualla et al., 2020; Park et al., 2020; Rebain et al., 2020; Tretschk et al., 2020; Xian et al., 2020; Du et al., 2020; Schwarz et al., 2020). While the results of MLP-based models are often visually pleasing, their main drawbacks are limited or no ability to be driven as well as their high computation cost for evaluation.
Our approach is a hybrid that finds the best trade-off between volumetric- and primitive-based neural scene representations. Thus, it produces high-quality results with fine-scale detail, is fast to render, drivable, and reduces memory constraints.
Our approach is based on a novel volumetric representation for dynamic scenes that combines the advantages of volumetric and primitive-based approaches to achieve high performance decoding and efficient rendering. In the following, we describe our scene representation and how it can be trained end-to-end based on 2D multi-view image observations.
3.1. Neural Scene Representation
Our neural scene representation is inspired by primitive-based methods, such as triangular meshes, that can efficiently render high resolution models of 3D space by focusing representation capacity on occupied regions of space and ignoring those that are empty. At the core of our method is a set of minimally-overlapping and dynamically moving volumetric primitives that together parameterize the color and opacity distribution in space over time. Each primitive models a local region of space based on a uniform voxel grid. This provides two main advantages that together lead to a scene representation that is highly efficient in terms of memory consumption and is fast to render: 1) fast sampling within each primitive owing to its uniform grid structure, and 2) conditional sampling during ray marching to avoid empty space and fully occluded regions. The primitives are linked to an underlying coarse guide mesh (see next section) through soft constraints, but can deviate away from the mesh if this leads to improved reconstruction. Both, the primitives’ motion as well as their color and opacity distribution are parameterized by a deconvolutional network that enables the sharing of computation amongst them, leading to highly efficient decoding.
3.1.1. Guide Mesh
We employ a coarse estimate of the scene’s geometry of every frame as basis for our scene representation. For static scenes, it can be obtained via off-the-shelf reconstruction packages such as COLMAP (Schönberger and Frahm, 2016; Schönberger et al., 2016). For dynamic scenes, we employ multi-view non-rigid tracking to obtain a temporally corresponded estimate of the scenes geometry over time (Wu et al., 2018). These meshes guide the initialization of our volumetric primitives, regularize the results, and avoid the optimization terminating in a poor local minimum. Our model generates both the guide mesh as well as its weakly attached volumetric primitives, enabling the direct supervision of large motion using results from explicit tracking. This is in contrast to existing volumetric methods that parameterize explicit motion via an inverse warp (Lombardi et al., 2019; Park et al., 2020), where supervision is more challenging to employ.
3.1.2. Mixture of Volumetric Primitives
The purpose of each of the volumetric primitives is to model a small local region of 3D space.
Each volumetric primitive
is defined by a position in 3D space, an orientation given by a rotation matrix 111 We parameterize rotations as an axis-angle Rodrigues vector.
We parameterize rotations as an axis-angle Rodrigues vector., and per-axis scale factors . Together, these parameters uniquely describe the model-to-world transformation of each individual primitive. In addition, each primitive contains a payload that describes the appearance and geometry of the associated region in space. The payload is defined by a dense voxel grid that stores the color (3 channels) and opacity (1 channel) for the voxels, with being the number of voxels along each spatial dimension. Below, we will assume our volumes are cubes with unless stated otherwise.
As mentioned earlier, the volumetric primitives are weakly constrained to the surface of the guide mesh and are allowed to deviate from it if that improves reconstruction quality. Specifically, their position , rotation , and scale are modeled relative to the mesh-based initialization (, , ) using the regressed values (, , ). To compute the mesh-based initialization, we generate a 2D grid in the mesh’s texture space and generate the primitives at the 3D locations on the mesh that correspond to the -coordinates of the grid points. The orientation of the primitives is initialized based on the local tangent frame of the 3D surface point they are attached to. The scale of the primitives is initialized based on the local gradient of the -coordinates at the corresponding grid point position. Thus, the primitives are initialized with a scale in proportion to distances to their neighbours.
3.1.3. Opacity Fade Factor
Allowing the volumetric primitives to deviate from the guide mesh is important to account for deficiencies in the initialization strategy, low guide mesh quality, and insufficient coverage of objects in the scene. However, allowing for motion is not enough; during training the model can only receive gradients from regions of space that the primitives cover, resulting in a limited ability to self assemble and expand to attain more coverage of the scene’s content. Furthermore, it is easier for the model to reduce opacity in empty regions than to move the primitives away. This wastes primitives that would be better-utilized in regions with higher occupancy. To mitigate this behavior, we apply a windowing function to the opacity of the payload that takes the form:
where are normalized coordinates within the primitive’s payload volume. Here, and
are hyperparameters that control the rate of opacity decay towards the edges of the volume. This windowing function adds an inductive bias to explain the scene’s contents via motion instead of payload since the magnitude of gradients that are propagated through opacity values at the edges of the payload are downscaled. We note that this does not prevent the edges of the opacity payload from being able to take on large values, rather, our construction forces them to learn more slowly(Karras et al., 2020), thus favoring motion of the primitives whose gradients are not similarly impeded. We found and was a good balance between scene coverage and reconstruction accuracy and keep them fixed for all experiments.
3.1.4. Network Architecture
We employ an encoder-decoder network to parameterize the coarse tracked proxy mesh as well as the weakly linked mixture of volumetric primitives. Our approach is based on Variational Autoencoders (VAEs)(Kingma and Welling, 2013) to encode the dynamics of the scene using a low-dimensional latent code . Note that the goal of our construction is only to produce the decoder. The role of the encoder during training is to encourage a well structured latent space. It can be discarded upon training completion and replaced with an application specific encoder (Wei et al., 2019) or simply with latent space traversal (Abdal et al., 2019). In the following, we provide details of the encoder for different training settings as well as our four decoder modules.
The architecture of our encoder is specialized to the data available for modeling. When coarse mesh tracking is available, we follow the architecture in (Lombardi et al., 2018)
, which takes as input the tracked geometry and view-averaged unwarped texture for each frame. Geometry is passed through a fully connected layer, and texture through a convolutional branch, before being fused and further processed to predict the parameters of a normal distribution, where is the mean and
is the standard deviation. When tracking is not available we follow the encoder architecture in(Lombardi et al., 2019), where images from fixed view is taken as input222In practice, we employ a frontal view of the dynamic scene, which we downsample to in size.. Each camera view is first processed independently, the intermediate features are then fused and further processed to predict . To learn a smooth latent space with good interpolation properties, we regularize using a KL-divergence loss that forces the predicted distribution to stay close to a standard normal distribution. The latent vector is obtained by sampling from the predicted distribution using the reparameterization trick (Kingma and Welling, 2013).
Our decoder is comprised of four modules; two geometry decoders and two payload decoders. The geometry decoders determine the primitives’ model-to-world transformations. predicts the guide mesh used to initialize the transformations. It is comprised of a sequence of fully connected layers. is responsible for predicting the deviations in position, rotation (as a Rodrigues vector), and scale (, , ) from the guide mesh initialization. It uses a 2D deconvolutional architecture to produces the motion parameters as channels of a 2D grid following the primitive’s ordering in the texture’s uv-space described in §3.1.2. The payload decoders determine the color and opacity stored in the primitives’ voxel grid . computes opacity based on a 2D deconvolutional architecture. computes view-dependent RGB color. It is also based on 2D deconvolutions and uses an object-centric view-vector , i.e., a vector pointing to the center of the object/scene. Unlike the geometry decoders, which employ small networks and are efficient to compute, payload decoders present a significant computational challenge due to the total size of the elements they have to generate. Our architecture, shown in Fig. 2, addresses this by avoiding redundant computation through the use of a deconvolutional architecture. Nearby locations in the output slab of ’s leverage shared features from earlier layers of the network. This is in contrast to MLP-based methods, such as (Mildenhall et al., 2020), where each position requires independent computation of all features in all layers, without any sharing. Since our texture space is the result of a mesh-atlasing algorithm 333http://www.blender.org that tends to preserve the locality structure of the underlying 3D mesh, the regular grid ordering of our payload within the decoded slab (see §3.1.2) well leverages the spatially coherent structures afforded by devonvolution. The result is an efficient architecture with good reconstruction capacity.
MVP is designed to model objects in a scene from an outside-in camera configuration444Cameras are placed on a structure surrounding the object and face inward., but the extent of object coverage is not know a-priori. Thus, we need a mechanism for separating objects from the backgrounds in the scene. However, existing segmentation algorithms can fail to capture fine details around object borders and can be inconsistent in 3D. Instead, we jointly model the objects as well as the scene’s background. Whereas the objects are modeled using MVP, we use a separate neural network to model the background as a modulation of images captured of the empty scene with the objects absent. Specifically, our background model for the -camera takes the form:
where is the image of the empty capture space, is the camera center and is the ray direction for pixel . The function is an MLP with weights that takes position-encoded camera coordinates and ray directions and produces a rgb-color using an architecture similar to NeRF (Mildenhall et al., 2020). The background images of the empty scene are not sufficient by themselves since objects in the scene can have effects on the background, which, if not accounted for, are absorbed in to the MVP resulting in hazy reconstructions as observed in NV (Lombardi et al., 2019). Examples of these effects include shadowing and content outside of the modeling volume, like supporting stands and chairs. As we will see in §3.2.2, MVP rendering produces an image with color, , and alpha, , channels. These are combined with the background image to produce the final output that is compared to the captured images during training through alpha-compositing: .
3.2. Efficient and Differentiable Image Formation
The proposed scene representation is able to focus the representational power of the encoder-decoder network on the occupied regions of 3D space, thus leading to a high resolution model and efficient decoding. However, we still need to be able to efficiently render images using this representation. For this, we propose an approach that combines an efficient raymarching strategy with a differentiable volumetric aggregation scheme.
3.2.1. Efficient Raymarching
To enable efficient rendering, our algorithm should: 1) skip samples in empty space, and 2) employ efficient payload sampling. Similar to (Lombardi et al., 2019), the regular grid structure of our payload enables efficient sampling via trilinear interpolation. However, in each step of the ray marching algorithm, we additionally need to find within which primitives the current evaluation point lies. These primitives tend to be highly irregular with positions, rotations, and scales that vary on a per-frame basis. For this, we employ a highly optimized data-parallel BVH implementation (Karras and Aila, 2013) that requires less than ms for 4096 primitives at construction time. This enables us to rebuild the BVH an a per-frame basis, thus handling dynamic scenes, and provides us with efficient intersection tests. Given this data structure of the scene, we propose a strategy for limiting evaluations as much as possible. First, we compute and store the primitives that each ray intersects. We use this to compute , ), the domain of integration. While marching along a ray between and , we check each sample only against the ray-specific list of intersected primitives. Compared to MLP-based methods, e.g., NeRF (Mildenhall et al., 2020), our approach exhibits very fast sampling. If the number of overlapping primitives is kept low, the total sampling cost is much smaller than a deep MLP evaluation at each step, which is far from real-time even with a good importance sampling strategy.
3.2.2. Differentiable Volumetric Aggregation
We require a differentiable image formation model to enable end-to-end training based on multi-view imagery. Given the sample points in occupied space extracted by the efficient ray marching strategy, we employ an accumulative volume rendering scheme as in (Lombardi et al., 2019) that is motivated by front-to-back additive alpha blending. During this process, the ray accumulates color as well as opacity. Given a ray with starting position and ray direction , we solve the following integral using numerical quadrature:
Here, and are the global color and opacity field computed based on the current instantiation of the volumetric primitives. We set the alpha value associated with the pixel to . For high performance rendering, we employ an early stopping strategy based on the accumulated opacity, i.e., if the accumulated opacity is larger than we terminate ray marching, since the rest of the sample points do not have a significant impact on the final pixel color. If a sample point is contained within multiple volumetric primitives, we combine their values in their BVH order based on the accumulation scheme. Our use of the additive formulation for integration, as opposed to the multiplicative form (Mildenhall et al., 2020), is motivated by its independence to ordering up to the saturation point. This allows for a backward pass implementation that is more memory efficient, since we do not need to keep the full graph of operations. Thus, our implementation requires less memory and allows for larger batch sizes during training. For more details, we refer to the supplemental document.
3.3. End-to-end Training
Next, we discuss how we can train our approach end-to-end based on a set of 2D multi-view images. The trainable parameters of our model are . Given a multi-view video sequence with training images, our goal is to find the optimal parameters that best explain the training data. To this end, we solve the following optimization problem:
We employ ADAM (Kingma and Ba, 2015)
to solve this optimization problem based on stochastic mini-batch optimization. In each iteration, our training strategy uniformly samples rays from each image in the current batch to define the loss function. We employ a learning rateand all other parameters are set to their default values. Our training objective is of the following from:
It consists of a photometric reconstruction loss , a coarse geometry reconstruction loss , a volume minimization prior , a delta magnitude prior , and a Kullback–Leibler (KL) divergence prior to regularize the latent space of our Variational Autoencoder (VAE) (Kingma and Welling, 2013). In the following, we provide more details on the individual energy terms.
Photometric Reconstruction Loss
We want to enforce that the synthesized images look photo-realistic and match the ground truth. To this end, we compare the synthesized pixels to the ground truth using the following loss function:
Here, is the set of sampled pixels and is a per-pixel weight. We set a relative weight of .
Mesh Reconstruction Loss
We also want to enforce that the coarse mesh proxy follows the motion in the scene. To this end, we compare the regressed vertex positions to the available ground truth traced mesh using the following loss function:
Here, we employ an -loss function, is the ground truth position of the tracked mesh, and is the corresponding regressed vertex position. We employ the coarse mesh-based tracking used in the approach of (Lombardi et al., 2018). The mesh reconstruction loss pulls the volumetric primitives, which are weakly linked to it, to an approximately correct position. Note, the primitives are only weakly linked to the mesh proxy and can deviate from their initial positions if that improves the photometric reconstruction loss. We set a relative weight of .
Volume Minimization Prior
We constrain the volumetric primitives to be as small as possible. The reasons for this are twofold: 1) We want to prevent them from overlapping too much, since this wastes model capacity in already well explained regions, and 2) We want to prevent loss of resolution by large primitives overlapping empty space. To this end, we employ the following volume minimization prior:
Here, is the vector of side lengths of the primitive and is the product of the values of a vector, e.g., in our case the volume of a primitive. We minimize the total volume with a relative weight of .
In this section we describe our datasets for training and evaluation, present results on several challenging sequences, perform ablation studies over our model’s components, and compare to the state of the art. We perform both qualitative and quantitative evaluations.
4.1. Training Data
We evaluate out approach on a large number of sequences captured using a spherically arranged multi-camera capture system with synchronized color cameras. The cameras record with a resolution of at Hz, and are equally distributed on the surface of the spherical structure with a radius of meters. They are geometrically calibrated with respect to each other with the intrinsic and extrinsic parameters of a pinhole camera model. For training and evaluation, we downsample the images to a resolution of to reduce the time it takes to load images from disk, keeping images from all but 8 cameras for training, with the remainder used for testing. To handle the different radiometric properties of the cameras, e.g., color response and white balance, we employ per-camera color calibration based on parameters (gain and bias per color channel) similar to (Lombardi et al., 2018), but pre-trained for all cameras once for each dataset.
4.2. Qualitative Results
Our approach achieves a high level of fidelity while matching the completeness of volumetric representations, e.g., hair coverage and inner mouth, see Fig. 3, but with an efficiency closer to mesh-based approaches. Our fastest model is able to render binocular stereo views at a resolution of at Hz, which enables live visualization of our results in a virtual reality setting; please see the supplemental video for these results. Furthermore, our model can represent dynamic scenes, supporting free view-point video applications, see Fig. 3. Our model also enables animation, which we demonstrate via latent-space interpolation, see Fig. 4. Due to the combination of the variational architecture with a forward warp, our approach produces highly realistic animation results even if the facial expressions in the keyframes are extremely different.
|NV||8 prim.||64 prim.||512 prim.||4k prim.||32k prim.||262k prim.||256 prim.*||256 prim.**|
4.3. Ablation Studies
We perform a number of ablation studies to support each of our design choices.
Number of Primitives
We investigated the influence the number of primitives, , has on rendering quality and runtime performance. A quantitative evaluation can be found in Tab. 1. Here, we compared models with varying number of primitives on held out views, while keeping the total number of voxels constant ( million). In total, we compared models with 1, 8, 64, 512, 4096, 32768, and 262144 primitives. Note that the special case of exactly one volumetric primitive corresponds to the setting used in Lombardi et al. (2019). Our best model uses 256 primitives and million voxels. Fig. 6 shows this data in a scatter plot with PSNR on the x-axis and total execution time on the y-axis. Fig. 5 shows the qualitative results for this evaluation. As the results show, if the number of primitives is too low results appear blurry, and if the number of primitives is too large the model struggles to model elaborate motion, e.g., in the mouth region, leading to artifacts.
Primitive Volume Prior
|512 prim.||512 prim.||512 prim.|
|32k prim.||32k prim.||32k prim.|
The influence of our primitive volume prior, in Section 3.3, is evaluated by training models with different weights, . We used models with 512 and 32,768 primitives in this evaluation (see Fig. 7 and Tab. 2). Larger weights lead to smaller scales and reduced overlap between adjacent primitives. This, in turn, leads to faster runtime performance since less overlap means that there are fewer primitives to evaluate at each marching step, as well as having less overall volume coverage. However, prior weights that are too large can lead to over shrinking, where holes start to appear in the reconstruction and image evidence is not sufficient to force them to expand.
|no alpha fade||alpha fade|
Importance of the Opacity Fade Factor
We trained a model with 512 primitives with and without the opacity fade factor, , described in Section 3.1.3. As shown in Fig. 8, it is critical for the primitives to converge to good configurations, since it enables them to move more easily especially during early stages of training. The model without alpha-fade produces sub optimal primitive configurations with large overlaps and coverage of empty space. Tab. 3 gives a quantitative comparison of the opacity fade factor.
Impact of Voxel Count and Raymarching Step Size
Fig. 9 and Fig. 10 illustrate the effects of different voxel counts and raymarching step sizes on perceived resolution. We also show the quantitative impact of step size in Tab. 4. Here, the same step size is used both during training and evaluation. Smaller marching step sizes recover more detail, such as hair and wrinkles, and result in lower reconstruction error on held out views. Likewise, more voxels provide sharper results. These gains in accuracy are, however, attained at the cost of performance. Decoding scales linearly with the number of voxels decoded, and raymarching scales linearly with step size.
Motion Model Architecture
|64 prim. deconv||64 prim linear||512 prim. deconv||512 prim. linear||4k prim. deconv||4k prim. linear||32k prim. deconv||32k prim. linear||262k prim. deconv||262k prim. linear|
The motion model, , regresses position, rotation, and scale deviations of the volumetric primitive from the underlying guide mesh; . We compared the deconvolutional architecture we proposed in Section 3 against a simpler linear layer that transforms the latent code to a -dimensional vector, comprising the stacked motion vectors, i.e., three dimensions each for translation, rotation and scale. Tab. 5 shows that our deconvolutional architecture outperforms the simple linear model for almost all primitive counts, . A visualization of their differences can be seen in Fig. 11, where our deconvolutional model produces primitive configurations that follow surfaces in the scene more closely than the linear model, which produces many more ”fly away” zero-opacity primitives, wasting resolution.
We compared MVP against the current state of the art in neural rendering for both static and dynamic scenes. As our qualitative and quantitative results show, we outperform the current state of the art in reconstruction quality as well as runtime performance.
We compare to Neural Volumes (NV) (Lombardi et al., 2019) on several challenging dynamic sequences, see Fig. 12. Our approach obtains sharper and more detailed reconstructions, while being much faster to render. We attribute this to our novel mixture of volumetric primitives that can concentrate representation resolution and compute in occupied regions in the scene and skip empty space during raymarching. Quantitative comparisons are presented in Tab. 1. As can be seen, we outperform Neural Volumes (NV) (Lombardi et al., 2019) in terms of SSIM and LPIPS.
Neural Radiance Fields (NeRF)
|NeRF||Ours (single frame)||Ours (multi-frame)|
We also compare our approach to a Neural Radiance Field (NeRF) (Mildenhall et al., 2020), see Fig. 13. Since NeRF is an approach for novel view synthesis of static scenes, we trained it using only a single frame of our capture sequence. We compared it against MVP trained on both the same frame and the entire sequence of approximately 20,000-frame. A visualization of the differences between NeRF and MVP on the static frame is shown in Fig. 13. NeRF excels at representing geometric detail, as can be seen in the teeth, but struggles with planar texture detail, like the texture on the lips or eyebrows. MVP captures both geometric and texture details well. Quantitative results comparing the methods is presented in Tab. 6, where our method outperforms NeRF on all metrics, even when trained using multiple frames. Finally, our approach improves over NeRF in runtime performance by three orders of magnitude.
We have demonstrated high quality neural rendering results for dynamic scenes. Nevertheless, our approach is subject to a few limitations that can be addressed in follow-up work: 1) Currently, we still require a coarse tracked mesh to initialize the positions, rotations, and scale of the volumetric primitives to have a good starting point for the optimization. In the future, it would be really interesting to come up with an approach that can be trained from scratch based on a fully self-organizing set of primitives. This would remove the currently required 3D supervision and thus simplify our approach. 2) We currently require a high-end computer and graphics card to achieve real-time performance. One reason for this is the often high overlap between adjacent volumetric primitives. Thus, we have to perform multiple trilinear interpolations per sample point, which negatively impacts runtime. It would be interesting to incorporate regularization strategies to minimize overlap, which could lead to a significant performance boost. 3) Currently, the number of employed primitives is predefined and has to be empirically determined for each scene type. It is an interesting direction for future work to incorporate this selection process into the optimization, such that the best setting can be automatically determined. Despite these limitations, we believe that our approach is already a significant step forward for real-time neural rendering of dynamic scenes at high resolutions.
We have presented a novel 3D neural scene representation that handles dynamic scenes, is fast to render, drivable, and can represent 3D space at high resolution. At the core of our scene representation is a novel mixture of volumetric primitives that is regressed by an encoder-decoder network. We train our representation based on a combination of 2D and 3D supervision. Our approach generalizes volumetric and primitive-based paradigms under a unified representation and combines their advantages, thus leading to high performance decoding and efficient rendering of dynamic scenes. As our comparisons demonstrate, we obtain higher quality results than the current state of the art. We hope that our approach will be a stepping stone towards highly efficient neural rendering approaches for dynamic scenes and that it will inspire follow-up work.
- Image2StyleGAN: how to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §3.1.4.
- Neural point-based graphics. arXiv preprint arXiv:1906.08240. Cited by: §1, §2.
- MatryODShka: real-time 6dof video view synthesis using multi-sphere images. arXiv preprint arXiv:2008.06534. Cited by: §2.
- Immersive light field video with a layered mesh representation. ACM Trans. Graph. 39 (4). External Links: Cited by: §2.
- Deep local shapes: learning local sdf priors for detailed 3d reconstruction. arXiv preprint arXiv:2003.10983. Cited by: §2.
- Learning to predict 3d objects with an interpolation-based differentiable renderer. In Advances in Neural Information Processing Systems, pp. 9609–9619. Cited by: §2.
- 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pp. 628–644. Cited by: §2.
- Neural radiance flow for 4d view synthesis and video processing. arXiv preprint arXiv:2012.09790. Cited by: §2.
- Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. https://arxiv.org/abs/2012.03065. Cited by: §2.
- Portrait neural radiance fields from a single image. https://arxiv.org/abs/2012.05903. Cited by: §2.
Unsupervised training for 3d morphable model regression.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8377–8386. Cited by: §2.
- Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4857–4866. Cited by: §2.
- Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Processing Systems 31 (NIPS 2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Montréal, Canada, pp. 2804–2814 (eng). Cited by: §2.
- Local implicit grid representations for 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010. Cited by: §2.
- Learning a multi-view stereo machine. In Advances in neural information processing systems, pp. 365–376. Cited by: §2.
- Fast parallel construction of high-quality bounding volume hierarchies. In Proceedings of the 5th High-Performance Graphics Conference, HPG ’13, New York, NY, USA, pp. 89–99. External Links: Cited by: §3.2.1.
- Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, Cited by: §3.1.3.
- Neural 3d mesh renderer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.1.4, §3.1.4, §3.3.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.3.
- TRANSPR: transparency ray-accumulating neural 3d scene point renderer. External Links: Cited by: §2.
- Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics 39 (6). Cited by: §2.
- Pulsar: efficient sphere-based neural rendering. External Links: Cited by: §2.
- Neural scene flow fields for space-time view synthesis of dynamic scenes. https://arxiv.org/abs/2011.13084. Cited by: §2.
Learning efficient point cloud generation for dense 3d object reconstruction.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
- Neural sparse voxel fields. External Links: Cited by: §1, §1, §2.
- Soft rasterizer: a differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7708–7717. Cited by: §2.
- Deep appearance models for face rendering. ACM Trans. Graph. 37 (4). External Links: Cited by: §1, §2, §3.1.4, §3.3, §4.1.
- Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. 38 (4). External Links: Cited by: §1, §1, §1, §2, §3.1.1, §3.1.4, §3.1.4, §3.2.1, §3.2.2, §4.3, §4.4.
- OpenDR: an approximate differentiable renderer. In European Conference on Computer Vision, pp. 154–169. Cited by: §2.
- Towards high-fidelity nonlinear 3d face morphoable model. In IEEE Computer Vision and Pattern Recognition (CVPR), Long Beach, CA. Cited by: §2.
- NeRF in the wild: neural radiance fields for unconstrained photo collections. https://arxiv.org/abs/2008.02268. Cited by: §2.
- Occupancy networks: learning 3d reconstruction in function space. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
- Neural rerendering in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6878–6887. Cited by: §2.
- Nerf: representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934. Cited by: §1, §1, §1, §2, §3.1.4, §3.1.4, §3.2.1, §3.2.2, §4.4.
- Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG). Cited by: §2, Figure 13.
- Holoportation: virtual 3d teleportation in real-time. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, UIST ’16, New York, NY, USA, pp. 741–754. External Links: Cited by: §1.
- Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §2.
- Deformable neural radiance fields. https://arxiv.org/abs/2011.12948. Cited by: §1, §2, §3.1.1.
- Convolutional occupancy networks. In European Conference on Computer Vision (ECCV), Cited by: §2, §2.
- Pix2vex: image-to-geometry reconstruction using a smooth differentiable renderer. arXiv preprint arXiv:1903.11149. Cited by: §2.
- PyTorch3D. Note: https://github.com/facebookresearch/pytorch3d Cited by: §2.
- DeRF: decomposed radiance fields. https://arxiv.org/abs/2011.12490. Cited by: §2.
- U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) Cited by: §2.
- A network architecture for point cloud classification via automatic depth images generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4176–4184. Cited by: §2.
- PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. https://arxiv.org/abs/1905.05172. Cited by: §2.
- PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2304–2314. Cited by: §2.
- Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.1.
- Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §3.1.1.
- GRAF: generative radiance fields for 3d-aware image synthesis. External Links: Cited by: §2.
- DeepVoxels: learning persistent 3d feature embeddings. In Proceedings of Computer Vision and Pattern Recognition (CVPR 2019), Cited by: §2.
- Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, pp. 1121–1132. Cited by: §2.
- NeRV: neural reflectance and visibility fields for relighting and view synthesis. https://arxiv.org/abs/2012.03927. Cited by: §1.
- Pushing the boundaries of view extrapolation with multiplane images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 175–184. External Links: Cited by: §2.
- Learned initializations for optimizing coordinate-based neural representations. https://arxiv.org/abs/2012.02189. Cited by: §1.
- State of the Art on Neural Rendering. Computer Graphics Forum (EG STAR 2020). Cited by: §2.
- Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics 2019 (TOG). Cited by: §2.
- Non-rigid neural radiance fields: reconstruction and novel view synthesis of a deforming scene from monocular video. External Links: Cited by: §2.
- GRF: learning a general radiance field for 3D scene representation and rendering. https://arxiv.org/abs/2010.04595. Cited by: §1.
- Single-view view synthesis with multiplane images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2626–2634. Cited by: §2.
- TensorFlow graphics: computer graphics meets deep learning. Cited by: §2.
- VR facial animation via multiview image translation. ACM Trans. Graph. 38 (4). External Links: Cited by: §1, §3.1.4.
- Synsin: end-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477. Cited by: §2.
- Deep incremental learning for efficient high-fidelity face tracking. ACM Trans. Graph. 37 (6). External Links: Cited by: §3.1.1.
- Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems, pp. 82–90. Cited by: §2.
- Space-time neural irradiance fields for free-viewpoint video. https://arxiv.org/abs/2011.12950. Cited by: §2.
- Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–14. Cited by: §2.
- pixelNeRF: neural radiance fields from one or few images. https://arxiv.org/abs/2012.02190. Cited by: §1.
- Stereo magnification: learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817. Cited by: §2.