High quality reconstruction and Photo-realistic rendering of a dynamic scene from a set of input images is necessary for many applications such as AR/VR, 3D content production and entertainment. Traditional methods use classical mesh-based representation to reconstruct the dynamic scenes, which, unfortunately, are prone to produce reconstruction errors and rendering artifacts when the scene contains thin structures, specular surfaces and topological changes [Li2012Temporal, Virtualizedreality1997, TheRelightables2019, Fvv2015, Starck2007Surface].
Recent advances in neural rendering approaches, which learn scene representations in the form of neural radiance fields (NeRF), have shown impressive novel view synthesis of general static scenes given only multi-view images [nerf2020]. They are immediately extended to dynamic scenes: some methods (e.g., NeRF-T) consider time as an additional input dimension to NeRF representation [VideoNeRF2021-nfds77, DCT-NeRF2021-nfds71], while other methods (e.g., D-NeRF) disentangle a dynamic scene into a canonical radiance field and a dynamic motion field [D-Nerf2021-nfds54, NR-Nerf2021-nfds67, NeRFlow2021-nfds12, Nerfies2021-nfds49, DeVRF2022-nfds32]. Either way, learning a 4D function is one of the main cornerstones. Unfortunately, directly using MLP to fit such a function often suffers from high time and computation cost, i.e., dozens of hours on high-end GPUs.
In fact, the aforementioned limitation also exists in conventional NeRF-based methods for static scenes, and researchers have proposed to use discrete data structures like voxel grids [Plenoctrees2021-nfds81] or triplanes [eg3d2022] to accelerate NeRF training and rendering. However, these techniques are difficult to be extended to dynamic domains as introducing an additional time dimension will exponentially increase memory footprint, hindering them from modeling high-quality appearance details.
In this work, we pursue a dynamic scene representation that also utilizes explicit feature grids to accelerate network training while avoiding huge memory consumption when introducing an additional time dimension. To this end, we bypass the construction of a high resolution 4D tensor; instead, we propose to model a 4D field using hierarchical tri-projection decomposition. Our decomposition method extends the tri-projection in EG3D[eg3d2022]. It firstly project a full 4D field into three time-aware volumes, each of which is then further decomposed into three feature planes. In this way, we model the 4D field using only nine 2D feature planes, and we empirically find that although being highly compact, such a representation is powerful enough to represent dynamic scenes containing complex motions. Moreover, the usage of explicit data structure also allows us to design a coarse-to-fine strategy to further improve the performance of our method.
By utilizing and factorizing an explicit 4D tensor, our method enables both efficient reconstruction and compact representation of dynamic scenes. Besides, the decomposition scheme also introduces implicit constraints on the representation since only low-rank tensors can be approximated by a small number of lower-dimensional components. Such a constraint can serve as an inherent regularization when the input observation is limited, e.g., under sparse and fixed cameras setting or even monocular inputs. In this paper, we first apply our method for sparse-view dynamic reconstruction by adopting our Tensor4D decomposition to time-conditioned radiance fields in “NeRF-T”. In addition, our decomposition method can also be used for single-view dynamic reconstruction. This is achieved through decomposing both the 4D dynamic motion field and the canonical radiance field in “D-NeRF”. With proper regularization, our system enables efficient and high-quality reconstruction of dynamic objects under both camera settings.
In summary, our contributions include:
A novel hierarchical tri-projection decomposition method that factorizes a 4D tensor into nine compact 2D feature planes, which enables efficient and high-quality representation of dynamic scenes.
We demonstrate the successful application of the proposed Tensor4D decomposition in dynamic reconstruction and novel view synthesis tasks and present two methods for the case of monocular video input and multiview video inputs individually.
A coarse-to-fine optimization strategy based on the Tensor4D decomposition is proposed to further improve the efficiency of both of the two dynamic reconstruction and view synthesis tasks.
2 Related Work
Multiview Reconstruction and Rendering. There are various ways to capture and reconstruct the 3D dynamic scenes, involving methods based on silhouette [Outdoor2012-19, freely2011-53], stereo [LRDS2018-31], segmentation [DMDE2016-44, Videopop-up2014-46], and photometric [Robustfusion2008-1, DSC2009-56]. With RGBD cameras, real-time solutions like DynamicFusion [DynamicFusion2015-td3r35]estimates the non-rigid deformations of a dynamic scene and integrates depth frames to reconstruct the geometry model in the canonical space. This method is later extended for telepresence and holographic communication [Motion2fusion2017-td3r7, ProjectStarline2021, yu2021function4d]. However, these systems heavily rely on depth sensors to obtain accurate geometry measurement. In contrast, our method can reconstruct and render dynamic scenes using only sparse-view RGB cameras.
In the past few years, neural implicit representations underwent rapid development and have been applied for multi-view reconstruction and rendering of static scenes. Some methods represent the geometry as the zero level-set of a neural network, and use differentiable surface rendering to optimize the network weights[DVR2020CVPR, srns2019, idr2020]. Given dense observation of an object, these methods are able to accurately recover its surface. NeRF [nerf2020], on the other hand, uses volume rendering to optimize the scene representation. Its simplicity and impressive results inspires a lot of following works, including in-the-wild reconstruction [martinbrualla2020nerfw, sun2022neuconw], lighting and material estimation [boss2021nerd, nerv2021, NeRFactor2021], generation [gu2022stylenerf, jang2021codenerf, chanmonteiro2020pi-GAN, GRAF2020], human rendering [shao2022doublefield, Zhao_2022_HumanNeRF, sun2021HOI-FVV, IButter2021] and so on. More recently, several methods unify surface and volumetric rendering, enabling accurate geometry reconstruction and high-quality novel view synthesis [volsdf2021, Unisurf2021ICCV, neus2021]. Compared to these static scene representations, we aim to enable free view synthesis of dynamic scenes from extremely sparse cameras.
NeRF for Dynamic Scenes. Modeling scenes in 4D domain with time dimension included is a direct solution to extend NeRF for dynamic domains. Typical approaches includes VideoNeRF [VideoNeRF2021-nfds77], NeRFlow [NeRFlow2021-nfds12], DyNeRF [DyNeRF2022-nfds30], and DCT-NeRF [DCT-NeRF2021-nfds71]. Specifically, VideoNeRF [VideoNeRF2021-nfds77] directly learns a spatiotemporal irradiance field from a single video and uses depth estimation to address the shape-motion ambiguities in monocular inputs, while NeRFlow [NeRFlow2021-nfds12] and DCT-NeRF [DCT-NeRF2021-nfds71] use point trajectory to regularize the network optimization. To deal with the limitation of topology-change modeling in deformation field, Park et al. [HyperNeRF2021-nfds50] represent HyperNeRF which can lift NeRF to higher dimensions.
Dynamic scenes can also be rendered by deforming the radiance field in the canonical space. For example, Nerfies [Nerfies2021-nfds49] optimizes an additional continuous deformation field by warping each observed point into a canonical 5D NeRF. D-Nerf [D-Nerf2021-nfds54] and NR-Nerf [NR-Nerf2021-nfds67] follow a similar framework, but take only monocular videos as training data. In addition, DeVRF [DeVRF2022-nfds32] uses voxel-based representation instead of MLPs to model the 3D canonical space and the 4D deformation field. Using parametric body templates as the semantic prior, methods like Neural Body [peng2021neuralbody] and HumanNeRF [weng_humannerf_2022_cvpr] enable photo-realistic novel view synthesis of complex human performance. The discrepancy between the practical capture process and the existing experimental protocols for monocular videos has been shown in [MDVS2022-nfds15].
NeRF Acceleration. Numerous works emerge with the purpose of speeding up static NeRF using explicit data structures including feature maps, voxels and tensors. For instance, DVGO [DVGO-na63] achieves fast convergence through an explicit representation of a density voxel grid and a feature voxel grid. With the sparse voxel octree structure, NSVF [NSVF2020-na33] accelerates the novel view reconstruction by discarding the empty voxels in a coarse-to-fine manner. Similarly, Plenoxels [Plenoxels2022-na57] and PlenOctree [Plenoctrees2021-nfds81] model a scene through a hierarchical 3D grid with spherical harmonics, which can realize an optimization with two orders of magnitude faster than NeRF. In DIVeR[DIVeR2022-na75], ray marching only finds a fixed number of hits on the voxel grid to accelerate volumetric rendering. Moreover, hashing encoding [InstantNGP2022-na44] and tensor decomposition [TensoRF2022-na9] are also used as compact representations for NeRF acceleration.
For dynamic scene modeling, DeVRF [DeVRF2022-nfds32] enables fast non-rigid neural rendering with both 3D volumetric and 4D voxel field. In addition, V4D [V4D2022-na14] introduce an effective conditional positional encoding for 4D data to realize fast novel view synthesis. TiNeuVox [TiNeuVox2022-na13] represents scenes with optimizable explicit data structures and accelerates radiance fields modeling, while Wang et al. extended PlenOctrees [Plenoctrees2021-nfds81] into free-view video rendering [Fourierplenoctrees2022-nfds72]. However, the low-res volumetric design of these works hinders the capacity of rendering high-quality images.
The goal of this paper is to reconstruct and render a dynamic scene given the videos captured with a sparse-view camera rig or a monocular camera, and we aim to obtain the 4D scene representation in an efficient manner. Building on prior work for spatio-temporal representations and tri-projection decomposition (Sec. 3.1), we propose a hierarchical tri-projection decomposition method (Sec. 3.2) and a coarse-to-fine strategy (Sec. 3.3), which allow us to learn a 4D field at a modest cost of training time and GPU memory.
Spatio-temporal 4D NeRF representation. To represent dynamic objects using neural radiance fields, a naive way is to condition the original neural radiance fields on the timestamp [VideoNeRF2021-nfds77, DCT-NeRF2021-nfds71], which we term as NeRF-T. Mathematically, NeRF-T can be formulated as:
where is a dynamic implicit field that produces a high-dimensional feature and a density value for a point at position and time instance , and is a function that takes into account the viewing direction to predict the final RGB color.
To achieve better disentanglement of shape and motion, some methods like D-NeRF [D-Nerf2021-nfds54] propose deformable neural radiance field which adopts a canonical 3D representation with the 4D flow fields:
where is the radiance field in canonical configuration and is a scene flow field representing the mapping between the scene at time instant and the canonical space.
From the above formulation one can easily observe that both NeRF-T and D-NeRF rely on modeling a 4D field , i.e., the dynamic implicit field in NeRF-T and the flow field in D-NeRF. Existing methods mainly adopt MLPs to fit these 4D fields. Such an implicit neural representation does not have an explicit structure and requires extensive computation time for both training and rendering.
Tri-projection Decomposition The tri-projection decomposition is widely used in recent work including EG3D [eg3d2022] and TensorRF [TensoRF2022-na9] in order to accelerate the training and rendering process in MLP-based NeRF frameworks. Such decomposition factorizes a -dimensional tensor into three lower-dimensional (-D) tensors by projecting along the , and -axis respectively. For example, triplane-based decomposition proposed by EG3D projects the 3D tensor into three 2D feature planes. Compared to voxel-based representations, triplane representation effectively reduces the memory footprint and improves the performance of 3D generation and reconstruction.
3.2 Hierarchical Tri-projection Decomposition
Our goal is to design a dynamic scene representation with explicit feature grids to accelerate network training and volumetric rendering. However, directly constructing a 4D tensor costs a huge amount of memory and is unacceptable for the purpose of high-resolution rendering. Therefore, we propose a hierarchical triprojection decomposition to factorize the 4D tensor into several compact features, which reduces memory consumption by a large margin while preserving the capability of fitting 4D fields.
Specifically, for a neural 4D field , we first decompose the 3D space part from 4D spatio-temporal tensor into three time-aware volumetric tensors via tri-projection decomposition:
where denotes the projection operator. To further lower space complexity and enable high-resolution representation, we decompose each feature volume into three feature planes as:
where denotes the volume-to-plane tri-projection. In this way, we compactly represent a 4D field using 9 planes. Given any spatio-temporal coordinate
, we can efficiently query its value in the 4D field by projecting it onto the planes and retrieving the corresponding value via bilinear interpolation. Fig2 is an illustration of our decomposition method.
Our hierarchical triprojection decomposition reduces the space complexity from to with being the spatial resolution of the grid, significantly lowering memory footprint without sacrificing representation power. Note that our decomposition supports various types of 4D fields including the time-conditioned radiance field in NeRF-T and the 4D flow field in D-NeRF. In this paper, we present a dynamic radiance field decomposition method (Sec. 4.1) for multiview dynamic reconstruction as well as a 4D flow decomposition method (Sec. 4.2) for single-view setting.
3.3 Coarse-to-fine Strategy
To further improve the efficiency of our 4D decomposition, we propose an optional coarse-to-fine strategy to factorize the 4D fields into different scales in different training phases. In coarse level, we adopt low-resolution feature planes () to decompose the 4D fields, which improves the robustness of the training process and achieves fast convergence. After coarse level training, we additionally use high-resolution feature planes () for 4D decomposition to represent dynamic details and achieve high-quality rendering. Specifically, we factorize the 4D fields into different scales:
where and are the coarse-level and fine-level components of , respectively. In the coarse level, the decomposed feature planes is low-resolution to represent coarse 3D structures and 4D dynamic changes. In the fine level, we adopt the high-resolution feature planes in each element to decompose the 4D fields, which focuses more on recovering dynamic details.
4 Tensor4D for Dynamic Reconstruction
As shown in Fig. 3, we apply our 4D tensor decomposition into the task of dynamic reconstruction with two inputs:
1) Dynamic reconstruction under sparse and fixed cameras. For this setting, we factorize NeRF-T instead of D-NeRF using our proposed 4D decomposition, which we found more efficient and flexible to represent topologically-varying objects (Sec. 4.1).
2) Dynamic reconstruction using monocular camera. For this setting, we separately decompose the 4D flow fields and the 3D canonical representation since it ensures appearance consistency across different frames and achieves more robust performance (Sec. 4.2).
4.1 Multi-view Reconstruction
In this section, we present our dynamic reconstruction system under sparse multi-view setting. The goal of our system is 1) efficient and high-quality dynamic reconstruction with low memory and time cost, and 2) robust reconstruction under sparse and fixed camera setting.
To this end, we adopt our 4D decomposition (Eq. 5) to factorize the NeRF-T representation with the coarse-to-fine strategy. Specifically, we can obtain nine low-resolution feature planes and nine high-resolution feature planes to represent the 4D NeRF-T fields after our 4D decomposition. Then for volume rendering, when sampling a point at in the ray with direction of , we first query the feature of in both the LR and the HR feature planes. Take the LR features for example:
where we concatenate all features queried in the nine LR planes as the final LR feature of the sampling point . Then we concatenate the LR and HR features with the positional encoding of into the geometry MLP to obtain the density and high-dimension feature :
where is the positional encoding function. Next, we concatenate the high-dimension feature with the positional encoding of and feed it into the color MLP:
In this way, we can render the images through volume rendering and adopt color loss to train our decomposed feature planes and neural network and :
With our coarse-to-fine design, our method can recover high-fidelity dynamic details effectively and efficiently.
To achieve robust dynamic reconstruction under sparse multi-view setting, we further adopt regularization for all decomposed feature planes :
where is the TV loss for each feature plane to keep their sparsity. To regularize the geometry, we also introduce surface constraint into volume rendering. Specifically, we adopt the SDF as the base geometry representation and follow NeuS [neus2021] to render the SDF field. Then we add surface constraint loss to enforce a smooth surface:
The final training loss is the regularization loss and the color loss :
4.2 Monocular Reconstruction
Different from the sparse view setting, we adopt 4D decomposition for D-NeRF in monocular capture cases. This is because monocular setting is much more ill-posed than sparse-view inputs, and the explicit disentanglement of appearance and motions can guarantee the consistency across different frames. Since D-NeRF represents the dynamic objects with the 4D flow fields and a 3D canonical representation, we separately factorize these two fields. First, for the 4D flow fields, we only adopt coarse level decomposition and factorize it into low-resolution feature planes:
Our coarse decomposition focuses more on the coarse and rigid motion, which improves the robustness of flow estimation and can achieve better disentanglement of shape and motion. Then for the 3D canonical representation, we adopt both coarse and fine level decomposition:
Therefore, we can obtain 9 flow feature planes for 4D flow fields and 6 canonical feature planes for 3D canonical representation. The volume rendering in monocular cases is similar to multi-view cases. For a sampling point , we first obtain the point flow feature using Eq. 6 with nine flow planes . Then we adopt the flow MLP to predict the movement of the point:
Then we obtain the point canonical feature by querying the canonical feature planes . Take the LR feature for example:
Next, we feed canonical feature and the positional encoding of into the geometry MLP to obtain high-dimension feature and density :
Finally, we adopt Eq. 8 to predict color for volume rendering and Eq. 9 for training color loss . We also add feature regularization loss in Eq. 10 and surface constraint loss in Eq. 11. The total training loss is:
Dataset. To evaluate the performance of our methods for multiview inputs, we build a sparse-view capture system with 6 forward-facing RGB cameras mounted on the borders of a 32” Looking Glass 3D holographic display [LookingGlass]. All cameras are synchronized and calibrated. Using this system, we capture multiple sequences of various challenging human motions, including dancing, thumbing up, waving hands, wearing hats and manipulating bags. We use 4 of them for reconstruction and rendering in all our experiments, while leaving the other two for quantitative evaluation. We also use three 360° multiview full body sequences captured with evenly spaced cameras on a camera ring for qualitative evaluation. For monocular evaluation, we use the synthetic dataset provided by D-NeRF [D-Nerf2021-nfds54] and select 3 scenes (“lego”, “standup” and “jumpingjacks”) from this dataset, with the numbers of training frames ranging from 50 to 200. More details about data collection and preprocessing can be found in the Supp.Mat..
Baselines. We mainly compare our method against the following state-of-the-art baselines that are most related to our work: D-NeRF [D-Nerf2021-nfds54], NeRF-T, TiNeuVox [TiNeuVox2022-na13] and NeuS-T. Here, NeRF-T is our extension of vanilla NeRF [nerf2020] by introducing an additional time input, and NeuS-T is extended from NeuS [neus2021] similarly. Among these baselines, D-NeRF and TiNeuVox represent the dynamic scenes through deforming a canonical one, while NeRF-T and NeuS-T directly learn a time-conditioned 4D radiance fields. TiNeuVox uses explicit voxel grids to accelerate network training, while others purely use MLPs to model the scene.
5.1 Results and Comparison
Qualitative Results. We train our model for each individual sequences, and present some example results for novel view synthesis in Fig. 4 and the Supp.Mat.. The results cover various body motions, clothing styles and accessories. As shown in Fig. 4, our method can render high-quality images for dynamic scenes and faithfully recover appearance details like the thin finger motions, semi-transparent silk, hand-object interaction, face expressions and cloth wrinkles. See our Supp.Video. for better visualization.
Comparisons on monocular dynamic dataset. We first compare our method with the baselines on monocular synthetic dataset. Qualitative results are presented in Fig. 5. Compared to other methods, ours recovers more appearance details and generate less artifacts. The numeric results in Tab. 1 also prove that our method outperforms state-of-the-art methods in terms of rendering quality and accuracy.
Comparisons on sparse view dataset. We then evaluate the performance of different methods for novel view synthesis given four camera views from our collected six camera-view real-world datasets. The remained two views are used for quantitative evaluation. Results reported in Fig. 6 and Tab. 2 show again that our method performs better in accurate and high-quality appearance detail synthesis.
|Method||Sequence1-thz (PSNR)||Lego (PSNR)|
Comparisons of training efficiency. We compare the training efficiency of our method with existing methods including NeRF-T [D-Nerf2021-nfds54], D-NeRF [D-Nerf2021-nfds54], TiNeuVox [TiNeuVox2022-na13] and Neus-T. In the experiment, we compare PSNR values changing with different number of training iterations on both a monocular synthetic dataset (Lego) and a multi-view dataset (Sequence1-thz). For fair comparison, we also fix the batch size of the sampling rays and keep the same learning rate. As shown in Tab. 3, our 4D decomposition is the most efficient method that achieve the best trade-off between complexity and quality. The rendering quality can be much higher given with sufficient training iterations.
5.2 Ablation study
Regularization. We quantitatively ablate the regularization terms in our method. We implement two strong baselines “Ours-NeRF-T” and “Ours-D-NeRF”, in which we directly apply our Tensor4D decomposition for NeRF-T and D-NeRF without the regularization terms. The results are reported in Tab. 1 and Tab. 2. Benefiting from our 4D decomposition, they achieve better performance than the original NeRF-T and D-NeRF. However, their rendering quality is still worse than our full method with regularization.
Time of video. We qualitatively ablate the ability of our method by training with different numbers of temporal frames. We train our method on the “thumbsup” multi-view sequence with , , , and frames. We keep the same training iterations (k), and the novel view rendering results are shown in Fig. 7.
Limitations. Since our method needs to decompose the 4D fields to several 2D feature planes, a pre-set bounding box of the scene is necessary. Therefore, it is difficult for our method to reconstruct backgrounds or objects which are out of the bounding box. Another limitation is our strong regularization terms. Though these terms benefit the robustness of our reconstruction under sparse views, they also limit our ability to handle challenging cases such as fluid and fog.
Conclusion. We presented Tensor4D, a new method for learning high-quality neural representation for dynamic scenes from sparse-view videos or even a monocular video. To capture the spatio-temporal information in a compact and memory-efficient manner, we propose propose a novel hierarchical tri-projection decomposition method that models a 4D tensor with nine 2D feature planes. With proper design of training losses and regularization, our method provides an efficient yet effective solution to model the radiance fields of dynamic scenes. We believe our work can inspire future research towards low-cost, portable and immersive telepresence systems.