Representing scenes as Neural Radiance Fields (NeRF) has brought a series of breakthroughs in 3D reconstruction and analysis [xie2021neural, mildenhall2020nerf]. High-fidelity real-time rendering of real-world scenes now can be obtained after a few seconds of training [yu_and_fridovichkeil2021plenoxels, instantngp]. The rendering system only requires a few real-world RGB images [mildenhall2021rawnerf], but can well model scenes as small as a cell [liu2022recovery] and as large as a city [tancik2022blocknerf] or even a black hole [levis2022gravitationally].
Despite NeRF’s success in static scenes, extending it to handle dynamic scenes remains challenging. Introducing an extra time dimension to NeRF’s 5D representation (3D location and 2D viewing direction ) is non-trivial for the following two reasons. First, the supervisory signal for a spatiotemporal point is sparser than a static point . Multi-view images of static scenes are easy to access as we can move the camera around, but an extra view in dynamic scenes requires an extra recording camera, leading to sparse input views. Second, the appearance and geometry frequency of the scene are different along the spatial axis and temporal axis. The content usually changes a lot when moving from one location to another location, but the background scene is unlikely to completely change from one timestamp to another. An inappropriate frequency modeling for the time dimension results in poor temporal interpolation performance.
A lot of progress has been made in addressing the aforementioned two challenges. Existing solutions include adopting motion models for matching the points (e.g., [pumarola2020dnerf, nerfies, Tretschk_2021_ICCV, park2021hypernerf, liu2022devrf, fang2022fast]) and leveraging data-driven priors like depth and optical flow (e.g., [wang2021neural, nsff, xian2020space, du2021nerflow]). Different from existing works, we are motivated by the observation that in dynamic scenes different spatial areas have different temporal characteristics. We assume that there are three kinds of temporal patterns in a dynamic scene (Figs. 3 and 2): static, deforming, and new areas. We thus propose to decompose the dynamic scene into these categories, which is achieved by a decomposition field that predicts the point-wise probabilities of being static, deforming, and new. The decomposition field is self-supervised and regularized by a manually assigned global parsimony regularization (e.g., suppressing the global probabilities of being new).
The proposed decomposition can address both of the aforementioned challenges. First, different temporal regularizations are introduced for each decomposed area, thus alleviating the ambiguity in reconstruction from sparse observations. For instance, the static area decomposition simplifies the dynamic modeling to a static scene modeling problem. The deforming areas enforce the foreground object to be consistent in the dynamic scene. Second, the scene is split into different areas according to their temporal characteristics, thus resulting in consistent frequency in the time dimension in each of the areas.
In response to the discrepancy between spatial and temporal frequency, we further decouple spatial and temporal dimensions based on the recent developed hybrid representations [dvgo, yu_and_fridovichkeil2021plenoxels, instantngp, tensorf]. Hybrid representations maintain a grid of feature volumes for fast rendering. Instead of designing a grid of feature volumes, we treat the channels of feature volumes as temporally dependent. To support streamable dynamic scene representation, we propose a sliding window scheme on the feature channels to introduce into the representation (Fig. 4). Sliding window not only supports streaming of the feature volumes, but also implicitly encourages the representation to be compact by leveraging the overlapped channels in adjacent frames.
For validation, we conduct experiments on datasets captured under both single-camera and multi-camera settings. Our extensive ablation studies validate our proposed method in three aspects: 1) the necessity of modeling all of the three areas on single-camera datasets, 2) the necessity of decomposing static areas on multi-camera datasets, and 3) the necessity of deforming decomposition on inputs with large frame-wise motion even for multi-camera datasets. To sum up, our contributions are as follows:
We propose to decompose the dynamic scene according to their temporal characteristics. The decomposition is achieved by a decomposition field that takes as input each point and outputs probabilities belonging to three categories: static, deforming, and new.
We design a self-supervised scheme for optimizing the decomposition field and regularizing the decomposition field with a global parsimony loss.
We design a sliding window scheme on recently developed hybrid representations for efficiently modeling spatiotemporal fields.
We present extensive experiments and real-time rendering demos on both single-camera and multi-camera datasets. Our ablation studies validate the implied regularizations behind the proposed three temporal patterns.
2 Related Work
2.1 Neural Fields
Neural fields are neural networks that take in the coordinates and output the properties of that point[xie2021neural]. 3D representations based on neural fields have made tremendous advancements in recent years. The pioneering work Occupancy Networks [mescheder2019occupancy] represents the geometry of 3D objects with a continuous decision boundary modeled by a neural network. Occupancy Networks are further improved to model dynamic objects [niemeyer2019occupancy]. Concurrently, DeepSDF [park2019deepsdf] represents the geometry with signed distance function with a network. Chibane et al. [chibane2020neural] predicts the unsigned distance field for 3D shapes from point clouds. NeRF [mildenhall2020nerf], a milestone work, proposes to represent the scene with a 5D function modeled by MLP. NeRF significantly improves the performance of novel view synthesis (i.e., image-based rendering). The scene representation in NeRF inspired a number of works focusing on 3D modeling, such as human face and body capture [hong2021headnerf, Noguchi_2021_ICCV, peng2021neural, Peng_2021_ICCV, su2021anerf, liu2021neural], relighting [boss2021nerd, srinivasan2021nerv, boss2021neuralpil] and 3D content generation [Trevithick_2021_ICCV, schwarz2020graf, chan2021pi, gu2021stylenerf, kosiorek2021nerf, chan2022efficient, dreamfields].
Scenes are implicitly represented by MLPs in vanilla NeRF and forwarding with the MLPs is time-consuming. Insightful methods [nsvf, wizadwongsa2021nex, PlenOctrees, Reiser_2021_ICCV, Hedman_2021_ICCV, Garbin_2021_ICCV, wu2021diver] are designed by adopting explicit data structures to efficiently query from the fields. Further, hybrid representations are developed by leveraging both explicit and implicit representations to improve the differentiability of the framework. DVGO [dvgo]
uses two feature voxels to represent occupancy and appearance. The feature vectors queried from the voxels are decoded by small MLPs. Plenoxels[yu_and_fridovichkeil2021plenoxels] prune empty spaces and save the sphere harmonic coefficients. InstantNGP [instantngp] proposes a hash encoding of the saved feature grids and solves hashing collision by multi-scale encoding and small MLP decoding. TensoRF [tensorf]
leverages tensor decomposition to reduce the model size of the voxels. Hybrid representations are further leveraged for efficient dynamic scene modeling. Recent concurrent works[fang2022fast, liu2022devrf, gan2022v4d] propose to model canonical spaces with voxels and motion with deformation fields. Li et al. [streamrf] propose to stream the difference of voxels in dynamic scenes. Different from the above methods, our method decomposes the scene into different areas and models them separately.
Neural fields have been adopted for decomposing scenes. Yang et al. [yang2021learning] and Zhang et al. [ZhangLYZZWZXY21] decompose the scene by objects for editing. DeRF [rebain2021derf] spatially decomposes the scene and uses small networks for each area for efficiency. Kobayashi et al. [kobayashi2022distilledfeaturefields] and Tschernezki et al. [tschernezki22neural] semantically decompose the scene with pre-trained models. Ost et al. [ost2021neural] decompose scenes into semantic scene graphs. Objects are decomposed by motion in NeuralDiff [tschernezki2021neuraldiff] and STaR [yuan2021star]. More recently, a decomposition between static and dynamic areas is studied in DNeRF [wu2022d] and Sharma et al. [sharma2022seeing]. Our decomposition is different from existing works since we decompose areas according to the temporal changing patterns.
2.2 4D Modeling of Dynamic Scenes
Free viewpoint rendering from video captures has been widely studied over the decades. The idea of viewing an event from multiple perspectives dates back to Multiple Perspective Interactive Video [jain1995multiple], in which 3D environments are generated with dynamic motion models. Virtualized Reality [kanade1997virtualized] design 3D dome and recovered 3D structure based on multi-camera stereo methods. Inspired by image-based rendering [LevoyH96, GortlerGSC96], some video-based rendering methods are developed [schirmacher2001fly, yang2002real, carranza2003free], which requires dense capturing of the scene. Zitnick et al. [zitnick2004high] propose a layered depth image representation for the high-quality video-based rendering of dynamic scenes. More recently, a milestone work developed by Collet et al. [collet2015high] utilizes tracked textured meshes for free-viewpoint video streaming. With RGB, infrared (IR), and silhouette information as the input, their system can output accurate geometric, detailed texture, and efficient streaming. Another impressive system developed by Broxton et al. [broxton2020immersive] proposes multi-sphere image based Layered Meshes. The capturing setting is a low-cost hemispherical array with 46 synchronized cameras and Layered Meshes are validated to be efficient and can well-handle non-Lambertian surfaces with view-dependent or semi-transparent effects. Bansal et al. [bansal20204d] use convolutional neural nets to compose static and dynamic parts of the event and then adopt U-Net to render images from intermediate results composited from depth-based re-projected images. Neural Volumes (NV) [lombardi2019neural] leverages differentiable volume rendering for optimizing a 3D volume representation, which can be transformed from 2D input RGB images using an encoder-decoder network. NV is further strengthened in [LombardiSSZSS21] with volumetric primitives. X-Fields [bemana2020x] consider input images from different view, time or illumination conditions in structured captures. DyNeRF [Li_2022_CVPR] assign observation frames with a set of compact latent codes and then use time-conditioned neural radiance fields to represent dynamic scenes. Fourier PlenOctrees [wang2022fourier] extend the real-time rendering framework PlenOctrees [PlenOctrees] to dynamic scenes. DeVRF [liu2022devrf] proposes a voxel-based representation that first reconstructs a canonical object from multi-view dense supervisions and then reconstructs deformation from few-view observations.
Another thread of research aims at modeling dynamic scenes without the requirements of multiple synchronized cameras. Multi-view information is collected by moving the camera around in the dynamic space. The setting of single-camera input is much more challenging than the multi-camera setting mentioned above. Data-driven solutions like video depth estimation[luo2020consistent, kopf2021robust] are developed. Based on the priors and motivated by the success of NeRF, motions are modeled by neural fields. Some methods first define a canonical space that is modeled by a NeRF, then align the following frames from the canonical space. Representative methods include D-NeRF [pumarola2020dnerf], Nerfies [nerfies] and NR-NeRF [Tretschk_2021_ICCV]. The trajectory of points is modeled by a neural field in DCT-NeRF [wang2021neural]. Directly modeling the 4D field by introducing an extra time dimension into the original radiance field is adopted in NSFF [nsff], VideoNeRF [xian2020space], and NeRFlow [du2021nerflow]. HyperNeRF [park2021hypernerf] points out the issue of motion inconsistency in topologically varying scenes and proposes a hyperspace representation, which is inspired by the level-set methods, for optimizing motion in a more smooth solution space. Gao et al. [gao2022dynamic] demonstrate the discrepancy between the casual monocular video and the above existing monocular testing videos.
The above methods are able to generate impressive results under various settings. However, rendering with both single- and multi-camera inputs can be further studied, such as the effectiveness of motion modeling with multi-camera inputs. Moreover, a tradeoff still exists among model size, training and rendering speed, and rendering quality. Our method studies both single- and multi-camera inputs and focuses on efficient and high-quality free-viewpoint video rendering.
Our method leverages the rendering scheme proposed by NeRF [mildenhall2020nerf] and hybrid representation for static scenes [nsvf, takikawa2021neural, dvgo, wu2021diver, yu_and_fridovichkeil2021plenoxels, zhang2022nerfusion, instantngp, takikawa2022variable, tensorf]. We first briefly review the rendering framework in NeRF, then we introduce the recently developed hybrid representation for efficient neural fields.
For each point in NeRF, we denote its volume density as and its color as , where is the viewing direction. The pixel color of a camera ray is computed by accumulating a set of samples on the ray with volume rendering. Let the optical origin and direction of the camera be and , then a set of points are sampled by and the expected color is computed by
where are near and far bounds. Numerical approximation by summing up a set of sample points on the ray is used for computing the integration in Eq. 1. In vanilla NeRF, the radiance field is implicitly represented by an MLP that takes in the point as input and outputs its density and color. The MLP is then trained with a reconstruction loss between the reconstructed color and ground-truth color , i.e.,
where is a batch of ray samples.
The implicit representation in NeRF is highly compact but computationally expensive, resulting in slow training and rendering speed. Hybrid representations, in which both explicit and implicit representations can be adopted, are developed for efficiently reconstructing and rendering with a radiance field. Though these methods have their unique standouts, all these hybrid representations follow a common framework. First, we have some explicitly stored features , which can be in the form of a voxel grid [nsvf, takikawa2021neural, dvgo, wu2021diver, yu_and_fridovichkeil2021plenoxels, zhang2022nerfusion], a hash table [instantngp] or a set of basis vectors/matrices [tensorf]. For any point in the 3D space, a feature vector can be efficiently obtained with cheap operations (e.g., tri-linear interpolation for a voxel). Next, a decoder is adopted to get properties like the density and color of the point from . The decoder can be an MLP [dvgo, instantngp, tensorf] or spherical harmonics [yu_and_fridovichkeil2021plenoxels].
4 Our Method
Our method is built on the assumption that different areas in a dynamic scene can have different temporal changing patterns. Modeling different areas with different temporal regularizations not only helps keep temporal consistency but also saves computation. For example, some objects in the background may have a static geometry in the dynamic sequence, which allows us to reduce the capacity and complexity of their representation. We begin our method with a decomposed spatiotemporal representation which aims to first categorize and then model different dynamic areas using different representations based on their categories.
4.1 Decomposed Spatiotemporal Representation
As illustrated in Fig. 2, we assume three kinds of areas in a dynamic scene and model these areas with separate fields:
Static areas have a constant geometry and location in the dynamic scene, such as the table. Besides, we assume the appearance of the static areas will not change frequently over time, i.e., is temporally low-frequency. This is based on the observation that the appearance change is mainly caused by lighting conditions and the albedo is time-invariant. Hence a stationary field is used for representing static points.
Deforming areas model objects with deforming surfaces, such as the hand and the cup in Fig. 2. Deforming areas may comprise rigid or non-rigid motion, but they are always presented in the sequence of interest. Deforming points are represented by a deformation field . Then the deformed point is sent as the query point into a predefined canonical space (e.g., the static field ).
New areas model new content emerging at some point in the sequence, such as the new fluid after pouring espresso into water. A newness feature field with inputs is adopted for representing new areas.
To decompose the scene, we design a decomposition field , where denotes the probability of being static, deforming and new. Next, we consider the output of the fields mentioned above () to be feature vectors rather than properties like the density of the point and denote the output feature vector as , respectively. Finally, given a query point , we first collect the outputs from the above fields and then compute the expected feature vector of this point by , where . Then is sent to a lightweight view-conditioned network for density and color prediction.
In Fig. 3, we demonstrate our approach using a simple 2D toy example. The task studied in the figure is a temporal interpolation from the given 2D images. The fields mentioned above take locations as the input. The string ‘2022’ undergoes rigid motion while the string ‘VR’ gradually appears. The different interpolation performance demonstrates the necessity of modeling dynamic scenes with both deforming and new fields. Note that we do not manually annotate the probability when performing decomposition. Instead, the decomposition field is only supervised by the reconstruction loss and generic parsimony priors which penalize objects being modeled as new. More details about training will be introduced in Sec. 4.4.
Hybrid representations, which enable fast training and real-time rendering, are adopted for implementing the above neural fields. However, most of existing static scene targeted hybrid representations implement the mapping . Adapting to inputs with an extra dimension time (i.e., dynamic scenes) is not straightforward, since modeling 4D inputs with the explicit representation may significantly increase the model size. A streamable hybrid representation for efficient spatiotemporal mapping is introduced in the next section.
4.2 Streamable Hybrid Representation
We observe that the explicit representation commonly consists of array entries with a predefined feature dimension. For example, each entry in the hash table in InstantNGP [instantngp] and each basis vector/matrix in TensoRF [tensorf] both have a fixed feature dimension. Thus, we propose to stream the feature channels so that can be a mapping from a spatiotemporal point to the fixed-length feature vector .
We propose to select feature channels with a sliding window along with the time dimension , as demonstrated in Fig. 4. Assume that for each frame the feature vector is of dimension and channels are newly needed for a new frame, then for a frame sequence the array entry in is of dimension . For a single frame , the channels in will be used for computing , such as trilinear interpolation in InstantNGP or tensor multiplication in TensoRF.
To ensure smoothly translates along with , a rearrangement of feature channels is conducted to match the shared channels. For example, let and , then channels of are used for . Next, we use channels for and channels for . The principle behind the rearrangement is that shared feature channels are always aligned to be with the same index in the vector. Otherwise, a smooth translation between frames is not guaranteed.
The streaming channels readily enable us to temporally interpolate a frame between two observed frames and by linearly interpolating the feature vectors: . Note that our proposed method can be applied to any hybrid representation that contains entries of feature vectors. The implementation of employed will be referred to as backbone in the following text. The sliding window scheme brings two benefits: First, overlapping feature channels are forced to be shared in adjacent frames, thus reducing the model size; Second, after rendering one frame, only new feature channels need to be loaded when moving to the subsequent frames, thus being streaming friendly.
4.3 Overall Framework.
Now we introduce the details of implementing the decomposed spatiotemporal representation (Sec. 4.1) with the streamable hybrid representation (Sec. 4.2). An illustration is presented in Fig. 5. The decomposition field consists of explicitly cached features (denoted by ) and a small MLP decoder . The deformation field is an MLP since the deformation is sparse and of low-frequency, where a small MLP is enough. The stationary field consists of explicitly cached features (denoted by ) and a tiny MLP decoder. Time and feature obtained from will be the input to the tiny MLP. The reason for using a tiny MLP is for modeling time-dependent appearance changes caused by time-varying illumination, which is assumed to be of low-frequency. The newness field is explicitly saved features . Note that in the above explicit representations, both and take in a 4D input , hence streaming channels are used here. The final expected feature vector is then decoded by a radiance field . Viewing direction is also sent to as in NeRF.
Our training process follows the practice of NeRF. A batch of camera rays is first randomly sampled from the observed data and then points on those rays, denoted by , are sampled for training.
In practical reconstruction and rendering tasks, a precise supervisory signal to the decomposition field is inaccessible. Instead, we supervise the output probabilities with a global parsimony regularization. Therefore, besides the reconstruction loss defined in Eq. 2, a regularization loss is introduced in our method. We use the average probability of all points in the batch for this loss, denoted as , where is the number of points. In our implementation, assuming the existence of static background, we propose to minimize the probability of not being a static point, thus the regularization loss is chosen as
where is a tunable parameter for weighting the ratio of being deforming and new. Minimizing the probability of being new points in Eq. 3 relies on the assumption that most of the points in the dynamic scene are either static or deforming. Overall, our training loss is
where is a balancing hyper-parameter.
When rendering an image with a given camera pose, we first forward the sampled points using the decomposition field. After knowing the probabilities, we can skip the forwarding process of some fields for efficiency. With a predefined threshold , if then we directly set and skip the field. We set as 0.001 in our implementation.
We first quantitatively and qualitatively compare our method with prior works, then extensive ablation studies are presented to validate our proposed components. We urge the reader to watch our video to better appreciate the efficiency and rendering quality of our system.
Our method requires only RGB observations of the dynamic scene for reconstruction. Unlike most existing methods, our framework does not require special capturing settings or prior knowledge, and detailed comparisons of the framework’s requirements against competitive methods are attached in the supplementary. Two multi-camera datasets and one single-camera dataset are used:
Immersive Video [broxton2020immersive] includes synchronized videos from 46 4K fisheye cameras. For the raw video data provided by the authors, each camera has different imaging parameters like exposure and white balance. We select 7 dynamic scenes with relatively similar imaging parameters. We downsample the images to in our experiments. The camera with ID 0 (the central camera) is used for validation and the other cameras are used for training.
Plenoptic Video [Li_2022_CVPR] is captured with 21 cameras at a resolution of . Different from Immersive Video which mostly focuses on outdoor scenes, Plenoptic Video consists of indoor activities in various lighting conditions. We downsample images to in our experiments. We follow the training and validation camera split provided by [Li_2022_CVPR]. Six scenes are publically available.
HyperNeRF [nerfies, park2021hypernerf] provides only one view for each timestamp in a dynamic scene. The dataset is challenging due to the single-camera setting. We adopt the same training and validation settings as in [park2021hypernerf]: images of are used for quantitative evaluation and images of are used for qualitative comparisons. There are two capturing settings in HyperNeRF: “vrig” captures the scene with stereo cameras and training with one camera and validating with the other; “interp” is a monocular video from a moving camera capturing dynamic scenes.
Our framework, as demonstrated by Fig. 5
, is implemented with PyTorch[paszke2019pytorch]. As highlighted in Sec. 4.3, our framework is general and any hybrid representation adopting explicit features can be used. We implement our framework with two backbones: InstantNGP [instantngp] and TensoRF [tensorf]. In both of the implementations, the deformation network is a 4-layer MLP with a width of 256. The stationary field uses a 2-layer MLP with a width of 64. The radiance field is a 4-layer MLP with a width of 64 and has the same structure as the decoder in the backbone. For InstantNGP based model, the number of levels is 8 and the number of feature dimensions per entry is 4. TensoRF based model follows the same setting as in their experiments on the real forward-facing datasets (i.e., LLFF [llff]). For both of the two backbones, we set the number of channels for streaming to be 1, and loss hyper-parameters , . We follow the default optimization schedule and settings as in the static-scene targeted backbone methods. For the two multi-camera datasets, we observe that their frame rates are high and simply modeling every dynamic area as new areas already lead to good temporal interpolation performance, so the deformation is not used by default for efficiency. An ablation is presented for studying the impact of video FPS when modeling with and without deformation decomposition. Due to the limitation of model sizes, we split a long video into 90-frame clips and trained on these clips separately. PSNR and SSIM [ssim] are reported for evaluating the rendering performance.
|Method||PSNR||Training Time||Rendering Time|
|Neural Volumes [lombardi2019neural]||22.797||-||-|
|Method||Rendering Time||Broom||3D Printer||Chicken||Peel Banana||Mean|
|GT||Ours-InstantNGP||HyperNeRF [park2021hypernerf]||Nerfies [nerfies]||NSFF [nsff]||NV [lombardi2019neural]||NeRF [mildenhall2020nerf]|
5.1 Comparison with State-of-The-Art Methods
DyNeRF [Li_2022_CVPR] and HyperNeRF [park2021hypernerf] are considered for multi- and single-camera settings. Besides the two methods, we also quote the results of other baseline methods reported in their paper.
5.1.1 On multi-camera dataset
In Tab. 1, we report out results with both InstantNGP and TensoRF backbones. Training time and rendering time of DyNeRF are quoted from their paper. Our method reaches a higher PSNR while significantly reducing the training and rendering time. We further compare the rendered images in Fig. 6. Images of comparison methods are again quoted from the result images in DyNeRF’s paper. Our method with InstantNGP renders images with 12% of the time required by DyNeRF while being comparable. Besides, our method with TensoRF-VM achieves better performance on fast-moving objects. As demonstrated in Fig. 7, we compare our rendered results with extracted frames from DyNeRF’s result video. Since the code and rendering parameters of DyNeRF are not publically available, we manually select similar camera poses and timestamp for comparison. We can observe that the knife in DyNeRF’s results is blurry while our method yields clearer results.
5.1.2 On single-camera dataset
A challenging and practical appealing setting is reconstructing and rendering without per-frame multi-view observations, i.e., capturing with a single camera. Our method with TensoRF backbone is not reported on this dataset since we find that the GPU memory required for training is too large with the default model setting. We compare our method with SoTA single-camera reconstruction methods in Tab. 2. Note that NSFF requires data-driven depth and optical flow priors. We can observe that our method outperforms HyperNeRF in terms of PSNR but is slightly worse than HyperNeRF on SSIM. We presume the reason is that our method generates more accurate but less sharp images compared to HyperNeRF. Visual comparisons can be found at Fig. 8. We can observe that HyperNeRF sometimes has misalignment between the rendered and real images regarding moving objects, such as the wire in the second row. We attribute the misalignment problem to not correctly modeling the deformation. The incorrect modeling is partially caused by treating all pointing as deforming in their representation. Lacking decomposing static and dynamic areas also leads to a flickering background (demonstrated in our video). In our method, by decomposing static and dynamic areas, the deformation field is regularized to only model dynamic areas.
5.2 Ablation Studies
5.2.1 Impact of Decomposition
We first study the necessity of the proposed three categories for decomposition. Visual comparisons of different decomposition variants are demonstrated in Fig. 10 and quantitative results are reported in Tab. 3. First, we study the impact of decomposing deforming and new areas on a single-camera dataset. We can observe from Fig. 10(a) that removing new area decomposition leads to failure of modeling the newly poured out espresso and removing deforming area decomposition leads to a blurred hand and cup. Second, we study the impact of our decomposition on the multi-camera dataset. Fig. 10(b) demonstrates that the static area becomes sharper after decomposing static areas. Besides, without static area decomposition, we observe that the background is flickering as we render images with novel time and view.
Finally, we study the impact of deforming area decomposition on the multi-camera dataset. This ablation study is motivated by the observation that rendering with and without deformation modeling leads to little difference (PSNR difference less than 0.1). We presume that this is because the motion of objects between frames is small from cameras with a high FPS recording rate. Therefore modeling all dynamic areas with a newness field can still produce a smooth interpolation. We manually downsample the frame sampling rate for training in Fig. 10(c) to enlarge the motion between frames. We can observe that without deformation modeling the moving helmet becomes first disappeared and then reappeared when interpolating between two training timestamps. As a comparison, the content of the helmet is well preserved if the deformation is modeled.
|(a) Results after removing deforming and new area decomposition on the single-camera dataset (HyperNeRF).|
|(b) Ablation of static area decomposition on the multi-camera dataset (Immersive Video). Second row: novel time and view rendering.|
|(c) Ablation of deforming area decomposition on the multi-camera dataset (Immersive Video). Every 8 frames are used for training.|
|on Immersive Video|
|on Immersive Video (every 8 frames)|
5.2.2 Scene Decomposition Regularizing
In our method, we use to balance the ratio of being deforming and new in Eq. 3. A larger encourages the scene to contain fewer deforming areas. As introduced in the previous section, single-camera datasets are more sensitive to the deformation field, thus we study the impact of in Fig. 11 on a scene from HyperNeRF. We can observe that over-suppressing deforming areas () lead to blurry moving objects and under-suppressing deforming areas () leads to a noisy scene. The reason behind the blur from large is the same as the second row in Fig. 3 and Fig. 10(c): falsely modeling a moving object as first-disappear-then-reappear. A good practice is that we can start with a relatively large to penalize deforming areas and then gradually allow areas to deform by decreasing .
5.2.3 Streaming Bitrates
An important metric for a streaming service is the bitrate. To render a new frame, the user is usually sensitive to the new data needed to download. We can easily tune the bitrate requirements in our method by setting the value of . In Tab. 4, we report the bitrate for streaming a new frame with different values. The testing data is a sequence from Immersive Video with 90 frames. Note that denotes new channels needed for rendering a new frame and rendering the first frame still follows the channels required for static scenes (96 for TensoRF-CP and 4 for TensoRF-VM). For fair comparisons, bitrate is computed by the total model size over the number of frames.
The TensoRF-CP based model achieves low bitrate and reasonable performance, while the cost of TensoRF-VM is higher but the performance gain is obvious. We further present rendering results in Fig. 12. We can observe clearer details of the background (i.e., car) and the moving objects (i.e., person) with increased bitrate budgets. The above results validate the extensibility of our framework.
5.2.4 Rendering Speed and Quality
The performance of our framework is highly correlated with the chosen backbones. Thus, in our method, there exists a tradeoff between rendering speed and quality, mainly affected by predefined model size and rendering hyper-parameters. In Fig. 9, we present the rendering FPS and PSNR with different hyper-parameter settings. We consider two parameters: for the hash table size and the stepping value during ray marching. Scenes from the Immersive Video dataset are considered. We can observe that our method inherits the flexibility of the backbone and we can easily tune the parameters to obtain the desired speed and quality.
|Input images||Novel view|
6 Limitation and Failure Cases
Our method models each frame in the scene with local feature channels, which enables streaming but limits the representation of long-range repeated activities. For example, the activity of pouring espresso in Fig. 2 may repeat several times in a scene. Further modeling the repeating activities can reduce redundancy and improve the reconstruction quality by leveraging all the views of the same object. Moreover, our method assumes input multi-view images are with the same camera imaging configuration (e.g., exposure). A failure example from Immersive Video is demonstrated in Fig. 13. Though view dependency can still be modeled in the framework, the model tends to generate floating points to overfit the training views. Recent progress that considers the photography process [martin2021nerf, mildenhall2021rawnerf] may help solve the issue.
We present a framework for representing dynamic scenes from both multi- and single-camera captured images. The key components of our framework are the decomposition module and the feature streaming module. The decomposition module decomposes the scene into static, deforming, and new areas. A sliding window based hybrid representation is then designed for efficiently modeling the decomposed neural fields. Experiments on multi- and single-camera datasets validate our method’s efficiency and effectiveness. Extensive ablation studies further provide insight into the model design, such as the necessity of modeling deformation in large-motion scenes captured by camera arrays.