Recent years have seen an explosion of novel scene representations which leverage multi-layer perceptrons (MLPs) as an implicit representation to encode spatially-varying properties of a scene. While these implicit representations are optimized via stochastic gradient descent, they are not really “learning” in the traditional sense. Instead, they exploit MLPs as a compressed representation of scene content. They can act as a substitute for traditional representations (e.g. explicit volumetric grids), but can adaptively distribute its limited capacity over the scene to enable high-fidelity scene representation. Representative examples include DeepSDF[park:etal:cvpr19], Scene Representation Networks [sitzmann:etal:NeurIPs19] and Neural Radiance Fields (NeRF) [mildenhall:etal:arXiv20].
Among all of these representations, NeRF [mildenhall:etal:arXiv20] and its variants [liu:etal:neuralips20, zhang:etal:arXiv2020, martin:etal:arXiv20] show enormous potential in their ability to photorealistically reconstruct a scene from only a sparse set of images. However, these works assume the scene is static, or at least that the dynamic content of the scene is uninteresting and can be discarded as in Martin-Brualla et al. [martin:etal:arXiv20]. When objects in a scene move, these methods can no longer correctly render novel views. It is possible to represent time-varying scenes by dedicating one NeRF volume per frame or by extending the input to four dimensions by including time [sitzmann:etal:arXiv20, tancik:etal:neuralips20]. The former is unnecessarily expensive, and neither can be used to render the object in novel poses (or remove it entirely) because they have no object-level understanding of the scene.
In this work, we aim to learn an interpretable and editable representation of a dynamic scene by simply observing an object in motion from multiple viewpoints. As an initial effort to tackle this challenge, we start from a simplified setting that assumes the scene contains only one moving object and the motion is fully rigid. We accomplish this by rendering a compositional neural radiance field with density and radiance defined via a composition of a static and a dynamic neural radiance field. Under this model, the only way all observations in a video can accurately be predicted is by segmenting the scene into the two volumes and correctly estimating the pose of the object in each frame.
To achieve this goal, our paper provides two main technical contributions. First, we present the first self-supervised neural rendering based representation that can simultaneously reconstruct a rigid moving scene as well as its background from videos only. Our approach enables photorealistic spatial-temporal novel view rendering as well as novel scene animation. Second, we present an optimization scheme that can effectively tackle the ambiguity and local optima during training.
Our experiments show that it is possible to recover the segmentation of static and dynamic contents as well as motion trajectory without any supervision other than multi-view RGB observations. Compared to NeRF and its extensions, our approach achieves more photorealistic reconstruction in complex synthetic as well as real world scenes. Further, our factorized representation can be edited to position the object in novel locations never observed during training, which no existing method can achieve without 3D ground truth or supervision.
2 Related Work
Our work is inspired and related to the recent progress in learning-based representations, and particular in novel differentiable rendering from images, e.g. NeRF [mildenhall:etal:arXiv20] and its variants [martin:etal:arXiv20, zhang:etal:arXiv2020]. However, no prior work has successfully demonstrated photorealistic reconstruction and understanding of dynamic scenes using only real-world natural images. We believe we are the first to achieve self-supervised tracking and reconstruction of dynamic (rigidly) moving scenes using neural rendering.
Neural implicit representation
Recently deep implicit representations using MLPs have demonstrated a promising ability to learn high quality geometry and appearance representations. Existing works use MLPs to represent the scene as an implicit function which maps from scene coordinates to scene properties, e.g. signed distance functions (SDFs) [park:etal:cvpr19], occupancy [mescheder:etal:cvpr19], occupancy flow [niemeyer:etal:cvpr19], volumetric density and radiance [mildenhall:etal:arXiv20], or an implicit feature [sitzmann:etal:NeurIPs19] which can be further decoded into pixel appearance. Several approaches show high quality reconstruction results by pairing such implicit representations with differentiable ray marching and some manual input such as object masks [niemeyer:etal:cvpr20] [yariv:etal:neuralips20]. Sitzmann [sitzmann:etal:arXiv20] and Tancik [tancik:etal:neuralips20] shows implicit representation can compress videos, but has yet demonstrated its ability to reconstruct dynamic 3D scenes.
NeRF [mildenhall:etal:arXiv20] demonstrated that a continuous implicit volume rendering function can be learned from only sparse sets of images, achieving state-of-the-art novel view synthesis results without relying on any manually specified inputs. Recent work has extended this representation to be trained with web images under different lighting conditions [martin:etal:arXiv20], with arbitrary viewpoints in a real-world large-scale scene [zhang:etal:arXiv2020], with hierarchical spatial latent priors [liu:etal:neuralips20], or with a latent embedding that can support generative modeling [schwarz:etal:neuralips20]. However, existing works all assume the input scene coordinate is quasi time-invariant while we assume the scene is rigidly moving. Among all of these work, only NeRF-W [martin:etal:arXiv20] consider the real-world that contains dynamic information. They incorporated a latent embedding in addition to the scene coordinate to model the photometric variation across web images and treat the moving scene as transient objects. This method discards these objects and only reconstructs the background scene, while we can simultaneously reconstruct dynamic objects and recover their trajectory.
Dynamic scene novel view synthesis
Recently a few systems demonstrate novel view synthesis in dynamic scenes using multi-view videos, with a neural image-based rendering representation [Bemana:etal:siggraphasia20], neural volume rendering [lombardi:etal:tog19], or a multi-spherical image representation [broxton:etal:siggraph20]. However, these approaches can only support dynamic video playback; they cannot understand scene dynamics, and therefore cannot be used for interactive animation. Yoon [yoon:etal:cvpr2020] propose a method that blends multi-view depth estimation in background regions, learned monocular depth estimation in the foreground, and a convolutional network. They demonstrated novel view synthesis as well as animation. However, this system requires precise manual specification of the foreground mask as part of training and rendering. This is a significant drawback, especially when it comes to video input. In contrast, we demonstrate that our approach can achieve simultaneous reconstruction of background and foreground completely self-supervised and demonstrate that it achieves high-quality reconstruction using real-world videos.
Self-supervised learning in dynamic scenes
There is also a surge in self-supervised learning approaches using videos, particularly in the autonomous driving domain, including 3D object detection[beker:etal:eccv20], joint camera poses and depth estimation [godard:etal:cvpr19, chen:etal:iccv19], with motion segmentation [ranjan:etal:cvpr19, luo:etal:pami19] and scene flow [hur:stefan:cvpr20]. Due to the sensor limitation on autonomous driving domain, these approaches focus on monocular videos or narrow baseline stereo videos where the change in viewpoint is small. As a result, these methods use 2D or 2.5D representations which limits their ability to render or animate the scene in a photorealistic fashion. In contrast, our method builds a implicit 3D representation that can be photorealistically rendered and animated. We demonstrate applications of our model on natural wide-baseline multi-view videos, which shows promise for applications in virtual and augmented reality.
We introduce STaR, a differentiable rendering-based framework for self-supervised tracking and reconstruction of rigidly moving objects and scenes. Given only multi-view passive video observations of an unknown object which rigidly moves in an unseen environment, STaR can simultaneously reconstruct a 3D model of the object (including both geometry and appearance) and track its 6DoF motion relative to a canonical frame without any human annotation. STaR can enable high-quality novel viewpoint rendering of both static and dynamic components of the original scene independently at any time, and can also be animated and re-rendered with a novel object trajectory.
We will first describe the 3D representation of STaR, which consists of a static and a rigidly-moving NeRF model. Next, we elaborate how to optimize STaR over a multi-view RGB video, specifically including how to jointly optimize rigid motion optimization over the Lie group of 3D transforms in the context of neural radiance fields, a regularized loss function, and an online training scheme that can work with videos of arbitrary length.
3.1 STaR as Dynamic Neural Radiance Fields
NeRF [mildenhall:etal:arXiv20] represents the scene as a continuous function using a MLP with parameters
which maps a 3D scene coordinate and view direction to volumetric density and radiance , where is the point on the ray from camera origin with a depth of . The pixel color can be obtained via an integral of accumulated radiance along the ray. We can numerically estimate the integral using quadrature:
Here, is a set of samples from near bound to far bound and , are evaluations of volume density and radiance at sample points along the ray.
Equation (2) can only represent time-invariant scenes. Similar to methods using a MLP to model time-dependent functions [sitzmann:etal:arXiv20], a straightforward extension to represent dynamic scenes using NeRF is to concatenate time to the input of the implicit function as
However, this time-dependent extension neither accurately reconstructs the complex time-varying scene using real-world images nor provides dynamic scene factorization to support animation, as we will show this in sec:experiments.
Rigidly composed dynamic radiance fields
Instead, we represent the scene using two time-invariant implicit volumetric models: a static NeRF for static components and a dynamic NeRF for dynamic components.
Here, is a set of time-dependent rigid poses that defines the transformation in world coordinate from time to time 0, which aligns the dynamic volume and the static volume under a single time-invariant canonical frame.
To compute the color of a pixel at a particular time , we compose the transformed dynamic NeRF with the static NeRF using and alpha blending. Specifically, given a set of samples , we evaluate the static density and radiance at the points , as well as the dynamic density and radiance at the transformed points . We derive the compositional radiance based on (2) as
Our rigidly composed NeRF representation decouples rigid motion from geometry and appearance, allowing it to have full control over the dynamics of the environment. As shown in Sec. 4.2, this enables applications such as removing the dynamic object and animating novel trajectories which cannot be achieved via a simple time-varying model.
We use the same MLP model as [mildenhall:etal:arXiv20] for both static and dynamic NeRF. We also use positional encoding with the same bandwidth for the inputs and hierarchical volume sampling strategy. We use two independent coarse and fine MLP models to represent static and dynamic NeRFs. The coarse models use stratified sampling along the ray while the fine model uses importance sampling where the importance weights are the composed density obtained by both coarse models. This ensures the same sample intervals for the static and dynamic NeRFs, which enables the field composition shown in (8).
While we choose NeRF as the underlying 3D representation due to its simplicity and excellent performance, it is worth noting that STaR as a differentiable renderer can generally subsume any 3D static scene representation as which shares the same input and output mapping.
3.2 Optimizing STaR
During training, we optimize the following objectives
The first term in (11) is a MSE loss, where and are the radiance (RGB) rendered by the coarse and fine model respectively, is the ground truth color and is the set of rays in a batch. Note that is the only source of supervision. Given a limited number of camera views, minimizing the objective in (11) can suffer from local optima. We introduce the regularization term in (12) which we observe can converge better.
(12) regularizes the entropy of the rendered transparency values and is summed over all samples along all rays in a batch. It consists of two parts. The first part encourages the transparencies to be close to 0 or 1, where
and is computed likewise. This helps reducing fuzziness in the volume. The second part prevents the static and dynamic volume from both having large density at the same point, which helps the model to obtain a less entangled decomposition. Specifically,
where is the normalized transparency. The term is weighted by the total transparency , so that a point is allowed to be empty but not occupied in both static and dynamic volume.
Rigid pose optimization
We define the canonical frame by assigning the pose at time 0 to the identity. We optimize pose parameters over time on the Lie algebra [gabay1982minimizing]. During optimization, we represent the transformation from dynamic to static volume at time as an iterative update , where is a local perturbation on the manifold that is initialized to zero at the beginning of each iteration. We can compute the gradient of using the gradient of the transformed point and an analytical Jacobian that can be computed from in the forward pass. Then, we update as . Note that we use different learning rate for the pose parameters and for the NeRF parameters. We include the full analytic form of Jacobian in the supplementary material.
We observe there are many local optima in the optimization of STaR due to entanglement of geometry, appearance and pose, which requires careful initialization. To initialize, we first train a static NeRF following (2) only using the images from the first frame. The initialization is terminated when the average MSE of the fine model over all images from the first frame is below the threshold . Note that at this stage, the static volume can (and likely does) contain information from the dynamic object, but we observe that this initialization provides the model with a good initial estimate of the geometry and appearance from which to begin disentangling the scene.
We train STaR in an online fashion and can handle video sequences of arbitrary length. After appearance initialization, the static NeRF , dynamic NeRF and pose parameters are optimized jointly on the first frames of the video. is set to initially and incremented when the average MSE of the fine model over all rays in the first frames is below the threshold . For , is initialized to identity and for , is set to the pose of the previous frame when frame is added. In sec:experiments, we show that both appearance initialization and online training is crucial for the optimization of STaR.
Our experiments demonstrate that STaR is able to decouple static and dynamic components of a dynamic scene in challenging synthetic as well as real-world scenarios (see fig:novel_view_comparison). As a result, STaR is not only capable of synthesizing a dynamic scene from novel views and time with superior quality (see sec:nvs) but also animating an imaginary trajectory in a photorealistic fashion (see sec:anim).
|Lamp and desk||3200||2e-3||5e-4||0.5||5e-5||4e-4||2e-4||5||2||12hr|
Quantitative comparison of our method to baselines in photorealistic interpolation of motion.The models are evaluated on a 4x slow motion of the original video from a fixed novel viewpoint. The synthetic data are followed with a notation (synthetic), and the real world data is notated as (real). We use a ground truth bounding box as a reference to evaluate static and dynamic regions separately.
We created two synthetic and one real world multi-view RGB videos to evaluate STaR. The videos are highly challenging, containing complex geometry, large object motion, significant view-dependent and time-dependent visual effects such as specular highlight and shadows, all of which are not present in any existing public dataset. Please refer to supplementary materials for more details.
Synthetic data: We rendered two synthetic videos using Blender: lamp and desk, a 15-frame video showing a chair pulled away from the desk in a study room, and kitchen table, a 20-frame video showing a vase sliding across a reflective breakfast bar in a kitchen. The motion is created by modifying the pose of an object in a photorealistic scene. We leave the material and lighting produced by human designers untouched. Both videos have 8 fixed camera views for training and 1 held-out view for evaluation with an image resolution of .
Real world data: We captured a multi-view video, moving banana, in real natural indoor environment. In the scene, a moving banana toy is put on a robot vacuum which sweeps across a living room. The data consists of 17 time-synchronized videos with a total of 792 temporal frames. To demonstrate our algorithm on relatively large motion, we uniformly subsample the video to 38 key frames and use 16 views for training and 1 held-out view for evaluation. The image resolution is .
4.1 4D Novel View Synthesis Evaluation
We first evaluate our approach on its ability to photorealisticly render the dynamic scene from a novel 4D view (3D camera pose + time). More specifically, we render a 4x slow motion of the original video from a novel camera view by linearly interpolating the estimated object poses. We compare our method to the following baselines (see supplementary for the detailed architecture):
NeRF [mildenhall:etal:arXiv20]: The original NeRF assumes the scene is static. This baseline cannot properly reconstruct the dynamic object, but can still provide a reference to the quality in terms of the static part of the scenes.
NeRF-time: This model takes positional-encoded time as an additional input, thus creating a 4D representation using a strategy similar to [sitzmann:etal:arXiv20, tancik:etal:neuralips20].
NeRF-W [martin:etal:arXiv20]: This model takes an additional latent code input as proposed by Martin-Brualla [martin:etal:arXiv20]. We did not include the appearance embedding as did NeRF-W in [martin:etal:arXiv20] because our video data does not contain strong appearance variations. This is the most related existing work that can handle dynamic variation based on NeRF.
We evaluate the novel view synthesis quality in terms of photorealism following the standard metrics in [mildenhall:etal:arXiv20]: the average PSNR, SSIM and LPIPS [zhang:etal:cvpr2018] across all test frames. Since our baselines do not provide separate static and dynamic rendering, we use object bounding boxes to divide the image into static background and dynamic foreground (see fig:evaluation). We compute metrics on the original image, the static background and the dynamic foreground.
|Key frame||Interpolated frame||Key frame||Interpolated frame|
|Key frame||Interpolated frame|
|No appr init*||20.78||0.820||0.188|
tab:quantitative_all shows the comparison to all baselines on unseen novel views. Overall, our method provides significantly better reconstruction quality over the all baselines, in all factorized regions using both synthetic and real world data. In real-world data, NeRF-time fails to reconstruct dynamic regions well and cannot provide reconstructions as good as NeRF in static regions. In contrast, our method can handle both regions well. NeRF-W can provide competitive reconstruction quality in pixel accuracy, but we still outperform it significantly in perceptual metric LPIPS. This demonstrates the benefits of using a factorized representation. fig:novel_view_comparison shows the visual comparisons in synthetic and real world data respectively. Compared to all baselines, our method is the only the method that can reconstruct both static background and dynamic objects in details from novel views. We encourage the readers to watch the supplementary video which best demonstrates the perceptual quality of the rendering on novel spatial-temporal views.
|No appr init||No online||No entropy||Full model|
We perform an ablation study on our optimization strategies discussed in sec:optimization using the lemp and desk sequence. See quantitative results in tab:ablation and qualitative results in fig:ablation. The results indicate our full model with regularization trained online performs the best. Our model performs poorly without the initialization stage, and also shows dramatically worse performance without the online training strategy. With additional entropy regularization over the volume density, we can further improve the results. fig:ablation highlights the significant perceptual difference which cannot be properly seen in the quantitative numbers.
4.2 Rendering on Animated Trajectories
Our factorized representation of motion and appearance allows STaR to synthesize novel views of animated trajectories of the dynamic object which have not been seen during training. fig:animation shows synthesized novel views of a trajectory dramatically different from the training trajectory in both synthetic and real world scenes. It is worth noting that no existing method we know of is able to synthesize motion so different from observed data and re-render it in a photorealistic fashion without any 3D ground truth or supervision, including NeRF-time and NeRF-W, which can only interpolate object poses from the observed trajectory.
Our method demonstrates a novel direction towards reconstructing dynamic scenes using only video observations. It should be noted here that this system is a proof of concept and has not completely solved the problem of fully decomposing dynamic scenes into their constituent parts. First, we assume only one object in motion. Extending the model to multiple objects is trivial but estimating the number of moving objects when it is not known a priori is an interesting direction for further research. Second, we cannot represent non-rigid motion in the presented model. This could probably be accomplished by combining our insights with the orthogonal work on deforming neural representations by Niemeyer et al.[niemeyer:etal:cvpr19]. Finally, we effectively remove geometric dynamism from both NeRF volumes by factorizing all motion into an explicit rigid transform, but we cannot do the same for appearance due to the mutual influence of each volume on the lighting conditions of the other. This could be solved by further factoring the appearance into material and lighting conditions, but these explorations are very much outside the scope of this paper and we leave it for future work.
In this document, we provide technical details in support of our main paper. Below is a summary of the contents.
Description of supplementary video;
Mesh reconstruction and 6D pose tracking results;
MLP architecture and volume rendering details;
Synthetic and real-world data preparation;
Derivation of pose Jacobian.
We encourage the reader to watch our supplementary video at https://wentaoyuan.github.io/star, where we visualize the following results.
We first show a comparison of STaR against NeRF [mildenhall:etal:arXiv20], NeRF-time and NeRF-W [martin:etal:arXiv20] on the rendering of novel spatial-temporal views on the lamp and desk and kitchen table sequences. The rendered videos are 20x slow motion of the training videos from a continuously varying camera view unseen during training, where STaR achieves superior perceptual quality compared to the baselines.
Then, we visualize the decomposition of static and dynamic components learned by STaR on the moving banana sequence by showing how static background and dynamic foreground can be seamlessly removed during spatial-temporal novel view rendering. Similarly, the rendered video is a 20x slow motion of the training videos from a continuously varying camera view unseen during training. We also visualize the disparity map rendered by STaR.
Finally, we show results of photorealistic animation of unseen object motion in lamp and desk and moving banana. Specifically, we compose the static and dynamic NeRFs using a set of poses significantly different from the poses seen during training (see fig:animation for a visualization of the trajectories) and rendered the composed NeRF from a continuously varying camera view unseen during training. Remarkably, without any prior knowledge about the scene geometry of the object motion, STaR is able to learn a factored representation that allows it to photorealistically synthesize novel spatial-temporal views of novel object motion, which no existing method can do.
C Additional Applications
The separation of static and dynamic components learned by STaR allows us to reconstruct a 3D mesh of the unknown dynamic object. Specifically, we can query the dynamic NeRF using a dense 3D grid over a training camera’s view frustum, then threshold the density outputs (i.e. setting if ) and run marching cubes to obtain a 3D mesh. fig:recon visualizes the reconstructed meshes of the dynamic objects compared to the ground truth in the two synthetic videos used in our paper, lamp and desk and kitchen table.
We also use MeshLab to compute the mean Hausdorff distance between the reconstructed mesh and the ground truth. We report the distances in tab:recon as percentage of the ground truth mesh’s bounding box diagonal. We use voxel size 0.002 for lamp and desk, voxel size 0.0001 for kitchen table, and density threshold 50 for both scenes. The reconstructed meshes are post-processed by excluding everything outside of a manually specified 3D bounding box and aligned with the ground truth meshes using ICP.
|Lamp and desk||Kitchen table|
6DoF Pose Tracking
In addition to reconstructing geometry and appearance, STaR also outputs the relative 6D pose between the static and dynamic volume. We can use the output poses to accurately track the relative motion of the dynamic object. In tab:track, we report error in the relative pose difference of the dynamic object between neighboring key frames estimated by STaR compared against the ground truth. The rotation error is computed in degree and the translation error is computed as percentage of the diagonal of the object’s 3D bounding box.
|Lamp and Desk||Kitchen table|
are marked in green, blue and yellow respectively, with their dimensions labeled beneath. Blue, black, yellow and green arrows denote linear transformations withno activation
, ReLU activation,sigmoid activation and softplus activation respectively and denotes concatenation. denotes positional encoding and denotes 3D location, viewing direction, volume density and radiance. is the latent code taken by NeRF-W to generate transient density , color and uncertainty .
D Implementation Details
fig:mlp shows the architecture of MLPs used by STaR (NeRF), NeRF-time and NeRF-W respectively. STaR uses the same MLP architecture as NeRF for both static volume and dynamic volume . NeRF-time shares the same MLP architecture except for using positional-encoded time as additional input. In practice, the time parameter before positional encoding is the frame index linearly projected on to the interval . NeRF-W (more precisely, its variant NeRF-U since we don’t use appearance embedding) uses the same MLP architecture as NeRF for the coarse network, but its fine network takes an additional 16-dimensional latent code and outputs transient density, color and uncertainty in addition to static density and color. Please refer to [martin:etal:arXiv20] for more details about the architecture of NeRF-W.
For STaR and all baselines, we use 64 stratified samples per ray for the coarse network and additional 64 importance samples (in total 128 samples) for the fine network. Following [mildenhall:etal:arXiv20], we add small Gaussian noise to the density outputs during appearance initialization but turn it off during online training. We adjust the absolute scale of the camera’s view frustum so that it roughly lies within the cube . For synthetic data, this can be done by scaling the rendered content. For real data, we translate the camera poses so that the world coordinate center aligns with the center of the average camera’s view frustum. Then we scale the camera poses’ translation component by half of the distance from the near bound to the far bound.
E Data Preparation
Synthetic Data Generation
The synthetic video sequences are rendered with Blender Cycles rendering engine using photorealistic assets created by professional designers on Blend Swap. The 8 training cameras are arranged in a array, focusing on the same target point in the scene. The evaluation camera is also looking at the same direction from a similar distance but its location is different from the training cameras. The camera poses remain fixed throughout the video.
Real World Data Capture
The real-world video sequence is captured using a 20-camera rig. The cameras are arranged in a array. We discard 3 cameras that are not synced with the others, use 16 cameras for training and 1 remaining camera for evaluation. The camera poses are also fixed and can be obtained by running COLMAP’s structure from motion pipeline on images from the first frame. The original image resolution is and is downsampled to for training and evaluation.
F Jacobian for Pose Gradient
Let be a 3D point and be a pose with associated transformation matrix
where denotes the transformation of with respect to . Let be a local perturbation on the manifold. We are interested in the derivative of with respect to at :
where denotes Kronecker product. The result is a Jacobian matrix, where is the cross product matrix
We encourage the reader to read [blanco2010tutorial] for more details about the on-manifold optimization of transformations.