Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering

by   Mingfei Chen, et al.

In this work we develop a generalizable and efficient Neural Radiance Field (NeRF) pipeline for high-fidelity free-viewpoint human body synthesis under settings with sparse camera views. Though existing NeRF-based methods can synthesize rather realistic details for human body, they tend to produce poor results when the input has self-occlusion, especially for unseen humans under sparse views. Moreover, these methods often require a large number of sampling points for rendering, which leads to low efficiency and limits their real-world applicability. To address these challenges, we propose a Geometry-guided Progressive NeRF (GP-NeRF). In particular, to better tackle self-occlusion, we devise a geometry-guided multi-view feature integration approach that utilizes the estimated geometry prior to integrate the incomplete information from input views and construct a complete geometry volume for the target human body. Meanwhile, for achieving higher rendering efficiency, we introduce a progressive rendering pipeline through geometry guidance, which leverages the geometric feature volume and the predicted density values to progressively reduce the number of sampling points and speed up the rendering process. Experiments on the ZJU-MoCap and THUman datasets show that our method outperforms the state-of-the-arts significantly across multiple generalization settings, while the time cost is reduced > 70 progressive rendering pipeline.



There are no comments yet.


page 3

page 8

page 11


HDhuman: High-quality Human Performance Capture with Sparse Views

In this paper, we introduce HDhuman, a method that addresses the challen...

DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Rendering

We introduce DoubleField, a novel representation combining the merits of...

Scale-Consistent Fusion: from Heterogeneous Local Sampling to Global Immersive Rendering

Image-based geometric modeling and novel view synthesis based on sparse,...

Learning Compositional Radiance Fields of Dynamic Human Heads

Photorealistic rendering of dynamic humans is an important ability for t...

Human View Synthesis using a Single Sparse RGB-D Input

Novel view synthesis for humans in motion is a challenging computer visi...

GeoNeRF: Generalizing NeRF with Geometry Priors

We present GeoNeRF, a generalizable photorealistic novel view synthesis ...

View Extrapolation of Human Body from a Single Image

We study how to synthesize novel views of human body from a single image...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-fidelity free-viewpoint synthesis of human body is important for many applications such as virtual reality, telepresence and games. Some recent works [27, 13, 29, 42] deploy a Neural Radiance Fields (NeRF) [22] pipeline, which achieved fairly realistic synthesis of human body. However, these works usually require dense-view capturing of human body, and have to train a separate model for each person to render new views. The limited generalization ability as well as demand for cost computation severely hinder their application in the real-world scenarios.

Figure 1: Our method can better handle self-occlusion (a) and high computational cost (b) issues than previous methods [12, 28]. In (a), our multi-view integration can extract high-quality geometry information from for the red SMPL vertex. In (b), our progressive rendering pipeline leverages the geometric volume and the predicted density values to progressively reduce the number of sampling points and speed up the rendering, while previous methods [12, 28] wastes large amount of computations at redundant empty regions. The efficiency comparison shown in (c) further verifies our high efficiency.

In this work, we aim at boosting high-fidelity free-viewpoint human body synthesis with a generalizable and efficient NeRF framework based on only single-frame images from sparse camera views. To pursue such a high-standard framework, there are mainly two challenges that need to be tackled. First, the human body is highly non-rigid and commonly has self-occlusions over body parts, which may lead to ambiguous results with only sparse-view captures. This ambiguity could drastically degrade the rendering quality without proper regularizations, which cannot be easily solved by simply sampling features from multi-view images as in [41, 38, 30]. This problem would become worse when using one model to synthesize unseen scenarios without specific per-scene training. Second, the high computation and memory cost of NeRF-based methods severely hinder human synthesis with accurate details in high-resolution. For example, when rendering one

image, existing methods need to process millions of sampling points through the neural network, even if using the bound of the geometry prior to remove empty regions.

To address these challenges, we propose a geometry-guided progressive NeRF, called GP-NeRF, for generalizable and efficient free-view human synthesis. More specifically, to regularize the learned 3D human representation, we propose a geometry-guided multi-view feature integration approach to more effectively exploit the information in the sparse input views. For the geometry prior, we adopt a coarse 3D body model, i.e., SMPL [20], which serves as a base estimate of our algorithm. We attach multi-view image features to the base geometry model using an adaptive multi-view aggregation layer. Then we can obtain an enhanced geometry volume by refining the base model with the attached image features, which substantially reduces the ambiguities in learning a high-fidelity 3D neural field. It is worth noting that our multi-view enhanced geometry prior differs significantly from related methods that also utilize human body priors [28, 12]. NB [28] learns a per-scene geometry embedding, which is hard to generalize to unseen human bodies; NHP [12] relies on temporal information to complement the base geometry model, which is less effective for regions occluded throughout the input video. In contrast, our approach is able to adaptively combine the geometry prior and multi-view features to enhance the 3D estimation, and thus can better handle the self-occlusion problem and acquire lifted generalization capacity even without using videos (see Figure 1 (a)). By integrating the multi-view information and form a complete geometry volume adapting to the target human body, we can also compensate some limitations of the geometry prior (e.g., inaccurate body shape or lacks cloth information), and support our following efficiency progressive pipeline well.

Furthermore, to tackle the high computation and memory cost, we introduce a geometry-guided progressive rendering pipeline. As shown in Figure 1 (b), different from previous methods [28, 12], our pipeline decouples the density and color prediction process, leveraging the geometry volume as well as the predicted density values to reduce the number of sampling points for rendering progressively. By simply deploying our progressive rendering pipeline with the same data and model parameters, we can remove % points for density prediction (with Density MLP in Figure 1 (b)) and % points for color prediction (with Appearance MLP in Figure 1 (b)), reducing the total forwarding time of this part for all points by %. Later experiments verify that our progressive pipeline causes no performance decline while requiring shorter training time, which is credited to focusing on the density and appearance learning separately.

Our main contributions are in three folds:

  • We propose a novel geometry-guided progressive NeRF (GP-NeRF) for generalizable and efficient human body rendering, which reduces the computational cost of rendering significantly and also gains higher generalization capacity simply based on the single-frame sparse views.

  • We propose an effective geometry-guided multi-view feature integration approach, where we let each view compensate the low-quality occluded information for other views with the guidance of the geometry prior.

  • Our GP-NeRF has achieved state-of-the-art performance on the ZJU-MoCap dataset, taking only ms on RTX and reducing time for rendering per image by over , which well verifies effectiveness and efficiency of our framework.

2 Related Work

Human Performance Capture.

Previous works [24, 3, 6, 10] apply traditional modeling and rendering pipelines for novel view synthesis of human performance, relying on either dense camera setup [5, 10] or depth sensors [3, 6, 36] to ensure photo-realistic reconstruction. Follow-up improvements are made by introducing neural networks to the rendering pipeline to alleviate geometric artifacts. To enable human performance capture in a sparse multi-view setup, template-based methods [2, 4, 8, 35] adopt pre-scanned human models to track human motion. However, these approaches require per-scene optimization and the pre-scanned human models are hard to collect in practice, which hinders them from real-world applications. Instead of performing per-scene optimization, recent methods [23, 31, 32, 43] adopt neural networks to learn human priors from ground-truth 3D data, and hence can reconstruct detailed 3D human geometry and texture from a single image. However, due to the limited diversity of training data, it is difficult for them to generate photo-realistic view synthesis or generalize to human poses and appearances that are very different from the training ones.

Neural 3D Representations.

Recently, researchers adopt neural networks to represent the shape and appearance of scenes. These representations, such as voxels [33, 19, 26, 21], point clouds [1, 39], textured meshes [16, 14, 18, 40] and multi-plane images [7, 44] are learned from 2D images via differentiable renderers. Though with impressive results, they are hard to scale to higher resolution due to innate cubic memory complexity.

Researchers then propose implicit function-based approaches [34, 15, 17, 25] to learn a fully-connected network to translate a 3D positional feature into local feature representation. A very recent work NeRF [22] achieves high fidelity novel view synthesis by learning implicit fields of color and density along with a volume rendering technique. Later, several works extend NeRF to dynamic scenes modeling [27, 29, 42, 13] by optimization NeRF and dynamic deformation fields jointly. Despite impressive performance, it is an extremely under-constrained problem to learn both NeRF and dynamic deformation fields together. NB [28] combines NeRF with a parametric human body model SMPL [20] to regularize the training process. It requires a lengthy optimization for each scene and hardly generalizes to unseen scenarios. To avoid such expensive per-scene optimization, Generalizable NeRFs [30, 38, 41, 12] condition the network on the pixel-aligned image features. However, directly extending such methods to complex and dynamic 3D human modeling is highly non-trivial due to self-occlusion, especially when modeling unseen humans under sparse views. Besides, these approaches suffer low efficiency since they need to process a large number of sampling points for volumetric rendering, harming their real-world applicability. Different from existing methods, we carefully design a multi-view information aggregation approach and a progressive rendering technique to improve model robustness and generalization to unseen scenarios under sparse views and also speed up the rendering.

3 Methodology

Given a set of sparse source views {} of an arbitrary human model, which are captured by pre-calibrated cameras respectively, we aim to synthesize the novel view of the human model from an arbitrary target camera.

To this end, we propose a geometry-guided progressive NeRF (GP-NeRF) framework for efficient and generalizable free-view human synthesis under very sparse views (e.g., ). Figure 2 illustrates the overview of our framework. Firstly, a CNN backbone is used to extract image features for each of the views . Then our GP-NeRF framework integrates these multi-view features to synthesize the novel-view image through three modules progressively, leveraging the geometry prior from SMPL [20] as guidance. The three modules are 1) geometry-guided multi-view feature integration (GMI) module (Section 3.1); 2) density network (Section 3.2); and 3) appearance network (Section 3.3). Details of the whole progressive human rendering pipeline are elaborated in Section 3.4, and the training method is described in Section 3.5.

Figure 2: Overview of our proposed framework. Our progressive pipeline mainly contains three parts. (a) Geometry-guided multi-view feature integration. We first learn query embedding for each SMPL vertex to adaptively integrate the multi-view pixel-aligned image features through the geometry-guided attention module. Based on this, we utilize the SparseConvNet to construct a denser geometry feature volume . (b) Density Network. For point within , we concatenate its geometry feature with the mean (

) and variance (

) of its pixel-aligned image features , and predict its density value through the density MLP. with a positive form the valid density volume. (c) Appearance Network. For point within the valid density volume, we utilize to predict its color value through the appearance MLP. Finally, we conduct the volume rendering to render the target image.

3.1 Geometry-guided Multi-view Integration

The geometry-guided multi-view feature integration module, shown in Figure 2 (a), enhances the coarse geometry prior with multi-view image features by adaptively aggregating these features via a geometry-guided attention module. Then it constructs a complete geometry feature volume that adapts to the target human body.

Firstly, we use the SMPL model [20] as the geometry prior, and get the pixel-aligned image features for each of the SMPL vertices from each source image . Specifically, we multiply the coordinate of with each source camera pose to transform the original to into the source camera coordinate system, and then utilize the intrinsic matrix to obtain the projected coordinate in the corresponding image plane. We denote the pixel-aligned features from the image features that corresponds to the pixel location of as

. We use bilinear interpolation to obtain the corresponding features if the projected location is fractional.

After obtaining from source views, we integrate them to represent the geometry information at vertex through a geometry-guided attention module. Concretely, we learn an embedding for each , and then take as a query embedding to calculate the correspondence score with each respectively:


where we denote as for simplicity. is the channel dimension of . represents linear projection layers. After that, we weighted sum the pixel-aligned feature embeddings based on the scores to obtain the aggregated geometry related feature for vertex :


Considering the SMPL vertices with their corresponding features are not dense enough to represent the whole human body volume, we further learn to extend and fill the holes of the sparse geometry feature volume through the SparseConvNet [9] and thus obtain a denser geometry feature volume, denoted as . In our method, we take the geometry volume as a more reliable basis to indicate occupancy of the human body in the whole space volume. More advanced than the coarse model SMPL, leverages the multi-view image-conditioned features to enhance the coarse geometry prior, which adapts to the shape of the target human body. only preserves the effective volume regions with body contents, including clothes regions. Because the SparseConvNet can gain experience from training to extend the features towards the regions with contents, based on the image-conditioned features with some instructive context information at each feature point. Besides, the geometry volume will also benefit our progressive rendering pipeline, which will be detailed in Section 3.4.

3.2 Density Network

The density network predicts the opacity of each sampling point , which is highly related to the geometry of human body, like postures and shapes. Through the geometry-guided multi-view integration module in Section 3.1, we can construct a geometry feature volume which can provide sufficient reliable geometry information of the target human body. As shown in Figure 2 (b), for each sampling point , we obtain its corresponding geometry related feature from based on its coordinate. Though the feature volume can provide the geometry information of human body, such geometry-related features are coarse and may lose some fine image-conditioned features that benefit the high-fidelity rendering. To compensate the information loss, we combine these two kinds of features at the same sampling point to predict its density value more accurately. Therefore, we further concatenate with the mean () and variance () feature embedding of its corresponding pixel-aligned image features that contain more detailed information, and process the concatenated feature through a density MLP to predict the density value at this point.

3.3 Appearance Network

The appearance network aims to predict the RGB color value for each sampling point . Since the RGB value is more related to the appearance details of human body, we utilize the image-conditioned features as the input to the appearance network for more detailed information. As shown in Figure 2 (c), we first aggregate the pixel-aligned image features from input views for each color sampling point . Specifically, similar to obtaining the pixel-aligned image features for each SMPL vertex, we project the coordinate of to the image plane of each source view, and obtain the pixel-aligned feature embedding, denoted as . We then concatenate from source views with their mean () and variance () feature embeddings together. Afterwards, based on the concatenated feature embeddings, an appearance MLP is deployed to predict the RGB value for the corresponding point .

3.4 Geometry-guided Progressive Rendering

We render the human body in the target view through the volumetric rendering following previous NeRF-based methods [22, 28, 12]. Instead of sampling many redundant points for rendering, we introduce an efficient geometry-guided progressive rendering pipeline for the inference process. Our pipeline leverages the geometry volume in Section 3.1 as well as the predicted density values in Section 3.2 to reduce the number of points progressively.

Specifically, we first preserve the sampling points that occupy the geometry volume as valid density sampling points . Compared to the smallest pillar that contains the human body that is used by previous methods [28, 12], the geometry volume is closer to the human body shape and contains much fewer redundant void sampling points. Then we predict the density values for through the density network, and the sampling points that have positive density values form a valid density volume. As shown in Figure 2, the valid density volume is very close to the 3D mesh of the target human body and we further remove many empty regions compared to the geometry volume. We take the sampling points in the valid density volume as the new valid sampling points , and further predict their color values through the appearance network in Section 3.3.

We conduct volume rendering based on the density and color predictions to synthesize the target view . Traditional volume rendering methods often march rays from the target camera to the pixels of the target view image, and then sample points on each . Denoting the distance of two adjacent sampling points on as , we can formulate the color rendering process for each as:


For our progressive rendering pipeline, we use projection to bind the sampling points to . Concretely, we project the points within the geometry volume to the target view, take the nearest four pixels of the projected points as valid pixels to march a ray, and then uniformly sample points between its near and far bounds as [28, 12]. We only process the sampling points within the valid volume regions and then conduct volume rendering based on the rays .

Experiments in Section 4.4 verify that our geometry-guided progressive rendering pipeline reduces the memory and time consumption during rendering significantly, and our performance can be even lifted by removing noisy unnecessary sampling points.

3.5 Training

During training, we do not deploy the progressive rendering pipeline in Section 3.4, because it is useful only when our density network is reliable. Instead, we march rays from the target camera to pixels randomly sampled on the image while ensuring no fewer than half of the pixels are on the human body. We uniformly sample points on the rays to predict the corresponding density and color values. By performing the volume rendering in Eq. (3), we obtain the predicted color for each . To supervise the network, we calculate the Mean Square Error loss between and the corresponding ground truth color value as our training loss .

4 Experiments

We study four questions in experiments. 1) Is GP-NeRF able to improve the fitting and generalization performance of human synthesis on the seen and unseen scenarios (Section 4.3)? 2) Is GP-NeRF effective at reducing the time and memory cost for rendering (Section 4.4)? 3) How does each individual design choice affect model performance (Section 4.5) 4) Can GP-NeRF provide promising results, both for human rendering and 3D reconstruction (Section 4.6

)? We describe the datasets and evaluation metrics in Section 

4.1, and our default implementation setting in Section 4.2.

Dataset Per-scene Unseen Results
Method Train Test training Pose Body PSNR () SSIM ()
Performance on training frames
NT [37] ZJU- ZJU- 23.86 0.896
NHR [39] ZJU- ZJU- 23.95 0.897
 NB [28] ZJU- ZJU- 28.51 0.947
NHP [12] ZJU- ZJU- 28.73 0.936
GP-NeRF (Ours) ZJU- ZJU- 28.91 0.944
Performance on unseen frames from training data
NV [19] ZJU- ZJU- 22.00 0.818
NT [37] ZJU- ZJU- 22.28 0.872
NHR [39] ZJU- ZJU- 22.31 0.871
NB [28] ZJU- ZJU- 23.79 0.887
NHP [12] ZJU- ZJU- 26.94 0.929
GP-NeRF (Ours) ZJU- ZJU- 27.92 0.934
Performance on test frames from test data
NV [19] ZJU- ZJU- 20.84 0.827
NT [37] ZJU- ZJU- 21.92 0.873
NHR [39] ZJU- ZJU- 22.03 0.875
NB [28] ZJU- ZJU- 22.88 0.880
PVA [30] ZJU- ZJU- 23.15 0.866
Pixel-NeRF [41] ZJU- ZJU- 23.17 0.869
NHP [12] ZJU- ZJU- 24.75 0.906
GP-NeRF (Ours) ZJU- ZJU- 25.96 0.921
Generalization performance across datasets
NHP [12] AIST ZJU- 17.05 0.771
GP-NeRF (Ours) THUman- ZJU- 24.74 0.907
GP-NeRF (Ours) THUman-all ZJU- 25.60 0.917
Table 1: Synthesis performance comparison. Our proposed method outperforms existing methods on all the settings.

4.1 Datasets and Metrics

We train and evaluate our method on the ZJU-MoCap dataset [28] and THUman dataset [43]. ZJU-MoCap contains sequences with synchronized cameras. We split the sequences into a training set with sequences and a test set with the remaining sequences, following [12] for a fair comparison. THUman contains human body D scans. of the scans are taken as the training set, and the remaining are the test set. We render images for each scan from virtual cameras, which are uniformly set on the horizontal plane.

To evaluate the rendering performance, we choose two metrics: peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) following 

[22, 28]. For the D reconstruction, we only provide the qualitative results since the corresponding ground truth is not available.

4.2 Implementation Details

In our implementation, we perform training and inference with an image size of under camera views, where the horizontal angle interval is around  (Uniform). We utilize a U-Net like architecture [38] as our backbone to extract the image features in Section 3 with a dimension of . We sample points uniformly between the near and far bound on each ray. For training, we utilize the Adam optimizer [11], and the learning rate decays exponentially from for k steps. We use one RTX GPU with a batch size of for both training and inference.

4.3 Synthesis Performance Analysis

In Table 1, we compare our human rendering results to previous state-of-the-art methods. To evaluate the capacity of fitting and generalization on different levels, we train our framework on the first frames of training video sequences of ZJU-MoCap (ZJU-), and test on 1) the training frames, 2) unseen frames of ZJU-, and 3) test frames from the test sequences (ZJU-), respectively. The results in Table 1 verify our advanced generalization capacity on the unseen scenarios. We also achieve competitive fitting performance on the training frames, even comparable to the per-scene optimization methods [37, 39, 28].

Notably, our method outperforms the state-of-the-art NHP [12] which utilizes the geometry prior with features of multi-view videos. Specifically, for the unseen poses and the unseen bodies, we outperform NHP by and dB on PSNR, and also by and on SSIM respectively, using only single-frame input. We also conduct generalization experiments across two datasets with different domains. We train our model on random human bodies from the THUman dataset (THUman-) and all human bodies (THUman-all) separately, and test the synthesis performance on the test frames of ZJU-. From Table 1, we observe our method outperforms NHP by a large margin under cross-dataset evaluation setup, i.e., around 7.7 dB and 13.6% improvements on PSNR and SSIM respectively. All these results demonstrate the effectiveness of our geometry-guided multi-view information integration approach.

4.4 Efficiency Analysis

In Table 2, we analyze the efficiency improvements 111

We count averaged per-sample inference time in milliseconds. For all methods, the time is counted on NVIDIA GeForce RTX 3090 and CPU Intel i7-11700 @ 2.50GHz, PyTorch

, CUDA . gained from our progressive pipeline on the first 300 frames of the 315 (Taichi) sequence in ZJU-MoCap dataset.

Method # (M) () # (M) () # (M) () Time (ms) () Mem (GB) ()
NB 2 [28] 0.063 4.03 4.03 611 21.80
GP-NeRF 3 0.063 (-0.0%) 4.03 (-0.0%) 4.03 (-0.0%) 589 (-3.6%) 14.53 (-33.3%)
GP-NeRF 2 0.063 (-0.0%) 4.03 (-0.0%) 4.03 (-0.0%) 567 (-7.2%) 20.74 (-4.9%)
GP-NeRF 2 0.039 (-38.1%) 0.95 (-76.4%) 0.24 (-94.0%) 243 (-60.2%) 9.88 (-54.7%)
GP-NeRF 1 0.039 (-38.1%) 0.95 (-76.4%) 0.24 (-94.0%) 175 (-71.4%) 14.25 (-34.6%)
Method T-MLP (ms) () T-total (ms) () T-MLP (ms) () T-total (ms) () PSNR ()
GP-NeRF 2 108.58 226.56 145.38 146.39 26.56
GP-NeRF 2 28.08 (-74.1%) 83.65 (-63.1%) 10.02 (-93.1%) 11.4 (-92.2%) 26.67 (+0.4%)
GP-NeRF 1 23.55 (-78.3%) 74.07 (-67.3%) 9.50 (-93.5%) 10.27 (-93.0%) 26.67 (+0.4%)
Table 2: Computation and memory cost comparison. GP-NeRF has the same structure as our GP-NeRF but adopts vanilla rendering technique. indicates the sampling points are split into chunks to be processed. # means the number of sampling rays; # and # mean sampling points through the density network and appearance network, respectively. T-total indicates the total time cost from backbone output to the density volume, including T-MLP which means the forwarding time of the density MLP. T-total means the time from density volume to the color prediction, and T-MLP is the time for the appearance MLP.

Considering the limited GPU memory, our final GP-NeRF can process all the sampling points in one run, but GP-NeRF and NB [28] requires at least twice. As shown in the upper panel of Table 2, compared to NB which also uses the SMPL bounds to remove redundant marched rays, our GP-NeRF can further remove rays and # by referring to the constructed geometry volume, and remove # based on the valid density volume. Correspondingly, our GP-NeRF achieves ms per image for the whole rendering procedure, less than NB and less than GP-NeRF which costs nearly the same GPU memory. For fair comparison to GP-NeRF , we also test the speed on GP-NeRF for chunks, and our progressive pipeline still reduces the time cost by and the memory cost by , which verifies the significant efficiency improvement from the proposed rendering pipeline.

In the bottom panel of Table 2, we compare the time cost of each component in GP-NeRF to GP-NeRF without progressive points reduction. The results show that we can reduce over and time cost for density MLP forwarding and the total density related time T-total respectively, by simply using our progressive rendering pipeline on the same network structures. Our pipeline can also reduce over time cost for the appearance MLP forwarding. Moreover, our progressive pipeline improves the efficiency significantly while even improving the PSNR metric by , as it can ignore some noisy sampling points during rendering that might degrade the performance.

Variants G Q P PSNR () SSIM ()
G 23.47 0.880
QG 23.68 0.885
P 26.09 0.915
QG+P 26.69 0.924
Table 3: Ablations: feature integration. G, Q, P are different approaches to obtain input features for the shared density and appearance network. G: geometry feature volume; Q: integrate multi-view information at each geometry vertex with geometry-guided attention; P: pixel-aligned image features.
Disen. Den. App. Steps PSNR () SSIM ()
QG+P QG+P 5000 26.05 0.912
QG+P QG+P 5000 26.13 0.917
QG+P P 5000 26.16 0.920
QG P 5000 25.71 0.904
QG+P QG+P 35000 26.69 0.924
QG+P QG+P 35000 26.65 0.925
QG+P P 35000 26.67 0.923
QG P 35000 26.40 0.918
Table 4: Ablations: progressive structure. G, Q, P have the same meanings as Table 3. Disen. indicates whether the density (Den.) and appearance (App.) networks are in a progressive pipeline. Steps mean the number of training steps. The columns of Den. and App. demonstrate components of the input features.

4.5 Ablation Studies

We conduct ablation studies under the uniform camera setting in Section 4.2 to verify effectiveness of our main designed components on generalization capacity. We train our model on training sequences of the ZJU-MoCap dataset for k steps and validate it on remaining 3 sequences.

Feature Integration. In Table 3, we explore the effectiveness of the proposed geometry-guided feature integration mechanism on the baseline GP-NeRF, i.e., GP-NeRF without adopting progressive rendering pipeline. As shown in Table 3, adaptively aggregating multi-view image features with the guidance of the geometry prior to construct the geometry feature volume (QG) achieves better performance (i.e., dB and improvements on PSNR and SSIM respectively) than baseline that simply uses the mean of multi-view image features (G), as the proposed geometry-guided attention module helps focus more on the views corresponding to the geometry prior. We also observe baseline using only pixel-aligned image features (P) gains dB PSNR and SSIM over baseline using only geometry feature (G), as it captures more detailed appearance features from images for high-fidelity rendering. Moreover, by combining the geometry feature and its corresponding detailed image features (QG+P), we can improve upon P by dB PSNR and SSIM respectively. This indicates that both the geometry and the pixel-aligned image features can compensate each other for better generalization performance on unseen scenarios.

Progressive Structure. Our efficient progressive rendering pipeline in Section 3.4 requires a progressive structure of the density and appearance network. Based on the same experimental settings, we further decouple the density and appearance networks to form a progressive pipeline as in Figure 2 and evaluate the performance. As shown in Table 4, the progressive structure does not harm the performance and even reaches relatively high performance faster. This is because it allows these two networks to lean their different focus, thus improving the performance more quickly during training. For the density network, involving more detailed image features P can enhance the relatively coarse geometry feature QG, and bring around improvements on SSIM. The results also show that the geometry feature QG is much more impactful on the geometry-related density prediction than on the appearance-related color value prediction.

Figure 3: Visualization comparisons on human rendering. Comparing to other methods, ours can synthesize more high-fidelity details like the clothes wrinkles and reconstructing the body shape more accurately. Our synthesis can stick to the normal human body geometry better than methods without geometry priors like NT and NHR. We can also recover more accurate lighting conditions than the previous video-based generalizable method NHP on unseen bodies (as (b) and (c)).

4.6 Visualization

We visualize our human rendering results under three uniform camera views in different experimental settings (Figure 3). As Figure 3 (a), (b) and (c) show, compared with other approaches, our method achieves better quality on unseen poses or bodies by synthesizing more high-fidelity details like the clothes wrinkles and reconstructing the body shape more accurately. From Figure 3 (d), we demonstrate some rendering results on the unseen bodies of the THUman dataset after training on it. Our method generalizes well on the same THUman dataset and can synthesize accurate details.

Figure 4: Visualization of our 3D reconstruction results. The color in the mesh is only for clearer visualization. By integrating multi-view information to form a complete geometry volume adapting to the target human body, our method can compensate some limitations of SMPL (e.g., not accurate or lack cloth information), and can generally reconstruct very close human body shape and even clothes details like hoods and folds on unseen human bodies (as (b)). We can generalize better on the unseen human bodies than previous image based 3D construction method like PIFuHD, which predicts incomplete or redundant body parts in its reconstruction results (as (b)).

In Figure 4 , we visualize the density volume from the density MLP in Section 3.2 as the mesh results of our 3D reconstruction. Different from previous methods that densely sample points within bounds of the geometry prior to determine the inside points through the density network for mesh construction, our progressive pipeline directly determines the sampling points from the geometry volume in Section 3.1, which contains much fewer redundant points and thus is more efficient for 3D reconstruction. Then we construct the mesh based on the points with higher density values. As Figure 4 (b) shows, on the unseen human bodies, previous image based 3D construction method like PIFuHD [32] can not generalize well. Besides their lower efficiency on making predictions for a lot of redundant sampling points, they are more likely to predict body parts that do not conform to a normal human body structure, because they can not integrate and adapt the given geometry information as well as we do. As shown in Figure 4, by integrating multi-view information to form a complete geometry volume adapting to the target human body, our method can generally reconstruct very close human body shape and even clothes details like folds on even unseen human bodies (Figure 4 (b)).

5 Conclusion

We propose a geometry-guided progressive NeRF model for generalizable and efficient free-viewpoint human rendering under sparse camera settings. Using our geometry-guided multi-view feature aggregation approach, the geometry prior can be effectively enhanced with the integrated multi-view information and form a complete geometry volume adapting to the target human body. The geometry feature volume combined with the detailed image-conditioned features can benefit the generalization performance on unseen scenarios. We also introduce a progressive rendering pipeline for higher efficiency, which reduces over rendering time cost without performance degradation. Experimental results on two datasets verify our model can outperform previous methods significantly on generalization capacity and efficiency.


  • [1] Aliev, K.A., Sevastopolsky, A., Kolos, M., Ulyanov, D., Lempitsky, V.: Neural point-based graphics. In: ECCV (2020)
  • [2] Carranza, J., Theobalt, C., Magnor, M.A., Seidel, H.P.: Free-viewpoint video of human actors. In: ACM Trans. on Graphics (2003)
  • [3] Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., Hoppe, H., Kirk, A., Sullivan, S.: High-quality streamable free-viewpoint video. In: ACM Trans. on Graphics (2015)
  • [4] De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: ACM Transactions on Graphics (2008)
  • [5] Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques (2000)
  • [6] Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S.R., Kowdle, A., Escolano, S.O., Rhemann, C., Kim, D., Taylor, J., et al.: Fusion4d: Real-time performance capture of challenging scenes. In: ACM Trans. on Graphics (2016)
  • [7] Flynn, J., Broxton, M., Debevec, P., DuVall, M., Fyffe, G., Overbeck, R., Snavely, N., Tucker, R.: Deepview: View synthesis with learned gradient descent. In: CVPR (2019)
  • [8] Gall, J., Stoll, C., De Aguiar, E., Theobalt, C., Rosenhahn, B., Seidel, H.P.: Motion capture using joint skeleton tracking and surface estimation. In: CVPR (2009)
  • [9] Graham, B., Engelcke, M., Van Der Maaten, L.: 3d semantic segmentation with submanifold sparse convolutional networks. In: CVPR (2018)
  • [10] Guo, K., Lincoln, P., Davidson, P., Busch, J., Yu, X., Whalen, M., Harvey, G., Orts-Escolano, S., Pandey, R., Dourgarian, J., et al.: The relightables: Volumetric performance capture of humans with realistic relighting. In: ACM Trans. on Graphics (2019)
  • [11] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICCV (2015)
  • [12] Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: Learning generalizable radiance fields for human performance rendering. In: NeurIPS (2021)
  • [13] Li, T., Slavcheva, M., Zollhoefer, M., Green, S., Lassner, C., Kim, C., Schmidt, T., Lovegrove, S., Goesele, M., Lv, Z.: Neural 3d video synthesis. arXiv (2021)
  • [14]

    Liao, Y., Schwarz, K., Mescheder, L., Geiger, A.: Towards unsupervised learning of generative models for 3d controllable image synthesis. In: CVPR (2020)

  • [15] Liu, L., Gu, J., Lin, K.Z., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. arXiv (2020)
  • [16] Liu, L., Xu, W., Zollhoefer, M., Kim, H., Bernard, F., Habermann, M., Wang, W., Theobalt, C.: Neural rendering and reenactment of human actor videos. In: ACM Trans. on Graphics (2019)
  • [17] Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In: CVPR (2020)
  • [18] Liu, T., Zhang, J., Nie, X., Wei, Y., Wei, S., Zhao, Y., Feng, J.: Spatial-aware texture transformer for high-fidelity garment transfer. In: IEEE Trans. on Image Processing (2021)
  • [19] Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: Learning dynamic renderable volumes from images. In: ACM Transactions on Graphics (2019)
  • [20] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: ACM Trans. on Graphics (2015)
  • [21] Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: CVPR (2019)
  • [22] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
  • [23] Natsume, R., Saito, S., Huang, Z., Chen, W., Ma, C., Li, H., Morishima, S.: Siclope: Silhouette-based clothed people. In: CVPR (2019)
  • [24] Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: CVPR (2015)
  • [25] Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In: CVPR (2020)
  • [26] Olszewski, K., Tulyakov, S., Woodford, O., Li, H., Luo, L.: Transformable bottleneck networks. In: ICCV (2019)
  • [27] Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Deformable neural radiance fields. arXiv (2020)
  • [28] Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., Zhou, X.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)
  • [29] Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: CVPR (2021)
  • [30] Raj, A., Zollhoefer, M., Simon, T., Saragih, J., Saito, S., Hays, J., Lombardi, S.: Pva: Pixel-aligned volumetric avatars. arXiv (2021)
  • [31] Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)
  • [32] Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: CVPR (2020)
  • [33] Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deepvoxels: Learning persistent 3d feature embeddings. In: CVPR (2019)
  • [34]

    Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Continuous 3d-structure-aware neural scene representations. arXiv (2019)

  • [35] Stoll, C., Gall, J., De Aguiar, E., Thrun, S., Theobalt, C.: Video-based reconstruction of animatable human characters. In: ACM Trans. on Graphics (2010)
  • [36] Su, Z., Xu, L., Zheng, Z., Yu, T., Liu, Y., Fang, L.: Robustfusion: Human volumetric capture with data-driven visual cues using a rgbd camera. In: ECCV (2020)
  • [37] Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: Image synthesis using neural textures. In: ACM Trans. on Graphics (2019)
  • [38] Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin-Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: CVPR (2021)
  • [39] Wu, M., Wang, Y., Hu, Q., Yu, J.: Multi-view neural human rendering. In: CVPR (2020)
  • [40] Xu, X., Loy, C.C.: 3D human texture estimation from a single image with transformers. In: ICCV (2021)
  • [41] Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. arXiv (2020)
  • [42] Yuan, W., Lv, Z., Schmidt, T., Lovegrove, S.: Star: Self-supervised tracking and reconstruction of rigid objects in motion with neural rendering. In: CVPR (2021)
  • [43] Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3d human reconstruction from a single image. In: ICCV (2019)
  • [44] Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. arXiv (2018)