DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Rendering

06/07/2021 ∙ by Ruizhi Shao, et al. ∙ 7

We introduce DoubleField, a novel representation combining the merits of both surface field and radiance field for high-fidelity human rendering. Within DoubleField, the surface field and radiance field are associated together by a shared feature embedding and a surface-guided sampling strategy. In this way, DoubleField has a continuous but disentangled learning space for geometry and appearance modeling, which supports fast training, inference, and finetuning. To achieve high-fidelity free-viewpoint rendering, DoubleField is further augmented to leverage ultra-high-resolution inputs, where a view-to-view transformer and a transfer learning scheme are introduced for more efficient learning and finetuning from sparse-view inputs at original resolutions. The efficacy of DoubleField is validated by the quantitative evaluations on several datasets and the qualitative results in a real-world sparse multi-view system, showing its superior capability for photo-realistic free-viewpoint human rendering. For code and demo video, please refer to our project page:



There are no comments yet.


page 1

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The surface fields Park et al. (2019); Mescheder et al. (2019); Chen and Zhang (2019) and radiance fields Mildenhall et al. (2020); Zhang et al. (2020) have recently emerged as promising solutions to model 3D geometry and appearance in a resolution-independent and continuous manner. Though significant progress has been made towards detailed geometry recovery Saito et al. (2019, 2020); Zheng et al. (2021b); Hong et al. (2021) and photo-realistic texture recovery and rendering Yu et al. (2021a); Peng et al. (2021), their limitations become apparent when considering the simultaneous reconstruction of both geometry and appearance.

The limitations of the existing surface fields and radiance fields originate from the trade-off between the continuity and disentanglement properties. Specifically, the reconstruction approaches Saito et al. (2019); Niemeyer et al. (2020); Liu et al. (2020b) built upon surface fields typically learn texture on the surface, resulting in the distribution of the predicted texture highly concentrated on the surface. Such a narrow texture field is typically discontinuous and hinders the optimization processes of differentiable rendering. By contrast, the radiance fields Mildenhall et al. (2020); Zhang et al. (2020) enable the learning of a continuous texture field, but its geometry is entangled with the texture and lacks enough constraints. Such a geometry-appearance entanglement not only leads to the inconsistency and artifacts in the geometry recovery, especially under sparse multi-view settings, but also makes the training and inference of the radiance fields very time-consuming Mildenhall et al. (2020).

To overcome the limitations of existing neural field representations, we propose a novel DoubleField framework to bridge surface and radiance fields and enable a continuous but disentangled space for geometry and appearance learning. Specifically, we build associations between surface and radiance fields from the aspects of the network architecture and sampling strategy. 1) In our network architecture, an intermediate MLP learns a shared double embedding for both the two fields. The shared learning space facilitates the update of both fields with the back-propagated gradients so that both geometry and appearance can be learned in a continuous manner. 2) A surface-guided sampling strategy is proposed to first sample sparse points to determine the intersected surface and then sample dense points around the surface for volume rendering in the radiance field. Such a strategy imposes the geometry constraint for the radiance field and disentangles the geometry component from the appearance modeling, which not only accelerates the learning process but also improves the quality and consistency of the free viewpoint rendering results. With the proposed architecture and sampling strategy, DoubleField combines the merits of the two fields and naturally supports efficient finetuning on new data with self-supervision signals based on differentiable rendering.

Training Geometry Rendering Finetuning Supervision Inference
PIFu Saito et al. (2019) fast high traditional not support 3D scan data fast
PixelNeRF Yu et al. (2021a) very slow low neural slow images only slow
DoubleField fast high neural fast images/3D fast
Table 1: A brief comparison between our work and PIFu Saito et al. (2019) and PixelNeRF Yu et al. (2021a).

To fully exploit the potential power of DoubleField, we take one more step forward to leverage ultra-high-resolution inputs. Instead of learning on coarse image features only, DoubleField is further augmented with a view-to-view transformer to directly take the raw RGB values of the images at the original resolution as inputs. This is motivated by the observation that the free-viewpoint rendering can be regarded as a view-to-view problem, i.e., generating novel-view images given sparse-view images, which is reminiscent of the text-to-text problem of a typical task in NLP Raffel et al. (2019). For more efficient modeling of the high-fidelity appearance, we adopt a transductive transfer learning scheme for our network. Specifically, our network is first trained on the low-resolution pre-training task for the learning of general multi-view prior and then transferred and adopted to the high-fidelity domain by fast finetuning when dealing with ultra-high-resolution inputs. However, this is non-trivial since finetuning on sparse-view images is prone to overfitting. To overcome this issue, we conduct comprehensive experiments to measure the influence of different modules and empirically introduce a bottom-up finetuning strategy that can avoid overfitting with fast convergence. The experimental results on human reconstruction from sparse-view inputs demonstrate the state-of-the-art performance and high-fidelity rendering quality of our approach. The comparison of the proposed DoubleField representation with existing ones is summarized in Table 1. Our contributions in this work are listed as follows:

1. We propose a novel representation DoubleField to combine the merits of both surface and radiance fields. We bridge these two fields via a shared double embedding and a surface-guided sampling strategy so that DoubleField has a continuous but disentangled learning space for geometry and appearance modeling.

2. We further augment DoubleField to support high-fidelity rendering by introducing a view-to-view transformer to take the raw RGB values of the ultra-high-resolution images as inputs. The view-to-view transformer learns the texture mapping from the known viewpoints to the query viewpoints on the ultra-high-resolution domain.

3. We exploit a transfer learning scheme and a bottom-up finetuning strategy for more efficient training of our network so that it can have a fast convergence speed while avoiding the overfitting issue. In this way, DoubleField can produce high-fidelity free-view rendering results given only sparse-view inputs, which demonstrates significant performance improvements upon prior work.

2 Related Work

Neural implicit field

Recently, neural implicit fields have emerged as powerful representations for geometry reconstruction and graphics rendering. Compared with the traditional explicit representations, such as meshes, volumns, and point clouds, neural implicit fields encode 3D models via neural networks that directly map 3D locations or viewpoints to the corresponding properties of occupancy 

Mescheder et al. (2019); Chen and Zhang (2019), SDF Park et al. (2019), volumes Lombardi et al. (2019), and radiance Mildenhall et al. (2020) etc. Conditioned on spatial coordinates rather than discrete voxels or vertices, neural implicit field is continuous, resolution-independent, and more flexible, which enables higher quality surface recovery and photo-realistic rendering. For geometry reconstruction, methods based on surface fields Saito et al. (2019, 2020); Xu et al. (2019) can generate detailed models from one or few images, and the high-fidelity geometry is achieved using local implicit field Jiang et al. (2020a); Chabra et al. (2020). For graphics rendering, methods based on implicit field are suitable for differentiable rendering Liu et al. (2020b); Yariv et al. (2020); Jiang et al. (2020b); Sitzmann et al. (2019); Mildenhall et al. (2020). Among them, the recently proposed NeRF Mildenhall et al. (2020) has made significant progress in novel view synthesis and photo-realistic rendering, which inspires many derivative methods Yu et al. (2021a); Martin-Brualla et al. (2021); Schwarz et al. (2020); Wang et al. (2021); Liu et al. (2020a); Pumarola et al. (2021) and applications.

Multi-view human reconstruction

There are numerous efforts devoted to capturing template-based human body from multi-view cameras at different levels, including shape and pose Huang et al. (2017); Liang and Lin (2019), and cloth surface De Aguiar et al. (2008); Vlasic et al. (2008); Gall et al. (2009); Dou et al. (2013); Xu et al. (2018). Limited by the representation ability, these methods typically have low-quality results for both geometry and appearance recovery. Moreover, it is also difficult for those template-based algorithms to handle topology changes. Other approaches to high-quality human reconstruction require extremely expensive dependencies such as dense viewpoints Joo et al. (2018); Wu et al. (2020) or even controlled lighting Collet et al. (2015); Guo et al. (2019). Recently, implicit fields Huang et al. (2018); Zheng et al. (2021a); Saito et al. (2019) enable detailed geometry reconstruction from sparse views. Based on sparse RGB-D cameras, the high-fidelity geometry reconstruction can be also achieved in real-time Yu et al. (2021b). However, the simultaneous reconstruction of high-fidelity geometry and appearance from sparse-view inputs remains very challenging for existing solutions.


The efficacy of Transformer is recently shown in a wide range of NLP and CV problems Devlin et al. (2018); Dosovitskiy et al. (2020); Yuan et al. (2021). The attention mechanism, which is the core of transformer, has been proved by numerous literature to capture long-range dependencies Vaswani et al. (2017); Wang et al. (2018). Its ability to obtain correlation has applied to many applications such as visual question answering Kim et al. (2018), texture transferring Yang et al. (2020), multi-view stereo Luo et al. (2020)

, and hand pose estimation 

Huang et al. (2020). Besides, transfer learning based on transformer Devlin et al. (2018) has made significant progress in NLP and showed great potential for generalization. In our work, we regard the free-viewpoint rendering problem as a view-to-view problem and apply a transformer to capture the correspondences across the multi-view inputs. Motivated by previous work, we adopt a transfer learning scheme to tackle the learning issue on ultra-high resolution images.

Figure 2: Comparison of different neural field representations. (a) Neural surface field in PIFu Saito et al. (2019). (b) Neural radiance field in PixelNeRF Yu et al. (2021a). (c) The proposed DoubleField. (d) DoubleField with the raw ultra-high-resolution inputs.

3 Preliminary

Our DoubleField representation is built upon the neural surface fields Saito et al. (2019) and radiance field Mildenhall et al. (2020); Yu et al. (2021a). In this Section, we briefly introduce the background of these two fields. Please aslo refer to Fig. 2 for the comparison of different neural field representations.

Neural Surface Field

The neural surface field represented as the occupancy field Mescheder et al. (2019); Saito et al. (2019) is a resolution-independent representation for modeling 3D surface. As shown in Fig. 2, a surface field can be formulated as an implicit function mapping 3D points to the surface field value , e.g. . To improve generalization and obtain detailed geometry, PIFu Saito et al. (2019) conditions it on pixel-aligned image features using the following formulation:


where is the image features located at the projection of on the image . PIFu further extends this formulation to reconstruct texture on the surface by predicting RGB color on the points satisfied : . Though PIFu provides a straightforward solution for jointly modeling the geometry and appearance, it isolates geometry and texture and makes the learning space of texture discontinuous and highly concentrated around the surface. Such a discontinuous texture space hinders the optimization process under the texture supervisions using differentiable rendering techniques Niemeyer et al. (2020).

Neural Radiance Field

As shown in Fig. 2, NeRF Mildenhall et al. (2020) represents a scene as a continuous volumetric radiance field of the density and color , which describes geometry and appearance in an entangled form: e.g. , where is the viewing direction. Under this formulation, a 2D image of novel view can be rendered by the integration along camera rays:


where denotes a camera ray with the origin and direction , tackles with occlusion, and is the pre-defined depth bounds. To achieve novel view synthesis from only sparse multi-view inputs, PixelNeRF Yu et al. (2021a) extends NeRF to leverage pixel-aligned image features in a similar manner to PIFu:


Since the entangled modeling of density and color brings high flexibility for the training of NeRF, the surface learned in PixelNeRF is inconsistent given only sparse-view inputs, which leads to artifacts such as ghost-like or blurry results in novel view rendering. In addition, the highly flexible nature of the vanilla NeRF makes the training, inference, and finetuning of its derivative solutions Yu et al. (2021a); Peng et al. (2021) time-consuming.

4 DoubleField Representation

In our approach, a novel neural field representation DoubleField is proposed to bridge the surface field and radiance field. As shown in Fig. 2, our DoubleField can be formulated as a mutual implicit function

represented by multi-layer perceptrons (MLPs) to fit both the surface field and the radiance field:

. The MLP shared between two fields imposes a geometry constraint to the radiance field in an implicit manner and encourages a more consistent density distribution for neural rendering.

Network Architecture

DoubleField is composed of a shared MLP for double embedding and two individual MLPs for geometry and texture modeling. Without loss of generality, DoubleField is also conditioned on pixel-aligned images features . In our implementation, given the query point and viewing direction , a double MLP learns a shared double embedding , which is further decoded by two MLPs and for the prediction of the geometry and the texture fields. Overall, the DoubleField conditioned on pixel-aligned features can be written as:


where is the positional encoding of , is a geometry MLP for the prediction of occupancy in the surface field and the density in the radiance field, while is a texture MLP for prediction of the color in the radiance field. Both the MLPs and are much lighter than the double MLP , which implicitly builds a strong association between the two fields and imposes the surface constraint on the learning of the radiance field.

Figure 3: Illustration of the surface-guided sampling strategy.

Sampling Strategy

To facilitate the learning process, we make full use of the surface field and propose a surface-guided sampling strategy for DoubleField. As illustrated in Fig.3, the surface-guided sampling strategy will determine the intersection points in the surface field at first and then perform fine-grained sampling around the intersected surface. Specifically, given the camera parameters of the rendering view and the camera rays , a uniform sampling is applied along the ray in the depth bounds with sampling points. We query the surface field value of each point to determine the first intersection position on the surface. These intersection positions are used to guide the sampling at a more fine-grained level by considering the radiance field surrounding the intersected surface in a interval of with sampling points. More details about the sampling process can be found in the Supp.Mat.

4.1 DoubleField with Multi-view Inputs

DoubleField can be easily extended to multi-view settings, where pixel-aligned features will be extracted from the multi-view images and then fused together for the prediction of the two fields. Specifically, given image inputs from viewpoints and the corresponding camera parameters, the image features are first extracted by the image encoder. For the query point , the pixel-aligned features on the image are first obtained based on the projection of

. These pixel-aligned features extracted from multi-view images are then fused together as



where is a concatenation operator, is the pixel-aligned features on the -th viewpoint image, is the viewing direction in the coordinate system of the -th viewpoint, and is a feature fusion operation such as average pooling Saito et al. (2019) or self-attention Zheng et al. (2021a). The fused features can be taken as the conditioned features for DoubleField in Eq. 4 to predict the corresponding geometry and appearance: .

Figure 4: High-fidelity neural human rendering based on DoubleField.

4.2 DoubleField on Ultra-high-resolution Domain

The merits of DoubleField also pave the way to exploit solutions for higher-quality neural rendering. As discussed in previous Sections, the learning on coarse image features limits the quality of the final rendering results. To overcome this issue, we further augment DoubleField to take the images at the original resolution as additional condition inputs (see Fig. 2):


where denotes the pixel RGB values at the projection of .

Motivated by the text-to-text problem in NLP Raffel et al. (2019), we design a view-to-view transformer to learn geometry and appearance on the ultra-high-resolution domain with sparse-view inputs. The view-to-view transformer fuses the raw RGB values and multi-view features in its encoder and produce features at the novel-view space by its decoder. Moreover, the raw RGB values are mapped to a higher dimensional space as the colored encoding for the learning of high-frequency appearance variation. The key components of the view-to-view transformer are presented as follows.

Colored Encoding

Similar to the positional encoding Vaswani et al. (2017); Mildenhall et al. (2020), the raw RGB values on each pixel of an ultra-high-resolution image are embedded as a colored encoding using the sine and cosine functions of different frequencies Mildenhall et al. (2020). In this way, each single RGB value is mapped to a higher dimensional space, which significantly improves the performance and accelerates the convergence speed. This is consistent with the previous work on neural field representations Mildenhall et al. (2020) and NLP Vaswani et al. (2017).


In our view-to-view transformer, a "fully-visible" attention mask is used in the encoder, which encourages the model attending to each view related to the novel view by the self-attention mechanism. The encoder acts as the feature fusion operation in Eq. 5 to obtain the fused features , which will be fed into the double MLP for the generation of the double embedding .


The decoder of our view-to-view transformer maps the features learned on sparse-view inputs to the features of novel viewpoints. Specifically, the decoder takes the double embedding , the query viewing direction , the positional encoding of the query point , and the colored encoding of the RGB values as inputs to obtain texture embedding :


Finally, the high-resolution color at the point is predicted by the texture MLP : .

5 Learning High-fidelity Human Rendering

The efficacy of our DoubleField is validated on the geometry and appearance reconstruction from sparse-view human images. As illustrated in Fig. 4, given sparse multi-view images and the corresponding ray directions, the encoder of our view-to-view transformer serves as the operation to fuse low-resolution image features from different viewpoints and output the fused features using Eq. 5. The double MLP takes the fused features as inputs and produces the double embedding , which will be used to predict the surface field and the density value by the geometry MLP. For the prediction of high-fidelity texture, the decoder takes the double embedding , query viewing direction , and the colored encoding of the ultra-high-resolution images as inputs and produces the texture embedding for the prediction of color values .

Though our network can be directly trained on ultra-high-resolution images, the expensive training time cost on such a high-fidelity domain is still a problem. For a more feasible solution, we adopt a transductive transfer learning scheme to divide the problem into two phases: low-resolution pre-training and high-fidelity finetuning. In the pre-training phase, the network learns two coarse prior on down-sampling images: 1) A general geometry and appearance prior of human. 2) A fusion prior of multi-view features and raw RGB values. Specifically, to train our model, we collect human models from 3D scan dataset such as Twindom111 and render low-resolution images with the size of . In the finetuning phase, the network takes the ultra-high-resolution images from sparse multi-view of specific human as inputs and is finetuned using multi-view self-supervision. In this way, the model pre-trained on low-resolution images is adapted to the ultra-high-resolution domain. More details about the transfer learning scheme can be found in the Supp.Mat.

6 Experiment

We evaluate our DoubleField representation and view-to-view transformer on several dataset: 1. Twindom dataset, we split 1,700 human models into 1,500 models for training and 200 models for evaluation. 2. THuman2.0 dataset Yu et al. (2021b), which is a publicly-available dataset consisting of 500 high-quality human models. We first validate the proposed DoubleField from the aspects of training, inference, and finetuning. Then we present experimental results of various strategies on the finetuning of the view-to-view transformer. Finally, we compare our solutions with prior state-of-the-art approaches.

6.1 Efficiency of DoubleField

To validate the efficiency of DoubleField representation, we first evaluate the training process via an overfitting experiment and compare with PixelNeRF Yu et al. (2021a). We randomly select one model from the Twindom dataset and render 60 views for supervision. During training, we use a fixed single-view as input and a network with the basic DoubleField in Eq. 4

(i.e., no multi-view fusion modules and transformer). For the loss function, PixelNeRF is trained using with the sampling strategy in NeRF 

Mildenhall et al. (2020) to formulate rendering loss with other 59 views, while DoubleField is trained using the sampling strategy in PIFu Saito et al. (2019) for the learning of the surface field and the proposed surface-guided sampling strategy for the learning of the radiance field. We evaluate the performances using another fixed viewpoint image for every 100 iterations during the training and report results in Fig. 5. We can see that our method achieves fast convergence while PixelNeRF is struggling to reconstruct the entangled appearance and geometry, which proves the imposed geometry constraint of surface field in an implicit form is helpful to the training process.

Based on the proposed DoubleField representation and the surface-guided sampling strategy, our network not only achieves much faster rendering speed but also acquires much less memory than NeRF. The reasons are: 1) The number of sampling points (set to 16 in our experiments) around the surface can be much less than the importance sampling in NeRF. 2) During training, the determination of the intersection points on the surface has no gradient for back-propagation so that the surface-guided sampling saves both time and memory. 3) A coarse mesh can be directly extracted from the surface field using marching cube at inference, which greatly reduces the total number of query rays by removing background and allows fast intersection detection using the depth map. The comparison of different sampling strategies can be found in the Supp.Mat.

(a) Overfitting
(b) Finetuning geometry
(c) Finetuning color
Figure 5: (a) Overfitting experiments. (b)(c) The curves of Chamfer distance and PSNR under different finetuning strategies. Curves come from the experiments of one randomly-selected model.
Figure 6: Comparisons of geometry reconstruction results.

6.2 Finetuning with Different Strategies

Our network consists of several modules for image feature extraction, fusion, and final DoubleField prediction, which means there are various possible strategies for finetuning. Directly finetuning the whole the network with sparse multiview supervisions is prone to overfitting. To overcome this issue, we conduct experiments on 10 models selected randomly from the Twindom test dataset to figure out a plausible strategy that can avoid overfitting with fast convergence. For the finetuning on each model, there are sparse-view images from 6 fixed viewpoints in total, and we randomly pick 4 images as the network input and 1 image for supervision in each iteration.

In the finetuning process, we evaluate the geometry reconstruction using chamfer distance for every 20 iterations and the rendering results using PSNR metric with 24 novel fixed viewpoints for every 100 iterations. We adopt different finetuning strategy including: 1) Finetuning on the image encoder. 2) Finetuning on the view-to-view transformer. 3) Finetuning on the MLPs (the double MLP, the geometry MLP and the texture MLP). In each finetuning process, we fix other modules and finetune only the specific module. We also conducted finetuning on the MLPs using the NeRF sampling strategy. The results are reported in Fig. 5, which shows that finetuning transformer can help to achieve higher fidelity rendering and finetuning downstreaming MLPs can refine both geometry and color reconstruction. Besides, using the NeRF sampling strategy instead of the proposed surface-guided sampling strategy leads to deteriorated performance due to overfitting, which further demonstrates that our sampling strategy alleviates the overfitting issue and contributes to better reconstruction.

In summary, we adopt a bottom-up strategy to finetune our network for high resolution inputs. Specifically, we first finetune MLPs to refine geometry and color, and then finetune the transformer and the texture MLP to achieve high-fidelity rendering.

Figure 7: Comparisons of appearance reconstruction results on Twindom dataset. PixelNeRF and our method are finetuned with additional 4,000 iterations.
(6 views Col.)
(6 views Col.)
(6 views Geo.)
(6 views Geo.)
PIFu Saito et al. (2019) 20.80 0.805 22.35 0.846 0.754 0.716 0.710 0.613
20.65 0.804 22.17 0.843 0.746 0.701 0.709 0.611
PixelNeRF Yu et al. (2021a)
21.57 0.808 22.95 0.854 0.945 0.931 0.815 0.725
Our Method
22.95 0.842 24.23 0.880 0.750 0.707 0.708 0.612
NeuralBody Peng et al. (2021)
20.69 0.808 22.65 0.862 1.597 2.146 1.528 2.126
21.62 0.812 23.08 0.855 0.779 0.736 0.724 0.623
PixelNeRF (Ft)
21.85 0.813 23.57 0.863 1.072 1.052 0.790 0.701
Our Method (Ft)
23.56 0.857 25.10 0.905 0.721 0.694 0.673 0.601
Table 2: Quantitative results on Twindom dataset and THuman2.0 dataset for human geometry and appearance reconstruction. Ft denotes the approaches finetuned with 4,000 iterations.

6.3 Comparisons with State-of-the-art Approaches

In this section, we compare DoubleField with the state-of-the-art approaches built upon the surface field and radiance field, including PIFu Saito et al. (2019), PixelNeRF Yu et al. (2021a), and NeuralBody Peng et al. (2021). We also implement DVR Niemeyer et al. (2020) based on PIFu (denoted as PIFu+DVR) to validate the efficiency of the DoubleField representation and its finetuning ability on unseen data. For fair comparisons, we replace the average pooling operation in PIFu Saito et al. (2019) and PixelNeRF Yu et al. (2021a) with self-attention modules for multi-view feature fusion and retrain their networks with the same training settings and datasets.

Geometry Reconstruction For the comparison with NeuralBody Peng et al. (2021), we regard NeuralBody as a frame-based method and train it on 6 viewpoint inputs for 15 hours. We quantitatively evaluate the geometry recovery performance using the point-to-surface distance and Chamfer distance and report results in Table. 2. We can see that our method without finetuning achieves competitive results compared with PIFu and PIFu+DVR. With finetuning, our method can improve the quality of geometry based on the DoubleField representation. Qualitative results are illustrated in Fig. 6. Unlike PixelNeRF and NeuralBody, the surface reconstructed by our method is more consistent and contains more details. The finetuning can further fix some missing parts on the geometry such as holes, which shows that the double MLP has learned to build an implicit association between two fields.

Appearance Reconstruction To evaluate different methods on the appearance reconstruction, we prepare images of 4K resolution rendered from 30 viewpoints. We use images from 6 fixed viewpoints as input and images from other 24 views for evaluation. Quantitative results are shown in Table. 2. Benefiting from the view-to-view transformer and DoubleField representation, our method can achieve high-fidelity rendering. Under the transfer learning scheme, our method can support higher quality appearance reconstruction with quick finetuning in 20 minutes (10 minutes for geometry finetuning and 10 minutes for texture and transformer finetuning, 4,000 iterations in total). Moreover, our method generalizes well for scenarios like object interactions and loose clothes (such as long skirts) as shown in Fig. 7.

Figure 8: Comparisons with NeuralBody. 4 images on the left are from ZJU-mocap, and 4 images on the right are from real world multi-view (5 views) system. Each video has 300 frames and we train NeuralBody for 20 hours.
Figure 9: Comparisons on real world data.

Comparisons on Real-world Data To demonstrate the robustness of our method, we also evaluate our method using the ZJU-mocap dataset and real world captured multi-view videos. The results are shown in Fig. 8 and Fig. 9. Our method produces comparable rendering results on the ZJU-mocap dataset but using much less time for network finetuning (<20 minutes V.S. >10 hours). Moreover, our method does not rely on human shape prior SMPL Loper et al. (2015) like NeuralBody and achieves higher quality results even under challenging scenarios like swinging skirt, topological changes and loose cloth, which demonstrates the strong generalization capacity of our method to real world data. For more results, please refer to our supplementary video.

7 Discussion and Future Works

We propose the DoubleField representation to combine the merits of geometry and appearance fields for high-fidelity human rendering. Though our approach achieves superior performances on the reconstruction from sparse-view ultra-high-resolution inputs, the high-quality 3D human models are still essential to learn the geometry prior. Moreover, the effort of finetuning on the geometry refinement is limited, which hinders our approach to handle extremely challenging poses.

In our work, the associations between two fields are built in an implicit manner. A more unified and explicit formulation is still deserved to be exploited. Besides, the proposed view-to-view transformer and the transfer learning scheme provide novel solutions for high-fidelity rendering. We hope our approach can enlighten the follow-up work in the field of free-viewpoint human rendering.


  • R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. Newcombe (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In ECCV, pp. 608–625. Cited by: §2.
  • Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In CVPR, pp. 5939–5948. Cited by: §1, §2.
  • A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan (2015) High-quality streamable free-viewpoint video. TOG 34 (4), pp. 1–13. Cited by: §2.
  • E. De Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H. Seidel, and S. Thrun (2008) Performance capture from sparse multi-view video. TOG, pp. 1–10. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2.
  • M. Dou, H. Fuchs, and J. Frahm (2013) Scanning and tracking dynamic objects with commodity depth cameras. In ISMAR, pp. 99–106. Cited by: §2.
  • J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel (2009) Motion capture using joint skeleton tracking and surface estimation. In CVPR, pp. 1746–1753. Cited by: §2.
  • K. Guo, P. Lincoln, P. Davidson, J. Busch, X. Yu, M. Whalen, G. Harvey, S. Orts-Escolano, R. Pandey, J. Dourgarian, et al. (2019) The relightables: volumetric performance capture of humans with realistic relighting. TOG 38 (6), pp. 1–19. Cited by: §2.
  • Y. Hong, J. Zhang, B. Jiang, Y. Guo, L. Liu, and H. Bao (2021) StereoPIFu: depth aware clothed human digitization via stereo vision. In CVPR, Cited by: §1.
  • L. Huang, J. Tan, J. Meng, J. Liu, and J. Yuan (2020)

    HOT-net: non-autoregressive transformer for 3d hand-object pose estimation

    In ACM MM, pp. 3136–3145. Cited by: §2.
  • Y. Huang, F. Bogo, C. Lassner, A. Kanazawa, P. V. Gehler, J. Romero, I. Akhter, and M. J. Black (2017) Towards accurate marker-less human shape and pose estimation over time. In 3DV, pp. 421–430. Cited by: §2.
  • Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. LeGendre, L. Luo, C. Ma, and H. Li (2018) Deep volumetric video from very sparse multi-view performance capture. In ECCV, pp. 336–354. Cited by: §2.
  • C. Jiang, A. Sud, A. Makadia, J. Huang, M. Nießner, T. Funkhouser, et al. (2020a) Local implicit grid representations for 3d scenes. In CVPR, pp. 6001–6010. Cited by: §2.
  • Y. Jiang, D. Ji, Z. Han, and M. Zwicker (2020b) Sdfdiff: differentiable rendering of signed distance fields for 3d shape optimization. In CVPR, pp. 1251–1261. Cited by: §2.
  • H. Joo, T. Simon, and Y. Sheikh (2018) Total capture: a 3d deformation model for tracking faces, hands, and bodies. In CVPR, pp. 8320–8329. Cited by: §2.
  • J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In NeurIPS, Vol. 31, pp. 1564–1574. Cited by: §2.
  • J. Liang and M. C. Lin (2019) Shape-aware human pose and shape reconstruction using multi-view images. In CVPR, pp. 4352–4362. Cited by: §2.
  • L. Liu, J. Gu, K. Z. Lin, T. Chua, and C. Theobalt (2020a) Neural sparse voxel fields. In NeurIPS, Cited by: §2.
  • S. Liu, Y. Zhang, S. Peng, B. Shi, M. Pollefeys, and Z. Cui (2020b) Dist: rendering deep implicit signed distance function with differentiable sphere tracing. In CVPR, pp. 2019–2028. Cited by: §1, §2.
  • S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. In TOG, Cited by: §2.
  • M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 1–16. Cited by: §6.3.
  • K. Luo, T. Guan, L. Ju, Y. Wang, Z. Chen, and Y. Luo (2020) Attention-aware multi-view stereo. In CVPR, pp. 1590–1599. Cited by: §2.
  • R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) Nerf in the wild: neural radiance fields for unconstrained photo collections. In CVPR, Cited by: §2.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, pp. 4460–4470. Cited by: §1, §2, §3.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In ECCV, pp. 405–421. Cited by: §1, §1, §2, §3, §3, §4.2, §6.1.
  • M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In CVPR, pp. 3504–3515. Cited by: §1, §3, §6.3.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §1, §2.
  • S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou (2021) Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, Cited by: §1, §3, §6.3, §6.3, Table 2.
  • A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer (2021) D-nerf: neural radiance fields for dynamic scenes. In CVPR, Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR. Cited by: §1, §4.2.
  • S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, Cited by: Table 1, §1, §1, Figure 2, §2, §2, §3, §3, §4.1, §6.1, §6.3, Table 2.
  • S. Saito, T. Simon, J. Saragih, and H. Joo (2020) PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In CVPR, Cited by: §1, §2.
  • K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger (2020) Graf: generative radiance fields for 3D-aware image synthesis. In NeurIPS, Vol. 33. Cited by: §2.
  • V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In NeurIPS, Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §2, §4.2.
  • D. Vlasic, I. Baran, W. Matusik, and J. Popović (2008) Articulated mesh animation from multi-view silhouettes. TOG, pp. 1–9. Cited by: §2.
  • Q. Wang, Z. Wang, K. Genova, P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser (2021) Ibrnet: learning multi-view image-based rendering. In CVPR, Cited by: §2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §2.
  • M. Wu, Y. Wang, Q. Hu, and J. Yu (2020) Multi-view neural human rendering. In CVPR, pp. 1682–1691. Cited by: §2.
  • Q. Xu, W. Wang, D. Ceylan, R. Mech, and U. Neumann (2019) Disn: deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, Cited by: §2.
  • W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H. Seidel, and C. Theobalt (2018) Monoperfcap: human performance capture from monocular video. TOG 37 (2), pp. 1–15. Cited by: §2.
  • F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo (2020)

    Learning texture transformer network for image super-resolution

    In CVPR, pp. 5791–5800. Cited by: §2.
  • L. Yariv, Y. Kasten, D. Moran, M. Galun, M. Atzmon, R. Basri, and Y. Lipman (2020) Multiview neural surface reconstruction by disentangling geometry and appearance. In NeurIPS, Cited by: §2.
  • A. Yu, V. Ye, M. Tancik, and A. Kanazawa (2021a) PixelNeRF: neural radiance fields from one or few images. In CVPR, Cited by: Table 1, §1, Figure 2, §2, §3, §3, §6.1, §6.3, Table 2.
  • T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, and Y. Liu (2021b) Function4D: real-time human volumetric capture from very sparse consumer rgbd sensors. In CVPR, Cited by: §2, §6.
  • L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan (2021)

    Tokens-to-token vit: training vision transformers from scratch on imagenet

    arXiv preprint arXiv:2101.11986. Cited by: §2.
  • K. Zhang, G. Riegler, N. Snavely, and V. Koltun (2020) Nerf++: analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492. Cited by: §1, §1.
  • Y. Zheng, R. Shao, Y. Zhang, T. Yu, Z. Zheng, Q. Dai, and Y. Liu (2021a) DeepMultiCap: performance capture of multiple characters using sparse multiview cameras. arXiv e-prints, pp. arXiv–2105. Cited by: §2, §4.1.
  • Z. Zheng, T. Yu, Y. Liu, and Q. Dai (2021b)

    PaMIR: parametric model-conditioned implicit representation for image-based human reconstruction

    TPAMI. Cited by: §1.