HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars

by   Tao Hu, et al.

We propose a novel neural rendering pipeline, Hybrid Volumetric-Textural Rendering (HVTR), which synthesizes virtual human avatars from arbitrary poses efficiently and at high quality. First, we learn to encode articulated human motions on a dense UV manifold of the human body surface. To handle complicated motions (e.g., self-occlusions), we then leverage the encoded information on the UV manifold to construct a 3D volumetric representation based on a dynamic pose-conditioned neural radiance field. While this allows us to represent 3D geometry with changing topology, volumetric rendering is computationally heavy. Hence we employ only a rough volumetric representation using a pose-conditioned downsampled neural radiance field (PD-NeRF), which we can render efficiently at low resolutions. In addition, we learn 2D textural features that are fused with rendered volumetric features in image space. The key advantage of our approach is that we can then convert the fused features into a high resolution, high-quality avatar by a fast GAN-based textural renderer. We demonstrate that hybrid rendering enables HVTR to handle complicated motions, render high-quality avatars under user-controlled poses/shapes and even loose clothing, and most importantly, be fast at inference time. Our experimental results also demonstrate state-of-the-art quantitative results.



There are no comments yet.


page 3

page 4

page 6

page 7

page 9

page 13

page 14

page 15


DRaCoN – Differentiable Rasterization Conditioned Neural Radiance Fields for Articulated Avatars

Acquisition and creation of digital human avatars is an important proble...

Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

We present dynamic neural radiance fields for modeling the appearance an...

Quantitative Distortion Analysis of Flattening Applied to the Scroll from En-Gedi

Non-invasive volumetric imaging can now capture the internal structure a...

HumanNeRF: Generalizable Neural Human Radiance Field from Sparse Inputs

Recent neural human representations can produce high-quality multi-view ...

Going beyond Free Viewpoint: Creating Animatable Volumetric Video of Human Performances

In this paper, we present an end-to-end pipeline for the creation of hig...

NeuVV: Neural Volumetric Videos with Immersive Rendering and Editing

Some of the most exciting experiences that Metaverse promises to offer, ...

Mixture of Volumetric Primitives for Efficient Neural Rendering

Real-time rendering and animation of humans is a core function in games,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Capturing and rendering realistic human appearance under varying poses and viewpoints is an important goal in computer vision and graphics. Recent neural rendering methods

[Tewari2020StateOT, neuralbody, neuralactor, anr] have made great progress in generating realistic images of humans, which are simple yet effective compared with traditional graphics pipelines [Borshukov2003UniversalCI, Carranza2003FreeviewpointVO, Xu2011VideobasedCC].

Given a training dataset of multiple synchronized RGB videos of a human, the goal is to build an animatable virtual avatar with pose-dependent geometry and appearance of the individual that can be driven by arbitrary poses from arbitrary viewpoints at inference time. We propose Hybrid Volumetric-Textural Rendering (HVTR). To represent the input to our system, including the pose and the rough body shape of an individual, we employ a skinned parameterized mesh (SMPL [smpl]) fitted to the training videos. Our system is expected to handle the articulated structure of human bodies, various clothing styles, non-rigid motions, and self-occlusions, and be fast at inference time. In the following, we will introduce how HVTR solves these challenges by proposing (1) effective pose encoding for better generalization, (2) rough yet effective volumetric representation to handle changing topology, and (3) hybrid rendering for fast and high quality rendering.

Pose Encoding on a 2D Manifold. The first challenge lies in encoding the input pose information so that it can be leveraged effectively by the image synthesis pipeline. Prior methods employ global pose parameter conditioning [Yang2018AnalyzingCL, Patel2020TailorNetPC, Ma2020LearningTD, Lhner2018DeepWrinklesAA], or Peng et al. [neuralbody] learn poses in a 3D sparse voxelized space by SparseConvNet [Graham20183DSS]. For better pose generalization, [neuralactor, Peng2021AnimatableNR, Chen2021AnimatableNR] learn motions using skinning weights via a backward (or inverse) skinning step. Skinning weights are not powerful enough to represent complicated deformations due to arbitrary motions and various clothing styles, however, which may cause averaged and blurry results. In addition, changing topology is challenging for backward skinning used in [Peng2021AnimatableNR], as it is not able to model one-to-many backward correspondences [Chen2021SNARFDF]

. In contrast, we encode motions on a 2D UV manifold of the body mesh surface, and the dense representation enables us to utilize 2D convolutional networks to effectively encode pose-dependent features. We define a set of geometry and texture latents on the 2D manifold, which have higher resolution than the compressed latent vectors used in

[Lombardi2019NeuralV] to enable capturing local motion/appearance details for rendering. Since our method does not employ backward skinning, we also avoid the multi-correspondence problem.

Rough Yet Effective Volumetric Representation. Our input is a coarse SMPL mesh as used in [smplpix, anr, egorend], which cannot capture detailed pose- and clothing-dependent deformations. Inspired by the recent neural scene representations [nerf, neuralactor, neuralbody], we model articulated humans with an implicit volumetric representation by constructing a dynamic pose-conditioned neural radiance field. This volumetric representation has the built-in flexibility to handle changing geometry and topology. Different from NeRF [nerf] for static scenes, we condition our proposed dynamic radiance field on our pose encoding defined on the UV manifold. This enables capturing pose- and view-dependent volumetric features. Constructing the radiance field is computationally heavy [nerf, neuralactor, neuralbody], however, hence we propose to learn only a rough volumetric representation by constructing a pose-conditioned downsampled neural radiance field (PD-NeRF). This allows us to balance the competing challenges of achieving computational complexity while still being able to effectively resolve self-occlusions. Yet learning PD-NeRF from low resolution images is challenging, and to address this, we propose an appropriate sampling scheme. We show that we can effectively train PD-NeRF from images with a size of only and as few as 7 sampled points along each query ray (see Fig. 3, 4 and Tab. 4).

Hybrid Rendering. The final challenge is to render full resolution images by combining the downsampled PD-NeRF and our learned latents on the 2D UV manifold. To solve this, we rasterize the radiance field into multi-channel (not just RGB) volumetric features in image space by volume rendering. The rasterized volumetric features preserve both geometric and appearance details [nerf]. In addition, we extract 2D textural features from our latents on the UV manifold for realistic image synthesis following the spirit of Deferred Neural Rendering (DNR) [dnr, anr, egorend]. We fuse the 3D and 2D features by utilizing Attentional Feature Fusion (AFF[dai21aff]), and finally use a 2D GAN-based [gans] textural rendering network (TexRenderer) to decode and supersample them into realistic images. Though TexRenderer works in image space, it is able to incorporate the rasterized volumetric features for geometry-aware rendering.

The hybrid rendering brings several advantages. 1) We are able to handle self-occlusions by volume rendering. 2) We can generate high quality details using GAN and adversarial training. This enables us to handle uncertainties involved in modeling dynamic details, and is well-suited for enforcing realistic rendered images [anr, Huang2020AdversarialTO, smplpix]. In contrast, a direct deterministic regression of dynamic scenes often leads to blurry results as stated in [neuralactor]. 3) Both textural rendering, and volumetric rendering only from rough volumetric representations, are fast at training and inference time. In contrast, regressing a detailed geometry for view synthesis by volume rendering is time consuming. 4) Benefiting from (1) and (2) and leveraging the rough geometry and GAN-based rendering, we are able to handle loose clothing like skirts (Fig. 5).

In summary, our contributions are: (1) We propose HVTR, a novel neural rendering pipeline, to generate human avatars from arbitrary skeleton motions using a hybrid strategy. HVTR achieves state-of-the-art performance, and is able to handle complicated motions, render high quality avatars even with loose clothing, generalize to novel poses, and support body shape control. Most importantly, it is fast at inference time. (2) HVTR uses an effective scheme to encode pose information on the UV manifold of body surfaces, and leverages this to learn a pose-conditioned downsampled NeRF (PD-NeRF) from low resolution images. Our experiments show how the rendering quality is influenced by the PD-NeRF resolution, and that even low resolution volumetric representations can produce high quality outputs at a small computational cost. (3) HVTR shows how to construct PD-NeRF and extract 2D textural features all based on pose encoding on the UV manifold, and most importantly, how the two can be fused and incorporated for fast, high quality, and geometry-aware neural rendering.

2 Related Work

Animatable Pipelines
2D : EDN [edn], vid2vid [vid2vid] NIT
2D Plus : SMPLpix [smplpix],
DNR [dnr], ANR[anr]
3D : NB[neuralbody], AniNeRF[Peng2021AnimatableNR] VolR
3D : Ours Hybrid
Table 1:

A set of recent human synthesis approaches classified by feature representations (2D or 3D) and rendering methods (NIT: neural image translation; VolR: volume rendering

[Kajiya1984RayTV]). NB: Neural Body[neuralbody]. AniNeRF: Animatable NeRF[Peng2021AnimatableNR].
Figure 1: We illustrate the differences between (left) NIT methods (DNR), (middle) our hybrid approach, and (right) NeRF methods (Neural Body). DNR[dnr] and SMPLpix[smplpix] are based on fixed mesh(SMPL[smpl] or SMPL-X[SMPLX]), and use a GAN for one-stage rendering without explicit geometry reconstruction. As a disadvantage, DNR needs to resolve geometric misalignments implicitly, which often leads to artifacts (see closeup \⃝raisebox{-0.9pt}{1} in the figure). Yet our method (middle) works in two stages by first learning a downsampled volumetric representation (by PD-NeRF), and then utilizing a GAN for appearance synthesis. Though only learned from low resolution images ( in this example), the rough volumetric representation still encodes more 3D pose-dependent features than SMPL, which enables us to handle self-occlusions (region \⃝raisebox{-0.9pt}{1} vs \⃝raisebox{-0.9pt}{2}), and preserve more details (\⃝raisebox{-0.9pt}{3}\⃝raisebox{-0.9pt}{4}) than DNR. In addition, our GAN-based renderer can generate high resolution wrinkles, whereas Neural Body cannot, although its geometry is more detailed. In addition, our approach is 52 faster than Neural Body at inference time (see Tab. 4).

Neural Scene Representations. Instead of explicitly modeling geometry, many neural rendering methods [sitzmann2019deepvoxels, Sitzmann2019, mildenhall2020nerf, dnr] propose to learn implicit representations of scenes, such as DeepVoxels [sitzmann2019deepvoxels], Neural Volumes [Lombardi2019NeuralV] SRNs [Sitzmann2019], or NeRF [mildenhall2020nerf]. In contrast to these static representations, we learn a dynamic radiance field on the UV manifold of human surfaces to model articulated human bodies.

Shape Representations of Humans. To capture detailed deformations of human bodies, most recent papers utilize implicit functions [Mescheder2019OccupancyNL, Michalkiewicz2019DeepLS, Chen2019LearningIF, Park2019DeepSDFLC, Saito2019PIFuPI, Saito2020PIFuHDMP, Huang2020ARCHAR, Saito2021SCANimateWS, Mihajlovi2021LEAPLA, Wang2021MetaAvatarLA, Palafox2021NPMsNP, Tiwari2021NeuralGIFNG, Zheng2021PaMIRPM, Jeruzalski2020NASANA, Zheng2021DeepMultiCapPC] or point clouds [scale, pop] due to their topological flexibility. These methods aim at learning geometry from 3D datasets, whereas we synthesize human images of novel poses only from 2D RGB training images.

Rendering Humans by Neural Image Translation (NIT).

Some existing approaches render human avatars by neural image translation, i.e. they map the body pose given in the form of renderings of a skeleton 

[edn, SiaroSLS2017, Pumarola_2018_CVPR, KratzHPV2017, zhu2019progressive, vid2vid], dense mesh [Liu2019, wang2018vid2vid, liu2020NeuralHumanRendering, feanet, Neverova2018, Grigorev2019CoordinateBasedTI] or joint position heatmaps [MaSJSTV2017, Aberman2019DeepVP, Ma18], to real images. As summarized in Tab. 1, EDN [edn] and vid2vid [vid2vid] utilize GAN [gans] networks to learn a mapping from 2D poses to human images. To improve temporal stability and learn a better mapping, “2D Plus” methods [dnr, smplpix, egorend] are conditioned on a coarse mesh (SMPL [smpl]), and take as input additional geometry features, such as DNR (UV mapped features) [dnr], SMPLpix (+ depth map) [smplpix], and ANR (UV + normal map) [anr]. A 2D convolutional network is often utilized for both shape completion and appearance synthesis in one stage [dnr, smplpix]. However, [dnr, smplpix, anr, stylepeople] do not reconstruct geometry explicitly and cannot handle self-occlusions effectively. In contrast, our rendering is conditioned on a learned 3D volumetric representation using a two-stage approach (see Fig. 1). We show that our learned representation handles occlusions more effectively than other techniques [dnr, smplpix, anr, stylepeople] that just take geometry priors (e.g., UV, depth or normal maps) from a coarse mesh as input (see Fig. 3, 4.1).

Rendering Humans by Volume Rendering (VolR). For stable view synthesis, recent papers [neuralbody, neuralactor, Peng2021AnimatableNR, narf, anerf, Chen2021AnimatableNR] propose to unify geometry reconstruction with view synthesis by volume rendering, which, however, is computationally heavy. In addition, the appearance synthesis (e.g., Neural Body [neuralbody]) largely relies on the quality of geometry reconstruction, which is very challenging for dynamic humans, and imperfect geometry reconstruction will lead to blurry images (Fig. 4.1). Furthermore, due to the difficulties of reconstructing geometry for loose clothing, most animatable NeRF methods (e.g., state-of-the-art Neural Actor [neuralbody]) cannot handle skirts. In contrast, our method utilizes a GAN to render high frequency details based on a downsampled radiance field, which makes our method more robust to geometry inaccuracies and fast at inference time. We can also render skirts (Fig. 5). A comparison of ours, NIT, and VolR is shown in Fig. 1.

Ours is distinguished from [Niemeyer2021GIRAFFERS] by rendering dynamic humans at high resolutions, and conditioning our rendering framework on the UV manifold of human body surfaces.

Figure 2: Pipeline overview. Given a coarse SMPL mesh with pose (p) and a target viewpoint (o, d), our system renders a detailed avatar using four main components: \⃝raisebox{-0.9pt}{1} pose encoding, \⃝raisebox{-0.9pt}{2} 2D textural feature encoding, \⃝raisebox{-0.9pt}{3} 3D volumetric representation, and \⃝raisebox{-0.9pt}{4} hybrid rendering. \⃝raisebox{-0.9pt}{1} Pose Encoding in UV space: We record the 3D positions of the mesh on a UV positional map . We stack it with a geometry latent (

) and encode them into a pose feature tensor

. We then construct the pose-dependent features by stacking , an optimizable texture style latent

, and the estimated normals

in UV space. \⃝raisebox{-0.9pt}{2} 2D Tex-Encoding: A Feature Rendering module renders the coarse mesh with into image features by utilizing a rasterized UV coordinate map (). The image features are then encoded as 2D textural features by the Textural Encoder . \⃝raisebox{-0.9pt}{3} 3D Vol-Rep: To capture the rough geometry and address self-occlusion problems, we further learn a volumetric representation by constructing a pose-conditioned downsampled neural radiance field (PD-NeRF) to encode 3D pose-dependent features. \⃝raisebox{-0.9pt}{4} Hybrid Rendering: PD-NeRF is rasterized into image space by volume rendering, where 3D volumetric features are also preserved. Both the 2D and 3D features are pixel-aligned in image space, fused by Attentional Feature Fusion (AFF), and then converted into a realistic image and a mask by TexRenderer .

3 Method

Problem setup. Our goal is to render pose- and view-dependent avatars of an individual from an arbitrary pose p (represented by a coarse human mesh) and an arbitrary viewpoint (position o, view direction d):

where K denotes the camera intrinsic parameters, and is the output image.

We first introduce our pose encoding method (Sec. 3.1), and based on this we present how to extract 2D textural features (Sec. 3.2) and 3D pose-dependent volumetric features (Sec. 3.3). Finally, we describe how we fuse these features and synthesize the final RGB avatars (Sec. 3.4). Fig. 2 shows an outline of the proposed framework.

3.1 Pose Encoding

Given a skeleton of pose p and posed SMPL mesh as input, we first project each surface point from 3D space to its UV manifold, represented by a UV positional map , , where describes the relative location of the point on the body surface manifold. With this, we define a set of geometry latents to represent the intrinsic local geometry features, and texture latents to represent high-dimensional neural textures as used in [dnr, anr, egorend]. Both latents are defined in UV space, and shared across different poses. Our geometry and texture latents have higher resolution than the compressed representation used in other works (e.g., latent vectors used in [Lombardi2019NeuralV]), which enables us to capture local details, and the rendering pipeline can leverage them to infer local geometry and appearance changes.

To extract geometric features, the projected poses and the geometry latents are convolved by a ConvNet to obtain pixel-aligned pose features . In addition, to enforce learning geometric features by , we predict the normal of the posed mesh in UV space using a shallow ConvNet : . We then concatenate the geometric features and , and the texture features to obtain our pose-dependent features .

Note that though the UV positional map used is similar to [scale, pop], ours is distinguished by learning pose-dependent features from 2D images instead of 3D point clouds, and we have a normal estimation network to enforce geometric learning, whereas [scale, pop] did not.

3.2 2D Textural Feature Encoding

Given a viewpoint (o, d), we also render a UV coordinate map [densepose] to encode the shape and pose features of in image space. This allows us to transform from UV space to pose-dependent features in image space using a Feature Rendering module [dnr, anr, egorend]. We further encode as a high-dimensional textural feature using a Texture Encoder implemented by a 2D ConvNet.

3.3 3D Volumetric Representation

Though existing methods achieve compelling view synthesis by just rendering with 2D textural features [dnr, egorend, anr], they cannot handle self-occlusions effectively since they do not reconstruct the geometry. We address this by learning a 3D volumetric representation using a pose-conditioned neural radiance field (PD-NeRF in Fig. 2).

We include pose information to learn the volumetric representation by looking up the encoded pose feature in corresponding to each 3D query point. To achieve this, we project each query point in the posed space of to a local point in an UV-plus-height space,


where is the triangle index, is the triangle (face), are the three vertices of , are the barycentric coordinates of the face, and

is the barycentric interpolation function. The height

is given by the signed distance of to the nearest face .

With this, we sample the local feature of from the encoded pose features : . Given a camera position o and view direction d, we predict the density and appearance features of as


where is a positional encoder. Note that is a high-dimensional feature vector, where the first three channels are RGB colors. A key property of our approach is that is conditioned on high resolution encoded pose features instead of pose parameters p.

3.4 Hybrid Volumetric-Textural Rendering

Though the radiance field PD-NeRF can be directly rendered into target images by volume rendering [Kajiya1984RayTV], this is computationally heavy. In addition, a direct deterministic regression using RGB images often leads to blurry results in dynamic scenes as stated in [neuralactor].

Volumetric Rendering. To address this, we use PD-NeRF to render downsampled images by a factor s for fast inference. We rasterize PD-NeRF into multi-channel volumetric features , and each pixel is predicted by N consecutive samples along the corresponding ray r through volume rendering[Kajiya1984RayTV],


where , and density and appearance features , of are predicted by Eq. 2. Note the first three channels of are RGB, which are supervised by downsampled ground truth images (see Fig. 2).

Attentional Volumetric Textural Feature Fusion. With both the 2D textural features and the rasterized 3D volumetric features , the next step is to fuse them and leverage them for 2D image synthesis. This poses several challenges. First, is trained in 2D, which converges faster than , since needs to regress a geometry by optimizing downsampled images, and NeRF training generally converges more slowly for dynamic scenes [neuralbody]. Second, has higher dimensions (both resolution and channels) than , because is learned from downsampled images with relatively weak supervision. Due to this, the system may tend to ignore volumetric features of at this stage. To solve this problem, we first use a ConvNet to downsample to the same size as . We also extend the channels of to the same dimensionality as using a ConvNet. Another approach would be to upsample the resolution of instead of extending channels, but we found this destabilizes the training of PD-NeRF.

Finally, we fuse the resized features by Attentional Feature Fusion (AFF [dai21aff]): . has the same size as . AFF is also learned, and we include it in in Fig. 2. See [dai21aff] for more details about AFF.

Textural Rendering. The TexRenderer net converts the fused features into the target avatar and a mask. has a similar architecture as Pix2PixHD [pix2pixhd]. See the Appendices for more details.

Figure 3: Qualitative results of our variants by changing downsampling factor S of PD-NeRF. Though existing SMPL based 2D-Plus NIT methods take as input extra geometry priors, such as DNR(+ UV)[dnr], SMPLpix(+ depth)[smplpix], ANR(UV + normal)[anr], they fail to fully utilize the priors for geometry-aware rendering. Instead, ours can handle self-occlusions better \⃝raisebox{-0.9pt}{1} and also improve the rendering quality (\⃝raisebox{-0.9pt}{2}\⃝raisebox{-0.9pt}{3}) by learning a downsampled PD-NeRF (Ours_16, ). Note for ours and Neural Body, the learned geometries are shown.

3.5 Optimization

HVTR is trained end-to-end by optimizing networks and latent codes . Given a ground truth image and mask , downsampled ground truth image , and predicted image and mask

, we use the following loss functions:

Volume Rendering Loss. We utilize to supervise the training of volume rendering, which is applied on the first three channels of , .

Normal Loss. To enforce learning of geometric features by , we employ a normal loss : , where is the ground truth normal of mesh projected into UV space.

Feature Loss. We use a feature loss [Perceptual_Losses] to measure the differences between the activations on different layers of the pretrained VGG network [vgg] of the generated image and ground truth image ,


where is the activation and the number of elements of the -th layer in the pretrained VGG network.

Mask Loss. The mask loss is .

Pixel Loss. We also enforce an loss between the generated image and ground truth as .

Adversarial Loss. We leverage a multi-scale discriminator [pix2pixhd] as an adversarial loss . is conditioned on both the generated image and feature image .

Face Identity Loss. We use a pre-trained network to ensure that TexRenderer preserves the face identity on the cropped face of the generated and ground truth image,


where is the pretrained SphereFaceNet [Liu2017SphereFaceDH].

Total Loss. .

The networks were trained using the Adam optimizer [adam]. See the Appendices for more details.

R1 R2 R3
DNR .102 75.02 .831 25.73 .125 98.86 .820 27.92 .108 80.33 .809 24.05
SMPLpix .100 69.81 .835 25.93 .124 94.81 .826 27.97 .104 74.57 .810 24.16
ANR .117 78.50 .830 26.02 .129 101.72 .825 28.30 .098 69.14 .813 24.29
Neural Body .212 155.84 .833 26.17 .218 161.99 .833 28.61 .240 165.03 .811 24.16
Ours .090 62.33 .842 26.22 .108 84.43 .833 28.62 .093 66.01 .823 24.55
R4 R5 R6
DNR .108 93.16 .833 23.34 .136 121.50 .817 24.06 .088 74.77 .864 25.81
SMPLpix .107 88.14 .837 23.37 .131 118.64 .818 24.10 .077 64.33 .875 26.14
ANR .138 91.92 .812 23.26 .140 123.55 .823 24.67 .083 63.16 .875 26.61
Neural Body .198 126.26 .856 24.26 .220 161.93 .816 24.25 .142 94.96 .880 27.19
Ours .096 78.79 .849 23.98 .117 93.56 .827 24.84 .070 57.00 .891 27.42
Z1 Z2 Z3
DNR .145 92.78 .797 22.06 .145 87.27 .774 25.04 .109 82.79 .826 23.16
SMPLpix .150 90.90 .797 22.14 .144 81.78 .774 25.18 .113 83.96 .827 22.92
ANR .205 171.69 .775 22.35 .159 110.85 .778 25.41 .173 123.84 .790 22.14
Neural Body .215 163.83 .789 22.16 .238 155.27 .792 25.88 .204 167.66 .825 23.89
Ours .143 90.43 .805 22.31 .132 79.14 .785 25.69 .105 78.03 .829 23.23
Table 2: Quantitative comparisons on nine datasets (averaged on all test views and poses). To reduce the influence of the background, all scores are calculated from images cropped to 2D bounding boxes. LPIPS[lpip] and FID[Heusel2017GANsTB] capture human judgement better than per-pixel metrics such as SSIM[ssim] or PSNR. All poses are novel, and R4, Z1-Z3 are tested on new views. The pose variations are relatively small in Z1-Z3 datasets, for which we mainly evaluate the capability of capturing/rendering high frequency details instead of pose generalization.

4 Experiments

Dataset. We evaluate our method on 10 datasets, denoted R1-6, Z1-3, and M1. We captured R1-6, and each dataset has 5 cameras at a resolution of (yet big human bounding box) with 800-2800 frames. Z1-3 [neuralbody] have 24 cameras (10241024, 620-1400 frames each), and we use splits of 10/7, 12/8, 5/5 separately for training/test cameras. M1 [Habermann2021RealtimeDD] has 101 cameras (1285940, 20K frames each), and we utilize 19/8 training/test cameras. For these datasets, we select key sequences to include various motions and use a split of 80%/20% for training and testing. All the tested poses are novel, and rendered viewpoints for R4, Z1-Z3, M1 are new. Yet since these methods render humans in local space and the captured human characters move, we found that novel poses mattered more than novel viewpoints for quantitative results. See R1 in Fig. 3, and R2-R4, Z1, Z3 in Fig. 4.1, M1 in Fig. 5, and the Appendices for more details.

Baselines. We compare our method with NIT-based methods (DNR[dnr], ANR[anr], SMPLpix[smplpix]), and NeRF-based Neural Body [neuralbody] (as used in [neuralactor, Xu2021HNeRFNR] for animation synthesis), and Animatable NeRF [Peng2021AnimatableNR] (see the Appendices). For fair comparisons, DNR, ANR, SMPLpix all have the same network architectures as ours, the same SMPL model as input, and were trained with the losses mentioned in their papers. ANR: Since the code of ANR was not released when this work was developed, we cannot guarantee our reproduced ANR achieves the performance as expected, though it converges and generates reasonable results. SMPLpix: We follow the author’s recent update111https://github.com/sergeyprokudin/smplpix to strengthen SMPLpix by rasterizing the SMPL mesh instead of the sparse SMPL vertices [smplpix]. Neural Body[neuralbody] and Animatable NeRF[Peng2021AnimatableNR] are trained with their provided code and setup separately.

Annotation: Ours_S(N) indicates the variant of our method, where S is the downsampling factor of PD-NeRF, and N is the number of sample points along each ray. By default, we use the setting of Ours_8(12) as our method for comparisons. Ours_ng is the variant without PD-NeRF.

4.1 Evaluations

Differences to the Baseline Methods. As shown in Fig. 3, compared with 2D-Plus methods (DNR, SMPLpix, ANR), we can handle self-occlusions better and generate more details than Neural Body. We also compare the architecture of ours, DNR, and Neural Body in Fig. 1.

Comparisons. We evaluate our methods on the 10 datasets, shown in Fig. 3, 4.1, 5 (see R5, R6, Z2 in the Appendices). We summarize the quantitative results in Tab. 2, 3

, where we achieve the best performances on 34/40 evaluations metrics, and on all the 20 LPIPS/FID scores.

Rendering Skirts. Our method is capable of rendering loose clothing like skirts as shown in Fig. 5 with rough geometry reconstruction, whereas solo volume rendering methods (e.g., Neural Actor stated in [neuralactor]) generally cannot because they rely more on the quality of geometry reconstruction, which is also challenging for dynamic skirts.

Accuracy, Inference Time, GPU Memory. See the accuracy and inference time in Tab. 4. Ours can improve the performance over DNR and SMPLpix by about 10% (even 14% by Ours_4) at a small computational cost, and is almost faster than Neural Body. For fair comparisons, we evaluate Tab. 4 on R1 (Fig. 1) dataset (about 8k frames for training, 2k for testing), where each frame was cropped to close to the human bounding box (bbox), to reduce the influence of white background. Yet one limitation is that we require more GPU memory in training, Ours_4(20)-most GPU-consuming version: 21GB; Neural Body: 5GB; ANR:11GB. However, in inference, Ours_4(20): 4GB, Neural Body: 15GB. Note that this was evaluated on the cropped bbox with downsampled S=4, and we can process high resolution like (Z1-3) or (M1). See the Appendices for more details.

Applications. We can render avatars under user-controlled novel views, poses, and shapes for Novel View Synthesis, Animation (Fig. 3-4.1), and Shape Editing (Fig. 6).

Figure 4: The effectiveness of PD-NeRF vs. downsampling factor S and sample points N. The 2nd-7th are variants of our method (see Annotation at Sec. 4). The first three channels of the volumetric features ( in Fig. 2) are shown at the bottom right. PD-NeRF improves the capability of handling self-occlusions (e.g., cheeks {\⃝raisebox{-0.9pt}{3}\⃝raisebox{-0.9pt}{4}\⃝raisebox{-0.9pt}{5}\⃝raisebox{-0.9pt}{6}\⃝raisebox{-0.9pt}{7}} vs {\⃝raisebox{-0.9pt}{1}\⃝raisebox{-0.9pt}{2}}), and the quality (\⃝raisebox{-0.9pt}{6} vs \⃝raisebox{-0.9pt}{5}) can significantly be improved by increasing S, which is also shown in Tab. 4.
Figure 5: We can render skirts on novel poses and viewpoints.
Figure 6: Rendering results of HVTR for different body shapes of the same individual. Top-left: SMPL shapes (visualized as UV coordinate maps); Bottom-left: renderings of PD-NeRF; Middle: normal shape. Both PD-NeRF and HVTR generate reasonable results. Not just a straightforward texture to shape mapping, HVTR can generate some shape-dependent wrinkles (marked in red for big models), though these shapes were not seen in training.
DNR .195 144.78 .687 19.96
Ours .179 132.83 .696 20.18
Table 3: Comparisons on M1 dataset under novel poses and views.
Models LPIPS FID Time (s) VR_T(%)
DNR 0.102 75.015 .184 -
SMPLpix 0.100 69.812 .198 -
ANR 0.117 78.501 .224 -
NeuralBody 0.212 155.838 18.20 -
Ours_ng 0.099 70.528 .257 -
Ours_16(7) 0.097 64.871 .292 11.99
Ours_16(12) 0.096 63.792 .295 12.88
Ours_16(20) 0.096 63.834 .305 15.41
Ours_8(12) 0.090 62.333 .349 26.36
Ours_4(20) 0.086 60.788 .464 44.61
Table 4: Accuracy and inference time on novel poses. VR_T(%) indicates the percentages of the volume rendering time. We test the end-to-end inference time on a GeForce RTX 3090, and the time for rendering the required maps are also counted, such as DNR (UV coord maps), SMPLpix (depth maps), ANR (UV coord + normal maps), ours: UV coord + depth maps (used in PD-NeRF to sample query points). PyTorch3D[ravi2020pytorch3d] is used for rendering.
Figure 7: Comparisons with NIT methods (DNR[dnr], SMPLpix [smplpix], ANR [anr]), and a NeRF method (Neural Body [neuralbody]) on R2-4, Z1, and Z3. Our method can generate different levels of pose-dependent details: \⃝raisebox{-0.9pt}{6} offsets, \⃝raisebox{-0.9pt}{5} big wrinkles, \⃝raisebox{-0.9pt}{4} tiny wrinkles. We handle self-occlusions better (\⃝raisebox{-0.9pt}{1}\⃝raisebox{-0.9pt}{2}\⃝raisebox{-0.9pt}{3}\⃝raisebox{-0.9pt}{7}) compared to NIT methods, generates high-quality details(\⃝raisebox{-0.9pt}{4}\⃝raisebox{-0.9pt}{5}\⃝raisebox{-0.9pt}{8}), and preserves thin parts (\⃝raisebox{-0.9pt}{3}\⃝raisebox{-0.9pt}{7}) and facial details better. All the poses are novel, and R4, Z1, Z3 are novel views. Note that we cannot guarantee our reproduced ANR achieves the expected performance as stated in Sec. 4.

4.2 Ablation Study

We analyze how PD-NeRF affects the final rendering quality and inference time by evaluating two parameters: the resolution represented by a downsampling factor 1/S, and the number of sampled points N along each ray, as shown in Fig. 3, 4. Fig. 3 shows that we can improve the capability of solving self-occlusions by just incorporating a 1/16 () downsampled PD-NeRF (Ours_16 vs Ours_ng). Tab. 4 shows that the quantitative results can be improved by increasing S (e.g., Ours_8(12) vs Ours_16(12), or \⃝raisebox{-0.9pt}{6} vs \⃝raisebox{-0.9pt}{4} in Fig. 4) or sampling more points (e.g., Ours_16(12) vs Ours_16(7)), which illustrate the effectiveness of PD-NeRF. Yet S contributes more than N as shown in Tab. 4 and Fig. 4. It seems that N easily reaches the peak where the performance cannot be improved obviously, such as Ours_16(12) vs Ours_16(20), as listed in Tab. 4. Yet Ours_8(12) significantly outperforms Ours_16(12) by doubling the resolution as seen in Tab. 4 and Fig. 4, which illustrate the effectiveness of higher resolution PD-NeRF.

The ablation study of face identity loss and feature fusion can be found in the Appendices.

5 Discussion and Conclusion

Potential Societal Impact. Our method enables a digital portrait copy which can be reenacted by another portrait video. Therefore, given a portrait video of a specific person, it can be used to generate portrait videos, which need to be addressed carefully before deploying the technique.

Conclusion. We introduce Hybrid Volumetric-Textural Rendering (HVTR), a novel neural rendering pipeline, to generate human avatars under user-controlled poses, shapes and viewpoints. HVTR can handle complicated motions, render loose clothing, and provide fast inference. The key is to learn a pose-conditioned downsampled neural radiance field to handle changing geometry, and to incorporate both neural image translation and volume rendering techniques for fast geometry-aware rendering. We see our framework as a promising component for real-time telepresence.



Appendix A Implementation Details

Optimization. The networks were trained using the Adam optimizer [adam] with an initial learning rate of , . The loss weights {, , , , , , } are set empirically to . We train DNR[dnr], SMPLpix[smplpix], ANR[anr], and our method for 50,000 iterations, and 180,000 iterations for Neural Body [neuralbody], and 250,000 iterations for Animatable NeRF (AniNeRF [Peng2021AnimatableNR]). We train the networks with a Nvidia P6000 GPU, and it generally takes 28 hours for DNR and SMPLpix, and 40 hours for our method. Note that Neural Body [neuralbody] cannot converge to a detailed generation as seen in Fig. 11.

Network Architectures and Optimizable Latents. and both have a size of . is based on Pix2PixHD [pix2pixhd]

architecture with Encoder blocks of [Conv2d, Batch- Norm, ReLU], ResNet

[He2016DeepRL] blocks, and Decoder blocks of [ReLU, ConvTranspose2d, BatchNorm]. has 3 Encoder and Decode blocks, and 2 ResNet blocks. has 2 Decode blocks. has ( = 2 or 3 or 4) Encoder blocks, and the exact number depends on the downsampling factor of PD-NeRF such that the textural features and volumetric features have the same size as discussed at Sec. 3.4. has () Encoder blocks, 4 Decoder blocks, and 5 ResNet blocks. For , we use an 7-layer MLP with a skip connection from the input to the 4th layer as in DeepSDF [park2019deepsdf]. From the 5th layer, the network branches out two heads, one to predict density with one fully-connected layer and the other one to predict color features with two fully-connected layers.

Geometry-guided Ray Marching. The success of our method depends on the efficient and effective training of the pose-conditioned downsampled NeRF (PD-NeRF). First, instead of sampling rays in the whole space, we utilize a geometry-guided ray marching mesh as illustrated in Fig. 8. Specifically, we only sample query points along the corresponding rays near the SMPL [smpl] mesh surface, which is determined by a dilated SMPL mesh. The SMPL mesh is dilated along the normal of each face with a radius of , where is about 12cm for general clothes and 20cm for loose clothing like skirts for M1 dataset (see Fig. 5 of the paper). We find the near and far points by querying the Z-buffer of the corresponding pixels after projecting the dilated SMPL mesh using Pytorch3D [ravi2020pytorch3d]. In addition, we sample more points to the near region, which is expected to contain visible contents. The geometry-guided ray marching algorithm and UV conditioned architecture enable us to train a PD-NeRF with resolution images and only 7 sampled point along each ray, as shown in Fig. 9. Though learned from low resolution images, the reconstructed geometry still preserves some pose-dependent features.

Figure 8: Geometry-guided ray marching. Left: sampling points by SMPL mesh dilation. Right: Red - SMPL model; Gray - rays and sampled points.
Figure 9: Construct PD-NeRF with resolution images and 7 sampled point along each ray: left (geometry), right (reference image).
Figure 10: Comparisons of methods trained with or without face identity loss. Ours(-F) indicates a variant of our method that is not trained with face identity loss.
Figure 11: Convergence of Neural Body in training. Left: training results of one frame after 100,000 iterations; Middle: after 150,000 iterations; Right: ground truth.

Appendix B More Experimental Results

AniNeRF [Peng2021AnimatableNR] .271 196.44 .773 23.36
Ours .090 62.33 .842 26.22
Table 5: Comparisons with AniNeRF on R1 dataset under novel poses. To reduce the influence of the background, all scores are calculated from images cropped to 2D bounding boxes.

b.1 Comparisons

Comparisons with Animatable NeRF [Peng2021AnimatableNR] A quantitative comparison with AniNeRF on R1 dataset is shown in Tab. 5, and the results of the other methods are shown in Tab. 2 and Tab. 4 of the paper. Our method significantly outperforms AniNeRF on all the four metrics. The qualitative comparison is shown in Fig. 12 and the supplementary video. AniNeRF learns motions by skinning weights, and uses an MLP to predict skinning weights which are used to transform query points from posed space to a canonical space, whereas AniNeRF did not solve the one-to-many backward correspondences problem in backward skinning and the predicted skinning weights may be wrong for challenging cases as shown in Fig. 12. However, our method does not require backward skinning as discussed on Line 96 of the paper. Besides, we can generate high quality details, whereas AniNeRF cannot produce the same level of realism.

Figure 12: Qualitative comparisons with AniNeRF. Top: AniNeRF; Middle: ours; Bottom: ground truth.
Models LPIPS FID SSIM PSNR Time (s) VR_T(%)
DNR 0.1023 75.0152 0.8310 25.7303 0.184 -
SMPLpix 0.1002 69.8119 0.8350 25.9295 0.198 -
ANR 0.1172 78.5012 0.8301 26.0168 0.224 -
Neuray Body 0.2124 155.8382 0.8328 26.1718 18.200 -
Ours_ng 0.0991 70.5282 0.8370 25.9842 0.257 -
Ours_16(7) 0.0966 64.8711 0.8489 26.4356 0.292 11.99
Ours_16(12) 0.0959 63.7922 0.8494 26.4822 0.295 12.88
Ours_16(20) 0.0959 63.8337 0.8495 26.4841 0.305 15.41
Ours_8(12) 0.0901 62.3330 0.8415 26.2165 0.349 26.36
Ours_4(20) 0.0861 60.7884 0.8461 26.2465 0.464 44.61
Table 6: Performance, inference time of each methods. VR_T(%) indicates the percentages of the volume rendering time. Compared with Tab. 4 of the paper, the other two metrics SSIM and PSNR are included.

Qualitative Comparisons on R5, R6, and Z2 Dataset. In addition to Fig. 7, the qualitative comparisons with the other methods on R5, R6, and Z2 are shown in Fig. 13.

Accuracy and Inference Time. The accuracy and inference time of each method are shown in Tab. 6.

Figure 13: Comparisons with the other methods on R5, R6, and Z2 Dataset.

b.2 Ablation Study

Face Identity Loss. We use the face identity loss to improve the qualitative results as shown in Fig. 10 (also used in [egorend, feanet]), whereas the improvements of faces do not improve the overall quantitative results of each method, as listed in Tab. 8.

Feature Fusion. We compare two methods to fuse the volumetric and textural features as discussed at Sec. 3.4 by concatenation (Concat) and Attentional Feature Fusion (AFF [dai21aff]) on two datasets, R1 and R2 (about 12,000 frames in training, 3,000 frames in testing). We test the performances on novel poses. The quantitative results show that AFF can improve the LPIPS [lpip] and FID [Heusel2017GANsTB] results.

Concat .117 90.43 .838 28.55
AFF .108 84.43 .833 28.62
Concat .099 64.54 .856 26.78
AFF .090 62.33 .842 26.22
Table 7: Comparisons of fusing volumetric and textural features by concatenation (Concat) and Attentional Feature Fusion (AFF [dai21aff]) on R1 and R2 dataset.
DNR .108 80.33 .809 24.05
DNR + F .103 75.42 .812 24.13
SMPLPix .104 74.57 .810 24.16
SMPLPix + F .109 78.59 .811 23.93
DNR .128 105.63 .820 27.82
DNR + F .125 98.86 .820 27.92
SMPLPix .124 99.81 .822 27.92
SMPLPix + F .124 94.81 .826 27.97
DNR .102 75.02 .831 25.73
DNR + F .103 75.35 .832 25.85
SMPLPix .100 69.81 .835 25.93
SMPLPix + F .104 75.34 .833 25.81
Ours(-F) .088 61.03 .841 26.42
Ours .090 62.33 .842 26.22
Table 8: Quantitative results of each method trained with or without face identity loss. Ours(-F) indicates a variant of our method that is not trained with face identity loss. (DNR + F) and (SMPLpix + F) are trained with face identity loss.