Relighting Humans: Occlusion-Aware Inverse Rendering for Full-Body Human Images

08/07/2019 ∙ by Yoshihiro Kanamori, et al. ∙ University of Tsukuba TUT 2

Relighting of human images has various applications in image synthesis. For relighting, we must infer albedo, shape, and illumination from a human portrait. Previous techniques rely on human faces for this inference, based on spherical harmonics (SH) lighting. However, because they often ignore light occlusion, inferred shapes are biased and relit images are unnaturally bright particularly at hollowed regions such as armpits, crotches, or garment wrinkles. This paper introduces the first attempt to infer light occlusion in the SH formulation directly. Based on supervised learning using convolutional neural networks (CNNs), we infer not only an albedo map, illumination but also a light transport map that encodes occlusion as nine SH coefficients per pixel. The main difficulty in this inference is the lack of training datasets compared to unlimited variations of human portraits. Surprisingly, geometric information including occlusion can be inferred plausibly even with a small dataset of synthesized human figures, by carefully preparing the dataset so that the CNNs can exploit the data coherency. Our method accomplishes more realistic relighting than the occlusion-ignored formulation.



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Relighting of human images has various applications in image synthesis such as stylized shading of portraits [Chai et al., 2015; Shu et al., 2017] or cut & paste of human image clips [Xue et al., 2012]. For physically-based relighting of a human portrait, we must infer reflectance, shape, and illumination from the single image. Previous techniques obtain the cues of albedo and shape from human faces via fitting of morphable 3D face models [Blanz and Vetter, 1999] or inference based on convolutional neural networks (CNNs) [Sengupta et al., 2018], and infer illumination on the basis of spherical harmonics (SH) lighting [Ramamoorthi and Hanrahan, 2001; Basri and Jacobs, 2003].

The SH-based lighting yields an elegant analytical formulation of shading from surface normals and illumination if we ignore the light occlusion; we can calculate per-pixel SH bases from normals, and then illumination is obtained in the form of SH coefficients using least squares [Kemelmacher-Shlizerman and Basri, 2011]. However, as is well known in the realtime-rendering literature, rendered images without light occlusion lacks photorealism because hollowed regions become unnaturally bright, although they ought to be occluded, compared to other regions. An approximate solution is to darken the hollowed regions by multiplying scalar values depending on occlusion, i.e., ambient occlusion [Zhukov et al., 1998]. A more elegant solution is to encode light occlusion and cosine decay as SH coefficients, which we refer to as a light transport vector, and formulate lighting calculation as a dot product of the light transport vector and SH coefficients of illumination [Sloan et al., 2002]. Unfortunately, calculating occlusion requires the geometry to be inferred, and is quite computationally expensive due to visibility sampling at each surface point.

In this paper, we introduce the first attempt to infer not only diffuse albedo but also a light transport vector for each pixel from a masked full-body human image, which is accomplished by supervised learning using CNNs, with a ground-truth training dataset synthesized from scanned 3D human figures. The main problem in making this inference possible is the lack of a training dataset, considering the unlimited variations of human portraits regarding poses, genders, builds, and garments. To the best of our knowledge, there is only a single publicly-available dataset of scanned 3D human figures [Zhang et al., 2017], but it lacks variations (i.e., only five individuals with one or two outfits each). We additionally purchased commercial datasets of clothed 3D human figures, amounting to only a few hundreds of models. Surprisingly, by carefully selecting standing figures and aligning them in the training images, CNNs can learn plausible light transport vectors, which can capture occlusions at armpits, crotches or garment wrinkles, even from such a small dataset. This result implies that CNNs can learn geometric information including occlusion from the silhouettes of human figures, i.e., binary masks to some extent, which is a similar conclusion drawn from the recent work inferring normal maps only from silhouette lines [Lun et al., 2017].

Thanks to the inferred light transport maps, we can relight human portraits quite efficiently just by calculating dot products of light transport vectors and SH coefficients of light, followed by channel-wise multiplication of inferred albedo maps. The inference of albedo and light transport maps is fast (0.43 sec. for each image), and our inferred albedo and light transport vectors have sufficient quality for plausible relighting of human images, as shown in Figure 1.

2. Related Work

For single-image physically-based relighting, we must solve inverse rendering

, i.e., estimation of shape, reflectance, and illumination from a single image, which is a highly ill-posed problem. Classical methods relax it by assuming that some of the three components are known, or use prior knowledge of the target in order to estimate the remaining components. Recent methods adopt data-driven approaches that exploit statistics of the three components in the target domain.

Classical inverse rendering.

The earliest technique is shape-from-shading [Horn, 1989], which estimates shape from the shading in an input image with known illumination. While methods in the early years assume simple illumination models such as point, directional, or area light sources, recent ones adopt environmental illumination represented with second-order SH [Johnson and Adelson, 2011]. Also with known shape (e.g., convex shape [Chandraker and Ramamoorthi, 2011], occluding contour [Lopez-Moreno et al., 2013], or approximate geometry [Kholgade et al., 2014]), one can estimate reflectance and illumination. Another mainstream in this literature is intrinsic images [Barrow and Tenenbaum, 1978; Bonneel et al., 2017], which decomposes an input image into shading (i.e., the product of shape and illumination) and reflectance based on the Retinex theory [Land and McCann, 1971]. With this decomposition, we can change the color or texture while retaining the shading. However, for relighting, we must further decompose the shading into shape and illumination.

Data-driven approaches.

Data-driven approaches are commonly adopted in recent techniques for, e.g., outdoor/indoor illumination estimation [Hold-Geoffroy et al., 2017; Gardner et al., 2017], estimation of specular reflectance and illumination [Oxholm and Nishino, 2012; Georgoulis et al., 2018] as well as intrinsic images [Bell et al., 2014; Narihira et al., 2015; Baslamisli et al., 2018; Shi et al., 2017]. As a generalization of both shape-from-shading and intrinsic images, Barron and Malik [2015] factored single input images of general objects into shape, diffuse reflectance, and SH illumination, via optimization with statistical priors.

Face inverse rendering.

Simultaneous inference similar to Barron and Malik’s work has been actively studied in the inverse rendering of human faces since the seminal work of the 3D morphable model (3DMM) [Blanz and Vetter, 1999]. The 3DMM is a statistical model of albedo and shape of human faces and serves as a strong prior for face inverse rendering via geometric fitting to the target face image. While the illumination model used in the original 3DMM paper [Blanz and Vetter, 1999] was directional light, currently the standard choice is again second-order SH. Due to the increase of large-scale publicly-available face datasets, many learning-based methods with [Tewari et al., 2017] and without [Shu et al., 2017; Sengupta et al., 2018] 3DMMs have been proposed for face inverse rendering.

Our work also adopts second-order SH illumination, but tackles inverse rendering of not only faces but also full bodies including garments. Full body images contain face regions, and thus existing techniques for faces can be applied to infer illumination. However, one concern is that most of the existing techniques assume that light occlusion is ignorable; this assumption might be valid for faces because most faces are approximately convex except for the vicinity of noses, but it does not hold true for concave regions in the human body, e.g., armpits, a crotch, or a neck under a chin, that should receive less light due to self-shadowing. Consequently, such concave regions become unnaturally bright if we ignore the light occlusion. For better relighting, we learn light occlusion for SH-based shading.

Schneider et al. [2017] also proposed to account for light occlusion in SH-based face inverse rendering to better handle face wrinkles. They extended a 3DMM [Paysan et al., 2009] so that not only albedo and shape but also per-vertex light transport vectors can be reconstructed via multilinear regression. However, light transport vectors are available only in the face region.

Apart from the SH-based formulation, Yamaguchi et al. [2018] inferred a base mesh and high-quality textures of a face from a single image. Without considering lighting formulation, they infer textures for photorealistic rendering of faces using regression with an adversarial loss. Their method relies on plenty of high-quality measured data, which are unfortunately not available in general for human bodies.

Other human-oriented techniques.

Traditionally, human whole-body relighting has been performed based on measurement under controlled setups with multiple lights and cameras [Debevec et al., 2000; Li et al., 2013]. In monocular settings, RGB video cameras are also used for capturing faces with multiple temporal frames, e.g., [Garrido et al., 2013]. Here we focus on single-image techniques. If the human target figure is almost naked, we can obtain a reasonable shape cue for inverse rendering by fitting statistical 3D body models [Anguelov et al., 2005; Balan et al., 2007] after segmenting out the figure mask [Guan et al., 2009]. However, this is generally not applicable to human figures wearing garments. There are also techniques that can estimate garment shapes from single images [Zhou et al., 2013; Danerek et al., 2017]. Our method is versatile and can capture garment wrinkles plausibly from various human portraits.

CNN-based techniques for material inference.

Recent methods can infer materials [Aittala et al., 2016; Li et al., 2017] of objects using CNNs from a single image of flat-surface objects. Innamorati et al. [2017] proposed an interesting approach that decomposes an input image into multiple components for manual photo retouching. They account for light occlusion in the form of ambient occlusion and decompose the shading component into six directions based on non-negative first-order SH bases. With this formulation, photo-retouch artists can emulate relighting by manually increasing/decreasing directional shading components. Inspired by their work, we will compare our method with the conventional SH formulation plus ambient occlusion in Section 7.

3. Spherical Harmonics (SH) Lighting

In this section, we briefly review spherical harmonics (SH) lighting with and without consideration of light occlusion.

3.1. SH Lighting without Occlusions

SH are orthonormal basis functions defined on the spherical domain, and known as advantageous for capturing low-frequency signals in the rendering community. It is shown that just nine SH bases (i.e., basis functions up to second order) can capture up to 99.22% of the irradiance on a convex surface [Basri and Jacobs, 2003].

Let us review the mathematical formulation [Ramamoorthi and Hanrahan, 2001]. If we ignore light occlusion and interreflection, the irradiance can be calculated with an integral of arbitrary incoming radiance over the hemispherical domain defined by a unit normal vector


We omit the dependency on surface position for simplicity. Ramamoorthi and Hanrahan projected the spherical signals of the incoming illumination distribution and the cosine decay term to SH. Using elevation and azimuth angles , to parameterize a unit direction vector , these signals are expanded as


where are SH with , , and . and are coefficients for the illumination and cosine decay term, respectively. does not depend on the azimuth angle . The integral in Equation (1) is now rewritten as


where . Here can be represented as polynomials of coordinates of a unit normal . If we rewrite the coefficients as a vector and the basis functions as a vector , is calculated as a dot product


3.2. SH Lighting with Occlusions

Although the above formulation is elegant, the critical problem is that light occlusion is ignored. Concave regions should receive less light due to self-shadowing, and thus should be darker than other convex regions. To account for light occlusion in Equation (1), the visibility term should be added in the integrand


returns zero if the light in the incoming direction is occluded and one otherwise. Unfortunately, does not have any analytical form in general, and one must sample visibility by casting many shadow rays at each surface point, which is quite computationally expensive.

Sloan et al. [2002] proposed to precompute the visibility term together with the cosine decay term, and project the compound spherical signal onto SH in order to enable efficient dot-product calculation (similar to Equation (5)) during real-time rendering


where is a vector that encodes SH coefficients of the compound spherical signal of the visibility term and the cosine decay term. They also proposed to handle glossy reflection and approximate interreflection. This technique is well-known as precomputed radiance transfer (PRT), which has been studied and extended extensively in the real-time rendering literature.

Hereafter we refer to as a light transport vector and a nine-channel image containing per-pixel light transport vectors as a light transport map.

4. Our Loss Functions

Figure 2. Our network architecture. The input image (multiplied with the binary mask) is fed to the encoder, and the output feature vector is then fed to the decoders of albedo, light transport, and light (Section 5

). We calculate 15 types of loss functions by considering the total variations (TVs) for albedo and light transport maps as well as the combinations of inferred outputs and their ground-truth (GT) (Section 


In this section, we define the loss functions to infer albedo and light transport maps using our CNNs based on the SH formulation.

For training and testing, we prepare a synthetic human image dataset and an illumination dataset (see Section 6 for the details). The synthetic human image dataset contains a set of a binary mask (where is the number of pixels, is the number of channels, and ), albedo map , and light transport map for each 3D human model. The illumination dataset contains SH illumination coefficients for RGB channels , where . Note that we multiply the binary mask to the ground-truth data and network outputs (e.g., or , where denotes element-wise multiplication) so that we can ignore out-of-mask pixels. In the following explanation, we omit the element-wise multiplication of the binary mask for simplicity.

We use a CNN architecture for inferring light, albedo, and light transport maps (see Section 5 for the network models). The input of the CNN is a masked, RGB full-body human image . Let be the CNN output for an albedo map, the CNN output for a light transport map, and the CNN output for illumination. Regarding notations, we use tildes () to indicate inferred outputs, and denote to indicate that is the input and is the parameter of network . We optimize these network parameters , , and via regression.

Our CNN architecture has a similar design to SfSNet [Sengupta et al., 2018], which infers light, albedo, and normal maps for faces simultaneously. The loss functions used in SfSNet are L1 losses for the inferred albedo map, normal map (from which a light transport map without light occlusion can be calculated analytically), light, and the reconstructed image using the three components. We also use similar four loss functions, but we do not infer normal maps but infer light transport maps directly. Namely, we use L1 losses for , , , and the reconstructed image . Furthermore, we also use the following L1 losses:

TV losses::

L1 total variation (TV) losses both for albedo and light transport maps ,

Shading losses::

Three patterns of combination of inferred/GT data to compute a shading map, i.e., , , and ,

Reconstruction losses::

Six patterns of combination to reconstruct an input image, i.e., , , , , , and .

In total, we use 15 L1 losses. All weights are set to one.

To consider the benefit of the 15 losses, let us take the shading losses, i.e., the three losses for a shading map, as an example. For the multiplication of a light transport map and a light, there are three combinations, namely, GT * inferred (i.e., ), inferred * GT (i.e., ), and inferred * inferred (i.e., ), where GT means ground truth. If GT is involved, GT works as a weighting matrix for the inferred output, which enforces the output to lie on a solution manifold in the high-dimensional space. If both are inferred outputs, the loss becomes a soft constraint for the intermediate output, i.e., the inferred shading map. For the choice of loss functions, we chose formulae that do not introduce bias, except for the TV losses. In this way, involving as many formulae that do not introduce bias as possible as losses is a quite general technique, and would be beneficial to other problems. We show an ablation study with and without these losses in Section 7.1.

5. Network Models

Figure 2

illustrates our encoder-decoder network. As mentioned, our network is similar to that of SfSNet, except that ours has much more parameters. Our encoder has six convolutional layers whose output channels are { 64, 128, 256, 512, 512, 512 } and the stride is two. The encoded features are then fed to the decoders for albedo, light transport, and light. The decoders for albedo and light transport maps have almost the same architecture, except that the numbers of output channels are different (i.e., nine for light transport and three for albedo). Each decoder has a residual block (consisting of two convolutional layers with 512 channels) and six deconvolutional layers (output channels are { 512, 512, 256, 128, 64, 9 or 3 } and the stride is also two). The encoder and decoders are connected using skip-connections. For the light decoder, the outputs of the encoder and decoders for albedo and light transport are concatenated and fed to four convolutional layers, which yield a 27-dimensional vector. While SfSNet uses average pooling layers, ours consists of (de-)convolutional layers only. Each (de-)convolutional layer (except for the first and final layers) is followed by batch normalization and (leaky) ReLU. The first three deconvolutional layers of each decoder are followed by dropout with probability 0.5.

6. Dataset Generation

As explained in Section 4, we prepared a synthetic human image dataset and an illumination dataset. Here we explain the details.

Synthetic human image dataset.

Our synthetic human image dataset consists of a binary mask, albedo map, normal map, and light transport map, created by rendering each scanned 3D human figure using a hardware-accelerated renderer. The scanned 3D human figures were obtained from two resources; one is the publicly-available BUFF dataset [Zhang et al., 2017], and the other is commercial websites. The BUFF dataset contains 9,613 standing 3D figures, but lacks variations for our purpose. Namely, it only includes five individuals with one or two outfits each and time-varying poses, and thus subsequent 3D models of the same individual in the same outfit are almost identical. To avoid biasing the training dataset, we manually picked 74 representative models from the BUFF dataset. The commercial data were purchased from different websites and amount to 271 models. We randomly split the models, 345 in total, into 276 training data and 69 test data. Figure 3 shows some examples of our training data. Note that some albedo maps contain self-shadows because shading was not completely removed during the scanning process.

When creating the dataset, we carefully aligned 3D models so that our CNNs can exploit the geometric regularity of our small dataset. Namely, we rendered front-facing figures in the middle of square images while aligning them so that they have almost the same vertical size in the images with vertical paddings at the top and bottom of a fixed size (5% of image heights). Regarding poses, we only used standing figures and removed sitting ones from our training/test datasets. The image resolution is

pixels. No data augmentation is employed for the human image dataset.

Illumination dataset.

For our illumination dataset, we used the Laval Indoor HDR dataset [Gardner et al., 2017] containing 2,144 environment maps in panoramic HDR format. We first converted them into diffuse SH coefficients and calculated a reference brightness of each environment map using Equation (5) with a front-facing normal . We omitted dark environment maps if the reference brightness is lower than 0.2, and scaled the brightness of other environment maps so that reference brightness lies within

. To obtain further variations, we rotated each data 35 times by 10 degrees around the vertical axis. We then reduced the redundancy using k-means clustering and manually removed unusual illuminations (e.g., too bright lights, back-lights, and lights causing too strong contrasts in shadings). Finally, from the remaining 50 illuminations, we randomly picked 40 illuminations for training and 10 for testing. Figure 

4 shows some examples of our training data.

Figure 3. Examples from our synthetic human image dataset. For each human figure, an albedo map, binary mask, normal map, and light transport map (visualized as a shading map) are displayed. Note that some albedo maps contain self-shadows due to scanning inadequacy. Each image is trimmed.
Figure 4. Examples from our illumination dataset. SH illumination coefficients are visualized as shading maps of a sphere and the top-left model in Figure 3.

7. Experiments

RMSE within binary masks SSIM within bounding boxes of masks
Shading Transport Normal AO Light Albedo Shading Transport Normal AO Light Albedo
SfSNet 0.299 0.526 0.346 N/A 0.207 0.135 0.884 0.755 0.776 N/A 0.446 0.954
SfSNet-AO 0.293 0.529 0.347 0.083 0.207 0.131 0.890 0.749 0.772 0.946 0.475 0.955
Ours (min) 0.237 0.406 N/A N/A 0.205 0.131 0.909 0.777 N/A N/A 0.473 0.953
Ours (full) 0.219 0.393 N/A N/A 0.199 0.129 0.927 0.781 N/A N/A 0.500 0.943
Table 1. RMSE and SSIM for the inferred results using each method. The light transport maps of SfSNet and SfSNet-AO were computed from corresponding normal maps analytically. Best values are highlighted in boldface.
Shading Transport Light Albedo Shading Transport Light Albedo
W/o TV 0.226 0.391 0.202 0.126 0.923 0.784 0.471 0.956
W/o shading 0.227 0.398 0.201 0.132 0.922 0.781 0.496 0.940
W/o reconstruction 0.224 0.394 0.198 0.144 0.925 0.782 0.503 0.907
Table 2. RMSE and SSIM for an ablation study for our 15-losses. Best values are highlighted in boldface.
Figure 5. Comparisons with synthetic data. From left to right, input images, shading maps, normal maps, ambient occlusion maps, light maps (i.e., light information visualized by shading a sphere), and albedo maps. The light maps are in the same order as shading/albedo maps, from top to bottom. The red ovals in inferred shading maps highlight differences between SfSNet-AO and “Ours (full).”
Figure 6. Comparisons with real photographs. The red ovals in inferred shading maps highlight differences between SfSNet-AO and “Ours (full).” Input images courtesy of Guillaume Bolduc, George Gvasalia, Jacob Postuma, and Kat Garcia.
Figure 7. Relighting comparison using synthetic data. The red ovals highlight differences between SfSNet-AO and “Ours (full).”
Figure 8. Relighting comparison using real photographs. The red ovals highlight differences between SfSNet-AO and “Ours (full).” Input images courtesy of George Gvasalia and Jacob Postuma.

We implemented our CNN models using Python and the chainer library, and ran our code on a PC with NVIDIA GeForce GTX 1080 Ti GPUs. We used Adam as an optimizer with a fixed learning rate 0.0002 and batch size 1. The computation times for one epoch of training on a single GPU were about three hours with our CNN models. We used the synthetic images of

pixels for training in our results. Our CNN models, as well as other models for comparisons, were trained up to 60 epochs. For relighting, we used Debevec’s environment maps [2004], namely, kitchen_probe for Figures 1, 7, and 8 and grace_probe for Figure 1. The input photographs in our results were downloaded from Unsplash111 Specifically, we selected high-quality free-license images of single human figures, generated their binary masks automatically using Adobe Photoshop with manual correction, applied trimming and uniform scaling, and then added paddings to make them pixels.

7.1. Comparisons of Inference

To clarify the advantage of our method, we compared it with three alternative methods. The first one is SfSNet [Sengupta et al., 2018], but the network architecture is not the original one for small images but much richer one defined in Section 5. In this case, a decoder of SfSNet outputs three-channel normal maps, instead of nine-channel light transport maps. The second method is SfSNet plus ambient occlusion (hereafter we call it SfSNet-AO). A single-channel ambient occlusion is inferred by an additional decoder branch. The third method is our network with four losses only, similar to SfSNet. We refer to the 4-loss version as “Ours (min)” and the 15-loss version as “Ours (full).” Comparisons between “SfSNet” and “Ours (min)” reveal the impact of considering light occlusion whereas those between “Ours (min)” and “Ours (full)” demonstrate the effectiveness of the full loss.

Figures 5 and 6 show the results of qualitative comparisons using synthetic test data and photographs, respectively. The red ovals in inferred shading maps highlight differences between SfSNet-AO and “Ours (full).” The first row of Figure 5 indicates that all methods suffer from separating textures from shading maps. The shading maps of SfSNet often seem like flat bas-reliefs because light occlusion is ignored. In the first and fourth rows, SfSNet-AO estimates the depth gaps between jackets and shirts smaller than the actual gap. Such biased estimate in shading maps often yields unnaturally-darkened albedo maps. Comparing our two variants, i.e., “Ours (min)” and “Ours (full)”, the latter yields sharper shading maps than the former. Also in Figure 6, we can see the similar tendency with real photographs.

For quantitative comparison, Table 1 summarizes the RMSE and SSIM of each component. To reduce the effects of out-of-mask-pixels, we calculate RMSEs within binary masks whereas we calculate SSIMs within the bounding boxes of binary masks. The table shows that “Ours (full)” is consistently better than other alternatives except for “Albedo SSIM”. The reason why “Albedo SSIM” of “Ours (full)” is lower than others is that “Ours(full)” better cancels the baked-in shadings (see Section 6) in GT albedos and thus its output albedos become more dissimilar to “GT.” Table 2 further reveals the impacts of the TV losses, shading losses, and reconstruction losses. We can see the tendency that overall the accuracies are lower than those of “Ours (full)” in Table 1. Note that light transport and albedo maps of “W/o TV” are slightly better than those of “Ours (full).” This result is reasonable because TV losses enforce smoothing, i.e., add biases, to the inferred outputs in compensation for generalization capability.

7.2. Relighting and Light Transfer

Figures 7 and 8 show the results of relighting with inferred albedo and light transport maps, given synthetic test images and real photographs, respectively. Comparisons with path-traced reference images as well as movies are available in the supplemental material.

Figure 9. Light transfer. By performing inference for two images and swapping inferred lights, we can transfer the inferred light to each other.

By inferring illuminations in two human portraits, we can transfer the inferred illumination to each other. Figure 9 shows the results of light transfer with synthetic human images. The inferred illuminations have colors slightly different from the ground-truth, but the patterns of the illuminations are similar. The relit human images are therefore similar to the ground-truth.

Figure 10. Shape from mask. For each human figure, a ground-truth shading map, mask, and inferred shading map are displayed. These results imply that our network can learn a strong shape prior from silhouettes.

8. Discussions

Figure 11. Comparison to self-supervised learning. By fine-tuning network parameters using only unlabeled photographs with (i.e., encoder-tuned; lower row) and without (i.e., fully-tuned; upper row) fixing decoders, the outputs collapsed; the shading maps bleached and the albedo maps got close to the input images. Input images courtesy of Philip Martin and Ali Morshedlou.
Figure 12. Inferred results with sitting poses. Our network can handle sitting poses unexpectedly well even though it is not trained with them.
Figure 13. Failure examples with unusual lights. Our network suffers from inferring lights quite different from the training data.

Silhouettes as priors.

Some existing methods [Barron and Malik, 2015; Lun et al., 2017] on shape-from-shading suggested that object silhouettes serve as shape priors. In the recent work of the CNN-based shape inference from 2D silhouettes [Lun et al., 2017], the size of the training dataset is around ten thousand. Compared to this size, it was a surprise that we can infer plausible albedo and shading from only a few hundreds of training data. To confirm how much the silhouettes help our inference, we inferred light transport maps only from the binary masks. For this inference, we used the CNN model for light transport maps and used only those loss functions related to light transport maps. Figure 10 shows the resultant shading maps and corresponding ground-truth. Surprisingly, we can observe the rough concave shapes under the chin and the flat shapes of the instep. This result implies that our CNN models also learned a strong shape prior from silhouettes thanks to the regularity of our small training dataset.

Self-supervised learning.

Recent methods for intrinsic decomposition or image disentanglement, e.g., [Sengupta et al., 2018], often employ self-supervised learning to fine-tune networks that are trained with synthetic data; only the single loss for the differences between input images and products of inferred outputs is considered, and the network is trained using unlabeled real photographs. We fine-tuned the model of “Ours (full)” with and without fixing the network parameters of the decoders. However, in both cases, the inferred outputs collapsed (see Figure 11

); the light transport maps lost details, the corresponding shading maps bleached, and the albedo maps got close to the input images. This is probably because our light transport maps have much larger degrees of freedom (i.e., nine dimensions per pixel) than normal maps inferred in

[Sengupta et al., 2018], and thus are more difficult to fine-tune under the unconstrained setting in self-supervised learning. We thus did not adopt self-supervised learning in other results. The details of the experimental settings are available in the supplemental material.

Sitting poses.

To evaluate the ability of our network for handling various poses, we fed synthetic human images in sitting poses, which were not included in training or test data. Figure 12 shows the results. The inferred outputs are unexpectedly well compared to the ground-truth, which is probably because our training dataset is sufficiently rich for the network to learn shapes of body parts such as arms and legs.

Unusual lights.

We also evaluated the ability for handling various lights, as shown in Figure 13. Unfortunately, our network could not plausibly infer lights that were quite different from those in our training data. Our network seems to reconstruct nearest-neighbor lights that can be found in the training dataset, and the light transport maps are inferred accordingly. The large differences in appearance are then encoded in the inferred albedo maps so that the products of the three components become similar to the input images. A straightforward solution is to enrich the training light dataset using, e.g., the environment maps used in [Endo et al., 2017], so that good nearest neighbors can be found for various inputs.

8.1. Limitations

Here we explain the limitations of our method. Although our method is based on a better formulation of SH-based lighting, i.e., with consideration of light occlusion, it is still a crude approximation of lighting calculation. First of all, we only handle diffuse albedo. This limitation mainly stems from our dataset; most of the commercial data do not have specular components, even though SH representation can naturally handle specular components as demonstrated in the original PRT paper. Adding artificial specular components to our training dataset, as done in [Innamorati et al., 2017], seems inappropriate in our case because human skin and clothes should have different reflectance. Material assignment with semantic segmentation for hundreds of meshes is ideal but can be a challenging project by itself. As our work is the first attempt regarding both full-body relighting and SH-based light occlusion learning/inference, we believe this limitation is acceptable to encourage follow-up studies.

Also, while we used second-order SH for representing light occlusion, Sloan et al. [2002] suggested to use higher-order bases because occlusion causes high-frequency signals. As is often the case with learning-based methods, our method might fail with conditions quite dissimilar to the training dataset, e.g., harsh illuminations, as demonstrated in Figure 13.

9. Conclusions and Future Work

In this paper, we have paved the way to occlusion-aware relighting from single-view human images and accompanying inference using CNNs. Inspired by the seminal work of the precomputed radiance transfer [Sloan et al., 2002], we employed SH-based lighting, i.e., dot-product calculation of second-order spherical harmonics (SH) coefficient vectors of illumination and occlusion (i.e., light transfer vectors), and trained our models using our synthetic ground-truth dataset. Plausible inference of albedo and light transport maps were possible probably because of our small yet geometrically-aligned human image dataset. By considering light occlusion, inferred albedo and shading maps (i.e., the product of a light transport map and illumination) as well as relighting results are more plausible than those obtained by using previous techniques without considering light occlusion.

One obvious direction of future work is to extend our first attempt to more physically-accurate inverse rendering, based on the formulations extensively studied in the literature of precomputed radiance transfer. For example, other basis functions such as wavelets or spherical Gaussians might be beneficial to handle high-frequency shadows or illumination. A quite important future work would be to build a publicly-available, high-quality 3D human models, which is crucial to develop this human-oriented research.

The authors would like to thank ZOZO Technologies, Inc. for generous financial support throughout this project, without which this work was not possible. The authors would also like to thank the anonymous referees for their constructive comments, and Ms. Sina Kitz for proof-reading the final version of this paper. For our accompanying video, input images courtesy of Kat Garcia, Kinga Cichewicz, George Gvasalia, and Jacob Postuma.


  • M. Aittala, T. Aila, and J. Lehtinen (2016) Reflectance modeling by neural texture synthesis. ACM Trans. Graph. 35 (4), pp. 65:1–65:13. Cited by: §2.
  • D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. ACM Trans. Graph. 24 (3), pp. 408–416. Cited by: §2.
  • A. O. Balan, L. Sigal, M. J. Black, J. E. Davis, and H. W. Haussecker (2007) Detailed human shape and pose from images. In

    2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007)

    Cited by: §2.
  • J. T. Barron and J. Malik (2015) Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37 (8), pp. 1670–1687. Cited by: §2, §8.
  • H. G. Barrow and J. M. Tenenbaum (1978) Recovering intrinsic scene characteristics from images. Comp. Vis. Sys.. Cited by: §2.
  • A. S. Baslamisli, H. Le, and T. Gevers (2018) CNN based learning using reflection and retinex models for intrinsic image decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Cited by: §2.
  • R. Basri and D. W. Jacobs (2003) Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2), pp. 218–233. Cited by: §1, §3.1.
  • S. Bell, K. Bala, and N. Snavely (2014) Intrinsic images in the wild. ACM Trans. Graph. 33 (4), pp. 159:1–159:12. Cited by: §2.
  • V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999), pp. 187–194. Cited by: §1, §2.
  • N. Bonneel, B. Kovacs, S. Paris, and K. Bala (2017) Intrinsic decompositions for image editing. Comput. Graph. Forum 36 (2), pp. 593–609. Cited by: §2.
  • M. Chai, L. Luo, K. Sunkavalli, N. Carr, S. Hadap, and K. Zhou (2015) High-quality hair modeling from a single portrait photo. ACM Trans. Graph. 34 (6), pp. 204:1–204:10. Cited by: §1.
  • M. K. Chandraker and R. Ramamoorthi (2011) What an image reveals about material reflectance. In IEEE International Conference on Computer Vision (ICCV 2011), pp. 1076–1083. Cited by: §2.
  • R. Danerek, E. Dibra, A. C. Öztireli, R. Ziegler, and M. H. Gross (2017) DeepGarment: 3d garment shape estimation from a single image. Comput. Graph. Forum 36 (2), pp. 269–280. Cited by: §2.
  • P. E. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar (2000) Acquiring the reflectance field of a human face. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 2000), pp. 145–156. Cited by: §2.
  • P. Debevec (2004) Light probe image gallery. Note: Cited by: §7.
  • Y. Endo, Y. Kanamori, and J. Mitani (2017) Deep reverse tone mapping. ACM Trans. Graph. 36 (6), pp. 177:1–177:10. Cited by: §8.
  • M. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. Gagné, and J. Lalonde (2017) Learning to predict indoor illumination from a single image. ACM Trans. Graph. 36 (6), pp. 176:1–176:14. Cited by: §2, §6.
  • P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt (2013) Reconstructing detailed dynamic face geometry from monocular video. ACM Trans. Graph. 32 (6), pp. 158:1–158:10. Cited by: §2.
  • S. Georgoulis, K. Rematas, T. Ritschel, E. Gavves, M. Fritz, L. V. Gool, and T. Tuytelaars (2018)

    Reflectance and natural illumination from single-material specular objects using deep learning

    IEEE Trans. Pattern Anal. Mach. Intell. 40 (8), pp. 1932–1947. Cited by: §2.
  • P. Guan, A. Weiss, A. O. Balan, and M. J. Black (2009) Estimating human shape and pose from a single image. In IEEE 12th International Conference on Computer Vision (ICCV 2009), pp. 1381–1388. Cited by: §2.
  • Y. Hold-Geoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J. Lalonde (2017) Deep outdoor illumination estimation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 2373–2382. Cited by: §2.
  • B. K. P. Horn (1989) Shape from shading. B. K. P. Horn and M. J. Brooks (Eds.), pp. 123–171. External Links: ISBN 0-262-08183-0 Cited by: §2.
  • C. Innamorati, T. Ritschel, T. Weyrich, and N. J. Mitra (2017) Decomposing single images for layered photo retouching. Comput. Graph. Forum 36 (4), pp. 15–25. Cited by: §2, §8.1.
  • M. K. Johnson and E. H. Adelson (2011) Shape estimation in natural illumination. In The 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), pp. 2553–2560. Cited by: §2.
  • I. Kemelmacher-Shlizerman and R. Basri (2011) 3D face reconstruction from a single image using a single reference face shape. IEEE Trans. Pattern Anal. Mach. Intell. 33 (2), pp. 394–405. Cited by: §1.
  • N. Kholgade, T. Simon, A. A. Efros, and Y. Sheikh (2014) 3D object manipulation in a single photograph using stock 3D models. ACM Trans. Graph. 33 (4), pp. 127:1–127:12. Cited by: §2.
  • E. H. Land and J. J. McCann (1971) Lightness and retinex theory. J. Opt. Soc. Am. 61 (1), pp. 1–11. Cited by: §2.
  • G. Li, C. Wu, C. Stoll, Y. Liu, K. Varanasi, Q. Dai, and C. Theobalt (2013) Capturing relightable human performances under general uncontrolled illumination. Comput. Graph. Forum 32 (2), pp. 275–284. Cited by: §2.
  • X. Li, Y. Dong, P. Peers, and X. Tong (2017) Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Trans. Graph. 36 (4), pp. 45:1–45:11. Cited by: §2.
  • J. Lopez-Moreno, E. Garces, S. Hadap, E. Reinhard, and D. Gutierrez (2013) Multiple light source estimation in a single image. Comput. Graph. Forum 32 (8), pp. 170–182. Cited by: §2.
  • Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang (2017) 3D shape reconstruction from sketches via multi-view convolutional networks. In 2017 International Conference on 3D Vision (3DV 2017), Cited by: §1, §8.
  • T. Narihira, M. Maire, and S. X. Yu (2015) Direct intrinsics: learning albedo-shading decomposition by convolutional regression. In 2015 IEEE International Conference on Computer Vision (ICCV 2015), pp. 2992. Cited by: §2.
  • G. Oxholm and K. Nishino (2012) Shape and reflectance from natural illumination. In 12th European Conference on Computer Vision (ECCV 2012), Proceedings, Part I, pp. 528–541. Cited by: §2.
  • P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009)

    A 3D face model for pose and illumination invariant face recognition

    In Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS 2009), pp. 296–301. Cited by: §2.
  • R. Ramamoorthi and P. Hanrahan (2001) An efficient representation for irradiance environment maps. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, pp. 497–500. Cited by: §1, §3.1.
  • A. Schneider, S. Schönborn, B. Egger, L. Frobeen, and T. Vetter (2017) Efficient global illumination for morphable models. In IEEE International Conference on Computer Vision (ICCV 2017), pp. 3885–3893. Cited by: §2.
  • S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs (2018) SfSNet: learning shape, reflectance and illuminance of faces ‘in the wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Cited by: §1, §2, §4, §7.1, §8.
  • J. Shi, Y. Dong, H. Su, and S. X. Yu (2017) Learning non-lambertian object intrinsics across shapenet categories. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 5844–5853. Cited by: §2.
  • Z. Shu, S. Hadap, E. Shechtman, K. Sunkavalli, S. Paris, and D. Samaras (2017) Portrait lighting transfer using a mass transport approach. ACM Trans. Graph. 36 (4). Cited by: §1.
  • Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras (2017) Neural face editing with intrinsic image disentangling. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 5444–5453. Cited by: §2.
  • P. J. Sloan, J. Kautz, and J. Snyder (2002) Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments. ACM Trans. Graph. 21 (3), pp. 527–536. Cited by: §1, §3.2, §8.1, §9.
  • A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt (2017)

    MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction

    In IEEE International Conference on Computer Vision (ICCV 2017), pp. 3735–3744. Cited by: §2.
  • S. Xue, A. Agarwala, J. Dorsey, and H. E. Rushmeier (2012) Understanding and improving the realism of image composites. ACM Trans. Graph. 31 (4), pp. 84:1–84:10. Cited by: §1.
  • S. Yamaguchi, S. Saito, K. Nagano, Y. Zhao, W. Chen, K. Olszewski, S. Morishima, and H. Li (2018) High-fidelity facial reflectance and geometry inference from an unconstrained image. ACM Trans. Graph. 37 (4), pp. 162:1–162:14. External Links: ISSN 0730-0301 Cited by: §2.
  • C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll (2017) Detailed, accurate, human shape estimation from clothed 3d scan sequences. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 5484–5493. Cited by: §1, §6.
  • B. Zhou, X. Chen, Q. Fu, K. Guo, and P. Tan (2013) Garment modeling from a single image. Comput. Graph. Forum 32 (7), pp. 85–91. Cited by: §2.
  • S. Zhukov, A. Iones, and G. Kronin (1998) An ambient light illumination model. In Rendering Techniques ’98, Proceedings of the Eurographics Workshop, pp. 45–56. Cited by: §1.