Hair extraction and modeling play an important role in human identification in computer vision as well as in character creation for digital entertainment production. Robust localization of the hair component and accurate recognition of hair styles in face images can improve face tracking and recognition; the segmented hair component can be separately modeled to create realistic 3D human avatars. By far nearly all existing approaches operate on RGB images by treating hair as a special semantic label. In reality, hair does not conform to a uniform shape, style or even color; black hair in particular lacks features, making hair segmentation in the wild an extremely difficult task, as shown in Figure1.
Computer graphics approaches resort to human interventions: a user manually provides a coarse segmentation followed by automatic refinement . For example, several recent methods allow the user to draw a small number of strokes in the hair regions [2, 3, 4] where features on these strokes are then used to guide the segmentation of the whole image. Hand-crafted features such as histogram of oriented gradients and local ternary patterns [5, 6]
have shown better performance than the classical ones such as SIFT. Fully automated methods have sought to employ machine learning techniques by training on a large number of portrait images with manually annotated hair regions[7, 8, 9]. These approaches follow the general segmentation pipeline, e.g., the pyramid scene parsing network (PSPNet) . However, hair is far away from being a ”normal” semantic label: it is semitransparent and has variable shapes. Even using a large set of training data, automatic approaches are vulnerable to lighting, viewing angles, background similarities, etc. Figure 1 shows common failure cases using the fully convolutional networks (FCN), DeepLabv3+  and the fully connected conditional random fields (CRFs for short) .
In this paper, we present a computational photography approach that tackles the problem from the input front, as shown in Figure Hair Segmentation on Time-of-Flight RGBD Images. We explore hair segmentation on RGBD images captured by time-of-flight (ToF) cameras on mobile devices. Several latest mobile phones are equipped with ToF: OPPO r17 pro uses a ToF sensor whereas VIVO x20 uses an even higher resolution sensor of . We first conduct a comprehensive analysis to show that hair scattering and inter-reflection cause different noise patterns on hair vs. non-hair regions on ToF sensors. Specifically scattering causes the multi-path artifacts whereas inter-reflection changes the length of light paths. Since hair fibers exhibit randomness in spatial arrangement, their corresponding scattering and inter-reflection pattern follow similar distributions.
For verification and subsequently training, we collect the first RGBD hair dataset with 20k+ head images captured on 30 subjects with a variety of hairstyles at different view angles. We simultaneously annotate hair in terms of top, back and two side regions and we show that the noise in ToF depth images is coherent across these regions but is missing on skins. At the same time, the gradient fields in RGB images exhibit heterogeneity across different regions. We therefore develop a deep network approach that employs both the ToF noise characteristics and the gradient fields on RGB images. We adopt the DeepLabv3+  framework and modify the encoder/decoder architecture to provide an initial segmentation with region labels. We further refine the results using the conditional random field (CRF) with the ToF noise model as a prior. Comprehensive experiments show that our approach significantly outperforms the state-of-the-art RGB based techniques in accuracy and robustness. In particular, our technique can handle challenging segmentation cases such as dark hair, similar hair/background, similar hair/foreground, hair viewed at different angles, etc. In addition, we show that our labeled hair segmentation can be further used to identify hair styles.
2 Related Work
Hair is the most challenging type of objects in recognition, segmentation, and reconstruction. Hair modeling and reconstruction aims to produce lifelike hair for virtual human in game and film production as well as to beautify portraits for hairstyle tryon. Image-based approaches achieve higher quality with less efforts than physical simulation based methods (see  and  for a comprehensive survey). The core of the problem lies in how to segment the hair component in images.
extract hair features from texture, shape and color, and then exploit machine learning techniques such as random forests
and support vector networks to separate the hair vs. non-hair regions. In general, light colored hair is easier to segment than the dark one as it exhibits richer features. The accuracy of these approaches depends heavily on the difference between the foreground and the background. Semi-automatic approaches allow the users to draw strokes or splines on the hair regions [2, 3, 18, 4] or generate seed pixels 
to produce more accurate hair boundaries, remove outlier regions and reduce computational expense.
More recent hair segmentation techniques employ deep convolutional neural networks (CNN) by learning to produce more discriminative features even better than the hand-crafted ones. Approaches in this category generally train on a large number of manually annotated portrait image datasets and employ semantic segmentation pipelines such as PSPNet to automatically obtain pixel-wise hair masks.  adopts a multi-objective CNN for both pixelwise labels and a pairwise edge map. 
applies Region-CNN (R-CNN) to estimate hair distribution classes and then generate a hair mask with a directional map. and  show that fully convolutional networks (FCN) achieve higher accuracy in hair segmentation. By far nearly all approaches use RGB color features whereas we exploit the depth channel, more precisely, the noise pattern on the depth channel.
Our work also seeks to automatically determine hair styles, an active research area in the field of computer graphics. Most methods require user interventions to indicate the hair styles (e.g., via directional strokes) and then search for best matching examples in a large hairstyle database manually constructed from public online repositories . They optimize the discrepancy between the reference photo and the selected hairstyle in the database, and synthesize the hair strands to fit to the hairstyle [23, 4]. We in contrast show that ToF + RGB images can be used to automatically infer hair styles, by simultaneously employing color, gradient, and noise patterns.
3 Time-of-Flight Image Analysis on Hair
A Time-of-Flight (ToF) camera works in conjunction with an active illumination modulated in intensity. Light emitting from the camera is reflected at the surface of an object and then sensed by a pixel array in the camera. The received ray attenuates according to the surface reflectance and its travel distance. Often, a narrow-band filter is used on ToF to prevent ambient light interference.
To measure depth, one can compute the phase shift that light travels from the light source to the object surface and then back to the sensor. We follow the same notation as [24, 25] where a correlation pixel measures the modulated exposure as:
where is irradiance, is a periodic reference function with modulation period of angular frequency and programmable phase offset , both evaluated at time . The reference function is generally a zero-mean periodic function such as sinusoidal and rectangle waves.
3.1 Skin vs. Hair
Compared with the smooth skin regions of face and neck, hair regions are volumetric, filamentous and textured. Light arriving at the surface of hair is scattered in the filaments repetitively and reflected to the sensor, as shown in Figure 2. When a single light path from a smooth surface contributes to a sensor pixel, the irradiance at the pixel is:
where is the DC component, the modulated amplitude of the light source. presents an attenuation term, and the total travel time from the light source to the object surface and then to the sensor pixel. is intensity modulated at the same frequency as .
When using ToF to image object surfaces of reflective and/or translucent materials, the inter-reflection and scattering can cause strong multi-path artifacts on the ToF image. However, by far such artifacts had been mostly used for modeling smooth surfaces, not uneven geometry such as hair as in this paper. For example, on the basis of the observation, classifies structurally distinct materials using a high-precision ToF camera. In the case of hair, the multi-path artifacts are stronger: a light ray can be scattered by the hair surface, causing the light path to folk and subsequently the measured light path at a pixel is a combination of multiple light paths; a light ray can also be reflected by hair surface where dense hair fibers can significantly and randomly change the length of the path.
We define the temporal (since ToF measures time/phase) point spread function to represent the integral of contributions from the light paths that correspond to the same travel time , and yield Equation 4 for a rough surface.
where is the light attenuation along a given light path , which connects the light source to the sensor pixel. presents the space of all light paths, and the travel time along path .
Each pixel of the sensor hence receives the signal reflected at single or multiple light paths from different hair strands and produces a proportional electrical signal with the same frequency as the incident light. In fact, we can calculate the phase shift and amplitude of the sensed light using four equally spaced sampling points A1, A2, A3, and A4 per modulation period for a sinusoidal function  as
Since the phase shift is equivalent to time shift in a periodic signal, we can compute the travel distance of the light with the known light speed c and modulation period T, as in Equation 6
Several recent studies have shown that in addition to reflection properties of object surface, distance between the object and the sensor and the background surrounding the object may also contribute to depth noise. In our experiments we observe that inter-reflection and scattering artifacts impose much stronger noise than the ones caused by depth and background. Similar to [27, 28], we can then fit the depth noise using a Gaussian or a Poisson function, as shown in Figure 3. In our analysis, we select a patch using a Gaussian window on the depth image, and fit it with a Gaussian function. The patch slides on the image so that we obtain a variance map. We calculate the variance map at the facial and hair regions for different subjects with long or short hairstyles, and also for the same subject at different viewing angles (top and side). For clarity, we show the variance histogram curve at the facial and hair regions of a subject, facial region in red and hair region in blue, respectively.
We observe that the variance histogram curve of hair region is highly distinguishable from that of facial region, and across different hairstyles. In experiments, we also observe that the variance histogram curves of hair are distinct at viewing angles of sides, back and top, as shown in Figure 4. This is mainly because the hair density and directions exhibit strong variations at different view angles.
3.2 ToF Hair Dataset
We collect the first RGBD dataset for hair segmentation as shown in Figure 5 (b). Our capture system consists of a mobile phone, which is equipped with a high-precision RGB camera and a ToF depth camera, a gripper to hold the mobile phone, and a stepper with a controller. The phone held in the gripper orbits our subject, who stands immobile on the ground, with a radius of 1.5 meters, and take images of her/his head ranging from 0 to 360 degrees. For each subject, we select 720 images at each 0.5 degrees.
Our dataset contains 20k+ head images of 30 subjects (18 males and 12 females), with different hairstyles and of different ages. RGB images are 10242048, and depth images 640480. We manually annotate ground-truth hair regions of top, two sides and back, and directional map on the RGB images for training data, as shown in Figure 5 (a). In particular, for direction annotation, we divide hair into four sections, i.e., and for horizontal, for longitudinal, for leftward, and for rightward. We also label left and right ears apart from the facial and neck regions, and train them in our network. The ear regions as well as their structure can be treated as landmarks to differentiate head at different orientations. We align the depth image with its corresponding RGB image using our alignment technique so that we have the hair regions and the directional map annotated on both images.
4 Hair Segmentation and Labeling
We set out to use both the ToF depth map and the RGB images for conducting hair segmentation. A unique aspect of our approach is that we not only segment the hair region but also partition the region into semantically meaningful labels. We extend the DeepLabv3+ framework. Figure 6 shows our pipeline. We observe that the direction of hair fibers provides important cues to both hair style and the location of the specific hair components. We therefore use the RGB gradient maps as inputs along with the ToF image. The output of the network is the segmentation result that partitions the image into ear regions, facial region (i.e., face and neck) and hair regions (top, back and two sides). We also conduct refinement to further improve the segmentation result.
4.1 Initial Segmentation
The input data to our deep convolutional neural networks includes RGB and ToF images and their gradient maps. Recall that our RGB and ToF images have different sizes, 10242048 and 640480, respectively. We therefore align the corresponding pair of RGB and ToF images based on the rotation-translation matrix provided by the mobile phone. The alignment yields to holes due to upsampling of the ToF image. We fill in these holes using a weighted mean value which we calculate from a patch of Gaussian window. We select the patch from its neighboring region annotated as the same label in its paired training data. In this way, our upsampled ToF image has its annotation and enables the computation of its variance map at specific regions.
We adopt DeepLabv3+ 
, a state-of-the-art deep learning architecture for semantic segmentation. We construct our ToF-HairNet for multiple tasks on the basis of the encoder and decoder parts of DeepLabv3+, as shown in Figure7. We add networks for ToF images and RGB gradient maps on the encoder side. Specifically, we use a 65-block Xception network  as our DCNN backbone to extract a high-level feature map from the RGB image to obtain features of 40 channels. After that, we feed the upsampled ToF image, aligned with its corresponding RGB image with its background removed, to a 5-layer network of convolution and a convolution, and obtain features of 10 channels. In addition, we subtract the result at the second layer from the third layer, then apply the residual to a 3-layer network of convolution and a convolution, and obtain features of 10 channels. The subnet in Figure 8 (a) shows the process for a ToF image.
We convert a RGB image to its gray image, calculate the gradients in and directions using a Sobel operator, and yield and . We input and separately to a 5-layer network of convolution for refinement. We concatenate the refined and into a gradient map and employ a convolution to produce features of 20 channels. This process is shown in Figure 8 (b). The final network is composed of 80 layers, including 40 channels for a feature map from RGB images, 20 channels from ToF images, and 20 channels from gradient maps, for fine-tuning the feature map at the encoder side.
On the decoder side, we adopt a convolution to a low-level feature map which is extracted from the second block of the Xception network, and upsample the fine-tuned feature map by 4. We then concatenate the aligned ToF image, the gradient map, the low-level feature map and the fine-tuned feature map, and apply convolution. Finally, after upsampling and a convolution, our network outputs hair directional map and the segmentation result.
4.2 Segmentation Refinement
The initial segmentation output from our ToF-HairNet includes ear regions, facial region, and hair regions of top, back and two sides. Since hair exhibits strong appearance variations under different lighting and view angles, we further refine the segmentation results by employing the ToF variance maps and dense conditional random fields (dense CRF).
Variance map. We extract the human region from the upsampled ToF image and remove the background. We select a patch of in a Gaussian window to calculate its mean value and variance , and slide the patch on the image, as in Equation 7.
where is the value at pixel on the upsampled ToF image, the number of pixels in patch .
We then obtain a variance map of the human region. From our analysis on ToF depth noise (in 3), variance map at the facial region and hair regions show different properties, we improve the hair regions in conjunction to the dense CRF module.
Dense CRF. The fully connected conditional random fields (CRFs) in  represent a popular model for semantic segmentation and labeling. This model connects all pairs of individual pixels and define pairwise edge potentials by linearly combining Gaussian kernels. Following  , we consider a conditional random field as a Gibbs distribution. The Gibbs energy includes the unary potentials and the pairwise potentials , as in Equation 8:
where i and j range from 1 to . We compute the unary potential independently for each pixel by a classifier that produces a distribution over the label assignment given image features. The pairwise potentials is defined as
where and are feature vectors for pixels and in a feature space, linear combination weights. is a label compatibility function. are Gaussian kernels. For multi-class image segmentation, two contrast-sensitive kernels are defined in terms of color vectors and , and positions and :
The first term is appearance kernel, based on the observation that pixels nearby with similar colors are likely in the same class. and control the spatial distance and similarity, respectively. The second term is smoothness kernel, removing small isolate regions.
We first employ a variance kernel to enhance semantic segmentation for filamentous regions such as hair, and yield Equation 11:
5 Experimental Results
We conduct hair segmentation experiments for various hairstyles on different scenes. Our dataset comprises 20k+ head images of 30 subjects. We manually annotate 8k images, 5 labels for segmentation and 5 labels for hair directional map. The segmentation labels include facial region, background, ear regions and hair regions of top, back and two sides. Hair direction labels are leftward, rightward, horizontal and longitudinal. Further, we categorize these annotated images into three groups: 6k for training, 1.2k for testing and 0.8k for validation. All our experiments are carried out on the public platform TensorFlow on a PC machine with two graphics cards of TiTan 1080 Ti.
We build our network based on the 65-block Xception network, and fine-tune its parameters initiated on ImageNet dataset with similar training protocol in . In particular, we set the learning rate to 0.0001, crop-size to 513513
3, training step to 50k times, output stride to 16 for encoder and 4 for decoder. We select the random mode for cropping position, flip, scale, and exposure to perform average operation on the annotated RGB images. Exploiting the parameters fine-tuned, we train hair segmentation on our ToF-HairNet following the details in4, and output hair directional map, the segmentation and its labeling.
We first demonstrate visual quality of hair segmentation output from our ToF-HairNet and from our network + refinement. Figure 9 shows some sample results on a single subject captured at different view angles at intervals of 45 degrees. We combine the hair regions of top, back and two sides to a single hair mask. We show that initial ToF-HairNet provides reasonable results whereas our refinement further improves the boundary.
Our experimental results on the challenging scenes, e.g, hair color similar to the background, hair on the temples and hair highlighted, are shown in Figure 1 in comparison to those using FCN , CRFs , and DeepLabv3+ . We show that RGB based methods result in missed hair boundary and details merely dependent upon intensity variations and spatial positions. As an additional cue, variance variation on ToF images is capable of more reliable segmentation and labeling for specific surfaces such as hair.
Further, we conduct our experiments on various hairstyles compared with CRFs  and DeepLabv3+ . Figure 10 shows sample results of 3 hairstyles. Our hair masks are significantly more accurate than two state-of-the-art techniques, especially for the fluffy hairstyles. We finally apply our hair masks of 3 different hairstyles to render hair for virtual characters of female and male, as shown in Figure 11.
|DeepLabv3+ + ToF||0.72|
|DeepLabv3+ + gradient||0.70|
|DeepLabv3+ + direction||0.80|
|DeepLabv3+ + gradient + direction + ToF||0.81|
Alternatively, we evaluate the components of our ToF-HairNet on mean intersection over union (mIoU) metric. The IoU measures the number of pixels between the ground truth and segmentation masks divided by the total pixels present across both masks. We calculate IoU score for each label and average the IoU score over all annotated labels to obtain mIoU of our network when we combine each component separately and combine all on base of DeepLabv3+ . Our mIoU metric in Table 1 shows the mean IoU at 50k training times. ToF images and directional maps contribute significantly to the mIoU score of our network. We refer the reviews to the supplementary materials for many additional results.
-  L. Hu, C. Ma, L. Luo, L.-Y. Wei, and H. Li, “Capturing braided hairstyles,” ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2014), vol. 33, no. 6, December 2014.
-  M. Chai, L. Wang, Y. Weng, X. Jin, and K. Zhou, “Dynamic hair manipulation in images and videos,” Acm Transactions on Graphics, vol. 32, no. 4, p. 75, 2013.
-  X. Yu, Z. Yu, X. Chen, and J. Yu, “A hybrid image-cad based system for modeling realistic hairstyles,” in Meeting of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, 2014, pp. 63–70.
-  M. Zhang, M. Chai, H. Wu, H. Yang, and K. Zhou, “A data-driven approach to four-view image-based hair modeling,” Acm Transactions on Graphics, vol. 36, no. 4, pp. 1–11, 2017.
J. Roth and X. Liu, “On hair recognition in the wild by machine,” in
Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014, pp. 2824–2830.
-  M. Svanera, U. R. Muhammad, R. Leonardi, and S. Benini, “Figaro, hair detection and segmentation in the wild,” in IEEE International Conference on Image Processing, 2016, pp. 933–937.
-  M. Chai, T. Shao, H. Wu, Y. Weng, and K. Zhou, “Autohair: fully automatic hair modeling from a single image,” Acm Transactions on Graphics, vol. 35, no. 4, p. 116, 2016.
-  Y. Zhou, L. Hu, J. Xing, W. Chen, H. W. Kung, X. Tong, and H. Li, “Hairnet: Single-view hair reconstruction using convolutional neural networks,” pp. 249–265, 2018.
-  S. LIANG, X. HUANG, X. MENG, K. CHEN, L. G. SHAPIRO, and I. KEMELMACHER-SHLIZERMAN, “Video to Fully Automatic 3D Hair Model,” ArXiv e-prints, Sep. 2018.
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,”
IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6230–6239.
-  M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, “Multinet: Real-time joint semantic reasoning for autonomous driving,” 2016.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” arXiv preprint arXiv:1802.02611, 2018.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.
-  K. Ward, F. Bertails, T. Y. Kim, S. R. Marschner, M. P. Cani, and M. C. Lin, “A survey on hair modeling: styling, simulation, and rendering,” IEEE Transactions on Visualization & Computer Graphics, vol. 13, no. 2, pp. 213–234, 2007.
-  Y. Bao and Y. Qi, “A survey of image-based techniques for hair modeling,” IEEE Access, vol. 6, pp. 18 670–18 684, 2018.
-  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
-  C. Cao, H. Wu, Y. Weng, T. Shao, and K. Zhou, “Real-time facial animation with image-based dynamic avatars,” Acm Transactions on Graphics, vol. 35, no. 4, pp. 1–12, 2016.
-  B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang, “Exemplar-based face parsing,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3484–3491.
-  S. Liu, J. Yang, C. Huang, and M. H. Yang, “Multi-objective convolutional learning for face labeling,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3451–3459.
-  S. Qin, S. Kim, and R. Manduchi, “Automatic skin and hair masking using fully convolutional networks,” in IEEE International Conference on Multimedia and Expo, 2017, pp. 103–108.
-  “The sims resource,” http://www.thesimsresource.com/, 2018.
-  L. Hu, C. Ma, L. Luo, and H. Li, “Single-view hair modeling using a hairstyle database,” Acm Transactions on Graphics, vol. 34, no. 4, pp. 1–9, 2015.
-  F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich, “Low-budget transient imaging using photonic mixer devices,” ACM Transactions on Graphics (ToG), vol. 32, no. 4, p. 45, 2013.
-  S. Su, F. Heide, R. Swanson, J. Klein, C. Callenberg, M. Hullin, and W. Heidrich, “Material classification using raw time-of-flight measurements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3503–3511.
-  R. Lange and P. Seitz, “Solid-state time-of-flight range camera,” IEEE Journal of quantum electronics, vol. 37, no. 3, pp. 390–397, 2001.
-  D. Falie and V. Buzuloiu, “Noise characteristics of 3d time-of-flight cameras,” arXiv preprint arXiv:0705.2673, 2007.
-  H. Sarbolandi, D. Lefloch, and A. Kolb, “Kinect range sensing: Structured-light versus time-of-flight kinect,” Computer vision and image understanding, vol. 139, pp. 1–20, 2015.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  S. Nowozin, “Optimal decisions from probabilistic models: The intersection-over-union case,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 548–555.