Video to Fully Automatic 3D Hair Model

09/13/2018 ∙ by Shu Liang, et al. ∙ University of Washington 0

Imagine taking a selfie video with your mobile phone and getting as output a 3D model of your head (face and 3D hair strands) that can be later used in VR, AR, and any other domain. State of the art hair reconstruction methods allow either a single photo (thus compromising 3D quality) or multiple views, but they require manual user interaction (manual hair segmentation and capture of fixed camera views that span full 360 degree). In this paper, we describe a system that can completely automatically create a reconstruction from any video (even a selfie video), and we don't require specific views, since taking your -90 degree, 90 degree, and full back views is not feasible in a selfie capture. In the core of our system, in addition to the automatization components, hair strands are estimated and deformed in 3D (rather than 2D as in state of the art) thus enabling superior results. We provide qualitative, quantitative, and Mechanical Turk human studies that support the proposed system, and show results on a diverse variety of videos (8 different celebrity videos, 9 selfie mobile videos, spanning age, gender, hair length, type, and styling).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 8

page 9

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 2. Overview of our method. The input to the algorithm is a video: (A) structure from motion is applied to the video to get camera poses, depth maps and a visual hull shape with view-confidence values, (B) hair segmentation and gradient direction networks are trained to apply on each frame and recover 2D strands, (C) the segmentations are used to recover the texture of the face area, and a 3D face morphable model is used to estimate face and bald head shapes. The core of the algorithm is (D) where the depth maps and 2D strands are used to create 3D strands, which are used to query a hair database; the strands of the best match are refined globally and locally to fit the input photos.

Her glossy hair is so well resolved that Hiro can see individual strands refracting the light into tiny rainbows.

Snow Crash, Neal Stephenson

Any future virtual and augmented reality application that includes people must have a robust and fast system to capture a person’s head (3D hair strands and face). The simplest capture scenario is taking a single selfie photo with a cell phone [Hu et al., 2017]. 3D reconstruction from a single selfie [Chai et al., 2016], however, by definition, will not produce high fidelity results due to the ill-posedness of the problem–a single view does not show sides of the person. Using multiple frames, however, will create an accurate reconstruction. Indeed, a state-of-the-art method for hair modeling [Zhang et al., 2017] needs four views, but it requires spanning the full (front, back, and sides), as well as user interaction. This paper proposes to use a video as input and introduces solutions to three obstacles that prevent state-of-the-art work [Zhang et al., 2017; Chai et al., 2016; Hu et al., 2017] from being applicable in simple automatic self capture:

  1. Fixed views: [Zhang et al., 2017] requires four views (front, back, and two side views), those are hard to acquire accurately with self capture. In this paper, we do not constrain the views; instead we use any available video frames in which the subjects talk or capture themselves. For the input videos which do not have a full view range from to degrees, we will correct the incomplete views with a hair shape from a database. Camera poses are estimated automatically with structure from motion. Results with various view ranges are demonstrated (as low as 90 degrees range). [Chai et al., 2016; Hu et al., 2017] assume a single frontal image.

  2. Hair segmentation: [Zhang et al., 2017] relies on the user to label hair and face pixels. In this paper, we use automatic hair segmentation and don’t require any user input. Using general video frames, rather than four fixed views, introduces motion blur, varying lighting, and resolution issues, all of which our system overcomes.

  3. Accuracy: our method compares and deforms hair strands in 3D rather than 2D, and the availability of the back view is not required as in [Zhang et al., 2017]. It achieves higher accuracy results as demonstrated with qualitative, quantitative, and human studies. Intersection of union rate of the hair region compared to ground truth photos is on average 80% for our method (compared to 60% by [Zhang et al., 2017]). Amazon Turk raters prefer our results 72.4% over the four-view method [Zhang et al., 2017] and 90.8% over the single-view method of [Hu et al., 2017].

In addition to [Zhang et al., 2017] (and a previously multi-view based method by [Vanakittistien et al., 2016] that also required user interaction), there is a large body of work for modeling hair from photos. Earlier works assumed laboratory calibrated photos, e.g., [Hu et al., 2014a]. More recently [Chai et al., 2013; Hu et al., 2015, 2017; Chai et al., 2016] showed how to reconstruct hair from a single photo. Interesting hair-related applications inspire further research, e.g., depth-based portraits [Chai et al., 2016], effective avatars for games [Hu et al., 2017], photo-based hair morphing [Weng et al., 2013], and hair-try-outs [Kemelmacher-Shlizerman, 2016]. Enabling reconstruction “in the wild” is an open problem where Internet photos have been explored [Liang et al., 2016], as well as structure from motion [Ichim et al., 2015] on a mobile video input. Both methods output a rough structure of the hair and head, without hair strands. This paper proposes a system that can take in an in-the-wild video and automatically output a full head model with a 3D hair-strand model.

2. Related Work

In this section, we describe work that focuses on face, head, and hair modeling.

Face modeling (no hair, or head) has progressed tremendously in the last decade. Beginning with high-detailed head geometry with a stereo capturing system [Beeler et al., 2010; Debevec, 2012; Alexander et al., 2013], then RGB-D-based methods like dynamic fusion [Newcombe et al., 2015] and non-rigid reconstruction methods [Thies et al., 2015; Zollhöfer et al., 2014] allowed capture to be real-time and much easier with off-the-shelf devices. [Blanz and Vetter, 1999] proposed a 3D morphable face model to represent any person’s face shape using a linear combination of face bases, [Tran et al., 2017; Richardson et al., 2016; Richardson et al., 2017] proposed CNN-based systems and [Kemelmacher-Shlizerman and Seitz, 2011; Suwajanakorn et al., 2014; Kemelmacher-Shlizerman and Basri, 2011; Suwajanakorn et al., 2015] showed how to estimate highly detailed shapes from Internet photos and videos.

Head modeling: Towards creating a full head model from a single photo, [Chai et al., 2015] modeled the face with hair as a 2.5D portrait, using head shape priors and shading information. [Maninchedda et al., 2016] allowed multiple views and used volumetric shape priors to reconstruct the geometry of a human head starting from structure-from-motion dense stereo matching. [Cao et al., 2016] showed that a full 3D head with a rough but morphable hair model can be reconstructed from a set of captured images with hair depth estimated from each image and then fused together. Finally, [Liang et al., 2016] explored how to reconstruct a full head model of a person from Internet photos. These methods focused on rough head and hair modeling without reconstruction of hair strands.

Hair strand modeling is key to digital human capture. Capturing the nuances of person’s hair shape is critical, since the hair style is a unique characteristic of a person and inaccuracies can change their appearance dramatically.

Multi-view camera rigs and a controlled capturing environment were able to acquire hair shape with high fidelity [Luo et al., 2013; Hu et al., 2014a; Paris et al., 2008; Paris et al., 2004; Ward et al., 2007], and more recently with RGB-D cameras [Hu et al., 2014b]. With a single RGB image, [Chai et al., 2012, 2013] showed that strands can be recovered from per pixel gradients, and with a use of a database of synthetic hairstyles, [Hu et al., 2015] created a natural-looking hair model. A key requirement was to have a user draw directional strokes on the image to initialize strand construction. A fully automatic approach was proposed recently by [Chai et al., 2016] and [Hu et al., 2017] with CNN-based methods for hair segmentation, direction classification and a larger database for retrieving the best match for a single-view input.

Since single-view modeling is ill-posed by definition, [Vanakittistien et al., 2016] used a hand-held cell phone camera to take photos from views of the head to recover the hair strands, and [Zhang et al., 2017] proposed a method to reconstruct the hair from four-view images starting from a rough shape retrieved from a database and synthesized hair textures to provide hair-growing directions to create detailed strands. However, both methods need human interactions for hair segmentation and pose alignment. This paper solves those constraints and also demonstrates higher accuracy results compared to those methods.

3. Overview

Figure 2 provides an overview of our method and the key components. The input to the algorithm is a video sequence of a person talking and moving naturally, as in a TV interview. There are four algorithmic components (correspond to the labeling of boxes in Figure 2):

(A) video frames are used to create a structure-from-motion model, estimate camera poses as well as per-frame depth, and compute a rough visual hull of the person with view-confidences, (B) two models are trained: one for hair segmentation, and another for hair direction; given those models, 2D hair strands are estimated and hair segmentation results are transferred to the visual hull to define the hair region, (C) the masks from the previous stage are used to separate the hair from the face and run the morphable model (3DMM) to estimate the face shape and later create the texture of the full head.

(D) is a key contribution in which first depth maps and 2D hair strands are combined to create 3D strands, and then 3D strands are used to query a database of hair styles. The match is deformed according to the visual hull, then corrected based on the region of confidence of the visual hull. Finally, it is deformed on the local strand level to fit the input strands. Texture is estimated from input frames to create the final hair model. The full head shape is a combination of the face model and the hair strands. In the next sections, we describe each of these components.

Figure 3. Example hair styles from the dataset. For each hairstyle , we create its corresponding rough mesh as described in the text.
Figure 4. In (a), we show a comparison of before and after global deformation. The retrieved hairstyle is deformed under the control of its rough mesh to fit the visual hull shape. In (b), we show a comparison of before and after local deformation. A video frame is shown as a reference that after local deformation, we are able to recover more personalized hair details.

4. Input Frames to Rough Head Shape

This section describes part (A) in Figure 2. We begin by preprocessing each frame using semantic segmentation to roughly separate the person from the background [Zheng et al., 2015] resulting in masks . Our goal is to estimate camera pose per frame and to create a rough initial structure from all the frames. Since the background is masked out, having the head moving while the camera is fixed is roughly equivalent to the head being fixed while the camera is moving; thus we use structure from motion [Wu, 2011] to estimate camera pose per frame and per-frame depth using [Goesele et al., 2007].

Given and per frame, we estimate an initial visual hull of the head using shape-from-silhouette [Laurentini, 1994]. The method takes a list of pairs and as input and carves a 3D voxel space to obtain a rough shape of the head. Meanwhile, each segmented video frame is processed using the IntraFace software [Xiong and De la Torre, 2013], which provides inner facial landmarks per frame. The 2D facial landmarks are transferred to 3D using and averaged.

The hair segmentation classifier trained in step(B) (Section

5

) is run on all of our video frames. Each pixel is assigned a probability of being in the hair region. We drop the video frames with large motion blurs by calculating the surface area

of the detected hair region on each frame. Assuming the head size is relatively fixed across frames, a valid frame should have a hair region size of at least . The corresponding probabilities of the valid frames are transferred to the visual-hull shape. A vertex with a mean probability larger than is considered hair. Thus, we extract the hair part out of the visual-hull as shown in Figure 6(a), and the remaining is the skin part.

The resultant visual-hull shape is relatively rough due to the non-rigid nature of the subject’s head and might be stretched due to the incomplete views. Ideally, assuming the camera distance is always larger than the size of the head, we will get a complete visual hull if our video covers a full azimuth range of to degree. However, for in-the-wild videos, we usually cannot guarantee full coverage. We rigidly align the rough visual hull to a generic head model using 3D facial landmarks, and each camera pose is also transformed to a corresponding based on this alignment. We connect each to the center of the generic head (the origin point in our case) and calculate the azimuth angle of each camera. The vertices on the visual hull with an azimuth angle in as illustrated in Figure 6(a) are denoted high-confidence vertices.

Figure 5. Examples of Figure 2(B). Hair segmentation, directional labels and 2D hair strands of example video frames. For the color of the directional subregions, red stands for , pink stands for , blue stands for and green stands for .

5. Images to 2D Strands

This section describes part (B) in Figure 2. Inspired by the strand direction estimation method of [Chai et al., 2016], we trained our own hair segmentation and hair directional classifiers to label and predict the hair direction in hair pixels of each video frame. We manually label hair directional labels on our data and train a classifier using the manual labels directly as groundtruth as in [Chai et al., 2016]. More details on our hair segmentation method can be found in the appendix. Results of the classifier are shown on examples in Figure 5 (1st and 3rd row).

To estimate 2D strands, we select one video frame every degrees according to its camera azimuth angle spanning the camera view range . Similar to previous hair orientation estimation methods [Jakob et al., 2009; Chai et al., 2012], we filter each image with a bank of oriented filters that are uniformly sampled in . We choose the orientation with the maximum response for each pixel to get the 2D non-directional orientation map for each image. We further trace the hair strands on each non-directional orientation map following the method in [Chai et al., 2012] as shown in Figure 5 (2nd and 4th rows). The hair-direction labels are used to resolve the ambiguity of each traced 2D hair strand. If half of the points in a single strand have opposite directions to their directional labels, we flip the direction of the strand.

6. Face Model

This section corresponds to part (C) in Figure 2. Each segmented video frame from the previous stage is processed using the IntraFace software [Xiong and De la Torre, 2013], which provides head pose, and inner facial landmarks. From all the frames, the frame that is closest to frontal face is picked first (where yaw and pitch are approximately ), and fed to a morphable-model-based face model estimator [Tran et al., 2017]. This method generates a linear combination of the Basel 3D face dataset [Blanz and Vetter, 1999] with both identity shape, expression weights and texture. Here, we only use the identity weights to generate a neutral face shape. In the future, it will be easy to add facial expressions to our method. The result of the estimation is a masked face model.

We complete the head shape using a generic 3D head model from the Facewarehouse dataset [Cao et al., 2013]. We choose to use the Basel dataset instead of using the Facewarehouse dataset to fit directly, because the Facewarehouse dataset contains only about vertices for the whole head, while the Basel dataset contains about for just the face region in which more facial shape details are provided. We pre-define 66 3D facial landmarks (49 landmarks on the inner face and 17 landmarks on the face contour and the ears) on both the 3D face dataset used by [Tran et al., 2017] and the generic head shape. Since all the face shapes in the 3D face dataset are in dense correspondence, we transfer these 66 landmarks to all the output 3D faces. We then deform the generic shape towards the 3D face shape using the landmarks, following [Liang et al., 2014]. We fuse the deformed generic head shape and the face shape using Poisson surface reconstruction [Kazhdan and Hoppe, 2013] and get a complete head shape.

For the texture of the head, we project the full head to the selected frontal image and extract per-vertex colors from the non-hair region of the frontal image. We complete the side-view textures by projecting the head to all the frames. For the remaining invisible region, we assign an average skin color.

7. 3D hair strand estimation

This section corresponds to part (D) in Figure 2. By utilizing a video, we deform the hairstyles in 3D instead of 2D because we are able to take advantage of the shape information of all the frames and its content continuity to estimate the per-frame pose.

2D to initial 3D strands: Each video frame has an estimation of 2D strands; those are projected to depths to estimate 3D strands. Large peaks of the 3D hair strands (distance to the neighboring vertex larger than with a reference head width of ) are removed. A merging procedure is performed to decrease future retrieval time (and reduce duplicate strands) as follows: for each pair of strands, if their directions are the same, the pairwise point-to-point distances of the vertices in these two 3D strands are checked, and the overlapping line segments are combined. If the directions are not the same, no merging occurs. This process iterates until around 3D strands are obtained.

3D Strands to Query a Hair Database: The recovered 3D strands in the previous stage are sparse and incomplete; thus we use them to query a database of hair models and adjust the retrieved matches with global and local deformations to create a full hair model. The sparseness is a result of resolution of the video frames, motion blur, quality, and coverage in views (in all of our input videos, the back of the head is not visible). While being sparse, the strands do capture the person’s specific hairstyle. We describe the algorithm below.

We use the hair dataset created by [Chai et al., 2016], which contains different hairstyles, each hairstyle model consisting of more than hair strands. For each database hair model, we create a voxel grid around each hair strand vertex and combine all voxel grids into a voxelized mesh. The shape is further smoothed using Laplacian mesh smoothing [Sorkine et al., 2004]. In order to remove the inner layer of the shape, a ray is shot from each vertex with the direction equivalent to the one from the center of the head to the current vertex. If the ray intersects any other part of the rough shape, the vertex is removed, because it is in the inner surface; otherwise it is kept. The resulting shape has to vertices. The final cleaned shape (shown in Figure 3) will be used for retrieval and deformation.

For each 3D hair strand in our query hairstyle , the closest 3D hair strand from a hair style is determined using the following distance:

(1)

where is a hair strand of and is a vertex in strand of ,

is the tangent vector direction at

.

This point-to-line distance comparison is very time-consuming. We performed experiments to accelerate the retrieval speed by pruning using a rough mesh of the head obtained from step (A) (Section 4) and step (B) (Section 5) as follows:

  1. Hairstyle boundary: Only the hairstyles with -range in the range of (,
    ) and -range in the range of (,) are considered.

  2. Area of the hairstyle: The surface area of the rough mesh and of each hairstyle mesh are computed. Only the hairstyles with surface area in the range of (,) are considered.

Next, the retrieved matches are deformed in global and local fashion in 3D instead of 2D, taking advantage of the multi-view information in the video. Figure 4 illustrates the deformation process.

View Correction and Global Deformation After the top best matching hairstyles are found, each retrieved hairstyle is deformed towards the rough shape (created by Step (A)(B) and shown in Figure  6(a)) using deformable registration [Allen et al., 2003] producing deformed hairstyle mesh . Furthermore, step (A) defines regions of low-confidence of (see Section 4), so further correction is needed on the corresponding regions of . The azimuth angle of each vertex is calculated on , and the vertices that are outside the confident region are considered invalid as shown in Figure 6(c) marked as red. Naturally, we think of using the original shape of to correct the invalid region. The correction is based on the idea of Laplacian Mesh Editing [Sorkine et al., 2004]. We denote the valid view range as for simplification. We assign a confidence value to each vertex on and minimize the following energy function.

(2)

where is a Laplacian operator, is the vertex position before deformation, is the closest point of on after direct deformable registration, is . The confidence value is for the valid region and is defined for the invalid region as follows:

where . As shown in Figure 6(c), the stretched red region of is corrected to have a natural look in (d). After the correction, a transformation matrix is obtained for each vertex on .

We further deform the hair strands in as shown in Figure 4(a). Each vertex in works as an anchor point for the hairstyle deformation. For each point in , its deformation will be decided by a set of neighboring anchor points as

(3)

where is the set of neighboring anchor points chosen to be the top closest vertices of .

is an identity matrix, and

is defined as a Gaussian function

(4)

In our experiments, we set to and to , while the width of our reference head is . We deform the top best matching hairstyles, and use the same distance function as proposed in Equation 1 to find the final best match. We show a comparison in Figure 4 (a) before and after global deformation.

Figure 6. In (a), we show the hair part shape extracted from the visual hull (Figure 2(A)) plus the hair labels in Figure 2(B) on top of the head mesh from Figure 2(C). (b) shows an illustration of the top-down view of the visual hull with camera range and invalid regions. In (c)(d), we show a candidate hairstyle mesh before and after correction.

Local Deformation To add locally personalized hair details, we follow a method similar to [Fu et al., 2007] by converting the deformed hair strands into a 3D orientation field . The orientation of the extracted hair strands from the video frames are also added to the 3D orientation field to bend the hair strands in the surface layer of the hairstyle. For each 3D query strand, we set an influence volume with a radius of around it and diffuse it to fill in its surrounding voxels. The best matching hairstyle and the query 3D hair strands all contribute to the 3D orientation field by

(5)

where is the discrete Laplacian operator, is the boundary constraints with the known directions at certain voxel grids that contain 3D strands from the best-matching hairstyle, and is the boundary constraints from query hair strands.

In our experiment, we set the 3D orientation grid size to be , to be and to be . We show a comparison of results with and without the local deformation in Figure 4(b). Notice the personalized hair strands in the red circles. We avoid the artifacts of hair going inside the head by growing new hair strands out of from the scalp region of the complete head shape of Section 6 with pre-defined hair root points.

Hair Texture The color of each hair strand vertex is averaged from all the frames, and the unseen regions are assigned by an average color of the visible regions.

8. Experiments

In this section we describe the parameters used and the data collection process; we show results as well as comparisons to state-of-the-art methods.

8.1. Data Collection and Processing of Video Sequences

We collected video clips of celebrities by searching for key words like "Hillary speech", "Adele award" on YouTube with an HD filter. The typical resolution of our videos is p () with video duration around seconds sampled at fps ( frames for Adele, and frames for Cate Blanchett, and frames for Hillary Clinton, frames for Justin Trudeau, frames for Theresa May,

frames for Angela Merkel). The camera view point is relatively fixed across all the frames, while the subject is making a speech with his/her head turning. We processed our frames at 10fps. We ran the face detection method of

[Xiong and De la Torre, 2013] on all the frames to determine a bounding box around the head (box height varies from to ). Our online video sequences typically cover the frontal, left and right view of the person. The minimum view range we have is for Angela Merkel: only to degrees. There are no back views of any person’s head in the videos.

For mobile selfie videos, 9 subjects were asked to take a selfie video of themselves from left to right and switch hands in the front using their own smart phones (video resolution varies from p to p). The subjects were not required to stay rigid and could take the video at their ease. The videos were taken in arbitrary environments and the lightings were not controlled. Note that the quality of mobile selfie videos are usually worse than the online videos due to large motion blurs caused by hand moving, auto focus, and auto exposure from phone cameras, although a higher frame resolution is accessible. The selfie video is approximately seconds sampled at 20fps ( frames, frames, frames, frames, frames, frames, frames, frames and frames for each individual from top to bottom of Figure 7 ).

Later, the semantic segmentation method of [Zheng et al., 2015] was run on video frames to remove the background and foreground occlusions such as microphones. We ran VisualSFM [Wu, 2011] on the pre-processed frames. In [Wu, 2011], the non-rigid face expression change might cause large distortions in the reconstructed views; thus we set radial distortion to zero.

Runtime We ran our algorithm on a single PC with a 12 core i7 CPU, 16GB of memory and four NVIDIA GTX 1080 Ti graphics cards. For a typical online video, the preprocessing and structure from motion plus visual hull in Figure 2(A) takes minutes. Extracting 2D query strands in Figure 2(B) takes minutes. The head shape reconstruction and texture extraction takes minutes to run. Hair database retrieval and deformation in Figure 2(D) is minutes with candidates. The 3D orientation local deformation and final hair strand generation from the reconstructed head takes min.

8.2. Results and Comparisons

Figure 7. Reconstruction results from mobile selfie videos of different people in different environments.
Figure 8. Example results of our method. From top to bottom, the view coverage for Angela Merkel’s video is degree to degree, to degree for Cate Blanchett and to degree for Hillary Clinton. Note that we can even create a natural looking result for Angela Merkel with a small view coverage.

Figure 7 and Figure 8 show the results together with the reference frames from the videos. We can see that the reconstructions are good for a variety of lighting conditions, diverse people and hairstyles. See also the supplementary videos.

Next, we compared our results to the state-of-the-art hair in-the-wild reconstruction methods [Chai et al., 2016; Zhang et al., 2017; Hu et al., 2017]. More view comparisons are shown in the supplementary video.111We thank the authors of those papers for helping creating comparison results. We performed qualitative, quantitative and user study comparisons below.

Figure 9 shows comparisons for single-view methods. We picked a frontal frame from each of the video clips of celebrities as input. We compared our untextured results with [Chai et al., 2016] and textured results with [Hu et al., 2017]. Note that our 3D models captured more personalized hairstyles; for example in Adele’s case (the 1st row), [Chai et al., 2016] produced a short hairstyle, while Adele has a long hairstyle. Compared to [Hu et al., 2017], where each hairstyle has a flat back, our results show more variety.

Figure 9. This figure shows our results compared to the state-of-the-art methods. For each subject, we show the results in frontal and side views. For each view, the first column shows a reference frame from the video, then we show in the order of the untextured results from [Chai et al., 2016], [Zhang et al., 2017], our method and the textured results from [Hu et al., 2017], our method. Note how our result captures more personalized hair details, as also indicated by human studies and quantitative comparisons. More view comparisons are provided in the supplementary video.

In [Zhang et al., 2017], frontal, left, and right views are manually chosen from the same video clip. Since we do not have the back view in our video frames and the back view is necessary for the four-view reconstruction method, the authors were allowed to use any back view image they could find to reconstruct (the authors did not reconstruct Adele; back view photos can be found in the supplementary video). In our algorithm, we did not use the back view photo of the person. Our results are similar to [Zhang et al., 2017]; however ours are closer to the input; this can be seen by looking at the result of Justin (the 4th row) produced by [Zhang et al., 2017] which has a larger volume.

We did a quantitative comparison by projecting the reconstructed hair as lines onto the images, computing the intersection-over-union rate to the ground truth hair mask (manually labeled, but not used in our training or testing of the hair classifiers) per frame. We show the average IOUs over all the frames of each subject in Table 1. A larger IOU means that the reconstructed hair approximates the input better. We used the same camera pose of each frame estimated from structure from motion, where a perspective camera model is assumed, to project both the results from [Zhang et al., 2017] and our results. In total, our reconstruction results get an average IOU rate of around , while the four-view reconstruction method gets an average IOU of around . We showed the projection and ground truth hair mask of some example frames in Figure 10. Our results resemble the hairstyles better in all the frames. Note that since the back view image used in [Zhang et al., 2017] does not necessarily come from the same person, the inconsistency between four views might affect the final results. Also, in [Zhang et al., 2017], the authors assumed an orthographic camera model, which might account for some of the difference.

Subject Frames [Zhang et al., 2017] Ours
Hillary 266
Theresa 252
Cate 255
Justin 307
Table 1. IOU accuracy between the projected reconstructed hair and the hair segmentation (manually labeled ground truth).
Figure 10. This figure shows four example frames comparing the silhouettes of the reconstructed hairstyles to the hair segmentation results. The red mask is the annotated groundtruth hair mask over the image frame. The green mask shows the projected silhouettes from our method over the image and the blue mask shows the projected silhouettes from [Zhang et al., 2017].

User Study We performed Amazon Mechanical Turk studies to compare our results to other methods. We showed two results side by side with the ground truth image in different views and asked which shape was more similar to the input, ignoring the face, shoulder and rendering qualities. For each subject, we did groups of studies comparing the frontal, left, and right views. To remove bias, we switched the order of the two results and did more groups of studies. Each view was rated by Turkers, giving a total of different Turkers for each subject. We reported the rate of preference to our results over total in Table 2. Our results achieved an average preference rate of in all the study groups. Similarly, we did a comparison of the textured results to the avatar digitalization work [Hu et al., 2017] showing the , and views. The ratio of preferences is reported in Table  3. In total, our results were considered better by of the Turkers.

Subject frontal left right total
Hillary 39/40 27/40 27/40 93/120
Theresa 13/40 26/40 27/40 66/120
Cate 30/40 26/40 32/40 88/120
Justin 27/40 37/40 38/40 102/120
Table 2. The ratio of preference to our results over total compared to [Zhang et al., 2017] based on Amazon Mechanical Turk tests.
Subject total
Adele 29/40 32/40 38/40 99/120
Hillary 35/40 38/40 36/40 109/120
Theresa 35/40 37/40 39/40 111/120
Cate 40/40 40/40 40/40 120/120
Justin 39/40 35/40 32/40 106/120
Table 3. The ratio of preference to our results over total compared to [Hu et al., 2017] based on Amazon Mechanical Turk test.

Robustness: To evaluate the robustness of our hair segmentation classifier, we counted the failure frames for each input video: frames for Adele, and frames for Cate Blanchette, and frames for Hillary Clinton, frames for Justin Trudeau, frames for Theresa May, and frames for Angela Merkel (2 videos each for Cate and Hillary). For selfie video inputs, the numbers are frames, frames, frames, frames, frames, frames, frames, frames and frames for each individual from top to bottom of Figure 7. We had an average successful segmented frame rate of . Generally we observed more failure segmentation frames on selfie videos due to the motion blur when the subject was switching hands and more uncontrolled lighting compared to celebrity videos. We ran simulation experiments to test how robust our system is against failure segmentation frames. We gradually decreased the number of frames (frames selected randomly) used to generate the rough hair shape and calculated the average IOU between the projected and the ground truth hair label on four celebrity videos. The result is plotted in Fig.  11. Our system is quite robust even when half of the frames are dropped; however, as the number of failure frames increases we cannot guarantee a good global shape, which will lead to a poor reconstruction result.

Figure 11. This figure shows how similar the rough hair shape to groundtruth is as the number of good hair segmentation frames decreases.

3D vs. 2D: We further compared retrieving the best matching hairstyle in 3D vs retrieving in 2D. For each subject, we used the same frames as we used in 3D retrieval as query input. Instead of reprojecting each 2D strand back to 3D using per-frame depth information, we projected the candidates from the hairstyle database after pruning (Section 7) to each frame and computed the query distance similar to Equation 1, but in 2D. We show the best matching result before any deformation in Fig. 12. There are three main reasons that we chose to retrieve in 3D instead of 2D. 1) The multi-frames from the same video have a lot of overlapping regions, which caused redundancy in the query input when retrieving using each 2D frame. 2) Projecting the hairstyles from the database increased the computational cost compared to projecting the sparse strands back to 3D. We might pre-generate projected 2D views to a set of pre-defined poses as in the single-view method [Chai et al., 2016], however, since our input video does not have fixed views, we would need to define a larger pose space than the single-view input. 3) The retrieved results are usually biased towards the frontal view, because the hair strands in relatively frontal views are seen in more frames than the side views as shown in Fig. 12. For example, in Adele’s comparison, the two hair strips in the front contributed more to the retrieval, making the best matching result less like the input subject from the side view. By projecting the strands to 3D, we allow the information from different views to contribute equally to the retrieval, as well as improving computational cost.

Figure 12. This figure shows the comparison for retrieving in 3D and 2D. The left two columns show two reference frames from the input video. The two columns in the middle show the frontal and side view of the best matching hairstyle from the database from 3D retrieval. The right two columns show the best matching result from 2D retrieval. The 2D retrieved results are generally similar to frontal views, however, do not look as similar as the 3D retrieval results in side views.

9. Limitations and Applications

9.1. Limitations

Our method cannot work on highly dynamic hairstyles due to the high non-rigidity of the hair volumes across the in-the-wild videos. See the example video frames of Olivia Culpo in Figure 13(a). The human hand interaction and body occlusion make the segmentation difficult. Explicitly modeling long hair dynamics is beyond the scope of the paper, and we assume small dynamics from the input videos. We also could not reconstruct very curly hairstyles and complicated hairstyles such as braids.

Our method also fails on videos where the background is too complicated as shown in Figure 13(b). The other people or crowds in the background make it hard to estimate the head silhouette of the person and will lead to incorrect correspondences when running structure-from-motion.

For the low-confidence view corrections, we require an input video covering a view range of at least degrees. Fewer views will cause the visual hull to be extremely distorted as shown in Figure 13(c), where our deformable registration will fail with a large fitting error. Note that as the view coverage decreases, this problem will be reduced to a single-view reconstruction problem.

Figure 13. Limitations of our algorithm. In (a) we show example video frames of highly non-rigid hairstyle. In (b) we show an example video frame with a complicated background. In (c) we show the back of a deformed hair mesh towards a visual hull from a small view coverage input.

9.2. Applications

Our reconstructed models can be now used for a variety of applications as shown in Figure 14(a)(b); we can also change the overall color of the hairstyle, making it darker or lighter, or morphing it to another hairstyle.

Figure 14. Hairstyle change examples. We show a darker and lighter version of Cate Blanchett’s hairstyle in (a)(b). (c) shows the hair morphing intermediate results from two different hairstyles of the same person. (d) shows the hair morphing from the person’s reconstructed hairstyle to a given hairstyle from the dataset.

Since the hair roots of all our hair models are transfered from the generic shape used to compute the head model, we can assume that hair root points and hair strands are in correspondence for the same person. We resampled all the hair strands with the same number of vertices (50 in our implementations), which can be used for applications such as personalized strand-level hairstyle morphing.

In Figure 14

(c), we show the hair morphing result of Cate Blanchett in two hairstyles from two different videos. The intermediate results are created by a one-to-one strand interpolation of the source and target hair strands. We can also morph the reconstructed hairstyle to a given hairstyle from the dataset as shown in Figure

14(d). The given hairstyle was re-grown from its 3D orientation field using the same set of scalp points to create correspondences to the original hairstyle. We trimmed the hair strands that intersect with the face during interpolation.

10. Discussion and Future Work

We have described a method that takes as input a video of a person’s head in the wild and outputs a detailed 3D hair strand model combined with a reconstructed 3D head to produce a full head model. This method is fully automatic and shows that a head model with higher fidelity can be recovered by combining information from video frames. Our method is not restricted to specific views and head poses, making the full head reconstruction from in-the-wild videos possible. We showed our results on several celebrities as well as mobile selfie videos and compared our work to the most recent state of the art.

However, there are still a number of limitations and possible extensions to explore. One direction is to refine the rough hair mesh estimation for non-rigid hairstyles. Currently in our input, the person’s head moves gently and our rough hair mesh is generated from the visual hull, which is only an approximation of the real hair volume, since the hair is non-rigid. We might explore using the ARkit of a smart phone to provide extra depth information to align the hair volume densely across frames. Also, we currently rely on the rigid camera poses estimated using structure from motion with SIFT features to connect all the views. In the future, we want to use facial features and hair specific features to increase the robustness of the frame alignment.

In our paper, we aim at a more challenging problem which is reconstructing 3D hair models from in-the-wild data, so some results are still not highly detailed, and we cannot handle complicated hairstyles such as braids and highly curled hairs.

We created face textures from video frames, which caused artifacts due to hair occlusions, decoration occlusions and low resolution of the faces. In the future, we can use generate photo-realistic textures as in [Saito et al., 2016]. Finally, a future extension to incorporate a facial blend shape model or estimate per frame facial dynamics as in [Suwajanakorn et al., 2014] to create a fully animatable model would be desirable.

Acknowledgements.
This work was partially supported by the UW Reality Lab, Facebook, Google, Huawei, and NSF/Intel Visual and Experimental Computing Award #1538618.

References

  • [1]
  • Alexander et al. [2013] Oleg Alexander, Graham Fyffe, Jay Busch, Xueming Yu, Ryosuke Ichikari, Andrew Jones, Paul Debevec, Jorge Jimenez, Etienne Danvoye, Bernardo Antionazzi, et al. 2013. Digital Ira: creating a real-time photoreal digital actor. In ACM SIGGRAPH 2013 Posters. ACM, 1.
  • Allen et al. [2003] B. Allen, B. Curless, and Z. Popović. 2003. The space of human body shapes: reconstruction and parameterization from range scans. In ACM Transactions on Graphics (TOG), Vol. 22. ACM, 587–594.
  • Beeler et al. [2010] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010. High-quality single-shot capture of facial geometry. ACM Transactions on Graphics (TOG) 29, 4 (2010), 40.
  • Blanz and Vetter [1999] Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 187–194.
  • Cao et al. [2013] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2013. Facewarehouse: a 3d facial expression database for visual computing. (2013).
  • Cao et al. [2016] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics (TOG) 35, 4 (2016), 126.
  • Chai et al. [2015] Menglei Chai, Linjie Luo, Kalyan Sunkavalli, Nathan Carr, Sunil Hadap, and Kun Zhou. 2015. High-quality hair modeling from a single portrait photo. ACM Transactions on Graphics (TOG) 34, 6 (2015), 204.
  • Chai et al. [2016] Menglei Chai, Tianjia Shao, Hongzhi Wu, Yanlin Weng, and Kun Zhou. 2016. Autohair: Fully automatic hair modeling from a single image. ACM Transactions on Graphics (TOG) 35, 4 (2016), 116.
  • Chai et al. [2013] Menglei Chai, Lvdi Wang, Yanlin Weng, Xiaogang Jin, and Kun Zhou. 2013. Dynamic hair manipulation in images and videos. ACM Transactions on Graphics (TOG) 32, 4 (2013), 75.
  • Chai et al. [2012] Menglei Chai, Lvdi Wang, Yanlin Weng, Yizhou Yu, Baining Guo, and Kun Zhou. 2012. Single-view hair modeling for portrait manipulation. ACM Transactions on Graphics (TOG) 31, 4 (2012), 116.
  • Chen et al. [2016] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016).
  • Debevec [2012] Paul Debevec. 2012. The light stages and their applications to photoreal digital actors. SIGGRAPH Asia (2012).
  • Fu et al. [2007] Hongbo Fu, Yichen Wei, Chiew-Lan Tai, and Long Quan. 2007. Sketching hairstyles. In Proceedings of the 4th Eurographics workshop on Sketch-based interfaces and modeling. ACM, 31–36.
  • Goesele et al. [2007] Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M Seitz. 2007. Multi-view stereo for community photo collections. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 1–8.
  • Hairbobo [2017] Hairbobo. 2017. Hairbobo. http://www.hairbobo.com/faxingtupian. (Sept. 2017).
  • Hu et al. [2014a] Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. 2014a. Robust hair capture using simulated examples. ACM Transactions on Graphics (TOG) 33, 4 (2014), 126.
  • Hu et al. [2015] Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. 2015. Single-view hair modeling using a hairstyle database. ACM Transactions on Graphics (TOG) 34, 4 (2015), 125.
  • Hu et al. [2014b] Liwen Hu, Chongyang Ma, Linjie Luo, Li-Yi Wei, and Hao Li. 2014b. Capturing braided hairstyles. ACM Transactions on Graphics (TOG) 33, 6 (2014), 225.
  • Hu et al. [2017] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Trans. Graph. 36, 6, Article 195 (Nov. 2017), 14 pages. https://doi.org/10.1145/3130800.31310887
  • Ichim et al. [2015] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar creation from hand-held video input. ACM Transactions on Graphics (TOG) 34, 4 (2015), 45.
  • Jakob et al. [2009] Wenzel Jakob, Jonathan T Moon, and Steve Marschner. 2009. Capturing hair assemblies fiber by fiber. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 164.
  • Jia et al. [2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675–678.
  • Kazhdan and Hoppe [2013] Michael Kazhdan and Hugues Hoppe. 2013. Screened poisson surface reconstruction. ACM Transactions on Graphics (TOG) 32, 3 (2013), 29.
  • Kemelmacher-Shlizerman [2016] Ira Kemelmacher-Shlizerman. 2016. Transfiguring portraits. ACM Transactions on Graphics (TOG) 35, 4 (2016), 94.
  • Kemelmacher-Shlizerman and Basri [2011] Ira Kemelmacher-Shlizerman and Ronen Basri. 2011. 3d face reconstruction from a single image using a single reference face shape. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33, 2 (2011), 394–405.
  • Kemelmacher-Shlizerman and Seitz [2011] Ira Kemelmacher-Shlizerman and Steven M Seitz. 2011. Face reconstruction in the wild. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 1746–1753.
  • Laurentini [1994] Aldo Laurentini. 1994. The visual hull concept for silhouette-based image understanding. IEEE Transactions on pattern analysis and machine intelligence 16, 2 (1994), 150–162.
  • Liang et al. [2014] Shu Liang, Ira Kemelmacher-Shlizerman, and Linda G Shapiro. 2014. 3d face hallucination from a single depth frame. In 3D Vision (3DV), 2014 2nd international conference on, Vol. 1. IEEE, 31–38.
  • Liang et al. [2016] Shu Liang, Linda G Shapiro, and Ira Kemelmacher-Shlizerman. 2016. Head reconstruction from internet photos. In European Conference on Computer Vision. Springer, 360–374.
  • Liu et al. [2017] Sifei Liu, Jianping Shi, Ji Liang, and Ming-Hsuan Yang. 2017. Face Parsing via Recurrent Propagation. arXiv preprint arXiv:1708.01936 (2017).
  • Liu et al. [2015] Sifei Liu, Jimei Yang, Chang Huang, and Ming-Hsuan Yang. 2015. Multi-objective convolutional learning for face labeling. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 3451–3459.
  • Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
  • Luo et al. [2013] Linjie Luo, Hao Li, and Szymon Rusinkiewicz. 2013. Structure-aware hair capture. ACM Transactions on Graphics (TOG) 32, 4 (2013), 76.
  • Luo et al. [2012] Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2012.

    Hierarchical face parsing via deep learning. In

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2480–2487.
  • Maninchedda et al. [2016] Fabio Maninchedda, Christian Häne, Bastien Jacquet, Amaël Delaunoy, and Marc Pollefeys. 2016. Semantic 3D Reconstruction of Heads. In European Conference on Computer Vision. Springer, 667–683.
  • Newcombe et al. [2015] Richard A Newcombe, Dieter Fox, and Steven M Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 343–352.
  • Paris et al. [2004] Sylvain Paris, Hector M Briceño, and François X Sillion. 2004. Capture of hair geometry from multiple images. In ACM Transactions on Graphics (TOG), Vol. 23. ACM, 712–719.
  • Paris et al. [2008] Sylvain Paris, Will Chang, Oleg I Kozhushnyan, Wojciech Jarosz, Wojciech Matusik, Matthias Zwicker, and Frédo Durand. 2008. Hair photobooth: geometric and photometric acquisition of real hairstyles. In ACM Transactions on Graphics (TOG), Vol. 27. ACM, 30.
  • Richardson et al. [2016] Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D face reconstruction by learning from synthetic data. In 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 460–469.
  • Richardson et al. [2017] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning detailed face reconstruction from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5553–5562.
  • Saito et al. [2016] Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2016. Photorealistic Facial Texture Inference Using Deep Neural Networks. arXiv preprint arXiv:1612.00523 (2016).
  • Sorkine et al. [2004] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and H-P Seidel. 2004. Laplacian surface editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing. ACM, 175–184.
  • Suwajanakorn et al. [2014] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. 2014. Total moving face reconstruction. In Computer Vision–ECCV 2014.
  • Suwajanakorn et al. [2015] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2015. What Makes Tom Hanks Look Like Tom Hanks. In Proceedings of the IEEE International Conference on Computer Vision. 3952–3960.
  • Thies et al. [2015] Justus Thies, Michael Zollhoefer, Matthias Niessner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) (2015).
  • Tran et al. [2017] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gérard Medioni. 2017.

    Regressing robust and discriminative 3D morphable models with a very deep neural network. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1493–1502.
  • Vanakittistien et al. [2016] Nuttapon Vanakittistien, Attawith Sudsang, and Nuttapong Chentanez. 2016. 3D hair model from small set of images. In Proceedings of the 9th International Conference on Motion in Games. ACM, 85–90.
  • Wang et al. [2011] Dan Wang, Xiujuan Chai, Hongming Zhang, Hong Chang, Wei Zeng, and Shiguang Shan. 2011. A novel coarse-to-fine hair segmentation method. In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 233–238.
  • Ward et al. [2007] Kelly Ward, Florence Bertails, Tae-Yong Kim, Stephen R Marschner, Marie-Paule Cani, and Ming C Lin. 2007. A survey on hair modeling: Styling, simulation, and rendering. IEEE Transactions on Visualization and Computer Graphics 13, 2 (2007).
  • Weng et al. [2013] Yanlin Weng, Lvdi Wang, Xiao Li, Menglei Chai, and Kun Zhou. 2013. Hair interpolation for portrait morphing. In Computer Graphics Forum, Vol. 32. Wiley Online Library, 79–84.
  • Wu [2011] Changchang Wu. 2011. VisualSFM: A visual structure from motion system. (2011).
  • Xiong and De la Torre [2013] Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and its Applications to Face Alignment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Yacoob and Davis [2006] Yaser Yacoob and Larry S Davis. 2006. Detection and analysis of hair. IEEE transactions on pattern analysis and machine intelligence 28, 7 (2006), 1164–1169.
  • Zhang et al. [2017] Meng Zhang, Menglei Chai, Hongzhi Wu, Hao Yang, and Kun Zhou. 2017. A data-driven approach to four-view image-based hair modeling. ACM Transactions on Graphics (TOG) 36, 4 (2017), 156.
  • Zheng et al. [2015] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015.

    Conditional random fields as recurrent neural networks. In

    Proceedings of the IEEE International Conference on Computer Vision. 1529–1537.
  • Zollhöfer et al. [2014] Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, et al. 2014. Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics, TOG (2014).

Appendix A Hair Segmentation Classifier

We trained the hair segmentation classifier using a fully convolutional network with images out of all the collected photos, and the remaining were used for testing. Specifically, the FCN-32s model [Long et al., 2015] was used, and it was fine-tuned with PASCAL VOC data from the ILSVRC-trained VGG-16 model. To only detect the hair category, we changed the output number of the last convolution layer to one and add a sigmoid layer to get scores between and . The output score represents the probability that the pixel belongs to hair. In our implementation, we resized all our input images to . With FCN-32s, we downsampled the output with a factor of . We later used bilinear interpolation to upsample the output heatmap to obtain the final segmentation result. We fine-tuned the pretrained FCN-32s with the following parameters: minibatch size , learning rate , momentum , and weight decay . We froze all the layers except the last score layer in the first iterations. Then we fine-tuned all the layers in the next iterations.

We tested our segmenter on the test images. The hair segmentation network was implemented with Caffe [Jia et al., 2014] and C++ and ran on a NVIDIA GTX 1080 GPU with an inference time of ms for a input image. The accuracy on the test images reached , and the IOU rate reached .

Using automatic hair segmentation (that works well across views, even back views) is the key to enabling a fully automatic hair modeling system. Our algorithm is capable of segmenting the hair region successfully in different views and head poses. However, it still fails to segment some hairstyles correctly when the hair color is too close to the background or when the image has a large motion blur.

We compared our classifier to two methods: [Liu et al., 2017] and DeepLab [Chen et al., 2016] (which was compared to in [Chai et al., 2016], but [Chai et al., 2016] does not provide code so we compared with DeepLab). We fine tuned and ran DeepLab on all of our test images and got a pixel accuracy of and IOU rate of . For [Liu et al., 2017], we used the pre-trained model to run on only of our test images, since it requires pre-detection of the face and fails on back and side views. The pixel accuracy for the test images was , and IOU rate was .

Appendix B Hair Directional Classifier

For training, the hair areas were manaully divided into subregions based on their general growing trends. One of four directional labels was assigned in the labeling stage: , , , . Hair accessories and hair occlusions were labeled as undetermined region, and background pixels were labeled as background. We trained a modified VGG16 network as proposed in [Chai et al., 2016] on the hair regions to automatically predict the directional labels for each pixel, using a multi-class approach with 6 classes (4 directions, 1 background, 1 undetermined).

To train our hair directional classifier, we cropped and extracted the hair region, resized each image to and downsampled times with a bilinear filter. The output result was then upsampled to the original image size with bilinear interpolation and then followed by a CRF for per-pixel labels. In the training stage, we utilized the same set of images and augmented the dataset to images by image rotation, translation, and mirroring.

We implemented the classifier in the same environment as the segmenter and set the minibatch size to and the initial learning rate to with exponential decay. Our network converged after k steps. The inference time was ms for a input image. We ran the directional classifier on the same test set of images and got an accuracy of .

Our classifier typically fails to generate a correct direction label for some small regions on the side of the face. However, in our pipeline, since we have video sequences, we can still get a correct direction from a different view. The availability of many views compensates for individual failure cases.