Towards Highly Accurate and Stable Face Alignment for High-Resolution Videos

11/01/2018 ∙ by Ying Tai, et al. ∙ Tencent Michigan State University 2

In recent years, heatmap regression based models have shown their effectiveness in face alignment and pose estimation. However, Conventional Heatmap Regression (CHR) is not accurate nor stable when dealing with high-resolution facial videos, since it finds the maximum activated location in heatmaps which are generated from rounding coordinates, and thus leads to quantization errors when scaling back to the original high-resolution space. In this paper, we propose a Fractional Heatmap Regression (FHR) for high-resolution video-based face alignment. The proposed FHR can accurately estimate the fractional part according to the 2D Gaussian function by sampling three points in heatmaps. To further stabilize the landmarks among continuous video frames while maintaining the precise at the same time, we propose a novel stabilization loss that contains two terms to address time delay and non-smooth issues, respectively. Experiments on 300W, 300-VW and Talking Face datasets clearly demonstrate that the proposed method is more accurate and stable than the state-of-the-art models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Face alignment aims to estimate a set of facial landmarks given a face image or video sequence. It is a classic computer vision problem that has attributed to many advanced machine learning algorithms 

Fan et al. (2018); Bulat and Tzimiropoulos (2017); Trigeorgis et al. (2016); Peng et al. (2015, 2016); Kowalski, Naruniec, and Trzcinski (2017); Chen et al. (2017); Liu et al. (2017); Hu et al. (2018). Nowadays, with the rapid development of consumer hardwares (e.g., mobile phones, digital cameras), High-Resolution (HR) video sequences can be easily collected. Estimating facial landmarks on such high-resolution facial data has tremendous applications, e.g., face makeup Chen, Shen, and Jia (2017), editing with special effects Korshunova et al. (2017) in live broadcast videos. However, most existing face alinement methods work on faces with medium image resolutions Chen et al. (2017); Bulat and Tzimiropoulos (2017); Peng et al. (2016); Liu et al. (2017). Therefore, developing face alignment algorithms for high-resolution videos is at the core of this paper.

To this end, we propose an accurate and stable algorithm for high-resolution video-based face alignment, named Fractional Heatmap Regression (FHR). It is well known that heatmap regression have shown its effectiveness in landmark estimation tasks Chen et al. (2017); Newell, Yang, and Deng (2016); Chen* et al. (2018). However, Conventional Heatmap Regression (CHR) is not accurate nor stable when dealing with high-resolution facial images, since it finds the maximum activated location in heatmaps which are generated from rounding coordinates, and thus leads to quantization errors as the heatmap resolution is much lower than the input image resolution (e.g., vs. shown in Fig. 1) due to the scaling operation. To address this problem, we propose a novel transformation between heatmaps and coordinates, which not only preserves the fractional part when generating heatmaps from the coordinates, but also accurately estimate the fractional part according to the D Gaussian function by sampling three points in heatmaps.


Figure 1: Comparisons between fractional regression heatmap and conventional heatmap regression. Our method differs conventional one in two aspects: 1) the ground truth heatmap for FHR maintains the precision of fractional coordinate, while the conventional one discards (e.g., from to , to ); and 2) three sampled points on the heatmap analytically computes the fractional peak location of the heatmap (Eq. 4), while the conventional one only finds the maximum activated location (Eq. 3) that loses the fractional part and thus leads to quantization error.

Using our proposed FHR, we can estimate more accurate landmarks compared to the conventional heatmap regression model, and achieve state-of-the-art performance on popular video benchmarks: Talking Face FGNET (2014) and -VW datasets Shen et al. (2017) compared to recent video-based face alignment models Liu et al. (2017); Peng et al. (2016). However, real-world applications such as face makeup in videos often demands extremely high stability, since the makeup jumps if the estimated landmarks oscillate between consecutive frames, which negatively impacts the user’s experience. To make the sequential estimations as stable as possible, we further develop a novel stabilization algorithm on the landmarks estimated by FHR, which contains two terms, a regularization term () and a temporal coherence term (), to address two common difficulties: time delay and non-smooth problems, respectively. Specifically, combines traditional Euclidean loss and a novel loss account for time delay; generalizes the temporal coherence loss in Cao, Hou, and Zhou (2014) to better handle nonlinear movement of facial landmark.

In summary, the main contributions of this paper are:

  • A novel Fractional Heatmap Regression method for high-resolution video based face alignment that leverages D Gaussian prior to preserve the fractional part of points.

  • A novel stabilization algorithm that addresses time delay and non-smooth problems among continuous video frames is proposed.

  • State-of-the-art performance, both in accuracy and stability, on the benchmarks of Sagonas et al. (2013), -VW Shen et al. (2017) and Talking Face FGNET (2014) datasets.

Related Work

Heatmap Regression

Heatmap regression is one of the most widely used approaches for landmark localization tasks, which estimates a set of heatmaps rather than coordinates. Stacked Hourglass Networks (SHN) are popular architectures in heatmap regression, which have symmetric topology that capture and consolidate information across all scales of the image. Newell et al. Newell, Yang, and Deng (2016) proposed SHN for D human pose estimation, which achieved remarkable results even for very challenging datasets Andriluka et al. (2014). With the hourglass structure, Chu et al. Chu et al. (2017)

introduced multi-context attention mechanism into Convolutional Neural Networks (CNN). Apart from applications in human pose estimation, there are also several heatmap regression based models for face alignment. Chen et al. 

Chen et al. (2017) proposed a structure-aware fully convolutional network to implicitly model the priors during training. Bulat et al. Bulat and Tzimiropoulos (2017) built a powerful CNN for face alignment based on the hourglass network and a hierarchical, parallel and multi-scale block.

However, all existing models drops the fractional part of coordinates during the transformation between heatmaps and points, which brings quantization errors to high-resolution facial images. On the contrary, our proposed FHR can accurately estimate the fractional part according to the D Gaussian function by sampling three points in heatmaps, and thus achieves more accurate alignment.

Video-based Face Alignment

Video-based face alignment estimates facial landmarks in video sequences Liu (2010). Early methods Black and Yacoob (1995); Shen et al. (2017) used incremental learning to predict landmarks on still frames in a tracking-by-detection manner. To address the issue that generic methods are sensitive to initializations, Peng et al. Peng et al. (2015) exploited incremental learning for personalized ensemble alignment, which samples multiple initial shapes to achieve image congealing within one frame. To explicitly model the temporal dependency Oh et al. (2015) of landmarks across frames, the authors Peng et al. (2016) further incorporated a sequence of spatial and temporal recurrences for sequential face alignment in videos. Recently, Liu et al. Liu et al. (2017)

proposed a Two-Stream Transformer Networks (TSTN) approach, which captures the complementary information of both the spatial appearance on still frames and the temporal consistency across frames. Different from 

Peng et al. (2016); Liu et al. (2017) that require temporal landmark labels across frames, our proposed method achieves state-of-the-art accuracy only by making full use of the spatial appearance on still frames, which is able to remedy the problem that labeled sequential data are very limited.

Apart from the accuracy of landmarks, stabilization is also a key metric to video-based alignment. Typically, two terms Cao, Hou, and Zhou (2014) are adopted for stabilization, where a regularization term drives the optimized results to be more expressive, and a temporal coherence term drives the results to be more stable and smooth. However, the existing stabilization algorithm is sensitive to time delay and nonlinear movements. Our proposed algorithm takes these into account and thus are overall more robust.

Figure 2: (a) Histograms of the Euclidean loss (left) and our (right). and are ground truths of frame and , respectively. has the same Euclidean loss as . However, since lies on the line , it indicates a loss caused by time delay, which is more likely to happen in the stabilization process. Thus our model prefers to assign it a larger loss (equal to ). (b) Histograms of the loss in Cao, Hou, and Zhou (2014) (left) and our (right). and are stabilization outputs of frame and , respectively. has the same loss as in Cao, Hou, and Zhou (2014). However, since lies on the line , we argue that movement trajectory is more smooth than trajectory . Thus our model assigns it a smaller loss (equal to ). Note that the short axis of the ellipse in (a) is while the long axis of the ellipse in (b) is , thus the two terms are not in contradiction.

The Proposed Approach

In this section, we introduce the details of the proposed approach based on heatmap regression. A key point of heatmap regression is the transformation between the heatmaps and coordinates. Specifically, before model training, a pre-process step is conducted to convert the coordinates to the heatmaps, which are used as the ground truth labels. After estimating the heatmaps, a post-process step is conducted to obtain the coordinates from the estimated heatmaps. In this work, we propose a novel transformation between the heatmaps and coordinates that is different from the conventional heatmap regression, which is demonstrated to be simple yet effective.

Fractional Heatmap Regression

As shown in Fig. 1, conventional heatmap regression mainly generates the heatmaps from integral coordinates, despite that the ground truth coordinates are usually with fractions.As a result, it causes quantization errors when scaling back to the original image resolution, since the heatmaps are usually of much lower resolution than the input image. To address this problem, our proposed fractional heatmap regression generates ground truth heatmaps based on the intact ground truth coordinates (see Fig. 1) as follows:

(1)

where represents the coordinate, is the domain of the heatmap ,

denotes the standard deviation and

is the center of the D Gaussian in the th heatmap.

Denoting an input image and the deep alignment model, we can estimate the recovered heatmaps by

(2)

where , and is the number of landmarks. Given Eq. (1) and , estimating the fractional coordinate amounts to solving a binary quadratic equation, which has a closed-form solution as long as we can sample any three non-zero points from the heatmap. Specifically, we first obtain as the location with the maximum likelihood, as in conventional heatmap regression:

(3)

Conventional heatmap regression directly takes as the output, which loses the fractional part. In our method, we further sample another two points, e.g., near  111In case is located at the edge of the heatmap, we would sample the points in the opposite directions.. Let , and , we then estimate the fractional coordinate as follows:

(4)

It should be noted that our fractional heatmap regression is applicable to any heatmap based methods. In this paper, we focus on face alignment, and adopt the stacked hourglass network Newell, Yang, and Deng (2016) as the alignment model that minimizes the loss of across the entire training set, where .

Stabilization Algorithm

We now introduce our stabilization algorithm for video-based alignment, which takes the alignment results of in all the past frames as input, and outputs a more accurate and stable result for the current frame.

Stabilization Model

We denote as the output of at frame and the stabilization model as , which has parameters to be optimized. takes as input, and outputs the stabilized landmarks of frame , which is denoted as . Therefore we have

(5)

Assume there are videos in the training set, and the th video has frames. For frame in the th video, we denote its ground truth landmarks as , the output of as , and the stabilized output as (). Here, we have .

Next, we present the specific form of as well as its parameters . Our model follows a Bayesian framework. Specifically, we model the prior distribution of given as a -component Gaussian mixture:

(6)

where indicates the density function,

indicates a normal distribution with mean

and covariance . We then model the likelihood of given as Gaussian

(7)

and use the Bayesian rule to obtain the most probable value of

:

(8)

Combining (6),(7) and (8), we can obtain the closed-form solution of (5). In practice, we fix since it already achieves satisfactory results and larger may cause both efficiency and overfitting issues. Moreover, to reflect the fact that has a decreasing correlation with when increases, we assume that

(9)

where along with positive semi-definite matrices and are unknown model parameters. In practice Eq. (9) can be calculated recursively, whose computational complexity remains constant when increases.

To further reduce the number of parameters, we calculate the covariance matrix of all

in the training set, denote the matrix of eigenvectors of

as , and finally assume

(10)

where and are diagonal matrices. In summary, we have

, which are optimized with the loss function in the next subsection.

Methods SDM (2013) CFAN (2014) CFSS (2015) MDM (2016) DAN (2017) GAN (2017) CHR (2016) FHR
NRMSE
AUC
Failure
Table 1: Comparisons of NRMSE, AUC and failure rate (at NRMSE) on W test set.
Figure 3: (a) The effect of stabilization loss for time delay; (b) The magnitute of the stability metric for FHR (green) and FHR+STA (red). (c) The orientation of the stability metric for FHR (green) and FHR+STA (red).

Loss Function Design

We now introduce and optimize a novel loss function so as to estimate the above stabilization model parameter . Throughout this section, we denote all as to emphasize that the stabilized landmarks are functions of model parameter . Our loss function is defined as follows:

(11)

This loss has two terms. The first term, is called the regularization loss, which regularizes the stabilized output to be close to the ground truth, and is defined as follows:

(12)

where

indicates the Moore-Penrose general inverse of vector/matrix

. We can see that the first term of (12) is the average Euclidean distance of the ground truth and the model output. The second term aims to fit every in terms of , where is the coefficient, and estimate the expectation of .If this expectation is large, it means that 1) is more similar to than , and 2) the model output has a significant time delay, which is undesirable. Since our stabilization model uses the alignment results of the past frames, how to avoid time delay is a critical task. Therefore we emphasize the time delay loss as an individual term in (see Fig. 2).

The second term in (11), is called the smooth loss which favors the stabilized output to be smooth, and is defined as follows:

(13)

where and . can be seen as a trade-off between two stability losses, controlled by . When , it is equivalent to

(14)

which is the average distance from to the line .

On the other hand, when , is equivalent to

(15)

which is the average distance from to the midpoint of the line . The trade-off smooth loss will cause the loss contour of to have an ellipse with long axis (see Fig. 2), which we argue is a more reasonable indicator of the smoothness of the movement trajectory .

With the stabilization model and the loss function, we can estimate the model parameters using the standard optimization method Lagarias et al. (1998). The proposed stabilization algorithm is trained by various videos, which learns the statistics of different kinds of movements and thus is more robust than the traditional stabilization model Cao, Hou, and Zhou (2014). The optimization process converges within hours on a conventional laptop.

Experiments

Experimental Setup

Datasets

We conduct extensive experiments on both image and video-based alignment datasets, including Sagonas et al. (2013), -VW Shen et al. (2017) and Talking Face (TF) FGNET (2014). To test on images of W private set, we follow Chen et al. (2017) to use use training images from LFPW, HELEN and AFW datasets. To test on -VW, we follow Shen et al. (2017) to use videos for training and the rest videos for testing. Specifically, the videos are divided to three categories: well-lit (Scenario), mild unconstrained (Scenario) and challenging (Scenario) according to the difficulties. To test on frames of TF dataset, we follow Liu et al. (2017) to use the model trained by the training set of -VW dataset.

Face scale Ave. inter-ocular dis. CHR/FHR (RMSE )
Very small / ( )
Small / ( )
Medium / ( )
Large / ( )
Very large / ( )
Table 2: RMSE comparisons on face scales on W.
Training Setting

Training faces are cropped using the detection bounding boxes, and scaled to pixels. Following Chen et al. (2017), we augment the data (e.g., scaling, rotation) for more robustness to different face boxes. We use the stacked hourglass network Newell, Yang, and Deng (2016); Chen et al. (2017) as the alignment model. The network starts with a

convolutional layer with stride

to reduce the resolution to , followed by stacking hourglass modules. For evaluation, we adopt the standard Normalized Root Mean Squared Error (NRMSE), Area-under-the-Curve (AUC) and the failure rate (at NRMSE) to measure the accuracy, and use the consistency between the movement of landmarks and ground truth as the metric to measure the stability. We train the network with the Torch7 toolbox Collobert, Kavukcuoglu, and Farabet (2011)

, using the RMSprop algorithm with an initial learning rate of

, a mini-batch size of and . Training a fractional heatmap based hourglass model on W takes hours on a P GPU.

During the stabilization training, we set and to make all terms in the stabilization loss (11

) on the same order of magnitude. We estimate the average variance

of across all training videos and all landmarks, and empirically set the initial value of as . Also, we initialize

as a zero matrix

, as , and .

Figure 4: Averaged heatmap distributions.
Table 3: NRMSE/stability comparisons on -VW test set using landmarks.
Methods FHR FHR+STA FHR+STA() FHR+STA() FHR+STA()
Scenario / / / / /
Scenario / / / / /
Scenario / / / / /

Ablation Study

Fractional vs. Conventional Heatmap Regression

We first compare our fractional heatmap regression with the conventional version Newell, Yang, and Deng (2016) and other state-of-the-art models Xiong and De la Torre (2013); Zhang et al. (2014); Zhu et al. (2015); Trigeorgis et al. (2016); Kowalski, Naruniec, and Trzcinski (2017); Chen et al. (2017) on W test set. The training set includes images, and the test set contains images. Note that the stacked hourglass networks for our fractional method and the conventional one are the same. The only difference is the transformation between the heatmaps and the coordinates, where our method preserves the fractional part. As shown in Tab. 1, our method significantly outperforms the conventional version with an improvement of on NRMSE, and is also better than the state-of-the-art model Chen et al. (2017).

What’s more, the standard NRMSE cannot fully reflect the advantage of our FHR compared to CHR, since it eliminates the scaling effects by dividing the inter-ocular distance. To further demonstrate the effects of FHR, we calculate RMSE without normalization. Specifically, we collect the inter-ocular distances of all images and evenly divide the distances to five groups w.r.t. face scales. Tab. 2 shows that with larger scales, the gap between FHR and CHR is bigger. Especially in the largest scale, FHR achieves pixel promotion for each landmark on average.

Stabilization Loss for Time Delay

Here we demonstrate the effectiveness of our proposed time delay term (i.e., the right part of Eq. 12). As in Fig. 3, compared to the fractional heatmap regression’s output, the stabilized output is not only more smooth, but also closer to ground truth landmarks. Besides, when the time delay term is removed (), the stabilized outputs behave lagged behind the ground truth while this phenomena is largely suppressed by using our proposed loss (Eq. 12).

Figure 5: CED curves on W.
Methods REDnet (2016) FHR FHR+STA
Scenario / / /
Scenario / / /
Scenario / / /
Table 4: NRMSE/stability comparisons with REDnet on -VW test set using landmarks.
Methods CFAN (2014) CFSS (2015) IFA (2014) REDnet (2016) TSTN (2017) CHR (2016) FHR FHR+STA
NRMSE
Table 5: NRMSE comparison with state-of-the-art methods on Talking Face dataset using landmarks.
Stabilization Loss for Smooth

Next, we evaluate the impact of each term in our loss function (11) on NRMSE and stability. We use the consistency between the cross-time movement of the landmarks and the ground truth as an indicator of stability. Specifically, we calculate and for every video in the test set, and calculate the average NRMSE between and . Assuming that the ground truth is stable, a lower value indicates higher stability.

The comparison result is shown in Tab. 4. It can be seen that dropping the time delay term () causes a higher NRMSE, and changing the smooth loss to the one in Cao, Hou, and Zhou (2014) () causes a higher stability loss. Our proposed method achieves a good balance between accuracy and stability.

Comparisons with State of the Arts

We now compare our methods FHR and FHR+STA (i.e., the stabilized version) with state-of-the-art models Peng et al. (2016); Liu et al. (2017); Kowalski, Naruniec, and Trzcinski (2017); Zhang et al. (2017) on two video datasets: -VW and Talking Face. The comparison adopts two popular settings (i.e., and landmarks) used in prior works.

Methods TSCN (2016) CFSS (2015) TCDCN (2016) TSTN (2017) CHR (2016) FHR FHR+STA
Scenario
Scenario
Scenario
Table 6: NRMSE comparison with state-of-the-art methods on -VW test set using landmarks.
Comparison with Landmarks

First, we evaluate our method on the -VW Shen et al. (2017) dataset and compare with REDnet Peng et al. (2016), using the code released by the authors. Tab. 4 shows the results on the three test sets. Our proposed method achieves much better performance than REDnet in all cases. Especially in the hardest Scenario, our method achieves a large improvement of , which shows the robustness of our method to large facial variations.

Then, we evaluate our method on the Talking Face FGNET (2014) dataset compared with state-of-the-art models, such as CFAN Zhang et al. (2014), CFSS Zhu et al. (2015), IFA Asthana et al. (2014), REDnet Peng et al. (2016) and TSTN Liu et al. (2017). Although the annotations of the TF dataset has the same landmark number as -VW dataset, the definitions of landmarks are different. Therefore, following the setting in Liu et al. (2017); Peng et al. (2016), we use landmarks for fair comparisons. The results are shown in Tab. 5, in which the performance of Zhang et al. (2014); Zhu et al. (2015); Asthana et al. (2014) are directly cited from Peng et al. (2016); Liu et al. (2017). Since the images in TF set are collected in controlled environment and with small facial variations, all of the methods achieve relatively small errors, and our proposed method is still the best.

Comparison with Landmarks

Next, we evaluate our method on the -VW Shen et al. (2017) dataset under the setting with landmarks. The comparison methods include TSCN Simonyan and Zisserman (2014), CFSS Zhu et al. (2015), TCDCN Zhang et al. (2017) and TSTN Liu et al. (2017). We cite the results of Simonyan and Zisserman (2014); Zhu et al. (2015); Zhang et al. (2017) from Liu et al. (2017), and list the performance in Tab. 6. As we can see, our proposed FHR achieves the best NRMSEs in all scenarios, and our stabilized version FHR+STA further improves the performance, especially in Scenario.

We then illustrate the Cumulative Errors Distribution (CED) curves of FHR, CHR and some SOTA methods on W in Fig. 5, where the gap between FHR and CHR is competitive to those gaps in prior top-tier works (e.g., CFSSMDM). Fig. 5 also shows that our FHR contributes more to those relatively easy samples, which makes sense since the insight of FHR is to find a more precise location near a coarse but correct coordinate, whose heatmap output may accurately model the distribution of Ground Truth (GT). To demonstrate this, we collect some predicted heatmaps from those hard and easy samples, and show their averaged heatmap distributions in Fig. 4

by fixing the centers at the same position respectively. The easy samples’ heatmaps better resemble the Gaussian distribution in GT where FHR can improve the most, while hard samples resemble less and thus FHR contributes little.

Figure 6: NRMSE/Stability comparisons with four baselines on -VW. Dashed lines indicate the difference between our STA method and the closest competitor, where , represent NRMSE, stability improvements, respectively. Black arrow and pentagrams further illustrate the parameter sensitivity of , varying among [, , , , ]), on Scenario, which indicates moderately adjusts our stabilization model between accuracy and stability.
Comparison on Stabilization

We further compare stabilization between our method and REDnet Peng et al. (2016) on -VW Shen et al. (2017) with landmarks. As shown in Tab. 4, our FHR is much more stable than REDnet according to the metric mentioned in Ablation Study. To visualize stabilization improvement, we compute for each landmark estimated by FHR and our proposed FHR+STA, and plot them in Fig. 3. Fig. 3 plots the difference of the stability orientation of two methods, where indicates the orientation of a vector. We provide videos in our project website, which can effectively demonstrate the stability and superiority of our method. In some continuous frames, our stabilized landmarks are even more stable than the ground truth annotations.

In addition, Fig. 6 shows that our stabilization model (i.e., STA) significantly outperforms other four baselines in both of NRMSE and stability, where all methods take the same from FHR as the input. Especially in the most challenging set Scenario where lots of complex movements exist, our method is much better than the closest competitor (i.e., moving average), with the improvements of NRMSE and Stability to be and , respectively. The reasons include: ) moving average filter and first order smoother may cause serious time delay problem; ) although second order and constant speed methods can handle time delay, it cannot handle multiple movement types (e.g., blinking and turning head). In contrast, our algorithm can effectively address time delay issue and multiple movement types through the Gaussian mixture setting, and hence is more precise and stable. Fig. 7 shows the stability comparisons with the second order stabilization method, which is chosen as a good baseline considering its stability performance. As we can see, our method significantly outperforms the second order method when handling complex movements, and also shows better ability for time delay issue which is very close to the GT.

Figure 7: Stability comparisons with second order stabilization method for time delay (left) and complex movements (right) issues.
Time Complexity

Note that our fractional heatmap regression does not impose any additional complexity burden during training compared to the conventional heatmap regression. For inference, our method provides a closed-form solution to estimate the coordinates from the heatmaps as in Eq. (4), whose runtime is negligible. Besides, after the parameter of our stabilization model is learnt, our stabilization algorithm costs s to process the entire test videos of -VW, which costs s per image, and can also be ignored.

Conclusions

In this paper, a novel Fractional Heatmap Regression (FHR) is proposed for high-resolution video-based face alignment. The main contribution in FHR is that we leverage D Gaussian generation prior to accurately estimate the fraction part of coordinates, which is ignored in conventional heatmap regression based methods. To further stabilize the landmarks among video frames, we propose a novel stabilization model that addresses the time-delay and non-smooth issues. Extensive experiments on popular benchmarks demonstrate our proposed method is more accurate and stable than the state of the arts. Except for the facial landmark estimation task, the proposed FHR has the potential to be plugged into any existing heatmap based system (e.g., human pose estimation task) and boost the accuracy.

References

  • Andriluka et al. (2014) Andriluka, M.; Pishchulin, L.; Gehler, P.; and Schiele, B. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
  • Asthana et al. (2014) Asthana, A.; Zafeiriou, S.; Cheng, S.; and Pantic, M. 2014. Incremental face alignment in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Black and Yacoob (1995) Black, M. J., and Yacoob, Y. 1995.

    Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion.

    In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Bulat and Tzimiropoulos (2017) Bulat, A., and Tzimiropoulos, G. 2017. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Cao, Hou, and Zhou (2014) Cao, C.; Hou, Q.; and Zhou, K. 2014. Displaced dynamic expression regression for real-time facial tracking and animation. In ACM Transactions on Graphics (SIGGRAPH).
  • Chen et al. (2017) Chen, Y.; Shen, C.; Wei, X.-S.; Liu, L.; and Yang, J. 2017. Adversarial learning of structure-aware fully convolutional networks for landmark localization. arXiv:1711.00253.
  • Chen* et al. (2018) Chen*, Y.; Tai*, Y.; Liu, X.; Shen, C.; and Yang, J. 2018.

    Fsrnet: End-to-end learning face super-resolution with facial priors.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Chen, Shen, and Jia (2017) Chen, Y.; Shen, X.; and Jia, J. 2017. Makeup-go: Blind reversion of portrait edit. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Chu et al. (2017) Chu, X.; Ouyang, W.; Li, H.; and Wang, X. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Collobert, Kavukcuoglu, and Farabet (2011) Collobert, R.; Kavukcuoglu, K.; and Farabet, C. 2011. Torch7: A matlab-like environment for machine learning. In NIPS Workshop.
  • Fan et al. (2018) Fan, X.; Liu, R.; Kang, H.; Feng, Y.; and Luo, Z. 2018. Self-reinforced cascaded regression for face alignment. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    .
  • FGNET (2014) FGNET. 2014. Talking face video. Technique Report, online.
  • Hu et al. (2018) Hu, T.; Qi, H.; Xu, J.; and Huang, Q. 2018. Facial landmarks detection by self-iterative regression based landmarks-attention network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
  • Korshunova et al. (2017) Korshunova, I.; Shi, W.; Dambre, J.; and Theis, L. 2017. Fast face-swap using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Kowalski, Naruniec, and Trzcinski (2017) Kowalski, M.; Naruniec, J.; and Trzcinski, T. 2017. Deep alignment network: A convolutional neural network for robust face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW).
  • Lagarias et al. (1998) Lagarias, J. C.; Reeds, J. A.; Wright, M. H.; and Wright, P. E. 1998. Convergence properties of the neldermead simplex method in low dimensions. SIAM J. OPTIM.
  • Liu et al. (2017) Liu, H.; Lu, J.; Feng, J.; and Zhou, J. 2017. Two-stream transformer networks for video-based face alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Liu (2010) Liu, X. 2010. Video-based face model fitting using adaptive active appearance model. Image and Vision Computing 28(7):1162–1172.
  • Newell, Yang, and Deng (2016) Newell, A.; Yang, K.; and Deng, J. 2016. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Oh et al. (2015) Oh, J.; Guo, X.; Lee, H.; Lewis, R.; and Singh, S. 2015. Action-conditional video prediction using deep networks in atari games. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
  • Peng et al. (2015) Peng, X.; Zhang, S.; Yang, Y.; and Metaxas, D. N. 2015. Piefa: personalized incremental and ensemble face alignment. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Peng et al. (2016) Peng, X.; Feris, R. S.; Wang, X.; and Metaxas, D. N. 2016. A recurrent encoder-decoder network for sequential face alignment. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Sagonas et al. (2013) Sagonas, C.; Tzimiropoulos, G.; Zafeiriou, S.; and Pantic, M. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW).
  • Shen et al. (2017) Shen, J.; Zafeiriou, S.; Chrysos, G. G.; Kossaifi, J.; Tzimiropoulos, G.; and Pantic., M. 2017. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCVW).
  • Simonyan and Zisserman (2014) Simonyan, K., and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
  • Trigeorgis et al. (2016) Trigeorgis, G.; Snape, P.; Nicolaou, M. A.; Antonakos, E.; and Zafeiriou, S. 2016. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Xiong and De la Torre (2013) Xiong, X., and De la Torre, F. 2013. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zhang et al. (2014) Zhang, J.; Shan, S.; Kan, M.; and Chen, X. 2014. Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Zhang et al. (2017) Zhang, Z.; Luo, P.; Loy, C. C.; and Tang, X. 2017. Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(5):918–930.
  • Zhu et al. (2015) Zhu, S.; Li, C.; Loy, C. C.; and Tang., X. 2015. Face alignment by coarse-tofine shape searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).