Laplace Landmark Localization

by   Joseph P. Robinson, et al.
Snap Inc.

Landmark localization in images and videos is a classic problem solved in various ways. Nowadays, with deep networks prevailing throughout machine learning, there are revamped interests in pushing facial landmark detection technologies to handle more challenging data. Most efforts use network objectives based on L1 or L2 norms, which have several disadvantages. First of all, the locations of landmarks are determined from generated heatmaps (i.e., confidence maps) from which predicted landmark locations (i.e., the means) get penalized without accounting for the spread: a high scatter corresponds to low confidence and vice-versa. For this, we introduce a LaplaceKL objective that penalizes for a low confidence. Another issue is a dependency on labeled data, which are expensive to obtain and susceptible to error. To address both issues we propose an adversarial training framework that leverages unlabeled data to improve model performance. Our method claims state-of-the-art on all of the 300W benchmarks and ranks second-to-best on the Annotated Facial Landmarks in the Wild (AFLW) dataset. Furthermore, our model is robust with a reduced size: 1/8 the number of channels (i.e., 0.0398MB) is comparable to state-of-that-art in real-time on CPU. Thus, we show that our method is of high practical value to real-life application.



There are no comments yet.


page 1

page 3

page 7

page 8


Improving Landmark Localization with Semi-Supervised Learning

We present two techniques to improve landmark localization from partiall...

Pixel-In-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild

Recently, heatmap regression based models become popular because of thei...

Facial Landmark Machines: A Backbone-Branches Architecture with Progressive Representation Learning

Facial landmark localization plays a critical role in face recognition a...

Robust Facial Landmark Detection under Significant Head Poses and Occlusion

There have been tremendous improvements for facial landmark detection on...

Combining Deep Learning and Model-Based Methods for Robust Real-Time Semantic Landmark Detection

Compared to abstract features, significant objects, so-called landmarks,...

Multi-Domain Multi-Definition Landmark Localization for Small Datasets

We present a novel method for multi image domain and multi-landmark defi...

Few-Shot Model Adaptation for Customized Facial Landmark Detection, Segmentation, Stylization and Shadow Removal

Despite excellent progress has been made, the performance of deep learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

aflw Landmark localization is a computer vision problem of finding pixel locations in visual media that correspond to points of interest. In cases of face alignment, these points correspond to face parts. For bodies and hands, landmarks correspond to the projections of joints on to the camera plane. Historically speaking, problems with landmark detection and shape analysis date back decades: from Active Shape Models 

[4] to Active Appearance Models [3], with the latter proposed to analyze and detect facial landmarks.

Figure 1: Heatmaps generated by models based on softargmax (middle) and the proposed LaplaceKL

(right). These confidence mappings are probabilities that a pixel is a landmark. The

softargmax-based methods generate highly scattered mappings (low certainty), while the same network trained with our loss is concentrated (high certainty). Each block has two columns: heatmaps superimposed on the input images (left) and a zoomed in view of the eye region (right). We further validate the importance of minimizing scatter experimentally (Table 2).

A revamped interest in facial landmark localization was triggered by a need for more advanced models to handle difficult views, poses, and variations in data as seen in-the-wild. From this came a wave of various types of deep neural architectures that pushed state-of-the-art on more challenging datasets. These modern-day networks are trained end-to-end on paired training data , where is the image and are the true landmark coordinates. Many of these works used encoder-decoder style networks to generate feature maps (heatmaps) that are transformed to pixel coordinates. The network must be entirely differentiable to train end-to-end. Hence, the layer (or operation) responsible for transforming the heatmaps to pixel coordinates must be differentiable. Note that each of the heatmaps correspond to the coordinates of a landmark. The operation that is typically used, softargmax, determines the location of a landmark as the expectation over the 2D heatmaps generated. Thus, metrics like or determine the distance from the true location, , where

are the predicted coordinates. There are two critical shortcomings of the above methodology. (1) These losses only penalize for differences in mean values in coordinate space, and the variance or scale of the heatmaps is not explicitly penalized. Thus, the generated heatmaps are highly scattered: a large variance means a low confidence. (2) This family of objectives depends entirely on paired training samples (

). However, obtaining quality data for this is expensive and challenging. Not only does each sample require several marks, but unintentional, and often unavoidable, labels result from the pixel-level marks subject to human error (inaccurate and imprecise ground-truth). All the while, plenty of unlabeled face data are available for free.

Our first contribution alleviates the first issue. For this, we introduce a new loss function that penalizes for the difference in distribution defined by location and scatter (see Figure 


). Specifically, we assume landmarks are random variables with

distributions, from which the KL-divergence between the predicted and ground-truth distributions defines the loss. Thus, the aim is to match distributions parameterized by both the mean and variance to yield heatmaps of less scatter (higher confidence). We call this objective the LaplaceKL loss.

Our second contribution is an adversarial training framework for landmark localization. This addressed the label data requirement by leveraging unlabeled data available for free. We treat our landmark detection network as a generator of normalized heatmaps (probability maps) that was fed to the discriminator with the objective to learn to distinguish between the true and generated heatmaps. This allows for large amounts of unlabeled data to boost the performance of our LaplaceKL-based models. In the end, the discriminator further boosts the predictive power of the LaplaceKL-based generators by injecting unlabeled data into the training set. As shown in the experiments, the proposed adversarial training framework compliments our LaplaceKL loss (more unlabeled data results in less error in predictions). In other words, we first demonstrate the effectiveness off the proposed loss by obtaining state-of-the-art without adversarial training. Then, we further boost performance by adding unlabeled data during training.

Furthermore, we reduced the model size to as little as , and the original number of convolution filters, with the smallest costing only 79Kb on disk. We show that the drop in accuracy for models trained with the proposed LaplaceKL-divergence (LaplaceKL) is far less than the case for other models (softargmax-based loss). All the while, adding unlabeled training data further reduces this drop in performance. Importantly, models larger than the original size still comparable with the existing state-of-the-art performance. We argue that the proposed contributions are instrumental for training landmark detection models for real-time production and on mobile devices.

Our contributions are three-fold: (1) A novel Laplace KL-divergence objective for landmark localization problems that makes models more certain about predictions; (2) An adversarial training framework that leverages a large amount of unlabeled data during training; (3) Experimental evaluation showing that our model outperforms recent works in face landmark detection, and remains comparable to state-of-the-art with the model size (160Kb).

2 Related work

In this section, we review relevant works on landmark localization and generative adversarial networks.

Landmark localization has been of interest to researchers for decades. At first, most methods were based on Active Shape Models [4] and Active Appearance Models [3]. Then, Cascaded Regression Methods (CRMs) were introduced, which operate in a sequential fashion; starting with the average shape, then incrementally shifting the shape closer to the target shape. CRMs offer high speed and accuracy (1,000 fps on CPU [24, 18]).

More recently, deep-learning based approaches have prevailed in the community due to end-to-end learning and improved accuracy. Initial works mimicked the iterative nature of cascaded methods using recurrent convolutional neural networks 

[22, 29, 33, 34]. Besides, several methods for dense landmark localization [11, 17] and 3D face alignment [30, 42] have been proposed: all of which are fully-supervised and, thus, require labels for each image.

Figure 2: The proposed semi-supervised framework for landmarks localization. We denote the labeled branch with the blue arrows and unlabeled branch with the red arrows. Given an input image, the generator produces heatmaps wherein each landmark is designated to a separate channel of the output. Labels are used to generate real heatmaps as . Fake samples are produced by the generator with the use of unlabeled data. Source images get concatenated on generated heatmaps, and passed to the discriminator.

Nowadays, there is an increasing interest in semi-supervised methods for landmark localization. Recent work used a sequential multitasking method which was capable of injecting labels of two types into the training pipeline, with one type constituting the annotated landmarks and the other type consisting of facial expressions (or hand-gestures) [14]. The authors argued that the latter label type was more easily obtainable, and showed the benefits of using both types of annotations by claiming state-of-the-art on several tasks. Additionally, they explore other semi-supervised techniques (equivariance loss). In [7], a supervision-by-registration method was proposed, which greatly utilized unlabeled videos to train a landmark detector. The key assumption was that the neighboring frames of the detected landmarks should be consistent with the optical flow computed between the frames. This approach demonstrated a more stable detector for videos, as well as an improved accuracy on public benchmarks.

Landmark localization datasets and benchmarks have significantly evolved as well. The 68-point mark-up scheme of the MultiPIE dataset [10] has been widely adopted. Despite the initial excitement for MultiPIE throughout the landmark localization community [43], it is now considered one of the easy dataset captured entirely in a controlled lab setting. A more challenging dataset, aflw [19], was then released with up to 21 facial landmarks per face (occluded or “invisible” landmarks were not marked). Finally, came the 300W dataset made-up of face images from the internet, labeled with the same 68-point mark-up scheme as MultiPIE, and promoted as a data challenge [25]. Currently, 300W is among the most widely used benchmarks for facial landmark localization. In addition to 2D datasets the community created several datasets annotated with 3D keypoints [1].

gan were recently introduced [9], and quickly became popular in both research and practice. gan have been used to generate images [23] and videos [26, 31], and to do image manipulation [8], text-to-image [37], image-to-image [40], video-to-video [32] translation and re-targeting [28].

An interesting feature of gan is the ability to transfer images and videos across different domains. Thus, gan were adopted in various semi-supervised and domain-adaptation tasks. Many have leveraged synthetic data to improve model performance on real data. For example, a gan transferred images of human eyes from the real domain to bootstrap training data [27]. Other researchers used a neural network to make synthetically generated images of outdoor scenes more photo-realistic, which also were used to improve performance for image segmentation [12] . Sometimes, labeling images captured in a controlled setting is more manageable (versus an uncontrolled setting). For instance, 2D body pose annotations were available for images in the wild, while 3D annotations were mostly for images captured in a lab setting. Therefore, images with 3D annotations were used in the adversarial training to predict 3D human body poses in images in-the-wild [36].

Our work differs from these works in several ways. Firstly, a majority, if not all, used a training objective that only accounts for the location of the landmarks [14, 29, 34]

, no variance or the spread of the prediction was considered. In other words, it has been assumed that the distribution of the landmarks is describable with a single parameter (the mean value). Networks trained in this way yield an uncertainty about the prediction, while still providing a reasonable location estimate. To mitigate this, we explicitly parametrize the distribution of landmarks using location and scale. For this, we propose a KL-divergence based loss to train the network end-to-end. Secondly, previous works used gan for domain adaptation in some fashion. In this work, we do not perform any adaptation between domains as in 

[12, 27], nor do we use any additional training labels as in [14]. Specifically, we have the discriminator do the quality assessment on the predicted heatmaps for a given image. The resulting gradients are used to improve the ability of the generator to detect landmarks. We show that both contributions improve the accuracy when used separately. Then, we combine these contributions to claim state-of-the-art on the renowned 300W dataset.

3 Method

Our training framework utilizes both labeled and unlabeled data during training. The high-level graphical representations of the cases that labels are available (blue arrows) and unavailable (red arrows) are shown in Figure 2. Thus, our framework has two branches, supervised (Eq. 3) and unsupervised (Eq. 7), where only the supervised (blue arrow) uses labels to train. We describe both of these cases in detail in the following sections.

3.1 Fully Supervised Branch

The joint distribution of the image

and the landmarks as , where is the total number of landmarks. The form of the distribution is unknown, however, joint samples are available when labels are present (). During training, we aim to learn a conditional distribution modeled by a neural network with parameters . Landmark detection is then done by sampling . Parameters are now left out for brevity in the notation. The parameter values can be found by maximising the likelihood that the process described by the model produced the data that were actually observed, i.e., trained by minimizing the following loss function w.r.t. its parameters:


Alternatively, it is possible to train a neural network to predict normalized probability maps (heatmaps): , where and each represents a normalized probability map for landmark , where . To get the pixel locations, one could perform the argmax operation over the heatmaps by setting . However, this operation is not differentiable. Therefore, it cannot be used for end-to-end training.

Recently, [14] used a differentiable variant of argmax known as softargmax [2]. For the 1D case the softargmax operation writes as:


where is the predicted probability mass at location , is the normalization factor, and is the temperature factor controlling the predicted distribution [2]. We denote coordinate in boldface (), and write 2D softargmax operation as with .

Essentially, the softargmax operation is the expectation of the pixel coordinate over the selected dimension. Hence, the softargmax-based loss assumes the underlying distribution can be described by just its mean (location). Regardless how certain a prediction, the objective then is just to line-up mean values. To avoid cases in which the trained model is uncertain about the predicted mean, while still yielding a low error, we parametrize the distribution using , where is mean or location and is variance or scale parameter for the selected distribution.

We would like the model to be certain about the predictions (a small variance or scale). We consider two parametric distributions: and with and . We define a function to compute the scale (or variance) of the predicted heatmaps using the location, where the locations are now found as the expectation over the heatmap space. Thus, , where , for Laplacian, and for Gaussian. Thus, and

are used to parameterize a Laplace (or Gaussian) distribution for the predicted landmarks


Data: ,
initialize network parameters while  do
       sample mini-batch from labeled data  sample mini-batch from unlabeled data    compute loss using Eq. 2 or Eq. 3 // update model parameters
end while
Algorithm 1 Training the proposed model.

Denoting the true conditional distribution of the landmarks as we define the objective as follows:


where is the KL-divergence. We assumed a true distribution for the case of Gaussian (, where is the ground-truth locations of the landmarks). For the case with Laplace, we sought

. KL-divergence conveniently has a closed-form solution for this family of exponential distributions 

[13]. Alternatively, it can be approximated by sampling. The labeled branch of the framework is represented with blue arrows in Figure 2.

Statistically speaking, given two estimators with different variances, we would prefer one that has the smaller variance (see [6] an analysis of the bias-variance trade-off). A lower variance implies higher confidence in prediction. To this end, we found an objective measuring distances between distributions is accurate and robust. The neural network must satisfy an extra constraint on variance and, thus, yields predictions of higher certainty. See higher confident heatmaps in Figures 1 and 3. Experimental evaluation further validates this (Table 2 and Table 3). Also, sample results are shown in Figure 4.

3.2 Unsupervised Branch

The previous section discusses several objectives to train the neural network with the available paired or fully labeled data (). We denote data samples with the superscript to distinguish them from unpaired or unlabeled data . In general, it is difficult for a human to label many images with landmarks. Hence, unlabeled data is more common and easier to obtain, which calls for capitalizing on this abundant data to improve training. In order to do so, we adapt the adversarial learning framework for landmark localization. We treat our landmarks predicting network as a generator . The discriminator takes the form , where

is a tensor concatenation operation . We define the real samples for the discriminator as

, where generates the true heatmaps given the ground-truth landmarks locations. Fake samples are given by With this notation, we define the min-max objective for landmark localization as:


where writes as:


In this setting, provided an input image, the goal of the the discriminator is to learn to tell the difference between the real and fake heatmaps from the appearance. Thus, the goal of the generator is to produce fake heatmaps that are indistinguishable from the real. Within this framework, the discriminator intends to provide additional guidance for the generator by learning from both labeled and unlabeled data. The objective in Eq. 4 is solved using alternating updates.

3.3 Training

We fused the softargmax-based and adversarial losses as


with the KL-divergence version of the objective defined as:


with the weight for the adversarial loss . This training objective includes both labeled and unlabeled data in the formulation. In the experiments, we show that this combination significantly improves the accuracy of our approach. We also argue that the softargmax-based version cannot fully utilize the unlabeled data, since the predicted heatmaps differ too much from the real heatmaps. The training procedure for steps of the proposed model is given in Algorithm 1. We show unlabeled branch of the framework is shown graphical in red arrows (Figure 2).

Layers Tensor Size
Input RGB image, no data augmentation 80 x 80 x 3
Conv() 3 3 64, LReLU, DROP, MAX 40 40 64
Conv() 3 3 64, LReLU, DROP, MAX 20 20 64
Conv() 3 3 64, LReLU, DROP, MAX 10 10 64
Conv() 3 3 64, LReLU, DROP, MAX 5 5 64
Conv() 1 1 64 , LReLU, DROP, UP 10 10 128
Conv() 5 5 128, LReLU 20 20 128
Conv( 1 1 64 , LReLU, DROP, UP 20 20 128
Conv() 5 5 128, LReLU, DROP 40 40 128
Conv() 1 1 64 , LReLU, DROP, UP 40 40 128
Conv() 5 5 128, LReLU, DROP 80 80 128
Conv() 1 1 64 , LReLU, DROP, UP 80 80 128
Conv() 5 5 128, LReLU, DROP 80 80 128
Conv() 1 1 68, LReLU, DROP 80 80 68
Output 1 1 68 80 80 68
Table 1: Generator architecture. Layers listed with the size and number of filters (

), and DROP, MAX, and UP stands for dropout (probability 0.2), max-pooling (stride 2), and bilinear upsampling (2

). Note the skip connections about the bottleneck: coarse-to-fine, connecting encoder () to decoder () by concatenating feature channels before fusion via fully-connected layers. Thus, feature dimensions and number of feature maps were preserved at all but the 2 topmost layers (layers that transformed feature maps to

heatmaps). A stride of 1, with padding to produce output size listed, was used throughout.

3.4 Implementation

We follow the ReCombinator network (RCN) initially proposed in [15]. Specifically, we use a 4-branch RCN as our base model, with input images and output heatmaps of size 80

80. Convolutional layers of the encoder consist of 64 channels, while the convolutional layers of the decoder output 64 channels out of the 128 channels at its input (64 channels from previous layer concatenated with the 64 channels skipped over the bottleneck via branching). We applied Leaky-ReLU, with a negative slope of 0.2, on all but the last convolution layer. The details of the generator architecture are given in Table 

1. Drop-out followed this, after all but the first and last activation. We use Adam optimizer with learning rate of 0.001 and weight decay of . In all cases, networks were trained from scratch, using no data augmentation nor any other ’training tricks.’

For the discriminator (), we use a 4-layer PatchGAN [16]. Similarly to [31] we applied Gaussian noise (

) before convolution at each layer, followed by batch-normalization in all but the top and bottom layers, and then Leaky-ReLU with a negative slope of 0.2 proceeds in all layers but the top. The original RGB image concatenated on top of a stack of

heatmaps (one per landmark) is fed in as the input (Figure 2). Thus, takes in () channels. We set

when computing softargmax. The entire framework was implemented in Pytorch. An important note to make is that models optimized using Laplace distribution consistently outperformed the Gaussian-based. For instance,

LaplaceKL had the lowest error compared to all existing methods on 300W with 4.01, while Gaussian-based resulted in 4.71. Thus, the sharper, “peakier” Laplace distribution proved to be more numerically stable with current network configuration, as we had to use a learning rate a magnitude smaller to avoid vanishing gradients for Gaussian. Thus, we used Laplace for all experiments.

aflw 300W
Common Challenge Full
SDM [35] 5.43 5.57 15.40 7.52
LBF [24] 4.25 4.95 11.98 6.32
MDM [29] - 4.83 10.14 5.88
TCDCN [39] - 4.80 8.60 5.54
CFSS [41] 3.92 4.73 9.98 5.76
CFSS [20] 2.17 4.36 7.56 4.99
RCSR [34] - 4.01 8.58 4.90
RCN+ (LELT) [14] 1.59 4.20 7.78 4.90
CPM SBR [7] 2.14 3.28 7.58 4.10
Softargmax 2.26 3.48 7.39 4.25
Softargmax+D(10K) - 3.34 7.90 4.23
Softargmax+D(30K) - 3.41 7.99 4.31
Softargmax+D(50K) - 3.41 8.06 4.32
Softargmax+D(70K) - 3.34 8.17 4.29
LaplaceKL 1.97 3.28 7.01 4.01
LaplaceKL+D(10K) - 3.26 6.96 3.99
LaplaceKL+D(30K) - 3.29 6.74 3.96
LaplaceKL+D(50K) - 3.26 6.71 3.94
LaplaceKL+D(70K) - 3.19 6.87 3.91
Table 2: nmse on aflw and 300W normalized by the square root of BB area and interocular distance, respectfully.

4 Experiments

We evaluated the proposed framework on two widely used benchmark datasets for face alignment. No data augmentation techniques were applied when training our models in any of the experiments nor was the learning rate dropped. This way, there was no ambiguity for whether or not the improved performance stemmed from training tricks or the learning component itself. All results reported for the proposed were from models trained for 200 epochs.

Next, we discuss the metric used to evaluate performance, nmse, with differences for the two datasets being in references used to find the normalization factor. Then, we cover experimental settings, results, and analysis for each dataset separately. We then show that reducing the number of parameters both reduces the storage requirements and processing time of the the model, and the proposed LaplaceKL+D(70K) has comparable performance using just 1/8 the number of feature channels.

4.1 Metric

Per convention [1, 5, 25], the metric used for evaluation was the nmse, a normalized average of euclidean distances for landmarks per image. Mathematically speaking:


where the number of visible landmarks set as , are the indices of the visible landmark, the normalization factor depends on the face size, and and are the ground-truth and predicted coordinates, respectfully. The face size ensured that the nmse scores across faces of different size were fairly weighted. Following predecessors, nmse was used to evaluate both datasets, except with different points referenced to calculate . Details for finding are provided in the following subsection.

Number of parameters, millions
Softargmax 9.79 6.86 4.83 4.35 4.25
Softargmax+D(70K) 9.02 6.84 4.85 4.38 4.29
LaplaceKL 7.38 5.09 4.39 4.04 4.01
LaplaceKL+D(70K) 7.01 4.85 4.30 3.98 3.91
Storage (MB) 0.076 0.162 0.507 1.919 7.496
Speed (fps) 26.51 21.38 16.77 11.92 4.92
Table 3: nmse on 300W (full set) for networks trained with fewer channels in each convolutional layer by , , , , and unmodified in size (the original) listed from left-to-right. Processing speed was measured with a 2.8GHz Intel Core i7 CPU.

4.2 300W + MegaFace

The 300W dataset is currently the most popular dataset for face alignment. It has 68 landmarks for 3,837 images (3,148 training and 689 testing). We followed the protocol of the 300W challenge [25] and evaluated using nmse Eq. 8, where the face size is set as the interocular distance (distance between the outer corners of each eye) and for all faces (all points are visible). Per convention, 300W was evaluated with different subsets (common and challenge, which together form full).

We compared the performance of the proposed objective trained in a semi-supervised fashion. The training data of 300W dataset was used as the labeled (real) data, and the unlabeled (fake) data was randomly selected from the MegaFace dataset [21]. MTCNN111 was used to detect 5 landmarks (eye pupils, corners of mouth, and middle of nose and chin) [38]. This allowed faces from both datasets to be cropped in the same way. Specifically, we extended the square hull that enclosed the 5 landmarks by 2 half the length in each directions. In other words, we fit the smallest bounding box that spans the 5 points (the outermost points lied on the parameter), transformed rectangles to squares by setting the radius as , and then increased each side by 2 the radius. Note that the midpoint of the original rectangle was was held constant to avoid shift translations (rounded up a pixel if radius was even and extended in all directions).

Figure 3: Random samples from 300W. Heatmaps predicted by our LaplaceKL+D(70K) model (middle), and those produced by softargmax+D(70K) (right), are alongside face images with ground-truth sketched on face (left). For this, the heatmaps generated for each landmark were colored by value (range of [0, 1] from bottom-to-top of colorbar, respectfully), and were superimposed on the original face. Note that the KL-divergence loss yields predictions of much greater confidence by producing landmarks that are clearly separated in the visualized heatmap space. In other words, the proposed has minimal spread about the mean, opposed to the softargmax-based model with heatmaps for individual landmarks smudged across one another. Best viewed electronically and in color. Zoom-in to see greater detail.

The LaplaceKL+D(70K) model obtained state-of-the-art on 300W, yielding the lowest error on 300W (Table 2 (300W columns)). The models trained with unlabeled data were denoted as LaplaceKL+D() and softargmax+D(), where representing the number of unlabeled images added from MegaFace.

First, notice that LaplaceKL trained without unlabeled data still achieved state-of-the-art. The LaplaceKL-based models then showed relative improvements as more unlabeled data was added. The softargmax-based models cannot fully take advantage of the unlabeled data, as the variance is not minimized (generates heatmaps of less confidence and, thus, more spread). Our LaplaceKL, on the other hand, penalizes for spread (scale), making the job of the discriminator more challenging. As such, LaplaceKL-based models benefit from increasing amounts of unlabeled data.

Also, notice the largest gap between the baseline models [7] and our best LaplaceKL+D(70K) model on the different sets of 300W. Adding more unlabeled helps more (LaplaceKL vs LaplaceKL+D(70K) improvement is about 2.53%). However, it is important to use samples that were not covered in the labeled set. To demonstrate this, we set the real and fake sets to 300W ( in the second term of Eq. 7). nmse results for this experiment are listed as follows: Laplace-KL+D(300W) 4.06 (baseline– 4.01) and softargmax+D(300W) 4.26 (baseline– 4.24). As hypothesized, all the information from the labeled set had already been extracted in the supervised branch, leaving no benefit of using the same set in the unsupervised branch. Therefore, more unlabeled data yields more hard negatives to train with, which improves the accuracy for the rarely seen samples (Table 2 (300W challenge set)). Our best model was 2.7% better than [7] on easier samples (common), 4.7% better on average (full), and, moreover, 9.8% better on the more difficult (challenge). This further highlights the advantages of the proposed LaplaceKL loss and adversarial training framework.

Additionally, our 300W baseline were further boosted by the adversarial framework (more unlabeled data yields a lower NMSE). Specifically, this was demonstrated by pushing state-of-the-art of the proposed on 300W from a NMSE of 4.01 to 3.91 (no unlabeled data to 70K unlabeled pairs, respectfully). In fact, there were boosts at each step size of full (larger smaller NMSE).

We randomly selected unlabeled samples for LaplaceKL+D(70K) and softargmax+D(70K) to visualize predicted heatmaps (Figure 3). In each case, the heatmaps produced by the softargmax-based models spread wider, explaining the worsened quantitative scores (Table 2). The models trained with the proposed contributions tend to yield higher probable pixel location (a more concentrated predicted heatmaps). For most images, the heatmaps generated by models trained with the LaplaceKL loss have distributions for landmarks that were clearly more confident and properly distributed: our LaplaceKL+D(70K) yielded heatmaps that vary 1.02 pixels from the mean, while softargmax+D(70K) has a variation of 2.59. Learning the landmark distributions with our LaplaceKL loss is conceptually and theoretically intuitive (Figure 1). Moreover, it is experimentally proven (Table 2).

4.3 The aflw dataset

aflw We evaluated the proposed KL-divergence loss on the aflw dataset [19]. aflw consists of 24,386 faces with up to 21 landmarks annotations and 3D real-valued head pose labels. Following previous work [14], we had split aflw into 20,000 faces for training and 4,386 for testing. We also ignored two of the landmarks on the left and right earlobes. Thus, up to 19 landmarks was used per face [7].

Since faces of the aflw dataset were captured across various head poses, most faces have landmarks out of view (missing). Thus, most samples were not annotated with the complete 19 landmarks, meaning that it does not allow for a constant sized tensor (real heatmaps) for the adversarial training. Therefore, we only compared the softargmax and KL-based objectives with existing state-of-the-art methods. The face size for the nmse was the square root of the bounding box hull [1].

Our LaplaceKL-based model scored results comparable to existing state-of-the-art (RCN+ (LELT) [14]) on the larger, more challenging aflw dataset, while outperforming all others. It is important to highlight here that  [14] puts great emphasis on data augmentation, while we do not apply any. Also, since landmarks are missing in some samples (no common reference points exist across all samples), we were unable to prepare faces for our semi-supervised component. This is subject to future work.

4.4 Model size analysis

We next measure error as a function of model size. For this, the number of channels at each convolutional layer was reduced by as much as that of the original. Table 3 shows these results. The softargmax-based model worsened by about 47% and 79% in nmse with the channel count reduced to an and , respectfully (4.25 6.86 and 9.79, respectfully). LaplaceKL, on the other hand, only decreased by about 24% with the number of channels and 59% with (4.01 5.09 and 7.38, respectfully). Our model trained with unlabeled data (LaplaceKL+D(70K)) dropped just about 21% and 57% for models reduced by a factor of 8 and 16, respectfully (3.91 4.85 and 7.01). In the end, LaplaceKL+D(70K) showed the best performance across the reduced sizes: with 0.040M parameters it still is comparable to previous state-of-the-art [14, 20, 34]. This is a clear advantage for the proposed. For instance, SDM [35], requires 1.693M parameters (25.17MB) for a NMSE of 7.52 (300W full), while our smallest and next-to-smallest get 7.01 and 4.85 with only 0.174M (0.076 MB) and 0.340M (0.166 MB) parameters.222

Processing speed also benefited from fewer channels (training time and at inference). For instance, the model reduced by factor of 16 processes 26.51 frames per second (fps) on a CPU of Macbook Pro (2.8GHz Intel Core i7), with the original running at 4.92 fps. Our best LaplaceKL-based model proved robust to model size reduction, obtaining 4.85 NMSE at 21.38 fps when reduced by .

Figure 4: Random samples of landmarks predicted by LaplaceKL (white), with ground truth drawn as line segments (red). Notice the predicted points tend to overlap with the ground-truth. Best viewed electronically and in color. Zoom-in for greater detail.

5 Conclusions

We demonstrated the benefits of the proposed LaplaceKL-divergence loss and the ability to leverage unlabeled data in an adversarial training framework, both separately and combined. The importance of penalizing a landmark predictor’s uncertainty is shown hypothetically and empirically. Thus, training with the proposed objective yields predictions of higher confidence, outperforming previous state-of-the-art methods. We also revealed the benefits of adding unlabeled training data to further boost performance via adversarial training. In the end, our model performs achieves state-of-the-art on all three splits of the renown 300W (common, challenge, and full), and second-to-best on the aflw benchmark. Also, we demonstrate the robustness of the proposed by significantly reducing the number of parameters. Specifically, with the number of channels (170Kb on disk), the proposed still yields an accuracy comparable to previous state-of-the-art in real-time (21.38 fps). Thus, the contributions of proposed framework are instrumental for models intended for use in real-world production.


  • [1] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [2] O. Chapelle and M. Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.
  • [3] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (6):681–685, 2001.
  • [4] T. F. Cootes and C. J. Taylor. Active shape models—‘smart snakes’. In British Machine Vision Conference (BMVC), 1992.
  • [5] D. Cristinacce and T. F. Cootes. Feature detection and tracking with constrained local models. In British Machine Vision Conference (BMVC), volume 1, page 3. Citeseer, 2006.
  • [6] P. Domingos. A unified bias-variance decomposition. In International Conference on Machine Learning (ICML), pages 231–238, 2000.
  • [7] X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, and Y. Sheikh. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 360–368, 2018.
  • [8] Z. Geng, C. Cao, and S. Tulyakov. 3d guided fine-grained face manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
  • [10] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5), 2010.
  • [11] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [12] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. 2017.
  • [13] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • [14] S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz.

    Improving landmark localization with semi-supervised learning.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [15] S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-fine feature aggregation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5743–5752, 2016.
  • [16] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
  • [17] L. A. Jeni, J. F. Cohn, and T. Kanade. Dense 3d face alignment from 2d videos in real-time. In Automatic Face and Gesture Recognition (FG), 2015.
  • [18] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [19] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In IEEE International Conference on Computer Vision (ICCV) Workshop, pages 2144–2151, 2011.
  • [20] J.-J. Lv, X. Shao, J. Xing, C. Cheng, X. Zhou, et al. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] A. Nech and I. Kemelmacher-Shlizerman.

    Level playing field for million scale face recognition.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [22] X. Peng, R. S. Feris, X. Wang, and D. N. Metaxas. Red-net: A recurrent encoder–decoder network for video-based face alignment. International Journal of Computer Vision (IJCV), 2018.
  • [23] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [24] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1685–1692, 2014.
  • [25] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In IEEE International Conference on Computer Vision (ICCV) Workshop, 2013.
  • [26] M. Saito, E. Matsumoto, and S. Saito.

    Temporal generative adversarial nets with singular value clipping.

    In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [27] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [28] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe. Animating arbitrary objects via deep motion transfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [29] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4177–4187, 2016.
  • [30] S. Tulyakov, L. A. Jeni, J. F. Cohn, and N. Sebe. Viewpoint-consistent 3d face alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • [31] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. 2018.
  • [32] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
  • [33] W. Wang, S. Tulyakov, and N. Sebe. Recurrent convolutional face alignment. In Asian Conference on Computer Vision (ACCV), pages 104–120. Springer, 2016.
  • [34] W. Wang, S. Tulyakov, and N. Sebe. Recurrent convolutional shape regression. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • [35] X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 532–539, 2013.
  • [36] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang. 3d human pose estimation in the wild by adversarial learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [37] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [38] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
  • [39] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision (ECCV), pages 94–108. Springer, 2014.
  • [40] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017.
  • [41] S. Zhu, C. Li, C. Change Loy, and X. Tang. Face alignment by coarse-to-fine shape searching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4998–5006, 2015.
  • [42] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3d solution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 146–155, 2016.
  • [43] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2879–2886. IEEE, 2012.