Landmark Detection in Low Resolution Faces with Semi-Supervised Learning

07/30/2019 ∙ by Amit Kumar, et al. ∙ University of Maryland 5

Landmark detection algorithms trained on high resolution images perform poorly on datasets containing low resolution images. This deters the performance of algorithms relying on quality landmarks, for example, face recognition. To the best of our knowledge, there does not exist any dataset consisting of low resolution face images along with their annotated landmarks, making supervised training infeasible. In this paper, we present a semi-supervised approach to predict landmarks on low resolution images by learning them from labeled high resolution images. The objective of this work is to show that predicting landmarks directly on low resolution images is more effective than the current practice of aligning images after rescaling or superresolution. In a two-step process, the proposed approach first learns to generate low resolution images by modeling the distribution of target low resolution images. In the second stage, the roles of generated images and real low resolution images are switched and the model learns to predict landmarks for real low resolution images from generated low resolution images. With extensive experimentation, we study the impact of each of the design choices and also show that prediction of landmarks directly on low resolution images improves the performance of important tasks such as face recognition in low resolution images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolution Neural Networks (CNNs) have revolutionized the computer vision research, to the point that current systems can recognize faces with more than 99.7%[9]

accuracy or achieve detection, segmentation and pose estimation results upto subpixel accuracy. These are only few of the many tasks which have seen a significant performance improvements in the last five years. However, CNN-based methods assume access to good quality images. ImageNet

[29], COCO[19], CASIA[40], 300W[30] or MPII[2] datasets all consist of high resolution images. As a result of domain shift much lower performance is observed when networks trained on these datasets are applied to images which have suffered degradation due to intrinsic or extrinsic factors. In this work, we address landmark localization in low resolution images. Although, we use face images in our case, the proposed method is also applicable to other tasks, such as human pose estimation. Throughout this paper we use HR and LR to denote high and low resolutions respectively.

Figure 1: Inaccurate landmark detections on low resolution images. We show landmark predicted by different systems. (a) MTCNN[42] and (b) [4] are not able to detect any face in the LR image. (c) Current practice of directly upsampling the low-resolution image to a fixed size of

by bilinear interpolation. (d) Output from a network trained on downsampled version of HR images. (e) Landmark detection using super-resolved images.

Note: For visualization purposes images have been reshaped after respective processing. Actual size of the images is in the range of pixels

Facial landmark localization, also known as keypoint or fiducial detection, refers to the task of detecting specific points such as eye corners and nose tip on a face image. The detected keypoints are used to align images to canonical coordinates, which are then used as inputs to different convolution networks. It has been experimentally shown in [bansal2017dosanddonts]

, that accurate face alignment leads to improved performance in face verification. Though great strides have been made in this direction, mainly addressing large-pose face alignment, landmark localization for low resolution images, still remains an understudied problem, mostly because of the absence of large scale labeled dataset(s). To the best of our knowledge, for the first time, landmark localization directly on low resolution images is addressed in this work.

Main motivation: In Figure 1, we examine possible scenarios which are currently practiced when low resolution images are encountered. Figure 1 shows the predicted landmarks when the input image is a LR image of size less than pixels. Typically, landmark detection networks are trained with crops of HR images taken from AFLW[13] and 300W[30] datasets. During inference, irrespective of resolution, an incoming image is rescaled to . We deploy two methods: MTCNN[42] and Bulat et al.[4], which have detection and localization built in a single system. In Figure 1(a) and (b) we see that these networks failed to detect face in the given image. Figure 1(c), shows the outputs when a network trained on high resolution images is applied to a rescaled low resolution one. It is important to note that the trained network, say HR-LD high resolution landmark detector (detailed in Section 4.4) achieves state of the art performance on AFLW and 300W test sets. A possible solution is to train a network on sub-sampled images as a substitute for low resolution images. Figure 1(d) shows the output of one such network. It is evident from these experiments that networks trained with HR images or subsampled images are not effective for real life LR images. It can also be concluded that subsampled images are unable to capture the distribution of real LR images.

Super-resolution is widely used to resolve LR images to reveal more details. Significant developments have been made in this field and methods based on encoder-decoder architectures and GANs[11]

have been proposed. We employ two recent deep learning based methods, SRGAN

[17] and ESRGAN[34] to resolve given LR images. It is worth noting that the training data for these networks also include face images. Figure 1(e) shows the result when the super-resolved image is passed through HR-LD. It can be hypothesized that possibly, the super-resolved images do not lie in the same space of images using which HR-LD was trained. Super resolution networks are trained using synthetic low resolution images obtained by downsampling the image after applying Gaussian smoothing. In some cases, training data for super-resolution networks consists of paired low and high resolution images. Neither of the mentioned scenarios is applicable in real life situations.

Main Idea: Different from these approaches, the proposed method is based on the concept of ‘generate to adapt’. This work aims to show that landmark localization in LR images can not only be achieved, but it also improves the performance over the current practice. To this end, we first train a deep network which generates LR images from HR images and tries to model the distribution of real LR images in pixel space. Since, there is no publicly available dataset, containing low resolution images along with landmark annotations, we take a semi-supervised approach for landmark detection. We train an adversarial landmark localization network on the generated LR images and hence, switching the roles of generated and real LR images. Heatmaps predicted for unlabelled LR images are also included in the inputs of the discriminators. The adversarial training procedure is designed in a way that in order to fool the discriminators, the heatmap generator has to learn the structure of the face even in low resolution. We perform extensive set of experiments explaining all the design choices. In addition, we also propose new state of the art landmark detector for HR images.

2 Related Work

Being one of the most important pre-processing steps in face analysis tasks, facial landmark detection has been a topic of immense interest among computer vision researchers. We briefly discuss some of the methods which use Convolution Neural Networks (CNN). Different algorithms have been proposed in the recent past such as direct regression approaches of MTCNN by Zhang et al. [44] and KEPLER by Kumar et al. [14]. The convolution neural networks in MTCNN and KEPLER act as non-linear regressors and learn to directly predict the landmarks. Both works are designed to predict other attributes along with keypoints such as 2D pose, visibility of keypoints, gender and many others. Hyperface by Ranjan et al. [26] has shown that learning tasks in one single network does in fact, improves the performance of individual tasks. Recently, architectures based on Encoder-Decoder architecture have become popular and have been used intensively in tasks which require per-pixel labeling such as semantic segmentation[25, 28] and keypoint detection[16, 1, 41, 15]. Despite making significant progress in this field, predicting landmarks on low resolution faces still remains a relatively unexplored topic. All of the works mentioned above are trained on high quality images and their performance degrades on LR images.

Figure 2: Overview of the proposed approach. High resolution input is passed through High-to-Low generator (shown in cyan colored block). The discriminator learns to distinguish generated LR images vs. real LR images in an unpaired fashion. This generated image is fed to heatmap generator . Heatmap discriminator distinguishes generated heatmap vs. groundtruth heatmaps. The pair is inspired from BEGAN[3]. In addition to generated and groundtruth heatmaps, the discriminator also receives predicted heatmaps for real LR images. This enables the generator to generate realistic heatmaps for un-annotated LR images.

One of the closely related works, is Super-FAN[5] by Bulat et al., which makes an attempt to predict landmarks on LR images by super-resolution. However, as shown in experiments in Section 4.3, face recognition performance degrades even on super-resolved images. This necessitates that super-resolution, face-alignment and face recognition be learned in a single model, trained end to end, making it not only slow in inference but also limited by the GPU memory constraints. The proposed work is different from [5] in many respects as it needs labeled data only in HR and learns to predict landmarks in LR images in an unsupervised way. Due to adversarial training, the network not only acts as a facial parts detector but also learns the inherent structure of the facial parts. The proposed method makes the pre-processing task faster and independent of face verification training. During inference, only the heatmap generator network is used which is based on the fully convolutional architecture of U-Net[28] and works at the spatial resolution of making the alignment process real time.

3 Proposed Method

The proposed work predicts landmarks directly on a low resolution image of spatial size less than pixels. We show that predicting landmark detection directly in low resolution is effective than current practices of rescaling or super-resolution. The entire pipeline can be divided into two stages: (a) Generation of LR images in an unpaired manner (b) Generating heatmaps for real LR images in a semi-supervised fashion. The diagrammatic overview of the proposed approach is shown in Figure 2. Being a semi-supervised method, it is important to first describe the datasets chosen for the ablative study.

High Resolution Dataset: We construct the HR dataset by combining the training images from AFLW[13] and the entire 300W[30] dataset. We divide the Widerface dataset[39] which consists of images in different resolutions captured under diverse conditions, into two groups based on their spatial size. The first group consists of images with spatial size between and , whereas the second group consists of images with more than pixels. We combine the second group in HR training set, resulting in a total of HR faces. The remaining images from AFLW are used as validation images for the ablative study and test set for the landmark localization task. Although, generation of LR images is an unpaired task, we use AFLW and 300W images for training, as the generated LR images from these datasets are used for semi-supervised learning in the second step.

Low Resolution Dataset: The first group from Widerface dataset consists of faces and is used as real or target low resolution images.

3.1 High to Low Generator and Discriminator

High to low generator , shown in Figure 3

is designed following the Encoder-Decoder architecture, where both encoder and decoder consists of multiple residual blocks. The input to the first convolution layer is the HR image concatenated with the noise vector which has been projected using a fully connected layer and reshaped to match the input size. Similar architectures have also been used in

[6, 17]

. The encoder in the generator consists of eight residual blocks each followed by a convolution layer to increase dimensionality. Max-pooling is used to decrease the spatial resolution to

, for high resolution image of pixels. The decoder is composed of six residual units followed by convolution layers to reduce the dimensionality. Finally, one convolution layer is added in order to output a three channel image. BatchNorm is used after every convolution layer.

The discriminator , shown in Figure 3 is also constructed in a similar way, except max-pooling is used only in the last three layers considering the inputs to discriminator are low resolution images. Referring to Figure 2, we use for input high resolution images of size , for generated LR images of size and for real LR images of the same size.

Figure 3: (a) Generator used in high to low resolution generator . Each represents two residual blocks followed by a convolution layer. (b) Discriminator used in and . Max-pooling is applied only in the last two layers. Each represents one residual block followed by a convolution layer.

We train High to Low generator using a weighted combination of GAN loss and pixel losses. loss is used to encourage convergence in initial training iterations. The final loss can be summarized in Equation 1.

(1)

where and are hyperpameters which are empirically set following

. Following recent developments in GANs we experimented with different loss functions. However, we use Hinge loss and Spectral Normalization

[22] in combination due to faster training. The hinge loss for the generative networks can be defined as in Equation 2:

(2)

where is the distribution of real LR images from Widerface dataset, and is the distribution of generated images .

The weights of the discriminator are normalized in order to satisfy the Lipschitz constraint , shown in Equation 3:

(3)

Finally, pixel loss described in Equation 4 is used which minimizes the distance between the generated and subsampled images. The loss ensures that the content is not lost during the generation process.

(4)

where the operation is implemented as a sub-sampling operation obtained by passing through four average pooling layers.

Figure 4: Sample outputs of High to Low generator. First row shows the HR images. Second row shows downsampled images obtained after applying Gaussian smoothing. Third row shows LR images generated by the network. Note: Best viewed when zoomed in. For visualization purposes, images have been enlarged after respective processing. Actual size of the images is in the range of pixels

Figure 4 shows some sample LR images generated from the network .

3.2 Semi-Supervised Landmark Localization

3.2.1 Heatmap Generator

Figure 5: Architecture of the heatmap generator . Architecture of this network is based on U-Net. Each represents two residual blocks. represents skip connections between the encoder and decoder.

The keypoint heatmap generator, in Figure 5 produces heatmaps corresponding to N (in our case or ) keypoints in a given image. As mentioned earlier, the objective of this paper is to show that landmark prediction directly on LR image is feasible even in the absence of labeled LR data, and evaluate the performance of auxiliary tasks compared to commonly used practices of rescaling or super-resolution. Keeping this in mind, we choose a simple network based on the U-Net[28] architecture as the heatmap generator, instead of computationally intensive stacks of hourglass networks[24] or CPMs[35]. The network consists of 16 residual blocks where both encoder and decoder have eight residual blocks. Eight residual blocks in the encoder are divided into four groups of two blocks each and spatial resolution is halved after each block using max pooling. The heatmap generator outputs (N+1) feature maps corresponding to N keypoints and 1 background channel. After experimentation, this design for landmark detection has proven to be very effective and has resulted in state of the art results for landmark predictions when trained with HR images (see Section 4.3).

3.2.2 Heatmap Discriminator

The heatmap discriminator

follows the same architecture as the heatmap generator. However, the input to the discriminator is a set of heatmaps concatenated with their respective color images. This discriminator predicts another set of heatmaps and learns whether the keypoints described by the heatmaps are correct and correspond to the face in the input image. The qualities of the output heatmaps are determined by their similarity to the input heatmaps, following the notion of an autoencoder. The loss is computed as the error between the input and the reconstructed heatmaps.

3.2.3 Heatmap Confidence Discriminator

The architecture of heatmap confidence discriminator is identical to the one used in high to low discriminator, except the input is an LR image concatenated with the heatmap. This discriminator receives three inputs corresponding to the generated LR image with groundtruth heatmap, generated LR image with predicted heatmap and a real LR image with predicted heatmap. This discriminator learns to distinguish between the groundtruth and predicted heatmaps. In order to fool this discriminator, the generator should generate heatmaps which are as real or feasible (for unlabeled real LR image) as possible. The loss propagated from this discriminator enforces the generator to learn, not only to predict accurate heatmaps for images whose groundtruth are available but also for the images without annotations. This in turn enables the generator to understand the structure of the face in the given image and make accurate predictions.

Switching roles of generated and real images: During training of this part of the system, the roles of generated and low resolution images are switched. While training High to Low discriminator , the generated LR images are considered to be fake so that the generator tries to generate as realistic LR image as possible. It is worth recalling that HR images have annotations associated with them. We assume that keypoint locations in a generated LR image stay relatively same as its downsampled version. Therefore, while training , the downsampled annotations are considered to be groundtruth for the generated LR images, and the networks are trained to predict heatmaps as close to the groundtruth as possible in order to fool the discriminator and . tries to predict accurate keypoints for real LR images by learning from generated LR images, and hence the switching of roles.

3.3 Semi-supervised Learning

The learning process of this setup is inspired by the seminal work of Berthelot et al. in [3] and Lecun et al. in [45] called Energy-based GANs. The discriminator receives two sets of inputs: generated LR image with downsampled groundtruth heatmaps and generated LR images with predicted heatmaps. When the input consists of groundtruth heatmaps, the discriminator is trained to recognize it and reconstruct a similar one, i.e., to minimize the error between the groundtruth heatmaps and the reconstructed ones. On the other hand, if the input consists of generated heatmaps, the discriminator is trained to reconstruct different heatmaps, i.e., to drive the error between the generated heatmaps and the reconstructed heatmaps as large as possible. The losses are expressed as

(5)

(6)

(7)

where represents the heatmap of a given image constructed by placing Gaussian with centered at the keypoint location . Inspired by Berthelot et.al. in [3], we use a variable to control the balance between heatmap generator and discriminator. The variable is updated every iterations. The adaptive term is defined by:

(8)

where is bounded between and , and

is a hyperparameter. As in Equation

7, controls the emphasis on . When the generator is able to fool the discriminator, becomes smaller than . As a result of this increases, making the term dominant. The amount of acceleration to train on is adjusted proportional to , i.e the distance the discriminator falls behind the generator. Similarly, when the discriminator gets better than the generator, decreases, to slow down the training on making the generator and the discriminator train together.

The discriminator is trained using the loss function from Least squares GAN[21] as shown in Equation 9. This loss function was chosen in order to be consistent with the losses computed by which are also losses.

(9)

It is noteworthy to mention that in this case represents the groundtruth-heatmaps distribution on generated LR images, while represents the distribution on generated heatmaps of generated LR images and real LR images.

The generator is trained using a weighted combination of losses from the discriminators and and heatmap loss. The loss functions for the generator are described in the following equations:

(10)

(11)

(12)

(13)

where and are hyper parameters set empirically obeying . We put more emphasis on to encourage convergence of the model in initial iterations. Some real LR images with keypoints predicted from the are shown in Figure 6.

Figure 6: Sample keypoint detections on Tinyface images. Note: For visualization purposes images have been enlarged after rocessing. Actual size of the images is in the range of pixels.

4 Experiments and Results

4.1 Ablation Experiments

We experimentally demonstrated in Section 1 (Figure 1) that networks trained on HR images perform poorly on LR images. Therefore, we propose the semi-supervised learning as mentioned in Section 3. With the above mentioned networks and loss functions it is important to understand the implication of each component. This section examines each of the design choices quantitatively. To this end, we first train the high to low resolution networks, and generate LR images of AFLW test images. In the absence of real LR images with annotated landmarks, this is done to create a substitute for low resolution dataset with annotations on which localization performance can be evaluated. We also generate subsampled version of the AFLW trainset and AFLW testset using average pooling after applying Gaussian smoothing. Data augmentation techniques such as random scaling , random rotation () and random translation upto pixels are used.

Evaluation Metric: Following most previous works, we obtain error for each test sample by averaging normalized errors for all annotated landmarks. For AFLW, the obtained error is normalized by the ground truth bounding box size over all visible points whereas for 300W, the error is normalized by the inter-pupil distance. Wherever applicable NRMSE stands for Normalized Root Mean Square Error.

Training Details:

All the networks are trained in Pytorch using the Adam optimizer with an initial learning rate of

and values of . We train the networks with a batch size of for epochs, while dropping the learning rates by after and epochs.

Setting S1: Train networks on subsampled images? We only train network with the subsampled AFLW training images using the loss function in Equation 10, and evaluate the performance on generated LR AFLW test images.

Setting S2: Train networks on generated LR images? In this experiment, we train the network using generated LR images, in a supervised way using the loss function from Equation 10. We again evaluate the performance on generated LR AFLW test images.

Observation: From the results summarized in Table 0(b) it is evident that there is a significant reduction in localization error when is trained on generated LR images validating our hypothesis that subsampled images on which many super-resolution networks are trained may not be a correct representative of real LR images. Hence, we need to train the networks on real LR images.

Setting S3: Does adversarial training help? This question is asked in order to understand the importance of training the heatmap generator in an adversarial way. In this experiment, we train and using the losses in Eqs 5, 6, 10, 11. Metrics are calculated on the generated LR AFLW test images and compared against the experimental setting mentioned in S2 above.

Setting S4: Does trained in adversarial manner scale to real LR images? In this experiment, we wish to examine if training networks and jointly, improves the performance on real LR images from Widerface dataset.(see Section 3 for datasets)

Observation: From Table 0(b) we observe that the network trained with setting S3 performs marginally better compared to setting S4. However, since there are no keypoint annotations available for the Widerface dataset, conclusions cannot be drawn from the drop in performance. Hence, in the following subsection 4.3, we leap towards understanding this phenomenon indirectly, by aligning the faces using the models from setting S3 and setting S4 and evaluating face recognition performances.

Method NRMSE (all) NRMSE (479 images) Time
MTCNN[42] - 0.9736 0.388 s
HRNet[31] 0.4055 0.3107 0.076 s
SAN[10] 0.3901 0.3141 0.0178 s
Proposed 0.257 0.1803 0.0105 s
(a)
Setting NRMSEstd auc@0.07 auc@0.08
S1 11.897 21.894
S2 50.843 55.751
S3 51.889 56.791
S4 51.775 56.697
(b)
Table 1: (a) Landmark Detection Error on Real Low Resolution dataset. (b) Table for ablation experiments under different settings on synthesized LR images.

4.2 Experiments on Low Resolution images

We choose to perform direct comparison on a real LR dataset. Two recent state of the art methods Style Aggregated Networks[10] and HRNet[31]. To create a real LR landmark detection dataset which we call Annotated LR Faces (ALRF), we randomly selected identities from the TinyFace dataset, out of which one LR image (less than pixels and more than pixels) per identity was randomly selected, resulting in a total of LR images. Next, three individuals were asked to manually annotated all the images with 5 landmarks(two eye centers, nose tip and mouth corners) in MTCNN[42] style, where invisible points were annotated with . The mean of the points obtained from the three users were taken to be the groundtruth. As per convention, we used Normalised Mean Square Error (NRMSE), averaged over all visible points and normalized by the face size as the comparison metric. Table 0(a) shows the results of this experiment. We also calculate time for forward pass of one image in a single gtx1080. Without loss of generality, the results can be extrapolated to other existing works as [10] and [31] are currently state of the art. MTCNN which has detection and alignment in a single system was able to detect only faces out of test images.

Figure 7: Snippet of the annotation tool used.

4.3 Face Recognition experiments

In the previous section, we performed ablative studies on the generated LR AFLW images. Although convenient to quantify the performance, it does not uncover the importance of training three networks jointly in a semi-supervised way. Therefore, in this section, we choose to evaluate the models from setting S3 and setting S4 (Section 4.1), by comparing the statistics obtained by applying the two models to align face images for face recognition task.

We use recently published and publicly available, Tinyface[8] dataset for our experimental evaluation. It is one of the very few datasets aimed towards understanding LR face recognition and consists of labeled facial identities with an average of three face images per identity, giving a total of LR face images (average pixels). All the LR faces in TinyFace are collected from the web (PIPA[43] and MegaFace2[23]) across diverse imaging scenarios, captured under uncontrolled viewing conditions in pose, illumination, occlusion and background. known identities is divided into two splits: for training and the remaining for test.

Evaluation Protocol: In order to compare model performances, we adopt the closed-set face identification (1:N matching) protocol. Specifically, the task is to match a given probe face against a gallery set of enrolled face images with true match from the gallery at top-1 of the ranking list. For each test class, half of the face images are randomly assigned to the probe set, and the remaining to the gallery set. For the purpose of this paper, we drop the distractor set as this does not divulge new information while significantly slowing down the evaluation process. For face recognition evaluation, we report statistics on Top-k (k=1,5,10,20) statistics and mean average precision (mAP).

Experiments with network trained from scratch: Since the number of images in TinyFace dataset is much smaller compared to larger datasets such as CASIA[40] or MsCeleb-1M[12], we observed that training a very deep model like Inception-ResNet[32], quickly leads to over-fitting. Therefore, we adopt a CNN with fewer parameters, specifically, LightCNN[37]. Since inputs to the network are images of size , we disable first two max-pooling layers. After detecting the landmarks, training and testing images are aligned to the canonical coordinates using affine transformation. We train layer LightCNN models using the training split of TinyFace dataset under the following settings:

Setting L1 L2 L3 L4 L5
top-1 31.17 35.11 39.03 39.87 43.82
(a)
Setting top-1 top-5 top-10 top-20 mAP
Baseline (ArcFace[9]) 34.71 44.82 49.01 53.70 0.32
I1 34.01 41.98 45.36 49.22 0.29
I2 45.04 56.30 60.11 63.71 0.43
I3 51.10 61.05 64.38 67.89 0.47
(b)
Table 2: Verification performance on Tinyface dataset under different settings (a) LightCNN trained from scratch (b) Using Inception-ResNet pretrained on MsCeleb-1M

Setting L1: Train networks on generated LR images? In this setting, we use the model trained under the setting S2 from the previous section 4.1. In this setting, network is trained using generated LR images in a supervised way using the loss function from Equation 10.

Setting L2: Does adversarial training help? We use the model trained from setting S3 (section 4.1) to align the faces in training and testing sets. In this setting networks and are trained using a weighted combination of pixel loss and GAN losses from Equations 5, 6, 10, 11.

Setting L3: Does trained in adversarial manner scale to real LR images? In this setting, networks , and are trained jointly in a semi-supervised way. We use Tinyface training images as real low resolution images. Later, Tinyface training and testing images are aligned using the trained model for training LightCNN model.

Setting L4: End-to-end training? Under this setting, we also train the High to Low networks and , using the training images from Tinyface dataset as real LR images. We reduce the amount of data-augmentation in this case to resemble tiny face dataset images. With the obtained trained model, landmarks are extracted and images are aligned for LightCNN training.

Setting L5: End-to-end training with pre-trained weights? This setting is similar to the setting L4 above, except instead of training a LightCNN model from scratch we initialize the weights from a pre-trained model, trained with CASIA-Webface dataset.

Observation: The results in Table 1(a) summarizes the results of the experiments done under the settings discussed above. We see that although, we observed a drop in performance in landmark localization when training the three networks jointly (Table 0(b)), there is a significant gap in rank-1 performance between setting L2 and L3. This indicates that with semi-supervised learning generalizes well to real LR data, and hence also validates our hypothesis of training , and together. Unsurprisingly, insignificant difference is seen between settings L3 and L4.

Experiments with pre-trained network: Next, to further understand the implications of joint semi-supervised learning, we design another set of experiments. In these experiments, we use a pre-trained Inception-ResNet model, trained on MsCeleb-1M using ArcFace[9] and Focal Loss[18]. This model expects an input of size pixels, hence the images are resized after alignment in low resolution. Using this pre-trained network, we perform the following experiments:

Setting top-1 top-5 top-10 top-20 mAP
A1 11.75 14.58 24.57 30.47 0.10
A2 26.21 34.76 39.03 43.99 0.24
Table 3: Face recognition performance using super-resolution before face-alignment

Baseline: For the baseline experiment, we choose to follow the usual practice of re-scaling the images to a fixed size irrespective of resolution. We trained our own HR landmark detector (HR-LD) on AFLW images for this purpose. Tinyface gallery and probe images are resized to and used by the landmark detector as inputs. Using the predicted landmarks, images are aligned to a canonical co-ordinates similar to ArcFace[9]

. Baseline performance was obtained by computing cosine similarity between gallery and probe features extracted from the network after feed-forwarding the aligned images.

Setting I1: Does adversarial training help? The model trained for S3 (Section 4.1) is used to align the images directly in low resolution. Features for gallery and probe images are extracted after the rescaling the images and cosine distance is used to measure the similarity and retrieve the images from the gallery.

Setting I2: Does trained in adversarial manner scale to real LR images? For this experiment, the model trained for L3 in Section 4.3 is used for landmark detection in LR. To recall, in this setting, the three models , and (with and frozen) are trained jointly in a semi-supervised way and Tinyface training images are used as real LR data for .

Setting I3: End-to-end training? In this case, we align the images using the model from setting L4 from Section 4.3. In this case, we also trained High to low networks ( and ) using training images from Tinyface dataset as real LR images. After training the model for 200 epochs, the weights are frozen to train and in a semi-supervised way.

Observation: With no surprise, we observe that (from Table 1(b)) training the heatmap prediction networks in a semi-supervised manner, and aligning the images directly in low resolution, improves the performance of any face recognition system trained with HR images.

4.4 Additional Experiments:

Setting A1: Does Super-resolution help? The aim of this experiment is to understand if super-resolution can be used to enhance the image quality before landmark detection. We use SRGAN[17] to super-resolve the images before using face alignment method from Bulat et al. [4] to align the images.

Setting A2: Does Super-resolution help? In this case, we use ESRGAN[34] to super-resolve the images before using HR-LD (below) to align.

Observation: It can be observed from Table 3, that face recognition performance obtained after aligning super-resolved images is not at par even with the baseline. It can be hypothesized that possibly super-resolved images do not represent HR images using which [4] or HR-LD are trained.

High Resolution Landmark Detector (HR-LD) For this experiment, we train on high resolution images of size (for AFLW and 300W) using loss from Equation 10. We evaluate the performance of this network on common benchmarks of AFLW-Full test and 300W test sets, shown in Table 4. We would like to make a note that LAB[36] and SAN[10] either uses extra data or extra annotations or larger spatial resolution to train the deep networks. A few sample outputs are shown in Figure 8

Method 300W AFLW
Common Challenge Full Full
RCPR[7] 6.18 17.26 8.35 -
SDM[38] 5.57 15.40 7.52 5.43
CFAN[41] 5.50 16.78 7.69 -
LBF[27] 4.95 11.98 6.32 4.25
CFSS[46] 4.73 9.98 5.76 3.92
TCDCN[44] 4.80 8.60 5.54 -
MDM[33] 4.83 10.14 5.88 -
PCD-CNN[16] 3.67 7.62 4.44 2.36
SAN[10]* 3.34 6.60 3.98 1.91
LAB[36]* 2.57 4.72 2.99 1.85
HR-LD 3.60 7.301 4.325 1.753
Table 4: Comparison of the proposed method with other state of the art methods on AFLW (Full) and 300-W testsets. The NMEs for comparison on 300W dataset are taken from the Table 3 of [20]. In this case is trained in supervised manner using high resolution images of size . * uses extra annotation or data.
Figure 8: Sample outputs obtained by training with HR images. First row shows samples from AFLW test set. Second row shows sample images from 300W test set. Last two columns of second row shows outputs from challenging subset of 300W

5 Conclusion

In this paper, we first present an analysis of landmark detection methods when applied to LR images, and the implications on face recognition. We also discuss the proposed method for predicting landmarks directly on LR images. We show that the proposed method improves face recognition performance over commonly used practices of rescaling and super-resolution. As a by-product, we also developed a simple but state of the art landmark detection network. Although, low resolution is chosen as the source of degradation, however, the method can trivially be extended to capture other degradations in the imaging process, such as motion blur or climatic turbulence. In addition, the proposed method can be applied to detect human keypoints in LR in order to improve skeletal action recognition. In the era of deep learning, LR landmark detection and face recognition is a fairly untouched topic, however, we believe this work will open new avenues in this direction.

6 Acknowledgment

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

References