accuracy or achieve detection, segmentation and pose estimation results upto subpixel accuracy. These are only few of the many tasks which have seen a significant performance improvements in the last five years. However, CNN-based methods assume access to good quality images. ImageNet, COCO, CASIA, 300W or MPII datasets all consist of high resolution images. As a result of domain shift much lower performance is observed when networks trained on these datasets are applied to images which have suffered degradation due to intrinsic or extrinsic factors. In this work, we address landmark localization in low resolution images. Although, we use face images in our case, the proposed method is also applicable to other tasks, such as human pose estimation. Throughout this paper we use HR and LR to denote high and low resolutions respectively.
Facial landmark localization, also known as keypoint or fiducial detection, refers to the task of detecting specific points such as eye corners and nose tip on a face image. The detected keypoints are used to align images to canonical coordinates, which are then used as inputs to different convolution networks. It has been experimentally shown in [bansal2017dosanddonts]
, that accurate face alignment leads to improved performance in face verification. Though great strides have been made in this direction, mainly addressing large-pose face alignment, landmark localization for low resolution images, still remains an understudied problem, mostly because of the absence of large scale labeled dataset(s). To the best of our knowledge, for the first time, landmark localization directly on low resolution images is addressed in this work.
Main motivation: In Figure 1, we examine possible scenarios which are currently practiced when low resolution images are encountered. Figure 1 shows the predicted landmarks when the input image is a LR image of size less than pixels. Typically, landmark detection networks are trained with crops of HR images taken from AFLW and 300W datasets. During inference, irrespective of resolution, an incoming image is rescaled to . We deploy two methods: MTCNN and Bulat et al., which have detection and localization built in a single system. In Figure 1(a) and (b) we see that these networks failed to detect face in the given image. Figure 1(c), shows the outputs when a network trained on high resolution images is applied to a rescaled low resolution one. It is important to note that the trained network, say HR-LD high resolution landmark detector (detailed in Section 4.4) achieves state of the art performance on AFLW and 300W test sets. A possible solution is to train a network on sub-sampled images as a substitute for low resolution images. Figure 1(d) shows the output of one such network. It is evident from these experiments that networks trained with HR images or subsampled images are not effective for real life LR images. It can also be concluded that subsampled images are unable to capture the distribution of real LR images.
Super-resolution is widely used to resolve LR images to reveal more details. Significant developments have been made in this field and methods based on encoder-decoder architectures and GANs
have been proposed. We employ two recent deep learning based methods, SRGAN and ESRGAN to resolve given LR images. It is worth noting that the training data for these networks also include face images. Figure 1(e) shows the result when the super-resolved image is passed through HR-LD. It can be hypothesized that possibly, the super-resolved images do not lie in the same space of images using which HR-LD was trained. Super resolution networks are trained using synthetic low resolution images obtained by downsampling the image after applying Gaussian smoothing. In some cases, training data for super-resolution networks consists of paired low and high resolution images. Neither of the mentioned scenarios is applicable in real life situations.
Main Idea: Different from these approaches, the proposed method is based on the concept of ‘generate to adapt’. This work aims to show that landmark localization in LR images can not only be achieved, but it also improves the performance over the current practice. To this end, we first train a deep network which generates LR images from HR images and tries to model the distribution of real LR images in pixel space. Since, there is no publicly available dataset, containing low resolution images along with landmark annotations, we take a semi-supervised approach for landmark detection. We train an adversarial landmark localization network on the generated LR images and hence, switching the roles of generated and real LR images. Heatmaps predicted for unlabelled LR images are also included in the inputs of the discriminators. The adversarial training procedure is designed in a way that in order to fool the discriminators, the heatmap generator has to learn the structure of the face even in low resolution. We perform extensive set of experiments explaining all the design choices. In addition, we also propose new state of the art landmark detector for HR images.
2 Related Work
Being one of the most important pre-processing steps in face analysis tasks, facial landmark detection has been a topic of immense interest among computer vision researchers. We briefly discuss some of the methods which use Convolution Neural Networks (CNN). Different algorithms have been proposed in the recent past such as direct regression approaches of MTCNN by Zhang et al.  and KEPLER by Kumar et al. . The convolution neural networks in MTCNN and KEPLER act as non-linear regressors and learn to directly predict the landmarks. Both works are designed to predict other attributes along with keypoints such as 2D pose, visibility of keypoints, gender and many others. Hyperface by Ranjan et al.  has shown that learning tasks in one single network does in fact, improves the performance of individual tasks. Recently, architectures based on Encoder-Decoder architecture have become popular and have been used intensively in tasks which require per-pixel labeling such as semantic segmentation[25, 28] and keypoint detection[16, 1, 41, 15]. Despite making significant progress in this field, predicting landmarks on low resolution faces still remains a relatively unexplored topic. All of the works mentioned above are trained on high quality images and their performance degrades on LR images.
One of the closely related works, is Super-FAN by Bulat et al., which makes an attempt to predict landmarks on LR images by super-resolution. However, as shown in experiments in Section 4.3, face recognition performance degrades even on super-resolved images. This necessitates that super-resolution, face-alignment and face recognition be learned in a single model, trained end to end, making it not only slow in inference but also limited by the GPU memory constraints. The proposed work is different from  in many respects as it needs labeled data only in HR and learns to predict landmarks in LR images in an unsupervised way. Due to adversarial training, the network not only acts as a facial parts detector but also learns the inherent structure of the facial parts. The proposed method makes the pre-processing task faster and independent of face verification training. During inference, only the heatmap generator network is used which is based on the fully convolutional architecture of U-Net and works at the spatial resolution of making the alignment process real time.
3 Proposed Method
The proposed work predicts landmarks directly on a low resolution image of spatial size less than pixels. We show that predicting landmark detection directly in low resolution is effective than current practices of rescaling or super-resolution. The entire pipeline can be divided into two stages: (a) Generation of LR images in an unpaired manner (b) Generating heatmaps for real LR images in a semi-supervised fashion. The diagrammatic overview of the proposed approach is shown in Figure 2. Being a semi-supervised method, it is important to first describe the datasets chosen for the ablative study.
High Resolution Dataset: We construct the HR dataset by combining the training images from AFLW and the entire 300W dataset. We divide the Widerface dataset which consists of images in different resolutions captured under diverse conditions, into two groups based on their spatial size. The first group consists of images with spatial size between and , whereas the second group consists of images with more than pixels. We combine the second group in HR training set, resulting in a total of HR faces. The remaining images from AFLW are used as validation images for the ablative study and test set for the landmark localization task. Although, generation of LR images is an unpaired task, we use AFLW and 300W images for training, as the generated LR images from these datasets are used for semi-supervised learning in the second step.
Low Resolution Dataset: The first group from Widerface dataset consists of faces and is used as real or target low resolution images.
3.1 High to Low Generator and Discriminator
High to low generator , shown in Figure 3
is designed following the Encoder-Decoder architecture, where both encoder and decoder consists of multiple residual blocks. The input to the first convolution layer is the HR image concatenated with the noise vector which has been projected using a fully connected layer and reshaped to match the input size. Similar architectures have also been used in[6, 17]
. The encoder in the generator consists of eight residual blocks each followed by a convolution layer to increase dimensionality. Max-pooling is used to decrease the spatial resolution to, for high resolution image of pixels. The decoder is composed of six residual units followed by convolution layers to reduce the dimensionality. Finally, one convolution layer is added in order to output a three channel image. BatchNorm is used after every convolution layer.
The discriminator , shown in Figure 3 is also constructed in a similar way, except max-pooling is used only in the last three layers considering the inputs to discriminator are low resolution images. Referring to Figure 2, we use for input high resolution images of size , for generated LR images of size and for real LR images of the same size.
We train High to Low generator using a weighted combination of GAN loss and pixel losses. loss is used to encourage convergence in initial training iterations. The final loss can be summarized in Equation 1.
where and are hyperpameters which are empirically set following
. Following recent developments in GANs we experimented with different loss functions. However, we use Hinge loss and Spectral Normalization in combination due to faster training. The hinge loss for the generative networks can be defined as in Equation 2:
where is the distribution of real LR images from Widerface dataset, and is the distribution of generated images .
The weights of the discriminator are normalized in order to satisfy the Lipschitz constraint , shown in Equation 3:
Finally, pixel loss described in Equation 4 is used which minimizes the distance between the generated and subsampled images. The loss ensures that the content is not lost during the generation process.
where the operation is implemented as a sub-sampling operation obtained by passing through four average pooling layers.
Figure 4 shows some sample LR images generated from the network .
3.2 Semi-Supervised Landmark Localization
3.2.1 Heatmap Generator
The keypoint heatmap generator, in Figure 5 produces heatmaps corresponding to N (in our case or ) keypoints in a given image. As mentioned earlier, the objective of this paper is to show that landmark prediction directly on LR image is feasible even in the absence of labeled LR data, and evaluate the performance of auxiliary tasks compared to commonly used practices of rescaling or super-resolution. Keeping this in mind, we choose a simple network based on the U-Net architecture as the heatmap generator, instead of computationally intensive stacks of hourglass networks or CPMs. The network consists of 16 residual blocks where both encoder and decoder have eight residual blocks. Eight residual blocks in the encoder are divided into four groups of two blocks each and spatial resolution is halved after each block using max pooling. The heatmap generator outputs (N+1) feature maps corresponding to N keypoints and 1 background channel. After experimentation, this design for landmark detection has proven to be very effective and has resulted in state of the art results for landmark predictions when trained with HR images (see Section 4.3).
3.2.2 Heatmap Discriminator
The heatmap discriminator
follows the same architecture as the heatmap generator. However, the input to the discriminator is a set of heatmaps concatenated with their respective color images. This discriminator predicts another set of heatmaps and learns whether the keypoints described by the heatmaps are correct and correspond to the face in the input image. The qualities of the output heatmaps are determined by their similarity to the input heatmaps, following the notion of an autoencoder. The loss is computed as the error between the input and the reconstructed heatmaps.
3.2.3 Heatmap Confidence Discriminator
The architecture of heatmap confidence discriminator is identical to the one used in high to low discriminator, except the input is an LR image concatenated with the heatmap. This discriminator receives three inputs corresponding to the generated LR image with groundtruth heatmap, generated LR image with predicted heatmap and a real LR image with predicted heatmap. This discriminator learns to distinguish between the groundtruth and predicted heatmaps. In order to fool this discriminator, the generator should generate heatmaps which are as real or feasible (for unlabeled real LR image) as possible. The loss propagated from this discriminator enforces the generator to learn, not only to predict accurate heatmaps for images whose groundtruth are available but also for the images without annotations. This in turn enables the generator to understand the structure of the face in the given image and make accurate predictions.
Switching roles of generated and real images: During training of this part of the system, the roles of generated and low resolution images are switched. While training High to Low discriminator , the generated LR images are considered to be fake so that the generator tries to generate as realistic LR image as possible. It is worth recalling that HR images have annotations associated with them. We assume that keypoint locations in a generated LR image stay relatively same as its downsampled version. Therefore, while training , the downsampled annotations are considered to be groundtruth for the generated LR images, and the networks are trained to predict heatmaps as close to the groundtruth as possible in order to fool the discriminator and . tries to predict accurate keypoints for real LR images by learning from generated LR images, and hence the switching of roles.
3.3 Semi-supervised Learning
The learning process of this setup is inspired by the seminal work of Berthelot et al. in  and Lecun et al. in  called Energy-based GANs. The discriminator receives two sets of inputs: generated LR image with downsampled groundtruth heatmaps and generated LR images with predicted heatmaps. When the input consists of groundtruth heatmaps, the discriminator is trained to recognize it and reconstruct a similar one, i.e., to minimize the error between the groundtruth heatmaps and the reconstructed ones. On the other hand, if the input consists of generated heatmaps, the discriminator is trained to reconstruct different heatmaps, i.e., to drive the error between the generated heatmaps and the reconstructed heatmaps as large as possible. The losses are expressed as
where represents the heatmap of a given image constructed by placing Gaussian with centered at the keypoint location . Inspired by Berthelot et.al. in , we use a variable to control the balance between heatmap generator and discriminator. The variable is updated every iterations. The adaptive term is defined by:
where is bounded between and , and
is a hyperparameter. As in Equation7, controls the emphasis on . When the generator is able to fool the discriminator, becomes smaller than . As a result of this increases, making the term dominant. The amount of acceleration to train on is adjusted proportional to , i.e the distance the discriminator falls behind the generator. Similarly, when the discriminator gets better than the generator, decreases, to slow down the training on making the generator and the discriminator train together.
The discriminator is trained using the loss function from Least squares GAN as shown in Equation 9. This loss function was chosen in order to be consistent with the losses computed by which are also losses.
It is noteworthy to mention that in this case represents the groundtruth-heatmaps distribution on generated LR images, while represents the distribution on generated heatmaps of generated LR images and real LR images.
The generator is trained using a weighted combination of losses from the discriminators and and heatmap loss. The loss functions for the generator are described in the following equations:
where and are hyper parameters set empirically obeying . We put more emphasis on to encourage convergence of the model in initial iterations. Some real LR images with keypoints predicted from the are shown in Figure 6.
4 Experiments and Results
4.1 Ablation Experiments
We experimentally demonstrated in Section 1 (Figure 1) that networks trained on HR images perform poorly on LR images. Therefore, we propose the semi-supervised learning as mentioned in Section 3. With the above mentioned networks and loss functions it is important to understand the implication of each component. This section examines each of the design choices quantitatively. To this end, we first train the high to low resolution networks, and generate LR images of AFLW test images. In the absence of real LR images with annotated landmarks, this is done to create a substitute for low resolution dataset with annotations on which localization performance can be evaluated. We also generate subsampled version of the AFLW trainset and AFLW testset using average pooling after applying Gaussian smoothing. Data augmentation techniques such as random scaling , random rotation () and random translation upto pixels are used.
Evaluation Metric: Following most previous works, we obtain error for each test sample by averaging normalized errors for all annotated landmarks. For AFLW, the obtained error is normalized by the ground truth bounding box size over all visible points whereas for 300W, the error is normalized by the inter-pupil distance. Wherever applicable NRMSE stands for Normalized Root Mean Square Error.
All the networks are trained in Pytorch using the Adam optimizer with an initial learning rate ofand values of . We train the networks with a batch size of for epochs, while dropping the learning rates by after and epochs.
Setting S1: Train networks on subsampled images? We only train network with the subsampled AFLW training images using the loss function in Equation 10, and evaluate the performance on generated LR AFLW test images.
Setting S2: Train networks on generated LR images? In this experiment, we train the network using generated LR images, in a supervised way using the loss function from Equation 10. We again evaluate the performance on generated LR AFLW test images.
Observation: From the results summarized in Table 0(b) it is evident that there is a significant reduction in localization error when is trained on generated LR images validating our hypothesis that subsampled images on which many super-resolution networks are trained may not be a correct representative of real LR images. Hence, we need to train the networks on real LR images.
Setting S3: Does adversarial training help? This question is asked in order to understand the importance of training the heatmap generator in an adversarial way. In this experiment, we train and using the losses in Eqs 5, 6, 10, 11. Metrics are calculated on the generated LR AFLW test images and compared against the experimental setting mentioned in S2 above.
Setting S4: Does trained in adversarial manner scale to real LR images? In this experiment, we wish to examine if training networks and jointly, improves the performance on real LR images from Widerface dataset.(see Section 3 for datasets)
Observation: From Table 0(b) we observe that the network trained with setting S3 performs marginally better compared to setting S4. However, since there are no keypoint annotations available for the Widerface dataset, conclusions cannot be drawn from the drop in performance. Hence, in the following subsection 4.3, we leap towards understanding this phenomenon indirectly, by aligning the faces using the models from setting S3 and setting S4 and evaluating face recognition performances.
4.2 Experiments on Low Resolution images
We choose to perform direct comparison on a real LR dataset. Two recent state of the art methods Style Aggregated Networks and HRNet. To create a real LR landmark detection dataset which we call Annotated LR Faces (ALRF), we randomly selected identities from the TinyFace dataset, out of which one LR image (less than pixels and more than pixels) per identity was randomly selected, resulting in a total of LR images. Next, three individuals were asked to manually annotated all the images with 5 landmarks(two eye centers, nose tip and mouth corners) in MTCNN style, where invisible points were annotated with . The mean of the points obtained from the three users were taken to be the groundtruth. As per convention, we used Normalised Mean Square Error (NRMSE), averaged over all visible points and normalized by the face size as the comparison metric. Table 0(a) shows the results of this experiment. We also calculate time for forward pass of one image in a single gtx1080. Without loss of generality, the results can be extrapolated to other existing works as  and  are currently state of the art. MTCNN which has detection and alignment in a single system was able to detect only faces out of test images.
4.3 Face Recognition experiments
In the previous section, we performed ablative studies on the generated LR AFLW images. Although convenient to quantify the performance, it does not uncover the importance of training three networks jointly in a semi-supervised way. Therefore, in this section, we choose to evaluate the models from setting S3 and setting S4 (Section 4.1), by comparing the statistics obtained by applying the two models to align face images for face recognition task.
We use recently published and publicly available, Tinyface dataset for our experimental evaluation. It is one of the very few datasets aimed towards understanding LR face recognition and consists of labeled facial identities with an average of three face images per identity, giving a total of LR face images (average pixels). All the LR faces in TinyFace are collected from the web (PIPA and MegaFace2) across diverse imaging scenarios, captured under uncontrolled viewing conditions in pose, illumination, occlusion and background. known identities is divided into two splits: for training and the remaining for test.
Evaluation Protocol: In order to compare model performances, we adopt the closed-set face identification (1:N matching) protocol. Specifically, the task is to match a given probe face against a gallery set of enrolled face images with true match from the gallery at top-1 of the ranking list. For each test class, half of the face images are randomly assigned to the probe set, and the remaining to the gallery set. For the purpose of this paper, we drop the distractor set as this does not divulge new information while significantly slowing down the evaluation process. For face recognition evaluation, we report statistics on Top-k (k=1,5,10,20) statistics and mean average precision (mAP).
Experiments with network trained from scratch: Since the number of images in TinyFace dataset is much smaller compared to larger datasets such as CASIA or MsCeleb-1M, we observed that training a very deep model like Inception-ResNet, quickly leads to over-fitting. Therefore, we adopt a CNN with fewer parameters, specifically, LightCNN. Since inputs to the network are images of size , we disable first two max-pooling layers. After detecting the landmarks, training and testing images are aligned to the canonical coordinates using affine transformation. We train layer LightCNN models using the training split of TinyFace dataset under the following settings:
Setting L1: Train networks on generated LR images? In this setting, we use the model trained under the setting S2 from the previous section 4.1. In this setting, network is trained using generated LR images in a supervised way using the loss function from Equation 10.
Setting L2: Does adversarial training help? We use the model trained from setting S3 (section 4.1) to align the faces in training and testing sets. In this setting networks and are trained using a weighted combination of pixel loss and GAN losses from Equations 5, 6, 10, 11.
Setting L3: Does trained in adversarial manner scale to real LR images? In this setting, networks , and are trained jointly in a semi-supervised way. We use Tinyface training images as real low resolution images. Later, Tinyface training and testing images are aligned using the trained model for training LightCNN model.
Setting L4: End-to-end training? Under this setting, we also train the High to Low networks and , using the training images from Tinyface dataset as real LR images. We reduce the amount of data-augmentation in this case to resemble tiny face dataset images. With the obtained trained model, landmarks are extracted and images are aligned for LightCNN training.
Setting L5: End-to-end training with pre-trained weights? This setting is similar to the setting L4 above, except instead of training a LightCNN model from scratch we initialize the weights from a pre-trained model, trained with CASIA-Webface dataset.
Observation: The results in Table 1(a) summarizes the results of the experiments done under the settings discussed above. We see that although, we observed a drop in performance in landmark localization when training the three networks jointly (Table 0(b)), there is a significant gap in rank-1 performance between setting L2 and L3. This indicates that with semi-supervised learning generalizes well to real LR data, and hence also validates our hypothesis of training , and together. Unsurprisingly, insignificant difference is seen between settings L3 and L4.
Experiments with pre-trained network: Next, to further understand the implications of joint semi-supervised learning, we design another set of experiments. In these experiments, we use a pre-trained Inception-ResNet model, trained on MsCeleb-1M using ArcFace and Focal Loss. This model expects an input of size pixels, hence the images are resized after alignment in low resolution. Using this pre-trained network, we perform the following experiments:
Baseline: For the baseline experiment, we choose to follow the usual practice of re-scaling the images to a fixed size irrespective of resolution. We trained our own HR landmark detector (HR-LD) on AFLW images for this purpose. Tinyface gallery and probe images are resized to and used by the landmark detector as inputs. Using the predicted landmarks, images are aligned to a canonical co-ordinates similar to ArcFace
Setting I1: Does adversarial training help? The model trained for S3 (Section 4.1) is used to align the images directly in low resolution. Features for gallery and probe images are extracted after the rescaling the images and cosine distance is used to measure the similarity and retrieve the images from the gallery.
Setting I2: Does trained in adversarial manner scale to real LR images? For this experiment, the model trained for L3 in Section 4.3 is used for landmark detection in LR. To recall, in this setting, the three models , and (with and frozen) are trained jointly in a semi-supervised way and Tinyface training images are used as real LR data for .
Setting I3: End-to-end training? In this case, we align the images using the model from setting L4 from Section 4.3. In this case, we also trained High to low networks ( and ) using training images from Tinyface dataset as real LR images. After training the model for 200 epochs, the weights are frozen to train and in a semi-supervised way.
Observation: With no surprise, we observe that (from Table 1(b)) training the heatmap prediction networks in a semi-supervised manner, and aligning the images directly in low resolution, improves the performance of any face recognition system trained with HR images.
4.4 Additional Experiments:
Setting A1: Does Super-resolution help? The aim of this experiment is to understand if super-resolution can be used to enhance the image quality before landmark detection. We use SRGAN to super-resolve the images before using face alignment method from Bulat et al.  to align the images.
Setting A2: Does Super-resolution help? In this case, we use ESRGAN to super-resolve the images before using HR-LD (below) to align.
Observation: It can be observed from Table 3, that face recognition performance obtained after aligning super-resolved images is not at par even with the baseline. It can be hypothesized that possibly super-resolved images do not represent HR images using which  or HR-LD are trained.
High Resolution Landmark Detector (HR-LD) For this experiment, we train on high resolution images of size (for AFLW and 300W) using loss from Equation 10. We evaluate the performance of this network on common benchmarks of AFLW-Full test and 300W test sets, shown in Table 4. We would like to make a note that LAB and SAN either uses extra data or extra annotations or larger spatial resolution to train the deep networks. A few sample outputs are shown in Figure 8
In this paper, we first present an analysis of landmark detection methods when applied to LR images, and the implications on face recognition. We also discuss the proposed method for predicting landmarks directly on LR images. We show that the proposed method improves face recognition performance over commonly used practices of rescaling and super-resolution. As a by-product, we also developed a simple but state of the art landmark detection network. Although, low resolution is chosen as the source of degradation, however, the method can trivially be extended to capture other degradations in the imaging process, such as motion blur or climatic turbulence. In addition, the proposed method can be applied to detect human keypoints in LR in order to improve skeletal action recognition. In the era of deep learning, LR landmark detection and face recognition is a fairly untouched topic, however, we believe this work will open new avenues in this direction.
This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
-  A recurrent autoencoder-decoder for sequential face alignment. http://arxiv.org/abs/1608.05477. Accessed: 2016-08-16.
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele.
2d human pose estimation: New benchmark and state of the art
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
-  D. Berthelot, T. Schumm, and L. Metz. BEGAN: boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017.
-  A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, volume 1, page 8, 2017.
-  A. Bulat and G. Tzimiropoulos. Super-fan: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans. CoRR, abs/1712.02765, 2017.
-  A. Bulat, J. Yang, and G. Tzimiropoulos. To learn image super-resolution, use a gan to learn how to do image degradation first. European Conference on Computer Vision, 2018.
-  X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust face landmark estimation under occlusion. ICCV, 0:1513–1520, 2013.
-  Z. Cheng, X. Zhu, and S. Gong. Low-resolution face recognition. CoRR, abs/1811.08965, 2018.
-  J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. CoRR, abs/1801.07698, 2018.
-  X. Dong, Y. Yan, W. Ouyang, and Y. Yang. Style aggregated network for facial landmark detection. In CVPR, pages 379–388, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. CoRR, abs/1607.08221, 2016.
-  M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.
-  A. Kumar, A. Alavi, and R. Chellappa. Kepler: Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors. In 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), pages 258–265, May 2017.
-  A. Kumar and R. Chellappa. A convolution tree with deconvolution branches: Exploiting geometric relationships for single shot keypoint detection. CoRR, abs/1704.01880, 2017.
-  A. Kumar and R. Chellappa. Disentangling 3d pose in A dendritic CNN for unconstrained 2d face alignment. CoRR, abs/1802.06713, 2018.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016.
-  T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
-  J. Lv, X. Shao, J. Xing, C. Cheng, and X. Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  X. Mao, Q. Li, H. Xie, R. Y. K. Lau, and Z. Wang. Multi-class generative adversarial networks with the L2 loss function. CoRR, abs/1611.04076, 2016.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. CoRR, abs/1802.05957, 2018.
-  A. Nech and I. Kemelmacher-Shlizerman. Level playing field for million scale face recognition. CoRR, abs/1705.00393, 2017.
-  A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Estimation, pages 483–499. Springer International Publishing, Cham, 2016.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. arXiv preprint arXiv:1505.04366, 2015.
-  R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR, abs/1603.01249, 2016.
-  S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 FPS via regressing local binary features. In CVPR, pages 1685–1692, 2014.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
-  C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In 2013 IEEE International Conference on Computer Vision Workshops, pages 397–403, Dec 2013.
-  K. Sun, B. Xiao, D. Liu, and J. Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
-  C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
-  G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR, Las Vegas, USA, June 2016.
-  X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang. ESRGAN: enhanced super-resolution generative adversarial networks. CoRR, abs/1809.00219, 2018.
-  S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. CoRR, abs/1602.00134, 2016.
-  W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou. Look at boundary: A boundary-aware face alignment algorithm. In CVPR, 2018.
-  X. Wu, R. He, and Z. Sun. A lightened CNN for deep face representation. CoRR, abs/1511.02683, 2015.
-  Xuehan-Xiong and F. De la Torre. Supervised descent method and its application to face alignment. In CVPR, 2013.
-  S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. CoRR, abs/1411.7923, 2014.
-  J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV, volume 8690 of Lecture Notes in Computer Science, pages 1–16. Springer International Publishing, 2014.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, Oct 2016.
-  N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. D. Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. CoRR, abs/1501.05703, 2015.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV, pages 94–108, 2014.
-  J. J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. CoRR, abs/1609.03126, 2016.
-  S. Zhu, C. Li, C. Change Loy, and X. Tang. Face alignment by coarse-to-fine shape searching. June 2015.