Joint Super-Resolution and Alignment of Tiny Faces

by   Yu Yin, et al.
Northeastern University

Super-resolution (SR) and landmark localization of tiny faces are highly correlated tasks. On the one hand, landmark localization could obtain higher accuracy with faces of high-resolution (HR). On the other hand, face SR would benefit from prior knowledge of facial attributes such as landmarks. Thus, we propose a joint alignment and SR network to simultaneously detect facial landmarks and super-resolve tiny faces. More specifically, a shared deep encoder is applied to extract features for both tasks by leveraging complementary information. To exploit the representative power of the hierarchical encoder, intermediate layers of a shared feature extraction module are fused to form efficient feature representations. The fused features are then fed to task-specific modules to detect landmarks and super-resolve face images in parallel. Extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art in both landmark localization and SR of faces. We show a large improvement for landmark localization of tiny faces (i.e., 16*16). Furthermore, the proposed framework yields comparable results for landmark localization on low-resolution (LR) faces (i.e., 64*64) to existing methods on HR (i.e., 256*256). As for SR, the proposed method recovers sharper edges and more details from LR face images than other state-of-the-art methods, which we demonstrate qualitatively and quantitatively.



There are no comments yet.


page 1

page 3

page 5


FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors

Face Super-Resolution (SR) is a domain-specific super-resolution problem...

Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs

This paper addresses two challenging tasks: improving the quality of rea...

Progressive Face Super-Resolution via Attention to Facial Landmark

Face Super-Resolution (SR) is a subfield of the SR domain that specifica...

Pro-UIGAN: Progressive Face Hallucination from Occluded Thumbnails

In this paper, we study the task of hallucinating an authentic high-reso...

Jointly Aligning Millions of Images with Deep Penalised Reconstruction Congealing

Extrapolating fine-grained pixel-level correspondences in a fully unsupe...

Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation

Recent works based on deep learning and facial priors have succeeded in ...

Think about boundary: Fusing multi-level boundary information for landmark heatmap regression

Although current face alignment algorithms have obtained pretty good per...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Automatic face understanding is critical for problems in human perception (e.g., super-resolution (SR) [34], visual understanding [6], and style transfer [17]) and applied machine vision (e.g., landmark localization [22], identity recognition [31]

, and face detection 

[36]). Modern-day models for face-based tasks tend to breakdown when applied to images of low-resolution (LR). In practice, face-based systems are frequently confronted with such scenarios (e.g., LR cameras used for surveillance [35]). Recent studies revealed that a decrease in resolution (i.e., ) yields an increase in error for models used for facial landmark localization (Bulat et al. 2018). To address this problem, face SR, also known as face hallucination, aims to generate high-resolution (HR) faces from LR imagery [16]. The recovered faces then provide more detailed information (e.g., sharper edges, clearer shapes, and finer skin details), and are often used for improved analysis and perception. However, most existing methods (e.g., Superfan [3]) rely heavily on the quality of recovered images. Since SR methods usually suffer from blurriness, using SR images for face-related tasks can hinder the final prediction or conclusion.

On the other hand, facial prior knowledge can be used to recover SR faces of higher quality [1, 16]. In problems of single image super-resolution (SISR), face SR utilizes prior knowledge to improve the accuracy of the inferred images and, thus, to yield results of higher quality. For example, one can leverage low-level information (i.e., smoothness in color), facial heatmaps, and face parsing maps to provide additional mid-level information (i.e., face structure) to recover sharper edges and shapes [4]. Also, high-level information can be extracted with identity labels and other face attributes (e.g., gender, age, and pose), and then leveraged to reduce the ambiguity of the hallucinated faces [33, 14]. Hence, additional face information is beneficial for SR, and especially for tiny faces (e.g., ).

Previous work in face SR either super-resolved LR images using prior information (e.g., FSRNet [4]) or directly localized the landmarks on the super-resolved images (e.g., SuperFAN [3]). Figure 2 compares these frameworks with the proposed method. Specifically, SuperFAN only uses SR to help localize the landmarks of tiny faces, but not vice-versa. Besides, our model does not process the recovered SR output that suffers from blurriness, as we dedicate an encoding module to maximize the amount of information captured from LR faces. As for FSRNet, landmarks are only used as facial prior knowledge to super-resolve faces, which suffers from the same problem of detecting landmarks on a coarse, recovered SR image. Furthermore, SuperFAN and FSRNet address the two tasks separately, leading to redundant feature maps. Since both face SR and landmark localization tasks could benefit from one another, we aim to extract the maximum amount of information from LR faces by addressing the two tasks simultaneously. Thus, we propose a multi-task framework that allows these tasks to benefit from one another, which improves the performance in both tasks (see Figure1).

The main contribution of this paper are as follows:

  1. In this paper, we propose a network that does SR and landmark detection on tiny faces jointly– a network we dubbed JASRNet111The code is available at: To the best of our knowledge, we are the first to train a multi-task model that jointly learns landmark localization and SR. Specifically, and unlike existing two-step approaches, we leverage the complementary information of the two tasks. This allows for more accurate landmark predictions to be made in LR space and improved reconstruction from LR-to-HR.

  2. Novel deep feature extraction and fusion modules are used to maximize the amount of information captured from the LR faces, which is done at intermediate layers of the encoder to exploit the deep hierarchical machinery.

  3. We show large improvements for both SR and landmark localization for tiny faces (i.e., ). Besides, our JASRNet yields results for landmark localization on LR faces (i.e., ) that are comparable to existing methods evaluated on the corresponding HR faces (i.e., ). Furthermore, the proposed method recovers HR faces with sharper edges and shapes compared with state-of-the-art methods for SR.

Related Work

Face super-resolution

Typical SISR methods do not benefit from facial prior information and can be utilized to super-resolve images of arbitrary type. By introducing face-specific information, Yu [34, 35] proposed a GAN-based model to recover HR images from tiny faces of size . Chen [4] used a separate branch to estimate facial landmark heatmaps and parsing maps, which were then used as face-specific information to super-resolve tiny face images. FaceAttr [33] validated that knowledge of facial attributes can also significantly reduce the ambiguity in face SR. It is worth noting that our method not only utilizes facial prior information to super-resolve tiny face with better quality, but also achieves state-of-the-art performance on landmark alignment by benefiting from SR.

Figure 2: Graphical view. (a) SuperFAN [3] detects landmarks on super-resolved faces. (b) FSRNet [4] uses prior information for SR. (c) Our multi-task framework jointly learns landmark localization and SR, with tasks aiding one another.

Face alignment

Modern-day approaches for face alignment have been successful on HR faces [5, 18, 19, 20] . However, most suffer from performance degradation with decreasing image resolution, especially with faces smaller than  [2]. The first to address landmark detection on LR faces was SuperFAN [3], which super-resolved tiny faces, from which the output images were fed to a landmark localization model. Although the error of the landmark localization provides gradients to back-propagate through the SR module, it is, in essence, a 2-step process. We argue that the facial prior information is not fully utilized for SR. To address this problem, we present a novel synergistic multi-task framework that learns facial landmark localization and SR jointly.

Figure 3: Architecture of the proposed JASRNet. The shared encoder module is used for extracting shallow and shared features for both tasks. The deep feature extraction and fusion module is used for obtaining better feature representations. The other two modules are task-specific modules for super-resolution and face alignment, respectively.

Multi-task learning

Multi-task learning is commonly used to jointly address correlated tasks. HyperFace [20]

proposed a multi-task learning framework for face detection, face alignment, gender recognition, and pose estimation. The joint learning tasks were based on regression or classification (

i.e., a special case of regression). Hence, similar architectures were adopted for all tasks. In our case, however, face SR and alignment are based on generation and regression, respectively. Thus, one of the main differences in architectures of the proposed from HyperFace is that we include specific modules for each task, while HyperFace used only fully connected layers after feature fusion.


Super-resolution (SR) and landmark localization of tiny faces are highly correlated tasks. Both of them can benefit from each other. While previous work either uses SR to help align tiny faces or vice-versa, but not both. We argue that the amount of information extracted from LR image is not maximized when only one task is used to help the other. Hence, we propose a deep joint alignment and super-resolution network (JASRNet) to model super-resolution and localize landmarks for tiny faces simultaneously, with information from both tasks boosting the performance of the other. As shown in Figure 3, the proposed JASRNet consists of four parts: (1) a shared shallow encoder module is used for extracting shallow and shared features for both tasks; (2) a deep feature extraction and fusion, which is used for obtaining better feature representations; (3-4) task-specific modules for super-resolution and face alignment, respectively.

Let be training samples. The original LR faces are passed in to the shared encoder, which then feeds into the feature extraction module to extract features for both tasks. To exploit the representative power of different grains, the intermediate features of the shared encoder branch out to fuse with the output of deep feature extraction module. This feature fusion forms a more efficient feature representation, as later demonstrated as part of the ablation study. Carrying on, the fused features are fed to both task-specific modules. Thus, the super-resolved images

and the probability maps of the landmark estimations

are produced simultaneously.

Usually, there are sharper edges or sudden changes around the contour of facial component. For face alignment, the SR module recovers the image with better resolution, which, hence, helps the model detect more accurate landmarks. In parallel, the alignment module locates the edges and structure of the face, forcing more attention to the high-frequency content (i.e

., edges). Since both tasks, face SR and landmark localization, are suited to benefit from one another, the aim of this work is to exploit the amount of maximum information that can be extracted from the LR faces. This is done by combining the loss function of each task. For the SR task, the

loss is minimized, as it can provide better convergence than [15, 37]. For the alignment task, a heatmap loss is used, like in [5]. Together, the loss function of JASRNet can be expressed as


where denotes the total loss, and denote the loss for super-resolution and the heatmap loss for alignment, respectively. The weight of is , and the estimated heatmap of the image is . As mentioned above, is the super-resolved image recovered from .

Shared feature extraction and fusion

Shallow encoder. Previous work in face SR and alignment usually addressed these two tasks separately, leading to redundant feature maps. To efficiently extract features from LR images, a shared encoder is designed to extract shallow features that capture complementary information of the two tasks. It consists of a convolutional layer, a residual block [7], and then three transformations made-up of the maxpooling operation and residual blocks (Figure 3). Intermediate layers of the encoder are later fused for richer features in geometry and semantics.

All the convolution layers of JASRNet use kernels of size

, and each is followed by a ReLU layer. The number of channels are all set as 128, except for the last convolutional layers in both reconstruction and alignment module, which are set as 3 and the number of landmarks (namely 68 for 300W), respectively. There are three maxpooling layers in the network, each downsample the feature maps

, which, in total, reduce the size of the feature maps by a factor of 8. The structure of the residual blocks is the same as in the original residual nets (ResNets) [7]

, except we omitted the batch normalization (BN) layers, as it reduces the variation of feature ranges: ResNets used for SISR (EDSR) performed best with all BN layers removed 

[15]. Also, we found that BN layers slow down the speed to convergence of the network, while reducing its overall performance, which was especially true in the SR task. Since we aim to reserve the most information possible when passing through the shared encoder module (i.e., during feature extraction), we follow EDSR [15] and remove all BN from the residual blocks.

Deep feature extraction and fusion

. Deeper networks have shown to have a better performance in many computer vision tasks including SR

[3, 4, 7, 15, 26]. Increased depth was also a tactic used in this work. Shallow features extracted from the shared encoder are passed to the deep feature extraction module consisting of residual blocks, with in the reported experiments. A deeper network not only recovers sharper edges and shapes for super-resolved face images, but it also achieves a higher accuracy for landmark localization.

Inspired by Hyperface [20], we fused intermediate layers to exploit the representative power of features at different levels of the hierarchical model. Considering the similarity of features from adjacent layers, not all features of the shared encoder are fused to compose the new feature representation. Since each of the maxpooling layers downsample the feature map by a factor of 2, the output of the layers that precede each maxpooling layer branches out using skip connections, and are later fused to form richer features with geometry information. To match sizes of the feature maps, a

convolutional layer with stride 2 is applied to downsample fusing features by a factor of 2 for each maxpooling layer that is applied in parallel to the skip connection.

The outputs before the maxpooling are denoted as ; the output of the last residual block in feature extraction module is (see Figure 3). Provided LR images as input, we have

where transform the signal during feature extraction. Hence, is the mapping of the first convolution layers and residual blocks, and are the mappings of the first and second steps combining maxpooling and residual blocks, respectively, and is the mapping for the remaining residual blocks making up the feature extraction module.

Mathematically speaking, the fused features that is output can be founded as


where the convolution operation fuses intermediate features.

Task-specific modules

Super-resolution reconstruction. The super-resolution reconstruction module reconstructs the HR image from shared features of size . First, shared feature maps are fed to two residual blocks to extract task-specific features. Next, 3 conv-layers, each of which are followed by pixel shuffle layers [25], upscale the feature maps in size (i.e., to ). Finally, a convolutional layer made-up of filters to map from HR RGB image space.

Inspired by EDSR [15] and RDN [37], the first and last residual blocks of the shared encoder and SR reconstruction module are linked by a large skip connection. This recovers HR images with finer details (i.e., sharper edges and shapes). The skip connection directly provides low frequency information to the super-resolved images. Hence, it forces the network to focus on learning the high frequency information, opposed to low frequency information already provided. Since the output size of the first convolution layer is , and the feature map size of the last residual block in reconstruction module is , we downsample the feature map with 3 convolution and 3 maxpooling layers (see Figure 3).

Unlike SuperFAN, where the long skip connection is reported to have minimal impact on overall performance, our model largely benefits from the skip connection. This is because the features extracted includes high frequency information and, thus, is more efficient for recovering sharp and accurate edges. Furthermore, since super-resolution and face alignment share the deep features, a byproduct of this long skip connection also is boosted performance for the landmark localization task as well.

Figure 4: Visual results. Comparison of different super-resolution methods.

max width= Bicubic VDSR URDGN SRRes EDSR TDAE FSRNet SuperFAN Ours 300W 21.36/0.594 21.80/0.558 21.97/0.617 23.30/0.669 23.47/0.658 21.12/0.547 23.05/0.678 23.13/0.691 23.69/0.711 HELEN 21.36/0.593 21.66/0.552 21.77/0.605 23.05/0.674 23.40/0.709 21.70/0.542 - - / - - 23.17/0.695 23.55/0.717

Table 1: Quantitative comparisons. PSNR/SSIM on 300W and HELEN.

Face alignment. Like the SR reconstruction module, the shared features are fed through consecutive residual blocks to extract features specific to face alignment. Inspired by the successes of convolutional pose machines (CPM)  [30] on face alignment, we also utilize the sequential framework made-up of residual blocks for estimating locations of landmarks. In the first stage, two residual blocks predict coarse heatmaps . Then, in the second stage, the heatmaps predicted in the first stage are first concatenated with the feature maps , which are then fed to the second prediction module composed of three sequential residual blocks that predict heatmaps . The third stage then concatenates feature maps and to produce final estimation expressed as


where maps the prediction modules, with . Note that the size of the feature maps is constant throughout the face alignment module (i.e., ). During training, heatmap regression loss was used to localize landmarks, opposed to directly predicting pixel coordinates . Thus, argmax is used to determine from the predicted heatmaps in final stage (i.e., ). Specifically, the maximum value of each of the heatmaps is found as the predicted landmarks (i.e., ).


We now review the experimental settings and results. Specifically, the datasets, implementation details, and metrics are first described. Then, we show results comparing with the state-of-the-art methods for the face SR and alignment task separately. Besides, we highlight the benefits of the proposed feature fusion and joint training. Finally, we conduct an ablation study as a deep-dive revealing the contributions of the components introduced in this work.

Experimental settings

Datasets. We evaluated the proposed approach on several datasets, which are listed as follows:

  • [leftmargin=*]

  • 300W [24, 23] consists of 3,837 face images with 68 landmarks. We used the same training set as  [18, 38]. Subsets of 300W were evaluated: common and challenge, and full.

  • AFLW [10] consists of 24,386 faces, each with 21 landmarks. The dataset was split into 20,000 faces for training and the rest (i.e., 4,386) for testing  [5]. Also, the left and right ears were ignored, leaving up to 19 landmarks per face sample.

  • HELEN [11] contains 2,330 images. The annotation of all 194 landmarks were used as facial prior information. We followed [4] to use the last 50 images for testing and the rest for training.

  • LFW [8, 12] contains 13,233 face images collected from 5,750 people. Each image is labeled with the name of the person pictured. Hence, it will also be used to evaluate the recognition capabilities of super-resolved images. Note that this dataset was only used for testing.

Implementation details. We first cropped facial images about the head region, which were then resized to 128128. These were designated as the HR images. Then, LR images were generated by applying bicubic downsampling (8) to the HR images, yielding a resolution of 1616. Then, the input LR images were reversed to match the size of the HR faces: each were up-scaled 8

using bicubic interpolation resulting in images of size 128

128. The training images were augmented using random scaling, rotation, and horizontally flipping. Specifically, these augmentation transformations were used to make fifteen copies. Optimization was done with ADAM with a learning rate of that dropped 0.5 at and epochs. The model was trained with a batch size of 8 and for a total epoch of 40 epochs. Implementation was done using PyTorch. Training took about 7 hours on Helen with a Nvidia TITAN-XP GPU.

Evaluation metrics. The metric used to evaluate landmark localization was NMSE (i.e., the normalized euclidean distances between ground-truth and predicted landmarks). Following  [2, 5, 24], the normalization factor is set as inter-ocular distance for 300W and the area of the ground-truth bounding box for AFLW dataset.

For SR, we evaluated using the peak signal to noise ratio (PSNR) and structural similarity index (SSIM) 

[29]: PSNR is computed as the mean squared error (MSE) between the SR and HR images, while SSIM accounts for the noise and edges (i.e., the high-frequency content) of an image. In our experiments, we converted the RGB images to the YCbCr color space and only calculated the PSNR for the Y-channel. To focus on the face region, while ignoring the background, only the face region within the bounding box was measured when evaluating the SR images.

Comparison with state-of-the-art methods

Comparisons were made with state-of-the-art methods in both SR and face alignment. It is important to note that most existing methods only do a single task, while the proposed model does both. Furthermore, our model performs the best in both tasks. The methods that do both tasks, SuperFAN [3] and FSRNet [4], were used to compare both tasks simultaneously.

max width= 300W AFLW Common Challenge Full SDM [32] 5.57 15.40 7.52 5.43 LBF [21] 4.95 11.98 6.32 4.25 CFSS [38] 4.73 9.98 5.76 3.92 MDM [27] 4.83 10.14 5.88 - Two-stage [18] 4.36 7.56 4.99 2.17 RCSR [28] 4.01 8.58 4.90 - CPM+SBR [5] 3.28 7.58 4.10 2.14 JASRNet (Ours) 3.20 7.44 4.03 2.03 SuperFAN [3] 5.60 10.47 6.55 3.774 FSRNet [4] 5.42 10.76 6.46 - - CPM+SBR [5] 5.42 10.65 6.45 3.87 JASRNet (Ours) 4.60 8.10 5.29 3.35

Table 2: NMSE on 300W and AFLW. We perform the best on LR faces (bottom). Even with the proposed processing LR, while all others process HR, it still is best (top).

max width= ACC() PSNR SSIM Param. HR 99.33 Bicubic 79.50 25.28 0.736 FSRNet [4] 83.75 26.63 0.800 27.14M SuperFAN (Bulat et al. 2018) 84.08 26.83 0.808 26.41M Ours 86.86 27.30 0.818 18.96M

Table 3: Quantitative comparisons on LFW. Performance was measured using verification accuracy (ACC), PSNR, and SSIM. The number of parameters is also listed here.

Face super-resolution results. We compared with methods used for SISR (i.e., VDSR [9], SRRes [13], and EDSR [15]), as well as methods for face SR (i.e., URDGN [34], TDAE [35], SuperFAN [3] and FSRNet [4]). For a fair comparison, we retrained aforementioned models with the same training and testing data used in the respective experiment. Qualitative comparisons clearly show that the proposed JASRNet recovers HR images with relatively more details (i.e., sharper edges, more accurate facial component shapes and textures), while other methods tend to produce face images with more blur and inaccuracies (see Figure 4). Quantitative results for face SR are shown in Table 1. The proposed model achieved the highest PSNR and SSIM on 300W and HELEN dataset. Since some methods only support an upscaling factor of 4, we added an additional upscaling module () to get the equivalent factor of 8. For this, we incorporated the commonly used pixel shuffle followed by a convolutional layer [25].

Face alignment results. We present face alignment results for 300W and AFLW dataset with LR image size of and separately. The results are summarized in Table 2. First, we compare the results of LR images (see bottom part of Table 2). Since only a few works address the tiny face (i.e., ) alignment problem, we only compare the performance of proposed models with SuperFAN, FSRNet, and another state-of-the-art method CPM+SBR [5]. Noticed that CPM+SBR is applied on super-resolved images using bicubic interpolation. Compared with other state-of-art methods, we show a large improvement for landmark localization on tiny faces.

Furthermore, we present results of JASRNet on faces with a resolution of (see Table 2 (top)). Note that existing methods detect landmarks on HR (i.e., ) images. Still, the proposed framework is comparable for landmark localization on LR images with the others on HR.

max width= baseline +feature fusion joint training +feature fusion JASRNet (BL) (BL_F) (JT) (JT_F) (ours) Super Resolution 23.41 23.50 23.55 23.58 23.69 Face Alignment 5.71 5.70 5.34 5.34 5.26

Table 4: Ablation study. To highlight the effectiveness of feature fusion and joint training.

max width= Super Resolution (PSNR) Face Alignment (NMSE) # of Param. Concat 23.57 5.42 Adding 23.69 5.29 One_stages 23.61 5.44 16.69M two_stages 23.62 5.36 17.83M Res_16 23.62 5.41 14.46M Res_32 23.69 5.29 18.96M

Table 5: Baseline variations of the proposed JASRNet. Trained and tested on 300W.

Comparison on both tasks. To the best of our knowledge, FSRNet [4] and SuperFAN [3] were the only attempts that reported results on both tasks (i.e., SR and face alignment). Thus, we compared results of both tasks with these two methods. Since one of the primary tasks for “enhancing” faces is to improve facial recognition capabilities, we also measured face verification performance on the super-resolved images. Additionally, the number of parameters used in each model is listed in Table 3. In this section, models were trained on the 300W training set, and tested on the 300W test set and the entire LFW dataset. The SR and alignment results for 300W test set are shown in Table 1 and 2, respectively. As for LFW dataset, the results for SR and facial recognition are listed in Table 3. Performance was measured using verification accuracy (ACC), PSNR, and SSIM. We did not include LFW in the test for landmark localization since it does not support the 68 landmarks used as prior knowledge in all three methods. Thus, we show that our JASRNet significantly outperforms SuperFAN and FSRNet in face SR and landmark localization (see Table 1, 2, and 3). Qualitatively, the proposed method also produces more accurate landmark estimations for alignment task and much more detailed appearances and texture for SR task than the other two methods (see Figure 1, 4). Note that our model also have less parameters than SuperFAN and FSRNet (see Table 3).

Ablation study

We next measured the contributions of feature fusion, joint training, and the long skip connection. Table 4 lists the four additional variants used. Baseline (BL) only consisted of an encoder, a feature extraction module, and either a SR or alignment module. In other words, the BL omitted the feature fusion at the intermediate layers, removed the long skip connection, and was only able to handle a single task per pass (i.e., either SR or face alignment, but not both). BL_F is BL with feature fusion. Joint training (JT) net was conducted by aggregating both task-specific modules to the baseline, and JT with feature fusion is JT_F. Finally, JT_F with long skip connection forms the proposed JASRNet. The training set used in this section is 300W. Note that our baseline model has even better performance while less parameters than SuperFAN [3]. Reasons are three-fold: 1) batch normalization omitted in layers of residual blocks to speed up training and boost performance; 2) Pixel shuffle layers [25] used in reconstruction module instead of deconvolutional, which is used in SuperFAN; 3) Two independent modules are used in SuperFAN, i.e., SR and face alignment are handled separately. This yields redundant feature maps and, hence, degrades performance.

Effects of the feature fusion. Fusing the features at the intermediate layers yields richer, and more efficient feature representations for SR, with BL_F and JT_F outperforming BL and JT, respectively, in SR (see Table 4). However, feature fusion has less impact on face alignment. This is because SR uses both low and high frequency information to recover HR from LR images, while landmark localization is mostly dependent on the high frequency content.

Effects of joint-task mechanism. To highlight the importance of training the two tasks jointly, we compared JT to BL and JT_F to BL_F (see Table 4). Results for both tasks (i.e., SR and face alignment) show that joint-task variants (i.e., JT and JT_F) significantly outperform BL and BL_F, respectfully. This validates that the joint training, in itself, contributes to the state-of-the-art performance of JASRNet.

Effects of long skip connection. The impact of the long skip connection is evident by the results: JASRNet, which is JT_F with the added skip connection, outperforms all others in both SR and landmark localization. The impact for SR stems from the skip connection forcing the network to encode sharper and more precise edges in the feature representation, as expected. However, the boosted accuracy for face alignment was less expected, yet supporting of the narrative: we believe the shared features for SR and face alignment yield additional information that complements both tasks.

Baseline variations. We also show the variations of the vanilla baseline for insights on the effects of different fusion methods (i.e., concatenation vs element-wise addition), the number of residual blocks in the feature extraction module (i.e., 16 vs 32), and the number of stages in the face alignment module (i.e., 1 vs 2). Table 5 lists results for different settings. Clearly, element-wise addition is better for the feature fusion module in our model. Also, more residual blocks and stages improves the performance. Thus, the deeper structure and, thus, the higher capacity captures more information for the SR and face alignment tasks: as the network grows so does its potential to learn.


We proposed a JASRNet to exploit the maximum amount of information from tiny face images when simultaneously addressing alignment and super-resolution tasks. Extensive experiments demonstrated the proposed significantly outperforms previous state-of-the-art in SR by recovering sharper edges (i.e., finer details) from HR faces. We also show large improvements for landmark localization of tiny faces (i.e., ). Furthermore, the proposed framework yields comparable results for landmark localization on faces of lower-resolution (i.e., ) to existing methods on higher-resolution (i.e., ).