Face hallucination, a.k.a. facial image super-resolution, aims to generate a high-resolution (HR) face image from a given low-resolution (LR) input. Face hallucination is a fundamental problem in the field of face analysis and has drawn considerable research attention due to the need for solving this problem in various face-related tasks, including face attribute recognition , face alignment [2, 3]
and face recognition, under complex low-quality real-world scenarios.
As a special form of general image super-resolution, obvious structural prior information exists in face images, which is therefore widely used in existing face hallucination algorithms [5, 6, 7]. Face prior information is usually embedded into the existing face hallucination models in the form of face component analysis , facial correspondence field  and facial landmark localization . However, the calculation of this prior information requires additional calculations, and accurate parsing of the landmarks is difficult in low-resolution situations. Therefore, there is a series of work attempts to replace the fine-grained face prior computation with rough patch-wise super-resolving mapping, which can improve the efficiency of the algorithm while achieving comparable performance. Due to differences in appearance of facial organs and the natural symmetry of the facial region, existing patch-wise face hallucination methods either extract patches from detected facial landmarks or simply divide the face image into even patches and then independently perform LR to HR mapping on each detected patch [9, 10, 11, 12]. Specifically, end-to-end deep convolutional networks (CNNs) have recently achieved great success in learning discriminative patch-to-patch mapping from LR images to HR images [13, 7]. However, face structure priors and spatial configurations [14, 15] are often treated as external information, and the contextual dependencies among the super-resolution reconstruction of each patch are usually ignored during the hallucination processing.
The difficulty of super-resolution reconstruction also varies due to the inconsistency in the degree of detail deterioration of each facial part. On the other hand, the symmetry of the human face and the similarity in the appearance of the adjacent regions make previously hallucinated HR patches worthy of reference to the latter. Therefore, during the process of face hallucination, the reconstruction sequence of patches and the selection of patch locations at each step are crucial for global face hallucination, which is consistent with the human visual perception mechanism. When people observe a scene object, they usually start with perceiving the whole image and successively explore a sequence of regions via the attention shifting mechanism rather than separately processing the local regions. This finding motivates us to explore a new pipeline for face hallucination by sequentially searching for the local attention regions and considering their contextual dependency from a global perspective.
Inspired by the effectiveness of recurrent visual attention modeling for visual analysis and understanding[16, 17, 18, 19], we propose an attention-aware face hallucination (Attention-FH) framework that fully exploits the global contextual information of the face to recurrently discover and enhance a series of local face regions. Specifically, accounting for the diverse characteristics of face images in terms of blur, pose, illumination and facial appearance, we model the face hallucination problem as a strategy optimization problem for patch sequence selection and implement a search for an optimal enhancement route. We resort to the deep reinforcement learning (RL) model  to sequentially determine local patches for enhancement, as the RL technique has shown promising results on decision-making problems without the need for supervision information at each step.
Specifically, our Attention-FH framework jointly optimizes a recurrent policy network that learns to identify the facial region to be hallucinated at each time step and a local enhancement network for facial part super-resolution by considering the whole face image with previously enhanced parts. In our framework, rich correlation cues among different facial parts are explicitly exploited to guide the current region assignment, while past hallucination results are incorporated as a global reference during local enhancement in each step. In this way, the agent can make full use of the symmetry of the human face and the adjacent regions to assist in obtaining more accurate facial part hallucination reconstruction. For example, the agent can improve the enhancement of the right eye region by taking a clear version of the left eye region as reference.
Instead of performing supervision in each step, we employ a single global reward for RL, which measures the overall performance of the entire hallucinated HR face. The optimization of the recurrent policy network is updated following the RL algorithm 
, which can be treated as a Markov decision process (MDP) maximized with a long-term global reward. At each time step, the policy network learns to determine the location and the size of an optimal rectangular facial region by conditioning on the whole face image with all previously enhanced results and the encoded action history. A gated recurrent unit (GRU) layer is employed to encode the information from the previously attended facial regions. All the previously determined regions are also recorded to avoid duplicated selection of a region in a recurrent mode.
Given the attended facial region in each step, the local enhancement network is trained for hallucination reconstruction, and its loss is defined as the distance between the part hallucination result and the specific ground truth. Compared with whole-face super-resolution, the structure of facial components (attended parts) is stable and easy to restored. Notably, the supervision information from the enhancement of facial parts also effectively reduces unnecessary trial and error during the reinforcement optimization.
We compare the proposed Attention-FH approach with state-of-the-art face hallucination methods under both constrained and unconstrained settings. Extensive experiments show that our method substantially outperforms all the alternative methods. Moreover, our framework can explicitly generate a sequence of attentional regions during the hallucination, which finely accords with the human perception process and to some extent provides an interpretable mechanism in the hallucination recovery process.
A preliminary version of this work is published in . In this work, we inherit the idea of exploring the interdependency of facial components and redevelop the policy network from the perspective of the attention mechanism and reinforcement optimization. The improvement upon the initial version includes size-free attention and a new reward function designed to maximize the stability of the RL. We have also added a comprehensive discussion of the design of the local enhancement network and greatly improved its efficiency while maintaining its performance. Moreover, we present more comparisons with state-of-the-art models and a more comprehensive ablation study on our proposed framework.
The rest of this paper is organized as follows. Section 2 reviews related work on face hallucination and deep RL. In Section 3, we introduce our proposed Attention-FH model. Section 4 provides extensive performance evaluation and comparisons with state-of-the-art models. Finally, we conclude this paper in Section 5.
2 Related Work
Face Hallucination and Image Super-Resolution Face hallucination is a domain-specific image super-resolution problem proposed to map a LR facial image to its HR version. With obvious prior knowledge, face hallucination methods are required to handle extremely degraded faces and restore complex structural information. Early approaches hypothesized that corrupted faces are in a relatively controlled environment with small variations. Yang et al.  enforced LR and HR images to have similar sparse representations and implemented image super-resolution by taking into account the sparsity prior. Wang et al.  decomposed faces into different frequency bands and hallucinated faces by eigentransformation. By contrast, structured face hallucination (SFH)  pre-aligns faces and establishes the mapping between facial components. SFH not only achieves impressive results but also reveals that facial components are crucial in face hallucination. However, SFH relies heavily on pre-alignment, making it difficult to cope with situations where illumination and pose changes are considerable. More recently, deep neural networks have shown impressive performance in face hallucination and image restoration [24, 25, 15, 26, 13]. Dong et al.  employed a fully convolutional network (FCN) for image SR. Kim et al.  made a very deep CNN for image SR trainable by adopting a highway connection. Reed et al.  proposed an efficient sampling strategy to demonstrate high-quality image reconstruction. Ulyanov et al.  further demonstrated that deep neural networks can make full use of prior knowledge of the image itself to rebuild a corrupted image. Shocher et .al.  employed a novel internal learning approach to fully explore LR inputs. UR-DGN  is claimed to be the first face SR method that uses a generative adversarial network. Tuzel et .al  claimed that global information is crucial to face hallucination and established a local and global network to restore faces.
However, all these existing deep learning models attempt to improve the performance of face hallucination by designing deeper and more complex neural network structures. In this work, we start from the perspective of human cognition and model the face hallucination process as a patch-wise local reconstruction problem. We introduce a deep RL-based optimization method to learn a series of ordered patch hallucination sequences. For patch-wise hallucination,which is not our main focus, we draw on an existing FCN-based framework and incorporate it into our Attention-FH model for end-to-end training. We firmly believe that the Attention-FH framework proposed in this paper is compatible with any existing deep super-resolution models and will benefit from the future improvement of super-resolution algorithms.
Attention and Reinforcement Learning
Visual attention modeling is inspired by the human visual perception system. Visual attention modeling is widely embedded in existing deep neural networks in the form of adaptive feature weighting or salient region localization and has been proved to be effective in improving the performance of a series of computer vision tasks, including object proposal, object classification, relationship detection, image captioning and visual question answering. Some works have exploited RL to optimize the attention networks to address the problem that the coordinates of attended regions are not differentiable. For example,  and  learned an agent that actively locates the target regions (face or objects) instead of exhaustively sliding subwindows on images. Goodrich et al.  defined 32 actions to shift the focal point and reward the agent when spotting the target. Caicedo et al.  defined an action set that contains several transformations of the bounding box and rewarded the agent if the bounding box became closer to the ground truth in each step. Both of these methods learned an optimal policy to locate the target through Q-learning.
3.1 Inference Overview
We develop an Attention-FH to perform face hallucination in a coarse-to-fine manner. Specifically, Attention-FH is composed of two parts: a recurrent policy network that learns to adaptively locate a particular facial region for local hallucination, and a local enhancement network that directly learns to map a located facial patch with LR to its HR version, considering both the local and global perspective.
Given a face image with LR, the target of our proposed Attention-FH framework is to generate the corresponding HR version through a series of iterative local patch enhancements, which can be formulated as:
where denotes the parameters of the Attention-FH model, and is the whole hallucinating procedure.
Given a state , the recurrent policy network is learned to predict the actions , including the center position of the next attended rectangular region and the size of the bounding box. The attended patch is further cropped and fed as input to a local enhancement network for super-resolution. This process can be formulated as:
where is the parameters of the recurrent policy network, indicates the recurrent policy network and denotes the restored face image at step . The state is a vector, encoded with the current state, and refers to the cropping operation, which is applied to crop the corresponding patch given action .
Given attention patch , the local enhancement network is adopted for hallucination.
where is the parameter of the local enhancement network and indicates the enhanced patch. After local enhancement, we replace with . Specifically, the restored face image for the next step is produced by replacing the existing patch with the enhanced patch . The overall coarse-to-fine face hallucination can be defined as:
where is the original corrupted face image, indicates the enhanced full-sized facial image at each step , is the maximal recurrent step, and . is set to 18 according to our empirical analyses, which are presented in Section 4.
3.2 Recurrent Policy Network
The recurrent policy network is designed to cooperate with a recurrent neural network to optimize a time-sequence of attended regions for local enhancement. As illustrated in Fig. 2, the Attention-FH framework is composed of a recurrent policy network and a local enhancement network. The recurrent policy network can be formulated as a decision-making process for optimal patch selection on time intervals. At each step, the policy network takes as input the concatenation feature vector (i.e., the state), which contains the current enhanced image, the original corrupted image and the encoded historical actions, and learns to determine the optimal image patch to be enhanced at each time step. At the final time step , a global delayed reward, which is measured in terms of the accumulated attended rate and the mean squared error (MSE) between the hallucinated image and the corresponding ground truth, is used to guide the training of the policy network. The agent learns to predict the most appropriate restoring route for different identities by maximizing the global delayed reward.
State. To provide rich contextual information, the state of the agent is designed to contain the three following types of information. 1) The enhanced face from previous steps, which enables the agent to determine the patch that needs to be repaired at the next time step by fully sensing the rich contextual information (e.g., the region that is still LR and needs to be enhanced). is represented as a global feature vector extracted from the output of a fully connected layer. 2) The original corrupted face image , which is also encoded with a global feature vector , as with the enhanced facial image. 3) The encoded history action vector obtained by forwarding all previous action vectors into the GRU network. The output hidden variable of the GRU thus encodes all previous action information and is denoted as . We formulate the state with size of 512 1 to encode multiple contextual information . In this way, the target of the agent is to determine the region cropping action (including the location and the size) of the next attended local patch by considering the state .
Action. Given state , the agent attempts to generate an action indication, which represents the next attended local region (cropping from the last hallucinated image ) for enhancement. Due to the large differences in the orientation and size of faces in in-the-wild cases, the use of a fixed-size attention bounding box to capture all facial components is not optimal. To this end, we propose a content-adaptive and size-free attention mechanism for better local region extraction. Specifically, to fully capture one facial component, the agent needs to predict the of a rectangular region at each time step, where ,,, and , respectively, refers to the center coordinates and the width and height of the bounding box. Let and be the width and height of the target hallucinated facial image, and let . To reduce the search space, we give up the free search of the width and height and instead attempt to predict the ratio factor and the scale factor . is used to control the length-width ratio. Empirical candidates for include . As shown in Fig. 3, the bounding box is adapted w.r.t the ratio factor and is thus sufficiently flexible to handle various facial components. The scale factor is adopted to represent the overall size of the attention box. With the sample factors, the size of the attention box can be obtained by:
where and are the height and the width of the attention box, respectively. is a constant value that specifies the initial size of the attended region. Here, we set to 60. The action consists of . The GRU takes the state as input and generates a 128 1 hidden vector
, which is fed to a fully connected layer to infer the action of the next step. We employ a tensor with the same size as the full image to ensure that the attended region can be accommodated.
Reward. The reward function is designed to guide the training of the recurrent policy network for the optimization of a time sequence of attended patches for local enhancement. We consider two factors when designing our reward function. 1) The MSE between the hallucinated image after steps of local enhancement and the corresponding HR image. Specifically, let be the enhanced image at the last step , and let be the image restored by the super-resolution generative adversarial network (SRGAN) . We first compute their MSEs with respect to the corresponding HR ground truth, denoted as and . The first term of our reward function is defined as . measures the absolute peak signal-to-noise ratio (PSNR) of the restored image while the introduction of the second replaces the reward function with the relative change in PSNR, which better reflects the evolution of each iteration and greatly enhances the stability of the model training. For instance, the PSNR of a restored image with simple details is usually much higher than that of an image with a relatively complex texture. By subtracting , the reward function can better reflect the incremental situation of model training and thus apply more accurate rewards and punishments. 2) The attention rate, which is introduced to indicate whether the attended region has covered the whole image. In detail, a tensor of the same size as the full image is employed. Initially, is set with all zeros, and once a region is visited, its corresponding value is set to . We sum the tensor to at the last step to reflect the attention ratio. The total reward function can be written as:
During training, we set the reward in a global manner, i.e., the reward is assigned after step is completed. The recurrent policy network is trained to maximize this reward via the REINFORCE algorithm .
3.3 Local Enhancement Network
Given an attended patch from the recurrent policy network, the local enhancement network is employed for hallucination. To provide comprehensive contextual information, takes the following three components as input: 1) The attended patch , which is represented by masking the outside area of the input image (setting the value of the region outside the selected patch to zero while keeping the pixels inside intact). 2) The current enhanced facial image (with all previously hallucinated results pasted on), which provides global contextual information for the enhancement of the current patch. 3) The original corrupted image , 4) and the global context , which is calculated by expanding to the dimension equal to the size of the input image by a fully connected layer followed by a reshaping operation (i.e., is of the same shape of ). are concatenated and further fed into the local enhancement network. To achieve a trade-off between performance and efficiency, we adopt a reduced version of LapSRN as our local enhancement network. The simplified settings of LapSRN are listed in Table II
. We resize the three input components to the same size as the LR corrupted image without interpolation to improve the efficiency of our model. The local enhancement network is fully convolutional and is composed of 5 convolutional and 2 deconvolutional layers. The convolutional layers all have a stride of, the kernels of the head and tail layers are of size and the kernels of the other layers are of size . By incorporating two deconvolutional layers with stride , the network is able to learn to reconstruct the resolution of the corrupted patch in a cascaded manner. At the end of the local enhancement network, the LR input image is upscaled to the target resolution. We crop out the attended region in the hallucinated result w.r.t its size and location. The cropped HR patch is added to the accumulated enhanced result . Finally, the residual between the attention patch and the ground truth face image
can be estimated by the well-trained local enhancement network.
3.4 Model Training
We illustrate the training strategy of our Attention-FH in Fig. 4. The recurrent policy network, which learns to obtain the attentional patch by maximizing the reinforced reward, is shown in Fig. 2(1). Fig. 2(2) shows the local enhancement network, which learns to enhance the attended patch in an end-to-end mode by minimizing the MSE.
In the training phase, the recurrent policy network is optimized by the REINFORCE algorithm , guided by the reward calculated at the end of sequential enhancement when the maximum time step is reached. The local enhancement network is optimized to minimize the
distance between the restored patch and its corresponding HR ground truth. The supervised loss is calculated at each time step and can be minimized based on backpropagation. After calculating the last step of attending patches for local enhancement, we obtain the global reward, which is leveraged to optimize the policy network.
To demonstrate the advantages of our Attention-FH, we have conducted extensive experiments on multiple widely used benchmarks, i.e., CelebFaces Attributes Dataset , Public Figures Face Dataset , Labeled Faces in the Wild Dataset , Surveillance Cameras Face Dataset  and BioID Face Dataset . We first briefly introduce the evaluation datasets, the corresponding evaluation protocols and the implementation details. Then, we perform comprehensive comparisons to verify the superiority of our Attention-FH over all the compared state-of-the-art approaches. Note that because the degradation types of face hallucination are complex, we have employed several down-sampling factors to evaluate our Attention-FH under various challenging conditions. Finally, we have performed detailed ablation studies to demonstrate the contribution of each component within our Attention-FH.
|Dataset||Scale||Bicubic||General SR||Face Hallucination||GAN Based||Our|
|Dataset||Scale||Bicubic||General SR||Face Hallucination||GAN Based||Our|
|Dataset||Scale||Bicubic||General SR||Face Hallucination||GAN Based||Our|
4.1 Datasets and Evaluation Protocols
We employ the following seven public datasets under various domains for a comprehensive evaluation to validate the robustness of our Attention-FH to in-the-wild faces.
SCface  consists of 4,160 images with 130 identities collected under an uncontrolled environment using five video surveillance cameras in various situations. As video surveillance is one of the main application scenarios for face hallucination, this dataset can be used to evaluate the compared methods from a practical perspective. We utilize 2,405 images for training and the rest for testing.
BioID  is a public dataset with 1,521 gray face images, all taken in a laboratory from a frontal view. We use 1,028 images for training and the remaining 493 images for testing.
PubFig  is a large dataset with 58,797 real-world face images collected from the Web. In our experiment, 11,041 images are utilized for training, and the remaining 6,425 images are used for testing.
Multi-PIE  contains images of 337 subjects captured from complicated perspectives and under complex illumination conditions in four different sessions. We use 126,093 images for training and 31,524 images for testing.
Extended Yale-B  is a large face dataset that contains 16,128 images of 28 identities under 9 poses and 64 illumination conditions. We randomly choose 12,908 images for training and 3,220 images for evaluation.
4.2 Implementation Details
All the datasets are first aligned with two points of the eye regions by CFSS . Then, we simply crop the images to a size of 160 120 by prefetching the centric region, except for the LFW dataset, the images of which are cropped to 128128. To ensure a fair comparison, all the methods are trained on only the corresponding training set, without using the other datasets for pre-training. We evaluate our method with scaling factors of 4, 8 and 16 to model different types of situations. In addition, we also normalize the input images into . The recurrent time step of the policy network is set to 18 to achieve a trade-off between efficiency and accuracy. The setting of the recurrent time step is also investigated in Section 4.5. Our Attention-FH is trained using ADAM gradient descent  with a base learning rate of , a weight decay of , and a momentum term of . The training batch size is 16. Considering the absolute free attention region can lead to unstable performance, we impose some empirical constraints on the size-free attention mechanism, i.e., the length and width of the attention region are customized with respect to the ScaleID and RatioID, which are evaluated in Section 4.5.
4.3 Competing Methods
We compare our method with several state-of-the-art methods, including SRCNN , VDSR , SFH , BiCNN , GLN , and SRGAN . These methods can be categorized into three groups: (i) general image super-resolution: SRCNN and VDSR; (ii) face hallucination: SFH, BiCNN and GLN; and (iii) generative adversarial learning: SRGAN. The first and second types are commonly applied to address regular image and face image restoration, respectively, while the third one is widely used in image generation and has achieved impressive results.
4.4 Quantitative and Qualitative Comparisons
As illustrated in Tab. III, IV and V, our Attention-FH consistently outperforms all the compared state-of-the-art methods, with clear margins in terms of all evaluation metrics. Attention-FH outperforms the best of the competing methods with 2.42 dB, 1.32 dB, and 1.56 dB on the SCface dataset with respect to the PSNR index, respectively. Moreover, our Attention-FH surpasses all the competing methods by large margins on all datasets when the scaling factor is small (e.g., 4). These results confirm the significant superiority of our Attention-FH. Note that we do not present the quantitative results of SFH , which relies heavily on face alignment and thus may fail to handle some testing images.
The visual comparison on the SCface dataset is presented in Fig. 5. Since the SCface dataset is similar to a real-world scenario, the dataset can be employed to explicitly validate the performance of each hallucination approach in terms of practicability. As shown in Fig. 5, regular super-resolution methods produce hallucinated facial images with blurry predictions. By contrast, our Attention-FH can generate faces with well-maintained facial structure. This result demonstrates that our Attention-FH is capable of deblurring and anti-aliasing facial images to preserve the structural information.
The qualitative results shown in Fig. 6, 7 and 8 demonstrate that our Attention-FH achieves significant improvements in restoration quality compared with all the competing methods. In addition, the attention mechanism also benefits our Attention-FH when addressing variation in pose, illumination and facial appearance. As depicted in Fig. 6, the facial expression of the woman in the third row is a ‘smile’ with her mouth opened, and the latter woman has a ‘smile’ with her mouth closed. Our Attention-FH outperforms all the competing methods in successfully addressing these two case with corrupted inputs. Furthermore, our Attention-FH can even hallucinate the man in the eighth row with ‘glasses’, which is extremely challenging for all the compared state-of-the-art approaches. As demonstrated in Fig.8, our Attention-FH can recover naturally acceptable facial images even after substantial information has been lost by downsampling.
To further verify the effectiveness of our model, we also compare our proposed Attention-FH with some methods [47, 48, 49] proposed after the conference version of this paper. Pixel-SR  and Image Transformer  exhibit good performance in general object hallucination, and enhanced deep super-resolution network (EDSR) is famous for generating clear structures in general image SR. Since Pixel-SR and Image Transformer aim to hallucinate small-size objects (e.g., 32 32 pixels), the target resolution of face hallucination (e.g., 128 128 pixels) may lead to GPU memory explosion. Hence, following the official implementation, we conduct patch-based learning and combine the restored patches into a full image for evaluation. As shown in Tab. VII, our model produces better results than the other methods, except for EDSR. Nevertheless, Attention-FH is sufficiently flexible to incorporate EDSR as the local enhancement network to improve performance. We reduce the number of recurrent steps to 4 and implement EDSR for local enhancement. “Our-EDSR” achieves superior performance to EDSR as we expected, which well illustrates the flexibility and effectiveness of the proposed model. Furthermore, we average the results of three model snapshots with different iterations (after convergence) of “Our-EDSR” and “EDSR” for comparison. As illustrated in Tab. VII, “Our-EDSR ensemble” achieves superior performance.
Additionally, we compare Attention-FH on general image super-resolution with state-of-the-art image SR approaches (i.e., VDSR, LapSRN, MemNet  and IDN ). We follow the training scheme of IDN  and conduct experiments on Set14  to demonstrate the performance of our method on arbitrary domain images. With the overall parameter number fixed, we adaptively decrease the recurrent step and build up the local enhancement network for better image SR. As shown in Tab. VIII, our method consistently outperforms all the compared methods on the 2, 3 and 4 settings. Though Attention-FH is indeed capable of restoring the general image well, our model is still a face hallucination framework. By incorporating recurrent attention mechanism, Attention-FH is specialized in low-resolution faces and achieves much greater advantages in the task of face hallucination.
|Algorithm||Multi-PIE 8||Yale-B 8|
|Algorithm||LFW 8||LFW 16|
4.5 Ablation Study on the Policy Network
To demonstrate the effectiveness of the policy network, we compare several variants of our Attention-FH as baseline methods, i.e., “CNN-16”, “Our w/o attention”, “Our w/ random”, “Our w/ sequences”, “Our w/o size-free” and “Our w/ agent”. “CNN-16” indicates a plain convolution network with 16 layers. “Our w/o attention” refers to that the whole face image is recursively enhanced via a recurrent model from a holistic perspective instead of attending patches. “Our w/ random” denotes that we randomly select the attention region to perform random patch-based enhancement. “Our w/ sequences” scans through the whole image in sequence. “Our w/o size-free” uses a fixed attention box inside the policy network. “Our w/ agent” replaces with as the input for the policy network. Moreover, we also consider the complicated transformation mechanism that excludes RL, e.g., Attention-FH with spatial transform net (STN)  (denoted as “STN”), which is capable of learning to select informative patches for face hallucination by estimating the subgradients w.r.t the location of the captured patch.
To demonstrate the superiority of the policy network, we conduct a comparison with the STN . Specifically, our Attention-FH and the STN both employ a paramount patch mining strategy. STN chooses the patch by learning auxiliary terms by minimizing the MSE, and our Attention-FH employs RL to iteratively identify the correct attention region. Therefore, we conduct STN  for comparison to illustrate the strengths of RL. For STN, the outputs of the policy network are replaced with a vector , which indicates the corresponding coordinates to extract the patch. We fix the transform scale to ensure an attended patch size of . As shown in Tab. IX, our full model outperforms STN by a clear margin. This result confirms that RL is beneficial in crucial region mining for face hallucination.
|Algorithm||LFW 4||LFW 8|
|Our w/o attention||32.29||25.69|
|Our w/ random||31.63||25.74|
|Our w/ sequences||31.78||25.98|
|Our w/o size-free||32.89||26.12|
|Our w/ agent||32.12||25.97|
4.5.1 Effectiveness of Patch-wise Enhancement
As shown in Tab. IX, our full model surpasses “Our w/o attention” by 0.73 dB and 0.28 dB on the LFW dataset with the scaling factors of 4 and 8, respectively. This justifies that in-the-wild faces are too changeable to restore, however individual facial parts are relatively stable and can be exploited for partial enhancement. Besides, “Our w/ random” obtains a significant improvement over “CNN-16”, which indicates that the multiple recurrent enhancements itself help to improve the image recovery. Furthermore, our full model achieves consistently higher PSNR values than “Our w/ random” due to the crucial patch sequence optimization implemented via RL.
4.5.2 Effectiveness of Sequentially Attending Patches
We demonstrate the contribution of sequentially attending patches from the perspective of the attention agent. As illustrated in Tab. IX, we first input the face image without enhancements into the policy network, e.g., “Our w/ agent”, and observe that the performance (PSNR) degrades on the LFW dataset with scaling factors of 4 and 8. By contrast, our full model produces better scores. This result proves that the restored patch not only improves the final face image but also leads the agent to select more accurate region sequence for restoration.
4.5.3 Effectiveness of Increasing Recursion Depth
To demonstrate the sensitivity of our Attention-FH in terms of the recursive step , we explore the effect of different recursive steps for sequentially enhancing facial parts. Specifically, we conduct the experiment under five different settings () of recursive step on the LFW dataset with scaling factors of 4 and 8. Fig. 9 shows that the face hallucination performance gradually increases with increasing number of attention steps. The PSNR measure improves dramatically when the number of recursion steps is small, as the extracted patches are still not enough to cover the whole image. When the number of recursion steps reaches more than , the extracted patches are generally enough to cover the whole image. Beyond steps, the step-wise performance improvement in PSNR becomes negligible. This phenomenon becomes more obvious as the number of steps approaches . Owing to the size-free attention strategy and stabilized reward function, we can obtain considerable restored quality under the PSNR metric when is set to . In our experiment, we empirically set considering the acceptable computational costs under practical scenarios.
4.5.4 Effectiveness of the Stabilized Reward Function
We conduct an experiment to validate the contribution of the proposed reward function. Compared with former reward function, our stabilized reward function incorporates PSNR gain value instead of absolute PSNR value as the reward. Given the renewal reward, Attention-FH demonstrates an accurate attention agent towards face hallucination. As shown in Fig. 10
, our reward function achieves more stable performance with lower variance. Furthermore, compared with the reward function from our preliminary conference version, this new reward function leads to higher and more stable PSNR results for our Attention-FH.
4.5.5 Effectiveness of the Size-free Attention Mechanism
As depicted in Fig. 3, our facial component has a different size for each identity. We improve our policy network by implementing a flexible attention mechanism to attend the facial part accurately. We conduct an ablation study by adopting a fixed attention box in the policy network, named “Our w/o size-free”, to validate the effectiveness. This model has the same settings as our full model, except for using a 60 60 attention box. Tab. IX shows that although “Our w/o size-free” produces favorable results, our Attention-FH achieves 0.13 dB and 0.12 dB improvements with scaling factors of 4 and 8, respectively. These results confirm the contribution of the proposed size-free attention mechanism.
4.6 Ablation Study on the Local Enhancement Network
Since the pipeline of our Attention-FH is flexible and extensional, we investigate the use of different network architectures as the local enhancement network. To make the investigation comprehensive, we consider several methods, namely, VDSR , Sub-pixel , GLN , LapSRN , U-Net  and FSRCNN . For U-Net , we reduce the parameters to avoid the case that the model is too large to train. As shown in Tab. X, LapSRN  achieves the best results while U-Net  is the most efficient method for generating hallucinated faces. However, neither method exhibits distinct differences in terms of PSNR under the recurrent attention mechanism. We choose FSRCNN  as the implementation of our local enhancement network based on the trade-off between efficiency and accuracy.
4.7 Efficiency Analysis
We have also conducted an experimental comparison to verify the efficiency of Attention-FH. The results in Tab. XI demonstrate that our Attention-FH requires very little time cost to achieve the superior performance. Compared with EDSR, our model achieves remarkable parameter advantages. As Attention-FH is less efficient than VDSR, the proposed model achieves significant improvement over restoration quality. With a single GPU, Attention-FH is able to perform real-time efficiency. However, Attention-FH still meets an efficiency bottleneck when it is deployed on the mobile platform.
In this section, we discuss the limitation of Attention-FH. With the novel attention mechanism, our model is capable of hallucinating profileocclusion faces well. However, Attention-FH may hallucinate the incorrect facial part if the occluded content is close to facial component. As shown in the last row of Fig. 11, Attention-FH hallucinate a mouth in the occlusion area, which occurred by hand. This failure case illustrates the limitation that Attention-FH meets an upper bound on complex occlusion samples. We will further improve Attention-FH by enhancing the generalization ability towards complex occlusion cases.
In this paper, we have proposed a deep RL-based attention mechanism to address the problem of face hallucination. In contrast to traditional patch-wise face hallucination models that usually neglect the interdependency between facial components, our framework implements a deep RL model and jointly optimizes a recurrent policy network, which learns to determine an ordered patch hallucination sequence, and a local enhancement network for facial part super-resolution. Our Attention-FH fully reflects the human visual perception mechanism and is capable of adaptively inferring an optimal search path for each facial image according to its unique appearance features. Extensive experiments show that Attention-FH outperforms state-of-the-art face hallucination methods and achieves leading performance on both widely used evaluation protocols and visual quality comparisons.
-  Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in ICCV, 2015, pp. 3730–3738.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representation for face alignment with auxiliary attributes,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 5, pp. 918–930, 2016.
-  L. Liu, G. Li, Y. Xie, Y. Yu, Q. Wang, and L. Lin, “Facial landmark machines: A backbone-branches architecture with progressive representation learning,” IEEE Transactions on Multimedia, 2019.
-  E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touching the limit of lfw benchmark or not?” arXiv preprint arXiv:1501.04690, 2015.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” 2017.
-  Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to-end learning face super-resolution with facial priors,” arXiv preprint arXiv:1711.10703, 2017.
-  S. Zhu, S. Liu, C. C. Loy, and X. Tang, “Deep cascaded bi-network for face hallucination,” in ECCV. Springer, 2016, pp. 614–630.
-  Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang, “Learning to hallucinate face images via component generation and enhancement,” arXiv preprint arXiv:1708.00223, 2017.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE transactions on image processing, vol. 19, no. 11, pp. 2861–2873, 2010.
-  C.-Y. Yang, S. Liu, and M.-H. Yang, “Structured face hallucination,” in CVPR, 2013, pp. 1099–1106.
-  X. Ma, J. Zhang, and C. Qi, “Hallucinating face by position-patch,” Pattern Recognition, vol. 43, no. 6, pp. 2224–2236, 2010.
-  M. F. Tappen and C. Liu, “A bayesian approach to alignment-based image hallucination,” in ECCV, 2012, pp. 236–249.
-  E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Learning face hallucination in the wild.” in AAAI, 2015, pp. 3871–3877.
-  C. Liu, H.-Y. Shum, and W. T. Freeman, “Face hallucination: Theory and practice,” International Journal of Computer Vision, vol. 75, no. 1, pp. 115–134, 2007.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014, pp. 184–199.
-  Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv preprint arXiv:1502.00873, 2015.
-  Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label image recognition by recurrently discovering attentional regions,” in ICCV, 2017, pp. 464–472.
T. Chen, Z. Wang, G. Li, and L. Lin, “Recurrent attentional reinforcement
learning for multi-label image recognition,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  G. Li, Y. Gan, H. Wu, N. Xiao, and L. Lin, “Cross-modal attentional context learning for rgb-d object detection,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1591–1601, 2019.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–503, 2016.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.
-  Q. Cao, L. Lin, Y. Shi, X. Liang, and G. Li, “Attention-aware face hallucination via deep reinforcement learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 690–698.
-  X. Wang and X. Tang, “Hallucinating face by eigentransformation,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 35, no. 3, pp. 425–434, 2005.
-  J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” 2016.
-  O. Tuzel, Y. Taguchi, and J. R. Hershey, “Global-local face upsampling network,” arXiv preprint arXiv:1603.07235, 2016.
-  A. Shocher, N. Cohen, and M. Irani, “”zero-shot” super-resolution using deep internal learning,” 2017.
-  S. Reed, A. v. d. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, D. Belov, and N. de Freitas, “Parallel multiscale autoregressive density estimation,” arXiv preprint arXiv:1703.03664, 2017.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” 2017.
-  X. Yu and F. Porikli, “Ultra-resolving face images by discriminative generative networks,” in European Conference on Computer Vision. Springer, 2016, pp. 318–333.
-  Z. Jie, X. Liang, J. Feng, X. Jin, W. Lu, and S. Yan, “Tree-structured reinforcement learning for sequential object localization,” in NIPS, 2016, pp. 127–135.
-  V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014, pp. 2204–2212.
-  L. Xiaodan, L. Lisa, and P. X. Eric, “Deep variation-structured reinforcement learning for visual relationship and attribute detection,” in CVPR, 2017.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015, pp. 2048–2057.
-  C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in ICML, 2016.
-  B. Goodrich and I. Arel, “Reinforcement learning based visual attention with application to face detection,” in CVPR, 2012, pp. 19–24.
-  J. C. Caicedo and S. Lazebnik, “Active object localization with deep reinforcement learning,” in ICCV, 2015, pp. 2488–2496.
-  C. Ledig, Z. Wang, W. Shi, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, and A. Tejani, “Photo-realistic single image super-resolution using a generative adversarial network,” pp. 105–114, 2016.
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” inComputer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 365–372.
-  G. B. Huang, V. Jain, and E. Learned-Miller, “Unsupervised joint alignment of complex images,” in ICCV, 2007.
-  M. Grgic, K. Delac, and S. Grgic, “Scface–surveillance cameras face database,” Multimedia tools and applications, vol. 51, no. 3, pp. 863–879, 2011.
-  O. Jesorsky, K. J. Kirchberg, and R. Frischholz, “Robust face detection using the hausdorff distance,” in AVBPA, 2001, pp. 90–95.
-  R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “The cmu multi-pose, illumination, and expression (multi-pie) face database,” Robotics Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. TR-07-08, 2007.
-  A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE transactions on pattern analysis and machine intelligence, vol. 23, no. 6, pp. 643–660, 2001.
-  L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: a feature similarity index for image quality assessment,” IEEE transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011.
-  S. Zhu, C. Li, C. L. Chen, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in Computer Vision and Pattern Recognition, 2015, pp. 4998–5006.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in Computer Vision and Pattern Recognition Workshops, 2017, pp. 1132–1140.
-  R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,” arXiv preprint arXiv:1702.00783, 2017.
-  N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, and A. Ku, “Image transformer,” arXiv preprint arXiv:1802.05751, 2018.
-  Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” pp. 4549–4557, 2017.
-  Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 723–731.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International conference on curves and surfaces. Springer, 2010, pp. 711–730.
M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” inNIPS, 2015, pp. 2017–2025.
W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” pp. 1874–1883, 2016.
-  W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” 2017.
-  O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation. Springer International Publishing, 2015.
-  C. Dong, C. L. Chen, and X. Tang, “Accelerating the super-resolution convolutional neural network,” pp. 391–407, 2016.