Face Hallucination Using Split-Attention in Split-Attention Network

10/22/2020 ∙ by Yuanzhi Wang, et al. ∙ 0

Face hallucination is a domain-specific super-resolution (SR), that generates high-resolution (HR) facial images from the observed one/multiple low-resolution (LR) input/s. Recently, convolutional neural networks(CNNs) are successfully applied into face hallucination to model the complex nonlinear mapping between HR and LR images. Although global attention mechanism equipped into CNNs naturally focus on the facial structure information, it always ignore the local and cross feature structure information, resulting in limited reconstruction performance. In order to solve this problem, we propose global-local split-attention mechanism and design a Split-Attention in Split-Attention (SIS) network to enable local attention across feature-map groups attaining global attention and to improve the ability of feature representations. SIS can generate and focus the local attention of neural network on the interaction of face key structure information in channel-level, thereby improve the performance of face image reconstruction. Experimental results show that the proposed approach consistently and significantly improves the reconstruction performances for face hallucination.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face hallucination, also known as face super-resolution (SR), is a specific domain of super-resolution and is designed for facial image enhancement. In real-world surveillance scenarios, the far distance between imaging sensors and faces of interest often results in low-resolution (LR) face images. Using face hallucination for restoring high-resolution (HR) face images from LR ones is helpful for targeted person recognition. This method plays an important role in many applications, such as face detection, face recognition and analysis

[10].

Generally speaking, face hallucination can be divided into three categories: interpolation-

[23], reconstruction- [2] and learning-based [17] approaches depending on the source of prior information as same as general image SR methods. The interpolation-based methods zoom the pixel size of images without generating pixels and calculate the value of the missing pixels by using mathematical formula on the basis of surrounding pixels. Reconstruction-based face hallucinations rely on multiple LR input images in fusing sub-pixel registration information. However, the efficiency and performance of interpolation- and reconstruction-based methods may degrade when the scale factor is overlarge. In the last decade, learning-based approaches are popular in face hallucination, because the learning-based method can fully utilize the prior information from training samples to map the LR images to HR ones, with pleasant visual results.

There are two common categories in learning-based face hallucination approaches: global-face hallucination and local-face hallucination. Global-face hallucinations super-resolve the whole image together to hold the global structure information. Wang et al. [19] used eigenface to project the whole image into features and reconstruct HR images by transferring LR linear combination into HR space. Zhou et al. [24] used orthogonal canonical correlation analysis to achieve global face reconstruction. Global-face hallucinations are robust to input noise but are blurry in face edges due to inaccurate local patch reconstruction performance. Local-face hallucination learns further accurate prior information from overlapping patches from local patches to global image. It is a natural idea to split the large image into small patches, which called as “think globally and act locally”. Chang et al. [4] used locally linear representation for exploring the complex relationship between LR and HR images. Yang et al. [5] addressed the problem of generating a SR image from a single LR input image using sparse representation. Jiang et al. [6] proposed an improved neighbor embedding method to face hallucination. However, the limited representation ability of these method results in unsatisfactory reconstruction performance.

Recently, deep convolutional neural networks (CNNs) based methods have achieved significant improvements over conventional SR methods. Among them, Dong et al. [1]

proposed Learning a Deep Convolutional Network for Image Super-Resolution (SRCNN) by firstly introducing a three-layer CNN for image SR. Since then, the reconstruction performance of SR has been advancing in the development of deep learning

[22, 8], and the performance of face hallucination has also been improved [20, 13, 9, 16]. Attention mechanism is introduced into face hallucination to focus on face structure information. In these CNNs based face hallucination methods, attention mechanism can be divided into two types: global-attention based and local-attention based methods. Global-attention based method directly focuses attention on the global input features. Wang et al. [16] proposed a texture-attention module to obtain the global correspondence between the front face images and the multi-view face images. Wang et al. [15] proposed a Multi-Scale Attention based face hallucination to extract multi-scale global information and exploit the channel and spatial correlation of features. Global attention makes the network focus on global structure information, but ignores the local detailed information. Local-attention based method divides the input into several local blocks in a certain dimension, and then focuses attention on these local blocks to exploit local structure information. Song et al. [13] proposed a two-stage method for face hallucination and performed SR on the five structures of organ in the face image separately, and then restores these reconstructed organ structures to the face image, which makes the attention of CNNs focus on the local facial information. Lu et al. [9] proposed a Region-Based attention to make deep residual networks reconstructing the face image better.

Although the above-mentioned face hallucination methods using attention mechanism yield satisfactory performance, most of them either consider global or local attention mechanism, which limited receptive-field size and lacked cross-channel interaction in face structure information. To practically resolve these problems and to be inspired by split-attention mechanism [21], we propose a global-local split-attention mechanism and design a Split-Attention in Split-Attention (SIS) network enable local attention across feature-map groups and to improve the ability of feature representations. The proposed SIS contains several Global-local Split-Attention Groups, and each Global-local Split-Attention Group (GSAG) contains a short Split-Attention connection (SSAC) and several Local Split-Attention Mechanism Blocks (LSAB) which include a Split-Attention module to achieve local attention and to enhance the interaction of face structure information in channel-level. A global attention module is used to fuse the features which are passing through SSAC and several LSABs respectively to enable local attention across feature-map groups, thereby achieves the global attention. Experimental results show that the proposed method consistently and significantly improves the reconstruction performance of face images.

The contributions of our work can be summarized as follows:

(i) We propose a novel global-local split-attention mechanism into the face hallucination using CNNs based methods to fuse different local attention, which can improve the interaction of cross-local attention to achieve global attention.

(ii) We propose a SIS network, which consists of several GSAGs, each GSAG contains several LSABs and a SSAC to generate different local attention and to enhance the cross-channel interaction of facial structures. The proposed SIS can improve the reconstruction performance of face hallucination.

2 Split-Attention in Split-Attention Network for Face hallucination

In this section, we present the architecture of the proposed method, including the main backbone network and details for individual block.

Figure 1:

Network architecture of FSSN, which consists of four parts: coarse feature extraction layer, Split-Attention in Split-Attention (SIS) deep feature extraction, upscale module, and reconstruction layer. SIS contains

GSAGs, and each GSAG contains LSABs.

2.1 Network Architecture

In this subsection, we describe the network architecture of the proposed method in detail, which is dubbed as ”FSSN” in this paper. Fig. 1 shows the architecture of FSSN, which mainly consists four parts: coarse feature extraction, Split-Attention in Split-Attention (SIS) deep feature extraction, upscale module, and reconstruction part. Let’s denote , as the input and output of FSSN, and we use only one convolutional layer (Conv) with a kernel size of 33 to extract the coarse feature from the input images.

(1)

where denotes the coarse feature extracting operation with one convolutional layer. is then used for deep feature extraction with SIS. So we can further have

(2)

where denotes the proposed Split-Attention in Split-Attention structure, which contains GSAG. After the passes through SIS and the network gets the deep feature being denoted , the upscaling operation needs to be performed by the upscale module.

(3)

where and denote the upscaled feature and a upscale module respectively. Finally, the upscaled feature is then reconstructed via one convolutional layer. The reconstructed is formulated as:

(4)

where and denote the reconstruction layer and the function of our FSSN respectively.

Then FSSN is optimized with loss function. Several previous methods

[1, 7]

for mapping LR images and HR images adopt Mean Squared Error (MSE) to minimize loss. MSE usually avails the Peak Signal-to-Noise Ratio (PSNR) while it would result in over-smoothed constructed image. To trade off the PSNR and image construction quality, the proposed method utilizes Mean Absolute Error (MAE), namely

loss function, to define our loss function. Given a training set , which contains LR inputs and their HR counterparts. The loss function of FSSN can be represented as:

(5)

where

denotes the parameter set of our network. The loss function is optimized by using stochastic gradient descent. More implementation details of training would be shown in Section 3.1.

2.2 Local Split-Attention Mechanism Block with local attention

Previous CNNs-based SR methods [16, 9, 20, 11] use large-scale residual blocks, which make the SR algorithm get better performance. But the limited receptive-field size and lacking of cross-channel interaction for previous versatile residual blocks also reduce the interaction of face structure information. At the same time, inspired by the success of Split-Attention in [21], we creatively propose the Local Split-Attention Mechanism Block (LSAB) to generate a local attention and to enhance the facial structure interaction in different channels. The detailed structure of LSAB and Split-Attention are shown in the Fig. 1, where the , and are represented the input channel, height and width respectively.

In each Split-Attention, the input feature can be divided into splits in channel-level, and the number of channels per split is . Then these divided splits are fused via an element-wise summation across multiple splits, thereby realize the interaction of facial structure information in different channels.

(6)

where denotes the first element-wise summation operation, denotes the split by the first division, and denotes the output by . The is then passed through an adaptive average pooling layer, two 11 convolutional layers and a function, and then divided into splits again. Each of the current splits is multiplied by each of the previous splits using an element-wise product operation respectively, and finally use an element-wise summation to fuse these features as the output of Split-Attention denoting . The is formulated as:

(7)

where the denotes the second element-wise summation operation, the represents the function of Split-Attention, and denotes the input of Split-Attention.

2.3 Split-Attention in Split-Attention with global-local attention

We now describe the proposed SIS structure, which contains GSAGs, and a long skip connection (LSC). Furthermore, each GSAG contains LSABs with a short Split-Attention connection (SSAC). In this structure, we train a deep convolution network (over 200 layers) to gain better performance for face hallucination. The details structure of SIS is shown in the Fig. 1.

It has been demonstrated that residual blocks can be cascaded to achieve more than 1000-layer networks[3]. However, such a structure can lead to train difficulty and be unlikely to gain further performance improvement by cascading a very deep network. To solve this problem, Zhang et al. [22] proposed the RIR structure to attain much deeper networks of over 400 convolution layers. Inspired by RIR, we build our deep network SIS through stacking several GSAGs and one LSC. In each GSAG, on the basis of stacking several LSABs, a Split-Attention module is added to the traditional short skip connection (SSC), which is called SSAC. SIS enable local attention across feature-map groups and to improve the ability of feature representations, A GSAG in the group is formulated as:

(8)

where and denote the functions of GSAG and LSAB of the GSAG respectively. and are the input and output for GSAG, and denotes a 33 convolution layer. The denotes the global attention module using the element-wise summation operation. The global attention module fuses the features from two paths including local attention and enables local attention cross feature-map groups, thereby improve the interaction of cross-local attention in face structure information to achieve global attention.

We observe that simply stacking many GSAGs would fail to achieve better performance. To solve the problem, the LSC is further introduced in SIS to stabilize the training of very deep network. The function of SIS is formulated as:

(9)

SSAC makes the input features including facial structure information split and fused again at the channel-level to produce the different local attention.

Figure 2: Performance (PSNR) comparison for 4 SR with different number of LSABs in FSSN, the number of GSAGs is fixed at 10.
Figure 3: Performance (PSNR) comparison for 4 SR with different number of GSAGs in FSSN, the number of LSABs is fixed at 5.

3 Experiments

3.1 Dataset and Implementation Details

In this paper, we use FEI [14] face dataset to prove the proposed algorithm. We use 350 images as the training dataset, 10 images as the validation dataset and 40 images as the testing dataset. The HR image size is 360260 pixels, and the downsampling factor is 4, so that the LR image (using Bicubic Degradation Model) size is 9065 pixels. It should be noted that all our training, validation and testing are based on luminance channel in the YCbCr color space, and upscaling factors: 4 is used for training and testing. The SR results are evaluated with three evaluation indexes: Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM) [18] and Visual Information Fidelity (VIF) [12] to test SR reconstruction performance on luminance channel.

Data augmentation is performed on the 350 training images, which are randomly rotated by 90°, 180°, 270°, and flipped horizontally. In each training batch, 10 LR color patches with the size of 4848 are extracted as inputs. Our model is trained by Adam optimizor with , , and . The initial leaning rate is set to

, and then decreases to half every 20 epoch. We use PyTorch to implement our models with a GTX 1080 GPU.

We set GSAG number as 10 in the SIS structure. In each GSAG, we set LSAB number as 5. Convolution layers in coarse feature extraction and SIS structure have filters, except for that in the channel-downscaling.

3.2 Ablation Experiments

In this subsection, we conduct the ablation studies on the FEI face testing dataset to demonstrate the most effective combination of GSAB and LSAB.

In order to verify the effectiveness of proposed SIS which contains 10 GSAG and 5 LSAB on face hallucination performance, we designed two sets of experiments under different constraints for 4 SR. In the first set of experiments, we constrain the number of GSAGs to 10, and test the reconstruction performance of FSSN when the number of LSABs is 5, 10, 15, 20 respectively. In the second set of experiments, we constrain the number of LSABs to 5, and test the reconstruction performance of FSSN when the number of GSAGs is 5, 10, 15, 20 respectively.

Fig. 2 and Fig. 3 show the experiment results of FSSN with above two sets of experiments, we can find that the PSNR of proposed method does not increase with the increase in the number of GSAGs or LSABs, but shows a state of fluctuations up and down. These two figures show that in these two sets of experiments, when the number of GSAGs is 10 and the number of LSABs is 5, the performance is the best, which is why we chose this combination in SIS.

Figure 4: Visual comparison for 4 SR with different SR methods. Four images from the FEI testing set are selected as the samples to show the reconstruction results of face images.

3.3 Comparison with State-of-the-Art

In this subsection, we select some excellent SR methods as: Bicubic, LCGE [13], EDGAN [20], PRDRN [9], SRFBN [8], MTC [16] and RCAN [22]

. Bicubic is a classic image interpolation algorithm; LCGE is a classic two-step method for face SR; EDGAN is a state-of-the-art deep learning face hallucination algorithm using generative adversarial network (GAN); PRDRN is a parallel region-based face SR method; SRFBN is a latest and state-of-the-art deep learning face SR algorithm using Feedback Network; MTC is a novel face SR using multi-view texture compensation; RCAN is a classic deep residual channel attention network based SR method.

Table 1 lists the experimental results of different state-of-the-arts and proposed method on FEI testing set, it is obviously that FSSN is superior to these state-of-the-art algorithms on three different evaluation indicators, which proves the effectiveness of SIS. In terms of visual performance, Fig. 4 shows the face SR result of different methods in four representative samples. Columns (a) to (g) are the results from seven selected state-of-the-art algorithms, and Column (i) is the HR groundtruth as the benchmark. From the enlarged area, we can observe that some details of EDGAN using GAN network have sharper edges, but these edges are not correct for the groundtruth, so EDGAN is lower than our method in PSNR/SSIM/VIF. (b), (e) and (g) are the test results for LCGE, PRDRN and MTC respectively, we can find that a lot of texture information is lost in (b) and (e), and (g) has serious texture information distortion. The visual results of our model are shown in (h), intuitively, FSSN achieves the high visual performance.

Method Bicubic LCGE[13] EDGAN[20] PRDRN[9] RCAN[22] SRFBN[8] MTC[16] Ours
PSNR/dB 36.29 38.55 38.67 39.36 40.25 40.13 37.92 40.41
SSIM 0.9416 0.9519 0.9475 0.9576 0.9619 0.9625 0.9491 0.9639
VIF 0.6498 0.6832 0.6664 0.7157 0.7328 0.7371 0.6793 0.7414
Table 1: Comparison for 4 SR with the state of the arts on FEI dataset. Red indicates the best performance.

4 Conclusion

In this paper, a novel face hallucination method is proposed, which uses global-local split-attention mechanism to enable local attention across feature-map groups and to improve the interaction of cross-local attention. The proposed SIS can generate and focus local attention on the face structure interaction in channel-level, thereby improve the performance of face image reconstruction. In the future, we believe that the proposed method can be easily applied to other image restoration problems, such as denoising, deblurring and general image SR.

References

  • [1] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 184–199. External Links: ISBN 978-3-319-10593-2 Cited by: §1, §2.1.
  • [2] S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar (2004) Fast and robust multiframe super resolution. IEEE Transactions on Image Processing 13 (10), pp. 1327–1344. Cited by: §1.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. Cited by: §2.3.
  • [4] Hong Chang, Dit-Yan Yeung, and Yimin Xiong (2004) Super-resolution through neighbor embedding. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 1. Cited by: §1.
  • [5] Jianchao Yang, J. Wright, T. Huang, and Yi Ma (2008) Image super-resolution as sparse representation of raw image patches. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. Cited by: §1.
  • [6] J. Jiang, R. Hu, Z. Han, Z. Wang, T. Lu, and Jun Chen (2013) Locality-constraint iterative neighbor embedding for face hallucination. In 2013 IEEE International Conference on Multimedia and Expo (ICME), Vol. , pp. 1–6. Cited by: §1.
  • [7] J. Kim, J. K. Lee, and K. M. Lee (2016) Accurate image super-resolution using very deep convolutional networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1646–1654. Cited by: §2.1.
  • [8] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu (2019) Feedback network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3867–3876. Cited by: §1, §3.3, Table 1.
  • [9] T. Lu, X. Hao, Y. Zhang, K. Liu, and Z. Xiong (2019) Parallel region-based deep residual networks for face hallucination. IEEE Access 7 (), pp. 81266–81278. Cited by: §1, §2.2, §3.3, Table 1.
  • [10] T. Lu, Y. Guan, Y. Zhang, S. Qu, and Z. Xiong (2018) Robust and efficient face recognition via low-rank supported extreme learning machine. Multimedia Tools and Applications 77 (9), pp. 11219–11240. Cited by: §1.
  • [11] T. Lu, J. Wang, J. Jiang, and Y. Zhang (2020) Global-local fusion network for face super-resolution. Neurocomputing 387, pp. 309–320. Cited by: §2.2.
  • [12] H. R. Sheikh and A. C. Bovik (2006) Image information and visual quality. IEEE Transactions on Image Processing 15 (2), pp. 430–444. Cited by: §3.1.
  • [13] Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang (2017) Learning to hallucinate face images via component generation and enhancement. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    ,
    pp. 4537–4543. External Links: ISBN 9780999241103 Cited by: §1, §3.3, Table 1.
  • [14] C. E. Thomaz and G. A. Giraldi (2010)

    A new ranking method for principal components analysis and its application to face image analysis

    .
    Image and Vision Computing 28 (6), pp. 902–913. Cited by: §3.1.
  • [15] C. Wang, Z. Zhong, J. Jiang, D. Zhai, and X. Liu (2020) Parsing map guided multi-scale attention network for face hallucination. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 2518–2522. Cited by: §1.
  • [16] Y. Wang, T. Lu, R. Xu, and Y. Zhang (2020) Face super-resolution by learning multi-view texture compensation. In MultiMedia Modeling, Y. M. Ro, W. Cheng, J. Kim, W. Chu, P. Cui, J. Choi, M. Hu, and W. De Neve (Eds.), Cham, pp. 350–360. External Links: ISBN 978-3-030-37734-2 Cited by: §1, §2.2, §3.3, Table 1.
  • [17] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, and J. Ma (2019) Multi-memory convolutional neural network for video super-resolution. IEEE Transactions on Image Processing 28 (5), pp. 2530–2544. Cited by: §1.
  • [18] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §3.1.
  • [19] Xiaogang Wang and Xiaoou Tang (2005) Hallucinating face by eigentransformation. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 35 (3), pp. 425–434. Cited by: §1.
  • [20] X. Yang, T. Lu, J. Wang, Y. Zhang, Y. Wu, Z. Wang, and Z. Xiong (2018) Enhanced discriminative generative adversarial network for face super-resolution. In Advances in Multimedia Information Processing – PCM 2018, Cham, pp. 441–452. External Links: ISBN 978-3-030-00767-6 Cited by: §1, §2.2, §3.3, Table 1.
  • [21] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang, H. Lin, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, and A. J. Smola (2020) ResNeSt: split-attention networks. arXiv preprint arXiv:2004.08955. Cited by: §1, §2.2.
  • [22] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: §1, §2.3, §3.3, Table 1.
  • [23] F. Zhou, W. Yang, and Q. Liao (2012) Interpolation-based image super-resolution using multisurface fitting. IEEE Transactions on Image Processing 21 (7), pp. 3312–3318. Cited by: §1.
  • [24] H. Zhou, J. Hu, and K. Lam (2015) Global face reconstruction for face hallucination using orthogonal canonical correlation analysis. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Vol. , pp. 537–542. External Links: ISSN Cited by: §1.