Learning with Privileged Information for Efficient Image Super-Resolution

07/15/2020 ∙ by Wonkyung Lee, et al. ∙ Yonsei University 0

Convolutional neural networks (CNNs) have allowed remarkable advances in single image super-resolution (SISR) over the last decade. Most SR methods based on CNNs have focused on achieving performance gains in terms of quality metrics, such as PSNR and SSIM, over classical approaches. They typically require a large amount of memory and computational units. FSRCNN, consisting of few numbers of convolutional layers, has shown promising results, while using an extremely small number of network parameters. We introduce in this paper a novel distillation framework, consisting of teacher and student networks, that allows to boost the performance of FSRCNN drastically. To this end, we propose to use ground-truth high-resolution (HR) images as privileged information. The encoder in the teacher learns the degradation process, subsampling of HR images, using an imitation loss. The student and the decoder in the teacher, having the same network architecture as FSRCNN, try to reconstruct HR images. Intermediate features in the decoder, affordable for the student to learn, are transferred to the student through feature distillation. Experimental results on standard benchmarks demonstrate the effectiveness and the generalization ability of our framework, which significantly boosts the performance of FSRCNN as well as other SR methods. Our code and model are available online: https://cvlab.yonsei.ac.kr/projects/PISR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 21

page 22

page 23

page 24

page 25

Code Repositories

PISR

An official implementation of ECCV2020 Paper: "Learning with Privileged Information for Efficient Image Super-Resolution" in PyTorch.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image from a low-resolution (LR) one, which has proven useful in various tasks including object detection [3]

, face recognition 

[63, 17], medical imaging [16], and information forensics [35]

. With the great success of deep learning, SRCNN 

[10] first introduces convolutional neural networks (CNNs) for SISR, outperforming classical approaches by large margins. After that, CNN-based SR methods focus on designing wider [34, 49, 62] or deeper [20, 29, 32, 39, 60, 61] network architectures for the performance gains. They require a high computational cost and a large amount of memory, and thus implementing them directly on a single chip for, e.g., televisions and mobile phones, is extremely hard without neural processing units and off-chip memory.

Figure 1: Compressing networks using knowledge distillation (left) transfers the knowledge from a large teacher model (T) to a small student model (S), with the same input, e.g., LR images in the case of SISR. Differently, the teacher in our framework (right) takes the ground truth (i.e., HR image) as an input, exploiting it as privileged information, and transfers the knowledge via feature distillation. (Best viewed in color.)

Many works introduce cost-effective network architectures [46, 30, 27, 1, 18, 19, 11] to reduce the computational burden and/or required memory, using recursive layers [46, 30] or additional modules specific for SISR [27, 1]. Although they offer a good compromise in terms of PSNR and speed/memory, specially-designed or recursive architectures may be difficult to implement on hardware devices. Network pruning [19] and parameter quantization [18], typically used for network compression, are alternative ways for efficient SR networks, where the pruning removes redundant connections of nodes and the quantization reduces bit-precision of weights or activations. The speedup achieved by the pruning is limited due to irregular memory accesses and poor data localizations [53], and the performance of the network quantization is inherently bound by that of a full-precision model. Knowledge distillation is another way of model compression, where a large model (i.e., a teacher network) transfers a softened version of the output distribution (i.e

., logits

[23] or intermediate feature representations [14, 43, 2, 22] to a small one (i.e., a student network), which has shown the effectiveness in particular for the task of image classification. Generalized distillation [37] goes one step further, allowing a teacher to make use of extra (privileged) information at training time, and assisting the training process of a student network with the complementary knowledge [24, 15].

We present in this paper a simple yet effective framework for an efficient SISR method. The basic idea is that ground-truth HR images can be thought of as privileged information (Fig. 1), which has not been explored in both SISR and privileged learning. It is true that the HR image includes the complementary information (e.g., high-frequency components) of LR images, but current SISR methods have used it just to penalize an incorrect reconstruction at the end of CNNs. On the contrary, our approach to using HR images as privileged information allows to extract the complementary features and leverage them explicitly for the SISR task. To implement this idea, we introduce a novel distillation framework where teacher and student networks try to reconstruct HR image but using different inputs (i.e., ground-truth HR and corresponding LR images for the teacher and the student, respectively), which is clearly different from the conventional knowledge distillation framework (Fig. 1). Specifically, the teacher network has an hourglass architecture consisting of an encoder and a decoder. The encoder extracts compact features from HR images while encouraging them to imitate LR counterparts using an imitation loss. The decoder, which has the same network architecture as the student, reconstructs the HR images again using the compact features. Intermediate features in the decoder are then transferred to the student via feature distillation, such that the student learns the knowledge (e.g., high frequencies or fine details of HR inputs) of the teacher trained with the privileged data (i.e., HR image). Note that our framework is useful in that the student can be initialized with the network parameters of the decoder, which allows to transfer the reconstruction capability of the teacher to the student. We mainly exploit FSRCNN [11] as the student network, since it has a hardware-friendly architecture (i.e., a stack of convolutional layers) and the number of parameters is extremely small compared to other CNN-based SR methods. Experimental results on standard SR benchmarks demonstrate the effectiveness of our approach, which boosts the performance of FSRCNN without any additional modules. To the best of our knowledge, our framework is the first attempt to leverage the privileged information for SISR. The main contributions of our work can be summarized as follows:

  • [leftmargin=*]

  • We present a novel distillation framework for SISR that leverages the ground truth (i.e., HR images) as privileged information to transfer the important knowledge of the HR images to a student network.

  • We propose to use an imitation loss to train a teacher network, making it possible to distill the knowledge a student is able to learn.

  • We demonstrate that our approach boosts the performance of the current SISR methods, significantly, including FSRCNN [11], VDSR [29], IDN [27], and CARN [1]. We show an extensive experimental analysis with ablation studies.

2 Related work

2.0.1 Sisr.

Early works on SISR design image priors to constrain the solution space [9, 28, 55], and leverage external datasets to learn the relationship between HR and LR images [13, 56, 47, 44, 6], since lots of HR images can be reconstructed from a single LR image. CNNs have allowed remarkable advances in SISR. Dong et al. pioneer the idea of exploiting CNNs for SISR, and propose SRCNN [10] that learns a mapping function directly from input LR to output HR images. Recent methods using CNNs exploit a much larger number of convolutional layers. Sparse [34, 32, 39] or dense [49, 62, 20] skip connections between them prevent a gradient vanishing problem, achieving significant performance gains over classical approaches. More recently, efficient networks for SISR in terms of memory and/or runtime have been introduced. Memory-efficient SR methods [30, 46, 45, 33] reduce the number of network parameters by reusing them recursively. They further improve the reconstruction performance using residual units [46], memory [45] or feedback [33] modules but at the cost of runtime. Runtime-efficient methods [11, 27, 1, 26] on the other hand are computationally cheap. They use cascaded [1] or multi-branch [27, 26] architectures, or exploit group convolutions [54, 8]. The main drawback of such SR methods is that their hardware implementations are difficult due to the network architectures specially-designed for the SR task. FSRCNN [11] reduces both runtime and memory. It uses typical convolutional operators with a small number of filters and feature channels, except the deconvolution layer at the last part of the network. Although FSRCNN has a hardware-friendly network architecture, it is largely outperformed by current SR methods.

2.0.2 Feature distillation.

The purpose of knowledge distillation is to transfer the representation ability of a large model (teacher) to a small one (student) for enhancing the performance of the student model. It has been widely used to compress networks, typically for classification tasks. In this framework, the softmax outputs of a teacher are regarded as soft labels, providing informative clues beyond discrete labels [23]. Recent methods extend this idea to feature distillation, which transfers intermediate feature maps [43, 2], their transformations [22, 58], the differences of features before and after a stack of layers [57], or pairwise relations within feature maps [36]. In particular, the variational information distillation (VID) method [2] transfers the knowledge by maximizing the mutual information between feature maps of teacher and student networks. We exploit VID for feature distillation, but within a different framework. Instead of sharing the same inputs (i.e., LR images) with the student, our teacher network inputs HR images, that contain the complementary information of LR images, to take advantage of privileged information.

Closely related to ours, SRKD [14] applies the feature distillation technique to SISR in order to compress the size of SR network, where a student is trained to have similar feature distributions to those of a teacher. Following the conventional knowledge distillation, the student and teacher networks in SRKD use the same inputs of LR images. This is clearly different from our method in that our teacher takes ground-truth HR images as inputs, allowing to extract more powerful feature representations for image reconstruction.

2.0.3 Generalized distillation.

Learning using privileged information [51, 50]

is a machine learning paradigm that uses extra information, which requires an additional cost, at training time, but with no accessibility to it at test time. In a broader context, generalized distillation 

[37] covers both feature distillation and learning using privileged information. The generalized distillation enables transferring the privileged knowledge of a teacher to a student. For example, the works of [24, 15] adopt the generalized distillation approach for object detection and action recognition, where depth images are used as privileged information. In the framework, a teacher is trained to extract useful features from depth images. They are then transferred to a student which takes RGB images as inputs, allowing the student to learn complementary representations from privileged information. Our method belongs to generalized distillation, since we train a teacher network with ground-truth HR images, which can be viewed as privileged information, and transfer the knowledge to a student network. Different from previous methods, our method does not require an additional cost for privileged information, since the ground truth is readily available at training time.

Figure 2: Overview of our framework. A teacher network inputs a HR image  and extracts a compact feature representation  using an encoder. The decoder in the network then reconstructs a HR output . To train the teacher network, we use imitation  and reconstruction  losses. After training the teacher, a student network is initialized with weights of the decoder in the teacher network (red line), and restores a HR output  from a LR image 

. Note that the student network and the decoder share the same network architecture. The estimator module takes intermediate feature maps of the student network, and outputs location and scale maps,

and , respectively. To train the student network, we exploit a reconstruction loss  together with a distillation loss  using the intermediate representation  of the teacher network and the parameter maps of and . See text for details. (Best viewed in color.)

3 Method

We denote by and LR and ground-truth HR images. Given the LR image , we reconstruct a high-quality HR output  efficiently in terms of both speed and memory. To this end, we present an effective framework consisting of teacher and student networks. The teacher network learns to distill the knowledge from privileged information (i.e., a ground-truth HR image ). After training the teacher network, we transfer the knowledge distilled from the teacher to the student to boost the reconstruction performance. We show in Fig. 2 an overview of our framework.

3.1 Teacher

In order to transfer knowledge from a teacher to a student, the teacher should be superior to the student, while extracting informative features. To this end, we treat ground-truth HR images as privileged information, and exploit an intelligent teacher [50]. As will be seen in our experiments, the network architecture of the teacher influences the SR performance significantly. As the teacher network inputs ground-truth HR images, it may not be able to extract useful features, and just learn to copy the inputs for the reconstruction of HR images, regardless of its capacity. Moreover, a large difference for the number of network parameters or the performance gap between the teacher and the student discourages the distillation process [7, 41]. To reduce the gap while promoting the teacher to capture useful features, we exploit an hourglass architecture for the teacher network. It projects the HR images into a low-dimensional feature space to generate compact features, and reconstructs the original HR images from them, such that the teacher learns to extract better feature representations for an image reconstruction task. Specifically, the teacher network consists of an encoder  and a decoder . Given a pair of LR and HR images, the encoder  transforms the input HR image  into the feature representation  in a low-dimensional space:

(1)

where the feature representation of  has the same size as the LR image. The decoder  reconstructs the HR image  using the compact feature :

(2)

For the decoder, we use the same architecture as the student network. It allows the teacher to have a similar representational capacity as the student, which has proven useful in [41].

3.1.1 Loss.

To train the teacher network, we use reconstruction and imitation losses, denoted by  and , respectively. The reconstruction term computes the mean absolute error (MAE) between the HR image  and its reconstruction  defined as:

(3)

where and are height and width of the HR image, respectively, and we denote by an intensity value of at position . It encourages the encoder output (i.e., compact feature ) to contain useful information for the image reconstruction and forces the decoder to reconstruct the HR image again using the compact feature. The imitation term restricts the representational power of the encoder, making the output of the encoder close to the LR image. Concretely, we define this term as the MAE between the LR image  and the encoder output :

(4)

where and are height and width of the LR image, respectively. This facilitates an initialization of the student network that takes the LR image  as an input. Note that our framework avoids the trivial solution that the compact feature becomes the LR image since the network parameters in the encoder are updated by both the imitation and reconstruction terms. The overall objective is a sum of reconstruction and imitation terms, balanced by the parameter :

(5)

3.2 Student

A student network has the same architecture as the decoder  in the teacher, but uses a different input. It takes a LR image  as an input and generates a HR image :

(6)

We initialize the weights of the student network with those of the decoder in the teacher. This transfers the reconstruction capability of the teacher to the student and provides a good starting point for optimization. Note that several works [24, 15] point out that how to initialize network weights is crucial for the performance of a student. We adopt FSRCNN [11], a hardware-friendly SR architecture, as the student network .

3.2.1 Loss.

Although the network parameters of the student  and the decoder 

in the teacher are initially set to the same, the features extracted from them are different due to the different inputs. Besides, these parameters are not optimized with input LR images. We further train the student network 

with a reconstruction loss  and a distillation loss . The reconstruction term is similarly defined as Eq. (3) using the ground-truth HR image and its reconstruction from the student network, dedicating to the SISR task:

(7)

The distillation term focuses on transferring the knowledge of the teacher to the student. Overall, we use the following loss to train the student network:

(8)

where is a distillation parameter. In the following, we describe the distillation loss in detail.

We adopt the distillation loss proposed in the VID method [2], which maximizes mutual information between the teacher and the student. We denote by  and  the intermediate feature maps of the teacher and student networks, respectively, having the same size of , where is the number of channels. We define mutual information  as follows:

(9)

where we denote by  and marginal and conditional entropies, respectively. To maximize the mutual information, we should minimize the conditional entropy 

. However, an exact optimization w.r.t the weights of the student is intractable, as it involves an integration over a conditional probability 

. The variational information maximization technique [4] instead approximates the conditional distribution 

using a parametric model 

, such as the Gaussian or Laplace distributions, making it possible to find a lower bound of the mutual information . Using this technique, we maximize the lower bound of mutual information  for feature distillation. As the parametric model , we use a multivariate Laplace distribution with parameters of location and scale, and , respectively. We define the distillation loss  as follows:

(10)

where we denote by the element of at the position . This minimizes the distance between the features  of the teacher and the location map . The scale map  controls the extent of distillation. For example, when the student does not benefit from the distillation, the scale parameter  increases in order to reduce the extent of distillation. This is useful for our framework where the teacher and student networks take different inputs, since it adaptively determines the features the student is affordable to learn from the teacher. The term  prevents a trivial solution where the scale parameter goes to infinite. We estimate these maps of and from the features of the student . Note that other losses designed for feature distillation can also be used in our framework (See the supplementary material).

3.2.2 Estimator module.

We use a small network to estimate the parameters of location  and scale  in Eq. (10). It consists of location and scale branches, where each takes the features of the student  and estimates the location and scale maps, separately. Both branches share the same network architecture of two convolutional layers and a PReLU [21] between them. For the scale branch, we add the softplus function  [12] at the last layer, forcing the scale parameter to be positive. Note that the estimation module is used only at training time.

4 Experiments

4.1 Experimental details

4.1.1 Implementation details.

The encoder in the teacher network consists of 4 blocks of convolutional layers followed by a PReLU [21]

. All the layers, except the second one, perform convolutions with stride 1. In the second block, we use the convolution with stride 

 (i.e., a scale factor) to downsample the size of the HR image to that of the LR image. The kernel sizes of the first two and the last two blocks are 5 5 and 3 3, respectively. The decoder in the teacher and the student network have the same architecture as FSRCNN [11]

consisting of five components: Feature extraction, shrinking, mapping, expanding, and deconvolution modules. We add the estimator module for location and scale maps on top of the expanding module in the student network. We use these maps together with the output features of the expanding module in the decoder to compute the distillation loss. We set the hyperparameters for losses using a grid search on the DIV2K dataset 

[48], and choose the ones () that give the best performance. We implement our framework using PyTorch [42].

4.1.2 Training.

To train our network, we use the training split of DIV2K [48] corresponding 800 pairs of LR and HR images, where the LR images are synthesized by bicubic downsampling. We randomly crop HR patches of size 192 192 from the HR images. LR patches are cropped from the corresponding LR images according to the scale factor. For example, LR patches of size 96 96 are used for the scale factor of 2. We use data augmentation techniques, including random rotation and horizontal flipping. The teacher network is trained with random initialization. We train our model with a batch size of 16 about 1000k iterations over the training data. We use the Adam [31] with and . As a learning rate, we use and reduce it until using a cosine annealing technique [38].

4.1.3 Evaluation.

We evaluate our framework on standard benchmarks including Set5 [5], Set14 [59], B100 [40], and Urban100 [25]. Following the experimental protocol in [34]

, we use the peak signal to noise ratio (PSNR) and the structural similarity index (SSIM) 

[52]

on the luminance channel as evaluation metrics.

width=0.91center Hourglass architecture Weight transfer Student PSNR Teacher PSNR - - - - 37.15 (baseline) - - - MAE 37.19 (+0.04) 57.60 MAE 37.22 (+0.07) 37.70 MAE 37.23 (+0.08) 37.70 MAE 37.27 (+0.12) 37.65 VID [2] 37.31 (+0.16) 37.65 VID [2] 37.33 (+0.18) 37.65

Table 1: Average PSNR of student and teacher networks, trained with variants of our framework, on the Set5 [5] dataset. We use FSRCNN [11], reproduced by ourselves using the DIV2K [48] dataset without distillation, as the baseline in the first row. We denote by VID and VID VID losses [2] with the Gaussian and Laplace distributions, respectively. The performance gains of each component over the baseline are shown in the parentheses. The number in bold indicates the best performance and underscored one is the second best.

4.2 Ablation studies

We present an ablation analysis on each component of our framework. We report quantitative results in terms of the average PSNR on Set5 [5] with the scale factor of 2. The results on a large dataset (i.e., B100 [40]) can be seen in the supplementary material. We show in Table 1 the average PSNR for student networks trained with variants of our framework. The results of the baseline in the first row are obtained using FSRCNN [11]. From the second row, we can clearly see that feature distillation boosts the PSNR performance. For the teacher network in the second row, we use the same network architecture as FSRCNN except for the deconvolution layers. In contrast to FSRCNN, the teacher inputs HR images, and thus we replace the deconvolution layer with a convolutional layer, preserving the size of the inputs. We can see from the third row that a teacher network with an hourglass architecture improves the student performance. The hourglass architecture limits the performance of the teacher and degrades the performance (e.g., a 19.9dB decrease compared to that of the teacher in the second row), reducing the performance gap between the teacher and the student. This allows the feature distillation to be more effective, thus the student of the third row performs better (37.22dB) than that of the second row (37.19dB), which can also be found in recent works [7, 41]. The fourth row shows that the student network benefits from initializing the network weights with those of the decoder in the teacher, since this provides a good starting point for learning, and transfers the reconstruction capability of the teacher. From the fifth row, we observe that an imitation loss further improves the PSNR performance, making it easier for the student to learn features from the teacher. The next two rows show that the VID loss [2], especially with the Laplace distribution (VID), provides better results than the MAE, and combining all components gives the best performance. The distillation loss based on the MAE forces the feature maps of the student and teacher networks to be the same. This strong constraint on the feature maps is, however, problematic in our case, since we use different inputs for the student and teacher networks. The VID method allows the student to learn important features adaptively. We also compare the performance of our framework and a typical distillation approach with different losses in the supplementary material.

HR
LR
w/
Figure 3:

Analysis on compact features in spatial (top) and frequency (bottom left) domains and the distribution of pixel values (bottom right). To visualize the compact features in the frequency domain, we apply the 2D Fast Fourier Transform (FFT) to the image, obtaining its magnitude spectrum. It is then sliced along the

-axis. (Best viewed in color.)
w/o

4.3 Analysis on compact features

In Fig. 3, we show an analysis on compact features in spatial and frequency domains. Compared to the LR image, the compact features  show high-frequency details regardless of whether the imitation loss  is used or not. This can also be observed in the frequency domain – The compact features contain more high-frequency components than the LR image, and the magnitude spectrums of them are more similar to that of the HR image especially for high-frequency components. By taking these features as inputs, the decoder in the teacher shows the better performance than the student (Table 1) despite the fact that they have the same architecture. This demonstrates that the compact features extracted from the ground truth contain useful information for reconstructing the HR image, encouraging the student to reconstruct more accurate results via feature distillation. In the bottom right of the Fig. 3, we can see that the pixel distributions of the LR image and the compact feature are largely different without the imitation loss, discouraging the weight transfer to the student. The imitation loss  alleviates this problem by encouraging the distributions of the LR image and the compact feature to be similar.

width=center Scale Methods Param. MultiAdds Runtime Set5 [5] Set14 [59] B100 [40] Urban100 [25] PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM 2 FSRCNN [11] 13K 6.0G 0.83ms 37.05/0.9560 32.66/0.9090 31.53/0.8920 29.88/0.9020 FSRCNN* 13K 6.0G 0.83ms 37.15/0.9568 32.71/0.9095 31.58/0.8913 30.05/0.9041 FSRCNN (Ours) 13K 6.0G 0.83ms 37.33/0.9576 32.79/0.9105 31.65/0.8926 30.24/0.9071 Bicubic Int. - - - 33.66/0.9299 30.24/0.8688 29.56/0.8431 26.88/0.8403 DRCN [30] 1,774K 17,974.3G 239.93ms 37.63/0.9588 33.04/0.9118 31.85/0.8942 30.75/0.9133 DRRN [46] 297K 6,796.9G 105.76ms 37.74/0.9591 33.23/0.9136 32.05/0.8973 31.23/0.9188 MemNet [45] 677K 2,662.4G 21.06ms 37.78/0.9597 33.28/0.9142 32.08/0.8978 31.31/0.9195 CARN [1] 1,592K 222.8G 8.43ms 37.76/0.9590 33.52/0.9166 32.09/0.8978 31.92/0.9256 IDN [27] 591K 136.5G 7.01ms 37.83/0.9600 33.30/0.9148 32.08/0.8985 31.27/0.9196 SRFBN [33] 3,631K 1,126.7G 108.52ms 38.11/0.9609 33.82/0.9196 32.29/0.9010 32.62/0.9328 IMDN [26] 694K 159.6G 6.97ms 38.00/0.9605 33.63/0.9177 32.19/0.8996 32.17/0.9283 3 FSRCNN [11] 13K 5.0G 0.72ms 33.18/0.9140 29.37/0.8240 28.53/0.7910 26.43/0.8080 FSRCNN* 13K 5.0G 0.72ms 33.15/0.9157 29.45/0.8250 28.52/0.7895 26.49/0.8089 FSRCNN (Ours) 13K 5.0G 0.72ms 33.31/0.9179 29.57/0.8276 28.61/0.7919 26.67/0.8153 Bicubic Int. - - - 30.39/0.8682 27.55/0.7742 27.21/0.7385 24.46/0.7349 DRCN [30] 1,774K 17,974.3G 239.19ms 33.82/0.9226 29.76/0.8311 28.80/0.7963 27.15/0.8276 DRRN [46] 297K 6,796.9G 98.58ms 34.03/0.9244 29.96/0.8349 28.95/0.8004 27.53/0.8378 MemNet [45] 677K 2,662.4G 11.33ms 34.09/0.9248 30.00/0.8350 28.96/0.8001 27.56/0.8376 CARN [1] 1,592K 118.8G 3.86ms 34.29/0.9255 30.29/0.8407 29.06/0.8034 28.06/0.8493 IDN [27] 591K 60.6G 3.62ms 34.11/0.9253 29.99/0.8354 28.95/0.8013 27.42/0.8359 SRFBN [33] 3,631K 500.8G 76.74ms 34.70/0.9292 30.51/0.8461 29.24/0.8084 28.73/0.8641 IMDN [26] 703K 71.7G 5.36ms 34.36/0.9270 30.32/0.8417 29.09/0.8046 28.17/0.8519 4 FSRCNN [11] 13K 4.6G 0.67ms 30.72/0.8660 27.61/0.7550 26.98/0.7150 24.62/0.7280 FSRCNN* 13K 4.6G 0.67ms 30.89/0.8748 27.72/0.7599 27.05/0.7176 24.76/0.7358 FSRCNN (Ours) 13K 4.6G 0.67ms 30.95/0.8759 27.77/0.7615 27.08/0.7188 24.82/0.7393 Bicubic Int. - - - 28.42/0.8104 26.00/0.7027 25.96/0.6675 23.14/0.6577 DRCN [30] 1,774K 17,974.3G 243.62ms 31.53/0.8854 28.02/0.7670 27.23/0.7233 25.14/0.7510 DRRN [46] 297K 6,796.9G 57.09ms 31.68/0.8888 28.21/0.7721 27.38/0.7284 25.44/0.7638 MemNet [45] 677K 2,662.4G 8.55ms 31.74/0.8893 28.26/0.7723 27.40/0.7281 25.50/0.7630 CARN [1] 1,592K 90.9G 3.16ms 32.13/0.8937 28.60/0.7806 27.58/0.7349 26.07/0.7837 IDN [27] 591K 34.1G 3.08ms 31.82/0.8903 28.25/0.7730 27.41/0.7297 25.41/0.7632 SRFBN [33] 3,631K 281.7G 48.39ms 32.47/0.8983 28.81/0.7868 27.72/0.7409 26.60/0.8015 IMDN [26] 715K 41.1G 4.38ms 32.21/0.8948 28.58/0.7811 27.56/0.7353 26.04/0.7838

Table 2: Quantitative comparison with the state of the art on SISR. We report the average PSNR/SSIM for different scale factors (2, 3, and 4) on Set5 [5], Set14 [59], B100 [40], and Urban100 [25]. *: models reproduced by ourselves using the DIV2K [48] dataset without distillation; Ours: student networks of our framework.

4.4 Results

4.4.1 Quantitative comparison.

We compare in Table 2 the performance of our student model with the state of the art, particularly for efficient SISR methods [10, 11, 30, 46, 45, 27, 1, 33, 26]. For a quantitative comparison, we report the average PSNR and SSIM [52] for upsampling factors of 2, 3, and 4, on standard benchmarks [5, 59, 40, 25]. We also report the number of model parameters and operations (MultiAdds), required to reconstruct a HR image of size , and present the average runtime of each method measured on the Set5 [5] using the same machine with a NVIDIA Titan RTX GPU. From this table, we can observe two things: (1) Our student model trained with the proposed framework outperforms FSRCNN [11] by a large margin, consistently for all scale factors, even both have the same network architecture. It demonstrates the effectiveness of our approach to exploiting ground-truth HR images as privileged information; (2) The model trained with our framework offers a good compromise in terms of PSNR/SSIM and the number of parameters/operations/runtimes. For example, DRCN [30] requires 1,774K parameters, 17,974.3G operations and average runtime of 233.93ms to achieve the average PSNR of 30.75dB on Urban100 [25] for a factor of 2. On the contrary, our framework further boosts FSRCNN without modifying the network architecture, achieving the average PSNR of 30.24dB with 13K parameters/6.0G operations only, while taking 0.83ms for inference.

Table 3: Quantitative results of student networks using other SR methods. We report the average PSNR for different scale factors (2, 3, and 4) on Set5 [5] and B100 [40]. *: models reproduced by ourselves using the DIV2K [48] dataset; Ours: student networks of our framework. width=1center Methods 2x Set5/B100 3x Set5/B100 4x Set5/B100 FSRCNN-L* 37.59/31.90 33.76/28.81 31.47/27.29 FSRCNN-L (Ours) 37.65/31.92 33.85/28.83 31.52/27.30 VDSR [29] 37.53/31.90 33.67/28.82 31.35/27.29 VDSR* 37.64/31.96 33.80/28.83 31.37/27.25 VDSR (Ours) 37.77/32.00 33.85/28.86 31.51/27.29 IDN [27] 37.83/32.08 34.11/28.95 31.82/27.41 IDN* 37.88/32.12 34.22/29.02 32.03/27.49 IDN (Ours) 37.93/32.14 34.31/29.03 32.01/27.51 CARN [1] 37.76/32.09 34.29/29.06 32.13/27.58 CARN* 37.75/32.02 34.08/28.94 31.77/27.44 CARN (Ours) 37.82/32.08 34.10/28.95 31.83/27.45
Figure 4: Trade-off between the number of operations and the average PSNR on Set5 [5] (2). The size of the circle and background color indicate the number of parameters and the efficiency of the model (white: high, black: low), respectively. (Best viewed in color.)

Urban100
img-71 (3x)

(a) *

Ground truth
(PSNR/SSIM)

(b) *

Bicubic Int.
(18.06/0.6835)

(c) *

FSRCNN
(20.42/0.8278)

(d) *

Urban100
img-11 (2x)

(f) *

Ground truth
(PSNR/SSIM)

(g) *

Bicubic Int.
(25.32/0.8034)

(h) *

VDSR
(27.64/0.8993)

(i) *

VDSR (Ours)
(27.87/0.9025)

(j) *

Set14
img-1 (3x)

(k) *

Ground truth
(PSNR/SSIM)

(l) *

Bicubic Int.
(26.25/0.7538)

(m) *

IDN
(26.28/0.7887)

(n) *

IDN (Ours)
(26.97/0.8017)

(o) *

Urban100
img-91 (3x)

(p) *

Ground truth
(PSNR/SSIM)

(q) *

Bicubic Int.
(17.32/0.5164)

(r) *

CARN
(20.20/0.7236)

(s) *

CARN (Ours)
(20.32/0.7293)

(t) *
(e) *
Figure 5: Visual comparison of reconstructed HR images (2 and 3) on Urban100 [25] and Set14 [59]. We report the average PSNR/SSIM in the parentheses. (Best viewed in color.)

In Table 3, we show the performances of student networks, adopting the architectures of other SR methods, trained with our framework using the DIV2K dataset [48]. We reproduce their models (denoted by *) using the same training setting but without distillation. The FSRCNN-L has the same components as FSRCNN [11] but with much more parameters (126K vs. 13K), where the numbers of filters in feature extraction and shrinking components are both 56, and the mapping module consists of 4 blocks of convolutional layers. Note that the multi-scale learning strategy in the CARN [1] is not used for training the network, and thus the performance is slightly lower than the original one. We can see that all the SISR methods benefit from our framework except for IDN [27] for the scale factor of 4 on Set5. In particular, the performances of the variant of FSRCNN [10] and VDSR [29] are significantly boosted through our framework. Additionally, our framework further improves the performances of the cost-effective SR methods [27, 1], which are specially-designed to reduce the number of parameters and operations while improving the reconstruction performance. Considering the performance gains of recent SR methods, the results are significant, demonstrating the effectiveness and generalization ability of our framework. For example IDN [27] and SRFBN [33] outperform the second-best methods by 0.05dB and 0.02dB, respectively, in terms of PSNR on Set5 [5] for a factor of 2. We visualize in Fig. 4 the performance comparison of student networks using various SR methods and the state of the art in terms of the number of operations and parameters. It confirms once more the efficiency of our framework.

4.4.2 Qualitative results.

We show in Fig. 5 reconstruction examples on the Urban100 [25] and Set14 [59] datasets using the student networks. We can clearly see that the student models provide better qualitative results than their baselines. In particular, our models remove artifacts (e.g., the borders around the sculpture in the first row) and reconstruct small-scale structures (e.g., windows in the second row and the iron railings in the last row) and textures (e.g., the patterns of the tablecloth in the third row). More qualitative results can be seen in the supplementary material.

5 Conclusion

We have presented a novel distillation framework for SISR leveraging ground-truth HR images as privileged information. The detailed analysis on each component of our framework clearly demonstrates the effectiveness of our approach. We have shown that the proposed framework substantially improves the performance of FSRCNN as well as other methods. In future work, we will explore distillation losses specific to our model to further boost the performance.

Acknowledgement.

This research was supported by the Samsung Research Funding & Incubation Center for Future Technology (SRFC-IT1802-06).

References

  • [1] N. Ahn, B. Kang, and K. Sohn (2018) Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, Cited by: item , §1, §2.0.1, Figure 4, §4.4.1, §4.4.1, Table 2.
  • [2] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai (2019) Variational information distillation for knowledge transfer. In CVPR, Cited by: §1, §2.0.2, §3.2.1, §4.2, Table 1.
  • [3] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem (2018) SOD-MTGAN: small object detection via multi-task generative adversarial network. In ECCV, Cited by: §1.
  • [4] D. Barber and F. V. Agakov (2003) The IM algorithm: a variational approach to information maximization. In NIPS, Cited by: §3.2.1.
  • [5] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, Cited by: Figure 4, §4.1.3, §4.2, §4.4.1, §4.4.1, Table 1, Table 2, Table 3.
  • [6] H. Chang, D. Yeung, and Y. Xiong (2004) Super-resolution through neighbor embedding. In CVPR, Cited by: §2.0.1.
  • [7] J. H. Cho and B. Hariharan (2019) On the efficacy of knowledge distillation. In ICCV, Cited by: §3.1, §4.2.
  • [8] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: §2.0.1.
  • [9] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos (2009) SoftCuts: a soft edge smoothness prior for color image super-resolution. IEEE TIP 18 (5). Cited by: §2.0.1.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. IEEE TPAMI 38 (2). Cited by: §1, §2.0.1, §4.4.1, §4.4.1.
  • [11] C. Dong, C. C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. In ECCV, Cited by: item , §1, §1, §2.0.1, §3.2, §4.1.1, §4.2, §4.4.1, §4.4.1, Table 1, Table 2.
  • [12] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia (2001) Incorporating second-order functional knowledge for better option pricing. In NIPS, Cited by: §3.2.2.
  • [13] W. T. Freeman, T. R. Jones, and E. C. Pasztor (2002) Example-based super-resolution. IEEE CG&A 22 (2). Cited by: §2.0.1.
  • [14] Q. Gao, Y. Zhao, G. Li, and T. Tong (2018) Image super-resolution using knowledge distillation. In ACCV, Cited by: §1, §2.0.2.
  • [15] N. C. Garcia, P. Morerio, and V. Murino (2018) Modality distillation with multiple stream networks for action recognition. In ECCV, Cited by: §1, §2.0.3, §3.2.
  • [16] H. Greenspan (2008) Super-resolution in medical imaging. The Computer Journal 52 (1). Cited by: §1.
  • [17] B. K. Gunturk, A. U. Batur, Y. Altunbasak, M. H. Hayes, and R. M. Mersereau (2003) Eigenface-domain super-resolution for face recognition. IEEE TIP 12 (5). Cited by: §1.
  • [18] S. Han, H. Mao, and W. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
  • [19] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NIPS, Cited by: §1.
  • [20] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In CVPR, Cited by: §1, §2.0.1.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In ICCV, Cited by: §3.2.2, §4.1.1.
  • [22] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi (2019) A comprehensive overhaul of feature distillation. In ICCV, Cited by: §1, §2.0.2.
  • [23] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the knowledge in a neural network. In NIPS Workshop, Cited by: §1, §2.0.2.
  • [24] J. Hoffman, S. Gupta, and T. Darrell (2016) Learning with side information through modality hallucination. In CVPR, Cited by: §1, §2.0.3, §3.2.
  • [25] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In CVPR, Cited by: Figure 5, §4.1.3, §4.4.1, §4.4.2, Table 2.
  • [26] Z. Hui, X. Gao, Y. Yang, and X. Wang (2019) Lightweight image super-resolution with information multi-distillation network. In ACMMM, Cited by: §2.0.1, §4.4.1, Table 2.
  • [27] Z. Hui, X. Wang, and X. Gao (2018) Fast and accurate single image super-resolution via information distillation network. In CVPR, Cited by: item , §1, §2.0.1, Figure 4, §4.4.1, §4.4.1, Table 2.
  • [28] Jian Sun, Zongben Xu, and Heung-Yeung Shum (2008) Image super-resolution using gradient profile prior. In CVPR, Cited by: §2.0.1.
  • [29] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: item , §1, Figure 4, §4.4.1.
  • [30] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In CVPR, Cited by: §1, §2.0.1, §4.4.1, Table 2.
  • [31] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.2.
  • [32] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §1, §2.0.1.
  • [33] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu (2019) Feedback network for image super-resolution. In CVPR, Cited by: §2.0.1, §4.4.1, §4.4.1, Table 2.
  • [34] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPR Workshop, Cited by: §1, §2.0.1, §4.1.3.
  • [35] W. S. Lin, S. K. Tjoa, H. V. Zhao, and K. R. Liu (2009) Digital image source coder forensics via intrinsic fingerprints. IEEE TIFS 4 (3). Cited by: §1.
  • [36] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang (2019) Structured knowledge distillation for semantic segmentation. In CVPR, Cited by: §2.0.2.
  • [37] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik (2016) Unifying distillation and privileged information. In ICLR, Cited by: §1, §2.0.3.
  • [38] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    In ICLR, Cited by: §4.1.2.
  • [39] X. Mao, C. Shen, and Y. Yang (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS, Cited by: §1, §2.0.1.
  • [40] D. Martin, C. Fowlkes, D. Tal, J. Malik, et al. (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, Cited by: §4.1.3, §4.2, §4.4.1, Table 2, Table 3.
  • [41] S. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh (2020) Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. In AAAI, Cited by: §3.1, §4.2.
  • [42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. Cited by: §4.1.1.
  • [43] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) FitNets: hints for thin deep nets. In ICLR, Cited by: §1, §2.0.2.
  • [44] S. Schulter, C. Leistner, and H. Bischof (2015) Fast and accurate image upscaling with super-resolution forests. In CVPR, Cited by: §2.0.1.
  • [45] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) MemNet: a persistent memory network for image restoration. In ICCV, Cited by: §2.0.1, §4.4.1, Table 2.
  • [46] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In CVPR, Cited by: §1, §2.0.1, §4.4.1, Table 2.
  • [47] R. Timofte, V. De, and L. V. Gool (2013) Anchored neighborhood regression for fast example-based super-resolution. In ICCV, Cited by: §2.0.1.
  • [48] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) NTIRE 2017 challenge on single image super-resolution: methods and results. In CVPR Workshop, Cited by: §4.1.1, §4.1.2, §4.4.1, Table 1, Table 2, Table 3.
  • [49] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In ICCV, Cited by: §1, §2.0.1.
  • [50] V. Vapnik and R. Izmailov (2015) Learning using privileged information: similarity control and knowledge transfer. JMLR 16. Cited by: §2.0.3, §3.1.
  • [51] V. Vapnik and A. Vashist (2009) A new learning paradigm: learning using privileged information. Neural Networks 22 (5-6). Cited by: §2.0.3.
  • [52] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE TIP 13 (4). Cited by: §4.1.3, §4.4.1.
  • [53] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §1.
  • [54] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §2.0.1.
  • [55] Q. Yan, Y. Xu, X. Yang, and T. Q. Nguyen (2015) Single image super-resolution based on gradient profile sharpness. IEEE TIP 24 (10). Cited by: §2.0.1.
  • [56] J. Yang, J. Wright, T. Huang, and Y. Ma (2008) Image super-resolution as sparse representation of raw image patches. In CVPR, Cited by: §2.0.1.
  • [57] J. Yim, D. Joo, J. Bae, and J. Kim (2017)

    A gift from knowledge distillation: fast optimization, network minimization and transfer learning

    .
    In CVPR, Cited by: §2.0.2.
  • [58] S. Zagoruyko and N. Komodakis (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, Cited by: §2.0.2.
  • [59] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In Curves and Surfaces, Cited by: Figure 5, §4.1.3, §4.4.1, §4.4.2, Table 2.
  • [60] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a Gaussian denoiser: residual learning of deep cnn for image denoising. IEEE TIP 26. Cited by: §1.
  • [61] K. Zhang, W. Zuo, S. Gu, and L. Zhang (2017) Learning deep CNN denoiser prior for image restoration. In CVPR, Cited by: §1.
  • [62] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In CVPR, Cited by: §1, §2.0.1.
  • [63] W. W. Zou and P. C. Yuen (2011) Very low resolution face recognition problem. IEEE TIP 21 (1). Cited by: §1.