An official implementation of ECCV2020 Paper: "Learning with Privileged Information for Efficient Image Super-Resolution" in PyTorch.
Convolutional neural networks (CNNs) have allowed remarkable advances in single image super-resolution (SISR) over the last decade. Most SR methods based on CNNs have focused on achieving performance gains in terms of quality metrics, such as PSNR and SSIM, over classical approaches. They typically require a large amount of memory and computational units. FSRCNN, consisting of few numbers of convolutional layers, has shown promising results, while using an extremely small number of network parameters. We introduce in this paper a novel distillation framework, consisting of teacher and student networks, that allows to boost the performance of FSRCNN drastically. To this end, we propose to use ground-truth high-resolution (HR) images as privileged information. The encoder in the teacher learns the degradation process, subsampling of HR images, using an imitation loss. The student and the decoder in the teacher, having the same network architecture as FSRCNN, try to reconstruct HR images. Intermediate features in the decoder, affordable for the student to learn, are transferred to the student through feature distillation. Experimental results on standard benchmarks demonstrate the effectiveness and the generalization ability of our framework, which significantly boosts the performance of FSRCNN as well as other SR methods. Our code and model are available online: https://cvlab.yonsei.ac.kr/projects/PISR.READ FULL TEXT VIEW PDF
Convolutional neural networks (CNNs) are highly successful for
Single image super resolution (SR), which refers to reconstruct a
Recently, deep convolutional neural networks (CNNs) have been demonstrat...
In this paper, we propose an efficient super-resolution (SR) method base...
Despite significant progress toward super resolving more realistic image...
Humans can robustly recognize and localize objects by integrating visual...
CT scanners that are commonly-used in hospitals nowadays produce
An official implementation of ECCV2020 Paper: "Learning with Privileged Information for Efficient Image Super-Resolution" in PyTorch.
Single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image from a low-resolution (LR) one, which has proven useful in various tasks including object detection 63, 17], medical imaging , and information forensics 
. With the great success of deep learning, SRCNN first introduces convolutional neural networks (CNNs) for SISR, outperforming classical approaches by large margins. After that, CNN-based SR methods focus on designing wider [34, 49, 62] or deeper [20, 29, 32, 39, 60, 61] network architectures for the performance gains. They require a high computational cost and a large amount of memory, and thus implementing them directly on a single chip for, e.g., televisions and mobile phones, is extremely hard without neural processing units and off-chip memory.
Many works introduce cost-effective network architectures [46, 30, 27, 1, 18, 19, 11] to reduce the computational burden and/or required memory, using recursive layers [46, 30] or additional modules specific for SISR [27, 1]. Although they offer a good compromise in terms of PSNR and speed/memory, specially-designed or recursive architectures may be difficult to implement on hardware devices. Network pruning  and parameter quantization , typically used for network compression, are alternative ways for efficient SR networks, where the pruning removes redundant connections of nodes and the quantization reduces bit-precision of weights or activations. The speedup achieved by the pruning is limited due to irregular memory accesses and poor data localizations , and the performance of the network quantization is inherently bound by that of a full-precision model. Knowledge distillation is another way of model compression, where a large model (i.e., a teacher network) transfers a softened version of the output distribution (i.e
., logits) or intermediate feature representations [14, 43, 2, 22] to a small one (i.e., a student network), which has shown the effectiveness in particular for the task of image classification. Generalized distillation  goes one step further, allowing a teacher to make use of extra (privileged) information at training time, and assisting the training process of a student network with the complementary knowledge [24, 15].
We present in this paper a simple yet effective framework for an efficient SISR method. The basic idea is that ground-truth HR images can be thought of as privileged information (Fig. 1), which has not been explored in both SISR and privileged learning. It is true that the HR image includes the complementary information (e.g., high-frequency components) of LR images, but current SISR methods have used it just to penalize an incorrect reconstruction at the end of CNNs. On the contrary, our approach to using HR images as privileged information allows to extract the complementary features and leverage them explicitly for the SISR task. To implement this idea, we introduce a novel distillation framework where teacher and student networks try to reconstruct HR image but using different inputs (i.e., ground-truth HR and corresponding LR images for the teacher and the student, respectively), which is clearly different from the conventional knowledge distillation framework (Fig. 1). Specifically, the teacher network has an hourglass architecture consisting of an encoder and a decoder. The encoder extracts compact features from HR images while encouraging them to imitate LR counterparts using an imitation loss. The decoder, which has the same network architecture as the student, reconstructs the HR images again using the compact features. Intermediate features in the decoder are then transferred to the student via feature distillation, such that the student learns the knowledge (e.g., high frequencies or fine details of HR inputs) of the teacher trained with the privileged data (i.e., HR image). Note that our framework is useful in that the student can be initialized with the network parameters of the decoder, which allows to transfer the reconstruction capability of the teacher to the student. We mainly exploit FSRCNN  as the student network, since it has a hardware-friendly architecture (i.e., a stack of convolutional layers) and the number of parameters is extremely small compared to other CNN-based SR methods. Experimental results on standard SR benchmarks demonstrate the effectiveness of our approach, which boosts the performance of FSRCNN without any additional modules. To the best of our knowledge, our framework is the first attempt to leverage the privileged information for SISR. The main contributions of our work can be summarized as follows:
We present a novel distillation framework for SISR that leverages the ground truth (i.e., HR images) as privileged information to transfer the important knowledge of the HR images to a student network.
We propose to use an imitation loss to train a teacher network, making it possible to distill the knowledge a student is able to learn.
Early works on SISR design image priors to constrain the solution space [9, 28, 55], and leverage external datasets to learn the relationship between HR and LR images [13, 56, 47, 44, 6], since lots of HR images can be reconstructed from a single LR image. CNNs have allowed remarkable advances in SISR. Dong et al. pioneer the idea of exploiting CNNs for SISR, and propose SRCNN  that learns a mapping function directly from input LR to output HR images. Recent methods using CNNs exploit a much larger number of convolutional layers. Sparse [34, 32, 39] or dense [49, 62, 20] skip connections between them prevent a gradient vanishing problem, achieving significant performance gains over classical approaches. More recently, efficient networks for SISR in terms of memory and/or runtime have been introduced. Memory-efficient SR methods [30, 46, 45, 33] reduce the number of network parameters by reusing them recursively. They further improve the reconstruction performance using residual units , memory  or feedback  modules but at the cost of runtime. Runtime-efficient methods [11, 27, 1, 26] on the other hand are computationally cheap. They use cascaded  or multi-branch [27, 26] architectures, or exploit group convolutions [54, 8]. The main drawback of such SR methods is that their hardware implementations are difficult due to the network architectures specially-designed for the SR task. FSRCNN  reduces both runtime and memory. It uses typical convolutional operators with a small number of filters and feature channels, except the deconvolution layer at the last part of the network. Although FSRCNN has a hardware-friendly network architecture, it is largely outperformed by current SR methods.
The purpose of knowledge distillation is to transfer the representation ability of a large model (teacher) to a small one (student) for enhancing the performance of the student model. It has been widely used to compress networks, typically for classification tasks. In this framework, the softmax outputs of a teacher are regarded as soft labels, providing informative clues beyond discrete labels . Recent methods extend this idea to feature distillation, which transfers intermediate feature maps [43, 2], their transformations [22, 58], the differences of features before and after a stack of layers , or pairwise relations within feature maps . In particular, the variational information distillation (VID) method  transfers the knowledge by maximizing the mutual information between feature maps of teacher and student networks. We exploit VID for feature distillation, but within a different framework. Instead of sharing the same inputs (i.e., LR images) with the student, our teacher network inputs HR images, that contain the complementary information of LR images, to take advantage of privileged information.
Closely related to ours, SRKD  applies the feature distillation technique to SISR in order to compress the size of SR network, where a student is trained to have similar feature distributions to those of a teacher. Following the conventional knowledge distillation, the student and teacher networks in SRKD use the same inputs of LR images. This is clearly different from our method in that our teacher takes ground-truth HR images as inputs, allowing to extract more powerful feature representations for image reconstruction.
is a machine learning paradigm that uses extra information, which requires an additional cost, at training time, but with no accessibility to it at test time. In a broader context, generalized distillation covers both feature distillation and learning using privileged information. The generalized distillation enables transferring the privileged knowledge of a teacher to a student. For example, the works of [24, 15] adopt the generalized distillation approach for object detection and action recognition, where depth images are used as privileged information. In the framework, a teacher is trained to extract useful features from depth images. They are then transferred to a student which takes RGB images as inputs, allowing the student to learn complementary representations from privileged information. Our method belongs to generalized distillation, since we train a teacher network with ground-truth HR images, which can be viewed as privileged information, and transfer the knowledge to a student network. Different from previous methods, our method does not require an additional cost for privileged information, since the ground truth is readily available at training time.
We denote by and LR and ground-truth HR images. Given the LR image , we reconstruct a high-quality HR output efficiently in terms of both speed and memory. To this end, we present an effective framework consisting of teacher and student networks. The teacher network learns to distill the knowledge from privileged information (i.e., a ground-truth HR image ). After training the teacher network, we transfer the knowledge distilled from the teacher to the student to boost the reconstruction performance. We show in Fig. 2 an overview of our framework.
In order to transfer knowledge from a teacher to a student, the teacher should be superior to the student, while extracting informative features. To this end, we treat ground-truth HR images as privileged information, and exploit an intelligent teacher . As will be seen in our experiments, the network architecture of the teacher influences the SR performance significantly. As the teacher network inputs ground-truth HR images, it may not be able to extract useful features, and just learn to copy the inputs for the reconstruction of HR images, regardless of its capacity. Moreover, a large difference for the number of network parameters or the performance gap between the teacher and the student discourages the distillation process [7, 41]. To reduce the gap while promoting the teacher to capture useful features, we exploit an hourglass architecture for the teacher network. It projects the HR images into a low-dimensional feature space to generate compact features, and reconstructs the original HR images from them, such that the teacher learns to extract better feature representations for an image reconstruction task. Specifically, the teacher network consists of an encoder and a decoder . Given a pair of LR and HR images, the encoder transforms the input HR image into the feature representation in a low-dimensional space:
where the feature representation of has the same size as the LR image. The decoder reconstructs the HR image using the compact feature :
For the decoder, we use the same architecture as the student network. It allows the teacher to have a similar representational capacity as the student, which has proven useful in .
To train the teacher network, we use reconstruction and imitation losses, denoted by and , respectively. The reconstruction term computes the mean absolute error (MAE) between the HR image and its reconstruction defined as:
where and are height and width of the HR image, respectively, and we denote by an intensity value of at position . It encourages the encoder output (i.e., compact feature ) to contain useful information for the image reconstruction and forces the decoder to reconstruct the HR image again using the compact feature. The imitation term restricts the representational power of the encoder, making the output of the encoder close to the LR image. Concretely, we define this term as the MAE between the LR image and the encoder output :
where and are height and width of the LR image, respectively. This facilitates an initialization of the student network that takes the LR image as an input. Note that our framework avoids the trivial solution that the compact feature becomes the LR image since the network parameters in the encoder are updated by both the imitation and reconstruction terms. The overall objective is a sum of reconstruction and imitation terms, balanced by the parameter :
A student network has the same architecture as the decoder in the teacher, but uses a different input. It takes a LR image as an input and generates a HR image :
We initialize the weights of the student network with those of the decoder in the teacher. This transfers the reconstruction capability of the teacher to the student and provides a good starting point for optimization. Note that several works [24, 15] point out that how to initialize network weights is crucial for the performance of a student. We adopt FSRCNN , a hardware-friendly SR architecture, as the student network .
Although the network parameters of the student and the decoder
in the teacher are initially set to the same, the features extracted from them are different due to the different inputs. Besides, these parameters are not optimized with input LR images. We further train the student networkwith a reconstruction loss and a distillation loss . The reconstruction term is similarly defined as Eq. (3) using the ground-truth HR image and its reconstruction from the student network, dedicating to the SISR task:
The distillation term focuses on transferring the knowledge of the teacher to the student. Overall, we use the following loss to train the student network:
where is a distillation parameter. In the following, we describe the distillation loss in detail.
We adopt the distillation loss proposed in the VID method , which maximizes mutual information between the teacher and the student. We denote by and the intermediate feature maps of the teacher and student networks, respectively, having the same size of , where is the number of channels. We define mutual information as follows:
where we denote by and marginal and conditional entropies, respectively. To maximize the mutual information, we should minimize the conditional entropy
. However, an exact optimization w.r.t the weights of the student is intractable, as it involves an integration over a conditional probability. The variational information maximization technique  instead approximates the conditional distribution
using a parametric model, such as the Gaussian or Laplace distributions, making it possible to find a lower bound of the mutual information . Using this technique, we maximize the lower bound of mutual information for feature distillation. As the parametric model , we use a multivariate Laplace distribution with parameters of location and scale, and , respectively. We define the distillation loss as follows:
where we denote by the element of at the position . This minimizes the distance between the features of the teacher and the location map . The scale map controls the extent of distillation. For example, when the student does not benefit from the distillation, the scale parameter increases in order to reduce the extent of distillation. This is useful for our framework where the teacher and student networks take different inputs, since it adaptively determines the features the student is affordable to learn from the teacher. The term prevents a trivial solution where the scale parameter goes to infinite. We estimate these maps of and from the features of the student . Note that other losses designed for feature distillation can also be used in our framework (See the supplementary material).
We use a small network to estimate the parameters of location and scale in Eq. (10). It consists of location and scale branches, where each takes the features of the student and estimates the location and scale maps, separately. Both branches share the same network architecture of two convolutional layers and a PReLU  between them. For the scale branch, we add the softplus function  at the last layer, forcing the scale parameter to be positive. Note that the estimation module is used only at training time.
The encoder in the teacher network consists of 4 blocks of convolutional layers followed by a PReLU 
. All the layers, except the second one, perform convolutions with stride 1. In the second block, we use the convolution with stride(i.e., a scale factor) to downsample the size of the HR image to that of the LR image. The kernel sizes of the first two and the last two blocks are 5 5 and 3 3, respectively. The decoder in the teacher and the student network have the same architecture as FSRCNN 
consisting of five components: Feature extraction, shrinking, mapping, expanding, and deconvolution modules. We add the estimator module for location and scale maps on top of the expanding module in the student network. We use these maps together with the output features of the expanding module in the decoder to compute the distillation loss. We set the hyperparameters for losses using a grid search on the DIV2K dataset, and choose the ones (, ) that give the best performance. We implement our framework using PyTorch .
To train our network, we use the training split of DIV2K  corresponding 800 pairs of LR and HR images, where the LR images are synthesized by bicubic downsampling. We randomly crop HR patches of size 192 192 from the HR images. LR patches are cropped from the corresponding LR images according to the scale factor. For example, LR patches of size 96 96 are used for the scale factor of 2. We use data augmentation techniques, including random rotation and horizontal flipping. The teacher network is trained with random initialization. We train our model with a batch size of 16 about 1000k iterations over the training data. We use the Adam  with and . As a learning rate, we use and reduce it until using a cosine annealing technique .
We present an ablation analysis on each component of our framework. We report quantitative results in terms of the average PSNR on Set5  with the scale factor of 2. The results on a large dataset (i.e., B100 ) can be seen in the supplementary material. We show in Table 1 the average PSNR for student networks trained with variants of our framework. The results of the baseline in the first row are obtained using FSRCNN . From the second row, we can clearly see that feature distillation boosts the PSNR performance. For the teacher network in the second row, we use the same network architecture as FSRCNN except for the deconvolution layers. In contrast to FSRCNN, the teacher inputs HR images, and thus we replace the deconvolution layer with a convolutional layer, preserving the size of the inputs. We can see from the third row that a teacher network with an hourglass architecture improves the student performance. The hourglass architecture limits the performance of the teacher and degrades the performance (e.g., a 19.9dB decrease compared to that of the teacher in the second row), reducing the performance gap between the teacher and the student. This allows the feature distillation to be more effective, thus the student of the third row performs better (37.22dB) than that of the second row (37.19dB), which can also be found in recent works [7, 41]. The fourth row shows that the student network benefits from initializing the network weights with those of the decoder in the teacher, since this provides a good starting point for learning, and transfers the reconstruction capability of the teacher. From the fifth row, we observe that an imitation loss further improves the PSNR performance, making it easier for the student to learn features from the teacher. The next two rows show that the VID loss , especially with the Laplace distribution (VID), provides better results than the MAE, and combining all components gives the best performance. The distillation loss based on the MAE forces the feature maps of the student and teacher networks to be the same. This strong constraint on the feature maps is, however, problematic in our case, since we use different inputs for the student and teacher networks. The VID method allows the student to learn important features adaptively. We also compare the performance of our framework and a typical distillation approach with different losses in the supplementary material.
Analysis on compact features in spatial (top) and frequency (bottom left) domains and the distribution of pixel values (bottom right). To visualize the compact features in the frequency domain, we apply the 2D Fast Fourier Transform (FFT) to the image, obtaining its magnitude spectrum. It is then sliced along the-axis. (Best viewed in color.)
In Fig. 3, we show an analysis on compact features in spatial and frequency domains. Compared to the LR image, the compact features show high-frequency details regardless of whether the imitation loss is used or not. This can also be observed in the frequency domain – The compact features contain more high-frequency components than the LR image, and the magnitude spectrums of them are more similar to that of the HR image especially for high-frequency components. By taking these features as inputs, the decoder in the teacher shows the better performance than the student (Table 1) despite the fact that they have the same architecture. This demonstrates that the compact features extracted from the ground truth contain useful information for reconstructing the HR image, encouraging the student to reconstruct more accurate results via feature distillation. In the bottom right of the Fig. 3, we can see that the pixel distributions of the LR image and the compact feature are largely different without the imitation loss, discouraging the weight transfer to the student. The imitation loss alleviates this problem by encouraging the distributions of the LR image and the compact feature to be similar.
We compare in Table 2 the performance of our student model with the state of the art, particularly for efficient SISR methods [10, 11, 30, 46, 45, 27, 1, 33, 26]. For a quantitative comparison, we report the average PSNR and SSIM  for upsampling factors of 2, 3, and 4, on standard benchmarks [5, 59, 40, 25]. We also report the number of model parameters and operations (MultiAdds), required to reconstruct a HR image of size , and present the average runtime of each method measured on the Set5  using the same machine with a NVIDIA Titan RTX GPU. From this table, we can observe two things: (1) Our student model trained with the proposed framework outperforms FSRCNN  by a large margin, consistently for all scale factors, even both have the same network architecture. It demonstrates the effectiveness of our approach to exploiting ground-truth HR images as privileged information; (2) The model trained with our framework offers a good compromise in terms of PSNR/SSIM and the number of parameters/operations/runtimes. For example, DRCN  requires 1,774K parameters, 17,974.3G operations and average runtime of 233.93ms to achieve the average PSNR of 30.75dB on Urban100  for a factor of 2. On the contrary, our framework further boosts FSRCNN without modifying the network architecture, achieving the average PSNR of 30.24dB with 13K parameters/6.0G operations only, while taking 0.83ms for inference.
In Table 3, we show the performances of student networks, adopting the architectures of other SR methods, trained with our framework using the DIV2K dataset . We reproduce their models (denoted by *) using the same training setting but without distillation. The FSRCNN-L has the same components as FSRCNN  but with much more parameters (126K vs. 13K), where the numbers of filters in feature extraction and shrinking components are both 56, and the mapping module consists of 4 blocks of convolutional layers. Note that the multi-scale learning strategy in the CARN  is not used for training the network, and thus the performance is slightly lower than the original one. We can see that all the SISR methods benefit from our framework except for IDN  for the scale factor of 4 on Set5. In particular, the performances of the variant of FSRCNN  and VDSR  are significantly boosted through our framework. Additionally, our framework further improves the performances of the cost-effective SR methods [27, 1], which are specially-designed to reduce the number of parameters and operations while improving the reconstruction performance. Considering the performance gains of recent SR methods, the results are significant, demonstrating the effectiveness and generalization ability of our framework. For example IDN  and SRFBN  outperform the second-best methods by 0.05dB and 0.02dB, respectively, in terms of PSNR on Set5  for a factor of 2. We visualize in Fig. 4 the performance comparison of student networks using various SR methods and the state of the art in terms of the number of operations and parameters. It confirms once more the efficiency of our framework.
We show in Fig. 5 reconstruction examples on the Urban100  and Set14  datasets using the student networks. We can clearly see that the student models provide better qualitative results than their baselines. In particular, our models remove artifacts (e.g., the borders around the sculpture in the first row) and reconstruct small-scale structures (e.g., windows in the second row and the iron railings in the last row) and textures (e.g., the patterns of the tablecloth in the third row). More qualitative results can be seen in the supplementary material.
We have presented a novel distillation framework for SISR leveraging ground-truth HR images as privileged information. The detailed analysis on each component of our framework clearly demonstrates the effectiveness of our approach. We have shown that the proposed framework substantially improves the performance of FSRCNN as well as other methods. In future work, we will explore distillation losses specific to our model to further boost the performance.
This research was supported by the Samsung Research Funding & Incubation Center for Future Technology (SRFC-IT1802-06).
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, Cited by: §3.2.2, §4.1.1.
SGDR: stochastic gradient descent with warm restarts. In ICLR, Cited by: §4.1.2.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In CVPR, Cited by: §2.0.2.