Impact of a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition

Knowledge Distillation (KD) is a strategy for the definition of a set of transferability gangways to improve the efficiency of Convolutional Neural Networks. Feature-based Knowledge Distillation is a subfield of KD that relies on intermediate network representations, either unaltered or depth-reduced via maximum activation maps, as the source knowledge. In this paper, we propose and analyse the use of a 2D frequency transform of the activation maps before transferring them. We pose that—by using global image cues rather than pixel estimates, this strategy enhances knowledge transferability in tasks such as scene recognition, defined by strong spatial and contextual relationships between multiple and varied concepts. To validate the proposed method, an extensive evaluation of the state-of-the-art in scene recognition is presented. Experimental results provide strong evidences that the proposed strategy enables the student network to better focus on the relevant image areas learnt by the teacher network, hence leading to better descriptive features and higher transferred performance than every other state-of-the-art alternative. We publicly release the training and evaluation framework used along this paper at http://www-vpu.eps.uam.es/publications/DCTBasedKDForSceneRecognition.

READ FULL TEXT VIEW PDF

Authors

page 3

page 21

page 29

page 30

12/03/2019

QUEST: Quantized embedding space for transferring knowledge

Knowledge distillation refers to the process of training a compact stude...
06/01/2021

Natural Statistics of Network Activations and Implications for Knowledge Distillation

In a matter that is analog to the study of natural image statistics, we ...
03/31/2021

Knowledge Distillation By Sparse Representation Matching

Knowledge Distillation refers to a class of methods that transfers the k...
07/04/2019

Graph-based Knowledge Distillation by Multi-head Attention Network

Knowledge distillation (KD) is a technique to derive optimal performance...
04/09/2019

Ultrafast Video Attention Prediction with Coupled Knowledge Distillation

Large convolutional neural network models have recently demonstrated imp...
02/05/2021

Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Knowledge distillation extracts general knowledge from a pre-trained tea...
08/28/2019

Online Sensor Hallucination via Knowledge Distillation for Multimodal Image Classification

We deal with the problem of information fusion driven satellite image/sc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks, and specifically models based on Convolutional Neural Networks (CNNs), have reached a remarkable success in several computer vision tasks during the last decade

Deng et al. (2009); Lin et al. (2014); Cordts et al. (2016). New advances in image databases, CNN architectures and training schemes have pushed forward the state-of-the-art in computer vision. However, the success of deep models, comes usually in hand with the need of huge computational and memory resources to process vast databases for training them Dosovitskiy et al. (2020). In this vein, there exists a line of research focused on using smaller models that need fewer computational resources for training while obtaining similar results to larger models. Techniques such as quantization Jacob et al. (2018), network pruning Luo et al. (2017); Zhou et al. (2018); Xiao et al. (2019); Liu et al. (2018), Knowledge Distillation Hinton et al. (2015); Gou et al. (2021) or the design of efficient new architectures Tan and Le (2019); Howard et al. (2019); Cui et al. (2019) have been of great importance to achieve fast, compact, and easily deploying CNN models.

Knowledge Distillation

Among these, Knowledge Distillation (KD) is of key relevance given its proven effectiveness in different computer vision tasks such as image classification, object detection and semantic segmentation Gou et al. (2021). KD was originally proposed by Hinton et al. Hinton et al. (2015) as a strategy to improve the efficiency of CNNs by passing on knowledge from a teacher to a student model. Generally, the student model, usually defined as a smaller network, leverages the knowledge learnt by the teacher model, usually a bigger one, via training supervision. Specifically, in Hinton’s KD Hinton et al. (2015)

, the student model is trained using supervision not only from the ground-truth labels, but also from the teacher predicted logits. Compared to just relying on hard-label annotations, the additional use of teacher’s predictions as extra supervision provides an automatic label smoothing regularization

Müller et al. (2019); Yuan et al. (2020).

Feature-based Knowledge Distillation expanded the seminal KD scheme by building on the concept of representation learning: CNNs are effective at encoding knowledge at multiple levels of feature representation Bengio et al. (2013). The idea was firstly introduced by the FitNets Romero et al. (2015), which proposed to use the matching of intermediate CNN representations as the source knowledge that is transferred from the teacher to the student.

Figure 1: Example of the obtained activation maps, at different levels of depth, for the scene recognition task (the scene class is hotel room). Top rows represent activation maps for vanilla ResNet-18 and ResNet-50 CNNs respectively. Bottom row represents the activation maps obtained by the proposed DCT Attention-based KD method when ResNet-50 acts as the teacher network and ResNet-18 acts as the student. AT Komodakis and Zagoruyko (2017) activation maps are also included for comparison.

A specific subgroup of Feature-based KD methods is that of the Attention-based KD ones. This category was pioneered by Komodakis et al. Komodakis and Zagoruyko (2017). They proposed to further optimize FitNets by simplifying complete CNN features into attention/activation maps. The matching between the student activation maps and the teacher ones serves as supervision for the KD scheme. The use of activation maps provides several advantages with respect to the direct use of features: first, as matching maps does not depend on channel dimensions, more architectures can be used in the KD process; second, it avoids the problem of semantic mismatching between features when KD is used between two significantly different architectures in terms of depth Chen et al. (2021a). As depicted in Figure 1, activation areas, although not being placed in the same image areas, are correlated in terms of the semantic concepts detected even when comparing considerably different models like ResNet-18 and ResNet-50.

Due to its computational simplicity and convenient mathematical properties (differentiable, symmetric and holds the triangle inequality), as already stated by Gou et al. Gou et al. (2021)

, the convention to compare either two feature tensors or a pair of activation maps is to compute the

norm of their difference. However, the performance of the norm when used to simulate human perception of visual similarities has already been demonstrated to be poor Zhao et al. (2016): it might yield, due to its point-wise accumulation of differences, similar results for completely visually different images Wang and Bovik (2009). Furthermore, in the scope of Attention-based KD, another key problem of the norm is its tendency towards desaturation when is used to guide an optimization process. A visual evidence of this problem is the sepia

effect in colorization

Zhang et al. (2016). We pose that the usage of the pixel-wise norm for the comparison of activation maps can be replaced by global image-wise estimates for a better matching and knowledge transferring in Feature-based KD.

Contributions

In this vein, we propose a novel matching approach based on a 2D discrete linear transform of the activation maps. This novel technique, for which we here leverage the simple yet effective Discrete Cosine Transform (DCT)

Oppenheim et al. (2001), is based on the 2D relationships captured by the transformed coefficients, so that the matching is moved from a pixel-to-pixel

fashion to a correlation in the frequency domain, where each of the coefficients integrates spatial information from the whole image. Figure

1 depicts an example of the obtained activation maps when using the proposed DCT approach to match ResNet-50 ones. Note how the similarity is higher with respect to the ones obtained by AT Komodakis and Zagoruyko (2017), a method based on an -driven metric.

In order to verify the effectiveness of the proposed method this paper proposes to use a evaluation of KD in scene recognition, a task defined by strong spatial and contextual relationships among stuff and objects. Scene recognition models are associated to highly variable and sparse attention maps that have been proved to be of crucial relevance for better knowledge modelling and to explain overall performance López-Cifuentes et al. (2020). Moreover, we claim that the state-of-the-art in KD is over-fitted to the canonical image classification task (Table 3, Chen et al. (2021b)

), where image concepts are represented by a single, usually centered, object (CIFAR and ImageNet datasets). We believe that moving KD research to a more complex task that uses more realistic datasets may be beneficial not only to assess the potential benefits of each KD method in an alternative scenario, but also, to widen the scope of KD research and, in particular, to boost the efficiency of scene recognition models by using models with the same performance but with a significantly lower number of parameters.

In summary, this paper contributes to the KD task by:

  • Proposing a novel DCT-based metric to compare 2D structures by evaluating their similarity in the DCT domain. We propose to use this technique in an Attention-based KD approach to more adequately compare activation maps from intermediate CNN layers.

  • Presenting a thorough benchmark of Knowledge Distillation methods on three publicly available scene recognition datasets and reporting strong evidences that the proposed DCT-based metric enables a student network to better focus on the relevant image areas learnt by a teacher model, hence increasing the overall performance for scene recognition.

  • Publicly releasing the KD framework used to train and evaluate the scene recognition models from the paper. This framework, given its simplicity and modularity, will enable the research community to develop novel KD approaches that can be effortlessly evaluated under the same conditions for scene recognition.

2 Related Work

2.1 Knowledge-Distillation

As already introduced, KD is a strategy defining a set of transferability gangways to improve the efficiency of Deep Learning models. A teacher model is used to provide training supervision for a student model, usually a shallower one. Gou et al. Gou et al. (2021) proposes to arrange KD into three different groups depending on the distilled knowledge: response-based, relation-based and feature-based KD.

The original KD idea, enclosed in the response-based group, was pioneered by Hinton et al. Hinton et al. (2015). They proposed to use teacher outputs in the form of logits to supervise, cooperatively with ground-truth labels, the training of the student network. The training using soft-labels predicted by the teacher provided a strong regularization that benefited the student’s performance in the image classification task Müller et al. (2019); Yuan et al. (2020). The seminal KD was improved by changing the way logits were compared. Passalis et al. Passalis et al. (2020)

proposed to use a divergence metric (Kullback–Leibler divergence) to match the probability distributions obtained by the teacher and the student. In the same line, Tian

et al. proposed the use of contrastive learning Tian et al. (2019), which pushed response-based KD performance even further.

Relation-based KD accounts for transferring the relationships between different activations, neurons or pairs of samples, that are encoded by the teacher model and transferred to the student one. Yim

et al. Yim et al. (2017) proposed a Flow of Solution Process (FSP), which is defined by the Gram matrix between two layers. The FSP matrix summarizes the relations between pairs of feature maps. Passalis et al. Passalis et al. (2020) proposed to model abstract feature representations of the data samples by estimating their distribution using a kernel function. Then these estimated distributions were transferred instead of the features, using feature representations of data.

Feature-based KD, as originally proposed by the FitNets transferring scheme Romero et al. (2015), deals with using the matching of intermediate CNN representations as source knowledge that is transferred from the teacher to the student. Building on top of this idea, a variety of methods have been proposed. Ahn et al. Ahn et al. (2019) formulated feature KD as the maximization of the mutual information between teacher and student features. Guan et al. Guan et al. (2020) proposed a student-to-teacher path and a teacher-to-student path to properly obtain feature aggregations. Chen et al. Chen et al. (2021a) detected a decrease in performance when distilling knowledge caused by semantic mismatch between certain teacher-student layer pairs, and proposed to use attention mechanisms to automatically weight layers’ combinations. Chen et al. Chen et al. (2021b) revealed the importance of connecting features across different levels between teacher and student networks.

Within Feature-based KD methods one can find the attention-based KD ones. Komodakis et al. Komodakis and Zagoruyko (2017) proposed to simplify the intermediate features to create activation maps that were compared using an difference. As already stated in Section 1 and indicated by Gou et al. Gou et al. (2021), it is a convention, not only in attention but also in feature-based KD methods, to build the matching metric based on the norm. We argue that this pixel-wise comparison might not be adequate when comparing multi-modal spatial structures such as attention maps.

2.2 Scene Recognition

Scene recognition is a hot research topic whose complexity is, according to the reported performances López-Cifuentes et al. (2020), one of the highest in image understanding. The complexity of the scene recognition task lies partially on the ambiguity between different scene categories showing similar appearance and objects’ distributions: inter-class boundaries can be blurry, as the sets of objects that define a scene might be highly similar to another’s.

Nowadays, top performing strategies are fully based on CNN architectures. Based on context information, Xie et al. Xie et al. (2017) proposed to enhance fine-grained recognition by identifying relevant part candidates based on saliency detection and by constructing a CNN architecture driven by both these local parts and global discrimination. Zhao et al. Zhao et al. (2019)

, similarly, proposed a discriminative discovery network (DisNet) that generates a discriminative map (Dis-Map) for the input image. This map is then used to select scale-aware discriminative locations which are finally forwarded to a multi-scale pipeline for CNN feature extraction.

A specific group of approaches in scene recognition is that trying to model relations between objects information and scenes. Herranz-Perdiguero et al. Herranz-Perdiguero et al. (2018)

extended the DeepLab network by introducing SVM classifiers to enhance scene recognition by estimating scene objects and stuff distribution based on semantic segmentation cues. In the same vein, Wang

et al. Wang et al. (2017)

defined semantic representations of a given scene by extracting patch-based features from object-based CNNs. The proposed scene recognition method built on these representations—Vectors of Semantically Aggregated Descriptors (VSAD), ouperformed the state-of-the-art on standard scene recognition benchmarks. VSAD’s performance was enhanced by measuring correlations between objects among different scene classes

Cheng et al. (2018). These correlations were then used to reduce the effect of common objects in scene miss-classification and to enhance the effect of discriminative objects through a Semantic Descriptor with Objectness (SDO). Finally, López-Cifuentes et al. López-Cifuentes et al. (2020)

argued that these methods relied on object information obtained by using patch-based object classification techniques, which entails severe and reactive parametrization (scale, patch-size, stride, overlapping…). To solve this issue they proposed to exploit visual context by using semantic segmentation instead of object information to guide the network’s attention. By gating RGB features from information encoded in the semantic representation, their approach reinforced the learning of relevant scene contents and enhanced scene disambiguation by refocusing the receptive fields of the CNN towards the relevant scene contents.

According to the literature, we pose that the differential characteristics of the scene recognition task with respect to classical image classification one might be beneficial to boost and widen the scope of KD techniques. These characteristics include that performance results are not yet saturated, the high ambiguity between different scene categories and that relevant image features are spread out throughout the image instead of being localized in a specific area—usually the center region of the image.

3 Attention-based Knowledge Distillation Driven by DCT Coefficients

Figure 2: Example of the proposed gangways between two ResNet architectures representing the teacher and the student models. In this case, the intermediate feature representations for the Knowledge Distillation are extracted from the basic Residual Blocks. Besides this example, the proposed method can be applied to the whole set of ResNet, MobileNets, VGGs, ShuffleNets, GoogleNet and DenseNets families.

Following the organization of KD methods proposed by Gou et al. Gou et al. (2021), the following Section is divided into Knowledge (Section 3.1) and Distillation (Section 3.2). Figure 2 depicts the proposed DCT gangways in an architecture exemplified with two ResNet branches.

3.1 Knowledge

Attention Maps: We rely on mean feature activation areas Komodakis and Zagoruyko (2017), or attention maps, as the source of knowledge to be transferred from a teacher network to an student network. Given an image , a forward pass until a depth in a teacher CNN and in a student CNN yields feature tensors and respectively, with , being the spatial dimensions and and the channel dimensions of the teacher and student features. An activation map for the teacher network can be obtained from these feature tensors by defining a mapping function that aggregates information from the channel dimensions:

(1)

The mean squared activations of neurons can be used as an aggregated indicator of the attention of the given CNN with respect to the input image. Accordingly, we define the mapping function as:

(2)

obtaining the feature map . This activation map is then rescaled to the range by a min-max normalization yielding . This process is similarly applied for the student network to obtain . Figure 1 depicts an example of the normalized activation maps for ResNet-18 and ResNet-50 at different depths.

Comparing Attention Maps via the DCT: We first propose to apply the DCT Oppenheim et al. (2001) to the two activation maps and before comparing them.

For the teacher map, , the DCT yields a set of coefficients , each representing the resemblance or similarity between the whole distribution of values and a specific 2D pattern represented by the corresponding basis function of the transform. Specifically, in the case of the DCT, these basis functions show increasing variability in the horizontal and vertical dimensions. The DCT is here used over other transformation given its simplicity, its computational efficiency and its differentiability.

Given the lossless nature of the DCT, applying the metric to the obtained coefficients of the transformed maps would be equivalent to applying it over the activation maps, as in Komodakis et al. Komodakis and Zagoruyko (2017). However, we propose to modify the DCT coefficients in two ways: first, in order to compare the spatial structure of activation maps disregarding the global mean activation we set to zero the first coefficient, the DC coefficient associated to a constant basis function Oppenheim et al. (2001). Then, we rescale the remaining coefficients to the range , again using the min-max normalization to obtain , which permits an scaling of the DCT-term to similar levels of the Cross-Entropy Loss, hence enabling their combination without the need of additional weighting terms. The combination of these three operations (DCT transform, DC coefficient removal and coefficients normalization) in the maps is a simple yet effective change that achieves the comparison to focus on the attention maps distribution rather than on their monomodal maximum.

After extracting the DCT transform for the student map, the two activation maps are compared using the norm between the normalized remaining coefficients by:

(3)

With the usage of the norm over the DCT coefficients rather than directly on the activation map pixels, we are moving the matching from a pixel-wise computation of differences towards a metric that describes full image differences. In addition, the proposed DCT-based metric focuses on the complete spatial structure while maintaining the mathematical properties of the metric: it is a differentiable convex function, it has a distance preserving property under orthogonal transformations and its gradient and Hessian matrix can be easily computed. All of these are desirable and advantageous properties when using this distance in numerical optimization frameworks.

3.2 Distillation

As stated before, the objective of the proposed distillation scheme is to properly transfer the localization of activation areas for a prediction obtained by the teacher model, , for a given input I, to the student one, . To this aim, we define the KD loss by accumulating the DCT differences along the explored gangways:

(4)

During training, we refine this loss by only using the teacher maps for correct class predictions. This removes the effect of using distracting maps resulting from teacher’s miss-predictions in the knowledge transfer process. In other words, we propose to transfer the knowledge only when the final logit prediction is correct. We propose to refine our proposal in Eq. 4 as:

(5)

The overall loss used to train the student CNN is obtained via:

(6)

where is the regular Cross-Entropy Loss and and are weighting parameters to control the contribution of each term to the final loss.

As usually done with other KD methods Komodakis and Zagoruyko (2017); Tian et al. (2019); Chen et al. (2021a), the proposed approach can also be combined with the original Response-based KD loss proposed by Hinton et al. Hinton et al. (2015) by including it in Eq. 6:

(7)

where is defined as in Hinton et al. Hinton et al. (2015) and weights its contribution to the final loss .

4 Experimental Evaluation

This Section describes the experiments carried out for validating the proposed approach. First, Section 4.1 delves into the reasons why a new KD benchmark is needed and motivates our choice of the scene recognition task for it. Second, to ease the reproducibility of the method, Section 4.2 provides a complete review of the implementation details. Section 4.3 motivates a series of ablation studies for the proposed method. Section 4.4 reports state-of-the-art results on the standard CIFAR 100 benchmark and a and thorough state-of-the-art comparison in the scene recognition task. Quantitative and qualitative results for the obtained distilled activation maps are presented in Section 4.5.

4.1 Validation on Scene Recognition Benchmarks

All feature and attention-based KD methods reviewed in Section 1 and 2

have been mainly evaluated so far using image classification benchmarks on ImageNet

Deng et al. (2009), CIFAR 10/100 Krizhevsky et al. (2009)

and MNIST

Deng (2012) datasets. We claim that scene recognition is a more suited task to evaluate KD methods for a variety of reasons:

First, reported performances on scene recognition benchmarks López-Cifuentes et al. (2020); Chen et al. (2020); Li et al. (2021) are not saturated. This means that results highly differ between shallow and deep architectures, providing a wider and more representative performance gap to be filled by KD methods than that existing for image classification in standard CIFAR10/100 evaluations. Note how the performance difference between a Teacher and a Vanilla baseline is just a in CIFAR100 (Table 3) while that difference grows to a

in the ADE20K scene recognition dataset (Table

5).

Second, attention is an secondary factor for succeeding in ImageNet-like datasets. Due to the nature of the images, model’s attention is usually concentrated around the center of the image Mohsenzadeh et al. (2020). This image-center bias provokes different models focusing on very similar image areas at different depth levels, suggesting that the performance is mainly driven by the representativity and discriminability of the extracted features rather than by the areas of predominant attention. Figure 5 in Section 4.4.1 provides examples of this observation.

Differently, in scene recognition the gist of a scene is defined by several image features including stuff, objects, textures and spatial relationships between stuff and objects, which are, in turn, spread out throughout the image representing the scene. The areas of attention which different models are primarily focused on have been proved to be critical and to have a strong correlation with performance López-Cifuentes et al. (2020). Actually, shallower networks can end up having better performance than deeper networks if their attention is properly guided. In this case, Attention-based KD might be a paramount strategy to build better and simpler models.

Given these reasons, we believe that setting up a KD benchmarking that uses scene recognition rather than classical ImageNet-like image classification is helpful to spread the use of KD to other research scenarios, build a novel state-of-the-art and widen its application to more challenging tasks.

In this section, our approach is evaluated on three well-known and publicly available scene recognition datasets: ADE20K Zhou et al. (2017), MIT Indoor 67 Quattoni and Torralba (2009) and SUN 397 Xiao et al. (2010). However, as we understand that our approach should be also compared with respect to KD literature in a standard benchmark, results for CIFAR 100 dataset Krizhevsky et al. (2009) are also presented in Section 4.4.1.

4.2 Implementation Details

We provide and publicly release a novel training and evaluation KD framework for scene secognition including all the code and methods reported in this paper 111http://www-vpu.eps.uam.es/publications/DCTBasedKDForSceneRecognition

. This framework enables the reproducibility of all the results in the paper and, given its modular design, enables future methods to be easily trained and evaluated under the same conditions as the presented approaches. The following implementation details regarding used architectures, hyper-parameters and evaluation metrics have been used:

Architectures: The proposed method and the state-of-the-art approaches are evaluated using different combinations of Residual Networks He et al. (2016) and Mobile Networks Sandler et al. (2018).

Data Normalization and Augmentation: Each input image is spatially adapted to the network by re-sizing the smaller dimension to , while the other is resized to mantain the aspect ratio. In terms of data augmentation, we adopt the common data augmentation transformations: random crop to

dimension and random horizontal flipping. We also apply image normalization using ImageNet mean and standard deviation values.

Knowledge Distillation Layers: For the proposed method, we select the intermediate features from ResNets He et al. (2016) and MobileNetV2 Sandler et al. (2018) Networks with the following spatial sizes : , , and , analyzing levels of depth. We assume that both Teacher and Student architectures share the same spatial sizes (in Width and Height, not in Channel dimension) at some points in their architectures. This assumption may preclude the application of the method (to some extent) for pairs of disparate architectures. However, the assumption holds for the most popular architectures (at least those concerning KD and the image classification tasks): the whole set of ResNet, MobileNets, VGGs, ShuffleNets, GoogleNet and DenseNets families. All of these CNN families share the same spatial sizes [H, W] at some points of their architectures.

Hyper-parameters:

All the reported models have been trained following the same procedure. Stochastic Gradient Descent (SGD) with

default momentum and

weight decay has been used to minimize the loss function and optimize the student network’s trainable parameters. The initial learning rate was set to

. All the models have been trained for epochs and the learning rate was decayed every epochs by a factor. The batch size was set to images. Unless otherwise specified along the Results Section, we set in the final loss equation when using the proposed approach. When combining it with Hinton’s KD Hinton et al. (2015), we follow the original publication and set and while maintaining .

All the state-of-the-art reported methods have been trained by us for the scene recognition task using authors’ original implementations and implementations from Tian et al. Tian et al. (2019)222https://github.com/HobbitLong/RepDistiller. To provide a fair comparison, and in order to adapt them to the scene recognition task, an extensive grid-search starting from the optimal values reported in the original papers has been performed and presented in Section 4.4. Additionally, for the CIFAR100 experiment in Section 4.4.1, optimal hyper-parameter configurations reported in the original papers have been conserved. We refer to each of the individual publications for details.

Evaluation Metrics: Following the common scene recognition procedure López-Cifuentes et al. (2020), Top@ accuracy metric with being the total number of Scene classes, has been chosen to evaluate the methods. Specifically, Top@ accuracy metrics have been chosen. Furthermore, and as the Top@ accuracy metrics are biased to classes over-represented in the validation set, we also use an additional performance metric, the Mean Class Accuracy (MCA) López-Cifuentes et al. (2020). For the CIFAR100 dataset experiment, following Tian et al. (2019) and Chen et al. (2021b), regular accuracy is computed.

Hardware and Software:

The model design, training and evaluation have been carried out using the PyTorch 1.7.1 Deep Learning framework

Paszke et al. (2017) running on a PC using a 8 Cores CPU, 50 GB of RAM and a NVIDIA RTX 24GB Graphics Processing Unit.

4.3 Ablation Studies

The aim of this Section is to gauge the influence of design choices, parameters and computational needs of the method. The performance impact of the different stages of the method are analyzed in Section 4.3.1, the influence of the value, that weights the contribution of the proposed DCT-based loss to the global loss function (Eq. 6), is measured in Section 4.3.2 and the computational overhead introduced by the proposed DCT-based metric is discussed in Section 4.3.3.

4.3.1 Knowledge Distillation Design

DCT
DC Removal
DCT Normalization
Teacher Predictions
Hinton’s KD Hinton et al. (2015)
Top@1 Top@5 MCA Top@1
40.97 63.94 10.24 -
42.54 63.12 11.10 + 3.83
46.51 68.92 12.45 + 9.33
46.84 67.41 12.88 + 0.70
47.35 70.40 13.11 + 1.08
54.27 76.15 18.05 + 14.61
Table 1: Ablation study regarding different stages of the proposed method. DCT stands for the use of the DCT to transform the activation maps. DC Removal stands for the suppression of the DC coefficient from Section 3. DCT Normalization stands for the min-max normalization of the DCT coefficients from Section 3. Teacher Predictions stands for the use of teacher predictions to refine the Knowledge Distillation in Eq. 5. Original KD stands for the addition of Hinton’s KD Hinton et al. (2015) as stated in Eq. 7. Bold values indicate best results.

Table 1 quantifies the incremental influence of every step in the proposed approach. For this experiment we use the ADE20K dataset, and ResNet-50 and ResNet-18 for the teacher and student models respectively. Results suggest that even the simplest approach (second row), i.e. when activation maps are distilled from the teacher to the student using the complete non-normalized DCT, outperforms the vanilla baseline (first row). Note that when the DC coefficient is suppressed results are further increased. This suggests that using a metric that captures 2D differences while disregarding the mean intensity value of an activation map helps to increase the performance of the student network.

Figure 3: Training and validation losses for ADE20K dataset. Classification curves represent Cross-Entropy loss values. Distill curves represent the proposed DCT-based loss values, either without normalization (a) or using min-max normalization (b).

Normalization of the DCT coefficients slightly enhances results, but more importantly, scales the DCT loss to be in a similar range than the Cross-Entropy Loss. To further stress the impact of the normalization, Figure 3 (a) includes loss-evolution graphs for the proposed DCT-based method when DCT coefficients are not normalized, whereas Figure 3 (b), on the contrary, represents losses when min-max normalization, as described in Section 3, is applied prior to the comparison with the loss. As it can be observed, the normalization plays a crucial role for scaling the proposed DCT loss. If normalization is not used, the distillation loss term is two orders of magnitude larger than the classification loss term, hence dominating the global loss after their combination. In order to balance the impact of the losses in their combination without normalization, larger values different than would be required, thereby increasing the complexity of setting adequate hyper-parameters.

Back to Table 1, when Teacher predictions are taken into account and miss-predictions are suppressed from the KD pipeline results are further increased. Finally, the combination of the proposed approach and KD Hinton et al. (2015) suggests a high complementarity that can boost results even further.

4.3.2 Influence of

Figure 4: Influence of in the performance of the model measured over the ADE20K dataset. ResNet-50 acts as the teacher and ResNet-18 as the student.

The influence of the hyper-parameter (Eq. 6) has also been analyzed. Figure 4 shows performance curves (teacher: ResNet-50, student: ResNet-18) obtained with values of ranging from to in the ADE20K dataset. For a clearer comparison, performance of the vanilla ResNet-18 is also plotted. It can be observed that our method outperforms vanilla ResNet-18 training for all values, suggesting an stable performance for a wide range of values. We use in all the experiments ahead as a trade-off between accuracy and balance of the distillation and the cross-entropy terms into the final loss. However, it is important to remark that, differently than reported KD methods that need values of ranging usually from to (Tables 5, 7 and 6), the proposed approach is more stable for different values thanks to the approach described in Section 3 which facilitates a smooth combination of the and losses.

4.3.3 Computational Overhead

Having in mind that computational resources are a key aspect that should be always taken into account, Table 2 presents the overhead derived from including the proposed DCT-based metric with respect to other KD approaches. Results indicate that our approach has a computational time per training epoch similar to that of AT Komodakis and Zagoruyko (2017) and KD Hinton et al. (2015)

. Our implementation leverages the GPU implementation of the Fast Fourier Transform (FFT), which has already been demonstrated to be highly efficient in computational terms. This is also one of the advantages of using the DCT with respect to other alternative transformations. In addition, the proposed method, differently to many others from the state-of-the-art, does not include extra trainable parameters from the student ones, hence not needing extra memory resources.

Method (ResNet-18) Extra Trainable Parameters Time per Epoch (Min)
Baseline - 0.79
AT Komodakis and Zagoruyko (2017) 0 M 1.11
KD Hinton et al. (2015) 0 M 1.09
VID Ahn et al. (2019) 12.3 M 1.53
Review Chen et al. (2021b) 28 M 1.79
CKD Chen et al. (2021a) 634 M 5.03
DCT (Ours) 0 M 1.14
Table 2: Computational cost comparison measured in extra trainable parameters needed and minutes per training epoch between the Baseline, AT Komodakis and Zagoruyko (2017), Review Hinton et al. (2015), KD Hinton et al. (2015), CKD Chen et al. (2021a) and the proposed DCT-based method.

4.4 Comparison with the State-of-the-Art

4.4.1 CIFAR 100 Results

Although one of the aims of our work is to extend and enhance the performance of KD in the scene recognition task, we are aware that an evaluation in the classical KD benchmark on image classification is also needed to help assess our contributions. To this aim, this section presents the performance of the proposed DCT-based approach in the CIFAR-100 dataset. For the sake of consistency, and to provide a fair comparison, we have followed the training and evaluation protocols described in the CRD paper Tian et al. (2019). In our case, the parameter from Eq. 6 has not been modified and remains set to . All the performances reported in Table 3 but those for our method are obtained from already published works Tian et al. (2019); Chen et al. (2021b).

Model Year T: ResNet-56 T: ResNet-110 T: ResNet-110 T: ResNet-32x4 Average
S: ResNet-20 S: ResNet-20 S: ResNet-32 S: ResNet-8x4
Teacher - 72.34 74.31 74.31 79.42 75.09
Vanilla - 69.04 69.06 71.14 72.50 70.43
RKD Park et al. (2019) 2019 69.61 69.25 71.82 71.90 70.64
FitNet Romero et al. (2015) 2014 69.21 68.99 71.06 73.50 70.69
CC Peng et al. (2019) 2019 69.63 69.48 71.48 72.97 70.89
NST Huang and Wang (2017) 2017 69.60 69.53 71.96 73.30 71.09
FSP Yim et al. (2017) 2017 69.95 70.11 71.89 72.62 71.14
FT Kim et al. (2018) 2018 69.84 70.22 72.37 72.86 71.32
SP Tung and Mori (2019) 2019 69.67 70.04 72.69 72.94 71.33
VID Ahn et al. (2019) 2019 70.38 70.16 72.61 73.09 71.50
AT Komodakis and Zagoruyko (2017) 2017 70.55 70.22 72.31 73.44 71.63
PKT Passalis et al. (2020) 2020 70.34 70.255 72.61 73.64 71.71
AB Heo et al. (2019) 2019 69.47 69.53 70.98 73.17 71.78
KD Hinton et al. (2015) 2015 70.66 70.67 73.08 73.33 71.93
CRD Tian et al. (2019) 2019 71.16 71.46 73.48 75.51 72.90
Review Chen et al. (2021b) 2021 71.89 71.60 73.89 75.63 73.25
DCT (Ours) 2022 70.45 70.10 72.42 73.52 71.55
Table 3: CIFAR100 accuracy results with 4 different Teacher-Student combinations. All the state-of-the-art results are extracted from CRD Tian et al. (2019) and Review Chen et al. (2021b) papers. Methods are sorted based on their average results.

Table 3 presents accuracy results for the state-of-the-art in KD and the proposed approach for several network combinations. To ease the comparison an average column in blue color is also included. These results suggest that: (1) all the reported methods perform similarly: most of them are within the range of to of accuracy difference; (2) our method achieves results comparable to other state-of-the-art methods even in a single object/concept dataset like CIFAR100.

Our approach is specifically targeted to tasks that benefit from the aggregation of information spatially spread throughout the image, e.g., scene recognition. However, when used for tasks that can be solved just extracting features from a single (usually image-centered) region such as the CIFAR 10/100 image classification benchmark Krizhevsky et al. (2009), our proposal is neutral. Contributions from attention-based approaches are hindered due to the similar, centered and compact attention patterns that result from this dataset at all levels of the different CNN vanilla models: as depicted in Figure 5, highly dissimilar architectures yield similar mono-modal attention maps around the object defining the image class. Note how unlike these attention maps are from the ones depicted in Figure 1

Figure 5: Example of obtained activation maps at three different levels for two different architectures in CIFAR 100 dataset. Note the similarity between activation maps from different architectures and the centered and compact patterns in Level and Level .
Method Training Validation
Level 1 Level 2 Level 3 Average Level 1 Level 2 Level 3 Average
Vanilla ResNet-20 0.71 0.70 0.84 0.75 0.71 0.70 0.84 0.75
AT Komodakis and Zagoruyko (2017) 0.92 0.92 0.94 0.93 0.93 0.92 0.94 0.93
DCT (Ours) 0.97 0.95 0.93 0.95 0.97 0.95 0.92 0.95
Table 4: ResNet-20 Activation Map’s similarity using SSIM with respect to a ResNet-56 model trained using the CIFAR100 dataset. SSIM values close to 1 indicate identical maps and values close to 0 indicate no similarity.

This attention map bias can be also noticed quantitatively in the experiment reported in Table 4. Here we quantify the similarity between ResNet-56’s (Teacher) and some selected model’s activation maps for the whole set of training and validation samples in the CIFAR100 dataset. We use the Structural Similarity Index Measure (SSIM) Wang et al. (2004) to evaluate such similarity, hence avoiding potential biases inherited from the metrics used in the training stage. It can be observed how attention maps for the vanilla ResNet-20 model are, in average, a similar to those of ResNet-56, a model with twice more capacity. It is noteworthy to advance that, when this experiment is carried out for scene recognition (Table 9), this average similarity decreases a (from 0.75 to 0.48), indicating that the correlation between attention maps is substantially higher for the CIFAR100 than for scene recognition datasets. In other words, activation maps in CIFAR-100 are already matched by most of the methods.

Nevertheless, considering results from Tables 3 and 4, one can conclude that the proposed DCT-based loss yields a better matching between Teacher and Student activation maps than a method driven by the norm (the AT Komodakis and Zagoruyko (2017) method selected for comparison in Table 4). This supports the motivation of the paper: using a 2D frequency transform of the activation maps before transferring them benefits the comparison of the 2D global information by leveraging the spatial relationships captured by the transformed coefficients.

4.4.2 Scene Recognition Results

This Section presents a state-of-the-art benchmark for KD methods. Following common evaluations Tian et al. (2019); Chen et al. (2021a, b) we have selected top performing KD methods: KD Hinton et al. (2015), AT Komodakis and Zagoruyko (2017), PKT Passalis et al. (2020), VID Ahn et al. (2019), CRD Tian et al. (2019), CKD Chen et al. (2021a) and Review Chen et al. (2021b). Obtained results for ADE20K, SUN397 and MIT67 datasets are presented in Tables 5, 6 and 7 respectively. Performance metrics are included for three different pairs of teacher/student models: two sharing the same architecture, ResNet-50/ResNet-18 and ResNet-152/ResNet-34, and one with different backbones, ResNet-50/MobileNetV2. In addition, the combination of all these models with Hinton’s KD Hinton et al. (2015) is also reported.

Figure 6: Box plot representing state-of-the-art results using different values in a range of from the original value proposed by the corresponding works with a step of . The study has been performed using ResNet-50 as teacher and ResNet-18 as student in the ADE20K dataset. Red line represents the performance of our approach. Blue crosses represent the performance of each method using the value reported in the original publications.

First, to provide a fair comparison, Figure 6 compiles the performance ranges of an extensive search of the optimal value for each of the compared methods for the scene recognition task. The search has been carried out modifying the values reported in the original publications (which we understand optimal for the image classification task) in a range between with a step of . The search has been performed using ResNet-50 as teacher and ResNet-18 as student in the ADE20K dataset. To ease the comparison, the performance obtained by the original value and the proposed method is also included. The models trained using values resulting in the best performance for each method have been used to obtain the results from Tables 5, 6 and 7.

Average results from Tables 5, 6 and 7 indicate that the proposed approach outperforms both the vanilla training of the student and all the reported KD methods. The training loss curves for the validation sets depicted in Figures 7 (a), 7(b) and 7 (c) support this assumption providing a graphical comparison between all the reported methods for ADE20K, SUN397 and MIT67 datasets respectively.

Method Year Extra Trainable Params T: ResNet-50 (25.6 M)
S: ResNet-18 (11.7 M) T: ResNet-152 (60.3 M)
S: ResNet-34 (21.8 M) T: ResNet-50 (25.6 M)
S: MobileNet-V2 (3.5 M)
Top1 Top5 MCA Top1 Top5 MCA Top1 Top5 MCA
Teacher - - - 58.34 79.15 21.80 60.07 79.65 24.19 58.34 79.15 21.80
Vanilla - - - 40.97 63.94 10.24 41.63 65.15 10.03 44.29 67.69 10.44
AT Komodakis and Zagoruyko (2017) 2017 0 M 1100 45.43 66.70 12.29 44.80 65.21 11.39 46.65 65.69 11.85
VID Ahn et al. (2019) 2019 12.3 M 1.5 43.11 65.78 10.70 41.03 62.41 9.24 43.73 66.70 10.35
CRD Tian et al. (2019) 2019 0.3 M 1.4 45.92 67.87 11.91 43.09 66.53 10.30 45.14 69.11 10.27
PKT Passalis et al. (2020) 2020 0 M 30000 44.59 65.46 11.89 42.38 62.98 10.74 46.42 67.32 11.81
CKD Chen et al. (2021a) 2021 634 M 400 46.89 69.55 12.70 45.01 65.70 11.89 47.30 68.60 12.30
Review Chen et al. (2021b) 2021 28 M 1.8 45.88 68.20 12.71 43.03 65.34 10.84 45.30 69.74 11.48
DCT (Ours) 2022 0 M 1 47.35 70.40 13.11 45.63 66.05 12.02 47.39 68.52 12.35
KD Hinton et al. (2015) 2015 0 M 0.8 50.54 73.49 15.39 48.91 73.37 14.51 48.37 71.47 12.55
AT Komodakis and Zagoruyko (2017) + KD 2017 0 M 1100 48.87 73.01 13.29 49.35 72.09 14.16 47.67 72.97 12.93
VID Ahn et al. (2019) + KD 2019 12.3 M 1.5 49.69 72.36 19.89 49.34 71.57 14.19 48.14 71.88 12.90
CRD Tian et al. (2019) + KD 2019 0.3 M 1.4 48.78 73.76 12.31 48.16 72.15 15.36 47.88 71.97 11.36
PKT Passalis et al. (2020) + KD 2020 0 M 30000 49.31 73.41 14.48 49.70 73.33 14.64 49.43 72.76 13.59
CKD Chen et al. (2021a) + KD 2021 634 M 400 52.10 76.90 15.54 53.54 75.20 17.98 49.15 70.25 13.32
Review Chen et al. (2021b) + KD 2021 28 M 1.8 50.63 73.73 14.86 49.59 72.56 14.99 48.32 71.84 12.12
DCT (Ours) + KD 2022 0 M 1 54.25 76.15 18.05 52.68 74.60 17.07 50.75 72.53 14.05
Table 5: Comparison with respect to state-of-the-art methods in the ADE20K dataset with different Teacher (T) - Student (S) combinations. For computational cost comparison the number of additional parameters is indicated. Results are obtained with one run of training. The value extracted from Figure 6 and used to train the models is also indicated. Best results in bold.
Figure 7: Validation set Accuracy (%) per epoch for the teacher model (ResNet-50), the vanilla network (ResNet-18), state-of-the-art methods, the proposed DCT approach and their combinations with KD Hinton et al. (2015) for ADE20K (a), SUN397 (b), and MIT67 (c) datasets.
Method Year Extra Trainable Params T: ResNet-50 (25.6 M)
S: ResNet-18 (11.7 M) T: ResNet-152 (60.3 M)
S: ResNet-34 (21.8 M) T: ResNet-50 (25.6 M)
S: MobileNet-V2 (3.5 M)
Top1 Top5 MCA Top1 Top5 MCA Top1 Top5 MCA
Teacher - - - 61.69 87.50 61.74 62.56 87.53 62.63 61.69 87.50 61.74
Vanilla - - - 38.77 67.05 38.83 39.66 69.36 40.10 41.18 70.58 41.23
AT Komodakis and Zagoruyko (2017) 2017 0 M 1100 41.52 69.87 41.58 40.75 69.53 40.81 38.84 68.08 38.91
VID Ahn et al. (2019) 2019 12.3 M 1.5 41.16 69.15 41.21 39.02 67.77 39.05 40.59 69.79 40.64
CRD Tian et al. (2019) 2019 0.3 M 1.4 43.89 73.55 43.95 42.13 71.51 42.14 42.69 72.98 42.73
PKT Passalis et al. (2020) 2020 0 M 30000 38.70 67.34 38.72 37.70 66.06 37.72 40.17 68.89 40.2
Review Chen et al. (2021b) 2021 28 M 1.8 43.26 72.77 43.29 42.69 70.92 42.73 42.68 71.72 42.74
DCT (Ours) 2022 0 M 1 45.75 74.59 45.80 43.50 72.33 43.54 43.16 70.59 43.19
KD Hinton et al. (2015) 2015 0 M 0.8 48.83 77.66 48.90 48.26 76.79 48.30 47.31 77.80 47.38
AT Komodakis and Zagoruyko (2017) + KD 2017 0 M 1100 49.44 78.06 49.52 47.05 75.39 49.10 46.60 76.42 46.08
VID Ahn et al. (2019) + KD 2019 12.3 M 1.5 49.26 78.16 49.32 47.08 75.95 47.12 46.64 76.87 46.71
CRD Tian et al. (2019) + KD 2019 0.3 M 1.4 49.79 78.69 49.82 48.39 77.00 48.44 46.77 77.30 46.82
PKT Passalis et al. (2020) + KD 2020 0 M 30000 49.13 78.16 49.16 48.08 76.75 48.15 47.54 77.51 47.56
Review Chen et al. (2021b) + KD 2021 28 M 1.8 49.90 78.71 49.96 47.05 76.30 47.07 47.05 77.44 47.10
DCT (Ours) + KD 2022 0 M 1 55.15 83.20 55.19 50.51 79.25 50.55 49.25 79.35 49.30
Table 6: Comparison with respect to state-of-the-art methods in the SUN397 dataset with different Teacher (T) - Student (S) combinations. For computational cost comparison the number of additional parameters is indicated. Results are obtained with one run of training. The value extracted from Figure 6 and used to train the models is also indicated. Best results in bold.
Method Year Extra Trainable Params T: ResNet-50 (25.6 M)
S: ResNet-18 (11.7 M) T: ResNet-152 (60.3 M)
S: ResNet-34 (21.8 M) T: ResNet-50 (25.6 M)
S: MobileNet-V2 (3.5 M)
Top1 Top5 MCA Top1 Top5 MCA Top1 Top5 MCA
Teacher - - - 77.32 95.20 79.00 78.11 95.02 78.91 77.32 95.20 79.00
Vanilla - - - 49.26 77.02 46.87 38.84 67.52 38.88 49.06 79.08 48.66
AT Komodakis and Zagoruyko (2017) 2017 0 M 1100 50.41 79.30 50.42 49.66 76.84 49.03 45.13 75.51 44.32
VID Ahn et al. (2019) 2019 12.3 M 1.5 48.21 76.71 47.60 44.22 72.77 43.23 47.76 75.96 47.14
CRD Tian et al. (2019) 2019 0.3 M 1.4 51.45 78.56 51.14 41.95 72.95 41.87 50.10 77.22 47.20
PKT Passalis et al. (2020) 2020 0 M 30000 51.03 79.15 49.56 46.32 74.34 45.55 50.23 78.80 47.92
Review Chen et al. (2021b) 2021 28 M 1.8 51.73 80.78 51.18 44.43 75.36 44.09 50.25.48 78.60 49.43
DCT (Ours) 2022 0 M 1 56.32 84.90 55.39 52.14 80.98 50.98 50.42 78.68 48.38
KD Hinton et al. (2015) 2015 0 M 0.8 54.87 83.42 54.91 51.55 79.61 51.24 56.14 82.51 56.04
AT Komodakis and Zagoruyko (2017) + KD 2017 0 M 1100 58.41 83.78 57.81 52.30 80.10 52.48 52.17 80.53 51.34
VID Ahn et al. (2019) + KD 2019 12.3 M 1.5 54.20 81.51 54.54 51.79 80.23 51.88 55.75 81.94 55.60
CRD Tian et al. (2019) + KD 2019 0.3 M 1.4 55.23 83.83 54.83 50.54 79.92 50.53 55.16 81.78 54.79
PKT Passalis et al. (2020) + KD 2020 0 M 30000 53.83 80.83 53.77 50.52 79.37 50.71 53.05 81.87 52.90
Review Chen et al. (2021b) + KD 2021 28 M 1.8 56.48 81.89 57.17 51.42 78.96 51.05 56.99 81.59 56.98
DCT (Ours) + KD 2022 0 M 1 60.11 86.88 60.53 55.18 81.64 55.62 57.35 84.79 56.89
Table 7: Comparison with respect to state-of-the-art methods in the MIT67 dataset with different Teacher (T) - Student (S) combinations. For computational cost comparison the number of additional parameters is indicated. Results are obtained with one run of training. The value extracted from Figure 6 and used to train the models is also indicated. Best results in bold.

Results from the proposed method compared with respect to the rest of the approaches reinforce the hypothesis that properly learnt CNN attention is crucial for scene recognition. Results from smaller networks can be boosted if their attention is properly guided towards representative image areas, which are better obtained by deeper and more complex architectures. The increase in performance of the method with respect to AT Komodakis and Zagoruyko (2017) suggests that, even though adopting similar knowledge sources, the proposed loss is able to consistently achieve better results by better quantifying the differences between attention maps.

CKD Chen et al. (2021a) outperforms our method in an specific combination of Table 5 (T: ResNet-152 and S: ResNet-34 + KD) for the ADE20K dataset, being behind us in the other two combinations evaluated. Nevertheless, the number of extra trainable parameters required by CKD grows with the resolution of the images: whereas CKD is reasonable for datasets composed of low-resolution images (CIFAR 10/100 datasets), here the number of parameters is times larger than the teacher from where the knowledge is transferred. Given this amount of extra trainable parameters, it may be worthier to train a vanilla model with that capacity. Therefore, we do not include the evaluation for CKD in the SUN397 and MIT67 datasets.

Results from Tables 5, 6 and 7 also indicate that when dealing with scene recognition datasets a proper selection of the architectures to be used in KD is important. Note how using a deeper architecture like ResNet-152 might not be as beneficial as using ResNet-50, maybe due to overfitting, or how extremely efficient models like MobileNet-V2 can get similar results as ResNet-18 or ResNet-34.

When the proposed method is combined with KD Hinton et al. (2015), results show an increase in performance with respect to the rest of the methods, which evidences that the proposed DCT-based method can be properly combined with KD, benefiting from the extra regularization that seminal KD provides at the response level.

Method Backbone Error Rate
Teacher ResNet-34 26.0
Student ResNet-18 28.2
AT Komodakis and Zagoruyko (2017) ResNet-18 27.1
KD Hinton et al. (2015) ResNet-18 28.1
DCT (Ours) ResNet-18 26.35
Table 8:

Error Rates when transfering learning from ImageNet to MIT67 scene recognition Dataset. All results except DCT (Ours) are extracted from Zagoruyko

et al. Komodakis and Zagoruyko (2017).

4.4.3 Transfer Learning Results

Table 8 presents a Transfer Learning experiment for scene recognition. We have followed the same training and evaluation protocol for the AT method as that proposed by Zagoruyko et al. Komodakis and Zagoruyko (2017). The aim of the experiment is to illustrate that our method also works when transferring attention in a Transfer Learning scenario, i.e., fine tuning a model to the MIT67 dataset from a model with ImageNet pre-trained weights. Results indicate that the proposed approach helps the transfer learning process by decreasing the error rate a and a with respect to the student and AT-transferred model, respectively.

4.5 Analysis of Activation Maps

Figure 8: Obtained activation maps for the proposed method using ResNet-50 as teacher and ResNet-18 as student. Note how the proposed approach enables a ResNet-18 architecture to have similar activation maps to the ones obtained by a ResNet-50.
Figure 9: Obtained activation maps for the proposed method using ResNet-50 as teacher and ResNet-18 as student. AT Komodakis and Zagoruyko (2017) activation maps are also included for comparison. Note how the proposed approach enables a ResNet-18 architecture to have similar activation maps to the ones obtained by a ResNet-50. Note also how the matching is better than the one achieved by AT Komodakis and Zagoruyko (2017).
Method Training Set Validation Set
Level 1 Level 2 Level 3 Level 4 Average Level 1 Level 2 Level 3 Level 4 Average
ResNet-18 0.46 0.32 0.39 0.72 0.48 0.47 0.32 0.40 0.71 0.47
AT Komodakis and Zagoruyko (2017) 0.66 0.73 0.76 0.90 0.76 0.67 0.74 0.77 0.83 0.75
DCT (Ours) 0.89 0.87 0.81 0.82 0.85 0.89 0.87 0.81 0.79 0.84
KD Hinton et al. (2015) 0.48 0.55 0.42 0.78 0.56 0.48 0.56 0.43 0.73 0.56
DCT (Ours) + KD 0.90 0.88 0.82 0.87 0.87 0.90 0.88 0.83 0.83 0.86
Table 9: Similarity between ResNet-50 activation maps, trained in ADE20K dataset, and the corresponding level’s activation maps of several models. SSIM values close to 1 indicate identical maps and values close to 0 indicate no similarity.

Figures 1, 8 and 9 present qualitative results of the obtained activation maps by the proposed method. In addition, Figures 1 and 9 include those obtained by AT Komodakis and Zagoruyko (2017) for comparison. Specifically, Figure 1 shows how AT maps resemble teacher ones only in the wider and intense areas of activation, i.e., the bed and the wardrobe in Level 3, while the proposed approach yields more similar maps in all the image areas where the teacher is focused on, i.e., the bed, and the wardrobe but also the lamps, the paintings and even the book on the table. This suggests that the proposed DCT-based metric achieves a better matching when activation patterns are diverse and spread throughout the image.

Table 9 quantifies qualitative observations from Figures 1, 8 and 9 by repeating the presented experiment from Section 4.4.1, i.e., computing the similarity between ResNet-50 (Teacher) and some model’s activation maps for the whole set of training and validation samples in the ADE20K dataset using the SSIM.

Results in Table 9 confirm the qualitative analysis presented in Figures 1, 8 and 9: the similarity for levels , in both Training and Validation sets, increases when the proposed DCT-based loss is used. Level similarity is slightly better for AT, mainly because activation maps in this level tend to be image-centred, continuous, and mono-modal, which benefits the measure. Overall, the average similarity achieved by the proposed DCT method is higher for the training set and higher for the validation respect to AT. Finally, it is remarkable how similarity is even higher when the DCT+KD combination is used, which again indicates a high complementarity between both losses.

5 Conclusions

This paper proposes a novel approach to globally compare 2D structures or distributions by evaluating their similarity in the Discrete Cosine Transform domain. The proposed technique is the core of an Attention-based Knowledge Distillation method that aims to transfer knowledge from a teacher to a student model. Specifically, intermediate feature representations from the teacher and the student are used to obtain activation maps that are spatially matched using a DCT-based loss. The proposal is applied to the scene recognition task, where the attention of trained models is highly correlated with performance. The reported results show that the proposed approach outperforms the state-of-the-art Knowledge Distillation approaches via better comparing attention maps.

The presented results provide promising evidences that the use of 2D discrete linear transforms that efficiently capture 2D patterns might be helpful, not only for the Knowledge Distillation task, but also for other Computer Vision tasks where vectorial metrics, i.e. metrics, are nowadays used by default.

References

  • S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai (2019) Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171. Cited by: §2.1, §4.4.2, Table 2, Table 3, Table 5, Table 6, Table 7.
  • Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
  • D. Chen, J. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, and C. Chen (2021a) Cross-layer distillation with semantic calibration. In

    Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

    ,
    Cited by: §1, §2.1, §3.2, §4.4.2, §4.4.2, Table 2, Table 5.
  • G. Chen, X. Song, H. Zeng, and S. Jiang (2020) Scene recognition with prototype-agnostic scene layout. IEEE Transactions on Image Processing 29, pp. 5877–5888. Cited by: §4.1.
  • P. Chen, S. Liu, H. Zhao, and J. Jia (2021b) Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5008–5017. Cited by: §1, §2.1, §4.2, §4.4.1, §4.4.2, Table 2, Table 3, Table 5, Table 6, Table 7.
  • X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou (2018) Scene recognition with objectness. Pattern Recognition 74, pp. 474–487. Cited by: §2.2.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1.
  • J. Cui, P. Chen, R. Li, S. Liu, X. Shen, and J. Jia (2019) Fast and practical neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6509–6518. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §4.1.
  • L. Deng (2012)

    The mnist database of handwritten digit images for machine learning research

    .
    IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §4.1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
  • J. Gou, B. Yu, S. J. Maybank, and D. Tao (2021) Knowledge distillation: a survey. International Journal of Computer Vision, pp. 1–31. Cited by: §1, §1, §1, §2.1, §2.1, §3.
  • Y. Guan, P. Zhao, B. Wang, Y. Zhang, C. Yao, K. Bian, and J. Tang (2020) Differentiable feature aggregation search for knowledge distillation. In European Conference on Computer Vision, pp. 469–484. Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2, §4.2.
  • B. Heo, M. Lee, S. Yun, and J. Y. Choi (2019) Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3779–3787. Cited by: Table 3.
  • C. Herranz-Perdiguero, C. Redondo-Cabrera, and R. J. López-Sastre (2018) In pixels we trust: from pixel labeling to object localization and scene categorization. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 355–361. Cited by: §2.2.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, External Links: Link Cited by: §1, §1, §2.1, §3.2, §3.2, Figure 7, §4.2, §4.3.1, §4.3.3, §4.4.2, §4.4.2, Table 1, Table 2, Table 3, Table 5, Table 6, Table 7, Table 8, Table 9.
  • A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324. Cited by: §1.
  • Z. Huang and N. Wang (2017) Like what you like: knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219. Cited by: Table 3.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713. Cited by: §1.
  • J. Kim, S. Park, and N. Kwak (2018) Paraphrasing complex network: network compression via factor transfer. arXiv preprint arXiv:1802.04977. Cited by: Table 3.
  • N. Komodakis and S. Zagoruyko (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: Figure 1, §1, §1, §2.1, §3.1, §3.1, §3.2, Figure 9, §4.3.3, §4.4.1, §4.4.2, §4.4.2, §4.4.3, §4.5, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical Report. Cited by: §4.1, §4.1, §4.4.1.
  • P. Li, X. Li, X. Li, H. Pan, M. Khyam, M. Noor-A-Rahim, and S. S. Ge (2021) Place perception from the fusion of different image representation. Pattern Recognition 110, pp. 107680. Cited by: §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    .
    In European conference on computer vision, pp. 740–755. Cited by: §1.
  • Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §1.
  • A. López-Cifuentes, M. Escudero-Viñolo, J. Bescós, and Á. García-Martín (2020) Semantic-aware scene recognition. Pattern Recognition 102, pp. 107256. Cited by: §1, §2.2, §2.2, §4.1, §4.1, §4.2.
  • J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §1.
  • Y. Mohsenzadeh, C. Mullin, B. Lahner, and A. Oliva (2020) Emergence of visual center-periphery spatial organization in deep convolutional neural networks. Scientific reports 10 (1), pp. 1–8. Cited by: §4.1.
  • R. Müller, S. Kornblith, and G. Hinton (2019) When does label smoothing help?. In Proceedings on the 33rd Conference on Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.1.
  • A. V. Oppenheim, J. R. Buck, and R. W. Schafer (2001) Discrete-time signal processing. vol. 2. Upper Saddle River, NJ: Prentice Hall. Cited by: §1, §3.1, §3.1.
  • W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: Table 3.
  • N. Passalis, M. Tzelepi, and A. Tefas (2020) Probabilistic knowledge transfer for lightweight deep representation learning. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.1, §2.1, §4.4.2, Table 3, Table 5, Table 6, Table 7.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Cited by: §4.2.
  • B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang (2019) Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016. Cited by: Table 3.
  • A. Quattoni and A. Torralba (2009) Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 413–420. Cited by: §4.1.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) Fitnets: hints for thin deep nets. In Proceedings on the International Conference on Learning Representations, Cited by: §1, §2.1, Table 3.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.2, §4.2.
  • M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. Cited by: §1.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive representation distillation. In International Conference on Learning Representations, Cited by: §2.1, §3.2, §4.2, §4.2, §4.4.1, §4.4.2, Table 3, Table 5, Table 6, Table 7.
  • F. Tung and G. Mori (2019) Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365–1374. Cited by: Table 3.
  • Z. Wang, L. Wang, Y. Wang, B. Zhang, and Y. Qiao (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing 26 (4), pp. 2028–2041. Cited by: §2.2.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.4.1.
  • Z. Wang and A. C. Bovik (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine 26 (1), pp. 98–117. Cited by: §1.
  • J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) Sun database: large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492. Cited by: §4.1.
  • X. Xiao, Z. Wang, and S. Rajasekaran (2019) Autoprune: automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems 32. Cited by: §1.
  • G. Xie, X. Zhang, W. Yang, M. Xu, S. Yan, and C. Liu (2017) LG-cnn: from local parts to global discrimination for fine-grained recognition. Pattern Recognition 71, pp. 118–131. Cited by: §2.2.
  • J. Yim, D. Joo, J. Bae, and J. Kim (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2.1, Table 3.
  • L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng (2020) Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911. Cited by: §1, §2.1.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §1.
  • H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2016) Loss functions for neural networks for image processing. IEEE Transactions on Computational Imaging. Cited by: §1.
  • Z. Zhao, Z. Liu, M. Larson, A. Iscen, and N. Nitta (2019) Reproducible experiments on adaptive discriminative region discovery for scene recognition. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1076–1079. Cited by: §2.2.
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • Z. Zhou, W. Zhou, H. Li, and R. Hong (2018) Online filter clustering and pruning for efficient convnets. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 11–15. Cited by: §1.