Differentiable Feature Aggregation Search for Knowledge Distillation

Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.



page 2


Towards Oracle Knowledge Distillation with Neural Architecture Search

We present a novel framework of knowledge distillation that is capable o...

DistPro: Searching A Fast Knowledge Distillation Process via Meta Optimization

Recent Knowledge distillation (KD) studies show that different manually ...

MS-KD: Multi-Organ Segmentation with Multiple Binary-Labeled Datasets

Annotating multiple organs in 3D medical images is time-consuming and co...

[Re] Distilling Knowledge via Knowledge Review

This effort aims to reproduce the results of experiments and analyze the...

Improving Generalization and Robustness with Noisy Collaboration in Knowledge Distillation

Inspired by trial-to-trial variability in the brain that can result from...

Knowledge Distillation for End-to-End Person Search

We introduce knowledge distillation for end-to-end person search. End-to...

LE-NAS: Learning-based Ensenble with NAS for Dose Prediction

Radiation therapy treatment planning is a complex process, as the target...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, visual recognition tasks have been significantly improved by deeper and larger convolutional networks. However, it is difficult to directly deploy such complicated networks on certain computationally limited platforms such as robotics, self-driving vehicles and most of the mobile devices. Therefore, the community has raised increasing attention on model compression approaches such as model pruning [7, 21, 34], model quantization [14, 19, 23] and knowledge distillation [13, 30, 33, 37, 38].

Knowledge distillation refers to the methods that supervise the training of a small network (student) by using the knowledge extracted from one or more well-trained large networks (teacher

). The key idea of knowledge distillation is to transfer the knowledge from the teacher networks to the student network. The first attempt of the knowledge distillation for deep neural networks leverages both the correct class labels and the soft targets of the teacher network, i.e., the soft probability distribution over classes, to supervise the training of the student network. The recent advances of knowledge distillation can be mainly divided into two categories:

output distillation [13] and feature distillation [30, 33, 38], as shown in Fig. 1 (a-b). More recent works concentrate on multi-teacher distillation with feature aggregation [37], where an ensemble of teacher networks provide richer information from the aggregation of output distributions and feature maps. Although an ensemble of teacher networks could provide richer information from the aggregation of output distributions and feature maps, they require much more computation resources than single-teacher distillation.

Figure 1: Illustrations of different knowledge distillation methods. (a) Output distillation. (b) Feature distillation. (c) Multi-teacher distillation. (d) DFA leverages a novel bridge loss for feature distillation, which takes the advantage of NAS and feature aggregation.

To achieve the same effect as the multi-teacher distillation with less computation overheads, we propose DFA, a two-stage Differentiable Feature Aggregation search method in the single-teacher knowledge distillation by coupling features from different layers of a single network as multiple “teachers”, and thus avoids the computation expenses on running several large teacher networks. Specifically, DFA first searches for the appropriate feature aggregation, i.e., the weighted sum of the feature maps, for each layer group in the teacher network by finding the best aggregation weights. Then, it conducts the normal feature distillation with the derived aggregations. Inspired by DARTS [24], DFA leverages the differentiable group-wise search in the first stage, which formulates the searching process as a bi-level optimization problem with feature aggregation weights as the upper-level variable and the model parameters as the lower-level variable. Moreover, as the common distillation loss and cross-entropy loss fail to find the appropriate feature aggregations, a novel bridge loss is introduced as the objective function in DFA, where (1) a student-to-teacher path is built for searching the layers that match the learning ability of student network, and (2) a teacher-to-student path is established for finding the feature aggregation with rich features and a wealth of knowledge. Experiments on CIFAR-100 [18] and CINIC [6] datasets show that DFA could outperform the state-of-the-art distillation methods, demonstrating the effectiveness of the feature aggregation search.

The main contributions of this paper are as follows:

  • We introduce DFA, a Differentiable Feature Aggregation search method to mimic multi-teacher distillation in the single-teacher distillation framework, which first searches for appropriate feature aggregation weights and then conducts the distillation with the derived feature aggregations.

  • We propose a novel bridge loss for DFA. The bridge loss consists of a student-to-teacher path and a teacher-to-student path, which simultaneously considers the expressivity and learnability for the feature aggregation.

  • Experimental results show that the performance of DFA surpasses the feature aggregations derived by both hand-crafted settings and random search, verifying the strength of the proposed method.

2 Related Work

Knowledge Distillation: Knowledge distillation [2] is firstly introduced in model compression. Despite the classification loss, the student network is optimized by an extra cross-entropy loss with the soft target from the teacher network, i.e. the probability distribution softened by temperature scaling. Hinton et al. [13] employ knowledge distillation in the training of deep neural networks. However, with the huge gap of model capacity among the neural networks, it is hard for the student to learn from the output distribution of a cumbersome teacher directly. Thus, several approaches [16, 30, 38] exploit feature distillation in the student training, where the student network mimics the feature maps from the teacher network of different layers. Multi-teacher knowledge distillation [37] takes a further step, which takes full advantage of the feature maps and the class distributions amalgamated from an ensemble of teacher networks. The feature aggregation from multiple teachers helps the student learn from different perspectives. However, compared with single-teacher distillation, more computation resources are demanded for extracting useful information from all the teachers.

Neural Architecture Search

: With the vigorous development of deep learning, neural architecture search (NAS), an automatic method for designing the structure of neural networks, has been attracting increasing attention recently. The early works mainly sample and evaluate a large number of networks from the search space, and then train the sampled models with reinforcement learning 

[3, 32, 41, 42]

or update the population with the evolutionary algorithm 

[28, 29]. Though achieving state-of-the-art performance, the above works are all computation expensive. Recent works propose the one-shot approaches [1, 4, 8, 24, 27, 35] in NAS to reduce the computation cost. It models NAS as a single training process for an over-parameterized network covering all candidate sub-networks named supernet, and then selects the network architecture from the trained supernet. Among the one-shot methods, the differentiable architecture search (DARTS) [5, 22, 24, 26, 36, 40] further relaxes the discrete search space to be continuous and couples the architecture parameters with the model parameters in the supernet. Therefore, the architecture parameters can be jointly optimized in the one-shot training along with the model parameters by gradient descent.

There have been several methods designed for combining NAS with knowledge distillation. DNA [20] searches for the light-weight architecture of the student network from a supernet. KDAS [15] builds up the student network progressively based on an ensemble of independently learned student networks. As opposed to these methods, we try to imitate the multi-teacher distillation in the single teacher distillation framework by finding the appropriate feature aggregations in the teacher network with differentiable search strategy.

Figure 2: Comparisons between traditional feature distillation and DFA. (a) Traditional methods implement distillation with the last feature map in each layer group of the teacher network. (b) DFA leverages the feature aggregation of the teacher for distillation, which contains rich features and a wealth of knowledge. “FD” and “CE” represent the feature distillation loss and cross entropy loss respectively.

3 Method

We propose the two-stage Differentiable Feature aggregation (DFA) method for single-teacher knowledge distillation, as outlined in Algorithm 1. In the first stage (i.e., “AGGREGATION SEARCH”), DFA searches for appropriate feature aggregation weights. In the second stage (i.e., “FEATURE DISTILLATION”), the derived feature aggregations are applied to perform the feature distillation between teacher and student. Details will be described in this section.

1:function aggregation Search
2:     Random initialize
3:     for group  do
4:         for iteration  do
5:              Optimize by
6:              Optimize by
7:         end for
8:     end for
9:     Reserve the derived .
10:end function
11:function Feature Distillation
12:     Get the derived .
13:     for iteration  do
14:         for group  do
15:              Calculate with .
16:         end for
17:         Update by minimizing the
18:         loss defined in Eqn. (4).
19:     end for
20:end function
Algorithm 1 Algorithm for the two-stage DFA.

3.1 Feature Distillation

DFA is based on feature distillation on layer groups, where a layer group denotes the set of layers with the same spatial size in teacher and student networks. The general design schemes for feature distillation are categorized into teacher transform, student transform, distillation position and distance function [10]. Teacher transform and student transform extract knowledge from hidden features of the teacher and student networks at the distillation positions respectively. Then, the extracted features are applied to the distance function of distillation. Most approaches [11, 33, 38] adopt loss as the distance measurement. Let and denote the number of layers in the -th layer group of teacher and student network, the distillation loss is defined as:


where and denote the feature maps of teacher and student networks drawn from the distillation position of the -th group. Conforming to the previous work [38], the distillation positions of the teacher and student network lay at the end of each layer group. Besides, and in Eqn. (1) represent the teacher transform and student transform respectively, which map the channel numbers of both and to the channel number of teacher feature map. Traditional feature distillation methods are illustrated in Fig. 2 (a).

Different from traditional single-teacher feature distillation methods, DFA utilizes feature aggregation of teacher network as the supervision for student network for each layer group, as shown in Fig. 2 (b). Given the feature aggregation weights of -th group in the teacher network, where , the feature aggregation of -th group can be computed by:


The existing feature distillation methods could be seen as a special case of feature aggregation, where the weight of the last layer for each layer group of the teacher network, i.e., , is set to one and the weights of the other layers in the group are set to zero. Given the feature aggregation of different layer groups, the feature distillation loss in Eqn. (1) is changed to:


Finally, the student network is optimized by a weighted sum of distillation loss and classification loss :



is the balancing hyperparameter.

is the standard cross-entropy loss between the ground-truth class label and the output distribution of the student :


where represents the number of classes and is the indicator function.

3.2 Differentiable Group-wise Search

As the feature aggregation weights are continuous and grow exponentially with the number of layer groups, DFA leverages a differentiable architecture search method to efficiently search for the task-dependent feature aggregation weights for better distillation performance. Inspired by previous attempts that divide the NAS search space into blocks [24, 42], DFA implements the feature aggregation search in a group-wise manner, i.e., the weights of other groups keep fixed when searching for the aggregation weights for layer group

. The group-wise search enables a strong learning capability of the model, leading to only a few epochs to achieve convergence during training. The overall framework of the differentiable group-wise search is shown in Fig. 


3.2.1 Search Space:

Given the teacher and student networks, DFA aims to find the appropriate feature aggregation weights for each group . Different from the DARTS-based methods [24], the search space for the feature aggregation is continuous since the combination of different layers could provide richer information than the individual feature map obtained from the discrete search space. Besides, as only one teacher network is utilized in the aggregation search, the training speed and computation overhead are similar to the standard feature distillation. For a stable training process, we represent the feature aggregation weights as a softmax over a set of architecture parameters :


3.2.2 Optimization of Differentiable Group-wise Search:

The goal of the differentiable group-wise search is to jointly optimize the architecture parameters and the model parameters of the student network. Specifically, the differentiable search tries to find the that minimizes the validation loss , where the weights of the architecture parameters are obtained by minimizing the training loss for a certain architecture parameter . Thus, the joint optimization could be viewed as a bi-level optimization problem with as the upper-level variable and as the lower-level variable:

s.t. (8)

where and are the training and validation loss respectively. denotes the regularization on the architecture parameters that could slightly boost the performance of DFA. To solve the bi-level optimization problem, and are alternately trained in a multi-step way by gradient descent to reach a fixed point of architecture parameters and model parameters.

An intuitive option of training and validation loss is to use in Eqn. (4). Though it seems that directly learning from could result in the appropriate architecture parameters for knowledge distillation, actually, training architecture parameters with is equivalent to minimizing the distillation loss between feature aggregation and student feature map:


as the cross-entropy loss is irrelevant to the architecture parameters. Since the distillation loss only characterizes the distance between the student feature maps and the combinations of the teacher feature maps, the architecture parameters tend to be more inclined to choose teachers that are close to the student. In depth, suppose that after training several epochs, the student feature map has learnt some knowledge from the data distribution through cross-entropy loss and teacher network through distillation loss. As the student network is always shallower than the teacher network, the knowledge in the deep layers is hard to learn such that the architecture parameters would prefer the teachers in the shallow layers matching the depth and expressivity of the student, other than selecting deep layers with rich semantics and strong expressivity. Therefore, the feature aggregation learnt from the deviates from the original target that learning a good teacher for the knowledge distillation. Besides, once the student network finds a matching layer in the teacher group, i.e., the weight of an architecture parameter is relative larger than the others, the student transform will learn more about the mapping function from to . Then, will grow much faster than other competitors due to the biased training of transform function, and the corresponding feature map will gradually dominant the feature aggregation under the exclusive competition of the architecture parameters. Notice again that the network is more likely to pick shallow layers in the early training stage. Therefore, the student network would unavoidably suffer from the performance collapse using the search results derived from .

Figure 3: Differentiable group-wise search of DFA. (a) The differentiable search for group . (b) The teacher-to-student (TS) path. (c) The student-to-teacher (ST) path. The ST and TS connectors are implemented with convolutional layers for matching the channel dimensions between teacher and student.

3.2.3 Bridge Loss for Feature Aggregation Search:

To search for an appropriate feature aggregation for the knowledge distillation, we introduce the bridge loss to connect the teacher and student networks, where the original information flow of the student network is split into two paths.

In the first teacher-to-student (TS) path as illustrated in Fig. 3 (b), DFA takes the feature aggregation of the group in the teacher network, i.e., , as the input of its -th group of the student network, and then computes the teacher-to-student (TS) loss with standard cross entropy. The TS loss in group can be expressed as:


where denotes the ground-truth class label and denotes the convolutional layers of group in the student network. Different from Eqn. (3), the teacher transform is now served as a TS connector that converts the channel dimension of the teacher feature map to the student feature map.

The second student-to-teacher (ST) path has the similar effect as the original in Eqn. (4), i.e., exploring the feature aggregation weights that match the learning ability of the student network. As shown in Fig. 3 (c), the information flow starts from the student input and ends at the -th group in the student network. Then, the student network produces , the last feature map of group , and compares with the feature aggregation by a student-to-teacher (ST) loss :



represents the vectorization of the tensor that converts the tensor into a column vector. Same as the distillation loss in Eqn. (

3), the student transform is served as a ST connector to map the channel numbers from student feature map to teacher feature map, as opposed to .

For each group , the bridge loss integrates ST loss and TS loss in a single training process for both training loss and validation loss :


where and are balancing hyperparameters.

Different from , both the model parameters and architecture parameters can be trained towards the ground truth through bridge loss. For the model parameters, the student network before -th group tries to imitate the teacher feature aggregation , while the student after -th group learns to optimize the cross entropy given . Hence, the joint optimization of ST loss and TS loss for the model weights can be regarded as an approximation of the cross entropy loss of the student network. On the other hand, the architecture parameters are trained by directly optimizing the cross-entropy loss given the input of the aggregation of teacher feature maps, analogous to the common differentiable architecture search in [24, 35]. The deep layers with rich features will be assigned by higher weights in the architecture search, as they contribute to the reduction of the validation loss. In this case, DFA achieves the feature aggregation which helps the student learn a wealth of knowledge from the teacher. Besides, we still keep ST loss in the architecture training as a regularization. Specifically, though the rich features in the deep layers contribute better performance in the teacher network, they are not always suitable for the student to learn due to the mismatch of expressive power, e.g., it is impractical for a student of three layers to learn from the teacher of ten layers. Introducing ST loss in the validation loss can help the network select the shallow teachers that match the student expressivity, such that the derived feature could provide more knowledge to the student in the early training stage and accelerate the model convergence. In this way, the ST loss and TS loss act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. Hence, the derived feature is more likely to achieve better performance in the final knowledge distillation.

After searching for the architecture parameters with the differentiable group-wise search, DFA trains the student network thoroughly with the derived feature aggregation weights by Eqn. (4).

3.3 Time Complexity Analysis

The two-stage design of DFA would not increase the time complexity compared with other feature distillation methods. Let denote the computing time of layer group in the teacher and student network, the differentiable search in the first stage could be calculated by:


As the second stage is a usual feature distillation, the overall time complexity of DFA could be derived as:


which is competitive with other feature distillation methods.

3.4 Implementation Details

The ST connector (student transform) and TS connector (teacher transform) are both implemented with the one-layer convolution networks in order to reconcile the channel dimensions between student and teacher feature maps. The weight of ST loss and TS loss are set to and in Eqn. (12

). In both stage of DFA, the pre-ReLU features are extracted in the student and teacher networks for the knowledge distillation, where values no smaller than -1 are preserved in the feature maps to retain knowledge from both positive values and negative values while avoiding the exploding gradient problem. The model parameters are initialized by He initialization 

[9]. The feature aggregation weights in each layer group are initialized in the way that only the last feature map has weight and all other feature maps are allocated zero weight.

We follow the same training scheme as DARTS: only the training set are used to update model parameters, and the validation set are leveraged for better feature aggregation parameters. For the update of the architecture parameters, DFA adopts Adam [17] as the optimizer with the momentum of , where the learning rate and weight decay rate are both set to 1e-3.

4 Experiments

4.1 Cifar-100

CIFAR-100 is a commonly used visual recognition dataset for comparing distillation methods. There are 100 classes in CIFAR-100, and each class contains 500 training images and 100 testing images. To carry out architecture search, the original training images are divided into the training set and validation set with the 7:3 ratio.

We compare our method with eight single-teacher distillation methods: KD [13], FitNets [30], AT [38], Jacobian [31], FT [16], AB [12], SP [33] and Margin [11]. All the experiments are performed on Wide Residual Network [39]. The feature aggregation is searched for 40 epochs at each layer group. We adopt the same training schemes as [39] to train the model parameters in all methods. Specifically, the model is trained by SGD optimizer of 5e-4 weight decay and 0.9 momentum for 200 epochs on both training and validation set. The learning rate is set to 0.1 initially and decayed by 0.2 at 60, 120 and 160 epochs. The batch size is 128. We utilize random crop and random horizontal flip as the data augmentation.

Teacher Size Student Size
(1) WRN_28_4 5.87M WRN_16_4 2.77M
(2) WRN_28_4 5.87M WRN_28_2 1.47M
(3) WRN_28_4 5.87M WRN_16_2 0.7M
Table 1: The configuration of teacher and student networks in experiments on CIFAR-100.

We explore the performance of our method on several teacher-student pairs, which vary in depth (number of layers), width (number of channels), or both, as shown in Table 1. We conduct the experiments on CIFAR-100 over the above teacher-student pairs, and depict the results in Table 2. For the teacher-student pair (1) of different widths, DFA has a 1.34% improvement over the output distillation method KD, and also outperforms the other state-of-the-art feature distillation methods. DFA even exhibits a better performance than the teacher network. For the teacher-student pair (2) of different depths, DFA surpasses other feature distillation methods by 0.34%-2.64%. DFA also achieves state-of-the-art results on (3), where the student network compresses both width and depth of the teacher network. The above experiments verify the effectiveness and robustness of DFA in various scenarios.

Teacher Student KD FitNets AT Jacobian FT AB SP Margin DFA
(1) 79.17 77.24 78.4 78.17 77.49 77.84 78.26 78.65 78.7 79.11 79.74
(2) 79.17 75.78 76.61 76.14 75.54 76.24 76.51 76.81 77.41 77.84 78.18
(3) 79.17 73.42 73.35 73.65 73.3 73.28 74.17 73.9 74.09 75.51 75.85
Table 2: The experiment results on CIFAR-100. We compare the proposed DFA with eight distillation methods. The best results are illustrated in bold. DFA outperforms other state-of-the-art methods.

4.2 Cinic-10

CINIC-10 is a large classification dataset containing 270000 images, which are equally split into training, validation and test set with the presence of 10 object categories. The images are collected from CIFAR-10 and ImageNet. Comparing with CIFAR datasets, CINIC-10 could present a more principled perspective of generalisation performance. We explore the performance of DFA and other three state-of-the-art feature distillation methods on CINIC-10. All the experiments are performed on ShuffleNetV2 

[25], an efficient network architecture at mobile platforms. We use several variants of ShuffleNetV2 in the experiments, and the basic configuration is shown in Table 3 conforming to [33]. In the model training process, the SGD optimizer is leveraged, with weight decay of 5e-4 and momentum of 0.9. All models are trained with 140 epochs. The learning rate is set to 0.01 initially and decayed by 0.1 at 100, 120 epochs. The batch size is 96. Same as CIFAR dataset, we utilize random crop and random horizontal flip as the data augmentation. We search 5 epochs for each feature group in DFA. The experiment results are shown in Table 4. DFA outperforms the vanilla cross-entropy training as well as the state-of-the-art feature distillation methods.

Group Block Output Size
1 Conv-BN-ReLU 3 24 1
2 ShuffleNetV2 block 3 4
3 ShuffleNetV2 block 3 8
4 ShuffleNetV2 block 3 4
5 ShuffleNetV2 block 3 1
6 AvgPool-FC 1 10 1
Table 3: The configuration of ShuffleNetV2 on CINIC-10 experiments. We leverage standard ShuffleNetV2 blocks in each layer group, where denotes the kernel size, and specify the number of channels and blocks in each layer group. In the end, we add an average pooling layer and a fully connected layer to make final predictions.
Teacher Size Acc Student Size Acc AT SP Margin DFA DFA-T
(1) 5.37M 86.14 1.27M 83.28 84.71 85.32 85.29 85.38 85.41
(2) 5.37M 86.14 0.36M 77.34 79.06 79.15 78.79 79.51 79.45
(3) 1.27M 83.28 0.36M 77.34 78.42 79.02 79.69 79.97 79.38
Table 4: Experimental results on CINIC-10. The best results are illustrated in bold. “DFA-T” represents the version that searches the feature aggregation weights on CIFAR-100 and implement the feature distillation on CINIC-10.

4.2.1 Ablation Study on Differentiable Search:

We study the robustness of DFA by searching the feature aggregation weights from a small dataset and then distillating on a larger dataset. Specifically, we build up a variant of DFA, named as DFA-T, by searching the feature aggregation weights on CIFAR-100 and distillating on CINIC-10. It can be observed that DFA-T is only little inferior to DFA, and still achieves state-of-the-art performance.

4.3 The Effectiveness of Differentiable Search

4.3.1 Comparisons with Other Search Methods:

We perform additional experiments on CIFAR-100 to verify the effectiveness of the differentiable feature aggregation search. We compare DFA with the following methods: ”Random” represents the method that all feature aggregation weights are randomly selected; ”Average” indicates that all feature maps in a layer group share the same feature aggregation weight; ”Last” denotes the method that only the last feature map in each layer group is leveraged for the feature distillation, which is widely used in the knowledge distillation. The results are shown in Table 5.

Method WRN_16_2 WRN_16_4 WRN_28_2
Student 73.42 77.24 75.78
Random 74.25 78.92 77.07
Last 75.51 79.11 77.86
Average 74.11 78.81 76.99
DFA 75.85 79.74 78.18
Table 5: Experimental results of DFA and other search methods on CIFAR-100.

Obviously, “Random” weights or “Average” weights would degrade the performance, indicating the necessity of a decently designed feature selection strategy. Different from “Last”, we observe that DFA would allocate positive weights to the shallow layers in the shallow groups to retrieve knowledge rapidly from the teacher network. The weight assignment in DFA reveals that feature aggregation contributes to transferring knowledge from the teacher network to the student network, while the differentiable group-wise search helps achieve the optimal feature aggregation weights. Hence, DFA brings about a remarkable improvement on the knowledge distillation task.

4.3.2 Result Analysis:

In Fig. 4 Left, we display the student network’s feature aggregation weights in the configuration (1) of the CIFAR-100 experiment. The feature aggregation weights are initialized with the “Last” scheme and then searched for epochs. It is obvious that the domination of the last feature map in each layer group are weakened as the training continues.

4.3.3 Sensitivity Analysis:

We study the impact of regularization on the feature aggregation search mentioned in Sec. 3.2. The right figure in Fig. 4 displays the accuracy of the student models with ranging from (no regularization) to . Experimental results show that DFA is robust to the regularization in range , except a slight decrease without regularization.

Figure 4: Left. Heatmap of feature aggregation weights in different layers of WRN_16_4. Right. The accuracy of the student network on CINIC-10, ranges from . We select the three teacher and student pairs in the CINIC-10 experiments.

5 Conclusion

In this paper, we propose DFA, a two-stage feature distillation method via differentiable aggregation search. In the first stage, DFA leverages the differentiable architecture search to find appropriate feature aggregation weights. It introduces a bridge loss to connect the teacher and student, where a teacher-to-student loss is built for searching the teacher with rich features and a wealth of knowledge, while a student-to-teacher loss is used to find the aggregation weights that match the learning ability of the student network. In the second stage, DFA performs a standard feature distillation with the derived feature aggregation weights. Experiments show that DFA outperforms several state-of-the-art methods on CIFAR-100 and large-scale CINIC-10 datasets, verifying both the effectiveness and robustness of the design. In-depth analysis also reveals that DFA decently allocates feature aggregation weights on the knowledge distillation task.


This work is partially supported by National Key Research and Development Program No. 2017YFB0803302, Beijing Academy of Artificial Intelligence (BAAI), and NSFC 61632017.


  • [1] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, pp. 550–559. Cited by: §2.
  • [2] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In KDD, pp. 535–541. Cited by: §2.
  • [3] H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu (2018) Path-level network transformation for efficient architecture search. In

    International Conference on Machine Learning

    pp. 678–687. Cited by: §2.
  • [4] H. Cai, L. Zhu, and S. Han (2019) Proxylessnas: direct neural architecture search on target task and hardware. In ICLR, Cited by: §2.
  • [5] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In ICCV, Cited by: §2.
  • [6] L. N. Darlow, E. J. Crowley, A. Antoniou, and A. J. Storkey (2018) CINIC-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505. Cited by: §1.
  • [7] X. Dong and Y. Yang (2019) Network pruning via transformable architecture search. In Advances in Neural Information Processing Systems, pp. 759–770. Cited by: §1.
  • [8] X. Dong and Y. Yang (2019) One-shot neural architecture search via self-evaluated template network. In ICCV, pp. 3681–3690. Cited by: §2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In

    Proceedings of the IEEE international conference on computer vision

    pp. 1026–1034. Cited by: §3.4.
  • [10] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi (2019-10) A comprehensive overhaul of feature distillation. In ICCV, Cited by: §3.1.
  • [11] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi (2019) A comprehensive overhaul of feature distillation. arXiv preprint arXiv:1904.01866. Cited by: §3.1, §4.1.
  • [12] B. Heo, M. Lee, S. Yun, and J. Y. Choi (2019)

    Knowledge transfer via distillation of activation boundaries formed by hidden neurons

    In AAAI, Vol. 33, pp. 3779–3787. Cited by: §4.1.
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §1, §2, §4.1.
  • [14] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §1.
  • [15] M. Kang, J. Mun, and B. Han (2019) Towards oracle knowledge distillation with neural architecture search. arXiv preprint arXiv:1911.13019. Cited by: §2.
  • [16] J. Kim, S. Park, and N. Kwak (2018) Paraphrasing complex network: network compression via factor transfer. In Advances in Neural Information Processing Systems, pp. 2760–2769. Cited by: §2, §4.1.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [18] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §1.
  • [19] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin (2018) Extremely low bit neural network: squeeze the last bit out with admm. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [20] C. Li, J. Peng, L. Yuan, G. Wang, X. Liang, L. Lin, and X. Chang (2019) Blockwisely supervised neural architecture search with knowledge distillation. arXiv preprint arXiv:1911.13053. Cited by: §2.
  • [21] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1.
  • [22] W. Li, S. Gong, and X. Zhu (2020) Neural graph embedding for neural architecture search. In AAAI, Cited by: §2.
  • [23] X. Lin, C. Zhao, and W. Pan (2017)

    Towards accurate binary convolutional neural network

    In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §1.
  • [24] H. Liu, K. Simonyan, and Y. Yang (2019) Darts: differentiable architecture search. In ICLR, Cited by: §1, §2, §3.2.1, §3.2.3, §3.2.
  • [25] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §4.2.
  • [26] N. Nayman, A. Noy, T. Ridnik, I. Friedman, R. Jin, and L. Zelnik (2019) Xnas: neural architecture search with expert advice. In Advances in Neural Information Processing Systems, pp. 1975–1985. Cited by: §2.
  • [27] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In ICML, pp. 4092–4101. Cited by: §2.
  • [28] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    In AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.
  • [29] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In ICML, pp. 2902–2911. Cited by: §2.
  • [30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §1, §1, §2, §4.1.
  • [31] S. Srinivas and F. Fleuret (2018) Knowledge transfer with jacobian matching. arXiv preprint arXiv:1803.00443. Cited by: §4.1.
  • [32] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In CVPR, pp. 2820–2828. Cited by: §2.
  • [33] F. Tung and G. Mori (2019) Similarity-preserving knowledge distillation. In ICCV, pp. 1365–1374. Cited by: §1, §1, §3.1, §4.1, §4.2.
  • [34] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1.
  • [35] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: §2, §3.2.3.
  • [36] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong (2020) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. In ICLR, Cited by: §2.
  • [37] S. You, C. Xu, C. Xu, and D. Tao (2017) Learning from multiple teacher networks. In KDD, pp. 1285–1294. Cited by: §1, §1, §2.
  • [38] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §1, §1, §2, §3.1, §4.1.
  • [39] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
  • [40] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter (2020) Understanding and robustifying differentiable architecture search. In ICLR, Cited by: §2.
  • [41] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.
  • [42] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, pp. 8697–8710. Cited by: §2, §3.2.