Ensemble Knowledge Distillation for Learning Improved and Efficient Networks

09/17/2019 ∙ by Umar Asif, et al. ∙ 8

Ensemble models comprising of deep Convolutional Neural Networks (CNN) have shown significant improvements in model generalization but at the cost of large computation and memory requirements. for learning compact CNN models with improved classification performance and model generalization. For this, we propose a CNN architecture of a compact student model with parallel branches which are trained using ground truth labels and information from high capacity teacher networks in an ensemble learning fashion. Our framework provides two main benefits: i) Distilling knowledge from different teachers into the student network promotes heterogeneity in feature learning at different branches of the student network and enables the network to learn diverse solutions to the target problem. ii) Coupling the branches of the student network through ensembling encourages collaboration and improves the quality of the final predictions by reducing variance in the network outputs. and CIFAR-100 datasets show that our Ensemble Knowledge Distillation (EKD) improves classification accuracy and model generalization especially in situations with limited training data. Experiments also show that our EKD based compact networks outperform in terms of mean accuracy on the test datasets compared to state-of-the-art knowledge distillation based methods.

READ FULL TEXT VIEW PDF

Authors

page 3

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ensemble methods have shown considerable improvements in model generalization and produced state of the art results in several machine learning competitions (e.g., Kaggle)

[Chen et al.2017]. These ensemble methods typically contain multiple deep Convolutional Neural Networks (CNN) as sub-networks which are pre-trained on large-scale datasets to extract discriminative features from the input data. The size of an ensemble is not constrained by training because the sub-networks can be trained independently, and their outputs can be computed in parallel. However, in many applications limited training data is not sufficient to effectively train deep CNN models compared to small or compact networks. For instance in healthcare applications, the amount of available data is constrained by the number of patients. Therefore, improving generalization capability of compact network without requiring large-scale annotated datasets is of utmost importance. Furthermore, today’s high performing deep CNN based ensemble models have Giga-FLOPS compute and Giga-Bytes storage requirements [Huang et al.2018], making them prohibitive in resource constrained systems (e.g., mobile- or edge-devices) which have stringent requirements on memory, latency and computational power.
To overcome these challenges, model compression techniques such as parameter pruning [Yu et al.2018] is a common way to reduce model size with trade-offs between accuracy and efficiency. Other techniques include hand crafting efficient CNN architectures such as SqueezeNets [Iandola et al.2016], MobileNets [Howard et al.2017], and ShuffleNets [Zhang et al.2018a]. Recently, neural network search showed an effective way to generate efficient CNN architectures [Tan et al.2019, Cai et al.2018] by extensively tuning parameters such as network width, depth, filter types and sizes. These models showed better efficiency than hand-crafted networks but, at the cost of extremely large tuning cost. Another stream of work in building efficient networks for resource constrained scenarios is through Knowledge distillation [Hinton et al.2015]. It enables small low memory footprint networks to mimic the behavior of large complex networks by training small networks using the predictions of large networks as soft labels in addition to the ground truth hard labels.
In this paper we also explore Knowledge Distillation (KD) based strategies to improve model generalization and classification performance for applications with memory and compute restrictions. For this, we present a CNN architecture with parallel branches which distill high level features from a teacher ensemble during training and maintains low computational overhead during inference. Our architecture provides two main benefits: i) It combines the student network with different teacher networks and distills diverse feature representations into the student network during training. This promotes heterogeneity in feature learning and enables the student network to mimic diverse high-level feature spaces produced by the teacher networks. ii) It combines the distilled information through parallel-branches in an ensembling manner. This reduces variance in the branch-level outputs and improves the quality of the final predictions of the student network. In summary, the main contributions of this paper are as follows:

  1. We present an Ensemble Knowledge Distillation (EKD) framework which improves classification performance and model generalization of small and compact networks by distilling knowledge from multiple teacher networks into a compact student network in an ensemble learning manner.

  2. We present a specialized training objective function to distill ensemble knowledge into a single student network. Our objective function optimizes the parameters of the student network with a goal of learning mappings between input data and ground truth labels, and a goal of minimizing the difference between high level features of the teacher networks and the student network.

  3. We perform ablation study of our framework on CIFAR-10 and CIFAR-100 datasets in terms of different CNN architectures, varying ensemble sizes, and limited training data scenarios. Experiments show that by encouraging heterogeneity in feature learning through the proposed ensemble distillation, our EKD-based compact networks produce superior accuracy compared to conventional ensembles without knowledge distillation and networks trained through other KD-based methods.

2 Related Work

In this section, we discuss related work on model compression and knowledge distillation.

2.1 Model Compression

Network pruning is a popular approach to reduce a heavy network to obtain a light-weight form by removing redundancy in the heavy network. In this approach, a complex over-parameterized network is first trained, then pruned based on come criterions, and finally fine-tuned to achieve comparable performance with reduced parameters. In this context, methods such as [Yu et al.2018] compress large networks through the reduction of connections based on weight magnitudes or importance scores. Other methods used quantization of the weights to 8 bits [Han et al.2015], filter pruning [Li et al.2016] and channel pruning [Luo et al.2017] to reduce network sizes. However, the trimmed models are generally sub-graphs of the original networks and there is less flexibility in changing the original architecture design.

2.2 Knowledge Distillation

Knowledge Distillation (KD) aims at learning a light-weight student network such that it can mimic the behavior of a complicated teacher network. In this context, the work of [Ba and Caruana2014] was the first to introduce knowledge distillation by minimizing L2 distance between the features from the last layers of two networks. Later, the work of [Hinton et al.2015]

showed that the predicted class probabilities from the teacher are informative for the student and can be used as a supervision signal in addition to the regular labeled training data during training. Romero et. al.

[Romero et al.2015] bridged the intermediate layers of the student and teacher networks in addition to the class probabilities and used L2 loss to supervise the student network. The method of [Czarnecki et al.2017] minimized the difference between teacher and student derivatives of the loss combined with the divergence from the teacher predictions. Other methods explored knowledge distillation using activation maps [Heo et al.2019], attention maps [Zagoruyko and Komodakis2016], Jacobians [Srinivas and Fleuret2018], and unsupervised feature factors [Kim et al.2018].
Ensembling is a promising technique to improve model generalization compared to the performance of individual models. Since different CNN architectures can achieve diverse distributions of errors due to the presence of several local minima, the combination of the outputs of individually trained networks leads to improved performance and better generalization to unseen test data. In the light of these studies, methods such as [Urban et al.2016, Furlanello et al.2018] combined ensemble learning and knowledge distillation to improve model generalization. For instance, the method of [Urban et al.2016]

trained an ensemble of 16 CNN models and compressed the learned function into shallow multi-layer perceptrons containing 5 layers. The work of

[Furlanello et al.2018] presented an iterative technique to transform a student model into the teacher model at each iteration. At the end of the iterations, all the student outputs were combined to form an ensemble. Our work also follows ensemble learning coupled with knowledge distillation however, compared to [Furlanello et al.2018], we train a compact student network through knowledge distillation in a single iteration. Furthermore, our ensemble architecture distills knowledge from different teacher networks into the student network. This increases heterogeneity in student feature learning and enables the student network to mimic diverse feature representations produced by different teacher networks. Consequently, our EKD-based compact networks demonstrate better generalization capability compared to iterative KD methods [Furlanello et al.2018] or conventional KD methods [Hinton et al.2015].

3 The Proposed Framework

Figure 1: Overview of our framework which consists of a Compact Network (CompNet)-A and a teacher ensemble network (TeachNet)-B. CompNet is composed of parallel branches with similar architecture topology. During training, the branches are coupled with the sub-networks of the teacher ensemble and the parameters of the CompNet model are optimized with respect to the ground truth labels as well as the high-level features produced by the teacher ensemble. During testing, the branches of CompNet are executed in parallel (to increase inference speed) and their outputs are summed before Softmax to produce final predictions.

Fig. 1 shows the overall architecture of our Ensemble Knowledge Distillation (EKD) framework. It consists of two main modules: i) A compact student network (CompNet) which is composed of branches connected in parallel (Fig. 1-A). The branches follow a common architecture constituting convolutional and pooling layers. ii) A Teacher Ensemble Network (TeachNet) which is composed of CNN models with different architectures or layer configurations (Fig. 1-B). In the following, we describe in detail the individual modules of the proposed framework.

3.1 The Proposed Compact Network (CompNet)

Our compact network is composed of branches connected in parallel. The branches follow a common architecture where each branch is composed of multiple convolutional layers interconnected through fuse connections. The branch outputs are fed into linear layers to produce probabilistic distributions of the input data with respect to the target classes. Our branch architecture is composed of Dense Blocks which contain multiple bottleneck convolutions interconnected through dense connections [Huang et al.2017]. Specifically, the branch architecture starts with a

convolution followed by Batch Normalization (BN), a Rectified Linear Unit (ReLU), and a

average pooling operation. Next, there are three dense blocks, where each dense block consists of number of layers termed Dense Layers which share information from all the preceding layers connected to the current layer through fuse connections. Fig. 1-C shows our branch structure with dense layers configuration. Each dense layer consists of and convolutions followed by Batch Normalization (BN), Rectified Linear Units (ReLU), and a dropout block as shown in Fig. 1. The output of the dense layer () in a dense block can be written as:

(1)

where represents concatenation of the features produced by the layers . Each branch ends with global average pooling and produces dimensional feature maps which are then fed to a linear layer of dimensions to produce probabilitic distributions () with respect to target classes. Mathematically, the output of a linear layer can be written as:

(2)

where, and represent weights and bias matrices, respectively. Finally, the outputs of the linear layers are summed to produce a combined feature representation as shown in Fig. 1-A. It is given by:

(3)

3.2 Proposed Teacher Ensemble Network (TeachNet)

Our teacher ensemble is composed of multiple CNN models which act as independent classifiers. The teacher sub-networks should use different architectures or layer configurations in order to produce diverse feature representations at the final convolutional layers. Similar to our CompNet architecture, the teacher outputs are first fed into linear layers to produce probabilistic distributions (

) of the input data with respect to the target classes, and finally summed together to produce a combined feature representation as shown in Fig. 1-B. It is given by:

(4)

In the following we describe our specialized training objective function which optimizes the parameters of our CompNet using ground truth labels as well as high-level feature representations from the teacher ensemble.

3.3 Proposed Ensemble Knowledge Distillation (EKD)

Consider a training dataset of images and labels , where each sample belongs to one of the classes . To learn the mapping , we train our CompNet parameterized by , where are the learned parameters obtained by minimizing a training objective function :

(5)

Our training function is a weighted combination of three loss terms. A CrossEntropy loss term which is applied on the outputs of the teacher ensemble and the CompNet model with respect to the ground truth labels (), and a distillation loss term which matches the outputs of the sub-networks of the teacher ensemble and the outputs of the branches of the CompNet model. Mathematically, can be written as:

(6)

where and

represent the logits (the inputs to the SoftMax) of the teacher ensemble and the CompNet model, respectively. The terms

, , and are the hyper-parameters which balance the individual loss terms. Mathematically, the CrossEntropy loss can be written as:

(7)

where is the indicator function and is the SoftMax operation. It is given by:

(8)

Our KD-based loss in Eq. 6 is composed of two terms. A Kullback-Leibler (KL) divergence loss term and a Mean-Squared-Error loss term . Mathematically, can be written as:

(9)

where indexes the sub-networks of the teacher ensemble and the branches of the CompNet model. The term in Eq. 9 is a temperature hyper-parameter which controls the softening of the output of the teacher sub-networks. A higher value of

produces a softer probability distribution over the target classes. The KL divergence loss

is given by:

(10)

4 Experimental Setup

In this section we describe the details of our experiments.

4.1 Network Architectures

We evaluated our ensemble knowledge distillation using two architectures for the student network. The first one is a CNN with dense connections [Huang et al.2017], where we considered the number of dense layers of the model as a proxy for size or capacity of the model. Specifically, we considered a medium capacity CNN with 6 dense layers (DenseNet6), and a large capacity CNN with 12 dense layers (DenseNet12). The other student architecture we considered for our evaluations is based on the ResNet8 structure with skip connections as used in other knowledge distillation based studies [Heo et al.2018, Mirzadeh et al.2019]. For the teacher ensemble, we considered up to 7 sub-networks based on ResNet14, ResNet20, ResNet26, ResNet32, ResNet44, ResNet56, and ResNet110 architectures as used in [Heo et al.2018, Mirzadeh et al.2019].

4.2 Training and Implementation

For training, we initialized the weights of the convolutional and the fully connected layers from zero-mean Gaussian distributions. The standard deviations were set to 0.01, the biases were set to 0, and a parameter decay of 0.0005 was set on the weights and biases. The teacher ensemble was first trained independently from the scratch, and then fine-tuned simultaneously and collaboratively with the student network. The distillation from the teacher sub-networks to the student network was performed throughout the whole training process by optimizing the training objective function in Eq.

6

. Specifically, we trained the networks for 1000 epochs starting with a learning rate of 0.01 which was divided by 10 at 50% and 75% of the total number of epochs. Our implementation is based on the auto-gradient computation framework of the Torch library

[Paszke et al.2017]. Training was performed by ADAM optimizer with a batch size of 128 using 2 nvidia V100 GPU hardware. For hyper-parameter optimization, we used the toolkit of [Bergstra et al.2011] to tune the loss weighing parameters , and the temperature parameter .

4.3 Datasets

We evaluated our framework on two well calibrated image classification datasets CIFAR-10 and CIFAR-100. Both datasets consists of 60,000 RGB images distributed into 10 and 100 classes, respectively. Specifically, the training set contains 50,000 images and the test set contains 10,000 images of sizes pixels.

5 Results and Analysis

Architecture EKD CIFAR-10 CIFAR-100
ResNet8 No 86.68 58.03
(R8) Yes 90.48 62.29
  DenseNet6 No 90.64 62.64
(D6) Yes 92.50 67.39
  DenseNet12 No 92.31 70.30
(D12) Yes 94.27 74.04
  Ensemble [Dutt et al.2019] No 92.79 70.73
(R8+D6+D12)
Table 1: Comparison of our student networks with and without the proposed Ensemble Distillation (EKD) for different CNN architectures on the CIFAR-10 and CIFAR-100 datasets.

5.1 Ensemble Distillation Improves Model Performance

Here, we evaluate our compact student networks with and without the proposed Ensemble Knowledge Distillation (EKD) on the CIFAR-10 and CIFAR-100 datasets. Table 1 shows the results of these experiments for different CNN architectures. From Table 1 we see that EKD based networks improve accuracy for all the tested CNN architectures on both CIFAR-10 and CIFAR-100 datasets. For instance, EKD improves accuracy by 4%, 5%, and 4% for ResNet8 (R8), DenseNet6 (D6), and DenseNet12 (D12) architectures on the CIFAR-100 dataset, respectively. Table 1 also shows that our EKD improves accuracy by 2% and 4% on the CIFAR-10 and CIFAR-100 datasets, respectively, compared to the ensembling method of [Dutt et al.2019] which combines heterogeneous classifiers without knowledge distillation.

5.2 Ensemble Learning Improves Knowledge Distillation

Model ES Teachers CIFAR CIFAR No. of No. of
10 100 param. FLOPS
(million) (million)

R8

1 T1 88.42 58.56 0.078 12.89
2 T1-T2 89.88 60.49 0.156 25.78
3 T1-T3 90.14 61.69 0.234 38.67
4 T1-T4 90.25 61.98 0.312 51.56
5 T1-T5 90.25 62.28 0.390 64.45
6 T1-T6 90.33 62.52 0.468 77.35
7 T1-T7 90.48 62.29 0.546 90.24
 

D6

1 T1 90.90 63.09 0.073 33.10
2 T1-T2 91.67 65.33 0.147 66.20
3 T1-T3 92.26 66.12 0.221 99.30
4 T1-T4 92.43 66.28 0.294 132.40
5 T1-T5 92.65 67.15 0.368 165.50
6 T1-T6 92.50 67.09 0.442 198.60
7 T1-T7 92.47 67.39 0.516 231.71
 

D12

1 T1 92.09 69.31 0.184 76.96
2 T1-T2 93.13 71.27 0.368 153.92
3 T1-T3 93.68 72.60 0.552 230.88
4 T1-T4 93.77 73.18 0.737 307.84
5 T1-T5 94.12 73.80 0.921 384.80
6 T1-T6 94.27 74.04 1.105 461.76
7 T1-T7 94.14 73.83 1.290 538.72
Table 2: Ablation study of our ensemble distillation based student network in terms of the Ensemble Size (ES), number of training parameters, number of FLOPS, and mean test accuracy on CIFAR-10 and CIFAR-100 datasets for different CNN architectures.
Model CIFAR CIFAR No. of No. of Inf.
10 100 param. FLOPS time
(million) (million) (ms)
T1: ResNet14 90.52 65.88 0.180 27.23 2
T2: ResNet20 91.62 66.89 0.272 41.61 5
T3: ResNet26 92.01 67.92 0.376 56.00 5
T4: ResNet32 92.25 68.42 0.470 70.38 6
T5: ResNet44 93.01 70.32 0.661 99.15 6
T6: ResNet56 93.28 69.43 0.860 127.92 9
T7: ResNet110 88.41 61.42 1.730 257.39 16
TeachNet 94.43 75.47 4.532 679.68 52
  CompNet-D12 94.27 74.04 1.10 461.76 47
Table 3: Ablation study of the proposed teacher networks and the EKD based student network in terms of mean test accuracy, number of training parameters, number of FLOPS, and inference speed on the CIFAR-10 and CIFAR-100 datasets.

Here, we evaluate the performance of our EKD based student networks by varying the size of the ensemble to explore the benefits of ensembling for knowledge distillation. Table 2 shows the results of these experiments on the CIFAR-10 and CIFAR-100 datasets. The results show that the accuracy increases with the increase in ensemble size (ES) for all the tested CNN architectures. For instance, a student network with 6 branches improves accuracy by around 2% and 4% compared to a 1-branch network on the CIFAR-10 and CIFAR-100 datasets, respectively. Table 3 shows a comparison between the student network and the teacher networks on the CIFAR-10 and CIFAR-100 datasets. The results show that our EKD based student network (CompNet-D12) produced competitive accuracy on the test datasets with less training parameters, less FLOPS, and faster inference speed compared to the teacher ensemble (TeachNet).

Method Dataset Accuracy
[Hinton et al.2015] CIFAR-10 86.66
FITNET [Mirzadeh et al.2019] CIFAR-10 86.73
[Zagoruyko and Komodakis2016] CIFAR-10 86.86
FSP [Yim et al.2017] CIFAR-10 87.07
BSS [Heo et al.2018] CIFAR-10 87.32
MUTUAL [Zhang et al.2018b] CIFAR-10 87.71
TAKD [Mirzadeh et al.2019] CIFAR-10 88.01
CompNet - ResNet8 (this work) CIFAR-10 90.48
  [Hinton et al.2015] CIFAR-100 61.41
TAKD [Mirzadeh et al.2019] CIFAR-100 61.82
CompNet - ResNet8 (this work) CIFAR-100 62.52
Table 4: Comparison of our ensemble distillation using ResNet8 base student network and other Knowledge Distillation (KD) methods on the CIFAR-10 and CIFAR-100 datasets.
Train CIFAR-10 CIFAR-100
size NO-KD KD EKD NO-KD KD EKD
100% 86.68 88.56 90.48 58.03 58.08 62.52
75% 85.52 87.43 89.48 56.15 56.71 60.99
50% 78.30 83.91 85.81 45.21 47.27 52.06
25% 72.20 77.66 79.22 32.92 34.30 38.73
10% 61.29 66.36 67.74 18.01 20.05 22.87
Table 5: Test accuracy of ResNet8 without distillation (NO-KD), with conventional distillation (KD), and with the proposed ensemble distillation (EKD) on the CIFAR-10 and CIFAR-100 datasets for different sizes of the training data.
Figure 2: Comparison of TSNE visualizations of 2-dimensional embeddings generated using features produced by the proposed teacher ensemble (A), our EKD-based ResNet8 (B), and a ResNet8 model without distillation (C), on the test data of CIFAR-10 dataset. The comparison shows that the embeddings produced by our EKD based models show better separation of the target classes especially in cases with limited training data compared to the embeddings produced by models trained without distillation.

5.3 Comparison with Other Knowledge Distillation Methods

Here, we compare our Ensemble Knowledge Distillation (EKD) with some of the recent state-of-the-art knowledge distillation based methods including: activation based attention transfer (AAT) [Zagoruyko and Komodakis2016], the method of [Yim et al.2017] (FSP), the method of [Hinton et al.2015], hint based transfer (FitNet) [Romero et al.2015], the method of [Heo et al.2018](BSS), the method of deep mutual learning [Zhang et al.2018b](MUTUAL), and the method of [Mirzadeh et al.2019] (TAKD). For a fair comparison, we used exactly the same setting for CIFAR-10 and CIFAR-100 experiments, and a ResNet8 based student network as used in the baseline studies. Table 4 shows that our EKD based networks improved accuracy on both the tested datasets compared to the other KD methods. This improved performance is attributed to the proposed ensemble distillation architecture where the proposed training objective function enabled the student network to successfully mimic diverse feature embeddings produced by different teachers and improve generalization to unseen test data. Furthermore, the combination of distilled information through ensembling reduced variance in the heterogeneous outputs and improved the quality of the final predictions of the student network.

5.4 Model Generalization Performance

Figure 3: Comparison of loss landscape visualizations generated using the proposed teacher ensemble (A), our EKD-based ResNet8 (B), and a ResNet8 model without distillation (C), on the test data of CIFAR-100 dataset.

Here we evaluate the generalization performance of the proposed ensemble knowledge distillation. For this, we conducted experiments for different sizes of the data used for training the teacher and the student networks. Table 5 shows the results of these experiments using ResNet8 as the student network. The results show that the performance gap increases as the size of the dataset is reduced. For instance, the accuracy drop by around 25% and 40% when 10% of the data was used to train the networks without the proposed EKD on the CIFAR-10 and CIFAR-100 datasets, respectively. Table 5 also shows that using the proposed EKD, the networks improve test accuracy for all the tested sizes of training data with considerable margins compared to networks without distillation (NO-KD). For instance, EKD-based networks produced improvements of upto 7% and 6% when 50% of data was used for training on the CIFAR-10, and CIFAR-100 datasets, respectively compared to networks without distillation (NO-KD).
Table 5 (columns labelled KD) show test accuracy of networks trained through conventional distillation [Hinton et al.2015], where ResNet8 based student networks were trained with ResNet56 as the teacher network. The results show that while non-ensemble based distillation improved accuracy compared to the networks without distillation (NO-KD), our ensemble distillation based networks (EKD) produced the best accuracy on the test datasets for different sizes of training data. These improvements are attributed to our ensemble distillation which promotes diversity in feature learning by transfering knowledge from different teachers into the student network and improves model generalization to test data. These experiments show that for situations where non-KD methods fail to achieve generalization due to insufficient data, the proposed ensemble distillation achieves considerable performance improvements, thereby demonstrating potentials for uses in applications with limited-data constraints.

5.4.1 Visualization of Teacher and Student Features Space

Here, we conducted experiments to visualize the features space learnt by our EKD based networks. Fig. 2 shows the 2-dimensional TSNE-embeddings generated using the features produced by the teacher ensemble and ResNet8 models with and without using the proposed ensemble distillation under different sizes of data used during training. These experiments show that training a student network using high-level features from multiple teacher networks enables the student network to imitate teachers and produce embeddings which exhibit better separation of the target classes compared to the embeddings produced by the models without using distillation.

5.4.2 Visualization of Teacher and Student Loss Landscapes

Fig. 3 shows a comparison of loss landscape visualizations [Li et al.2018] around local minima for the teacher ensemble, ResNet8 with the proposed EKD and ResNet8 without distillation on the test data of the CIFAR-100 dataset. The comparison shows that our EKD-based networks exhibit flatter surface topology similar to those produced by the teacher networks compared to the models without distillation. This translates to better generalization performance of our EKD-based networks compared to the networks without distillation.

5.5 Conclusion and Future Work

Recently, deep CNN based ensemble methods have shown state-of-the-art performance in image classification but at the cost of high computation cost and large memory requirements. In this paper, we show that knowledge distillation using an ensemble architecture can improve classification accuracy and model generalization especially with fewer training data for small and compact networks. Unlike traditional ensembling techniques which reduce variance in outputs by combining independently trained networks, we show that Ensemble Knowledge Distillation (EKD) encourages heterogeneity in student feature learning through collaboration between heterogeneous teachers and the student network. This enables student networks to learn more discriminative and diverse feature representations while maintaining small memory and compute requirements. Experiments on well-established CIFAR-10 and CIFAR-100 datasets show that compact networks trained through the proposed ensemble distillation improved classification accuracy and model generalization especially in situations with fewer training data. In future, we plan to explore a fully data-driven automated ensemble selection. We also plan to evaluate our framework for video classification tasks to gain more insights into the benefits of ensemble distillation.

References