Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement

Generic Image recognition is a fundamental and fairly important visual problem in computer vision. One of the major challenges of this task lies in the fact that single image usually has multiple objects inside while the labels are still one-hot, another one is noisy and sometimes missing labels when annotated by humans. In this paper, we focus on tackling these challenges accompanying with two different image recognition problems: multi-model ensemble and noisy data recognition with a unified framework. As is well-known, usually the best performing deep neural models are ensembles of multiple base-level networks, as it can mitigate the variation or noise containing in the dataset. Unfortunately, the space required to store these many networks, and the time required to execute them at runtime, prohibit their use in applications where test sets are large (e.g., ImageNet). In this paper, we present a method for compressing large, complex trained ensembles into a single network, where the knowledge from a variety of trained deep neural networks (DNNs) is distilled and transferred to a single DNN. In order to distill diverse knowledge from different trained (teacher) models, we propose to use adversarial-based learning strategy where we define a block-wise training loss to guide and optimize the predefined student network to recover the knowledge in teacher models, and to promote the discriminator network to distinguish teacher vs. student features simultaneously. Extensive experiments on CIFAR-10/100, SVHN, ImageNet and iMaterialist Challenge Dataset demonstrate the effectiveness of our MEAL method. On ImageNet, our ResNet-50 based MEAL achieves top-1/5 21.79 by 2.06 remarkable improvement of top-3 1.15 baseline model of ResNet-101.


page 1

page 2

page 4

page 6

page 7


MEAL: Multi-Model Ensemble via Adversarial Learning

Often the best performing deep neural models are ensembles of multiple b...

How and When Adversarial Robustness Transfers in Knowledge Distillation?

Knowledge distillation (KD) has been widely used in teacher-student trai...

Essence Knowledge Distillation for Speech Recognition

It is well known that a speech recognition system that combines multiple...

An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation

Compressing deep neural network (DNN) models becomes a very important an...

Representational Distance Learning for Deep Neural Networks

Deep neural networks (DNNs) provide useful models of visual representati...

D-PCN: Parallel Convolutional Neural Networks for Image Recognition in Reverse Adversarial Style

In this paper, a recognition framework named D-PCN using a discriminator...

1 Introduction

The model ensemble approach is a collection of neural networks whose predictions are combined at test stage by weighted averaging or voting. It has been long observed that ensembles of multiple networks are generally much more robust and accurate than a single network if the training data is noisy and intractable to handle. This benefit has also been exploited indirectly when training a single network through Dropout [59], Dropconnect [66], Stochastic Depth [24], Swapout [58], etc. We extend this idea by forming ensemble predictions during training, using the outputs of different network architectures with different or identical augmented input. Our testing still operates on a single network, but the supervision labels made on different pre-trained networks correspond to an ensemble prediction of a group of individual reference networks.

The traditional ensemble, or called true ensemble, has some disadvantages that are often overlooked. 1) Redundancy: The information or knowledge contained in the trained neural networks are always redundant and has overlaps between with each other. Directly combining the predictions often requires extra computational cost but the gain is limited. 2) Ensemble is always large and slow: Ensemble requires more computing operations than an individual network, which makes it unusable for applications with limited memory, storage space, or computational power such as desktop, mobile and even embedded devices, and for applications in which real-time predictions are needed.

To address the aforementioned shortcomings, in this paper we propose to use a learning-based ensemble method. Our goal is to learn an ensemble of multiple neural networks without incurring any additional testing costs, as shown in Fig. 2. We achieve this goal by leveraging the combination of diverse outputs from different neural networks as supervisions to guide the target network training. The reference networks are called Teachers and the target networks are called Students

. Instead of using the traditional one-hot vector labels, we use the

soft labels that provide more coverage for co-occurring and visually related objects and scenes. We argue that labels should be informative for the specific image. In other words, the labels should not be identical for all the given images with the same class. More specifically, as shown in Fig. 3, an image of “tobacco shop” has similar appearance to “library” should have a different label distribution than an image of “tobacco shop” but is more similar to “grocery store”. It can also be observed that soft labels can provide the additional intra- and inter-category relations of datasets.

(a) Standard
(b) Ours
Fig. 1: Visualizations of validation images from the ImageNet dataset [8] by t-SNE [45]. We randomly sample 10 classes within 1000 classes. Left is the single model result using the standard training strategy. Right is our MEAL ensemble model result.

To further improve the robustness of student networks, we introduce an adversarial learning strategy to force the student to generate similar outputs as teachers. We propose two different strategies for the generative adversarial training: (i) joint training with a unified framework; and (ii) alternately update gradients with separate training processes, i.e., updating the gradients in discriminator and student network iteratively. To the best of our knowledge, there are very few existing works adopting generative adversarial learning to force the student networks to have similar distribution outputs with the teachers, so our proposed method is a pioneer of this direction for multi-model ensemble. Our experiments show that MEAL consistently improves the accuracy across a variety of popular network architectures on different datasets. For instance, our shake-shake [12] based MEAL achieves 2.54% test error on CIFAR-10, which is a relative improvement111Shake-shake baseline [12] is 2.86%.. On ImageNet, our ResNet-50 based MEAL achieves 21.79%/5.99% val error, which outperforms the baseline by a large margin.

Furthermore, we extend our method to the problem of noisy data processing. We propose an iterative refinement paradigm based on our MEAL method, which can refine the labels from the teacher networks progressively and provide more accurate supervisions for the student network training. We conduct experiments on iMaterialist Challenge Dataset and the results show that our method can vastly improve the performance of base models.

To explore what our model actually learned, we visualize the embedded features from the single model and our ensembling model. The visualization is plotted by t-SNE tool [45] with the last conv-layer features (2048 dimensions) from ResNet-50. We randomly sample 10 classes on ImageNet, results are shown in Fig. 1, it’s obvious that our model has better feature embedding result.

In summary, our contribution in this paper is three fold.

  • An end-to-end framework with adversarial learning is designed based on the teacher-student learning paradigm for deep neural network ensembling and noisy data learning.

  • The proposed method can achieve the goal of ensembling multiple neural networks with no additional testing cost.

  • The proposed method improves the state-of-the-art accuracy on CIFAR-10/100, SVHN, ImageNet and iMaterialist Challenge Dataset for a variety of existing network architectures.

A preliminary version of this manuscript [55] has been published in a previous conference. In this version, we involved and compared two different gradient update strategies for adversarial learning on our proposed MEAL framework. We also provided a novel learning paradigm for how to adopt our method on handling noisy date circumstances. Furthermore, we included more experiments, details, analysis and an iterative refinement strategy with better performance. Currently, there are few works focusing on adopting generative adversarial learning on feature space for learning identical distributions between teacher and student networks. Thus, this work gives very good and practical guidelines for multi-model learning/ensemble and noisy data refinement.

Fig. 2: Comparison of FLOPs at inference time. Huang et al. [22] employ models at different local minimum for ensembling, which enables no additional training cost, but the computational FLOPs at test time linearly increase with more ensembles. In contrast, our method use only one model during inference time throughout, so the testing cost is independent of # ensembles.

2 Related Work

There is a large body of previous work [17, 52, 34, 9, 22, 35, 76, 73] on ensembles with neural networks. However, most of these prior studies focus on improving the generalization of an individual network. Recently, Snapshot Ensembles [22] is proposed to address the cost of training ensembles. In contrast to the Snapshot Ensembles, here we focus on the cost of testing ensembles. Our method is based on the recently raised knowledge distillation [20, 48, 40, 70] and adversarial learning [15], so we will review the ones that are most directly connected to our work.

Fig. 3: Left is a training example of class “tobacco shop” from ImageNet. Right are soft distributions from different trained architectures. The soft labels are more informative and can provide more coverage for visually-related scenes.

“Implicit” Ensembling. Essentially, our method is an “implicit” ensemble which usually has high efficiency during both training and testing. The typical “implicit” ensemble methods include: Dropout [59], DropConnection [66], Stochastic Depth [24], Swapout [58], etc. These methods generally create an exponential number of networks with shared weights during training and then implicitly ensemble them at test time. In contrast, our method focuses on the subtle differences of labels with identical input. Perhaps the most similar to our work is the recent proposed Label Refinery [3], who focus on the single model refinement using the softened labels from the previous trained neural networks and iteratively learn a new and more accurate network. Our method differs from it in that we introduce adversarial modules to force the model to learn the difference between teachers and students, which can improve model generalization and can be used in conjunction with any other implicit ensembling techniques. There are some other ensemble methods like DivE [73]

, which aims to train an ensemble of models that assigns data to models at each training epoch based on each model’s current expertise and an intra- and inter-model diversity reward. It starts by choosing easy samples for each model, and then gradually adjusts towards the models having specialized and complementary expertise on subsets of the training data.

Adversarial Learning. Generative Adversarial Learning [15] is firstly proposed to generate realistic-looking images from random noise using neural networks. It consists of two components. One serves as a generator and another one as a discriminator. The generator is used to synthesize images to fool the discriminator, meanwhile, the discriminator tries to distinguish real and fake images. Recently, numerous interesting GAN evolution algorithms have been proposed, such as Wasserstein GAN [2], Improved Wasserstein gans [16], DRAGAN [30], NS GAN [10], LS GAN [46]

, et al. Generally, the generator and discriminator are trained simultaneously through competing with each other. In this work, we employ generators to synthesize student features and use discriminator to discriminate between teacher and student outputs for the same input image. An advantage of adversarial learning is that the generator tries to produce similar features as a teacher that the discriminator cannot differentiate. This procedure improves the robustness of training for student network and has applied to many fields such as image-to-image translation 

[27, 74, 75, 41, 25, 56], image generation [29], detection [4], etc.

Fig. 4: Overview of our proposed architecture. We input the same image into the teacher and student networks to generate intermediate and final outputs for Similarity Loss and Discriminators. The model is trained adversarially against several discriminator networks. During training the model observes supervisions from trained teacher networks instead of the one-hot ground-truth labels, and the teacher’s parameters are fixed all the time.

Knowledge Transfer. Distilling knowledge from trained neural networks and transferring it to another new network has been well explored in [20, 5, 40, 68, 70, 62, 3, 1, 6, 49, 19, 11, 63, 7, 69]. The typical way of transferring knowledge is the teacher-student learning paradigm, which uses a softened distribution of the final output of a teacher network to teach information to a student network. With this teaching procedure, the student can learn how a teacher studied given tasks in a more efficient form. Yim et al. [70] defined the distilled knowledge to be transferred flows between different intermediate layers and computered the inner product between parameters from two networks. Bagherinezhad et al. [3] studied the effects of various properties of labels and introduce the Label Refinery method that iteratively updated the ground truth labels after examining the entire dataset with the teacher-student learning paradigm. Park et al. [49] introduced a novel dubbed relational knowledge distillation (RKD) that transferred mutual relations of data examples instead. For concrete realizations of RKD, they proposed distance-wise and angle-wise distillation losses that penalize structural differences in relations.

Learning with Noisy Labels. Learning with noisy labels has been a widely-explored research topic in the recent years, since it has wide usage and applications [39]

. There are a large number of variety methods which are applied to tackle this problem of modeling distribution of noisy and true annotations, such as knowledge graphs and distillation 

[40], conditonal random fields [64], directed graphical models [67]

, etc. As in the deep learning era, one line to handle this problem with neural networks is  

[39, 28, 13, 65, 60, 51, 53, 38], which are formulated by explicit or implicit noisy models.

In particular, Li et al. [39] proposed to use a unified distillation framework to adopt “side” information, including a small clean dataset and label relations in knowledge graph, to “hedge the risk” of learning from noisy labels. Liu et al. [42] presented an importance reweighting framework for classification in the presence of label noise which sample labels are randomly corrupted. Li et al. [39] proposed a noise-tolerant learning algorithm, which a meta-learning update strategy is performed prior to conventional gradient update operation. The proposed meta-learning method simulates actual training by generating synthetic noisy labels, and train the model after one gradient update using each set of synthetic noisy labels. Sukhbaatar et al. [60]

explored the performance of discriminatively-trained CNN when training on noisy data. They introduced an extra noise layer into the network which adapted the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep neural networks.

3 Overview

Siamese-like Network Structure Our framework is a siamese-like architecture that contains two-stream networks in teacher and student branches. The structures of two streams can be identical or different, but should have the same number of blocks, in order to utilize the intermediate outputs. The whole framework of our method is shown in Fig. 4. It consists of a teacher network, a student network, alignment layers, similarity loss layers and discriminators.

The teacher and student networks are processed to generate intermediate outputs for alignment. The alignment layer is an adaptive pooling process that takes the same or different length feature vectors as input and output fixed-length new features. We force the model to output similar features of student and teacher by training student network adversarially against several discriminators. We will elaborate each of these components in the following sections with more details.

4 Adversarial Learning (AL) for Knowledge Distillation

4.1 Similarity Measurement

Given a dataset , we pre-trained the teacher network over the dataset using the cross-entropy loss against the one-hot image-level labels222Ground-truth labels in advance. The student network is trained over the same set of images, but uses labels generated by . More formally, we can view this procedure as training on a new labeled dataset . Once the teacher network is trained, we freeze its parameters when training the student network.

We train the student network by minimizing the similarity distance between its output and the soft label generated by the teacher network. Letting ,

be the probabilities assigned to class

in the teacher model and student model . The similarity metric can be formulated as:


We investigated three distance metrics in this work, including , and KL-divergence. The detailed experimental comparisons are shown in Tab. I. Here we formulate them as follows.

distance is used to minimize the absolute differences between the estimated student probability values and the reference teacher probability values. Here we formulate it as:


distance or euclidean distance is the straight-line distance in euclidean space, which has been used in Mean Teacher [62] (mean squared error, MSE) as the consistency loss. We use loss function to minimize the error which is the sum of all squared differences between the student output probabilities and the teacher probabilities. The can be formulated as:



is a measure of how one probability distribution is different from another reference probability distribution. Here we train student network

by minimizing the KL-divergence between its output and the soft labels generated by the teacher network. Our loss function is:


where the second term is the entropy of soft labels from teacher network and is constant with respect to . We can remove it and simply minimize the cross-entropy loss as follows:


4.2 Intermediate Alignment

Adaptive Pooling.

The purpose of the adaptive pooling layer is to align the intermediate output from teacher network and student network. This kind of layer is similar to the ordinary pooling layer like average or max pooling, but can generate a predefined length of output with different input size. Because of this specialty, we can use the different teacher networks and pool the output to the same length of student output. Pooling layer can also achieve spatial invariance when reducing the resolution of feature maps. Thus, for the intermediate output, our loss function is:


where and are the outputs at -th layer of the teacher and student, respectively. is the adaptive pooling function that can be average or max. Fig. 8 illustrates the process of adaptive pooling. Because we adopt multiple intermediate layers, our final similarity loss is a sum of individual one:


where is the set of layers that we choose to produce output. In our experiments, we use the last layer in each block of a network (block-wise).

Fig. 5: The process of adaptive pooling in forward and backward stages. We use max operation for illustration.

4.3 Stacked Discriminators

We generate student output by training the student network and freezing the teacher parts adversarially against a series of stacked discriminators D. A discriminator D

attempts to classify its input

as teacher or student by maximizing the following objective as in [15]:


where are outputs from generation network . At the same time, attempts to generate similar outputs which will fool the discriminator by minimizing .

In Eq. 9, is the concatenation of teacher and student outputs. We feed into the discriminator which is a three-layer fully-connected network. The whole structure of a discriminator is shown in Fig. 6.

Multi-Stage Discriminators. Using multi-Stage discriminators can refine the student outputs gradually. As shown in Fig. 4, the final adversarial loss is a sum of the individual ones (by minimizing -):


Let be the number of discriminators. In our experiments, we use 3 for CIFAR [33] and SVHN [47], and 5 for ImageNet [8].

Fig. 6: Illustration of our proposed discriminator. We concatenate the outputs of teacher and student as the inputs of a discriminator. The discriminator is a three-layer fully-connected network.
Fig. 7: Illustration of our two gradient update strategies. Red line indicates joint training strategy and black dash line indicates alternate updateing strategy. More details can be referred to Section 5.

Strategy 1: Alternately Update Gradients.

Require: Following [15], we also define the number of steps to apply to the discriminator,

, as a hyperparameter.

and are the trade-off coefficients.

1:for number of training iterations do:
2:     for k steps do:
3:         Sample minibatch of m examples , . . . , from training data through teacher as distribution and through student as distribution .
4:     Update the -th discriminator by ascending its stochastic gradient:
5:     end for
6:   Sample minibatch of m examples , . . . , from training data through student as distribution .
7:     Update student by descending its stochastic gradient:
8:end for


Strategy 2: Joint Training.

1:for number of training iterations do:
2:      Sample minibatch of m examples , . . . , from training data distribution .
3:    Update the -th discriminator and the student by descending its stochastic gradient:
4:end for The gradient-based updates can use any standard gradient-based learning rule.
Algorithm 1 Learning Strategy of Similarity and Discriminators.

5 Learning Strategy of Similarity and Discriminators

5.1 Joint Training

For the strategy of joint training, we incorporate the similarity loss in Eq. 7 and adversarial loss in Eq. 9 into our final loss function based on above definition and analysis. Our whole framework is trained end-to-end by the following objective function:


where and are trade-off weights. We set them as 1 in our experiments by cross validation. We also use the weighted coefficients to balance the contributions of different blocks. For 3-block networks, we ues [0.01, 0.05, 1], and [0.001, 0.01, 0.05, 0.1, 1] for 5-block ones.

5.2 Alternately Update Gradients

For alternately updating gradients, we follow the training process of standard generative adversarial networks [15] which trains to maximize the probability of assigning the correct label to both teacher features and features from student. We simultaneously train (student) to minimize . As in [15], and play the following two-player minimax game with value function :


We update the student network after updating the discriminator in iterations (we choose in all our experiments). When updating the student network , we aim to fool the discriminator by fixing discriminator and minimizing the similarity loss and GAN loss. More details can be referred in Algorithm 1 and Fig. 7.

The main difference between joint training and alternate updating is that in the former strategy, the gradients from discriminators will propagate back to the student backbone, while in the latter strategy, the parameters in discriminators and student will update separately without any interactions.

Fig. 8: Image samples from two benchmarks grouped by clean and noisy/multi-label categories. In each group, black box images are clean labeled images and red box images are ones with multi-label objects (ImageNet) or noisy label images (iMaterialist products).

6 Multi-Model Ensemble via Adversarial Learning (MEAL)

We achieve ensemble with a training method that is simple and straight-forward to implement. As different network structures can obtain different distributions of outputs, which can be viewed as soft labels (knowledge), we adopt these soft labels to train our student, in order to compress knowledge of different architectures into a single network. Thus we can obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional testing cost.

Stage 1:

Building and Pre-training the Teacher Model Zoo , including: VGGNet [57], ResNet [18], DenseNet [23], MobileNet [21], Shake-Shake [12], etc.

Stage 2:

1:function ()
2:      Random Selection
3:     return
4:end function
5:for each iteration do:
6:      Randomly Select a Teacher Model
7:      Adversarial Learning for a Student
8:end for
Algorithm 2 Multi-Model Ensemble via Adversarial Learning (MEAL).

6.1 Learning Procedure

To clearly understand what the student learned in our work, we define two conditions. First, the student has the same structure as the teacher network. Second, we choose one structure for student and randomly select a structure for teacher in each iteration as our ensemble learning procedure.

The learning procedure contains two stages. First, we pre-train the teachers to produce a model zoo. Because we use the classification task to train these models, we can use the softmax cross entropy loss as the main training loss in this stage. Second, we minimize the loss function in Eq. 10 to make the student output similar to that of the teacher output. The learning procedure is explained below in Algorithm 2.

7 Learning on Noisy Data

Deep neural networks have achieved notable success in image classification recently due to the collection of massive large-scale labeled datasets such as ImageNet [8], OpenImage [31], etc. However, collecting such datasets is time-consuming and expensive, further requires double check from multiple annotators to reduce label error. So a better solution is to automaticly build and learn from an Internet-scale dataset with noisy labels. Our method is fairly easy to extend to handle noisy data with an automatic way, since the “soft labels” predicted from teacher models usually are more accurate than the noisy labels provided by the noisy dataset. We further propose an iterative refinement strategy to boost the performance of our method on noisy labeled dataset.

7.1 Iterative Refinement

We propose to use an iterative training strategy to refine the noisy labels, which mainly exhibits three advantages: (1) Rectify sample supervisions with potentially wrong class labels through the teacher model predictions; (2) Improve the quality of predictions from the teacher model through the Iterative Refinement, so that the similarity loss will be more effective; And (3) Mitigate the overfit to noisy smaples when training networks on noisy labeled data, which leads to more robustness of student models against label noise.

Firstly, we perform an initial training iteration following the method described in Algorithm 2, and obtain a model with the best validation accuracy. This model will be the teacher in the next training iteration. In the second training iteration, we repeat the steps in Algorithm 2 with only one change described as follows. We replace the teacher model zoo with the models we learned from the first step. This operation improve the quality of teacher models and promote the teacher model to produce more reliable predictions for student model training, which can improve the quality of student models.

Fig. 9: Error rates (%) on CIFAR-10 and CIFAR-100, SVHN and ImageNet datasets. In each figure, the results from left to right are 1) base model; 2) base model with adversarial learning; 3) true ensemble/traditional ensemble; and 4) our ensemble results. For the first three datasets, we employ DenseNet as student, and ResNet for the last one (ImageNet).

8 Experiments and Analysis

We empirically demonstrate the effectiveness of MEAL on several benchmark datasets. We implement our method on the PyTorch 

[50] platform.

8.1 Datasets

CIFAR. The two CIFAR datasets [33] consist of colored natural images with a size of 3232. CIFAR-10 is drawn from 10 and CIFAR-100 is drawn from 100 classes. In each dataset, the train and test sets contain 50,000 and 10,000 images, respectively. A standard data augmentation scheme333

zero-padded with 4 pixels on both sides, randomly cropped to produce 32x32 images, and horizontally mirror with probability 0.5.

 [37, 54, 36, 22, 43] is used. We report the test errors in this section with training on the whole training set.

SVHN. The Street View House Number (SVHN) dataset [47] consists of 3232 colored digit images, with one class for each digit. The train and test sets contain 604,388 and 26,032 images, respectively. Following previous works [14, 24, 22, 43], we split a subset of 6,000 images for validation, and train on the remaining images without data augmentation.

ImageNet. The ILSVRC 2012 classification dataset [8] consists of 1000 classes, with a number of 1.2 million training images and 50,000 validation images. We adopt the the data augmentation scheme following [32] and apply the same operation as  [22] at test time.

iMaterialist Challenge Dataset444A large-scale, noisy, fine-grained, product classification dataset at FGVC6, CVPR 2019. Website: The iMaterialist Dataset contains about one million product images with 2019 classes for training, about 10K images for validation and 90K images for testing. This dataset is fairly challenging since about 30% training images are with incorrect labels. We follow the evaluation in the competition which uses top-3 classification error as metric. We also provide top-1 results in our experiments.

8.2 Networks

We adopt several popular network architectures as our teacher model zoo, including VGGNet [57], ResNet [18], DenseNet [23], MobileNet [21], shake-shake [12]

, etc. For VGGNet, we use 19-layer with Batch Normalization 

[26]. For ResNet, we use 18-layer network for CIFAR and SVHN and 50-layer for ImagNet. For DenseNet, we use the structure with depth L=100, and growth rate k=24. For shake-shake, we use 26-layer 296d version. Note that due to the high computing costs, we use shake-shake as a teacher only when the student is shake-shake network.

8.3 Ablation Studies

We first investigate each design principle of our MEAL framework with joint training strategy. We design several controlled experiments on CIFAR-10 with VGGNet-19 w/BN (both to teacher and student) for this ablation study. A consistent setting is imposed on all the experiments, unless when some components or structures are examined.

dis. dis. Cross-Entropy Intermediate Adversarial Test Errors (%)
Base Model (VGG-19 w/ BN) [57] 6.34
TABLE I: Ablation study on CIFAR-10 using VGGNet-19 w/BN. Please refer to Section 8.3 for more details.

The results are mainly summarized in Table I. The first three rows indicate that we only use , or cross-entropy loss from the last layer of a network. It’s similar to the Knowledge Distillation method. We can observe that use cross-entropy achieve the best accuracy. Then we employ more intermediate outputs to calculate the loss, as shown in rows 4 and 5. It’s obvious that including more layers improves the performance. Finally, we involve the discriminators to exam the effectiveness of adversarial learning. Using cross-entropy, intermediate layers and adversarial learning achieve the best result. Additionally, we use average based adaptive pooling for alignment. We also tried max operation, the accuracy is much worse (6.32%).

8.4 Results of Multi-Model Ensemble

Comparison with Different Learning Strategy. We compare MEAL with joint update and alternate update strategies. The results are shown in Table II, we employ several network architectures in this comparison. All models are trained with the same epochs. It can be observed that in the most cases joint training obtains better performance than alternate update on all the networks, so in the following experiments, we use joint update as our basic learning method unless otherwise noted.

Network Alternate Updating (A) (%) Joint Training (J) (%)
VGG-19 w/ BN [57] 5.87 5.55
GoogLeNet [61] 4.39 4.83
ResNet-18 [18] 4.43 4.35
DenseNet-BC (=24) [23] 3.99 3.54
TABLE II: Comparison of error rate (%) with different learning strategies on CIFAR-10.

Comparison with Traditional Ensemble. The results are summarized in Fig. 9 and Table III. In Figure 9, we compare the error rate using the same architecture on a variety of datasets (except ImageNet). It can be observed that our results consistently outperform the single and traditional methods on these datasets. The traditional ensembles are obtained through averaging the final predictions across all teacher models. In Table III, we compare error rate using different architectures on the same dataset. In most cases, our ensemble method achieves lower error than any of the baselines, including the single model and traditional ensemble.

Network Single (%) Traditional Ens. (%) Our Ens. (%)
MobileNet [21] 10.70 8.09
VGG-19 w/ BN [57] 6.34 5.55
DenseNet-BC (=24) [23] 3.76 3.73 3.54
Shake-Shake-26 2x96d [12] 2.86 2.79 2.54
TABLE III: Error rate (%) using different network architectures on CIFAR-10 dataset.

Comparison with Dropout. We compare MEAL with the “Implicit” method Dropout [59]. The results are shown in Table IV, we employ several network architectures in this comparison. All models are trained with the same epochs. We use a probability of 0.2 for drop nodes during training. It can be observed that our method achieves better performance than Dropout on all these networks.

Network Dropout (%) Our Ens. (%)
VGG-19 w/ BN [57] 6.89 5.55
GoogLeNet [61] 5.37 4.83
ResNet-18 [18] 4.69 4.35
DenseNet-BC (=24) [23] 3.75 3.54
TABLE IV: Comparison of error rate (%) with Dropout [59] baseline on CIFAR-10.
Method Top-1 (%) Top-5 (%) #FLOPs Inference Time (per/image)
Teacher Networks:
VGG-19 w/BN 25.76 8.15 19.52B s
ResNet-50 23.85 7.13 4.09B s
Ours (ResNet-50) 23.58 6.86 4.09B s
Traditional Ens. 22.76 6.49 23.61B s
Ours Plus J (ResNet-50) 21.79 5.99 4.09B s
Ours Plus A (ResNet-50) 22.08 5.93 4.09B s
TABLE V: Val. error (%) on ImageNet dataset.
Fig. 10: Top-1 error rates (%) of training and validation with our three base models (VGG-19, ResNet-50 and ResNet-101) on iMaterialist products dataset.

Our Learning-Based Ensemble Results on ImageNet. As shown in Table V, we compare our ensemble method with the original model and the traditional ensemble. We use VGG-19 w/BN and ResNet-50 as our teachers, and use ResNet-50 as the student. The #FLOPs and inference time for traditional ensemble are the sum of individual ones. Therefore, our method has both better performance and higher efficiency. Most notably, our MEAL Plus555denotes using more powerful teachers like ResNet-101/152. yields an error rate of Top-1 21.79%, Top-5 5.99% on ImageNet, far outperforming the original ResNet-50 23.85%/7.13% and the traditional ensemble 22.76%/6.49%. This shows great potential on large-scale real-size datasets.

Fig. 11: Accuracy curves of our Iterative Refinement method during training under different re-training budgets on iMaterialist products dataset.
Network Val Set(%) Test Set (%)
VGG-19 [57] 11.48 10.76
ResNet-50 [18] 10.03 9.24
ResNet-101 (baseline) [18] 9.19 8.96
MEAL (ResNet-101) 8.16 7.81
MEAL w/ MixUp [72] 7.57
MEAL w/ CutMix [71] 7.06
MEAL w/ (CutMix [71] + Cosine LR [44]) 6.89
TABLE VI: Top-3 error rate (%) on iMaterialist products dataset.
Fig. 12: Error rate (%) on CIFAR-10 with MobileNet, VGG-19 w/BN and DenseNet.

8.5 Results of Noisy Data Refinement

Base Model Training (Teachers). We first train our base models following the parameter-setting from ImageNet with three network structures: VGG-19, ResNet-50 and ResNet-101. In particular, we use the ImageNet pre-trained networks as initial parameters and the initial learnnig rate is set to 0.01, and then divided by 10 after 45 epochs. The total training budget is 60 epochs. The whole training error curves are illustrated in Fig. 10. We can observe that because of the large percentage of noisy labels in the training set, the training errors are higher than that on validation set (validation and testing sets are cleaned manually by the organizers). The results on testing set are show in Table VI. The baseline result is 8.96%, our MEAL outperforms the baseline by 1.15% (8.96% vs. 7.81%). We further adopt recently proposed data augmentaton method [72, 71] to verify whether our model overfits to the training data. From Table VI we can see that after using CutMix [71] and Cosine Learning Rate schedule [44], our result further improves to 6.89%, which demonstrates that our model doesn’t overfit to the noisy training data and still has space to improve.

Iterative Refinement. Then we iteratively refine our MEAL model with the strategy we described above. Every time we replace the teacher model from the previous round, which can generate better probabilities as supervision for the student model training. We show the train accuracy curves of first and second re-training rounds in Fig. 11. It is obvious that second re-training has better performance than the first re-training, which verifies the effectiveness of our iterative refinement strategy.

8.6 Analysis

Effectiveness of Ensemble Size. Fig. 12 displays the performance of three architectures on CIFAR-10 as the ensemble size is varied. Although ensembling more models generally gives better accuracy, we have two important observations. First, we observe that our single model “ensemble” already outputs the baseline model with a remarkable margin, which demonstrates the effectiveness of adversarial learning. Second, we observe some drops in accuracy using the VGGNet and DenseNet networks when including too many ensembles for training. In most case, an ensemble of four models obtains the best performance.

Fig. 13: Error rates (%) of different re-training time with our three base models (VGG-19, ResNet-50 and ResNet-101) on iMaterialist products testing set. “0” indicates the base model performance (teachers).
Fig. 14: Accuracy of our ensemble method under different training budgets on CIFAR-10.

Budget for Training. On CIFAR datasets, the standard training budget is 300 epochs. Intuitively, our ensemble method can benefit from more training budget, since we use the diverse soft distributions as labels. Fig. 14 displays the relation between performance and training budget. It appears that more than 400 epochs is the optimal choice and our model will fully converge at about 500 epochs.

Effectiveness of Re-training Number. Fig. 13 displays the performance of three networks on iMaterialist products dataset as the number of re-training is varied. We can observe that the first re-training process generally gives most improvement on all three networks. After that, continuing re-training the models provides very limited boost, but still can increase the accuracy.

(a) SequeezeNet vs. VGGNet
(b) ResNet vs. DenseNet
(c) AlexNet vs. VGGNet
(d) VGGNet vs. ResNet
Fig. 15: Probability Distributions between five networks.

Diversity of Supervision. We hypothesize that different architectures create soft labels which are not only informative but also diverse with respect to object categories. We qualitatively measure this diversity by visualizing the pairwise correlation of softmax outputs from two different networks. To do so, we compute the softmax predictions for each training image in ImageNet dataset and visualize each pair of the corresponding ones. Fig. 15 displays the bubble maps of four architectures. In the top-left figure, the coordinate of each bubble is a pair of -th predictions (), , and the top-right figure is (). If the label distributions are identical from two networks, the bubbles will be placed on the master diagonal. It’s very interesting to observe that the top-left (weaker network pairs) has bigger diversity than the top-right (stronger network pairs). It makes sense because the stronger models generally tend to generate predictions close to the ground-truth. In brief, these differences in predictions can be exploited to create effective ensembles and our method is capable of improving the competitive baselines using this kind of diverse supervisions.

9 Conclusion

We have presented MEAL, a learning-based ensemble method that can compress multi-model knowledge into a single network with adversarial learning. Our experimental evaluation on three benchmarks CIFAR-10/100, SVHN, ImageNet and iMaterialist Products Dataset verified the effectiveness of our proposed method, which achieved the state-of-the-art accuracy for a variety of network architectures. Our further work will focus on adopting MEAL for cross-domain ensemble and adaption.


  • [1] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton (2018) Large scale distributed neural network training through online distillation. In ICLR, Cited by: §2.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    International conference on machine learning

    pp. 214–223. Cited by: §2.
  • [3] H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi (2018) Label refinery: improving imagenet classification through label progression. In ECCV, Cited by: §2, §2.
  • [4] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem (2018) Finding tiny faces in the wild with generative adversarial network. pp. 21–30. Cited by: §2.
  • [5] T. Chen, I. Goodfellow, and J. Shlens (2016) Net2net: accelerating learning via knowledge transfer. In ICLR, Cited by: §2.
  • [6] Y. Chen, N. Wang, and Z. Zhang (2018) Darkrank: accelerating deep metric learning via cross sample similarities transfer. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [7] E. J. Crowley, G. Gray, and A. J. Storkey (2018) Moonshine: distilling with cheap convolutions. In Advances in Neural Information Processing Systems, pp. 2888–2898. Cited by: §2.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement, Fig. 1, §4.3, §7, §8.1.
  • [9] T. G. Dietterich (2000) Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §2.
  • [10] W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow (2017) Many paths to equilibrium: gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446. Cited by: §2.
  • [11] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. arXiv preprint arXiv:1805.04770. Cited by: §2.
  • [12] X. Gastaldi (2017) Shake-shake regularization. arXiv preprint arXiv:1705.07485. Cited by: §1, §8.2, TABLE III, Algorithm 2, footnote 1.
  • [13] J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. In ICLR, Cited by: §2.
  • [14] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio (2013) Maxout networks. In ICML, Cited by: §8.1.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2, §2, §4.3, §5.2, Algorithm 1.
  • [16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.
  • [17] L. K. Hansen and P. Salamon (1990) Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence 12 (10), pp. 993–1001. Cited by: §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §8.2, TABLE II, TABLE IV, TABLE VI, Algorithm 2.
  • [19] B. Heo, M. Lee, S. Yun, and J. Y. Choi (2018)

    Knowledge transfer via distillation of activation boundaries formed by hidden neurons

    arXiv preprint arXiv:1811.03233. Cited by: §2.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2, §2.
  • [21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    arXiv preprint arXiv:1704.04861. Cited by: §8.2, TABLE III, Algorithm 2.
  • [22] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger (2017) Snapshot ensembles: train 1, get m for free. In ICLR, Cited by: Fig. 2, §2, §8.1, §8.1, §8.1.
  • [23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §8.2, TABLE II, TABLE III, TABLE IV, Algorithm 2.
  • [24] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §1, §2, §8.1.
  • [25] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §2.
  • [26] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §8.2.
  • [27] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
  • [28] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2017) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055. Cited by: §2.
  • [29] J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228. Cited by: §2.
  • [30] N. Kodali, J. Abernethy, J. Hays, and Z. Kira (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §2.
  • [31] I. Krasin, T. Duerig, N. Alldrin, A. Veit, et al. (2016) OpenImages: a public dataset for large-scale multi-label and multi-class image classification.. Cited by: Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement, §7.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §8.1.
  • [33] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §4.3, §8.1.
  • [34] A. Krogh and J. Vedelsby (1995)

    Neural network ensembles, cross validation, and active learning

    In Advances in neural information processing systems, pp. 231–238. Cited by: §2.
  • [35] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §2.
  • [36] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Fractalnet: ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648. Cited by: §8.1.
  • [37] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Artificial Intelligence and Statistics, Cited by: §8.1.
  • [38] K. Lee, X. He, L. Zhang, and L. Yang (2018)

    Cleannet: transfer learning for scalable image classifier training with label noise

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5447–5456. Cited by: §2.
  • [39] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2019) Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: §2, §2.
  • [40] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li (2017) Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1910–1918. Cited by: §2, §2, §2.
  • [41] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §2.
  • [42] T. Liu and D. Tao (2015) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: §2.
  • [43] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §8.1, §8.1.
  • [44] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: §8.5, TABLE VI.
  • [45] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Fig. 1, §1.
  • [46] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §2.
  • [47] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011, pp. 5. Cited by: §4.3, §8.1.
  • [48] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar (2017) Semi-supervised knowledge transfer for deep learning from private training data. In ICLR, Cited by: §2.
  • [49] W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §2.
  • [50] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §8.
  • [51] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §2.
  • [52] M. P. Perrone and L. N. Cooper (1995) When networks disagree: ensemble methods for hybrid neural networks. In How We Learn; How We Remember: Toward an Understanding of Brain and Neural Systems: Selected Papers of Leon N Cooper, pp. 342–358. Cited by: §2.
  • [53] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: §2.
  • [54] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) Fitnets: hints for thin deep nets. In International Conference on Learning Representations, Cited by: §8.1.
  • [55] Z. Shen, Z. He, and X. Xue (2019) MEAL: multi-model ensemble via adversarial learning. In AAAI, Cited by: §1.
  • [56] Z. Shen, M. Huang, J. Shi, X. Xue, and T. Huang (2019) Towards instance-level image-to-image translation. In CVPR, Cited by: §2.
  • [57] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §8.2, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE VI, Algorithm 2.
  • [58] S. Singh, D. Hoiem, and D. Forsyth (2016) Swapout: learning an ensemble of deep architectures. In Advances in neural information processing systems, pp. 28–36. Cited by: §1, §2.
  • [59] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. Cited by: §1, §2, §8.4, TABLE IV.
  • [60] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus (2014) Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §2, §2.
  • [61] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, et al. (2015) Going deeper with convolutions. In CVPR, Cited by: TABLE II, TABLE IV.
  • [62] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §2, §4.1.
  • [63] J. Uijlings, S. Popov, and V. Ferrari (2018) Revisiting knowledge transfer for training object class detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1101–1110. Cited by: §2.
  • [64] A. Vahdat (2017) Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems, pp. 5596–5605. Cited by: §2.
  • [65] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie (2017) Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 839–847. Cited by: §2.
  • [66] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §1, §2.
  • [67] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §2.
  • [68] Z. Xu, Y. Hsu, and J. Huang (2017) Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513. Cited by: §2.
  • [69] C. Yang, L. Xie, C. Su, and A. L. Yuille (2019) Snapshot distillation: teacher-student optimization in one generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2859–2868. Cited by: §2.
  • [70] J. Yim, D. Joo, J. Bae, and J. Kim (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2, §2.
  • [71] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. arXiv preprint arXiv:1905.04899. Cited by: §8.5, TABLE VI.
  • [72] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §8.5, TABLE VI.
  • [73] T. Zhou, S. Wang, and J. A. Bilmes (2018) Diverse ensemble evolution: curriculum data-model marriage. In Advances in Neural Information Processing Systems, pp. 5905–5916. Cited by: §2, §2.
  • [74] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.
  • [75] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §2.
  • [76] X. Zhu, S. Gong, et al. (2018) Knowledge distillation by on-the-fly native ensemble. In Advances in Neural Information Processing Systems, pp. 7517–7527. Cited by: §2.