. One intrinsic property of deep neural networks is that they constitute a black box, with no interpretable way of describing how they perform tasks as they are trained for. While there has been significant progress in understanding how feature maps represent human interpretable features[Mordvintsev2015DeepDream, kim2018interpretability], we still lack a complete picture of how neural networks operate. This limits our ability to extend and modify existing neural networks. Recently knowledge distillation [romero2014fitnets, Hinton2014KD] has emerged as a robust way of transferring knowledge across different neural networks [crowley2018moonshine, hahn2019self, Wang2020-lh]. In the teacher-student framework, the student is taught to mimic the prediction layer of the teacher. The knowledge distillation approach has been an instrumental technique for model compression [crowley2018moonshine, hahn2019self], as well as for pushing state of the art by harnessing large amounts of unlabeled data to improve the performance of student models such as Noisy Student [xie2020selftraining] and many others, see [Wang2020-lh] for a survey.
Despite its successes, knowledge distillation in its classical form has a critical limitation. It assumes that the real training data is still available in the distillation phase. However, in practice, the original training data is often unavailable due to privacy concerns. A typical example is medical imaging records [Frid2018GANMedImageAug, Han2019GANMedImageAug], whose availability is time-limited due to security and confidentiality. Similarly many large models are trained on millions [Deng2009ImageNet] or even billions [Mahajan2018InsHashTag] of samples. While the pre-trained models might be made available for the community at large, making training data available poses a lot of technical and policy challenges. Even storing the training data is often infeasible for all but the largest users. Furthermore, in some training frameworks such as federated learning [Konecny2016FederatedLearning], the model is tuned by collecting gradient updates calculated on distributed devices, and the training data disappears as soon as its been used, thus requiring even more data each time the model needs to be updated.
To address this issue, a few approaches to the data-free knowledge distillation, i.e., for distilling models under a data-free regime, have been previously proposed. However, they are either highly time-consuming to produce synthetic images or unable to scale to large datasets. We disciss prior research in this field in Section 2.
In this work, we adapt the idea of generative image modeling to attain efficient data generation, and investigate ways to scale it to large datasets. We propose our generative data-free distillation method, as illustrated in Figure 1, by training a generator without using the original training data and use it to produce substitute data for knowledge distillation. Our generator minimize two optimization objectives: (1) moment matching lossinceptionism loss
, in which the generator maximizes the activation of the logit of the teacher network corresponding to the target loss. The variants of moment-matching loss has been explored before in non-generative data-free image synthesis methods such as in[Yin2020DeepInversion, Haroush2020StatMatchQuant]
. We also note that this information is often available as part of the training batch-normalization[Ioffe2015BN] layers which are present in nearly all modern architectures such as ResNets [He2016ResNet], DenseNets [Huang2017DenseNet], MobileNets [Howard2017MobileNetV1]
and their variants. Under an isotropic Gaussian assumption of the internal activations, we can explicitly minimize their Kullback–Leibler divergence or the-norm of their difference.
We then follow the idea of deep dream style[Mordvintsev2015DeepDream] image synthesis method to employ inceptionism loss. The general idea is to find an input image that can maximize the probability of a certain category being predicted by the pre-trained teacher, which can be naturally formulated as a cross-entropy minimization problem. Combining this with the aforementioned moment matching loss together, given only a pre-trained teacher model, we are now able to train a generator without using real images, which can effectively produce synthetic images for the distillation.
To demonstrate the effectiveness of the proposed method, we design an empirical study on three image classification datasets with increasing size and complexity. We first conduct an experiment of data-free distillation on CIFAR-10 and CIFAR-100. The generator trained without using real images is able to produce higher-quality and more realistic images than previous methods. These images can also effectively support the following knowledge distillation. The learned student outperforms the previous methods with a clear margin, achieving a new state-of-the-art result which is even better than its supervised-trained counterparts. We then explore using the ensemble of multiple generators on CIFAR-100 and ImageNet and demonstrate its ability of further improving the distillation result.
Our main contributions can be summarized as follows:
We propose a new method for training an image generator from a pre-trained teacher model, which efficiently produces synthetic inputs for knowledge distillation.
We push forward the state of the art of data-free distillation on CIFAR-10 and CIFAR-100 datasets to and respectively, which is even better than the supervised-trained counterparts.
We scale the generative data-free distillation method to ImageNet by using multiple generators. This is the first success of data-free distillation on ImageNet using generative models to the best of our knowledge.
2 Related Work
Our approach can be viewed as a combination of two components, generative modeling and knowledge distillation, each of which attracted considerable attention of the deep learning community over the past years. Here we provide a brief overview of these fields in the context of our work.
Generative Image Modeling.Generative Adversarial Networks (GANs) [Goodfellow2014GAN], perhaps the most celebrated approach to image generation, together with numerous variants of this general method [Brock2019BigGAN, Arjovsky2017-as, Karras2017-vv] have shown a tremendous potential for generating high-fidelity synthetic images based on a limited corpus of training samples. One appealing application of synthetic image generation is data augmentation. Recent studies have employed GAN-based augmentation to improve the model performance in data-restricted scenarios such as medical imaging [Frid2018GANMedImageAug, Han2019GANMedImageAug]. However, as stated in Goodfellow2019AAAITalk, this approach has not shown much success in practice on large-scale data. It was also observed that while the image are of extremely high quality, using them exclusively for training leads to a significant performance degradation [Yin2020DeepInversion].
Another promising approach to image synthesis is based on recent work on reversible networks [Behrmann2019iResNet, Gomez2017RevNet, Jacobsen2018iRevNet]
. These studies explore reversible models, in which the transformation from one layer to the next are invertible, allowing to reconstruct layer activations using the outputs of the following layer. The initial motivation was to save memory by computing the activations on-the-fly during backpropagation, while later researchers have also discovered its potential for image generation[Asim2020iVNetDenoise]. We stress here, that while GAN and GAN-like methods generate very realistic and high-quality images, all aforementioned methods require access to the original data to build their generators.
Knowledge Distillation. Knowledge distillation [Hinton2014KD] is a general technique that can be used for model compression. It transfers knowledge from a pre-trained network, the teacher, to another student network by teaching the student to mimic the teacher’s behaviour. In a typical image classification task, this is usually done by aligning the probabilities predicted by the teacher and student network. In recent literature, there have been numerous variations of knowledge distillation in terms of application domains and distillation strategies. For an overview of general distillation techniques we refer the reader to knowledge distillation surveys such as [Wang2020-lh].
Data-Free Knowledge Distillation. The problem of knowledge distillation becomes much more challenging when the original training data is not available at the time to train the student model. This is often encountered in privacy-sensitive and data-restricted scenarios. Most approaches to this scenario center around synthetic image generation. lopes2017data was a first attempt to pre-compute and store activation statistics for each layer of the teacher network with the goal of constructing synthetic inputs that produces similar activations. Follow-up works [nayak2019zero, Yin2020DeepInversion] have developed the approach by using less meta-data or proposing different optimization objectives. These methods typically obtain the synthetic inputs by directly optimizing some trainable random noise with regards to a pre-determined objective, where each input image requires multiple iterations to converge. Therefore, it can be costly and time-consuming to produce sufficient data for compression. There are also a few methods that synthesize input data via generative image modeling [Chen2019DAFL, Fang2019DFAD, micaelli2019zero], which create substitute data much more efficiently than optimizing input noise. However, scaling them to the tasks on large datasets, e.g., ImageNet classification task, remains challenging.
3 Generative Distillation in Data-Free Setting
In this section, we first briefly recall the classical knowledge distillation method and then introduce our approach for building a generative model from a pre-trained teacher.
Generally, we denote random variables with bold serif font,e.g., ,
. By contrast, sampled values of such variables and deterministic tensors are denoted with regular serif font,e.g.,
, which is typically used when we are discussing loss objectives with respect to a single input. Loss functions denoted by, where
is an abbreviation of a particular loss component, typically involve averaging over the probability distributions entering them. Without ambiguity, we may slightly abuse the notation ofto denote the deterministic loss value with respect to a single input .
3.2 Knowledge Distillation
Knowledge distillation aims to transfer knowledge from typically a larger teacher network into a smaller student , where and are usually differentiable functions represented by neural networks with parameters and respectively and is the model input. In the setting of a classification task, and typically output a probability distribution over different possible categories. The student is trained to mimic the behavior of the teacher network by matching the probability distribution produced by the teacher on the training data. Formally, knowledge distillation can be modeled as a minimization of the following objective:
where refers to the Kullback–Leibler divergence that evaluates the discrepancy between the distributions produced by the teacher and student networks. Here denotes the training data distribution.
3.3 Generative Image Modeling
Computing the loss objective in Equation (1) requires the knowledge of the data distribution , which is not available in a data-free setting. Instead, we approximate with a distribution of a generator trained to mimic the original data. We learn the generator distribution conditioned on the class label given the trained teacher by introducing a latent variable with a prior and representing as a marginal , essentially learning a deterministic deep generator . The generator is then trained without the access to , but using only the trained teacher model . Now the key is to find appropriate objectives for training the generator. These objectives are introduced in the remainder of this section.
Inceptionism loss. Inceptionism-style [Mordvintsev2015DeepDream] image synthesis, also known as DeepDream, is a way to visualize input images that provoke a particular response of a trained neural network. For instance, say we want to know what kind of image would result in “dog” class being predicted by the model. The inceptionism method would start with a trainable image initialized with random noise, and then gradually tweak it towards the most “dog-like” image by maximizing the probability of the dog category being produced by the model. Formally, given the expected label and the trained teacher , we find that minimizes the cross-entropy of the categorical distribution relative to :
In practice we usually do not optimize this objective alone, but also impose a prior constraint that the synthetic images mimic the statistics to the natural images, such as a particular correlation of neighboring pixels. It is done by adding a regularization term to Equation (2):
where in this paper we follow [Yin2020DeepInversion, Haroush2020StatMatchQuant] to use total variation loss and -norm loss as the regularizers:
where and penalize the total variation and -norm of the image with scaling weights and , respectively.
Moment matching loss. The inceptionism loss by itself only constraints the input (images) and output (probabilities) of the trained network, while leaving the activations of internal layers unconstrained. Previous studies have observed that different layers of a deep convolutional network are likely to perform different tasks [Lee2009Conv, Luo2019AdaBound, Gu2018ConvAdavances], i.e., lower layers tend to detect low-level features such as edges and curves, while higher layers learn to encode more abstract features. In addition, Haroush2020StatMatchQuant showed that images learned with conventional inceptionism method may result in anomalous internal activations deviating from those observed for real data. These facts suggest there should be a regularization term to constrain the statistics of the teacher’s intermediate layers as well.
Batch normalization [Ioffe2015BN] layers, a common component of most neural networks, are helpful with providing such statistics [Yin2020DeepInversion, Haroush2020StatMatchQuant]
. The normalization operation is designed to normalize layer activations by re-centering and re-scaling them with the moving averaged mean and variance calculated during training. In other words, it implicitly stores the estimated layer statistics of the teacher over the original data. Therefore, we can force the layer statistics (mean and variance specifically) produced by our synthetic images to align with those emerging from the real data [Salimans2016, WardeFarley2017, Nogueira2019].
Given the running estimates for the mean and the variance of a teacher batch-norm layer, we measure the mean and variance activated by the synthetic image and minimize the discrepancy between the real-time statistics and the estimated ones. Under an isotropic Gaussian assumption, this can be done by minimizing their Kullback–Leibler divergence:
or the -norm of their differences:
stands for Gaussian distribution anddenotes the -norm. In this paper, we choose the latter one and formulate the moment matching loss by summing up these penalties together across all the batch-norm layers:
where is the scaling weight and , are recorded mean and variance statistics for per-layer activations on the real data.
Generator training objective. Combining the inceptionism loss and moment matching loss together we get
Recall that our final goal is to utilize these objectives to train a generative model. By substituting with in Equation (8), we define the final generator training objective as:
Using multiple generators. Mode collapse is a common problem plaguing various generative models like GANs [Mao2019MSGANModeCollapose, Goodfellow2014GAN, Salimans2016ImprovedGAN], where instead of producing a variety of different images, the generator produces a distribution with single image or just a few variations and the generated samples are almost independent of the latent variable. We hypothesize that in our case, if the generator occasionally produces an image corresponding to a high-confidence teacher prediction, the cross-entropy loss vanishes and the generator may learn to produce essentially only that output even if other loss components like are not fully optimized. Figure 2 illustrates a typical example of mode collapse that happened to our generator, where it is able to generate realistic images for the “automobile” class but all of the generated objects are red.
As suggested in previous literature [Ghosh2018MADGAN, Liu2016CoGAN], training multiple generators can be a very simple but powerful way of alleviating this issue.
For our approach we choose to use a setup with generators, and assign all classes across all generators, so that each class is assigned to exactly one generator. Each generator only tries to maximize inceptionism loss for the classes it has assigned. For the moment matching, instead of using the moments pre-stored in batch-norm layers, in this case we use a more precise per-category moments for each generator. The way of estimating these moments is discussed in Section 4.2.
In this section, we turn to an empirical study to evaluate the effectiveness of our method on datasets of increasing size and complexity. We first conduct a series of ablation studies on CIFAR-10 ( image size; categories) to verify the effect of each optimization objective for generator training, and then demonstrate the performance of using single or per-class generators on CIFAR-100 ( image size; categories). Finally, we extend to ImageNet ( image size; categories) to show how our method scales to larger and more complex datasets.
Experimental setup. The CIFAR-10 dataset [Krizhevsky2009LearningML] consists of K training images and K testing images from classes. To make a fair comparison, we follow the setting used in the previous literature [Chen2019DAFL, Fang2019DFAD, Yin2020DeepInversion] with a pre-trained ResNet-34 as the teacher network and ResNet-18 as the student. The generator architecture is also identical to the one used in [Chen2019DAFL, Fang2019DFAD]. The generator is trained using Adam optimizer [Kingma2014-vr] with a learning rate of . We use the batch size of , , , and . For CIFAR-10 we only use a single-generator mode. More details on the experimental setup can be found in the supplementary materials.
|ResNet-18||Knowledge Distillation [Hinton2014KD]||94.34%|
|Adaptive DeepInversion [Yin2020DeepInversion]||93.26%|
Experimental results. Test accuracies obtained using different methods on CIFAR-10 are summarized in Table 1. Trained in a fully supervised setting, the teacher (ResNet-34) and the student models (ResNet-18) achieve test accuracies of and respectively. The student can further gain a improvement by performing knowledge distillation on the teacher using the original training data. In the data-free setting, as was previously observed, using simple Gaussian noise as input distribution leads to a very poor performance that is only slightly better than a random guess (). This is not unexpected since the real data distribution looks very different from Gaussian noise and is generally expected to concentrate on a lower-dimensional manifold embedded in a high-dimensional space.
As part of our ablation study, we first trained the generator using the inceptionism loss alone. The resulting student model trained on this generator reached the accuracy of , which is better than random noise, but significantly lower than the supervised accuracy of . In another experiment, where the generator was trained with the moment matching loss alone, the resulting student now reached a much higher accuracy of , Combining both objectives brings the accuracy of our final method to , now almost indistinguishable from the accuracy of the original larger teacher model and higher than the accuracy obtained with distillation on the original training dataset.
Visualization. We finally provide several example images produced by DAFL [Chen2019DAFL], Adaptive DeepInversion (ADI) [Yin2020DeepInversion] and our generator in Figure 3. As we can see, although ADI can produce images with much higher quality than previous methods, it tends to synthesize images with different textures but similar background (e.g., category horse, ship and truck). In contrast, our method can generate more realistic images, which are likely to have a closer distribution to the original data.
Experiment setup. Like CIFAR-10, the CIFAR-100 dataset [Krizhevsky2009LearningML] also consists of K training images and K testing images, but the images from this dataset are categorized into
classes, which makes it more diverse than CIFAR-10. In most of our experiments, we use the same model architectures and training hyperparameters as in our CIFAR-10 experiments, with the only exception ofnow being chosen as . More details about the experimental setup can be found in the supplementary materials.
Single generator. Test accuracies obtained using knowledge distillation with a single generator are listed in Table 2. Our method achieves an accuracy of on the test set, which outperforms all previous approaches by a large margin. However, this result is still slightly worse than the test accuracy of a ResNet-18 network trained in a supervised setting, or distilled from the teacher on the training data.
|ResNet-18||Knowledge Distillation [Hinton2014KD]||76.87%|
Multiple generators. We consider two ways of gathering per-class statistics. One direct way of accumulating it is to: (a) collect a small subset of the training images, (b) feed them to the pre-trained teacher to compute the required moments in each layer, and (c) serve them as meta-data during generator training. But although we only need a small number of images to gather such statistics, this can no longer be thought as a purely data-free approach. In the most strict setting, where the training has to be data-free, there is another option in which we can learn several batches of trainable images using Equation (8) as the optimization objective [Yin2020DeepInversion, Haroush2020StatMatchQuant]. Following this approach, we can also obtain a small amount of (synthetic) images to measure per-class statistics. Specifically, we sample images per class from the training data or learn the same amount of images in a data-free manner to compute the per-class statistics for each class, which is then used to train the generators. When performing distillation, we simply sample images uniformly at random from all generators.
The results of knowledge distillation with the collection of generators are shown in the last two rows in Table 2. Both methods outperform distillation with a single generator and, perhaps more remarkably, a ResNet-18 model trained in a supervised fashion, or the same model distilled on the original dataset. Finally, we see that both methods exhibit very similar performance, which suggests that we have a freedom to choose a particular approach based on the actual use case. In a scenario where we can pre-record activation statistics during the teacher training phase, it might be convenient to use the approach relying on meta-data collection. Otherwise, we may choose the alternative data-free approach without a significant loss of accuracy.
We finally turn to the study on ImageNet [Deng2009ImageNet].
Experimental setup. The ImageNet dataset consists of over M images in classes. We explore two target resolutions: matching that of CIFAR-10, and full resolution . For images we study the performance trade-offs between the number of generators and the accuracy. For full resolution we train an ensemble of generators for distillation, where each generator is trained to produce just a single class. The generator structure is similar to that used in CIFAR-10/100 experiment, we add additional convolutional and upscaling layers to bring the image resolution to the target . Due to memory constraints we also reduce the number of dimensions of the latent variable from to . To estimate per-class statistics we sample images per class. Note that this sampling can be done during the original teacher training.
All generators are trained with Adam optimizer [Kingma2014-vr] with a learning rate of . We use the batch size of , , , and . For the distillation, we use a pre-trained ResNet-50 as the teacher and distill the knowledge to several different students. We do not apply any data augmentation techniques such as Inception preprocessing [Szegedy2016-vu], MixUp [zhang2018mixup], RandAugment [cubuk2019randaugment] or AutoAugment [cubuk2019autoaugment] for simplicity and leave further optimization as a subject of future work. More details about training setup is available in the supplementary materials.
4.4 Experimental results
Full scale ImageNet. We illustrate the test results on ImageNet in Table 3. To the best of our knowledge, this is the first distillation on Imagenet using data-free generative models, and thus no previous work in similar settings that we can directly compare with. Instead, we display the distillation results using images synthesized by (1) BigGAN [Brock2019BigGAN], a generative model trained with a GAN fashion on the ImageNet training data that uses real data to train generators; and (2) DeepInversion [Yin2020DeepInversion], which synthesizes images by directly optimizing mini-batches of trainable images one by one. Such kind of method is time-consuming to run but in theory has the potential to produce much more diverse images than using generative models. We note that DeepInversion uses ResNet-50v1.5 as the model and train it with more advanced training techniques, while we are using ResNet-50v1. As a result, the performance of our teacher is gap to theirs in terms of top-1 accuracy, which makes the distillation results not directly comparable.
Specifically, our student trained with the ensemble of generators achieves an accuracy of , which outperforms the ones trained with images synthesized by BigGAN and DeepInversion by and respectively. Considering their teacher is stronger than us by , we actually have a even smaller gap () to each one’s own teacher (BigGAN: ; DeepInversion: ).
|Ensemble of different number of generators|
Number of generators. We further investigate the effect of the number of generators on the accuracy. Due to the computational and time constraints, we conduct this ablation study on ImageNet resized to . We use ResNet-34 as the teacher and ResNet-18 as the student network. The single generator is trained with the moving averaged moments stored in each batch-norm layer, and the ensemble of generators is trained using per-class statistics. For the ensemble of generators, we first divide the image categories into groups of , i.e., categories 1–10 in the first group, categories 11–20 in the second group, etc. Then we sample images per group to estimate the per-group statistics for generator training. The setting of distillation is the same with the experiment on full-resolution ImageNet.
The results are shown in Table 4. As we can see, using a single generator we can only obtain a poor distillation accuracy of . However by using ensembles and increasing the number of generators, we can gradually get better results and finally achieve an accuracy of , which only has a gap of to the supervised-trained student. These results demonstrate the importance of using ensembles to scale generative data-free distillation to large dataset.
|Student||Sup. Acc.||Distill. Acc.||Acc.|
Different students. In our experiments, we also compared the distillation results on students with different architectures (see Table 5). Here the teacher is a ResNet-50 model with a top-1 accuracy of . We use the same set of generators for all students we consider. The distillation performance on ResNet-50 is the best with an accuracy drop of compared with the model trained in a supervised fashion. However, the performance results on ResNet-18 and MobileNetV2 [Sandler2018MobileNetV2] are much worse with larger gaps to their supervised counterparts. This indicates that perhaps there is some entanglement between student and teacher structures that make generators learned on ResNet-50 teacher to be less effective on MobileNetV2 and ResNet-18 than on ResNet-50 student. Investigation on improving its generalization ability remains subject of future work.
In this paper, we propose a new method to train a generative image model by leveraging the intrinsic normalization layers’ statistics of the trained teacher network. Our contributions are three-fold. First, we have shown that the generator trained on our proposed objectives (i.e., moment matching loss and inceptionism loss) is able to produce higher-quality and more realistic images than previous methods. Second, we have successfully pushed forward the data-free distillation performance on CIFAR-10 and CIFAR-100 to and respectively. Finally, we were able to scale it to the ImageNet dataset, which to the best of our knowledge, has not been done using generative models before.
While we have shown that data-free distillation can successfully scale to large dataset, there are still many open questions. Specifically, our experiments with multi-category generators shows that performance drops dramatically, so generalizing our approaches to work with a single generator is a natural next step. Another direction is to utilize global moments for training images for different classes. Finally, our early experiments show that generators that we produce are teacher specific — and attempts to use them with a different teachers generally fail. It would be an important extension of our work to create universal generators that allow to learn from any teacher.
We thank Laëtitia Shao for insightful discussions, Mingxing Tan, Sergey Ioffe, Rui Huang and Shiyin Wang for feedbacks on the draft.
Appendix A Experiment Settings
We provide more experimental details in this section. Generally, each generator is trained on an NVIDIA V100 GPU for K steps using the Adam optimizer with , , and a constant learning rate of . We run all the knowledge distillation experiments on a Cloud TPUv3 -core Pod slice. We use the Momentum optimizer and set the momentum parameter to . We employ a linear warm-up scheme where the learning rate increases from to the base learning rate for the first K training steps, and then it is decayed by every K steps. The temperature of distillation is set to .
We run knowledge distillation for K steps with a batch size of and base learning rate of . The generator architecture is illustrated in Table 6, where .
We run knowledge distillation for K steps with a batch size of and base learning rate of . The generator architecture is illustrated in Table 6. For the single-generator experiment, . On the other hand, in multiple-generators experiment, each generator is responsible for producing images in a certain category, and therefore .
We run knowledge distillation for K steps with a batch size of and base learning rate of . For the target resolution of , we use the same generator architecture as the one used in CIFAR-10/100 experiments. For the full resolution (), each generator is responsible for producing images in one category, whose architecture is illustrated in Table 7.