Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

12/18/2019 ∙ by Hongxu Yin, et al. ∙ Princeton University University of Illinois at Urbana-Champaign Nvidia 27

We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network. We 'invert' a trained network (teacher) to synthesize class-conditional input images starting from random noise, without using any additional information about the training dataset. Keeping the teacher fixed, our method optimizes the input while regularizing the distribution of intermediate feature maps using information stored in the batch normalization layers of the teacher. Further, we improve the diversity of synthesized images using Adaptive DeepInversion, which maximizes the Jensen-Shannon divergence between the teacher and student network logits. The resulting synthesized images from networks trained on the CIFAR-10 and ImageNet datasets demonstrate high fidelity and degree of realism, and help enable a new breed of data-free applications - ones that do not require any real images or labeled data. We demonstrate the applicability of our proposed method to three tasks of immense practical importance – (i) data-free network pruning, (ii) data-free knowledge transfer, and (iii) data-free continual learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to transfer learned knowledge from a trained neural network to a new one with properties desirable for the task at hand has many appealing applications. For example, one might want to use a smaller or more resource-efficient network architecture while deploying on edge inference devices 

[molchanov2019importance, fbnet, zhu2016trained], or to adapt the network to the inference hardware [chamnet, wang2019haq, yin2019hardware]

, or for continuously learning to classify new image classes 

[li2017learning, lopez2015unifying], etc. Most current methods that solve such knowledge transfer tasks are based on the concept of knowledge distillation [hinton2015distilling], wherein the new network (student) is trained to match its outputs to that of a previously trained network (teacher). However, all such methods have a significant constraint – they assume that either the previously used training data is available [chen2015net2net, li2017learning, molchanov2016pruning, romero2014fitnets], or some real images representative of the distribution of the prior training dataset are available [Kimura2018FewshotLO, kimura2018few, lopez2015unifying, rebuffi2017icarl]. Even methods not based on distillation, such as [Kirkpatrick2016OvercomingCF, nguyen2018variational, Zenke2017ContinualLT] assume that some additional statistics about prior training is made available by the pretrained model provider.

The requirement for prior training information can be very restrictive in practice. For example, suppose a very deep network such as ResNet-152 [he2016deep] was trained on datasets with millions [deng2009imagenet] or even billions of images [mahajan2018exploring], and we wish to distill its knowledge to a lower latency model such as ResNet-18. In this case, we would need access to these datasets, which are not only large but difficult to store, transfer, and manage. Further, another emerging concern is that of data privacy. While entities might want to share their trained models, sharing the training data might not be desirable due to user privacy, security, proprietary concerns, or competitive advantage.

In the absence of prior data or metadata, an interesting question arises – can we somehow recover training data from the already trained model and use it for knowledge transfer? A few methods have attempted to visualize what a trained deep network expects to see in an image [bhardwaj2019dream, mahendran2015understanding, mordvintsev2015deepdream, nguyen2015deep]

. The most popular and simple-to-use method is DeepDream 

[mordvintsev2015deepdream]. It synthesizes or transforms an input image to yield high output responses for chosen classes in the output layer of a given classification model. This method optimizes the input (random noise or a natural image), possibly with some regularizers, while keeping the selected output activations fixed, but leaves intermediate representations constraint-free. The resulting “dreamed” images lack natural image statistics and can be quite easily identified as unnatural. These images are also not very useful for the purposes of transferring knowledge, as our extensive experiments in Section 4 show.

In this work, we make an important observation about deep networks that are widely used in practice – they all implicitly encode very rich information about prior training data. Almost all high-performing convolutional neural networks (CNNs) such as ResNets 

[he2016deep], DenseNets [huang2017densely], or their variants, use the batch normalization layer [ioffe2015batch]

. These layers store running means and running variances of the activations at multiple layers. In essence, they store the history of previously seen data, at multiple levels of representation. By assuming that these intermediate activations follow a Gaussian distribution with mean and variance equal to the running statistics, we show that we can obtain “dreamed” images that are realistic and much closer to the distribution of the training dataset as compared to prior work in this area.

Our approach, visualized in Fig. 1, called DeepInversion, introduces a regularization term for intermediate layer activations of dreamed images based on just the two layer-wise statistics, mean and variance, which are directly available with trained models. As a result, we do not require any training data or metadata to perform training image synthesis. Our method is able to generate images with high fidelity and realism at a high resolution, as can be seen in the middle section of Fig. 1, and more samples in Fig. 5.

We also introduce an application-specific extension of DeepInversion, called Adaptive DeepInversion, which can enhance the diversity of the generated images. More specifically, it exploits disagreements between the pretrained teacher and the in-training student network to expand the coverage of the training set by the synthesized images. It does so by maximizing the Jensen-Shannon divergence between the responses of the two networks.

In order to show that our dataset synthesis method is useful in practice, we demonstrate its effectiveness on three different use cases. First, we show that the generated images support knowledge transfer between two networks using distillation, even with different architectures, with a minimal accuracy loss on the simple CIFAR-10 as well as the large and complex ImageNet dataset. Second, we show that we can prune the teacher network using the synthesized images to obtain a smaller student on the ImageNet dataset. Finally, we apply DeepInversion to incremental learning that enables the addition of new classes to a pretrained CNN without the need for any original data. Using our DeepInversion technique, we empower a new class of “data-free” applications of immense practical importance which need neither any natural image nor labeled data.

Our main contributions are as follows:

  • [topsep=1pt,itemsep=1pt,partopsep=0pt, parsep=0pt,leftmargin=]

  • We introduce DeepInversion, a new method for synthesizing class-conditional images from a CNN trained for image classification (Sec. 3.2). Further, we improve the synthesized image diversity by exploiting student-teacher disagreements via Adaptive DeepInversion (Sec. 3.3).

  • We enable data-free and hardware-aware pruning that achieves performance comparable to the state-of-the-art methods that rely on the training dataset (Sec. 4.3).

  • We introduce and address the task of data-free knowledge transfer between a teacher and a randomly initialized student network (Sec. 4.4).

  • We improve upon prior work on data-free incremental learning and achieve results comparable with oracle methods given the original data (Sec. 4.5).

2 Related Work

Knowledge distillation.

A long line of work has explored transferring knowledge from one model to another. Breiman and Shang first introduced this concept in learning a single decision tree to approximate the outputs of multiple decision trees 

[breiman1996born]. Similar ideas are explored in neural networks by Bucilua et al. [bucilua2006model], Ba and Caruana [ba2014deep], and Hinton et al. [hinton2015distilling]. Of these, Hinton et al. formulated the problem as “knowledge distillation”, where a compact student mimics the output distributions of expert teacher models [hinton2015distilling]. Knowledge distillation and its improved variants [ahn2019variational, park2019relational, romero2014fitnets, xu2017training, zagoruyko2016paying] enable teaching students with goals such as quantization [mishra2017apprentice, polino2018model], compact neural network architecture design [romero2014fitnets], semantic segmentation [liu2019structured], self-distillation [furlanello2018born]

, and un-/semi-supervised learning 

[lopez2015unifying, pilzer2019refine, yim2017gift]. All these methods still rely on images from original or proxy datasets. More recent research has explored data-free knowledge distillation. Lopes et al[lopes2017data] synthesized inputs based on pre-stored auxiliary layer-wise statistics of the teacher network. Chen et al. [chen2019data] trained a new generator network for image generation while treating the teacher network as a fixed discriminator. Despite remarkable insights, scaling to tasks such as the ImageNet classification remains difficult for these methods.

Image synthesis. GANs [gulrajani2017improved, miyato2018spectral, nguyen2017plug, zhang2018self] have been at the forefront of generative image modeling, and have yielded high fidelity images as recently shown by BigGAN proposed by Brock et al[brock2018biggan]. Though adept at capturing image distribution, GAN training requires access to the original data to train its generator for future image generation.

An alternat line of work focuses on image synthesis from a single CNN. In work on security, Fredrikson et al[fredrikson2015modelinversionattack] propose the model inversion attack to reverse class-wise training images of a network through a gradient descent on the input. Follow-up works have refined and expanded the approach to new threat scenarios such as collaborative learning and membership detection [he2019model, wang2015regression, yang2019adversarial]. These methods have only been demonstrated on shallow networks.

In vision, researchers have focused on visualizing neural networks and understanding factors behind their intrinsic properties. Mahendran et al. explored inversion, activation maximization, and caricaturization techniques to synthesize “natural pre-images” from a trained network  [mahendran2015understanding, mahendran2016visualizing]. Nguyen et al. used a deep generator network as a GAN prior to inverse images from a trained CNN [nguyen2016synthesizing], and its followup Plug & Play approach further improved the image diversity and quality via latent code prior [nguyen2017plug]. Along the same line, Bhardwaj et al. explored the training data cluster centroids to improve inversion efficacy [bhardwaj2019dream]. These methods still rely on auxiliary dataset information or additional pretrained networks. Of particular relevance to this work is DeepDream [mordvintsev2015deepdream] proposed by Mordvintsev et al., which has enabled the “dreaming” of new object features onto natural images given a single pretrained CNN. Despite notable progress, synthesizing high fidelity and high resolution natural images from a deep network remains challenging.

3 Method

We propose a new framework for data-free knowledge distillation. Our framework broadly consists of two steps: (i) model inversion, and (ii) application-specific knowledge distillation. In this section, we briefly discuss the background and notation, and then introduce our DeepInversion and Adaptive DeepInversion methods.

3.1 Background

Knowledge Distillation. Distillation [hinton2015distilling] is a popular technique for knowledge transfer between two models. The simplest form of distillation consists of two steps. First, the teacher, a large single or ensemble of models, is trained. Second, a smaller model, the student, is trained to mimic the behavior of the teacher model by matching the temperature-scaled soft target distribution produced by the teacher on training data. Given a trained model and a dataset , the parameters of the student model, , can be learned by

(1)

where KL

refers to the Kullback-Leibler divergence and

and are the output distributions produced by the teacher and student model respectively, typically obtained using a high temperature on the softmax inputs [hinton2015distilling].

Knowledge distillation also works using other images from the same domain as the original training data. Note that labels are not required for distillation. Despite its efficacy, the process still relies on real images from the same domain. Below, we focus on methods to synthesize a large set of images from noise that could replace .

DeepDream [mordvintsev2015deepdream]. Mordvintsev et al. originally formulated DeepDream to derive artistic effects on natural images, but their method is also suitable for optimizing noise into images. Given a randomly initialized input (, being the height, width, and number of color channels) and an arbitrary target label , the image synthesis process can be expressed as solving an optimization of the form

(2)

where is a classification loss such as the standard softmax cross-entropy, and is a image regularization term to improve ’s visual quality. DeepDream [mordvintsev2015deepdream] proposes an image prior term [dosovitskiy2016inverting, mahendran2015understanding, nguyen2015deep, simonyan2013deep] to steer the generated images away from unrealistic images that are classified correctly but possess no discernible visual information:

(3)

where and penalize the total variance and norm of , respectively, and , are scaling factors. As empirically observed in our experiments and prior work [mahendran2015understanding, mordvintsev2015deepdream, nguyen2015deep], image prior regularization provides more stable convergence to valid images. However, these images still lack an overlap in their distribution with natural (or original training) images and thus lead to unsatisfactory results for knowledge distillation.

3.2 DeepInversion (DI)

We improve DeepDream’s image quality by extending the regularization with a new feature distribution regularization term. Given access to only , the image prior regularization term defined in the previous section provides little guidance for obtaining an that contains similar low- and high-level features as . To effectively enforce feature similarities at all levels, we propose to minimize the distance between feature map statistics for and . We assume that feature statistics follow the Gaussian distribution across batches and, therefore, can be defined by mean and variance . Then, the feature distribution regularization term can be formulated as:

(4)

where and

are the batch-wise mean and variance estimates of feature maps corresponding to the

convolutional layer. The and operators denote the expected value and norm calculations, respectively.

It might seem as though a set of training images would be required to obtain and , but the running average statistics stored in the widely-used batchnorm (BN) layers are more than sufficient. A BN layer normalizes the feature maps during training to alleviate covariate shifts [ioffe2015batch]. It implicitly captures the channel-wise means and variances during training, hence allows for estimation of the expectations in Eq. 4 by:

(5)
(6)

As we will show, this feature distribution regularization substantially improves the quality of the generated images. We refer to this model inversion method as DeepInversion - a generic approach that can be applied to any trained deep CNN classifier for the inversion of high-fidelity images. The regularization (corr. to Eq. 2) can thus be expressed as

(7)

3.3 Adaptive DeepInversion (ADI)

In addition to quality, diversity of the generated images also plays a crucial role in avoiding repeated but redundant image synthesis. Prior work on GANs has studied the problem and proposed various techniques, such as min-max training competition [goodfellow2014generative] and the truncation trick [brock2018biggan]. These methods rely on the joint training of two networks over original data and therefore are not applicable to our problem. We propose Adaptive DeepInversion, an enhanced image generation scheme based on a novel iterative competition scheme between the image generation process and the student network. The main idea is to encourage the synthesized images to cause student-teacher disagreement. For this purpose, we introduce an additional loss for image generation based on the Jensen-Shannon divergence that penalizes output distribution similarities,

(8)

where is the average of the teacher and student distributions.

During optimization, this new term leads to new images the student cannot easily classify whereas the teacher can. As illustrated in Fig. 2, our proposal iteratively expands the distributional coverage of the image distribution during the learning process. With competition, the regularization from Eq. 7 is updated with an additional loss scaled by as

(9)
Figure 2: Illustration of the Adaptive DeepInversion competition scheme to improve image diversity. Given a set of generated images (shown as green stars), an intermediate student can learn to capture part of the original image distribution. Upon generating new images (shown as red stars), competition encourages new samples out of student’s learned knowledge, improving distributional coverage and facilitating additional knowledge transfer. Best viewed in color.

3.4 DeepInversion vs. Adaptive DeepInversion

DeepInversion is a generic method that can be applied to any trained CNN classifier. For knowledge distillation, it enables an one-time synthesis of a large number of images given the teacher, to initiate knowledge transfer. Adaptive DeepInversion on the other hand, needs a student in the loop to enhance image diversity. Its competitive nature favors constantly-evolving students, which gradually force new image features to emerge. Interaction and feedback from another network enable the augmentation of DeepInversion, as we show in our experiments in the following section.

4 Experiments

In this section, we demonstrate the effectiveness of our proposed inversion methods on datasets of increasing size and complexity. We perform a number of ablations to better understand the contributions of each component in our method on the simple CIFAR-10 dataset of pixel images and 10 classes. Then, we focus on the complex ImageNet dataset that has 1000 classes and images of size . On ImageNet, we show the successful application of our inversion method on three different tasks under the data-free setting - (a) pruning, (b) knowledge transfer, and (c) continual/incremental learning. In all experiments, image pixels are initialized from Gaussian noise of and for DeepInversion (DI) and Adaptive DeepInversion (ADI).

4.1 Results on CIFAR-10

We first perform experiments on the CIFAR-10 dataset to validate our design choices for image generation. We consider the task of data-free knowledge distillation, where we teach a randomly initialized student network from scratch.

Implementation Details. We use VGG-11-BN and ResNet-34 networks that are pretrained on the CIFAR-10 dataset as the teachers. For all image synthesis in this section, we use an Adam optimizer and a learning rate of 0.05. We generate resolution images in batches of size 256. Each image batch requires 2k gradient updates. After a simple grid search optimizing for student accuracy, we found , and to work best for DI, and for ADI. Additional details are in the supplementary material.

Teacher Network VGG-11 VGG-11 ResNet-34
Student Network VGG-11 ResNet-18 ResNet-18
Teacher accuracy 92.34% 92.34% 95.42%
Noise () 13.55% 13.45% 13.61%
(DeepDream [mordvintsev2015deepdream]) 36.59% 39.67% 29.98%
(DeepInversion) 84.16% 83.82% 91.43%
(ADI) 90.78% 90.36% 93.26%
DAFL [chen2019data] 92.22%
Table 1: Data-free knowledge transfer to various students on CIFAR-10. For ADI, we generate one new batch of images every 50 KD iterations and merge the newly generated images into the existing set of generated images.
(a) Noise (opt) (b) DeepDream [mordvintsev2015deepdream] (c) DAFL [chen2019data]
(d) DeepInversion (DI) (e) Adaptive DI (ADI)
Figure 3: Generated images by inverting a ResNet-34 trained on CIFAR-10 with different methods. All images are correctly classified by the network, clockwise: cat, dog, horse, car.

Baselines – Noise & DeepDream [mordvintsev2015deepdream]. From Table 1, we observe that optimized noise, Noise (), does not provide any support for knowledge distillation - a drastic change in input distribution disrupts the teacher and impacts the validity of the transferred knowledge. Adding , like in DeepDream, slightly improves the student’s accuracy.

Effectiveness of DeepInversion (). Upon adding , we immediately find large improvements in accuracy of across all the teaching scenarios. As can be seen in Fig. 3, DeepInversion images (d) are vastly superior in realism, as compared to baselines (a) and (b).

Effectiveness of Adaptive DeepInversion (). Using competition-based inversion further improves accuracy by , bringing the student accuracy very close to that of the teacher trained on real images from the CIFAR-10 dataset (within ). The training curves from one of the runs are shown in Fig. 4. Encouraging teacher-student disagreements brings in additional “harder” images during training (shown in Fig. 3 (e)) that confuse the student but remain correct for the teacher, as seen in the left part of Fig. 4 which shows low student confidence during image generation.

Figure 4: Progress of knowledge transfer from trained VGG-11-BN (92.34% acc.) to freshly initialized VGG-11-BN network (student) using inverted images. Plotted here are accuracies on generated (left) and real (right) images. Final student accuracies are in Table 1.

Comparison with DAFL [chen2019data]. We further compare our method with [chen2019data], which trains a new generator network to convert noise into images while treating the teacher as a fixed discriminator. As seen in Fig. 3 (c), we notice that these images are reminiscent of “unrecognizable images”, or “fooling images” [nguyen2015deep]. Our method enables higher visual fidelity of images while cutting down on the training cost for an additional generator network. Our approach also exhibits higher student accuracy under the same experimental setup.

Figure 5: Class-conditional samples obtained by DeepInversion, given only a ResNet-50 classifier trained on ImageNet and no additional information. Note that the images depict classes in contextually correct backgrounds, in realistic scenarios. Best viewed in color.

4.2 Results on ImageNet

After successfully demonstrating our method’s abilities on the small CIFAR-10 dataset, we move on to examine its effectiveness on the large-scale ImageNet dataset [deng2009imagenet]. We first run DeepInversion on networks trained on ImageNet, and perform quantitative and qualitative analysis. Then, we show the effectiveness of our synthesized images on 3 different tasks of immense importance: data-free pruning, data-free knowledge transfer, and data-free continual learning.

Implementation Details.

For all experiments in this section, we use the publicly available pretrained ResNet-{18, 50} from PyTorch as the fixed teacher network, with top-1 accuracy of {

}. For image synthesis, we use an Adam optimizer and a learning rate of . We set for DI, and for ADI. We synthesize images of resolution in a batch size of using 8 NVIDIA V100 GPUs and automatic-mixed precision (AMP) [micikevicius2017mixed] acceleration. Each image batch consumes 20k updates over 2h.

4.2.1 Analysis of DeepInversion Images

Fig. 5 shows a set of images generated by applying DeepInversion to an ImageNet-pretrained ResNet-50. Remarkably, given just the model, we observe that DeepInversion is able to generate images with high fidelity and resolution. It also produces detailed image features and textures around the target object, e.g., clouds surrounding the target balloon, water around a catamaran, forest below the volcano, etc.

Generalizability. In order to verify that the generated images do not overfit to just the model used for inversion, we obtain predictions using 4 other networks trained on ImageNet. As seen in Table 2, images generated using a ResNet-50 generalize to other models and are correctly classified by a range of models. Further, DeepInversion outperforms DeepDream by a large margin. This indicates robustness of our generated images while transferring across networks.

Inception Score (IS). We also compare the IS [salimans2016improved] of our generated images with other methods in Table 3. Again, DeepInversion substantially outperforms DeepDream by an improvement of on IS. Without sophisticated training, DeepInversion even beats multiple GAN baselines that have limited scalibility to high image resolutions.

Model DeepDream DeepInversion
top-1 acc. (%) top-1 acc. (%)
ResNet-50 100
ResNet-18
Inception-V3
MobileNet-V2
VGG-11
Table 2: Classification accuracy of ResNet-50 synthesized images by other ImageNet-trained CNNs.
Method Resolution GAN Inception Score
BigGAN [brock2018biggan] /
DeepInversion (Ours)
SAGAN [zhang2018self]
SNGAN [miyato2018spectral]
WGAN-GP [gulrajani2017improved]
DeepDream [mordvintsev2015deepdream]*
Table 3: Inception Score (IS) obtained by images synthesized by various methods on ImageNet. SNGAN ImageNet score from [shmelkov2018good]. *: our implementation. : BigGAN-deep.

4.3 Application I: Data-free Pruning

In this section, we focus on the application of data-free pruning. The goal of pruning is to remove individual weights or entire filters (neurons) from a network such that the metric of interest (

e.g. accuracy, precision) does not drop significantly. Many techniques have been proposed to successfully compress neural networks including [han2015deep, li2016pruning, liu2017learning, thinet, molchanov2019importance, molchanov2016pruning, ye2018rethinking, nisp]. All these methods require images from the original dataset to perform pruning. We build upon the pruning method of Molchanov et al[molchanov2019importance], which uses the Taylor approximation of the pruning loss for a global ranking of filter importance over all the layers. We extend this method by applying the Kullback-Leibler divergence loss, computed between the teacher and student output distributions. Filter importance is estimated from images inverted with DeepInversion and/or Adaptive DeepInversion, making such a pruning completely data-free.

Hardware-aware Loss. We further consider actual latency on the target hardware for a more efficient pruning. To achieve this goal, the importance ranking of filters needs to reflect not only accuracy but also latency. This impact can be quantified by:

(10)

where and focus on absolute changes in network error and inference latency, specifically, when the filter group is zeroed out from the set of neural network parameters W. balances their importance. Pruning in layer-wise groups, e.g., filters per group, conforms better to common digital architecture design routines. Akin to [molchanov2019importance], can be approximated by the sum of the first-order Taylor expansion of individual filters. We approximate latency reduction term, , via precomputed hardware-aware look-up tables of operation costs. We include details of latency estimation in supplementary materials. In iterative manner neurons are ranked according to and the least important ones are removed.

We next implement the method for ResNet-50 pruning. First, we study data-free pruning without the original dataset. Then, we demonstrate the effectiveness of latency-aware loss. Finally, we explore the joint benefits of both techniques.

Data-free Pruning Evaluation. For better insights, we study four image sources: (i) partial ImageNet with 0.1M original images; (ii) unlabeled images from proxy dataset, MS COCO [lin2014microsoft] (127k images), and PASCAL VOC [everingham2010pascal] (9.9k images) datasets; (iii) 100k generated images from the BigGAN-deep [brock2018biggan] model, and a data-free setup with the proposed methods: we first (iv) generate 165k images via DeepInversion, and then (v) add to the set an additional 24k/26k additional images through two competition rounds of ADI, given pruned students at / top-1 acc.

Results of pruning and fine-tuning are summarized in Table 4. Our approach boosts the top-1 accuracy by more than given inverted images. ADI performs relatively on par with BigGAN. Despite beating VOC, we still observe a gap between synthesized images (ADI and BigGAN) and natural images (MS COCO and ImageNet). Such gap is narrowed down as fewer filters are pruned.

Inverted

Closest real

(a) DeepInversion
(b) ADI
Figure 6: Nearest neighbors of the synthesized images in the ResNet-50-avgpool feature space for the ImageNet class ‘handheld computer’ (a) without and (b) with the proposed competition scheme.

= 0mm = 0mm Image Source Top-1 acc. (%) -50% filters -20% filters -71% FLOPs -37% FLOPs No finetune gray Partial ImageNet white 0.1M images / 0 label gray Proxy datasets white MS COCO PASCAL VOC gray GAN white Generator, BigGAN gray Noise (Ours) white DeepInversion (DI) Adaptive DeepInversion (ADI)

= 0.605mm = 0.984mm

Table 4: ImageNet ResNet-50 pruning results for the knowledge distillation setup, given different types of input images.

= 0mm = 0mm Method ImageNet data GFLOPs Lat. (ms) Top-1 acc. (%) Base model Taylor [molchanov2019importance] () () SSS [huang2017data] () - ThiNet-70 [thinet] () - NISP-50-A [nisp] () - gray Ours Hardware-Aware (HA) () (1.2) ADI (Data-free) () (1.1) ADI + HA () (1.2) = 0.605mm = 0.984mm

Table 5: ImageNet ResNet-50 pruning comparison with prior work.

Impact of Competition. To visualize the diversity increase due to competition loss (Eq. 8), we compare the class of handheld computers generated with and without the competition scheme in Fig. 6. As learning continues, competition leads to the discovery of features for hands from the teacher’s knowledge scope to challenge the student network. Moreover, generated images differ from their nearest neighbors, showing the efficacy of the approach in capturing distribution as opposed to memorizing inputs. The distribution coverage of the generated images with and without competition loss are shown in supplementary materials.

Comparisons with SOTA. We compare data-free pruning against state-of-the-art methods in Table 5 for the setting of pruning away 20% of filters globally. We evaluate three setups for our approach: (i) individually applying the hardware-aware technique (HA), (ii) data-free pruning technique with DeepInversion and Adaptive DeepInversion (ADI), and (iii) jointly applying both (ADI+HA). First, we evaluate the hardware-aware loss on original dataset, and achieve a faster inference with zero accuracy loss compared to the base model. We also observe simultaneous improvements in inference speed and accuracy over the state-of-the-art pruning baseline [molchanov2019importance]. In a data-free setup, we achieve similar post-pruned model performances compared to prior work (which use the original dataset), while completely removing the need for any images/labels. The data-free setup demonstrates a loss in accuracy with respect to the base model. A final combination of both data-free and hardware-aware techniques (ADI+HA) closes this gap to only , while simultaneously improving the inference speed.

4.4 Application II: Data-free Knowledge Transfer

Figure 7: Class-conditional images by DeepInversion given a ResNet50v1.5 classifier pretrained on ImageNet. Classes top to bottom: brown bear, quill, trolleybus, cheeseburger, cup, volcano, daisy, cardoon.

In this section, we show that we can distill information from a teacher network to a student network without using any real images at all. We apply DeepInversion to a ResNet-50v1.5 [resnet50v15url] trained on ImageNet to synthesize images. Using these images, we then train another randomly initialized ResNet-50v1.5 from scratch. Below, we describe two practical considerations - a) image clipping, and b) multi-resolution synthesis, which we find can greatly help boost accuracy while reducing run-time.

a) Image clipping. We find that enforcing the synthesized images to conform to the mean and variance of the data preprocessing step helps improve accuracy. This intuitively makes sense as maintaining the correct pixel range is important so that the network receives a ‘realistic image’ as input. Note that commonly released networks, such as the publicly available PyTorch [resnet50v15url] model that we conduct experiments with, use means and variances computed on ImageNet. We would like to emphasize that our method does not require statistics of prior datasets, but merely use those provided with the model, such as the data preprocessing statistics. We clip the synthesized image using

(11)

where and .

b) Multi-resolution synthesis. We find that we can speed up DeepInversion by employing a multi-resolution optimization scheme. We first optimize the input of resolution for

k iterations. Then, we up-sample the image via nearest neighbour interpolation to

, and then optimize for an additional k iterations. We use the following parameters for this experiment: , , and a learning rate of for Adam. This implementation speeds up DeepInversion and enables the generation of images within

minutes on an NVIDIA V100 GPU. As a final setup, we generate images with an equal probability between this (i) multi-resolution scheme and (ii) the scheme described in Section 

4.2 with k iterations only to further improve image diversity.

A set of images generated by DeepInversion on the pretrained ResNet-50v1.5 using the techniques described above are shown in Fig. 7. The images again demonstrate high fidelity and diversity. Clipping helps improve the contrast, while multi-resolution synthesis helps speed-up the process.

Knowledge Transfer. We synthesize 140k images via DeepInversion on ResNet50v1.5 [resnet50v15url] to train a student network with the same architecture from scratch. Our teacher is a pretrained ResNet-50v1.5 which achieves

top-1 accuracy. We apply knowledge distillation for 90/250 epochs, with temperature

, initial learning rate of , batch size of split across eight V100 GPUs, and other settings same as in the original setup [resnet50v15url]. Results are summarized in Table 6. By the proposed method, given only the trained ResNet50v1.5 model, we can teach a new model from scratch to achieve a accuracy, which is only below the accuracy of the teacher.

= 0mm = 0mm Image source Real images Data amount Top-1 acc. Base model 1.3M gray Knowledge Transfer, 90 epochs white ImageNet 215k MS COCO 127k Generator, BigGAN 215k DeepInversion (DI) 140k gray Knowledge Transfer, 250 epochs, with mixup white DeepInversion (DI) 140k

= 0.605mm = 0.984mm

Table 6: Knowledge transfer from the trained ResNet-50v1.5 to the same network initialized from scratch.

4.5 Application III: Data-free Continual Learning

In this section, we focus on the task of data-free continual or incremental learning, another scenario that benefits from the image generation capability of DeepInversion. The main idea of incremental learning is to add new classification classes to a pretrained model without comprehensive access to its original training data. To the best of our knowledge, only LwF [li2017learning] and LwF.MC [rebuffi2017icarl] achieve data-free incremental learning. Other methods require information that can only be obtained from the original dataset, e.g., a subset of data (iCaRL [rebuffi2017icarl]), parameter importance estimations (in the form of Fisher matrix in EWC [Kirkpatrick2016OvercomingCF], contribution to loss change in SI [Zenke2017ContinualLT], posterior of network weights in VCL [nguyen2018variational]), or training data representation (encoder [RannenTrikiAljundi2017encoder], GAN [Hu2019OvercomingCF, Shin2017ContinualLW]). Some methods rely on network modifications e.g. Packnet [Mallya2017PackNetAM] and Piggyback [Mallya2018PiggybackAA]. In comparison, DeepInversion does not need network modifications or the original (meta-)data, as BatchNorm statistics are inherent to neural networks.

In this task setting, a network is initially trained on a dataset with classes , e.g. ImageNet [deng2009imagenet]. Given new class data , e.g. from the CUB [WelinderEtal2010] dataset, the pretrained model is now required to make predictions in a combined output space . Similar to prior work, we start from a trained network (denoted by , effectively as a teacher), make a copy (denoted by , effectively as a student), and then add new randomly initialized neurons to ’s final layer to output logits for the new classes. We train to classify simultaneously over all classes, old and new, while network remains fixed.

width=.8 = 0mm = 0mm Method Top-1 acc. (%) Combined ImageNet CUB Flowers gray ImageNet + CUB ( outputs) LwF.MC [rebuffi2017icarl] DeepDream [mordvintsev2015deepdream] DeepInversion (Ours) Oracle (distill) Oracle (classify) gray ImageNet + Flowers ( outputs) LwF.MC [rebuffi2017icarl] DeepDream [mordvintsev2015deepdream] DeepInversion (Ours) Oracle (distill) Oracle (classify) gray ImageNet + CUB + Flowers ( outputs) LwF.MC [rebuffi2017icarl] DeepInversion (Ours) Oracle (distill) Oracle (classify) = 0.605mm = 0.984mm

Table 7: Continual learning results extending the network output space, adding new classes to ResNet-18. Accuracy over combined classes reported on individual datasets. Average over datasets also shown (datasets treated equally regardless of size, so ImageNet samples have less weight than CUB or Flowers samples).

Incremental Learning Loss. We formulate a new loss with DeepInversion images as follows. We use same-sized batches of DeepInversion data and new class real data for each training iteration. For , we use the original model to compute its soft labels , i.e. class probability among old classes, and then concatenate it with additional zeros as new class probabilities. We apply a KL-divergence loss between predictions and on DeepInversion images for prior memory, and a cross-entropy loss between one-hot and prediction on new class images for emerging knowledge. Similar to prior work [li2017learning, rebuffi2017icarl], we also apply a third KL-divergence term between the new class images’ old class predictions and their original model predictions . This forms the loss

(12)

Evaluation Results. We add new classes from the CUB [WelinderEtal2010], Flowers [nilsback2008automated], and both CUB and Flowers datasets to a ResNet-18 [he2016deep] classifier trained on ImageNet [deng2009imagenet]. Prior to each step of addition of new classes, we generate 250 DeepInversion images per old category. We compare with prior work LwF.MC [rebuffi2017icarl] as opposed to LwF [li2017learning] that cannot distinguish between old and new classes. We further compare with oracle methods where we break the data-free constraint: we use the same amount of old class real images from old datasets in place of , jointly with either their ground truth label classification loss or their soft labels from for KL-divergence distillation loss. The third KL-divergence term in Eq. 12 is omitted in this case. Implementation details are in the appendix.

Results are shown in Table 7. Our method outperforms LwF.MC drastically in all cases and leads to consistent performance improvements over DeepDream in most scenarios. We are very close to the oracles (and occasionally outperforming them), showing DeepInversion’s efficacy in replacing ImageNet images for incremental learning. We verify that our observations also hold for VGG-16 in the appendix.

Conclusions

We have proposed DeepInversion for synthesizing training images with high resolution and fidelity given just a trained CNN. Further, by using student-in-the-loop, our Adaptive DeepInversion improves image diversity. Through extensive experiments, we have shown that our methods are generalizable, effective, and have empowered a new class of “data-free” tasks of immense practical significance.

Acknowledgements

We would like to thank Arash Vahdat and Ming-Yu Liu for in-depth discussions and helpful comments.

References

Supplementary Materials

We provide more experimental details in the following sections. First, we elaborate on CIFAR-10 experiments, followed by additional details on ImageNet results. We then give details of experiments on data-free pruning (Section 4.3 of the main paper) and data-free incremental learning (Section 4.5 of the main paper). Finally, we provide additional discussions.

Appendix A CIFAR-10 Appendix

a.1 Implementation Details

We run each knowledge distillation experiment between the teacher and student networks for epochs in all, with an initial learning rate of , decayed every epochs with a multiplier of . One epoch corresponds to gradient updates. Image generation runs 4 times per epoch, in steps of batches when VGG-11-BN is used as a teacher, and 2 times per epoch for ResNet-34. The synthesized image batches are immediately used for network update (the instantaneous accuracy on these batches is shown in Fig. 4) and are added to previously generated batches. In between image generation steps, we randomly select previously generated batches for training.

a.2 BatchNorm vs. Set of Images for

DeepInversion approximates feature statistics and in (Eq. 4) using BatchNorm parameters. As an alternative, one may acquire the information by running a subset of original images through the network. We compare both approaches in Table 8. From the upper section of the table, we observe that feature statistics estimated from the image subset also support DeepInversion and Adaptive DeepInversion, hence advocate for another viable approach in the absence of BatchNorms. It only requires 100 images to compute feature statistics for Adaptive DeepInversion to achieve almost the same accuracy as with BatchNorm statistics.

# Samples DI ADI
Top-1 acc. (%) Top-1 acc. (%)
BatchNorm statistics
Table 8: CIFAR-10 ablations given mean and variance estimates based on (i) up: calculations from randomly sampled original images, and (ii) bottom: BatchNorm running_mean and running_var parameters. The teacher is a VGG-11-BN model at accuracy. The student is a freshly initialized VGG-11-BN. DI: DeepInversion; ADI: Adaptive DeepInversion.

Appendix B ImageNet Appendix

b.1 DeepInversion Implementation

We provide additional implementation details of DeepInversion next. The total variance regularization in Eq. 3 is based on the sum of norms between the base image and its shifted variants: (i) two diagonal shifts, (ii) one vertical shift, and (iii) one horizontal shift, all by one pixel. We apply random flipping and jitter ( pixels) on the input before each forward pass. We use the Adam optimizer with and given a constant learning rate of . We speed up the training process using half-precision floating point (FP16) via the APEX library.

Appendix C Data-free Pruning Appendix

c.1 Hardware-aware Loss

Our proposed pruning criterion considers actual latency on the target hardware for more efficient pruning. Characterized by Eq. 10, the new scheme leverages and to focus on absolute changes in network error and inference latency, specifically, when the filter group is zeroed out from the set of neural network parameters W.

Akin to  [molchanov2019importance], we approximate as the sum of the first-order Taylor expansion of individual filters

(13)

where and denote the gradient and parameters of a filter , respectively. We implement this approximation via gate layers, as mentioned in the original paper.

The importance of a group of filters in reducing latency can be assessed by removing it and obtaining the latency change

(14)

where measures the latency of an intermediate pruned model on the target hardware.

Since the vast majority of computation stems from convolutional operators, we approximate the overall network latency as the sum of their run-times. This is generally appropriate for inference platforms like mobile GPU, DSP, and server GPU [chamnet, fbnet]. We model the overall latency of a network as:

(15)

where refers to the operator at layer . By benchmarking the run-time of each operator in hardware into a single look-up-table (LUT), we can easily estimate the latency of any intermediate model based on its remaining filters. The technique of using LUT for latency estimation has also been studied in the context of neural architecture search (NAS) [chamnet, fbnet]. For pruning, the LUT consumes substantially less memory and profiling effort than NAS: instead of an entire architectural search space, it only needs to focus on the operations given reduced numbers of filters against the pre-trained model.

c.2 Implementation Details

Our pruning setup on the ImageNet dataset follows the work in [molchanov2019importance]. For knowledge distillation given varying image sources, we experiment with temperature among for each pruning case and report the highest validation accuracy observed. We profile and store latency values in the LUT under the following format:

(16)

where

denote input channel number, output channel number, kernel size, stride, and input feature map dimension of a Conv2D operator, respectively. Parameters with superscript

remain at their default values in the teacher model. We scan input and output channels to cover all combinations of integer values that are less than their initial values. Each key is individually profiled on a V100 GPU with a batch size one with CUDA 10.1 and cuDNN 7.6 over eight computation kernels, where the mean value of over computations for the fastest kernel is stored. Latency estimation through Eq. 15 demonstrates a high linear correlation with the real latency (). For hardware-aware pruning, we use for Eq. 10 to balance the importance of and , and prune away filters each time in group size of . We prune every mini-batches until the pre-defined filter/latency threshold is met, and continue to fine-tune the network after that. We use a batch size of for all our pruning experiments. To standardize input image dimensions, the default ResNet pre-processing from PyTorch is applied on MS COCO and PASCAL VOC images.

c.3 Hard-aware Loss Evaluation

As an ablation, we show the effectiveness of the hardware-aware loss (Eq. 10 in Section 4.3) by comparing it with the pruning baseline in Table 9. We base this analysis on the ground truth data to compare with prior art. Given same latency constraints, the proposed loss improves the top-1 accuracy by absolute .

V100 Lat. Taylor [molchanov2019importance] Hardware-aware Taylor (Ours)
(ms) Top-1 acc. (%) Top-1 acc. (%)
- baseline
Table 9: ImageNet ResNet-50 pruning ablation with and without latency-aware loss given original data. Latency measured on V100 GPU at batch size one. Top-1 accuracy corresponds to the highest validation accuracy for 15 training epochs. Learning rate is initialized to 0.01, decayed at the epoch.

c.4 Pruning without Labels

Taylor expansion for pruning estimates the change in loss induced by feature map removal. Originally, it was proposed for cross-entropy (CE) loss given ground-truth labels of input images. We argue that the same expansion can be applied to the knowledge distillation loss, particularly the Kullback-Leibler (KL) divergence loss, computed between the teacher and student output distributions. We also use original data in this ablation for a fair comparison with prior work and show the results in Table 10. We can see that utilizing KL loss leads to only to absolute top-1 accuracy changes compared to the original CE-based approach, while completely removing the need for labels for Taylor-based pruning estimates.

Filters pruned CE loss, w/ labels [molchanov2019importance] KL Div., w/o labels (Ours)
(%) Top-1 acc. (%) Top-1 acc. (%)
- baseline
Table 10: ImageNet ResNet-50 pruning ablation with and without labels given original images. CE: cross-entropy loss between predictions and ground truth labels; KL Div: KullBack-Leibler divergence loss between teacher-student output distributions. Learning rate is 0, hence no fine-tuning.

c.5 Distribution Coverage Expansion

The proposed Adaptive DeepInversion aims at expanding the distribution coverage of the generated images in the feature space through competition between the teacher and the student networks. Results of its impact are illustrated in Fig. 8. As expected, the distribution coverage gradually expands, given the two sequential rounds of competition following the initial round of DeepInversion. From the two side bars in Fig. 8, we observe varying ranges and peaks after projection onto each principal component from the three image generation rounds.

Figure 8:

Projection onto the first two principal components of the ResNet-50-avgpool feature vectors of ImageNet class ‘hand-held computer’ training images. ADI-1/2 refers to additional images from round1/2 competition.

Appendix D Data-free Incremental Learning Appendix

d.1 Implementation Details

Our DeepInversion setup for this application follows the descriptions in Section 4.2 and Appendix B.1 with minor modifications as follows. We use DeepInversion to generate {250, 64} images of resolution per existing class in the pretrained {ResNet-18, VGG-16-BN}. These images are generated afresh after adding each new dataset. For {ResNet18, VGG-16-BN}, we use a learning rate of , optimize for k gradient updates in all, and decay the learning rate every k steps with a multiplier. We use both and norms for total variance regularization at , joint with and for DeepInversion. These parameters are chosen such that all loss terms are of the same magnitude, and adjusted to optimize qualitative results.

Each method and dataset combination has individually-tuned learning rate and number of epochs obtained on a validation split using grid search, by optimizing the new dataset’s performance while using the smallest learning rate and number of epochs possible to achieve this optimal performance. For each iteration, we use a batch of DeepInversion data and a batch of new class real data . The batch size is for both kinds of data when training ResNet18, and for VGG-16-BN. Similar to prior work [rebuffi2017icarl], we reduce the learning rate to at , , , and of the total number of epochs. We use SGD with a momentum of as the optimizer. We clip the gradient magnitude to

, and disable all updates in the BatchNorm layers. Gradient clipping and freezing BatchNorm do not affect the baseline LwF.MC 

[rebuffi2017icarl] much (

change in combined accuracy after hyperparameter tuning), but significantly improve the accuracy of our methods and the oracles. We start with the pretrained ImageNet models provided by PyTorch. LwF.MC 

[rebuffi2017icarl] needs to use binary cross entropy loss. Hence, we fine-tuned the model with a small learning rate. The resulting ImageNet model is within top-1 error of the original model.

d.2 VGG-16-BN Results

We show our data-free incremental learning results on the VGG-16-BN network in Table 11. The proposed method outperforms prior art [rebuffi2017icarl] by a large margin by enabling up to absolute increase in the top-1 accuracy. We observe a small gap of combined error between our proposal and the best-performing oracle for this experimental setting, again showing DeepInversion’s efficacy in replacing ImageNet images for the incremental learning task.

max width=0.91 = 0mm = 0mm Method Top-1 acc. (%) Combined ImageNet CUB Flowers gray ImageNet + CUB ( outputs) LwF.MC [rebuffi2017icarl] DeepInversion (Ours) Oracle (distill) Oracle (classify) gray ImageNet + Flowers ( outputs) LwF.MC [rebuffi2017icarl] DeepInversion (Ours) Oracle (distill) Oracle (classify) = 0.605mm = 0.984mm

Table 11: Results on incrementally extending the network softmax output space by adding classes from a new dataset. All results are obtained using VGG-16-BN.

d.3 Use Case and Novelty

The most significant departure from prior work such as EWC [Kirkpatrick2016OvercomingCF] is that our DeepInversion-based incremental learning can operate on any

regularly-trained model, given the widespread usage of BatchNorm layers. Our method eliminates the need for any collaboration from the model provider, even when the model provider (1) is unwilling to share any data, (2) is reluctant to train specialized models for incremental learning, or (3) does not have the know-how to support a downstream incremental learning application. This gives machine learning practitioners more freedom and expands their options when adapting existing models to new usage scenarios, especially when data access is restricted.