A promising approach to scale up transfer learning to diverse downstream vision tasks is to maintain a library of pre-trained experts. When presented with a novel downstream task, one can select an appropriate expert and quickly fine-tune the representation with small amounts of task-specific data. This strategy has many practical benefits: fine-tuning a task-relevant pre-trained representation is fast, there is no need to store or revisit expert pre-training data, and in principle the system can be “upgraded” at any time by simply adding additional experts to the library. How is such a library of expert representations populated? Previous work (e.g.,Puigcerver et al. (2020); Achille et al. (2019); Deshpande et al. (2021)) has assumed a static collection of experts created by training on domain-specific datasets or semantically related subsets of classes selected from large-scale general-purpose datasets. Instead, we would like to automatically maintain and enrich the library of experts based on the accumulated experience of tuning models for downstream tasks, so that the overall system performance continually increases over time (life-long meta-learning).
One approach to growing the library would be to add every fine-tuned downstream task-specific model back into the library as a candidate expert for transfer on future tasks. Such a naïve approach clearly does not scale and, at a minimum, requires developing techniques for selecting experts that is sub-linear in the size of the library Puigcerver et al. (2020); Achille et al. (2019); Deshpande et al. (2021). More importantly, as our experiments show, task-specific models fine-tuned on small amounts of data do not provide transferable representations. Task-specific models tend to overspecialize, suffering “catastrophic forgetting” of under-utilized feature representations and thus under-perform on new tasks compared to generic pre-trained models. To address this, we introduce representation consolidation in which the goal is to consolidate knowledge from multiple task-specific teacher models into a single expert student representation which transfers to downstream tasks better than any of the individual teacher representations.
To carry out representation consolidation, we utilize multi-teacher multi-task model distillation (see Fig.1). Here, a single student is trained to emulate multiple teachers, each of which operates on a different set of class labels. Previous work on multi-teacher knowledge distillation has focused on evaluating how well the student model performs the teacher’s task. Instead, we evaluate how well the student representation generalizes to new downstream tasks (whether related or unrelated to the teachers’ tasks). In this setting we demonstrate several surprising results:
While task-specific model representations transfer poorly, consolidating a task-specific teacher with a generalist teacher (ImageNet) is sufficient to rescue the student. The resulting representation transfers well, with improved downstream performance on teacher-relevant tasks while matching the performance of a strong generalist representation on unrelated tasks.
Consolidating multiple related task-specific teacher models can yield a student representation that exceeds the performance of any one teacher on downstream tasks.
Unlike knowledge distillation, which requires access to the teacher training data (or using data-free distillation Lopes et al. (2017); Luo et al. (2020) to carefully craft synthetic data) to achieve good performance, we avoid using these data and show effective representation consolidation can be carried out using a sufficiently diverse generic proxy dataset and is robust to the choice of the proxy.
|The number of tasks / task-specific teachers / domains|
|The -th task-specific teacher’s dataset|
|The -th task-specific teacher’s backbone|
|(: the generalist’s backbone)|
|The -th task-specific teacher’s head (classifier layer)|
|(: the generalist’s head)|
|The large unlabeled proxy dataset used for|
|distillation / consolidation|
|The distilled/consolidated student’s backbone|
|The distilled/consolidated student’s head (classifier layer)|
|The -th downstream task’s dataset|
|The consolidated student backbone fine-tuned on dataset|
|The downstream model’s head (classifier layer|
|or linear SVM) for dataset|
|Loss weight for the -th teacher|
|t-split||First 50% random split of classes of a dataset, used as|
|d-split||Second 50% random split of classes of a dataset, used as|
2 Representation Consolidation
We start with a collection of one or more task-specific image classification models , trained on corresponding datasets belonging to some domain (e.g. satellite images, images of flowers, etc.). We assume models consist of a feature extractor or backbone , composed with a classifier head so that = . We first consolidate the knowledge of these task-specific teachers into a single student representation using a proxy dataset (e.g., ImageNet) and then fine-tune the student representation on a given downstream chosen from some set . Our goal is that the resulting downstream model achieves good performance, where denotes the student representation after tuning on . Figure 1 highlights how this differs from standard distillation in which the student model is simply evaluated on the same task its teachers were once trained to perform.
Forgetting and representation collapse during distillation.
We observe (Sec. 4) that standard knowledge distillation is insufficient to produce good student representations . Simple distillation yields models that under-perform general pre-training representations (e.g. with ImageNet) when evaluated on downstream tasks from the teachers’ domain, and drastically under-perform on tasks outside the teachers domain. This holds true even though the student network is itself initialized with a general pre-trained representation. We argue that distillation from task-specific teachers thus suffers from catastrophic forgetting of general knowledge that is crucial for transfer learning.
Intuitively, transfer performance depends how distinguishable different classes are represented in the penultimate layer feature space. When using traditional distillation, the student only learns from task-specific teachers trained on smaller datasets and is only required to discriminate the classes in those datasets. This is well suited for traditional distillation that evaluates on these same tasks. But for representation consolidation, distinguishing unknown downstream classes requires preserving general features that may not be relevant to the teachers’ specific task. Our strategy is thus to ensure that the student maintains general features as it learns task-specific features.
We use multi-head multi-task distillation and avoid forgetting with the help of a generalist teacher. During distillation we use heads, , on top of the student backbone . In addition to the set of task-specific teachers , , we also include a generalist teacher that was trained on ImageNet (denoted as ) and optimize student parameters using the loss:
where is the distillation loss Hinton et al. (2015), i.e. cross-entropy with temperature :
where indexes the classes, and , .
We initialize the student backbone and its 0-th head using the generalist model’s weights , , whereas other heads are randomly initialized. Since it is important to maintain pre-trained model’s representational power, we simply set the loss weights and for . In this way, the learned representation must include features useful for both the pre-trained model’s classes and other task-specific teachers’ classes, improving suitability for future downstream transfer. After training the student we evaluate the resulting representation on multiple downstream tasks . For each task we can either fine-tune the whole model or keep the student representation fixed and only learn the classifier head which is often referred to as a linear probe.
3 Experimental setup
Datasets and downstream tasks.
We utilize datasets from a variety of domains to generate teachers and downstream tasks: Cars196 Krause et al. (2013), Resisc45 Cheng et al. (2017) (remote sensing images), iFood Kaur et al. (2019) and Food101 Bossard et al. (2014), iFashion Guo et al. (2019), DTD Cimpoi et al. (2014) (describable textures), iNaturalist Horn et al. (2018) (species classification, 2019 challenge version), CUB Birds Wah et al. (2011), Flowers Nilsback and Zisserman (2008), Caltech256 Griffin et al. (2007), and Aircrafts Maji et al. (2013). Among these, iFood and Food101 are the same domain, and Birds and Flowers are subdomains of iNaturalist. We checked for near-duplicates between these datasets using perceptual hash 25, and found negligible duplication: 1 out of 134k iNaturalist (50% classes we used) is a duplicate of CUB, and 8 out of 130k/100k images are duplicates between iFood and Food101.
To evaluate if a consolidated student representation has learned features relevant to a specific domain, we require downstream tasks that are similar to that of each task-specific teacher. Except for Food101 and iFashion, we split each dataset at random into two disjoint sets which each contain only 50% of the classes. We take the first half of the dataset, named “t-split”, and use as , to train a task-specific teacher. We use the second half of each dataset, named “d-split”, as one of the downstream tasks . When evaluating few-shot downstream transfer, we random sample 5 training images from each available class in , but always use the entire test set for evaluation. We use all samples for those classes with images. For iFashion, which is a multi-task multi-label dataset, we use all images but with a subset of labels (e.g. those related to clothing category) as for teacher training. We then use all images with a disjoint set of labels (e.g. those related to sleeve) as a downstream task for evaluating transfer. For iFashion’s few-shot scenario, we randomly subsample 1000 images for downstream training, and evaluate on all test data. We do not train any teacher on Food101 so the complete set of classes are used as a downstream task.
We note that our goal is not to produce state of the art representations for transfer learning, but rather to demonstrate an approach that reliably improves transfer performance of representations learned using distillation. We evaluate each student’s using its performance when transferred to various downstream datasets and compare to multiple baselines for initializing the downstream representation :
(1) ImageNet-pretrained which is a strong baseline for transfer learning, (2) ImageNet-pretrained fine-tuned on the soft-labels produced by the ImageNet-pretrained model with batchnorms in test mode. This baseline aims to isolate the effect of soft-labels / self-distillation on model performance. (3) The task-specific teacher (or one of the teachers when ) without further distillation. (4) Traditional distillation with teachers, i.e. without the ImageNet as a teacher. (5) Our consolidated representation which includes as a teacher.
We use the transfer accuracy on downstream tasks to measure each representation’s power. We primarily use the linear probe (train a linear SVM as over fixed ) on ’s training set (single training run for full dataset, few-shot performance averaged over 50 random subsampling trials). We also verify our results hold when fine-tuning the student representation .
Implementation details of downstream training & evaluation.
We will release our code to reproduce this paper upon publication. We use PyTorch.Paszke et al. (2019) For all network training, we use SGD with momentum of 0.9, weight decay of , a batch size of 32. Unless noted, we use a learning rate decay of at 50% and 80% of total training. We initialize the task-specific teacher training with a ResNet50 He et al. (2016) pre-trained on ImageNet. We fine-tune each on the task-specific
for 120 epochs (learning rate decay at 70th and 100th epoch) while doing a log-scale grid search on the learning rate. For distilling, our method is less sensitive to the choice of learning rate. We use a fixed learning rate of 0.01 for each and 0.001 for , and a schedule of 40 epochs. Note that ImageNet pre-training in PyTorch uses 0.001 as the final epoch learning rate. This takes us roughly 4 days on an AWS instance with an NVIDIA V100 for each of the 38 unique traditional distill or representation consolidation experiments. When the downstream transfer uses a fixed , we extract on the center image crop, and search
’s SVM hyperparameters using a 5-fold cross-validation in scikit-learnPedregosa et al. (2011). When the downstream transfer uses fine-tuning, we run a log-scale grid search of learning rate with a 50 epoch schedule.
Motivational analysis: traditional distillation vs. representation consolidation
To highlight the difference between representation consolidation and traditional distillation, we use either iFood or Resisc45 (t-split) as to train teachers, and run distillation/consolidation with . We test on iFood (d-split), Food101 (full), and Resisc45 (d-split) as downstream tasks (t-split and d-split are disjoint 50% classes of each dataset; see Section 3)
We compare the ImageNet , the task specific teacher , traditionally distilled (i.e. only use as teacher), and our consolidated (i.e. use and as teachers). On downstream tasks , we follow representation learning’s evaluation protocol (linear probe): train linear SVM on , evaluate on ’s test set. The results are shown in Table 2. We also evaluate each representation on the (upstream) teacher’s task following the traditional distillation evaluation protocol, i.e. directly evaluate or performance on the upstream task ’s test set.
If we only focus on the upstream task , then traditional (proxy) distillation almost matches the teacher’s performance, and representation consolidation performs worse than both of them. However, if we instead focus on the downstream transfer performance on , we see the opposite trend. In the case of the Food101 downstream task, we see that the representation power of both the generalist teacher (Imagenet) as well as the task specific teacher is lower than our representation consolidation. For the teacher-related tasks (i.e., downstream task which is disjoint classes split from the same dataset as ) we see a benefit of consolidation over generic pre-trained features. On the unrelated downstream task (Resisc45), where the generalist teacher excels, we notice that our method almost matches the performance even though it includes an improved representation for the food task. We see a similar trend for Resisc45, where consolidated student outperform the teacher and the generalist on teacher-related tasks but still retains the performance of the generalist teacher on unrelated tasks. This clearly demonstrates the significant difference between upstream and downstream transfer.
Improving student representation when .
We show that this advantage of consolidation over baselines holds for a wide range of upstream and downstream datasets. Figure 2 summarizes few-shot SVM accuracy using different representations relative to ImageNet-pretrained (See supplemental for the full figure and raw numbers).
The conclusions are similar – consolidation outperforms (Cars196, Resisc45, iFood, CUB, Aircrafts, iNaturalist) or matches (iFashion, DTD, Flowers, Caltech256) ImageNet pre-trained model performance on related downstream tasks, and matches its performance on unrelated ones. In contrast, traditional distillation (1) underperforms our method on related downstream tasks for all teachers except Cars196, iFashion, and Aircraft, and (2) drastically underperforms both consolidated and ImageNet features on unrelated downstream tasks. Notably, traditional distillation underperforms ImageNet even on related downstream tasks for Resisc45, DTD, Flowers, and Caltech256 teachers.
Consolidating representations with .
We can merge multiple task-specific teachers that are related to get a better representation. In addition, we show that our method is not constrained to merging models with the same architecture or ’s feature space dimensions like prior work Geyer et al. (2019). To illustrate this, we split the “t-split” of Cars196 and iFood into five equal splits containing 10% of the original classes. We train a ResNet18 teacher on each of the five splits. Then, we use either traditional distillation or representation consolidation to merge these models with a ResNet50 generalist teacher. We compare the resulting representations with the teacher model’s and ImageNet. Figure 2(a) shows this result. For downstream tasks, we obtain better performance than using only one of the five teachers on both related and unrelated tasks, especially for iFood and Food101 where the teachers themselves underperform on related downstream task. This shows that our method can benefit from even teachers whose representation is weaker, as long as they have domain knowledge.
Full comparison with traditional distillation and representation consolidation from only one of the five teachers are in supplemental material – we gain performance on similar by using five task-specific teachers instead of one, and we outperform traditional distillation on related downstream tasks for iFood. We also show results when using all ResNet50 teachers for the five splits in the supplemental material. The results are similar and the conclusions are identical.
Finally, we explore merging models from different domains to form a multi-domain consolidated student. See Figure 2(b). We observe that we can outperform ImageNet on all related downstream tasks, but the performance gain is smaller than representation consolidation in just one domain. We show in the supplemental material that we outperform traditional distillation on most downstream tasks except Cars196.
We also verify that our conclusions generalize to fine-tuning as well, especially without few-shot sampling. Figure 4 shows an excerpt of our results that include full dataset transfer. Fine-tuning allows the representation to change into one that better suits the downstream task, so the benefit of teachers’ domain knowledge shrinks compared to fixed with linear SVM . Despite this, the conclusions are the same as the fixed scenario. See supplemental material for few-shot fine-tuning and transfer with full d-splits, whose conclusions are the same as this section’s.
Influence of the Loss weights.
We show in Figure 4(a) that our results are sensitive to the loss balance – when we use , (), we gain performance on the related downstream task but lose performance on unrelated ones, whereas using , () gives us the opposite. This allows us to trade-off the influence of each teacher, potentially suppressing unreliable ones.
Proxy data choice.
Figure 4(b) shows results replacing ImageNet with Places365 as the dataset. This yields similar performances as using ImageNet, only slightly weaker on unrelated downstream tasks. This shows representation consolidation is somewhat robust to the choice of proxy data. In the last row, we add a head to the student to learn the supervised ground truth of Places365 with a cross-entropy loss with loss weight of 1. This surprisingly hurt performance on both related and unrelated tasks. This shows that the effect of representation consolidation preserving general knowledge is largely due to the loss between and , not because of any new information from input images.
5 Related work
Our goal is to maximize the transferability of a representation consolidated from multiple teachers. This downstream transfer aspect has received little attention in the literature (see e.g., Liu et al. (2019); Geyer et al. (2019); Tian et al. (2020)), but this problem formulation is closely related to prior work on multi-model merging and distillation with proxy data.
It is useful to combine separate models that perform different tasks into a single model for efficiency and performance benefits. Knowledge Concentration Gao et al. (2017) combines teachers trained on subsets of 100k classes in the EFT dataset Gao et al. (2017) into one single model that outperforms direct training on the full data, using handcrafted sparse connections for the final student layers. Chou et al. Chou et al. (2018) merge CNNs by combining the kernel weights of different models by clustering and lookup, and then fine-tuning the resulting weights. Zhao et al. Zhao et al. (2020) merge object detection models when some classes may be the background to other models. Vongkulbhisal et al. Vongkulbhisal et al. (2019) combines models with different but potentially overlapping classes into one using the combined dataset, by computing unified soft labels using known intra-model class correspondence. Ye et al. Ye et al. (2020) progressively train a GAN to regenerate proxy data for all teachers, and transfer all teachers into one student layer by layer. Chakraborty et al. Chakraborty et al. (2018) aggregates an ensemble of weak classifiers by weighted average of their outputs. Park et al. Park and Kwak (2020) merges an ensemble for the same task using both the distillation loss and an adaptation layer to predict the teacher’s penultimate layer’s activations. In one-shot federated learning Guha et al. (2019), multiple clients with their own private data train a model each while protecting their data from being shared or otherwise leaked, and merge the models at the end of the training using an ensemble or distillation.
Unlike our approach, these methods merge models in order to perform exactly the same task as the teacher models, are not concerned with the performance of the student when transferred to a downstream task, and often Gao et al. (2017); Chou et al. (2018); Zhao et al. (2020); Vongkulbhisal et al. (2019); Chakraborty et al. (2018); Park and Kwak (2020) require revisiting the original training images. We have show that these teachers themselves have poor transferrability compared to a simple pretrained baseline, and merging them yields suboptimal transfer learning performance.
Multi-model merging for transfer.
Knowledge Flow Liu et al. (2019) connects the student to multiple teachers’ intermediate layers to kick-start its training, and gradually penalizes its reliance on teacher models over the training of the final target task. Geyer et al. Geyer et al. (2019) uses IMM Lee et al. (2017) to merge multiple models using their diagonal approximated Fisher information matrix (FIM), balancing their importance using learned weights, to form a new representation to fine-tune from. Computing the FIM requires reprocessing the original teacher training data (unlike our approach that only needs generic proxy data). Furthermore, this method requires all students and teachers to have exactly the same backbone architecture while ours works on any combination of student and teacher architectures. These methods directly optimize performance on the downstream task and thus require separate model merging runs for each different target dataset. Our approach is more efficient, as it only requires consolidating teachers once to improve the pre-trained representation independently of the downstream task. Finally, we note that the representations learned were not compared to a strong baseline (i.e., pretraining on ImageNet), which we argue is a prerequisite for actually being useful in real applications. One of our main contributions is observing the need for including a generalist teacher, an insight which is largely orthogonal to, and could be combined with these previous approaches.
Distillation with representation losses (“representation distillation”)
tries to capture additional structure of feature representations by aligning student and teacher feature activations during distillation. Koratana et al. Koratana et al. (2019) compress models by adding losses between intermediate representations of the teacher and students. Aguilar et al. Aguilar et al. (2020)
use KL divergence and cosine similarity to make the attention and representation of intermediate features of the student and teacher similar. Tianet al. Tian et al. (2020) adds contrastive representation learning loss to the penultimate layer to preserve feature structures, by maximizing each image’s student and teacher features’ mutual information. They demonstrate this can yield better representations for transfer than traditional knowledge distillation but don’t consider multiple downstream tasks or compare to strong generalist baselines.
Proxy data (“data-free”) distillation
transfers the input-output function of a teacher network to a student network without using the teacher’s training data. Yalniz et al. (2019); Orekondy et al. (2019) opt to use a large general proxy dataset to query the teacher, and their teacher outputs to on this data to train the student. Other methods Nayak et al. (2019); Chen et al. (2019); Haroush et al. (2020); Chawla et al. (2021) generate proxy data directly from the trained models, and use this data to train the students. Further, Micaelli and Storkey (2019); Yin et al. (2020), also encourage generating samples the student and teacher disagree on. Lastly, other methods require the original dataset to compute meta-data information such as feature cluster mean. Lopes et al. (2017); Bhardwaj et al. (2019) Some train a GAN from the teachers to maximize chosen class predictions Fang et al. (2019); Yoo et al. (2019); Ye et al. (2020), sometimes also batchnorm statistics Luo et al. (2020); Xu et al. (2020), and sometimes on proxy data instead. Addepalli et al. (2020); Besnier et al. (2020) These works do not concern either the transferability of the learned student network or merging multiple teachers.
is also related to our overall goal of growing a library of expert representations. These methods continually learn tasks or classes but often with limited access to past training data Li and Hoiem (2018); Rebuffi et al. (2017); Kirkpatrick et al. (2017); Zenke et al. (2017); Hu et al. (2019); Yin et al. (2020); Prabhu et al. (2020). Our approach addresses many of the same challenges by consolidating knowledge in the form of feature representations without revisiting old data used to train teachers, but our “increments” are whole tasks whose labels don’t necessarily overlap.
In this paper, we show that traditional distillation can result in a representation suboptimal for downstream task transfer learning, because they only focus on preserving the end-to-end input-output mapping of the old task. We show our representation consolidation with the generalist model as an additional teacher preserves the transferability of the strong ImageNet baseline and improve the performance for both related and unrelated downstream tasks over traditionally distilled networks. We show that we can merge multiple models in the same domain to get a better representation than any single model.
Limitations and societal impact.
We assume we have perfect knowledge of which tasks form the same domain and which tasks do not belong to a domain. Our performance drops when teachers from different domains are consolidated. In future work, we plan to automatically determine how to cluster a large amount of teachers into domains. In addition, one of our contributions assume the existence of a strong representation baseline such as the ImageNet pre-trained model, which is true for e.g. images and language, but not for other fields e.g. 3D reconstruction. Our method also takes the teachers as is and learn from them, and any mistakes made by the teachers can be propagated during the consolidation. Possible mitigations include using a better teacher that makes less such mistakes, or using regularization on both teacher and student training, such as making similar inputs map to similar outputs.
Task2vec: task embedding for meta-learning.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6430–6439. Cited by: §1, §1.
DeGAN: data-enriching gan for retrieving representative samples from a trained classifier.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 3130–3137. Cited by: §5.
- Knowledge distillation from internal representations. In AAAI, Cited by: §5.
- This dataset does not exist: training models from generated images. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. Cited by: §5.
- Dream distillation: a data-independent model compression framework. In , Cited by: §5.
Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, Cited by: §3.
A mixture model for aggregation of multiple pre-trained weak classifiers.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 454–4547. Cited by: §5, §5.
- Data-free knowledge distillation for object detection. In WACV, Cited by: §5.
- DAFL: data-free learning of student networks. In ICCV, Cited by: §5.
Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE 105 (10), pp. 1865–1883. External Links: Cited by: §3.
- Unifying and merging well-trained deep neural networks for inference stage. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2049–2056. Cited by: §5, §5.
- Describing textures in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613. Cited by: §3.
- A linearized framework and a new benchmark for model selection for fine-tuning. arXiv preprint arXiv:2102.00084. Cited by: §1, §1.
- Data-free adversarial distillation. ArXiv abs/1912.11006. Cited by: §5.
- Knowledge concentration: learning 100k object classifiers in a single cnn. ArXiv abs/1711.07607. Cited by: §5, §5.
- Transfer learning by adaptive merging of multiple models. In MIDL, Cited by: §4, §5, §5.
- Caltech-256 object category dataset. Technical report California Institute of Technology. Cited by: §3.
- One-shot federated learning. Machine Learning on Devices Workshop at NeurIPS. Cited by: §5.
- The imaterialist fashion attribute dataset. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3113–3116. Cited by: §3.
- The knowledge within: methods for data-free model compression. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8491–8499. Cited by: §5.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.
- Distilling the knowledge in a neural network. ArXiv abs/1503.02531. Cited by: §2.
- The inaturalist species classification and detection dataset. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8769–8778. Cited by: §3.
- Overcoming catastrophic forgetting for continual learning via model adaptation. In ICLR, Cited by: §5.
-  ImageHash library. Note: https://github.com/JohannesBuchner/imagehashAccessed: 2021-05-27 Cited by: §3.
- FoodX-251: a dataset for fine-grained food classification. arXiv preprint arXiv:1907.06167. Cited by: §3.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, pp. 3521 – 3526. Cited by: §5.
- LIT: learned intermediate representation training for model compression. In ICML, Cited by: §5.
- 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §3.
Overcoming Catastrophic Forgetting by Incremental Moment Matching (IMM). In Advances In Neural Information Processing Systems 30, Cited by: §5.
- Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, pp. 2935–2947. Cited by: §5.
- Knowledge flow: improve upon your teachers. 7th International Conference on Learning Representations, ICLR 2019. Cited by: §5, §5.
- Data-free knowledge distillation for deep neural networks. ArXiv abs/1710.07535. Cited by: 3rd item, §5.
- Large-scale generative data-free distillation. ArXiv abs/2012.05578. Cited by: 3rd item, §5.
- Fine-grained visual classification of aircraft. ArXiv abs/1306.5151. Cited by: §3.
- Zero-shot knowledge transfer via adversarial belief matching. In NeurIPS, Cited by: §5.
- Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pp. 4743–4751. Cited by: §5.
- Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §3.
- Knockoff nets: stealing functionality of black-box models. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4949–4958. Cited by: §5.
- Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks. In ECAI, Cited by: §5, §5.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §3.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
- GDumb: a simple approach that questions our progress in continual learning. In ECCV, Cited by: §5.
- Scalable transfer learning with expert models. arXiv preprint arXiv:2009.13239. Cited by: §1, §1.
- ICaRL: incremental classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5533–5542. Cited by: §5.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §3.
- Contrastive representation distillation. In International Conference on Learning Representations, Cited by: §5, §5.
- Unifying heterogeneous classifiers with distillation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3170–3179. Cited by: §5, §5.
- The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §3.
- Generative low-bitwidth data free quantization. In ECCV, Cited by: §5.
Billion-scale semi-supervised learning for image classification. ArXiv abs/1905.00546. Cited by: §5.
- Data-free knowledge amalgamation via group-stack dual-gan. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12513–12522. Cited by: §5, §5.
- Dreaming to distill: data-free knowledge transfer via deepinversion. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8712–8721. Cited by: §5, §5.
- Knowledge extraction with no observable data. In NeurIPS, Cited by: §5.
- Continual learning through synaptic intelligence. Proceedings of machine learning research 70, pp. 3987–3995. Cited by: §5.
- Object detection with a unified label space from multiple datasets. In ECCV, Cited by: §5, §5.
- Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, pp. 1452–1464. Cited by: §3.
Appendix A Complete figures for Section 4 experiments
We now provide the full graphs and table for our experiments. Note that we have summarized these results and all conclusions in the main paper.
|our vs. ImageNet||(6 others)||
|our vs. trad. distill||(7 others)||Cars196.||iFashion, Aircraft.||(all 10)|
|trad. distill vs. ImageNet||(5 others)||iNaturalist.||
Improving student representation when .
Consolidating representations with .
For Figures 7 and 8 same-domain model merging, in addition to the main paper results merging ResNet18 and ResNet50 , we show similar results for all teachers being ResNet50 in Figure 8. We also show comparison to traditional distill with 5 teachers, representation consolidation with only 1 teacher, and the teacher itself (randomly chosen, or best out of 5 according to related performance). The conclusions are the same, and merging five teachers using representation consolidation outperforms all baselines on both related and unrelated downstream tasks (except Cars196 with related against traditional distill).
For multi-domain model merging in Figure 9, we compare to traditional distill with multi-task learning. We outperform it on all related and unrelated except for Cars196. We also explore using a concatenation of multiple large unlabeled datasets as . With a more diverse proxy, the performance drops a little for related and stays similar for unrelated , suggesting we are somewhat insensitive to choice of datasets, but a more diverse proxy data may not provide better model merging performance.
Appendix B Raw number accuracy tables for all experiments
|Cars196 (t-split) repr. consolid.||59.21||70.00||28.42||35.28||89.00||58.99||83.01||80.26||54.02||29.85|
|Cars196 (t-split) trad. distill||59.39||51.56||9.75||11.54||84.27||36.07||59.91||35.51||16.90||16.82|
|Resisc45 (t-split) repr. consolid.||30.80||72.64||28.77||36.37||89.03||60.08||82.75||80.77||55.47||29.17|
|Resisc45 (t-split) trad. distill||8.97||61.62||9.85||12.11||84.97||37.80||56.06||34.14||11.23||11.57|
|iFood (t-split) repr. consolid.||29.44||69.49||38.85||47.19||89.04||57.16||82.66||79.17||51.07||27.31|
|iFood (t-split) trad. distill||12.85||53.57||35.26||44.05||85.39||36.62||70.99||39.94||17.95||14.90|
|iFashion category task repr. consolid.||31.89||70.02||29.23||37.45||89.15||60.76||83.28||81.29||57.10||30.24|
|iFashion category task trad. distill||6.41||37.97||6.26||8.09||90.70||23.45||44.01||30.32||9.28||9.27|
|DTD (t-split) repr. consolid.||31.33||69.78||29.17||37.38||89.08||61.43||82.80||81.28||56.09||29.59|
|DTD (t-split) trad. distill||11.29||53.12||14.11||19.00||86.39||46.80||62.48||46.38||17.14||14.75|
|Flowers (t-split) repr. consolid.||31.81||69.94||29.46||37.01||89.13||60.06||84.79||81.18||57.07||30.41|
|Flowers (t-split) trad. distill||16.72||58.08||16.68||20.67||87.03||44.70||77.35||54.49||31.32||20.26|
|Caltech256 (t-split) repr. consolid.||31.71||69.36||29.18||36.97||89.08||61.07||83.14||81.28||56.63||29.98|
|Caltech256 (t-split) trad. distill||21.6||59.67||19.44||25.10||87.35||51.59||75.52||68.98||36.09||23.32|
|CUB (t-split) repr. consolid.||31.50||70.35||29.16||36.83||89.20||60.32||83.30||80.89||66.37||29.65|
|CUB (t-split) trad. distill||17.15||57.95||17.06||20.72||87.03||46.37||72.16||53.26||62.41||19.00|
|Aircrafts (t-split) repr. consolid.||30.71||69.58||28.25||35.54||89.06||58.67||81.36||80.54||53.95||55.26|
|Aircrafts (t-split) trad. distill||10.56||44.30||7.81||9.09||84.17||29.16||48.32||29.18||10.87||58.51|
|iNaturalist (t-split) repr. consolid.||30.08||69.67||27.35||34.52||88.70||57.28||84.77||78.01||67.58||27.94|
|iNaturalist (t-split) trad. distill||13.51||57.25||14.66||18.68||84.46||39.49||79.74||41.59||57.49||16.98|
|Cars196 10% x 5 (Res18) repr. consolid.||42.03||69.95||29.44||36.97||89.05||60.15||83.23||81.48||58.18||31.10|
|Cars196 10% x 5 (Res18) trad. distill||45.37||60.74||17.53||20.80||87.11||48.16||73.66||58.56||39.02||23.85|
|Cars196 10% (Res18) repr. consolid.||36.97||69.64||28.62||35.95||88.92||59.88||82.79||80.37||56.80||30.87|
|Cars196 10% classes (Res18) teacher||34.28||56.03||16.83||20.78||87.14||45.59||70.31||57.13||39.40||22.33|
|Cars196 10% classes (Res18) teacher (best of 5)||35.39||56.47||16.98||20.50||87.59||45.77||70.58||60.41||41.87||23.40|
|iFood 10% x 5 (Res18) repr. consolid.||31.14||70.69||35.82||42.97||88.98||58.98||83.51||80.70||53.91||29.80|
|iFood 10% x 5 (Res18) trad. distill||17.60||60.26||34.10||39.71||86.58||45.93||76.18||51.55||27.91||20.27|
|iFood 10% (Res18) repr. consolid.||30.45||69.54||31.25||38.47||88.99||58.59||81.97||79.37||54.33||30.25|
|iFood 10% classes (Res18) teacher||13.77||52.88||22.29||25.46||85.84||36.14||65.34||45.19||21.58||16.43|
|iFood 10% classes (Res18) teacher (best of 5)||14.27||52.35||23.11||26.40||86.45||38.34||65.88||44.94||22.58||16.11|
|Cars196 10% x 5 repr. consolid.||45.99||70.10||29.23||37.02||89.16||60.40||83.82||81.21||57.40||31.42|
|Cars196 10% x 5 trad. distill||50.33||62.63||18.52||22.32||86.77||49.31||75.66||59.74||37.27||25.48|
|Cars196 10% repr. consolid.||39.23||69.62||28.42||36.01||89.03||59.84||82.76||80.34||56.43||31.20|
|Cars196 10% classes teacher||41.76||60.72||19.63||24.53||87.85||50.03||78.77||61.27||39.22||23.27|
|Cars196 10% classes teacher (best of 5)||42.92||64.48||20.87||25.95||87.58||51.79||78.69||64.35||43.20||25.01|
|iFood 10% x 5 repr. consolid.||30.74||70.64||37.34||44.77||88.91||59.17||84.58||80.20||52.47||29.83|
|iFood 10% x 5 trad. distill||17.70||61.33||35.88||42.71||86.18||44.63||77.50||50.98||25.36||20.49|
|iFood 10% repr. consolid.||30.05||68.62||32.06||39.37||88.83||58.67||82.09||79.43||53.83||29.68|
|iFood 10% classes teacher||14.07||52.84||25.04||28.84||86.70||41.14||69.59||45.79||19.57||16.57|
|iFood 10% classes teacher (best of 5)||13.85||54.31||25.16||29.22||86.48||39.39||69.26||44.21||20.54||16.08|
|Cars196 + Resisc45 + iFood (t-split) repr. consolid.||47.85||72.39||34.57||41.92||89.09||58.63||83.64||79.90||51.79||29.18|
|Cars196 + Resisc45 + iFood (t-split) trad. distill||53.27||66.27||34.44||41.84||85.32||43.74||75.84||48.16||23.16||19.99|
|Cars196 (t-split) repr. consolid. old:new = 1:3||61.21||67.83||26.70||33.35||88.92|
|Cars196 (t-split) repr. consolid. old:new = 1:1||59.21||70.00||28.42||35.28||89.00|
|Cars196 (t-split) repr. consolid. old:new = 3:1||52.80||70.82||29.18||36.53||89.13|
|Cars196 (t-split) repr. consolid. (ImageNet proxy)||59.21||70.00||28.42||35.28||89.00|
|Cars196 (t-split) repr. consolid. (Places365 proxy)||61.18||69.59||26.69||33.54||88.68|
|Cars196 (t-split) trad. distill (Places365 proxy)||59.98||51.11||9.54||11.10||84.38|
|Cars196 (t-split) repr. consolid.||55.80||69.91||28.41||34.00||89.91|
|Cars196 (t-split) trad. distill||60.19||61.30||18.98||19.79||87.90|
|Resisc45 (t-split) repr. consolid.||38.30||71.71||28.16||34.33||89.83|
|Resisc45 (t-split) trad. distill||21.12||60.93||17.57||19.46||88.05|
|iFood (t-split) repr. consolid.||35.94||68.39||35.27||42.14||89.54|
|iFood (t-split) trad. distill||27.89||61.58||31.23||39.33||87.91|
|iFashion category task repr. consolid.||38.57||68.45||28.80||35.88||89.29|
|iFashion category task trad. distill||16.71||54.53||11.49||14.93||89.80|
|Cars196 + Resisc45 + iFood (t-split) repr. consolid.||45.31||70.84||30.90||37.62||89.87|
|Cars196 + Resisc45 + iFood (t-split) trad. distill||45.29||67.05||30.14||35.44||87.97|
|Cars196 (t-split) repr. consolid.||91.84||97.08||78.12||87.91||87.81||92.45||81.46|
|Cars196 (t-split) trad. distill||91.35||96.43||75.76||86.10||86.91||85.35||75.39|
|Resisc45 (t-split) repr. consolid.||91.03||97.14||78.57||88.02||87.75||92.08||81.55|
|Resisc45 (t-split) trad. distill||87.41||96.71||75.90||86.09||85.42||85.25||71.57|
|iFood (t-split) repr. consolid.||90.75||96.86||78.34||88.00||87.75||92.26||80.20|
|iFood (t-split) trad. distill||89.44||96.34||77.24||87.33||85.65||88.42||76.30|
|iFood (t-split) teacher||89.09||96.21||77.76||87.92||84.46||87.97||75.92|
|Aircrafts (t-split) repr. consolid.||90.93||97.08||78.20||88.01||88.94||91.97||81.15|
|Aircrafts (t-split) trad. distill||87.31||95.87||75.06||85.29||89.54||81.79||72.30|
|iNaturalist (t-split) repr. consolid.||91.08||96.74||78.16||87.92||87.81||93.62||82.36|
|iNaturalist (t-split) trad. distill||88.62||96.02||76.64||86.60||86.61||90.89||79.89|
|iNaturalist (t-split) teacher||89.37||96.37||76.89||87.09||85.59||91.22||79.62|
|Cars196 + Resisc45 + iFood (t-split) repr. consolid.||91.05||96.74||78.42||88.15||88.16||92.76||81.43|
|Cars196 + Resisc45 + iFood (t-split) trad. distill||91.05||96.71||77.47||87.38||87.27||89.71||77.29|