Performance of deep learning algorithms in real-world applications is often limited by the size of training datasets. Training a deep neural network (DNN) model with a small number of training samples usually leads to theover-fitting issue with poor generalization performance. A common yet effective solution is to train DNN models under transfer learning [pan2010survey]
settings using large source datasets. The knowledge transfer from the source domain helps DNNs learn better features and acquire higher generalization performance for the pattern recognition in the target domain[donahue2014decaf, yim2017gift].
Backgrounds. For example, the paradigm [donahue2014decaf]
proposes to first train a DNN model using the large (and possibly irrelevant) source dataset (e.g. ImageNet), then uses the weights of the pre-trained model as the starting point of optimization and fine-tunes the model using the target dataset. In this way, blessed by the power of large source datasets, the fine-tuned model is usually capable of handling the target task with better generalization performance. Furthermore, authors in[yim2017gift, li2018explicit, li2019delta] propose transfer learning algorithms that regularize the training procedure using the pre-trained models, so as to constrain the divergence of the weights and feature maps between the pre-trained and fine-tuned DNN models. Latter, the work [chen2019catastrophic, wan2019towards] introduces new algorithms that prevent the regularization from the hurts to transfer learning, where [chen2019catastrophic] proposes to truncate the tail spectrum of the batch of gradients while [wan2019towards] proposes to truncate the ill-posed direction of the aggregated gradients.
In addition to the aforementioned strategies, a great number of methods have been proposed to transfer knowledge from the multi-task learning perspectives, such as [ge2017cvpr, cui2018large]. More specifically, Seq-Train [cui2018large] proposes a two phase approach, where the algorithm first picks up auxiliary samples from the source datasets with respect to the target task, then pre-train a model with the auxiliary samples and fine-tune the model using the target dataset. Moreover, Co-Train [ge2017cvpr] adopts a multi-task co-training approach to simultaneously train a shared backbone network using both source and target datasets and their corresponding separate Fully-Connected (FC) layers. While all above algorithms enable knowledge transfer from source datasets to target tasks, they unfortunately perform poorly, sometimes, due to the critical technical issues as follows.
Catastrophic Forgetting and Negative Transfer. Most transfer learning algorithms [donahue2014decaf, yim2017gift, li2018explicit, li2019delta] consist of two steps – pre-training and fine-tuning. Given the features that have been learned in the pre-trained models, either forgetting some good features during the fine-tuning process (catastrophic forgetting) [chen2019catastrophic] or preserving the inappropriate features/filters to reject the knowledge from the target domain (negative transfer) [li2019delta, wan2019towards] would hurt the performance of transfer learning. In this way, there might need a way to make compromises between the features learned from both source/target domains during the fine-tuning process, where multi-task learning with Seq-Train [cui2018large] and Co-Train [ge2017cvpr] might suggest feasible solutions to well-balance the knowledge learned from the source/target domains, through fine-tuning the model with a selected set of auxiliary samples (rather than the whole source dataset) [cui2018large] or alternatively learning the features from both domains during fine-tuning [ge2017cvpr].
Gradient Complexity for Seq-Train and Co-Train. The deep transfer learning algorithms based on multi-task learning are ineffective. Though the pre-trained models based on some key datasets, such as ImageNet, are ubiquitously available for free, multi-tasking algorithms usually need additional steps for knowledge transfer. Prior to the fine-tuning procedure based on the target dataset, Seq-Train requires an additional step to select auxiliary samples and “mid-tunes” the pre-trained model using the selected auxiliary samples [cui2018large]. Furthermore, Co-Train [ge2017cvpr] requests additional backpropagation in-situ as the two dataset combined. In this way, there might need a deep transfer learning algorithm that does not require explicit “mid-tuning” procedure or additional bakcpropagation to learn from the source dataset.
Our Work. With both technical issues in mind, we aim to study efficient and effective deep transfer learning algorithms with low computational complexity from the multi-task learning perspectives. We propose XMixup, namely Cross-domain Mixup, which is a novel deep transfer learning algorithm enabling knowledge transfer from source to target domains through the low-cost Mixup [zhang2018mixup]. More specifically, given the source and target datasets for image classification tasks, XMixup runs deep transfer learning in two steps – (1) Auxiliary sample selection: XMixup pairs every class from the target dataset to a dedicated class in the source dataset, where the samples in the source class are considered as the auxiliary samples for the target class; then (2) Mixup with auxiliary samples and Fine-tuning: XMixup combines the samples from the paired classes of the two domains randomly using mixup strategy [zhang2018image], and performs fine-tuning process over the mixup data. To the best of our knowledge, this work has made three sets of contributions as follows.
We study the problems of cross-domain deep transfer learning for DNN classifiers from the multitask learning perspective, where the knowledge transfer from the source to the target tasks is considered as a co-training procedure of the shared DNN layers using the target dataset and auxiliary samples[ge2017cvpr, cui2018large]. We review the existing solutions [donahue2014decaf, yim2017gift, li2018explicit, li2019delta], summarize the technical limitations of these algorithms, and particularly take care of the issues in catastrophic forgetting [chen2019catastrophic], negative transfer [wan2019towards], and computational complexity.
In terms of methodologies, we extend the use of Mixup [zhang2018mixup] to the applications of cross-domain knowledge transfer, where both source and target datasets own different sets of classes and the aim of transfer learning is to adapt classes in the target domain. While vanilla mixup augments the training data with rich features and regularizes the stochastic training beyond empirical risk minimization (ERM), the proposed algorithm XMixup in this paper uses mixup to fuse the samples from source and target domains. In this way, the catastrophic forgetting issue could be solved in part, as the model keeps learning from both domains, but with lower cost compared to [chen2019catastrophic]. To control the effects of knowledge transfer, XMixup also offers a tuning parameter to make trade-off between the two domains in the mixup of samples [zhang2018mixup].
We carry out extensive experiments using a wide range of source and target datasets, and compare the results of XMixup with a number of baseline algorithms, including fine-tuning with weight decay () [donahue2014decaf], fine-tuning with -regularization on the starting point (-) [li2018explicit], Batch Singular Shrinkage (BSS) [chen2019catastrophic], Seq-Train [cui2018large], and Co-Train [ge2017cvpr]. The experiment results showed that XMixup can outperform all these algorithms with significant improvement in both efficiency and effectiveness.
Organizations of the Paper The rest of this paper is organized as follows. In Section 2, we review the relations between our work to the existing algorithms, where the most relevant studies are discussed. We later present the algorithm design in Section 3, and the experiments with overall comparison results in Section 4, respectively. We discuss the details about the algorithm with case studies and ablation studies in Section 5, then conclude the paper in Section 6.
2 Related Work
The most relevant studies to our algorithm are [donahue2014decaf, chen2019catastrophic, cui2018large, ge2017cvpr, zhang2018mixup, xu2020adversarial]. All these algorithms, as well as the proposed XMixup algorithm, start the transfer learning from a pre-trained model, which has been well-trained using the source dataset. However, XMixup makes unique technical contributions in comparisons to these works.
Compared to [donahue2014decaf], which fine-tunes the pre-trained model using the target set only and might cause the so-called catastrophic forgetting effects, XMixup proposes to fine-tune the pre-trained model using the mixup data from both domains. Compared [chen2019catastrophic]
, which uses the computationally expensive singular value decomposition (SVD) on the batch gradients to avoid catastrophic forgetting and negative transfer effects, XMixup employs the low-cost mixup strategies to achieve similar goals. Compared to[cui2018large], the proposed algorithm XMixup also adopts a similar procedure (pairing the classes in source/target domains) to pick up auxiliary samples from the source domain for knowledge transfer. XMixup however further mixes up the target training set with auxiliary samples and fine-tunes the pre-trained model with the data in an end-to-end manner, rather than using a two-step approach for fine-tuning [cui2018large]. Compared to [ge2017cvpr], which combines source/target tasks together to fine-tune the shared DNN backbones, the proposed algorithm here mixes up data from the two domains and boosts the performance through a simple fine-tuning process over the mixup data with low computational cost.
Finally, we extend the usage of vanilla mixup strategy [zhang2018mixup] for the applications of transfer learning, where in terms of methodologies we propose to pair classes of the two domains and perform mixup over the selected auxiliary samples for improved performance. Actually, mixup strategies have been used in [xu2020adversarial] for unsupervised domain adaption. Since the target task is assumed to share the same set of classes as the source domain in [xu2020adversarial], selecting auxiliary samples or pairing the source classes to fit the classes in the target domain is not required.
3 XMixup: Cross-Domain Mixup for Deep Transfer Learning
Given the source and target datasets and a pre-trained model (that has been well-trained using the source dataset), XMixup performs deep transfer learning using two steps as follows.
Auxiliary Sample Selection
Given a source dataset with classes and a target training dataset with classes, XMixup assumes the source domain is usually with more classes than the target one (i.e.,
), and it intends to pair every class in the target training dataset with a unique and dedicated class in the source dataset (one-to-one pairing from the target to source classes). More specifically, given a pre-trained model, XMixup first passes every sample from the two datasets through the pre-trained model and obtains features extracted from the last layer of the feature extractor. Then, XMixup groups the features of the samples according to the ground truth classes in their datasets, and estimates the centroid of the features for every class in both datasets. Such that, for every classin the source or target dataset, XMixup represents the class as the centroid of the features using the pre-trained model for every sample in the class , i.e.,
Given two classes and in the source and target domains respectively, we consider the similarity between the two classes as the potentials for knowledge transfer, while XMixup measures the similarity between the two classes using the cosine measures between the centroids of the two classes, such that . In this way, the auxiliary sample selection could be reduced to search the optimal transport between the sets of classes of and respectively, via the pre-defined distance measure. Hereby XMixup intends to find a one-to-one mapping , such that
where refers to the Cartesian product of the target and source class sets, refers to the constraint of the one-to-one mapping, maps the target class to a unique class from the source domain. Note that refers to the optimal mapping that potentially exists to minimize the overall distances, while XMixup solves the optimization problem using a simple Greedy search [cui2018large] to pursue a robust solution denoted as in low complexity. Compared to XMixup, the Seq-Train algorithm [cui2018large] uses Greedy algorithm to pair the source/target classes via the measure Earth Mover’s Distance (EMD), which might be inappropriate in our settings of transfer learning.
Cross-domain Mixup with Auxiliary Samples and Fine-tuning
Given the one-to-one pairing from target to source classes, XMixup carries out the fine-tuning process over the two datasets. In every iteration of fine-tuning, XMixup first picks up a mini-batch of training samples drawn from the target dataset ; then for every sample in the batch , the algorithm retrieves the class of as and randomly draws one sample from the paired class of , such that
We consider as an auxiliary sample of in the current iteration of fine-tuning. XMixup then mixes up the two samples as well as their labels through linear combination with a trade-off parameter
drawn from the Beta distribution, such that
In this way, XMixup augments the original training sample from the target domain using the auxiliary sample from the paired source class, for knowledge transfer purposes. XMixup fine-tunes the pre-trained model using the mixup samples accordingly.
4 Experiments and Overall Comparisons
Stanford Dogs. The Stanford Dogs [KhoslaYaoJayadevaprakashFeiFei_FGVC2011] dataset contains images of 120 breeds of dogs worldwide, each of which containing 100 examples for training and 72 for testing.
CUB-200-2011. Caltech-UCSD Birds-200-2011 [wah2011caltech] consists of 11,788 images of 200 bird species. Each species is associated with a Wikipedia article.
Food-101. Food-101 [bossard14] is a large scale data set consisting of more than 100k food images divided into 101 different kinds.
Flower-102. Flower-102 [Nilsback2008Automated] consists of 102 flower categories. 1,020 images are used for training and 6149 images for testing. Only 10 samples are provided for each category during training.
Stanford Cars. The Stanford Cars [KrauseStarkDengFei-Fei_3DRR2013] dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training and 8,041 testing images.
FGVC-Aircraft. FGVC-Aircraft [Maji2013FineGrainedVC] is a fine-grained visual classification dataset composing more than 10,000 images of aircraft across 102 different aircraft models.
Although bounding box annotations are provided in some datasets, they are not used in our experiment.
4.2 Training Details
We evaluate the recent state-of-the-art transfer learning methods in addition to XMixup. They are divided into two categories, which are regularized learning and multitask learning. For the former, we evaluate fine-tuning with regularization, regularization [li2018explicit] and regularization [chen2019catastrophic]. Note that the multitask learning is defined in a broad sense here, including traditional co-training [ge2017borrowing], sequential training [cui2018large] and co-training with XMixup. Compared with regularized learning, the essential difference is to re-train labeled auxiliary examples from the source dataset instead of regularizing the source model. We use the strategy described in Section 3 to select auxiliary examples for all multitask learning experiments. According to the empirical study in [ge2017borrowing], a threshold value of the auxiliary dataset size is required to guarantee the effect. In our implementation, we repeat the one-to-one pairing procedure until the size of the selected subset reaches a specific threshold. The value is 100,000 for Stanford Dogs and 200,000 for other datasets.
All experiments are performed using modern deep neural architecture ResNet-50 [he2016deep], pre-trained with ImageNet. Input images are resized with the shorter edge being 256 and center cropped to square. We perform standard data augmentation composed of random flip and random cropped to . For optimization, we use SGD with the momentum of 0.9 and batch size of 48 for fine-tuning. We train 9,000 iterations for all datasets. The initial learning rate is set to 0.001 for Stanford Dogs and 0.01 for remaining datasets. It is divided by 10 after 6,000 iterations. Each experiment is repeated five times. The average top-1 classification accuracy and standard division are reported.
|Regularized Learning||Multitask Learning|
Accuracy. The top-1 classification accuracies are reported in Table 1. We observe that regularized learning methods obtain similar results, while performs obviously better on Stanford Dogs. achieves similar improvements on average but is more stable, consistently outperforming by a small margin on most datasets. As for multitask learning, pre-training with auxiliary examples before fine-tuning often hurts the performance surprisingly. This may be caused by the phenomenon of catastrophic forgetting [li2017learning]
, that pre-training with a subset of the source dataset loses general knowledge stored in the source model, but this general knowledge is probably useful for the target task. Co-training with auxiliary examples performs much better because auxiliary examples from the source task are used to preserve the knowledge. However, Co-training still hurts the performance on some datasets which have larger distance of data distribution to the source dataset such as Stanford Cars and FGVC-Aircraft. XMixup obtains obvious higher average accuracy than all baseline methods. It is also very robust and stably outperforms naive fine-tuning withregularization.
Complexity. As stated in [li2018explicit, chen2019catastrophic], regularized learning methods are usually efficient because the computational complexity involved by the regularization is approximately proportional to the number of parameters or features. While previous multitask learning methods [ge2017borrowing, cui2018large] are time consuming to deal with auxiliary examples, costing additional time whose scale is at least the same as training only target examples. XMixup achieves the efficiency almost the same as naive fine-tuning, because extra computations for data mixing are negligible.
5 Discussions and Ablation Studies
In this section, we provide more empirical studies to analyze the effectiveness and applicability of our algorithm. In subsection 5.1, we show that XMixup effectively alleviates catastrophic forgetting and negative transfer. In subsection 5.2 and 5.3
, XMixup is proved to be not sensitive with the auxiliary dataset selection and hyperparameter setting, indicating that XMixup is easy to be applied in various real world tasks. Finally in subsection5.4, we analyze two essential characteristics, crossing domain and supervised mixing, showing that they are both necessary.
|CUB-200-2011||Stanford Cars||Flower-102||Food-101||Stanford Dogs||FGVC-Aircraft|
5.1 Analysis of the Effectiveness
Authors in [chen2019catastrophic] figure out that deep transfer learning suffers from two kinds of problems, which are catastrophic forgetting and negative transfer. We do empirical analysis showing that XMixup explicitly or implicitly deals with both issues. Three different datasets are used for detailed analysis.
A widely adopted measurement about catastrophic forgetting is to evaluate the accuracy of a fine-tuned model on the previous task [li2017learning, kirkpatrick2017overcoming]. It is worth noting that although parameters are totally changed after fine-tuning, the general knowledge is still preserved for feature representation. We thus can use models fine-tuned on target tasks as feature extractors for images in source domains. A randomly initialized fully connected layer is trained to adapt to each specified source task. Results in Table 2 show that XMixup helps preserve the knowledge about auxiliary samples, although this may hurt the capacity of representing samples not in the auxiliary dataset.
[chen2019catastrophic] finds that the distribution of tail singular values indicates the degree of negative transfer. Specifically, negative transfer can be reduced by suspending tail singular values. In Figure 2, we observe that XMixup shows similar trends with increasing number of training examples, which is the most meaningful approach to avoid negative transfer. Through sufficient utilization of auxiliary examples, XMixup further decreases smallest singular values.
|XMixup||Mixup||w/o Label||XMixup||Mixup||w/o Label||XMixup||Mixup||w/o Label|
5.2 Sensitivity of the Auxiliary Dataset
Since the only external dependence is the auxiliary dataset, we analyze in detail how characteristics of the auxiliary dataset influences the effect of XMixup.
Size. We first discuss how XMixup performs as the size of the auxiliary dataset varies. We create auxiliary datasets with different scales based on the default dataset described in experiment settings. To increase the size, we continue adding the most similar category from the source dataset until the whole set is selected. To decrease the size, we perform random sampling from the default dataset. Note that these smaller auxiliary datasets are only different on the size but not the domain similarity, while larger auxiliary datasets are less similar with the target domain. As illustrated in Figure 3, CUB-200-2011 and FGVC-Aircraft show very good robustness to the size of the auxiliary dataset. Specifically, XMixup performs well enough when the number of auxiliary examples is more than 100,000. Surprisingly, we find that XMixup still significantly outperforms naive fine-tuning even simply using the whole source dataset without selection. The task of Stanford Dogs is a bit different that using too many dissimilar auxiliary examples hurts the performance, since it is closely related with a subset of ImageNet. However, all tasks can benefit from XMixup obviously with a wide range of the number of auxiliary examples starting from about 30,000.
In order to investigate how XMixup depends on the similarity of auxiliary examples, we keep the size of the auxiliary dataset the same, but only replace auxiliary examples by random sampling from the entire source dataset. The result is presented in Table 3. We observe that most datasets are not affected obviously by the similarity removing, indicating that XMixup is a robust approach to integrate the general knowledge during fine-tuning. In subsection 5.4, we further show through ablation study that, knowledge from the source domain plays an important role in XMixup. In other words, crossing domain mixing is more than a kind of perturbation on target examples.
5.3 Sensitivity of Hyperparameters
There are two parameters and in Beta distribution. In this work we use fixed of 1 and only change to control the mixing weight of the two domains. Larger means larger sampling weights for examples from the target domain. We explore a broad range of values for and investigate how well XMixup performs under different settings. Fig. 4 shows top-1 accuracy against for three different datasets.
We observe that, in all experiments conducted, XMixup demonstrates a relatively continuous and smooth change as varies. Top-1 accuracy tends to be lower when is small because the training assigns a too small weight to samples in the target dataset. It then rises to its maximum as increases, followed by a gradual drop when continues to increase (using a too large value for degenerates to naive fine-tuning). We also see that for each dataset, there exists a range of that yields enhanced performance in comparison with other state-of-the-art algorithms. This trend indicates that XMixup is not strongly dependent on the choice of within the appropriate range.
Suggested range for depending on datasets.
Our experimental results show that, for datasets that are very similar to the source datasets, such as Stanford Dogs which is a subset of ImageNet, a lower () usually performs better. Otherwise, in a wide range between and is generally safe to improve the naive fine-tuning and even state-of-the-art baseline methods.
5.4 Ablation Study
To further determine whether both components, namely crossing domain and supervised mixing, are essential for XMixup, we evaluate the two variants. First we remove the characteristic of crossing domain, implementing traditional Mixup [zhang2018mixup] using only samples in the target dataset (in-domain). Second, we keep crossing domain sample mixing in the data generation step but remove labels of auxiliary samples. The results are reported in Table 4.
Comparing XMixup with traditional Mixup, we find that Mixup has lower test accuracy than XMixup by an average of 2.7% in the three datasets we experimented. This consistent difference implies that crossing domain is an essential factor that contributes to the effectiveness of XMixup. Furthermore, Mixup has similar, or even slightly worse, performance compared to models, whose results are shown in Table. 1. This phenomenon suggests that general data augmentation may not work well under transfer learning scenario where sample size is limited. Table 4 also shows that XMixup is more than data augmentation. We observe that model performance drops significantly (for an average of 5.2%) if labels of auxiliary examples are removed. This result indicates that knowledge of the source dataset is essential.
In this paper, we study the problem of knowledge transfer for DNN models from the perspectives of multi-task learning. We propose a novel deep transfer learning algorithm XMixup, namely Cross-domain Mixup, with superiority in both effectiveness and efficiency. Through a two-step approach with (1) auxiliary sample selection and (2) cross-domain mixup and fine-tuning, XMixup achieves significant performance improvements with 1.9% higher classification accuracy on average, when compared to the state-of-the-art algorithms [donahue2014decaf, li2018explicit, chen2019catastrophic]. XMixup is also robust to hyperparameter choices and ways of auxiliary sample selection. Finally, we conclude that knowledge transfer through multitask learning with a set of selected auxiliary samples is no doubt a promising direction with huge potentials, while this work suggests a solid yet easy-to-use baseline method.