The incredible successes of deep learning originate from years of developments on network architectures and learning algorithms, but also from increasing amounts of data available for training. In many deep learning applications, data abundance allows for an extensive empirically-driven exploration of models and hyper-parameter tuning to fit billions of model parameters[30, 25, 4]. Despite the efforts in building data-efficient pipe-lines [24, 49], deep networks remain intrinsically data-hungry . In some settings, however, this procedure is impracticable, due to the limited size of the dataset. For instance, the bureaucracy behind the acquirement of medical data and the lack of automatised systems for their labelling complicate the deployment of deep learning models in medical imaging [18, 46, 55].
A solution that can drastically reduce the need for new labelled data is transfer learning [50, 45, 42]. This technique allows to improve the generalization performance of a network trained on a data-scarce target task, by leveraging the information that a second network has acquired on a related and data-abundant source
task. The idea is that the network does not need to learn to recognize the relevant features of data from scratch, but can start from an advanced, yet imperfect, stage of learning. In the simplest formulation of transfer learning, the first layers of a neural network–those that perform feature extraction–are transferred and kept fixed, and only the last segment of the network is retrained on the new dataset. In practice, an additional end-to-end fine-tuning stage is often added, to adapt the learned features to the target task.
Despite its great success and its extensive use in deep learning applications, theoretical understanding of transfer learning is still limited and many practical questions remain open. For instance: How related do the source and target tasks need to be? Is it better to transfer from complex to simpler datasets or vice versa? How much data is necessary in the source dataset to make the transfer effective? How does the size of the transferred feature map impact performance? And when is the fine-tuning stage beneficial?
A key ingredient for gaining theoretical insight on these questions is to obtain a generative model for structured data able to produce non-trivial correlations between different tasks. Influence of structure of data on learning has been an active research topic in recent years [21, 20, 39, 9]. An important step forward was taken in , where it was shown that a simple generative model called the hidden manifold model (HMM) induces learning behaviors that more closely resemble those observed on real data. The HMM is built on the idea that real-world datasets are intrinsically low dimensional, as recent studies have shown for some of the most widely used benchmarks. For instance, the intrinsic dimensions and , compared to the actual number of pixels and . This concept is also demonstrated in modern generative models [28, 23]
In the present work, we propose a framework for better understanding transfer learning, where the many variables at play can be isolated in controlled experiments.
In Sec. 2, we introduce the correlated hidden manifold model (CHMM), a tractable setting where the correlation between datasets becomes explicit and tunable.
In Sec. 4 we compare the performance of transfer learning with the generalization error obtained by three alternative learning models: a random feature model; a network trained from scratch on the target task; and a transfer learning with an additional stage of fine-tuning.
In Figs. 3 and 5 we leverage parametric control over the correlation between source and target datasets to extensively explore different transfer learning scenarios and delineate the boundaries of effectiveness of the feature transfer. We also trace possible negative transfer effects observed on real data, such as over-specialization and overfitting [52, 29] in Fig. 3.
Further related work:
Starting from the seminal work of , the statistical learning community has produced a number of theoretical results bounding the performance of transfer learning, often approaching the problem from closely related settings like multi-task learning, few-shot learning, domain adaptation and hypothesis transfer learning, see [53, 51, 54] for surveys. Most of these results rely on standard proof techniques based on the Vapnik-Chervonenkis dimension, covering number, stability, and Rademacher complexity. A recent notable example of this type of approach can be found in . However, these works focus on worst-case analyses, whereas our work has a complementary focus on characterizing average-case behavior, i.e., typical transfer learning performance that we believe may be closer to observations in practice.
An alternative – yet less explored – approach is to accept stronger modeling assumptions in order to obtain descriptions of typical learning scenarios. Three recent works rely on the teacher-student paradigm , exploring transfer learning in deep linear networks  and in single-layer networks [11, 14]. Despite many differences, these works confirm the intuitive finding that fewer examples are needed to achieve good generalization when the teachers of source and target tasks are more aligned. While these studies provide a careful characterization of the effect of initialization inherited from the source task, they do not address other crucial aspects of transfer learning, especially the role played by feature extraction and feature sharing among different tasks. In modern applications feature reuse is key in the transfer learning boost  and it is lacking in the current modeling effort. This is precisely what we investigate in the present work.
2 Definition of the correlated hidden manifold model
We propose the correlated hidden manifold model (CHMM) as a high-dimensional tractable model of transfer learning, sketched in Fig. 1. To capture the phenomenology of transfer learning, two key ingredients are necessary. First, we need to express the relationship between source and target tasks, by defining a generative model for correlated and structured datasets. Second, we need to specify the learning model and the associated protocol through which the feature map is transferred over the two tasks. In the following, we describe both parts in detail.
The relationship between classification tasks can take different forms. Two tasks could rely on different features of the input, but these features could be read-out identically to produce the label; or two tasks could use the same features, but require these to be read-out according to different labeling rules. In the hidden manifold model (HMM) , explicit access to a set of generative features and to the classification rule producing the data allows us to model and combine correlations of both kinds.
The correlated-HMM combines two HMMs – representing source and target datasets – in order to introduce correlations at the level of the generative-features and the labels. The key element in the modeling is thus the relation between the two datasets. On a formal level, we construct source and target datasets, and , as follows: let denote the input dimension of the problem (e.g., the number of pixels), and the latent (intrinsic) dimension of the source dataset. First, we generate a pair , where is a matrix of -dimensional Gaussian generative features, , and denotes the parameters of a single-layer teacher network, . The input data points for the source task, , are then obtained as a combination of the generative features with Gaussian coefficients , while the binary labels are obtained as the output of the teacher network, acting directly on the coefficients:
In order to construct the pair for the target task, we directly manipulate both the features and the teacher network of the source task. To this end, we consider three families of transformations:
Feature perturbation and substitution. This type of transformations can be used to model correlated datasets that may differ in style, geometric structure, etc. In the CHMM, we regulate these correlations with two independent parameters: the parameter , measuring the amount of noise injected in each feature, and the parameter , representing the fraction of substituted features:
Teacher network perturbation. This type of transformations can be used to model datasets with the same set of inputs but grouped into different categories. In the CHMM, we represent this kind of task misalignment through a perturbation of the teacher network with parameter :
Feature addition or deletion. This type of transformation can be used to model datasets with different degrees of complexity and thus different intrinsic dimension. In the CHMM, we alter the latent space dimension , adding or subtracting some features from the generative model:
Moreover, also the target teacher vector will have a different number of components:
We can easily identify counterparts of these types of transformations in real world data. For example, a simple instance of datasets that slightly differ in style can be obtained by applying a data-augmentation transformation (e.g., elastic distortions) on a given dataset. An instance of teacher network perturbation can instead be constructed by considering the same dataset for both source and target tasks, but with a grouping of the various data-clusters into different labelings (e.g., even-odd digits vs digits grater-smaller than 5). Finally, an instance of datasets that only differ in latent dimension can be produced by selecting a subset of the categories contained in a given dataset and then transferring from the reduced dataset to the richer full-dataset or vice versa. In section4, we provide evidence of the qualitative agreement between the phenomenology observed in the CHMM of task correlation and in real-world datasets.
We specialize our analysis to the case of two-layer neural networks with hidden units and activation functions. This is the simplest architecture able to perform feature extraction and develop non-trivial representations of the inputs through learning. To train the network, we employ a standard logistic loss with -regularization on the second layer.
We define the transfer protocol as follows. First, we train a randomly initialized (with i.i.d. weights) two-layer network on the source classification task. In order to guarantee higher transferability, we employ early stopping . Then, we transfer the learned feature map, i.e. the first-layer weights, , to a second two-layer network. Finally, we train the last layer of the second network, , to solve the target task while keeping the first layer frozen:
where is the logistic loss. We refer to this learning model as the transferred feature model (TF). This notation puts the TF in contrast to the well studied random feature model (RF) [43, 34], where the first-layer weights are again fixed but random, e.g. sampled i.i.d.. Of course, when the correlation between source and target tasks is sufficient, TF is expected to outperform RF.
Throughout this work, we will generally assume the hidden-layer dimension to be of the same order of the input size . Note that, when the width of the hidden layer diverges, different limiting behaviors could be observed .
3 Theoretical analysis
One of the main interests of the presented model is that the generalization performance of transfer learning can be characterized analytically in high-dimensions. The crucial fact allowing such theoretical analysis is that the feature map is kept fixed through training and only the second layer weights are learned on the target task. In fact, it was recently conjectured and empirically demonstrated in  that this problem setting falls into a broad universality class, featuring diverse learning models that asymptotically behave as simpler Gaussian covariate models. According to the generalized Gaussian equivalence theorem (GET) , the model dependence on the training samples and the nonlinear feature map is completely summarized by the correlation matrices and between the generative coefficients and the activations :
Building on the GET, the theoretical framework presented in  yields analytical learning curves for general choices of the generative model and the feature map.
In this paper, we accept the validity of the Gaussian equivalence as a working assumption, setting up to verify its viability a posteriori through comparison with numerical simulations. To get our theoretical results, the pipeline is thus the following: we repeatedly generate finite-size correlated HMM models, and . We obtain the weights to be transferred, , via numerical optimization on the source task. We then estimate through Monte Carlo sampling the population covariances that serve as inputs to the asymptotic formulas. Finally, we iterate a set of saddle-point equations from , reproduced in detail in the supplementary material (SM), in order to get an analytic prediction for the associated test error.
We employ the same analytic framework also for the characterization of the performance of RF, for which the GET assumption was recently proven rigorously . In the following, we use the performance of the RF as a baseline for evaluating the effect of transfer learning in the CHMM model. We also compare numerically to the performance of a two-layer network (2L) trained from scratch on the target task. Finally, we evaluate through numerical experiments the effect of adding an additional fine-tuning stage (ft-TF) to a previously trained transferred features. The hyper-parameters employed in the learning protocols are fully detailed in the SM.
In this section, we employ the CHMM for correlated synthetic datasets and the analytic description of the TF performance to explore the effectiveness of transfer learning in a controlled setting. For simplicity, we focus our presentation on two key variables that can impact the generalization of TF. We first consider the effect of different correlation levels between source and target tasks, and afterwards we examine the role played by the latent dimensions of the two datasets. In both cases the starting point is an experiment on real data, followed by the identification of similar phenomenology in the synthetic model. Legitimized by the observed qualitative agreement, we then use the described tool-box of analytical and numerical methods to explore the associated transfer learning scenarios, drawing the corresponding phase diagrams.
Transfer between datasets with varying levels of correlation.
An unquestionable aspect of transfer learning is that the degree of task relatedness crucially affects the results of the transfer protocol. As mentioned in section 2, many sources of correlation can relate different learning tasks, creating a common ground for feature sharing among neural network models.
Let us start by discussing a simple transfer learning experiment on real data. We consider the EMNIST-letters dataset , containing centered images of hand-written letters. We construct the source dataset, , by selecting two groups of letters ( in the first group and in the second) and assigning them binary labels according to group membership. We then generate the target task, , by substituting one letter in each group (letter with for the first group and letter with for the second). In this way, a portion of the input traits characterizing the source task is replaced by some new traits in the target. Fig. 2(a) displays the results. On the x-axis, we vary the sample complexity in the target task (notice the log-scale in the plot), while the size of the source dataset is large and fixed. A first finding is that TF (light blue curve) consistently outperforms RF (orange curve), demonstrating the benefit of the feature map transfer. More noticeably, at low sample complexities, TF is also largely outperforming training with 2L (green curve), with test performance gains up to . This gap closes at intermediate sample complexities, and we observe a transition to a regime where the 2L performs better. We also look at the effect of fine-tuning the feature map in the TF, finding that at small sample complexity it is not beneficial with respect to TF, while at large sample complexity is helps a bit the generalization. Both the learning curves of TF and RF display a double-descent cusp, due to the low employed regularization. This type of phenomenon has sparked a lot of theoretical interest in recent years [3, 38]. The cusp is delayed in the TF, signaling a shift in the linear separability threshold due to the transferred feature map. It is also interesting that the fine-tuned TF retains the cusp while the 2L model does not. This is due to the different initialization: while 2L starts from small random weights, ft-TF picks up the training from the TF, where the second-layer weights can be very large. Further details are provided in the SM.
We can straightforwardly reproduce this type of dataset correlation in the CHMM, via the feature substitution described in section 2. Fig. 2(b) shows the very analogous phenomenology as described above, in the synthetic setting of the CHMM. In the plot, the full lines display the results of the theoretical analysis of the CHMM, corroborated by numerical simulations in finite-size (points in the same color). This agreement validates the GET assumption behind the analytic approach, as described in section 3. The 2L and ft-TF dashed lines are instead purely numerical, averaged over different realizations of the CHMM. For low sample complexity we observe rather clearly that the fine-tuning of TF actually deteriorates the test error due to overfitting, analogously to . This effect is also seen in another real data experiment reported in the SM.
The correspondence between the transfer learning behavior on real data and on the CHMM motivates a more systematic exploration of the effect of dataset correlation in the CHMM. The efficiency of our analytical pipeline, combined with fine-grained control over the correlation, allow us to draw phase diagrams for transfer learning. Fig 3 shows three phase diagrams, comparing (a) TF with 2L, (b) TF with a RF, (c) and ft-TF with TF, and displaying the performance gain as the teacher network alignment is varied. High values of indicate strongly correlated tasks, while low values indicate unrelated ones. Each point in the diagrams represents a different source-target pair, with variable sample complexities in the target task. In panel (a), we can see that TF can outperform 2L when the sample complexity is low enough, or when the level of correlation is sufficiently strong. In these regimes 2L is not able to extract features of a comparable quality. Note that, at , the transferred features are received from a source task identical to the target, so it is to be expected that at high sample complexity the performance of TF is equivalent to 2L. The darker red region around sample complexities of order 1 is due to the double-descent phenomenon, mentioned above and investigated in the SM.
In panel (b), we see a corner of the phase diagram where the performance of TF is sub-obtimal with respect to learning with random features. This is a case of negative-transfer: the received features are over-specialized to an unrelated task and hinder the learning capability of TF as in . Finally, in panel (c), we see the effect of an additional tuning of the transferred features on the new data. At small sample complexities this procedure leads to over-fitting .
We reported on the effect of a teacher network perturbation, but we find surprisingly similar results for the other families of transformations, perturbing the generative features with and . The corresponding phase diagrams are shown in the SM. Thus, at the level of the CHMM, all these manipulations have a similar impact on dataset correlation and feature map transferability. We trace similar trends in different transfer learning settings on real data, reported in the SM.
Transfer between datasets of different complexity.
So far, we discussed situations where the transfer occurs between tasks with similar degrees of complexity. However, in deep learning applications, this is not the typical scenario. Depending on whether the source task is simpler or harder than the target one, a different gain can be observed in the generalization performance of TF. The two transfer directions, from hard to simple and vice-versa, are not symmetric. This observation has been repeatedly reported for transfer learning applications on real data [32, 35, 37].
To isolate the asymmetric transfer effect, we propose again a simple design. Consider as a first classification task,
, the full EMNIST-letters dataset (including all classes) with binarized labels (even/odd categories). As a second task,, consider instead a filtered EMNIST dataset (containing only some of the letters) with the same binarized labels. As denoted by the subscript, the learning problem associated to the first task is harder, given the richness of the dataset, while the second classification task is expected to be easier. Fig. 4(a) shows the outcome of this experiment. In the top sub-plot, we display the transfer from to , while in the lower sub-plot the transfer from to . A first remark is on the different difficulty of the two learning problems: as expected, the test error is smaller in the top figure than in the bottom one, especially at small sample complexities (difference of about test accuracy for all learning models). More importantly, when the target sample complexity is low, the performance gain of TF over the two base-lines is about doubled when the transfer goes from to (about , compared to a in the opposite transfer direction).
As mentioned in section 2, dataset complexity is captured in our modeling framework through the latent dimension of the two HMMs. In particular, we can respectively assign a higher and a smaller to the two tasks. Correspondingly, the harder HMM will comprise a larger number of generative features. Fig. 4(b) shows the asymmetric transfer effect in the setting of the CHMM. Again, the top sub-plot shows the transfer from to , while the bottom sub-plot the converse. The different task complexity is reflected in the lower test scores recorded when the target task is the easier one. Moreover, as above, we observe a different gain between the two transfer directions for TF (above in transfer from hard to easy, about in the other direction). Thus, the difference in the intrinsic dimension of the two datasets seems to be the key ingredient for tracing this phenomenon.
We can now exploit our modeling framework and further explore the asymmetric transfer effect as a function of the latent dimension discrepancy. We consider source and target HMMs with variable latent dimensions, and , while keeping fixed. This allows us to probe cases where the target task is simpler and cases where it is more complex than the source. By comparing the performance of TF to RF, we identify the regimes where transfer learning produces the largest gains.
Fig. 5(a) shows the resulting phase diagram, highlighting a stark asymmetry when transferring between datasets with different latent dimensions. In the plot, the vertical axis at , represents the symmetry line of the phase diagram and corresponds to the case where , namely when the two HMMs share the same generative features. By moving to the right or to the left of the axis by the same amount, the number of generative features common to both datasets is identical. However, on the right, the target dataset is more complex than the source dataset. As a result, on this side of the phase diagram, the performance gain of TF is smaller. When instead the target task is simpler, the test performance of TF is highly improved, especially at low sample complexities.
Fig. 5(b) displays a horizontal cut of the phase diagram, at . In the low sample complexity regime transfer learning is highly beneficial, and indeed we observe that TF greatly outperforms RF and 2L. Again, note that at the latent dimensions of source and target tasks are equal. However, as we move away from the symmetry line at , we see the asymmetric transfer effect coming into play. When the target latent dimension increases, the gap in performance closes faster. This is due to the lack of learned features, affecting the test error of TF. If we look at the effect of fine-tuning the transferred features, we see a different behavior at small/large target latent dimension. When is small, ft-TF is detrimental. When is instead large, ft-TF can slightly improve performance. The fully trained 2L network is over-fitting due to the small dataset.
The modeling framework proposed in this work can be used to investigate other facets of transfer learning. In the SM we provide some additional results, exploring the impact of the network width , and of the sample complexity in the source task. In the first case, we find that the biggest performance gaps between TF, RF and 2L are observed in the regime where is smaller than one, while in the opposite regime the gain is observed only at small sample-complexity. In the second case, we find that, at fixed dataset correlation, the sample complexity in the source task needs to be sufficiently large in order for the transfer to be effective, as expected.
Despite the remarkable adherence of the CHMM phenomenology to simple transfer learning experiments on real data, there are several limitations to the presented approach. The main one is technical: in high-dimensions we are not aware of any theoretical framework that captures feature learning in full two-layer networks in a form that could then be used to study transfer learning. Therefore, even our analytic curves are obtained on top of results from numerical learning of the features from the source task. The description of learning processes in architectural variants, commonly used in deep learning practice, is even further from the current reach of existing theoretical approaches. Another shortcoming is associated to the fact that we are considering binary classification problems (as customary in many theoretical approaches), and by doing so we are missing the relational information appearing in multi-class settings, that could be learned and transferred between different models. An interesting direction for future work would be to find a method for quantifying the amount of correlation between real source and target datasets and to locate the pair in the phase diagrams we obtained. At this time, the presented theory remains descriptive, and does not yield quantitative predictions.
We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Program Grant Agreement 714608-SMiLe, and a Sir Henry Dale Fellowship from the Wellcome Trust and Royal Society (grant number 216386/Z/19/Z). A.S. is a CIFAR Azrieli Global Scholar in the Learning in Machines & Brains programme.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: Appendix C.
A model of inductive bias learning.
Journal of artificial intelligence research12, pp. 149–198. Cited by: §1.
Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116 (32), pp. 15849–15854. Cited by: Appendix D, §4.
-  (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
-  (2018) notMNIST. Note: http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html[Google (Books/OCR), Tech. Rep.[Online]] Cited by: Appendix B, Appendix C.
-  (2014) Big data deep learning: challenges and perspectives. IEEE access 2, pp. 514–525. Cited by: §1.
-  (2020) Transfer learning & fine tuning. GitHub. Note: https://keras.io/guides/transfer learning/ Cited by: §1.
-  (2017) EMNIST: extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. Cited by: Appendix C, §4.
-  (2021) More data or more parameters? investigating the effect of data structure on generalization. arXiv preprint arXiv:2103.05524. Cited by: §1.
-  (2020) Double trouble in double descent: bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pp. 2280–2290. Cited by: Appendix D.
Double double descent: on generalization errors in transfer learning between linear regression tasks. In CoRR, Vol. abs/2006.07002. External Links: Cited by: §1.
-  (2021) Transfer learning can outperform the true prior in double descent regularization. arXiv preprint arXiv:2103.05621. Cited by: Appendix D.
-  (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: Appendix C.
-  (2021) Phase transitions in transfer learning for high-dimensional perceptrons. Entropy 23 (4), pp. 400. External Links: Cited by: Appendix A, §1.
-  (2020) Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434. Cited by: §1.
-  (2001) Statistical mechanics of learning. Cambridge University Press. Cited by: §1.
-  (2017) Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports 7 (1), pp. 1–8. Cited by: §1.
-  (2018) GAN-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing 321, pp. 321–331. Cited by: §1.
-  (2020) Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment 2020 (11), pp. 113301. Cited by: §2.
-  (2020) Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pp. 3452–3462. Cited by: Appendix D, §1.
-  (2019) Modelling the influence of data structure on learning in neural networks: the hidden manifold model. Physical Review X 10, pp. 041044. Cited by: §1, §2.
-  (2020) The gaussian equivalence of generative models for learning with two-layer neural networks. CoRR abs/2006.14709. External Links: Cited by: §3.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: §1, §2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
-  (2020) Universality laws for high-dimensional learning with random features. arXiv preprint arXiv:2009.07669. Cited by: §3.
-  (2018) imgaug. Note: https://github.com/aleju/imgaug[Online; accessed 30-Oct-2018] Cited by: Appendix C.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.
Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2661–2671. Cited by: 5th item, §2.
-  (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §1.
-  (2019) An analytic theory of generalization dynamics and transfer learning in deep linear networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: §1.
-  (2015) Simple to complex transfer learning for action recognition. IEEE Transactions on Image Processing 25 (2), pp. 949–960. Cited by: §4.
-  (2021) Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model. arXiv preprint arXiv:2102.08127. Cited by: Appendix E, Appendix E, Appendix E, Appendix E, Appendix E, 2nd item, §3, §3.
-  (2019) The generalization error of random features regression: precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355. Cited by: §2.
-  (2016) Towards automated melanoma screening: exploring transfer learning schemes. arXiv preprint arXiv:1609.01228. Cited by: §4.
-  (1987) Spin glass theory and beyond: an introduction to the replica method and its applications. Vol. 9, World Scientific Publishing Company. Cited by: Appendix E.
-  (2020) What is being transferred in transfer learning?. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 512–523. External Links: Cited by: §4.
-  (1996) Statistical mechanics of generalization. In Models of neural networks III, pp. 151–209. Cited by: Appendix D, §4.
-  (2020) Statistical learning theory of structured data. Physical Review E 102 (3), pp. 032119. Cited by: §1.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Appendix C.
-  (2019) Rapid learning or feature reuse? towards understanding the effectiveness of MAML. CoRR abs/1909.09157. External Links: Cited by: §1.
-  (2019) Transfusion: understanding transfer learning for medical imaging. arXiv preprint arXiv:1902.07208. Cited by: §1.
-  (2007) Random features for large-scale kernel machines.. In NIPS, Vol. 3, pp. 5. Cited by: §2.
-  (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: §4.
-  (2016) Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §1.
Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In International workshop on simulation and synthesis in medical imaging, pp. 1–11. Cited by: §1.
-  (2019) A jamming transition from under-to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical 52 (47), pp. 474001. Cited by: Appendix D.
-  (2019) Asymptotic learning curves of kernel methods: empirical data vs teacher-student paradigm. arXiv preprint arXiv:1905.10843. Cited by: Appendix A, §1.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §1.
-  (2012) Learning to learn. Springer Science & Business Media. Cited by: §1.
-  (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Computing Surveys (CSUR) 53 (3), pp. 1–34. Cited by: §1.
-  (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: 5th item, §4, §4.
-  (2019) Transfer adaptation learning: a decade survey. arXiv preprint arXiv:1903.04687. Cited by: §1.
-  (2021) A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
-  (2019) Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8543–8553. Cited by: §1.
Appendix A Exploring different aspects of transfer learning in the CHMM
In the following paragraphs, we present some additional phase diagrams on the phenomenology of transfer learning in the correlated hidden manifold model (CHMM). These results complement the materials presented in the main text and give additional insights on the role played by the various parameters in transfer learning problems.
Impact of transformations on the generative features.
We first consider a similar experiment to the one presented in the first paragraph of section 4. In section 4 we showed the phase diagrams for transfer learning when source and target HMMs are linked by a teacher perturbation. Here we explore the effect of the other types of transformations that preserve the dimensionality of the latent space, namely the feature perturbation and the feature substitution transformations.
Fig A.1 shows two phase diagrams, comparing TF with 2L as a function of: (a) the feature perturbation parameter ; (b) the feature substitution parameter . High values of (low values of ) indicate strongly correlated tasks. Apart from the fact that the diagram associated to is mirror image compared to the others, it is evident that the obtained phase diagrams are non just qualitatively equivalent, but also quantitatively similar to the phase diagram for the parameter . At low sample complexities and high level of feature correlations TF largely outperforms 2L. We also find again a region where TF overfits due to the closeness to the separability threshold. More generally, 2L becomes the better algorithm when the size of the target dataset is large enough.
The striking similarity of Fig. 3a and these phase diagrams might suggest some type of universality in the effect of dataset correlation (of any type) on the quality of the transferred features.
Impact of the number of learned features.
In this paragraph, we look at the effect of varying the the width of the two-layer neural network, . Note that also corresponds to the number of learned features in a 2-layer network. In the plots, we rescale by the number of generative features (kept fixed to ) to obtain a quantity that remains even in the high-dimensional setting of the replica computation.
Fig A.2 shows two phase diagrams, comparing (a) TF with 2L and (b) TF with a RF, as a function of and of the number of samples in the target task. Source and target tasks are linked through a fixed transformation with parameters . In both diagrams we can clearly see the diagonal line corresponding to the separability threshold, which shifts to higher values as (i.e., the number of parameters in the second layer) is varied. TF is found to perform better in the low sample complexity regime, as in all other experimental settings. An interesting phenomenon appears in panel (b), where the behavior of RF seems to change once the number of learned features becomes larger than the number of generative features . From this point on, the improvement obtained by TF over RF becomes less pronounced if the sample complexity is sufficient. Both diagrams also seem to show that at very large , and starting from intermediate values of the sample complexity, the differences between TF, 2L and RF seem to narrow. This type of behavior is expected, since by growing the width of the hidden layer one eventually approaches the kernel regime .
Impact of sample size of source dataset.
Finally, we vary a parameter that was kept fixed throughout the above presented simulations, the number of samples in the source task. Of course, in practice it does not make too much sense to use a small dataset as a source task in a transfer learning procedure. The interest in this experiment is more theoretical, as it can be used to understand when the learning model is able to start extracting features that could be helpful in a different task.
Fig A.3 offers a comparison of (a) TF with 2L and (b) TF with a RF, as a function of source and target dataset sizes, and . The resulting phase diagram shows that the two-layer network trained on the source dataset is unable to extract good features from the data below a critical value of the sample complexity . In this regime the performance of TF in the target task is indistinguishable from RF, confirming the key role played by the learned features (and not just initialization and scaling ) in the success of TF. As expected, when the sample complexity in the second task becomes larger, we reconnect with the typical scenario already recorded in the previous experiments.
Appendix B More experiments on real data
In the following paragraphs, we provide some additional transfer learning experiments on real data. The goal of this section is to show that, despite different types of relationship between source and target tasks and different degrees of relatedness, the emerging qualitative behavior is similar to that presented in the main text and reproduced by the CHMM. In paragraph 1 of section 4, we have considered an experiment corresponding to a feature substitution transformation in the context of the CHMM. In the following, we will describe two experimental designs that instead correspond to feature perturbation and teacher perturbation transformations (see section 2 for more details). Moreover, we will demonstrate a case of “orthogonal” tasks, where the feature transfer is completely ineffective.
We consider an experiment where the transfer is between the MNIST dataset and a perturbed version of MNIST, obtained by applying a data-augmentation transformation to each image. In particular, we construct the source dataset, , by altering the MNIST dataset through an edge enhancer in the imgaug library for data augmentation (for more details see sec.C). Instead, we use the original MNIST as the target dataset, . In both cases, we label the images according to even-odd digits. In this way, the target task can be seen as a perturbed version of the source task.
Fig. B.1 shows the outcome of this experiment. A first thing to notice is that TF always outperforms RF for all sample complexities in the range we considered. Moreover, both models show a double descent behavior, with the peak occurring at the linear separability threshold (see sec. D for more details). As seen in the main text, TF reveals to be more effective than 2L at small sample complexities, with a gain of about in test error scores. This gap closes above , where the 2L starts performing better than both TF and ft-TF. Concerning ft-TF, we can instead recognize two different regimes. At small sample complexities, ft-TF performs better than 2L but worse than TF, showing a larger over-fitting effect compared to Fig. 2. At higher sample complexities, ft-TF joins 2L, thus outperforming TF, except in the small region close to the separability threshold.
Teacher perturbation and orthogonal tasks.
Additionally, we consider the following experiment on real data. As a source dataset, , we use the notMNIST dataset  grouped into even-odd classes. In the target task, , we instead consider notMNIST but labeled differently, based on whether the class labels are smaller or larger than . In this way, the only difference in the two tasks is in the employed labeling rule.
Fig. B.2(a) displays the outcome of this experiment. The observed phenomenology is identical to that already described for the previous experiments. Note that, even though the two tasks share the same input images, observing a benefit when transferring the feature map is not trivial. TF is here found to be effective because the learned features carry information about the digit represented in each image, which induces a useful representation regardless of the grouping of the digits.
It is in fact possible to construct effectively “orthogonal” tasks even if the set of inputs is the same. In the final experiment, we use even-odd MNIST as the target task, , but we consider a source task where the label assigned to each MNIST image is only dependent on its luminosity. In particular, we assign all the images with average brightness less than or in the interval to one group and the remaining ones to the other group. Thus, the resulting source task has nothing to do with digit recognition.
Fig. B.2(b) shows the outcome of this last experiment. Contrary to the other cases, in this experiment we can clearly see no advantage in transferring the feature map from the source to the target task: TF not only does not improve over 2L, but it also overlaps with RF for all sample complexities. Interestingly, we can here identify three different regimes for ft-TF: a first regime at small sample complexities where ft-TF actually follows the trend of TF and RF, thus badly performing with respect to 2L; a second regime, at intermediate sample complexities, where ft-TF actually starts improving its test scores with respect to TF and RF (around ) but it still does not have enough data to perform as well as 2L; finally, a third regime at high sample complexities, where ft-TF equates the generalization performances of 2L.
Appendix C Technical details on the numerical simulations
In this section, we provide additional details on the numerical simulations concerning the experiments on both real and synthetic data.
In the experiments with real data, we have used three different standard datasets: MNIST, EMNIST-letters and notMNIST. MNIST is a database of images ( pixels in range in the experiments) of handwritten digits with classes, containing examples in the training set and examples in the test set . EMNIST-letters is a dataset of images ( pixels in range in the experiments) of handwritten letters with classes, containing examples in the training set and examples in the test set . Finally, notMnist is a dataset of images ( pixels in range in the experiments) of letter fonts from to with classes, containing examples in the training set and examples in the test set . Concerning the experiment on feature perturbation of Fig. B.1, we have used the EdgeDetect function of the imgaug library for data augmentation to enhance the contours of each digit with respect to the background . In the experiment, we set the parameter alpha to . An example of the three datasets is provided in Fig. C.1.
In the source task, we consider a two-layer neural network with ReLu activation function and single sigmoidal output unit. To train the two-layer network on the source, we implement mini-batch gradient descent, using the end-to-end open source platform for Machine Learning Tensorflow 2.4. In particular, we use the Adam optimizer with default Tensorflow hyper-parameters and the binary cross-entropy loss. We then apply regularization on the last layer and early-stopping with default Tensorflow hyper-parameters as regularizers. The training is immediately stopped when an increase in the test loss is recorded (the patience
parameter in early stopping is set to zero). We set the maximum number of epochs to, the learning rate to and the batch size to .
In the target task, we train TF and RF via the scikit-learn open-source python library for Machine Learning . In particular, we use the module .
which implements logistic regression withregularization when the penalty parameter is set to “l2”. Note that, non-zero L2 regularization ensures that the optimization process is bounded even below the separability threshold for both TF and RF. The training stops either because a maximum number of iterations () has been reached, or because the maximum component of the gradient is found to be below a certain threshold (). To train 2L, we have instead employed Tensorflow 2.4 once again, with Adam optimizer and cross-entropy loss with -regularization on the last layer. In this case, we set the maximum number of epochs to , the learning rate to and the batch-size to . The training stops when the maximum number of epochs is reached. The choice of the learning hyper-parameters is made to ensure the two-layer network to always reach zero-training error on the target task. Finally, to train ft-TF (from the TF initialization), we use the Adam optimizer and the binary cross-entropy loss. Since we expect the pre-trained weights on the source task to be already good enough to ensure good generalization performances, we set the learning rate to not to alter them too much or too quickly. We then keep the total number of epochs equal to and the batch-size equal to ).
Appendix D Separability threshold and double descent.
The double-descent phenomenon, first observed in , has recently risen a lot of theoretical interest [3, 47], even in the context of transfer learning [10, 12]. Fig. D.1 shows the double-descent phenomenology in the CHMM for the transferred feature model analyzed in the present work. As in the case of RF , a peak in the generalization error can be observed in correspondence of the linear separability threshold. This threshold (dashed black vertical lines in Fig. D.1
) signals the transition to a regime where the inputs can no longer be perfectly separated, i.e. where the training loss becomes strictly greater than zero. As can be observed in the plot, the interpolation threshold associated to TF is shifted towards higher sample complexities. This is a direct consequence of the non-trivial correlations learned from the source dataset and encoded in the feature map. The preprocessing induced by the transferred features helps in the classification process, making data more easily separable than with random Gaussian projections. Note that the sharpness of the transition is controlled by the regularization strength: the smaller is the regularization, the sharper is the transition between the two regimes. However, due to numerical instabilities in the convergence of the saddle-point equations, we could not approach regularization strengths smaller than.
Appendix E Analytic approach
We report the results of the replica analysis presented in  that is directly applicable to the CHMM. The replica approach is a non-rigorous –yet exact– analytical method from statistical physics , used to characterize a high-dimensional optimization problem through a narrow set of scalar fixed-point equations. In this reduced asymptotic description, the original dependence on the microscopic details of the model is captured through a set of overlap parameters, which are assumed to concentrate around their typical values in high dimensions.
In particular, the equations presented in  allow for the characterization of the asymptotic behavior of learning models with fixed feature maps. We employ the very same equations in our pipe-line for predicting the generalization performance of transferred feature models and random feature models, described in section 3.
As discussed in the main text, the generalized Gaussian equivalence can be used to map non-linear learning problems with generic feature maps onto a simple Gaussian covariate model, that focuses on the joint distribution of the inputs of the last layer of the teacher model (determining the true labels) and of the student model (determining the predicted labels). In the CHMM, these quantities are represented by the generative coefficientsand the first layer activations . In high dimensions, their correlation matrix can be decomposed as:
The replica computation does not take as input the full correlation matrices, but instead:
where represent the spectrum of , and
is the matrix composed by stacking its eigenvectors.
The replica results are valid in the high-dimensional limit , with and of . Note, however, that the pipe-line proposed in  and employed in our work, starts from finite-size correlation matrices estimated through Monte Carlo sampling. The justification is the following: the final replica expressions only depend on the traces of the correlation matrices. In high dimensions, the traces yield expectations over the corresponding spectral distributions. However, these expectations can be well approximated by empirical averages over the finite dimensional spectra, provided the dimension is large enough (we employ in the main).
In the next paragraph, we report the results obtained in the so-called replica symmetric ansatz. While this is the simplest possible ansatz for these types of calculations, it is also known to be the correct one for convex problems like the logistic regression setting under study. Moreover, the expression we report are already in the zero-temperature limit, which is the relevant one for studying optimization problems. Note, however, that this limit is non-trivial and requires the introduction of appropriate scaling laws. We refer the reader to  for further details on the derivation.
Summary of replica results.
The central object of the replica analysis is the so-called quenched free-entropy, which represents the average log-partition function associated to the studied learning problem. In the assumption of replica symmetry and the zero-temperature limit, the free-entropy associated to our setting reads:
where the scalar energetic channel is defined as:
with and obtained by extremizing:
where denotes the logistic loss.
The extremum operation in equation (E.3) is taken over the order parameters of the model and their conjugates . The stationarity conditions for yield a system of coupled scalar equations, which can be easily solved by iteration. At the fixed point, one obtains the saddle-point values for the order parameters, which can then be compared to the typical value of measurable overlaps:
At convergence, the order parameters can be inserted in a closed-form expression yielding the generalization error achieved at the end of training. Again, while the reader can find the detailed derivation in , we only report the final expression: