DeepAI

# Super-model ecosystem: A domain-adaptation perspective

This paper attempts to establish the theoretical foundation for the emerging super-model paradigm via domain adaptation, where one first trains a very large-scale model, i.e., super model (or foundation model in some other papers), on a large amount of data and then adapts it to various specific domains. Super-model paradigms help reduce computational and data cost and carbon emission, which is critical to AI industry, especially enormous small and medium-sized enterprises. We model the super-model paradigm as a two-stage diffusion process: (1) in the pre-training stage, the model parameter diffuses from random initials and converges to a steady distribution; and (2) in the fine-tuning stage, the model parameter is transported to another steady distribution. Both training stages can be mathematically modeled by the Uhlenbeck-Ornstein process which converges to two Maxwell-Boltzmann distributions, respectively, each of which characterizes the corresponding convergent model. An 𝒪(1/√(N)) generalization bound is then established via PAC-Bayesian framework. The theory finds that the generalization error of the fine-tuning stage is dominant in domain adaptation. In addition, our theory suggests that the generalization is determined by a new measure that characterizes the domain discrepancy between the source domain and target domain, based on the covariance matrices and the shift of the converged local minimum.

• 29 publications
• 400 publications
03/03/2021

Fine-tuning is known to improve NLP models by adapting an initial model ...
06/15/2015

### A New PAC-Bayesian Perspective on Domain Adaptation

We study the issue of PAC-Bayesian domain adaptation: We want to learn, ...
10/19/2022

### Variational Model Perturbation for Source-Free Domain Adaptation

We aim for source-free domain adaptation, where the task is to deploy a ...
01/13/2015

### An Improvement to the Domain Adaptation Bound in a PAC-Bayesian context

This paper provides a theoretical analysis of domain adaptation based on...
06/23/2020

### Domain Adaptation for Semantic Parsing

Recently, semantic parsing has attracted much attention in the community...
02/07/2021

### Domain Adversarial Neural Networks for Domain Generalization: When It Works and How to Improve

Theoretically, domain adaptation is a well-researched problem. Further, ...

## 1 Introduction

Large-scale pretrained models (or foundation models) (Han et al., 2021; Chen et al., 2021)

, including GPT-3

(Brown et al., 2020) and BERT (Devlin et al., 2018)

, enables a new paradigm in machine learning: pre-training a large-scale model on very large-scale datasets and then transferring the learned model to an unseen domain. This paradigm was first introduced in natural language processing and recently to computer vision. It sheds light in a higher-level automation and is establishing a new paradigm with advantages/supremacy. In this paradigm, a super model learns meta knowledge from large amounts of data and reduces learning cost in specific domains. This helps considerably reduce the computational and data cost of applying machine learning in many specific applications. This is thus of significant values to enormous small and medium-sized enterprises. Additionally, super-model paradigm enables better management of the geographic location of machine learning workload and the datacenter infrastructure, which has been shown able to significantly reduce the carbon emission

(Patterson et al., 2021).

Technically, domain adaptation plays a vital role for the knowledge transferring in the super-model paradigm. Usually, the data in a target domain is much smaller than the one in the source domain. In the light of this, an appropriate understanding to the generalizability of the transferred model on the target domain is of high importance.

In this paper, we prove an upper bound for the generalization error (generalization bound) for domain adaptation algorithms. The generalization error is defined as the difference between the expected risk and the empirical risk . Intuitively, a larger generalization bound indicates that the generalization error is possibly larger and thus suggests worse generalizability.

We model the super model paradigm as a two-stage diffusion processes. In the first stage, stochastic gradient-based optimizers, usually stochastic gradient-based optimization, including stochastic gradient descent (SGD)

(Robbins and Monro, 1951), momentum (Nesterov, 1983; Tseng, 1998), and Adam (Kingma and Ba, 2014), learns a pre-trained model on the source-domain data via empirical risk minimization,

 minθ^RS(θ)=minθ1NN∑i=1ℓ(hθ(xi),yi),

where is the empirical risk of model parameterized by on the training sample , which is defined to be

 S={(x1,y1),…,(xN,yN)|xi∈RdX,yi∈RdY}, (1)

where is the training sample size, and

is the loss function. The convergent parameter initializes the fine-tuning process on the target domain in the second stage. We model the parameter trajectory of SGD by a stochastic process, Uhlenbeck-Ornstein process

(Uhlenbeck and Ornstein, 1930), as follows,

 Δθ(t)= θ(t+1)−θ(t)=−η^gS(θ(t)) = −ηg(θ)+η√|S|BΔW, ΔW∼N(0,I), (2)

where is positive definite matrix which characterizes the covariance of the gradient noise. This can also be smoothed to a diffusion equation, Fokker-Plank equation. Correspondingly, the trajectories can be modeled by the dynamics of the Fokker-Plank equations. Further, the steady distributions of the Uhlenbeck-Ornstein equations characterize the distributions of the learned models.

Deep learning can be formulated as solving a non-convex optimization problem: The loss surface of neural networks are usually highly non-convex due to the complexity of neural network architectures. In general, solving a non-convex optimization problem is NP-hard. However, numerous experiments show that deep learning has excellent optimization performance. This mystery is partially addressed by some empirical finding on the local convexity and smoothness of the loss surfaces of deep neural networks. Empirical results show that the loss surface around the convergent local minima is second-order smooth, as shown by Li et al. (2018).

This empirical finding inspires us to model the loss surface around the convergent local minimum as a quadratic function. This assumption determines the derivatives and boundary conditions of the Fokker-Plank equation. Moreover, the model parameter is usually initialized by following a Gaussian distribution. Based on them, Fokker-Plank equation has a steady distribution in the form of Maxwell-Boltzmann distribution, which governs the distribution of the learned model by the SGD. During the pre-training stage, SGD converges a Maxwell-Boltzmann distribution around the local minimum given below,

 qPT(θ)=MPTexp{−12θ⊤Σ−1PTθ}, (3)

where is the normalizer and is the covariance.

This distribution is then used as the initial distribution in the fine-tuning stage. Subsequently, SGD in the fine-tuning stage learns the mapping from the initialization to a new Maxwell-Boltzmann distribution centered at the new local minimum on the loss surface in the target domain as follows,

 qFT(θ)=MPTexp{−12(θ−θFT)⊤Σ−1FT(θ−θFT)}, (4)

where is the normalizer, is the distribution shift, and is the covariance.

Based on the diffusion processes, we then establish PAC-Bayesian generalization bounds for the learned super model on the source domain and the transferred model on the target domain. The PAC-Bayesian framework (McAllester, 1999a, b)

upper bounds the generalization error of a stochastic algorithm via the distance between the initial distribution and the distribution of the learned hypothesis, usually measured by some information-theoretical distances, such as KL-divergence. Intuitively, the PAC-Bayesian theory suggests that training a very-large model from a no-knowledge prior, such as Gaussian distribution and uniform distribution, needs a very large amount of data to secure the generalizability; and if the initialization is near the distribution of the learned hypothesis, the needed sample complexity can be relatively much smaller. However, a high-quality prior is not accessible in practice. This significantly limits the model size, particularly in low-resource scenarios. This renders the key motivation of the super-model paradigm: (1) training a super model on a very large-scale dataset, in order to learn a high-quality model from the no-knowledge prior; and (2) using the learned super model as a high-quality prior in the down-stream application, in order to reduce the needed training data and supports larger model size.

In this paper, the generalization bound in pre-training is established based on the KL-divergence between the Maxwell-Boltzmann distribution in pre-training as below,

 R(QPT)≤^R(QPT) +  ⎷D(QPT,P)+2log(1δ)+2logNPT+44NPT−2, (5)

where

 D(QPT,P)=log(det(ΣPT))+tr(ΣPT−I),

and is the expected risk, is the empirical risk, is the covariance of the distribution of the learned hypothesis, and is the training sample size in the pre-training.

Meanwhile, the generalization bound in pre-training is established based on the KL-divergence between the two Maxwell-Boltzmann distributions in pre-training and fine-tuning as follows,

 R(QFT)≤^R(QFT) +  ⎷D(QFT,QPT)+2log(1δ)+2logNFT+44NFT−2, (6)

where

 D(QFT,QPT) = log(det(Σ−1PTΣFT))+tr(Σ−1PTΣFT−I)+θ⊤FTΣ−1PTθFT,

and is the expected risk, is the empirical risk, is the covariance of the distribution of the learned hypothesis, is the shift of the distribution center, and is the training sample size in the fine-tuning.

We further define two new notions to measure the domain discrepancy as follows,

 D(QFT,QPT) = log(det(Σ−1PTΣFT))+tr(Σ−1PTΣFT−I)+θ⊤FTΣ−1PTθFT,

and

 ~D(QFT,QPT) = log(tr(Σ−1PTΣFT))+tr(Σ−1PTΣFT)+θ⊤FTΣ−1PTθFT +dlogd−d,

where is the parameter size. These two notions measure the magnitude of the domains shifts based on the learned hypotheses on the source domains and target domains.

Our theory have the following implications:

• The very large-scale datasets employed in the pre-training stage helps secure obtaining a high-quality model from a no-knowledge prior. The learned model carries the knowledge learned from the training data in the pre-training stage. The distribution of the pre-trained model severs as a high-quality initialization in the down-stream fine-tuning stage.

• Comparing the generalization bounds in the pre-training stage and fine-tuning stage, we show that the generalization error of the fine-tuning stage is dominant in the super-model paradigm. This is because the dominantly large size of the training data in the pre-training stage. This finding supports the feasibility and efficiency of the super-model paradigm.

• Our generalization bound suggests that the generalization on the target domain is determined by the magnitude of the domain shifts. Generally, larger domain shifts lead to worse generalization on the target domain. This supports the heuristic in practice that the performance on the target domain is limited by the domain shifts.

It is worth noting that the super-model paradigm also supports model compression in the model deployment, including pruning, quantization, model distillation, etc. The influence of model compression methods can be directly plugged in our theory.

## 2 Background

This section reviews the related work, including super model, domain adaptation, generalization, and deep learning theory.

Domain adaptation. Domain adaptation algorithms transfer knowledge from one domain to another. It enables the super model paradigm. Domain adaptation has three main streams:

(1) Discrepancy-based domain adaptation modifies the loss to narrow the discrepancy between the features from the source domain and the ones from the target domain . Tzeng et al. (2014) introduce a fully-connected adaptation layer into CNN for learning the representation of the kernel in order to minimize the maximum mean discrepancy (MMD) between the features from different domains:

 MMD(XS,XT) =

Long et al. (2015) employ multiple adaptation layers. Long et al. (2015, 2016)

introduce residual blocks into the classifiers of source domain.

Long et al. (2017)

further consider the discrepancy of the joint distribution

rather than the marginal distribution ;

(2) Adversarial-based domain adaptation maps the source domain and the target domain

to a general space, inspired by generative adversarial networks (GANs): If a classifier hardly separates examples of source domain and those from target domain, the feature extractor has narrowed the two domains.

Ganin and Lempitsky (2015)

propose gradient reversal layers that reverse the gradients generated by the domain classifier during backpropagation.

Zhang et al. (2018) argue that the features of the bottom layers contain more domain information, while those of the top layers contain less domain information. They further employ collaborative learning to learn domain informative features in the bottom layers, and adapt adversarial learning to learn domain uninformative features in the top layers; and

(3) Reconstruction-based domain adaptation reconstructs the features extracted from the source domain

to the target domain . Ghifary et al. (2016) reconstruct examples from the target domain via features learned from source domain classification task. Bousmalis et al. (2016) reconstruct the inputs via both private representation and shared representation of both source and target domains.

Generalization. Good generalization guarantees that an algorithm learns the underlying patterns in training data rather than just memorize the data. In this way, good generalization abilities provide confidence that the models trained on existing data can be applied to similar but unseen scenarios. Three major approaches in analyzing the generalizability are seen in the literature: (1) generalization bounds based on the hypothesis complexity, including VC dimension (Blumer et al., 1989; Vapnik, 2006), Rademacher complexity (Koltchinskii and Panchenko, 2000; Koltchinskii, 2001; Bartlett and Mendelson, 2002), and covering number (Dudley, 1967; Haussler, 1995). The results are usually obtained via concentration inequalities. They also suggest controlling the model size to secure the generalizability, which is no longer valid in deep learning; (2) generalization bounds based on the algorithmic stability (Rogers and Wagner, 1978; Bousquet and Elisseeff, 2002; Xu et al., 2011). The results in this stream follow the motivation that learning algorithms robust to small disturbances in input data usually have good generalizability; and (3) generalization bounds in the PAC-Bayes framework (McAllester, 1999a, b). The results are obtained based on information-theoretical versions of concentration inequalities.

Deep learning theory. Deep learning has been deployed successfully in many real-world scenarios. However, the theoretical foundations of deep learning are still elusive. For example, there is no explanation for how deep learning algorithms work, why they can succeed, when they would fail, and whether they would hurt society. Such deficiency in explainability questions the transparency and accountability of deep learning, and further undermines our confidence of deploying deep learning in security-critical application domains, such as medical diagnosis (Kulikowski, 1980; Silver et al., 2016) and drug discovery (Chen et al., 2018a). Many works have emerged to establish the theoretical foundations of deep learning via VC dimension (Harvey et al., 2017), Rademacher complexity (Golowich et al., 2018; Bartlett et al., 2017), covering number (Bartlett et al., 2017), Fisher-Rao norm (Liang et al., 2019; Tu et al., 2020), PAC-Bayesian framework (Neyshabur et al., 2017), algorithmic stability (Hardt et al., 2016; Kuzborskij and Lampert, 2018; Verma and Zhang, 2019), and the dynamics of stochastic gradient descent or its variants (Mandt et al., 2017; Mou et al., 2018b; He et al., 2019). Please see more related works in surveys (E et al., 2020; He and Tao, 2020; Poggio et al., 2020). This work is committed to establishing theoretical foundations of privacy, generalization, adversarial attack in deep learning, all of which have profound importance in enhancing the explainability, transparency, and accountability of deep models.

Generalization of SGD. Some generalization bounds for algorithms trained by SGD are proposed. Mou et al. (2018a) analyze the generalization of stochastic gradient Langevin dynamics (SGLD), and prove an upper bound and an upper bound for the generalization error, respectively via algorithmic stability and PAC-Bayesian theory. Pensia et al. (2018) analyze the generalizability of noisy and iterative machine learning algorithms. A generalization bound is then proved given the mutual information between the output hypothesis and the input data. It also proved generalization bounds for SGLD as examples. Chen et al. (2018b) prove that the convergence and stability for iterative machine learning algorithms have a trade-off under both convex smooth assumption and strong convex smooth assumption. Under the same assumptions, Chen et al. (2018b) prove an generalization bound for SGD. Liu et al. (2017) prove an generalization bound for SGD when the loss function is Lipschitz continuous and smooth. London (2017) prove a generalization bound for SGD based on the KL divergence between the prior and the posterior under the PAC-Bayes framework. He et al. (2019) present a PAC-Bayes generalization bound for SGD based on stochastic differential equations. In the work of He et al., the gradient noise is modeled by a Gaussian distribution. Meng et al. (2020) extend the gradient noise to be state-dependent. Cheng et al. (2020) extend the gradient noise to be Levy process.

## 3 Notations and preliminaries

Suppose the training dataset is , where is the dimension of the feature and is the dimension of the label . Suppose and are independent and identically distributed (i.i.d.) observation of variables and , respectively. We also rewrite

, which is an i.i.d. observation of random variable

. Denote the generating distribution of is .

Formally, machine learning algorithms are designed to select the hypothesis function with the lowest expected risk under the loss function from a hypothesis class , where is the parameter of the hypothesis and is the dimension of the parameter . For many stochastic algorithms, such as SGD, we usually use a distribution to express the output parameter. Suppose the parameter follows a distribution , the expected risks respectively in terms of and are defined as:

 R(θ)=E(X,Y)∼Dl(Fθ(X),Y), (7) R(Q)=Eθ∼QE(X,Y)∼Dl(Fθ(X),Y). (8)

However, the expected risk is not available from the data, since we do not know the formulation of latent distribution of data. Practically, we use the empirical risk

to estimate the expected risk

, which is defined as:

 ^R(θ)=1|T||T|∑i=1l(Fθ(Xi),Yi), (9) ^R(Q)=Eθ∼Q⎡⎣1|T||T|∑i=1l(Fθ(Xi),Yi)⎤⎦, (10)

where all constitute the training sample .

Learning algorithms usually solve the following empirical risk minimization (ERM) problem to approach the optimal hypothesis,

 minθ^RS(θ)=minθ1NN∑i=1ℓ(hθ(xi),yi).

We usually employ stochastic gradient-based optimizers for ERM in deep learning. Popular options of stochastic gradient-based optimizers include stochastic gradient descent (SGD) (Robbins and Monro, 1951), momentum (Nesterov, 1983; Tseng, 1998), and Adam (Kingma and Ba, 2014). For the brevity, we analyze SGD in this paper. The analysis for other stochastic gradient-based optimizers is similar.

Suppose is a mini batch randomly drawn from the training sample set . Then, the stochastic gradient on is as follows,

 ^gERM(θ)=1|B|∑(xi,yi)∈B∇θℓ(hθ(xi),yi).

In the -th iteration, the weight is updated as follows,

where

is the weight vector in the

-th iteration and is the corresponding learning rate.

Meanwhile, adversarial training employs SGD to solve the following minimax problem,

 minθ^RAS(θ)=minθ1NN∑i=1max∥x′i−xi∥≤ρℓ(hθ(x′i),yi), (11)

where is the radius of the ball centered at the example . Here, we call adversarial empirical risk. Correspondingly, the stochastic gradient on a mini batch and the weight update are calculated as below,

 ^gA(θ)=1|B|∑(xi,yi)∈B∇θmax∥x′i−xi∥≤ρℓ(hθ(x′i),yi), θAt+1=θAt−ηt^gA(θAt). (12)
###### Definition 1 (KL Divergence; cf. Kullback and Leibler (1951)).

Suppose two distributions and are defined on the same support. Then the KL divergence between and is defined as

 DKL(P∥Q)=EP(logdPdQ).

To avoid technicalities, the measurability/integrability issues are ignored throughout this paper. Moreover, Fubini’s theorem is assumed to be applicable for any integration with respect to multiple variables, that the order of integrations is exchangeable. Also, we assume the stable (stationary) solutions of all stochastic differential equations involved exit and are unique.

A supreme industrial paradigm has been emerging that (1) pre-training a large-scale model on large amounts of multi-modality data, such as GPT-3 (Brown et al., 2020) and BERT (Devlin et al., 2018); and (2) fine-tuning the obtained model on specific smaller domain where data size is relatively difficult to access. In this paper, we name it as super-model paradigm. Super-model paradigm enables efficient and effective knowledge discovery in low-resource application scenarios, including few-shot learning (Snell et al., 2017; Sung et al., 2018) and zero-shot learning (Romera-Paredes and Torr, 2015). A key cornerstone technology wherein is domain adaptation. This section describes this paradigm.

Large-scale pre-trained models. Recent advances are seen mainly in natural language processing (NLP), particularly after the appearance of transformer (Vaswani et al., 2017). ELMo (Peters et al., 2018) finds the word embedding in NLP is not invariant in different application domains, but considerably changes with context. Based on this observation, ELMo pre-trains a large-scale bidirectional LSTM on a large text corpus to generate word vectors by fine-tuning. BERT (Devlin et al., 2018) employs the transformer encoder for detecting bidirectional information in the context. Meanwhile, Liu et al. (2018) employs the transformer decoder to word embedding with fine-tuning, in order to realize wider attention. GPT (Radford et al., 2018) also employs the transformer decoder but is fine-tuned on each specific task for better performance. Extended from GPT, GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) construct huge models in order to realize zero-shot learning. The comparison between these “super models” is presented in the following table.

Pre-training stage. The first step of the super-model paradigm is pre-training a super model on large-scale data, sometimes of multi-modality. The learned model in this stage is of high quality that the approximation and generalization of the output hypothesis are usually excellent, which suggests that the learned model has stored rich general knowledge in the learned model. This makes it possible to apply the learned model for the smaller specific application domains.

Fine-tuning stage. The learned model is then fine-tuned on the target domain, usually a smaller specific domain. The stored general knowledge is thereby transferred to the target domain. In this way, super-model paradigm reduces considerable sources of knowledge discovery in the target domain.

Theoretical advantages. According to the PAC-Bayesian theory, the generalizability of the learned model is determined by the distance between the posterior and the prior. As we will show in the next two sections, the very large-scale training data in the pre-training stage secures learning a high-quality model with no-knowledge prior. The learned knowledge is of high value but consumed enormous resources which is not accessible for many potential machine learning users, particularly small and medium-sized enterprises. In the super-model paradigm, the high-quality model learned in the pre-training stage is employed as the initialization in the fine-tuning stage. In this way, we significantly reduce the needed sample complexity in the fine-tuning stage.

Industrial values. Machine learning has been thriving in a wide range of areas. However, the industrial applications are still limited. This is partially caused by the high cost of computing facilities and data annotations. The paradigm based on super models significantly reduce the cost of machine learning applications. This is particularly important for small and medium-size enterprises.

Climate value. Super-model paradigm enables recycling discovered general knowledge in enormous application domains. This would also help significantly reduce the carbon emission. Meanwhile, the super-model paradigm centralizes the modeling training process which can help manage the geographic location and the datacenter infrastructure in order to reduce the carbon emission, as a recurrent work suggested (Patterson et al., 2021):

• Geographic location of machine learning workload scheduling can result carbon emission vary around five times to ten times, even when the country and the organization remain invariant.

• Cloud data centers can be around 1.4-2X more energy-efficient. Meanwhile, machine learning-oriented accelerators can be  2-5X more effective.

Super-model paradigm can thus reduce the carbon print of machine learning application and further contribute in slowing down the climate crisis.

## 5 Diffusion processes in super-model paradigm

We consider a diffusion process-based model that serves an envelope for domain adaptation methods. Two diffusion processes are designed for modeling the pre-training and fine-tuning stages, respectively. The knowledge transition can then be modeled via the transition of diffusion processes.

### 5.1 Diffusion process in pre-training

In pre-training, SGD explores on the loss surface for a decent local minimum. Compared with gradient descent, SGD introduces gradient noise into the gradient and then the weight. The noise plays as an implicit regularizer that controls the hypothesis complexity of the learned model. In this section, we employ a stochastic differential equation to characterize the trajectory of SGD.

We assume that the loss function in the local region around the minimum is convex and second-order differentiable, as shown in the following assumption.

###### Assumption 1.

Suppose that the empirical risk around the optimum as the following equation,

 R(θ)=12θ⊤APTθ, (13)

where is the Hessian matrix around the minimum and is a (semi) positive-definite matrix.

###### Remark 1.

This assumption implicitly assumes that the converged local minimum is at the zero point. This would not influence the generality under translational motion. Specifically, suppose the converged local minimum is at . We may perform a translational motion to the neural network to move the converged local minimum to zero.

###### Remark 2.

The Hessian matrix of the loss surface characterizes the local geometry around the converged local minimum. Its determinant characterizes the flatness/sharpness of the loss function around the local minimum (Keskar et al., 2017; Goyal et al., 2017).

###### Remark 3.

The covariance matrix characterizes the fluctuation introduced by the mini bathes into the gradient estimation. A recent intuition for the advantage of SGD is that it introduces noise into the gradient, so that it can jump out of bad local minima.

The loss and gradient calculated on a mini-batch are un-biased estimators of the empirical risk and the full gradient , as follows,

 E[ln(θ)]=E[^R(θ)]=R(θ), (14) E[∇θln(θ)]=E[^gS(θ)]=g(θ)=∇θR(θ), (15)

where the expectations are in terms of the corresponding examples .

The fluctuations introduced by the mini batches are modeled by Gauss distributions centered at . Specifically, we assume that

 ∇θln(θ)∼N(g(θ),C), (16)

where is the covariance matrix and is a constant matrix for all . This Gaussian assumption is also employed in by E (2017) and Mandt et al. (2017). Therefore, we further have the following estimation,

 ^gS(θ)=1|S|∑n∈S∇θln(θ)∼N(g(θ),1|S|C). (17)

SGD uses the stochastic gradient to iteratively update the parameter in order to minimize the function :

 Δθ(t)= θ(t+1)−θ(t)=−η^gS(θ(t))=−ηg(θ)+η√|S|BΔW, (18)

and

 ΔW∼N(0,I),

where is positive definite matrix which characterizes the covariance of the gradient noise. We define that

 C=B⊤B.

In this paper, we consider the case that the batch size and learning rate are constant.

Combining eqs. (13) and (18), we have the following analytic form of the stationary distribution (Gardiner and others, 1985):

 qPT(θ)=MPTexp{−12θ⊤Σ−1PTθ}, (19)

where is the normalizer and

 ΣPTAPT+APTΣPT=ηPT|SPT|CPT,

, , and are the learning rate, batch size, and the covariance matrix in the pre-training stage, respectively.

###### Remark 4.

In this section, we show that the learned hypothesis is drawn from the steady distribution of a Fokker-Plank equation, which is a Gibs-Boltzmann distribution centered around the zero point.

### 5.2 Knowledge transition in fine-tuning

SGD in the fine-tuning stage can also be characterized by the Uhlenbeck-Ornstein equation (eq. 18), while the initial condition is different. The fine-tuning stage is initialized by the steady distribution of the pre-training stage . Similarly, the SGD converges to another steady distribution . In this way, we model the domain adaptation as a two-stage diffusion process. The second-stage diffusion process characterizes the knowledge transition between the two domains.

We assume that the loss function in the local region around the minimum is convex and -order differentiable, as shown in the following assumption.

###### Assumption 2.

Suppose that the empirical risk around the optimum as the following equation,

 R(θ)=12(θ−θFT)⊤AFT(θ−θFT), (20)

where is the Hessian matrix around the minimum and is a (semi) positive-definite matrix.

Recall that we assumed that the converged local minimum in the pre-training stage is at the zero point. In the fine-tuning stage, the converged local minimum cannot be assumed at the same point in general. Thus, a shift term is introduced to characterize the the shift of the converged local minimum.

Similarly, combining eqs. (20) and (18), we have the following analytic form of the stationary distribution:

 qFT(θ)=MPTexp{−12(θ−θFT)⊤Σ−1FT(θ−θFT)}, (21)

where is the normalizer and

 ΣFTAFT+AFTΣFT=ηPT|SPT|CPT,

, , and are the learning rate, batch size, and the covariance matrix in the fine-tuning stage.

Recall that the converged local minimizer in the pre-training stage is drawn from a Gibs-Boltzmann distribution centered at the zero point. This is inherited from the assumption that the local minimum is around the zero point. However, in the fine-tuning stage, the converged local minimum has a shift from the zero point. This leads to a shift of the distribution of the learned hypothesis.

## 6 Generalization analysis of super-model paradigm

The knowledge transition characterized by the diffusion process in the fine-tuning. In this paper, we employ the PAC-Bayesian theory to analyze the generalizability of domain adaptation.

### 6.1 PAC-Bayesian framework

PAC-Bayesian theory corporates the PAC theory and Bayesian statistics

(McAllester, 1999a, b). It presents a generalization bound for a stochastic algorithm based on the distance between the learned hypothesis and the prior measured by the KL divergence. The PAC-Bayesian bound characterizes the trade-off between minimising the empirical risk and exploring further areas of the hypothesis space from the initial.

###### Lemma 1 (see McAllester (1999a), Theorem 1).

For any positive real

, with probability at least

over a sample of size , we have the following inequality for all distributions :

 R(Q)≤ ^R(Q)+ ⎷D(Q||P)+log1δ+logN+22N−1, (22)

where is the KL divergence between the distributions and and is defined as,

 D(Q||P)=Eθ∼Q(logQ(θ)P(θ)). (23)

This lemma characterizes the influence on the generalization via the distance between the distribution of the learned hypothesis and the prior measured by the KL divergence . The KL divergence serves as a hypothesis complexity measure. In specific, a larger KL divergence corresponds to a larger hypothesis complexity and further a worse generalizability.

### 6.2 Generalization bound

We then obtain a generalization bound for the pre-training stage as follows.

###### Theorem 1.

For any positive real , with probability at least over a training sample set of size , we have the following inequality for the distribution of the output hypothesis function of SGD:

 R(QPT)≤^R(QPT)+ ⎷D(QPT,P)+2log(1δ)+2logNPT+44NPT−2, (24)

where

 D(QPT,P)=log(det(ΣPT))+tr(ΣPT−I).

The proof for this generalization bound has two parts: (1) utilize results from stochastic differential equation (SDE) to find the stationary solution of the latent Ornstein-Uhlenbeck process (eq. 18) which expresses the iterative update of SGD; and (2) adapt the PAC-Bayes framework to obtain the generalization bound based on the stationary distribution. A detailed proof is omitted here and is given in Appendix 8.2.

Similarly, we can obtain a generalization bound for the fine-tuning stage as follows.

###### Theorem 2.

For any positive real , with probability at least over a training sample set of size , we have the following inequality for the distribution of the output hypothesis function of SGD:

 R(QFT)≤^R(QFT)+ ⎷D(QFT,QPT)+2log(1δ)+2logNFT+44NFT−2,

where

 D(QFT,QPT)=log(det(Σ−1PTΣFT))+tr(Σ−1PTΣFT−I)+θ⊤FTΣ−1PTθFT.
###### Remark 5.

The generalization bounds in both pre-training and fine-tuning stages are in order of , which suggests that the generalization error converges to zero when the training sample size goes to infinity.

### 6.3 Dominance of fine-tuning in generalization of domain adaptation

In the super-model paradigm, the model is usually pre-trained on large amounts of data in a wide source domain and then fine-tuned on specific domains with relatively smaller training data. The training sample size in the source domain is significantly larger than the size of the data in the target domain. For example, the GPT-3 is trained on 45TB data. Meanwhile, the training sample size in the target domain is relatively smaller.

###### Remark 6.

Combining Theorems 1 and 2, the comparison between the training sample sizes on the source domain and the target domain suggests that the generalization error of the fine-tuning stage is dominant in the super-model paradigm.

### 6.4 Impact of the domain shifts

Theorem 2 helps characterize how the domain shifts between the source domain and the target domain influences the generalization on the target domain. The domain shifts are measured by the following discrepancy.

###### Definition 2 (Domain discrepancy).

Suppose the distributions of the learned models in the pre-training and fine-tuning are and as follows,

 qPT(θ)= MPTexp{−12θ⊤Σ−1PTθ}, qFT(θ)= MFTexp{−12(θ−θFT)⊤Σ−1FT(θ−θFT)}, (25)

where and are two normalizers, and are two covariance matrices, and is the center shift between the two learned hypotheses.

Then, the domain discrepancy between the two domains are defined as below,

 D(QFT,QPT)=log(det(Σ−1PTΣFT))+tr(Σ−1PTΣFT−I)+θ⊤FTΣ−1PTθFT.
###### Remark 7.

In Definition 2, we assume the distribution of the pre-trained model is centered at the zero point. This assumption would not hurt the generality. Suppose the distribution center is not at the zero point. One may move it to the zero point via reparamterization.

###### Remark 8.

The domain discrepancy is constituted by two parts: (1) characterizes the matchness in the aspect of covariance; and (2) characterizes the matchness of the center shift in the lens of the the covariance in the source domain.

###### Remark 9.

Our generalization bound suggests that the generalization on the target domain is determined by the magnitude of the domain shifts. Generally, larger domain shifts lead to worse generalization on the target domain. This supports the heuristic in practice that the performance on the target domain is limited by the domain shifts.

Based on Definition 2, one can get the following lemma.

###### Lemma 2.

The domain discrepancy can be rearranged as follows,

 D(QFT,QPT) ≤ log(tr(Σ−1PTΣFT))+tr(Σ−1PTΣFT)+θ⊤FTΣ−1PTθFT+dlogd−d.
###### Proof of Lemma 2.

We have that

 D(QFT,QPT) = log(det(Σ−1PTΣFT))+tr(Σ−1PTΣFT−I)+θ⊤FTΣ−1PTθFT ≤ log(ddtr(Σ−1PTΣFT))−d+tr(Σ−1PTΣFT).

Based on Lemma 2, we define a new notion for measuring the domain shifts as follows.

###### Definition 3 (Dimension-dependent domain discrepancy).

Suppose the distributions of the learned models in the pre-training and fine-tuning are and as follows,

 qPT(θ)= MPTexp{−12θ⊤Σ−1PTθ}, qFT(θ)= MFTexp{−12(θ−θFT)⊤Σ−1FT(θ−θFT)}, (26)

where and are two normalizers, and are two covariance matrices, and is the center shift between the two learned hypotheses.

Then, the domain discrepancy between the two domains are defined as below,

 ~D(QFT,QPT) = log(tr(Σ−1PTΣFT))+tr(Σ−1PTΣFT)+θ⊤FTΣ−1PTθFT+dlogd−d.

From Theorem 2, one may obtain the following corollary.

###### Corollary 1.

For any positive real , with probability at least over a training sample set of size , we have the following inequality for the distribution of the output hypothesis function of SGD:

 R(QFT)≤^R(QFT)+ ⎷~D(QFT,QPT)+2log(1δ)+2logNFT+44NFT−2, (27)

where

 ~D(QFT,QPT) = log(tr(Σ−1PTΣFT))+tr(Σ−1PTΣFT)+θ⊤FTΣ−1PTθFT+dlogd−d.

and is the Hessian matrix of the loss function around the local minimum.

## 7 Discussion and future work

Large-scale pre-trained models, such as GPT-3 and Bert, enables a new industrial paradigm: pre-training a super model on large amounts of multi-modality data (sometimes of low-quality) and then fine-tuning the learned model to smaller specific application domains. This paradigm may start a super-model paradigm that would significantly reduce the application cost of machine learning, which is critical for enormous small and medium-sized enterprises.

A major technique in this paradigm is domain adaptation which enables the knowledge transfer between the two domains. We model a super-model paradigm as a two-stage diffusion process: (1) in the pre-training stage, the trajectory of the stochastic gradient descent (SGD) or its variants searches on the loss surface driven by Uhlenbeck-Ornstein process discretely or smoothly by the Fokker-Plank equation. The model weight starts from a no-knowledge prior and converges to a Maxwell-Boltzmann distribution; and (2) in the fine-tuning stage, the trajectory of SGD is driven by a similar SDE, which starts from the learned model distribution in the pre-training stage and converges to another Maxwell-Boltzmann distribution. Based on the diffusion processes, an generalization bound is obtained via the PAC-Bayesian framework.

The generalization bounds suggest that the fine-tuning stage dominates the generalization of the whole paradigm. The generalization is determined by the domain discrepancy between the pre-training and fine-tuning domains, which is characterized by a new measure based on the covariance and domain shifts.

In this work, we make several assumptions and abstractions. This section discusses the limitation introduced by them and give several potential extensions.

• Model compression in the fine-tuning stage. In this paper, we ignore the model compression approaches in the fine-tuning stage, which are sometimes employed in practice. Popular model compression methods include model distillation, pruning, and quantization. The effects of model compression can be seen as operators on the loss surface and the learned model. A future direction is to mathematically characterize the influence of model compression. The results may be plug-and-play components to the presented theory in this paper.

• Gradient noise in SGD. In this paper, we assume that the gradient noise is drawn from a Gaussian distribution. Recent works also made assumptions that the gradient noise as a Levy process, Laplacian noise, etc. The exact distribution of the gradient noise is still an open problem. In addition, the gradient noise is assumed state-independent, which can be easily extended to state-dependent. A future direction is to study the distribution of the gradient noise in SGD. It is worth noting that relatively little efforts are needed to change the gradient noise distribution assumptions in this paper.

• Advanced techniques in modeling SGD. In this paper, we model the trajectory of SGD via Fokker-Planck equation and Uhlenbeck-Ornstein process. This modeling ignores the influence of several techniques, such as momentum and adaptive learning rate. Recent works discover that these techniques may have implicit regularization on the learned model while would not have determinant impact. A future direction is modeling the SGD as a more sophisticated stochastic differential equation.

• Distribution/data-dependent priors. Some works design priors relying on the data generation distribution but still not directly relying on the training data. This would be reasonable since we can assume the data distribution has been fixed before the data was collected (Lever et al., 2013). Such distribution-dependent priors have shown to be able to considerably tighten the generalization bounds. Negrea et al. (2019) further push the frontier that constructs priors not independent with data. Suppose is a subset of with size of . One may design a prior exploiting to deliver a data-dependent forecast of the posterior . A future direction is modeling the SGD via distribution/data-dependent priors.

## 8 Proofs

This section presents the proofs for the given theory.

We model the the iterative updates in SGD employing a stochastic differential equation. This approach is also seen in the literature; see, e.g., E (2017); Mandt et al. (2017); Mou et al. (2018a); He et al. (2019); Meng et al. (2020); Cheng et al. (2020); Xie et al. (2020); Wang et al. (2021).

We first translate the updates in SGD as Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930) under some mild assumptions. The Ornstein-Uhlenbeck process has a steady distribution which is then employed to characterize the distribution of the learned hypothesis. We further obtain a generalization bound via PAC-Bayesian framework by exploiting the stationary distribution, which characterizes the influence on the generalization via the distance between the output hypothesis distribution and its prior (McAllester, 1999a, b).

### 8.1 Proof of Theorem 1

The proof for Theorem 1 replies on the following lemma.

###### Lemma 3 (cf. Mandt et al. (2017), pp. 27-18, Appendix B).

Under the second-order differentiable assumption (eq. 13), the Ornstein-Uhlenbeck process (eq. 18)’s stationary distribution,

 q(θ)=Mexp{−12θ⊤Σ−1PTθ}, (28)

has the following property,

 AΣPT+ΣPTA=η|S|C. (29)

This lemma gives the analytic form of the steady distribution of the Ornstein-Uhlenbeck process. This lemma is from Mandt et al. (2017). Here, we recall the proof to make this paper complete.

###### Proof.

Form a result in Ornstein-Uhlenbeck process (Gardiner and others, 1985), we know that the parameter has the following analytic solution,

 θ(t)=θ(0)e−At+√η|S|∫t0e−A(t−t′)BdW(t′), (30)

where

is a white noise and follows

. From eq. (28), we know that

 ΣPT=Eθ∼Q[θθ⊤]. (31)

Therefore, we have the following equation,

 AΣPT+ΣPTA= η|S|∫t−∞Ae−A(t−t0)Ce−A(t−t0)dt′ +η|S|∫t−∞e−A(t−t0)Ce−A(t−t0)dt′A = η|S|∫t−∞ddt′Ae−A(t−t0)Ce−A(t−t0) = η|S|C. (32)

The proof is completed. ∎

Then, we can prove Theorem 1. This proof is inspired by He et al. (2019). Here, we recall the proof to make this paper complete.

###### Proof of Theorem 1.

In PAC-Bayesian framework (Lemma 1), an essential part is the KL divergence between the distribution of the learned hypothesis and the priori on the hypothesis space. The prior distribution can be interpreted as the distribution of the initial parameters, which are usually settled according to Gaussian distributions or uniform distributions.111Usually, when there is no confident prior knowledge of the latent model parameters, the priori should be set as distributions with no information, such as Gaussian distributions or uniform distributions. This setting comes from two considerations: (1) Once the algorithms based on the Bayesian statistics can converge, after long enough time and with big enough data, the algorithms can always converge to the stationary distributions. This is guaranteed by the assumption that the stationary solution of the latent stochastic differential equation exists and is unique; (2) Setting priori should be very careful, as we can not assume we have any knowledge of the target hypothesis function before we have started training the model. Here, we use a standard Gaussian distribution as the priori. Suppose the densities of the stationary distribution and the prior distribution are respectively and in terms of the parameter as the following equations,

 p(θ)=1√2πdet(I)exp{−12θ⊤Iθ}, (33) qPT(θ)=1√2πdet(ΣPT)exp{−12θ⊤Σ−1PTθ}, (34)

where ep. (34) comes from eq. (28) by calculating the normalizer .

Therefore,

 log(qPT(θ)p(θ)) = log(√2πdet(I)√2πdet(ΣPT)exp{12θ⊤Iθ−12θ⊤Σ−1PTθ}) = 12log(1det(ΣPT))+12(θ⊤Iθ−θ⊤Σ−1PTθ). (35)

Applying eq. (8.2) to eq. (23), we can calculate the KL divergence between the distributions and (we assume ):

 D(QPT||P) = Eθ∼QPT(logQPT(θ)P(θ)) = ∫θ∈Θlog(qPT(θ)p(θ))qPT(θ)dθ = ∫θ∈Θ[12log(1det(ΣPT))+12(θ⊤Iθ−θ⊤Σ−1PTθ)]q(θ)dθ = 12log(1det(ΣPT))+12∫θ∈Θθ⊤Iθp(θ)dθ−12∫R|S|θ⊤Σ−1PTθq(θ)dθ = 12log(1det(ΣPT))+12Eθ∼N(0,ΣPT)θ⊤Iθ−12Eθ∼N(0,ΣPT)θ⊤Σ−1PTθ = 12log(1det(ΣPT))+12tr(ΣPT−I). (36)

From eq. (29), we have that

 APTΣPT+ΣPTAPT=ηPT|SPT|C. (37)

Therefore,

 APTΣPTA−1PT+ΣPT=ηPT|SPT|CA−1PT. (38)

After calculating the trace of the both sides, we have the following equation,

 tr(APTΣPTA−1PT+ΣPT)=tr(ηPT|SPT|CA−1PT). (39)

The left-hand side (LHS) is as follows,

 LHS= tr(APTΣPTA−1PT+ΣPT) = = = tr(ΣPT)+tr(ΣPT) = 2tr(ΣPT). (40)

Therefore,

 tr(ΣPT)=12tr(ηPT|SPT|CA−1PT)=12ηPT|SPT|tr(CA−1PT). (41)

At the same time, we can easily calculate that

 tr(I)=d, (42)

as , where is the dimension of the parameter .

Insert eqs. (41) and (42) to eq. (8.2), we can get the following inequality,

 D(QPT||P)≤14ηPT|SPT|tr(CA−1PT)−12log(det(ΣPT))−12d. (43)

Eq. (43) gives an upper bound for the distance (measured by KL divergence) between the stationary distribution of the output weights by SGD and the priori on the hypothesis space. Considering the monotonicity of the generalization bound in terms of the KL divergence, we can further obtain a PAC-Bayesian generalization bound for SGD by inserting the KL divergence bound (eq. 43) into the PAC-Bayesian framework (eq. (22) of Lemma 1).

The proof is completed. ∎

### 8.2 Proof of Theorem 2

This section proves Theorem 2. The proof is similar to the previous theorem.

###### Proof of Theorem 2.

Similarly, the distribution of the learned hypothesis and the prior distributions are respectively and in terms of the parameter as the following equations,

 qPT(θ)=1√2πdet(ΣPT)exp{−12θ⊤Σ−1PTθ}, (44) qFT(θ)=1√2πdet(ΣFT)exp{−12θ⊤Σ−1FTθ}, (45)

where ep. (45) comes from calculating the normalizer .

Therefore,

 log(qFT(θ)qPT(θ)) = log(√2πdet(ΣPT)√2πdet(ΣFT)exp{12θ⊤Σ−1PTθ−12θ⊤Σ−1FTθ}) = 12log(det(ΣPT)det(ΣFT))+12(θ⊤Σ−1PTθ−θ⊤Σ−1FTθ). (46)

Then, the KL divergence between the distributions and are as follows (we assume ):

 D(QFT||QPT) = Eθ∼QFT(logQFT(θ)QPT(θ)) = ∫θ∈Θlog(qFT(θ)qPT(θ))qFT(θ)dθ = ∫θ∈Θ[12log(det(ΣPT)det(ΣFT))+12(θ⊤Σ−1PTθ−θ⊤Σ−1FTθ)]q(θ)%dθ = 12log<