1 Introduction
Largescale pretrained models (or foundation models) (Han et al., 2021; Chen et al., 2021)
, including GPT3
(Brown et al., 2020) and BERT (Devlin et al., 2018), enables a new paradigm in machine learning: pretraining a largescale model on very largescale datasets and then transferring the learned model to an unseen domain. This paradigm was first introduced in natural language processing and recently to computer vision. It sheds light in a higherlevel automation and is establishing a new paradigm with advantages/supremacy. In this paradigm, a super model learns meta knowledge from large amounts of data and reduces learning cost in specific domains. This helps considerably reduce the computational and data cost of applying machine learning in many specific applications. This is thus of significant values to enormous small and mediumsized enterprises. Additionally, supermodel paradigm enables better management of the geographic location of machine learning workload and the datacenter infrastructure, which has been shown able to significantly reduce the carbon emission
(Patterson et al., 2021).Technically, domain adaptation plays a vital role for the knowledge transferring in the supermodel paradigm. Usually, the data in a target domain is much smaller than the one in the source domain. In the light of this, an appropriate understanding to the generalizability of the transferred model on the target domain is of high importance.
In this paper, we prove an upper bound for the generalization error (generalization bound) for domain adaptation algorithms. The generalization error is defined as the difference between the expected risk and the empirical risk . Intuitively, a larger generalization bound indicates that the generalization error is possibly larger and thus suggests worse generalizability.
We model the super model paradigm as a twostage diffusion processes. In the first stage, stochastic gradientbased optimizers, usually stochastic gradientbased optimization, including stochastic gradient descent (SGD)
(Robbins and Monro, 1951), momentum (Nesterov, 1983; Tseng, 1998), and Adam (Kingma and Ba, 2014), learns a pretrained model on the sourcedomain data via empirical risk minimization,where is the empirical risk of model parameterized by on the training sample , which is defined to be
(1) 
where is the training sample size, and
is the loss function. The convergent parameter initializes the finetuning process on the target domain in the second stage. We model the parameter trajectory of SGD by a stochastic process, UhlenbeckOrnstein process
(Uhlenbeck and Ornstein, 1930), as follows,(2) 
where is positive definite matrix which characterizes the covariance of the gradient noise. This can also be smoothed to a diffusion equation, FokkerPlank equation. Correspondingly, the trajectories can be modeled by the dynamics of the FokkerPlank equations. Further, the steady distributions of the UhlenbeckOrnstein equations characterize the distributions of the learned models.
Deep learning can be formulated as solving a nonconvex optimization problem: The loss surface of neural networks are usually highly nonconvex due to the complexity of neural network architectures. In general, solving a nonconvex optimization problem is NPhard. However, numerous experiments show that deep learning has excellent optimization performance. This mystery is partially addressed by some empirical finding on the local convexity and smoothness of the loss surfaces of deep neural networks. Empirical results show that the loss surface around the convergent local minima is secondorder smooth, as shown by Li et al. (2018).
This empirical finding inspires us to model the loss surface around the convergent local minimum as a quadratic function. This assumption determines the derivatives and boundary conditions of the FokkerPlank equation. Moreover, the model parameter is usually initialized by following a Gaussian distribution. Based on them, FokkerPlank equation has a steady distribution in the form of MaxwellBoltzmann distribution, which governs the distribution of the learned model by the SGD. During the pretraining stage, SGD converges a MaxwellBoltzmann distribution around the local minimum given below,
(3) 
where is the normalizer and is the covariance.
This distribution is then used as the initial distribution in the finetuning stage. Subsequently, SGD in the finetuning stage learns the mapping from the initialization to a new MaxwellBoltzmann distribution centered at the new local minimum on the loss surface in the target domain as follows,
(4) 
where is the normalizer, is the distribution shift, and is the covariance.
Based on the diffusion processes, we then establish PACBayesian generalization bounds for the learned super model on the source domain and the transferred model on the target domain. The PACBayesian framework (McAllester, 1999a, b)
upper bounds the generalization error of a stochastic algorithm via the distance between the initial distribution and the distribution of the learned hypothesis, usually measured by some informationtheoretical distances, such as KLdivergence. Intuitively, the PACBayesian theory suggests that training a verylarge model from a noknowledge prior, such as Gaussian distribution and uniform distribution, needs a very large amount of data to secure the generalizability; and if the initialization is near the distribution of the learned hypothesis, the needed sample complexity can be relatively much smaller. However, a highquality prior is not accessible in practice. This significantly limits the model size, particularly in lowresource scenarios. This renders the key motivation of the supermodel paradigm: (1) training a super model on a very largescale dataset, in order to learn a highquality model from the noknowledge prior; and (2) using the learned super model as a highquality prior in the downstream application, in order to reduce the needed training data and supports larger model size.
In this paper, the generalization bound in pretraining is established based on the KLdivergence between the MaxwellBoltzmann distribution in pretraining as below,
(5) 
where
and is the expected risk, is the empirical risk, is the covariance of the distribution of the learned hypothesis, and is the training sample size in the pretraining.
Meanwhile, the generalization bound in pretraining is established based on the KLdivergence between the two MaxwellBoltzmann distributions in pretraining and finetuning as follows,
(6) 
where
and is the expected risk, is the empirical risk, is the covariance of the distribution of the learned hypothesis, is the shift of the distribution center, and is the training sample size in the finetuning.
We further define two new notions to measure the domain discrepancy as follows,
and
where is the parameter size. These two notions measure the magnitude of the domains shifts based on the learned hypotheses on the source domains and target domains.
Our theory have the following implications:

The very largescale datasets employed in the pretraining stage helps secure obtaining a highquality model from a noknowledge prior. The learned model carries the knowledge learned from the training data in the pretraining stage. The distribution of the pretrained model severs as a highquality initialization in the downstream finetuning stage.

Comparing the generalization bounds in the pretraining stage and finetuning stage, we show that the generalization error of the finetuning stage is dominant in the supermodel paradigm. This is because the dominantly large size of the training data in the pretraining stage. This finding supports the feasibility and efficiency of the supermodel paradigm.

Our generalization bound suggests that the generalization on the target domain is determined by the magnitude of the domain shifts. Generally, larger domain shifts lead to worse generalization on the target domain. This supports the heuristic in practice that the performance on the target domain is limited by the domain shifts.
It is worth noting that the supermodel paradigm also supports model compression in the model deployment, including pruning, quantization, model distillation, etc. The influence of model compression methods can be directly plugged in our theory.
2 Background
This section reviews the related work, including super model, domain adaptation, generalization, and deep learning theory.
Domain adaptation. Domain adaptation algorithms transfer knowledge from one domain to another. It enables the super model paradigm. Domain adaptation has three main streams:
(1) Discrepancybased domain adaptation modifies the loss to narrow the discrepancy between the features from the source domain and the ones from the target domain . Tzeng et al. (2014) introduce a fullyconnected adaptation layer into CNN for learning the representation of the kernel in order to minimize the maximum mean discrepancy (MMD) between the features from different domains:
Long et al. (2015) employ multiple adaptation layers. Long et al. (2015, 2016)
introduce residual blocks into the classifiers of source domain.
Long et al. (2017)further consider the discrepancy of the joint distribution
rather than the marginal distribution ;(2) Adversarialbased domain adaptation maps the source domain and the target domain
to a general space, inspired by generative adversarial networks (GANs): If a classifier hardly separates examples of source domain and those from target domain, the feature extractor has narrowed the two domains.
Ganin and Lempitsky (2015)propose gradient reversal layers that reverse the gradients generated by the domain classifier during backpropagation.
Zhang et al. (2018) argue that the features of the bottom layers contain more domain information, while those of the top layers contain less domain information. They further employ collaborative learning to learn domain informative features in the bottom layers, and adapt adversarial learning to learn domain uninformative features in the top layers; and(3) Reconstructionbased domain adaptation reconstructs the features extracted from the source domain
to the target domain . Ghifary et al. (2016) reconstruct examples from the target domain via features learned from source domain classification task. Bousmalis et al. (2016) reconstruct the inputs via both private representation and shared representation of both source and target domains.Generalization. Good generalization guarantees that an algorithm learns the underlying patterns in training data rather than just memorize the data. In this way, good generalization abilities provide confidence that the models trained on existing data can be applied to similar but unseen scenarios. Three major approaches in analyzing the generalizability are seen in the literature: (1) generalization bounds based on the hypothesis complexity, including VC dimension (Blumer et al., 1989; Vapnik, 2006), Rademacher complexity (Koltchinskii and Panchenko, 2000; Koltchinskii, 2001; Bartlett and Mendelson, 2002), and covering number (Dudley, 1967; Haussler, 1995). The results are usually obtained via concentration inequalities. They also suggest controlling the model size to secure the generalizability, which is no longer valid in deep learning; (2) generalization bounds based on the algorithmic stability (Rogers and Wagner, 1978; Bousquet and Elisseeff, 2002; Xu et al., 2011). The results in this stream follow the motivation that learning algorithms robust to small disturbances in input data usually have good generalizability; and (3) generalization bounds in the PACBayes framework (McAllester, 1999a, b). The results are obtained based on informationtheoretical versions of concentration inequalities.
Deep learning theory. Deep learning has been deployed successfully in many realworld scenarios. However, the theoretical foundations of deep learning are still elusive. For example, there is no explanation for how deep learning algorithms work, why they can succeed, when they would fail, and whether they would hurt society. Such deficiency in explainability questions the transparency and accountability of deep learning, and further undermines our confidence of deploying deep learning in securitycritical application domains, such as medical diagnosis (Kulikowski, 1980; Silver et al., 2016) and drug discovery (Chen et al., 2018a). Many works have emerged to establish the theoretical foundations of deep learning via VC dimension (Harvey et al., 2017), Rademacher complexity (Golowich et al., 2018; Bartlett et al., 2017), covering number (Bartlett et al., 2017), FisherRao norm (Liang et al., 2019; Tu et al., 2020), PACBayesian framework (Neyshabur et al., 2017), algorithmic stability (Hardt et al., 2016; Kuzborskij and Lampert, 2018; Verma and Zhang, 2019), and the dynamics of stochastic gradient descent or its variants (Mandt et al., 2017; Mou et al., 2018b; He et al., 2019). Please see more related works in surveys (E et al., 2020; He and Tao, 2020; Poggio et al., 2020). This work is committed to establishing theoretical foundations of privacy, generalization, adversarial attack in deep learning, all of which have profound importance in enhancing the explainability, transparency, and accountability of deep models.
Generalization of SGD. Some generalization bounds for algorithms trained by SGD are proposed. Mou et al. (2018a) analyze the generalization of stochastic gradient Langevin dynamics (SGLD), and prove an upper bound and an upper bound for the generalization error, respectively via algorithmic stability and PACBayesian theory. Pensia et al. (2018) analyze the generalizability of noisy and iterative machine learning algorithms. A generalization bound is then proved given the mutual information between the output hypothesis and the input data. It also proved generalization bounds for SGLD as examples. Chen et al. (2018b) prove that the convergence and stability for iterative machine learning algorithms have a tradeoff under both convex smooth assumption and strong convex smooth assumption. Under the same assumptions, Chen et al. (2018b) prove an generalization bound for SGD. Liu et al. (2017) prove an generalization bound for SGD when the loss function is Lipschitz continuous and smooth. London (2017) prove a generalization bound for SGD based on the KL divergence between the prior and the posterior under the PACBayes framework. He et al. (2019) present a PACBayes generalization bound for SGD based on stochastic differential equations. In the work of He et al., the gradient noise is modeled by a Gaussian distribution. Meng et al. (2020) extend the gradient noise to be statedependent. Cheng et al. (2020) extend the gradient noise to be Levy process.
3 Notations and preliminaries
Suppose the training dataset is , where is the dimension of the feature and is the dimension of the label . Suppose and are independent and identically distributed (i.i.d.) observation of variables and , respectively. We also rewrite
, which is an i.i.d. observation of random variable
. Denote the generating distribution of is .Formally, machine learning algorithms are designed to select the hypothesis function with the lowest expected risk under the loss function from a hypothesis class , where is the parameter of the hypothesis and is the dimension of the parameter . For many stochastic algorithms, such as SGD, we usually use a distribution to express the output parameter. Suppose the parameter follows a distribution , the expected risks respectively in terms of and are defined as:
(7)  
(8) 
However, the expected risk is not available from the data, since we do not know the formulation of latent distribution of data. Practically, we use the empirical risk
to estimate the expected risk
, which is defined as:(9)  
(10) 
where all constitute the training sample .
Learning algorithms usually solve the following empirical risk minimization (ERM) problem to approach the optimal hypothesis,
We usually employ stochastic gradientbased optimizers for ERM in deep learning. Popular options of stochastic gradientbased optimizers include stochastic gradient descent (SGD) (Robbins and Monro, 1951), momentum (Nesterov, 1983; Tseng, 1998), and Adam (Kingma and Ba, 2014). For the brevity, we analyze SGD in this paper. The analysis for other stochastic gradientbased optimizers is similar.
Suppose is a mini batch randomly drawn from the training sample set . Then, the stochastic gradient on is as follows,
In the th iteration, the weight is updated as follows,
where
is the weight vector in the
th iteration and is the corresponding learning rate.Meanwhile, adversarial training employs SGD to solve the following minimax problem,
(11) 
where is the radius of the ball centered at the example . Here, we call adversarial empirical risk. Correspondingly, the stochastic gradient on a mini batch and the weight update are calculated as below,
(12) 
Definition 1 (KL Divergence; cf. Kullback and Leibler (1951)).
Suppose two distributions and are defined on the same support. Then the KL divergence between and is defined as
To avoid technicalities, the measurability/integrability issues are ignored throughout this paper. Moreover, Fubini’s theorem is assumed to be applicable for any integration with respect to multiple variables, that the order of integrations is exchangeable. Also, we assume the stable (stationary) solutions of all stochastic differential equations involved exit and are unique.
4 Supermodel paradigm
A supreme industrial paradigm has been emerging that (1) pretraining a largescale model on large amounts of multimodality data, such as GPT3 (Brown et al., 2020) and BERT (Devlin et al., 2018); and (2) finetuning the obtained model on specific smaller domain where data size is relatively difficult to access. In this paper, we name it as supermodel paradigm. Supermodel paradigm enables efficient and effective knowledge discovery in lowresource application scenarios, including fewshot learning (Snell et al., 2017; Sung et al., 2018) and zeroshot learning (RomeraParedes and Torr, 2015). A key cornerstone technology wherein is domain adaptation. This section describes this paradigm.
Largescale pretrained models. Recent advances are seen mainly in natural language processing (NLP), particularly after the appearance of transformer (Vaswani et al., 2017). ELMo (Peters et al., 2018) finds the word embedding in NLP is not invariant in different application domains, but considerably changes with context. Based on this observation, ELMo pretrains a largescale bidirectional LSTM on a large text corpus to generate word vectors by finetuning. BERT (Devlin et al., 2018) employs the transformer encoder for detecting bidirectional information in the context. Meanwhile, Liu et al. (2018) employs the transformer decoder to word embedding with finetuning, in order to realize wider attention. GPT (Radford et al., 2018) also employs the transformer decoder but is finetuned on each specific task for better performance. Extended from GPT, GPT2 (Radford et al., 2019) and GPT3 (Brown et al., 2020) construct huge models in order to realize zeroshot learning. The comparison between these “super models” is presented in the following table.
SM  Architecture  Params 

ELMo  BiLSTM   
BERT  Transformer Encoder  110M 340M 
GPT  Transformer Deconder  117M 
GPT2  Transformer Deconder  117M 1,542M 
GPT3  Transformer Deconder  175M 
Pretraining stage. The first step of the supermodel paradigm is pretraining a super model on largescale data, sometimes of multimodality. The learned model in this stage is of high quality that the approximation and generalization of the output hypothesis are usually excellent, which suggests that the learned model has stored rich general knowledge in the learned model. This makes it possible to apply the learned model for the smaller specific application domains.
Finetuning stage. The learned model is then finetuned on the target domain, usually a smaller specific domain. The stored general knowledge is thereby transferred to the target domain. In this way, supermodel paradigm reduces considerable sources of knowledge discovery in the target domain.
Theoretical advantages. According to the PACBayesian theory, the generalizability of the learned model is determined by the distance between the posterior and the prior. As we will show in the next two sections, the very largescale training data in the pretraining stage secures learning a highquality model with noknowledge prior. The learned knowledge is of high value but consumed enormous resources which is not accessible for many potential machine learning users, particularly small and mediumsized enterprises. In the supermodel paradigm, the highquality model learned in the pretraining stage is employed as the initialization in the finetuning stage. In this way, we significantly reduce the needed sample complexity in the finetuning stage.
Industrial values. Machine learning has been thriving in a wide range of areas. However, the industrial applications are still limited. This is partially caused by the high cost of computing facilities and data annotations. The paradigm based on super models significantly reduce the cost of machine learning applications. This is particularly important for small and mediumsize enterprises.
Climate value. Supermodel paradigm enables recycling discovered general knowledge in enormous application domains. This would also help significantly reduce the carbon emission. Meanwhile, the supermodel paradigm centralizes the modeling training process which can help manage the geographic location and the datacenter infrastructure in order to reduce the carbon emission, as a recurrent work suggested (Patterson et al., 2021):

Geographic location of machine learning workload scheduling can result carbon emission vary around five times to ten times, even when the country and the organization remain invariant.

Cloud data centers can be around 1.42X more energyefficient. Meanwhile, machine learningoriented accelerators can be 25X more effective.
Supermodel paradigm can thus reduce the carbon print of machine learning application and further contribute in slowing down the climate crisis.
5 Diffusion processes in supermodel paradigm
We consider a diffusion processbased model that serves an envelope for domain adaptation methods. Two diffusion processes are designed for modeling the pretraining and finetuning stages, respectively. The knowledge transition can then be modeled via the transition of diffusion processes.
5.1 Diffusion process in pretraining
In pretraining, SGD explores on the loss surface for a decent local minimum. Compared with gradient descent, SGD introduces gradient noise into the gradient and then the weight. The noise plays as an implicit regularizer that controls the hypothesis complexity of the learned model. In this section, we employ a stochastic differential equation to characterize the trajectory of SGD.
We assume that the loss function in the local region around the minimum is convex and secondorder differentiable, as shown in the following assumption.
Assumption 1.
Suppose that the empirical risk around the optimum as the following equation,
(13) 
where is the Hessian matrix around the minimum and is a (semi) positivedefinite matrix.
Remark 1.
This assumption implicitly assumes that the converged local minimum is at the zero point. This would not influence the generality under translational motion. Specifically, suppose the converged local minimum is at . We may perform a translational motion to the neural network to move the converged local minimum to zero.
Remark 2.
Remark 3.
The covariance matrix characterizes the fluctuation introduced by the mini bathes into the gradient estimation. A recent intuition for the advantage of SGD is that it introduces noise into the gradient, so that it can jump out of bad local minima.
The loss and gradient calculated on a minibatch are unbiased estimators of the empirical risk and the full gradient , as follows,
(14)  
(15) 
where the expectations are in terms of the corresponding examples .
The fluctuations introduced by the mini batches are modeled by Gauss distributions centered at . Specifically, we assume that
(16) 
where is the covariance matrix and is a constant matrix for all . This Gaussian assumption is also employed in by E (2017) and Mandt et al. (2017). Therefore, we further have the following estimation,
(17) 
SGD uses the stochastic gradient to iteratively update the parameter in order to minimize the function :
(18) 
and
where is positive definite matrix which characterizes the covariance of the gradient noise. We define that
In this paper, we consider the case that the batch size and learning rate are constant.
Combining eqs. (13) and (18), we have the following analytic form of the stationary distribution (Gardiner and others, 1985):
(19) 
where is the normalizer and
, , and are the learning rate, batch size, and the covariance matrix in the pretraining stage, respectively.
Remark 4.
In this section, we show that the learned hypothesis is drawn from the steady distribution of a FokkerPlank equation, which is a GibsBoltzmann distribution centered around the zero point.
5.2 Knowledge transition in finetuning
SGD in the finetuning stage can also be characterized by the UhlenbeckOrnstein equation (eq. 18), while the initial condition is different. The finetuning stage is initialized by the steady distribution of the pretraining stage . Similarly, the SGD converges to another steady distribution . In this way, we model the domain adaptation as a twostage diffusion process. The secondstage diffusion process characterizes the knowledge transition between the two domains.
We assume that the loss function in the local region around the minimum is convex and order differentiable, as shown in the following assumption.
Assumption 2.
Suppose that the empirical risk around the optimum as the following equation,
(20) 
where is the Hessian matrix around the minimum and is a (semi) positivedefinite matrix.
Recall that we assumed that the converged local minimum in the pretraining stage is at the zero point. In the finetuning stage, the converged local minimum cannot be assumed at the same point in general. Thus, a shift term is introduced to characterize the the shift of the converged local minimum.
Similarly, combining eqs. (20) and (18), we have the following analytic form of the stationary distribution:
(21) 
where is the normalizer and
, , and are the learning rate, batch size, and the covariance matrix in the finetuning stage.
Recall that the converged local minimizer in the pretraining stage is drawn from a GibsBoltzmann distribution centered at the zero point. This is inherited from the assumption that the local minimum is around the zero point. However, in the finetuning stage, the converged local minimum has a shift from the zero point. This leads to a shift of the distribution of the learned hypothesis.
6 Generalization analysis of supermodel paradigm
The knowledge transition characterized by the diffusion process in the finetuning. In this paper, we employ the PACBayesian theory to analyze the generalizability of domain adaptation.
6.1 PACBayesian framework
PACBayesian theory corporates the PAC theory and Bayesian statistics
(McAllester, 1999a, b). It presents a generalization bound for a stochastic algorithm based on the distance between the learned hypothesis and the prior measured by the KL divergence. The PACBayesian bound characterizes the tradeoff between minimising the empirical risk and exploring further areas of the hypothesis space from the initial.Lemma 1 (see McAllester (1999a), Theorem 1).
For any positive real
, with probability at least
over a sample of size , we have the following inequality for all distributions :(22) 
where is the KL divergence between the distributions and and is defined as,
(23) 
This lemma characterizes the influence on the generalization via the distance between the distribution of the learned hypothesis and the prior measured by the KL divergence . The KL divergence serves as a hypothesis complexity measure. In specific, a larger KL divergence corresponds to a larger hypothesis complexity and further a worse generalizability.
6.2 Generalization bound
We then obtain a generalization bound for the pretraining stage as follows.
Theorem 1.
For any positive real , with probability at least over a training sample set of size , we have the following inequality for the distribution of the output hypothesis function of SGD:
(24) 
where
The proof for this generalization bound has two parts: (1) utilize results from stochastic differential equation (SDE) to find the stationary solution of the latent OrnsteinUhlenbeck process (eq. 18) which expresses the iterative update of SGD; and (2) adapt the PACBayes framework to obtain the generalization bound based on the stationary distribution. A detailed proof is omitted here and is given in Appendix 8.2.
Similarly, we can obtain a generalization bound for the finetuning stage as follows.
Theorem 2.
For any positive real , with probability at least over a training sample set of size , we have the following inequality for the distribution of the output hypothesis function of SGD:
where
Remark 5.
The generalization bounds in both pretraining and finetuning stages are in order of , which suggests that the generalization error converges to zero when the training sample size goes to infinity.
6.3 Dominance of finetuning in generalization of domain adaptation
In the supermodel paradigm, the model is usually pretrained on large amounts of data in a wide source domain and then finetuned on specific domains with relatively smaller training data. The training sample size in the source domain is significantly larger than the size of the data in the target domain. For example, the GPT3 is trained on 45TB data. Meanwhile, the training sample size in the target domain is relatively smaller.
6.4 Impact of the domain shifts
Theorem 2 helps characterize how the domain shifts between the source domain and the target domain influences the generalization on the target domain. The domain shifts are measured by the following discrepancy.
Definition 2 (Domain discrepancy).
Suppose the distributions of the learned models in the pretraining and finetuning are and as follows,
(25) 
where and are two normalizers, and are two covariance matrices, and is the center shift between the two learned hypotheses.
Then, the domain discrepancy between the two domains are defined as below,
Remark 7.
In Definition 2, we assume the distribution of the pretrained model is centered at the zero point. This assumption would not hurt the generality. Suppose the distribution center is not at the zero point. One may move it to the zero point via reparamterization.
Remark 8.
The domain discrepancy is constituted by two parts: (1) characterizes the matchness in the aspect of covariance; and (2) characterizes the matchness of the center shift in the lens of the the covariance in the source domain.
Remark 9.
Our generalization bound suggests that the generalization on the target domain is determined by the magnitude of the domain shifts. Generally, larger domain shifts lead to worse generalization on the target domain. This supports the heuristic in practice that the performance on the target domain is limited by the domain shifts.
Based on Definition 2, one can get the following lemma.
Lemma 2.
The domain discrepancy can be rearranged as follows,
Proof of Lemma 2.
We have that
∎
Based on Lemma 2, we define a new notion for measuring the domain shifts as follows.
Definition 3 (Dimensiondependent domain discrepancy).
Suppose the distributions of the learned models in the pretraining and finetuning are and as follows,
(26) 
where and are two normalizers, and are two covariance matrices, and is the center shift between the two learned hypotheses.
Then, the domain discrepancy between the two domains are defined as below,
From Theorem 2, one may obtain the following corollary.
Corollary 1.
For any positive real , with probability at least over a training sample set of size , we have the following inequality for the distribution of the output hypothesis function of SGD:
(27) 
where
and is the Hessian matrix of the loss function around the local minimum.
7 Discussion and future work
Largescale pretrained models, such as GPT3 and Bert, enables a new industrial paradigm: pretraining a super model on large amounts of multimodality data (sometimes of lowquality) and then finetuning the learned model to smaller specific application domains. This paradigm may start a supermodel paradigm that would significantly reduce the application cost of machine learning, which is critical for enormous small and mediumsized enterprises.
A major technique in this paradigm is domain adaptation which enables the knowledge transfer between the two domains. We model a supermodel paradigm as a twostage diffusion process: (1) in the pretraining stage, the trajectory of the stochastic gradient descent (SGD) or its variants searches on the loss surface driven by UhlenbeckOrnstein process discretely or smoothly by the FokkerPlank equation. The model weight starts from a noknowledge prior and converges to a MaxwellBoltzmann distribution; and (2) in the finetuning stage, the trajectory of SGD is driven by a similar SDE, which starts from the learned model distribution in the pretraining stage and converges to another MaxwellBoltzmann distribution. Based on the diffusion processes, an generalization bound is obtained via the PACBayesian framework.
The generalization bounds suggest that the finetuning stage dominates the generalization of the whole paradigm. The generalization is determined by the domain discrepancy between the pretraining and finetuning domains, which is characterized by a new measure based on the covariance and domain shifts.
In this work, we make several assumptions and abstractions. This section discusses the limitation introduced by them and give several potential extensions.

Model compression in the finetuning stage. In this paper, we ignore the model compression approaches in the finetuning stage, which are sometimes employed in practice. Popular model compression methods include model distillation, pruning, and quantization. The effects of model compression can be seen as operators on the loss surface and the learned model. A future direction is to mathematically characterize the influence of model compression. The results may be plugandplay components to the presented theory in this paper.

Gradient noise in SGD. In this paper, we assume that the gradient noise is drawn from a Gaussian distribution. Recent works also made assumptions that the gradient noise as a Levy process, Laplacian noise, etc. The exact distribution of the gradient noise is still an open problem. In addition, the gradient noise is assumed stateindependent, which can be easily extended to statedependent. A future direction is to study the distribution of the gradient noise in SGD. It is worth noting that relatively little efforts are needed to change the gradient noise distribution assumptions in this paper.

Advanced techniques in modeling SGD. In this paper, we model the trajectory of SGD via FokkerPlanck equation and UhlenbeckOrnstein process. This modeling ignores the influence of several techniques, such as momentum and adaptive learning rate. Recent works discover that these techniques may have implicit regularization on the learned model while would not have determinant impact. A future direction is modeling the SGD as a more sophisticated stochastic differential equation.

Distribution/datadependent priors. Some works design priors relying on the data generation distribution but still not directly relying on the training data. This would be reasonable since we can assume the data distribution has been fixed before the data was collected (Lever et al., 2013). Such distributiondependent priors have shown to be able to considerably tighten the generalization bounds. Negrea et al. (2019) further push the frontier that constructs priors not independent with data. Suppose is a subset of with size of . One may design a prior exploiting to deliver a datadependent forecast of the posterior . A future direction is modeling the SGD via distribution/datadependent priors.
8 Proofs
This section presents the proofs for the given theory.
We model the the iterative updates in SGD employing a stochastic differential equation. This approach is also seen in the literature; see, e.g., E (2017); Mandt et al. (2017); Mou et al. (2018a); He et al. (2019); Meng et al. (2020); Cheng et al. (2020); Xie et al. (2020); Wang et al. (2021).
We first translate the updates in SGD as OrnsteinUhlenbeck process (Uhlenbeck and Ornstein, 1930) under some mild assumptions. The OrnsteinUhlenbeck process has a steady distribution which is then employed to characterize the distribution of the learned hypothesis. We further obtain a generalization bound via PACBayesian framework by exploiting the stationary distribution, which characterizes the influence on the generalization via the distance between the output hypothesis distribution and its prior (McAllester, 1999a, b).
8.1 Proof of Theorem 1
The proof for Theorem 1 replies on the following lemma.
Lemma 3 (cf. Mandt et al. (2017), pp. 2718, Appendix B).
This lemma gives the analytic form of the steady distribution of the OrnsteinUhlenbeck process. This lemma is from Mandt et al. (2017). Here, we recall the proof to make this paper complete.
Proof.
Form a result in OrnsteinUhlenbeck process (Gardiner and others, 1985), we know that the parameter has the following analytic solution,
(30) 
where
is a white noise and follows
. From eq. (28), we know that(31) 
Therefore, we have the following equation,
(32) 
The proof is completed. ∎
Then, we can prove Theorem 1. This proof is inspired by He et al. (2019). Here, we recall the proof to make this paper complete.
Proof of Theorem 1.
In PACBayesian framework (Lemma 1), an essential part is the KL divergence between the distribution of the learned hypothesis and the priori on the hypothesis space. The prior distribution can be interpreted as the distribution of the initial parameters, which are usually settled according to Gaussian distributions or uniform distributions.^{1}^{1}1Usually, when there is no confident prior knowledge of the latent model parameters, the priori should be set as distributions with no information, such as Gaussian distributions or uniform distributions. This setting comes from two considerations: (1) Once the algorithms based on the Bayesian statistics can converge, after long enough time and with big enough data, the algorithms can always converge to the stationary distributions. This is guaranteed by the assumption that the stationary solution of the latent stochastic differential equation exists and is unique; (2) Setting priori should be very careful, as we can not assume we have any knowledge of the target hypothesis function before we have started training the model. Here, we use a standard Gaussian distribution as the priori. Suppose the densities of the stationary distribution and the prior distribution are respectively and in terms of the parameter as the following equations,
(33)  
(34) 
where ep. (34) comes from eq. (28) by calculating the normalizer .
Therefore,
(35) 
Applying eq. (8.2) to eq. (23), we can calculate the KL divergence between the distributions and (we assume ):
(36) 
From eq. (29), we have that
(37) 
Therefore,
(38) 
After calculating the trace of the both sides, we have the following equation,
(39) 
The lefthand side (LHS) is as follows,
(40) 
Therefore,
(41) 
At the same time, we can easily calculate that
(42) 
as , where is the dimension of the parameter .
Eq. (43) gives an upper bound for the distance (measured by KL divergence) between the stationary distribution of the output weights by SGD and the priori on the hypothesis space. Considering the monotonicity of the generalization bound in terms of the KL divergence, we can further obtain a PACBayesian generalization bound for SGD by inserting the KL divergence bound (eq. 43) into the PACBayesian framework (eq. (22) of Lemma 1).
The proof is completed. ∎
8.2 Proof of Theorem 2
This section proves Theorem 2. The proof is similar to the previous theorem.
Proof of Theorem 2.
Similarly, the distribution of the learned hypothesis and the prior distributions are respectively and in terms of the parameter as the following equations,
(44)  
(45) 
where ep. (45) comes from calculating the normalizer .
Therefore,
(46) 
Then, the KL divergence between the distributions and are as follows (we assume ):