1 Introduction
Recent years have seen rising interest in largescale empirical studies of the various metrics used to quantify generalization jiang2019fantastic; dziugaite2020search; martin2020predicting_NatComm; MM21a_simpsons_TR. On the one hand, theorydriven metrics have the potential to reveal more information than test error and thus bring us one step closer to unpacking the black box of deep NNs nakkiran2019deep; zhang2021understanding; frankle2018lottery. On the other hand, a wide variety of generalization metrics have been applied to predict the quality of pretrained models martin2020predicting_NatComm; martin2019traditional, design effective training procedures foret2020sharpness; izmailov2018averaging, improve network efficiency chen2021neural; dong2019hawq, quantify network robustness yang2020boundary; tanay2016boundary, improve ensemble learning techniques garipov2018loss; fort2019deep, analyze and improve largescale contests MM21a_simpsons_TR, and so on.
Despite advances in the study of generalization, however, several recent papers point out the deficiencies of many of these “fantastic” generalization metrics. These include a lack of “robustness” to the changes of environmental hyperparameters
jiang2019fantastic; dziugaite2020search (such as data, network architecture and training schemes), or the Simpson’s paradox that generalization metrics perform differently (i.e., predict opposite trends) when applied to each subpart of a collection of learning models or to the holistic study MM21a_simpsons_TR. Another drawback is the overreliance on experiments with CV models, which are relatively wellexplored, and which are not representative of many other application areas. Despite a few counterexamples nakkiran2019deep; martin2020predicting_NatComm; yang2021taxonomizing, systematic studies of generalization in other fields, such as NLP, are largely missing.Generalization metrics for NLP. The objective of this paper is to provide a systematic study of generalization metrics to address issues that have not received proper attention in prior studies jiang2019fantastic; dziugaite2020search; martin2020predicting_NatComm
. We will primarily focus on NLP. Compared to CV, predicting generalization in NLP has several important differences that require careful consideration. The first is that training NLP models to completely interpolate the training data is often impossible, due to the webscale size of the training data. This creates an issue when applying most existing generalization metrics because most of them focus on predicting the
generalization gap (i.e., the difference between training and test performance) rather than the test error itself.To see why focusing on the generalization gap is an issue, consider the most commonly used application to motivate the study of generalization metrics: comparing two models MM21a_simpsons_TR; jiang2020neurips.^{1}^{1}1
As the report of the NeurIPS 2020 Competition on Predicting Generalization in Deep Learning
jiang2020neurips points out, the generalization metric should be able to order models’ performance in a way similar to the generalization gap, and thus it can be used for model selections or neural architecture search; however, see MM21a_simpsons_TR for a detailed exposition of issues with this. Suppose two classification models A and B have 5% and 10% training error, respectively. In this case, even if a generalization metric correctly predicts that the model B has a smaller generalization gap than model A, and even if we know the training errors of both model A and B, it is still unclear if model B indeed has a smaller test error than A. In this paper, we aim to study generalization metrics that correlate with model quality, and we use test error as a close approximation of model quality. On the other hand, as we will demonstrate, (rank) correlation with the generalization gap does not imply (rank) correlation with model quality, as many of generalization metrics we study are on a substantially different scale from the values of the test error. Metrics that focus on predicting the generalization gap include most of the wellknown metrics in CV, such as those based on the PACBayesian framework neyshabur2017pac; mcallester1999pac and margins bartlett2017spectrally; pitas2017pac; jiang2018predicting. Therefore, from a practical point of view, for NLP tasks, we prefer generalization metrics that can directly predict the trends in test error (or similar evaluation metrics in NLP, such as the test BLEU score
papineni2002bleu) rather than trends in the generalization gap, so that we learn information closer to the model quality. Although this paper focuses on dataindependent metrics, we also provide results for datadependent metrics motivated by margins and PACBayesian bounds jiang2019fantastic; dziugaite2020search. However, although these metrics perform well in predicting the generalization gap, we show that none of them satisfactorily predicts test error directly.The second difference between NLP models and CV is that the data for NLP pretraining are usually webscale and are hard to access and use, while the training data from standard CV benchmarks can often be easily obtained. Therefore, it will be ideal if the generalization metric under study could measure the quality of learning models without access to data. In this paper, we focus on generalization metrics that do not need access to data. Although surprising, recent work has shown that access to training or testing data is not necessary for assessing the model quality of learning models martin2020predicting_NatComm.
With these objectives in mind, among the generalization metrics in the literature, we take particular interest in those derived from the heavytail self regularization (HTSR) theory martin2019traditional; martin2018implicit_JMLRversion, which (i) predicts test error directly instead of the generalization gap and (ii) does not require access to training (or testing) data. In addition to these two advantages, there are several other benefits worth mentioning. First, it is known that NLP training is harder than benchmark image classification tasks and its optimization loss landscape can be problematic yang2021taxonomizing. Therefore, the focus of the evaluation of NLP models should place more emphasis on the quality of the entire training process instead of the “final stage of data interpolation.” Fortunately, being able to do this is a known advantage of HTSR theory hodgkinson2021multiplicative; hodgkinson2021generalization. Second, actual data often follow heavytail distributions feldman2020does, which can be even more evident in NLP than the more wellbehaved datasets in CV li2017deep that are often used to study generalization.
HTSR theory. The core principle of HTSR theory martin2018implicit_JMLRversion; martin2019traditional; martin2020predicting_NatComm; MM21a_simpsons_TR
is that HT structures can arise naturally in the ESDs of the weight matrices as the result of extracting various correlations in data during optimization. The main technique used in these papers is to estimate the PL coefficient from the ESDs (which only requires access to weights), with smaller coefficients reported to correspond to higher model quality. However, these estimators can be sensitive to the empirical noise, and so one must be careful not to rely on them alone. For example, the quality of PL fit and the socalled
localization of eigenvectors
martin2018implicit_JMLRversion should all point to similar conclusions, which could be a sanity check. Fortunately, for large Transformer models vaswani2017attention; devlin2018bert typically used in modern NLP tasks, there are many large linear layers, which allows for greater accuracy in the PL estimators.The principles of HTSR theory extend beyond fitting the PL coefficient, however, as ESDs can take many forms. To this end, we study three different types of distributions to fit to the ESDs of weight matrices, including powerlaw (PL) in Eqn. (1), exponentially truncated powerlaw (ETPL) in Eqn. (2), and exponential (EXP) in Eqn. (3). These are all commonly considered families of distributions in classical studies of PL clauset2009power, and it is often hard in practice to predict which family fits data the best (as we show in this paper, this is true for deep NNs especially). Figure 1 shows examples of comparing different HT fittings on the same ESD.
When used appropriately, we will find that the various metrics derived from HTSR, which (following MM21a_simpsons_TR) we call shape metrics, uniformly perform better than scale metrics (or normbased metrics) in our empirical results. Furthermore, while the calculation of more subtle metrics (e.g. those derived from PACBayes bounds) is slow when the data is large, metrics in HTSR theory are derived from weights and are often much faster to compute.
The following summarizes our main contributions.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

Unlike prior papers on generalization metrics that focus on CV jiang2019fantastic; dziugaite2020search, we provide the first systematic empirical study on various generalization metrics in NLP. Compared to existing studies that focus primarily on limitedsized models and data, we consider more practical settings, with Transformers and mediumtolarge scale data (e.g., millionscale dataset such as WMT14).

HT structure is identified as one of the most effective generalization theories in NLP. From our empirical results, shape metrics obtained from the HT ESDs of weight matrices perform uniformly better than normbased/scalebased metrics for predicting model quality.

We extend prior studies on HTSR theory and investigate alternative models to fit heavytail/lighttail distributions. Our results show that the rand_distance metric, a novel metric quantifying the distance of the ESD (in distribution) from the randomized layer, and the exponent of ETPL fittings can be used as relatively robust generalization metrics.
2 Background
2.1 Notation and preliminaries
General notation. Consider a NN with weight matrices , ,…, . We use
to denote the collection of all the weights and denote the vector that consists of all the model weights as
. We denote the neural network (function) as , which takes a single input sample and outputs a vector . The superscript on a weight matrix, e.g. , denotes the initial weights from which the model is trained. We use to denote the correlation matrix, i.e., . The notation means an allone vector, andmeans the identity matrix.
Norms and distances. We use different types of norms defined on vectors and matrices. and used on vectors respectively means the norm and the norm. and used on matrices respectively denotes the Frobenius norm and the spectral norm.
2.2 Preliminary of ESDs of weight matrices
For a weight matrix , without loss of generality, we denote its shape by
. We denote the set of singular values as
. Following martin2017rethinking, we define the correlation matrix as . We denote the eigenvalues of the correlation matrix as , and we have that . Furthermore, we use to denote the maximum eigenvalue of the correlation matrix . By ESD of the weight matrix , we mean the empirical density of the eigenvalues of , which is usually plotted as a histogram. Following MM21a_simpsons_TR, we let denote the density function to fit the ESD taking values in the interval . For a power law, satisfies(1) 
We note that, from MM21a_simpsons_TR, is chosen to be the maximum eigenvalue of the correlation matrix. However, is a variable to be optimized to improve the quality of PL fitting, and it is not equal to the minimum eigenvalue in general. We use to denote the fitted PL coefficient of the ESD of the th weight matrix .
Eigenvector localization.
The eigenvectors of a random matrix can be localized under certain assumptions on the distributions from which the matrix entries are drawn. Our analysis also considers the localization of an eigenvector
, measured by the vector entropy martin2018implicit_JMLRversion. The vector entropy is calculated using the histogram estimated from the vector entries. Specifically, for an eigenvector , a histogram is estimated using a given number of bins. Then, the histogram is normalized to form a probability vector
. The vector entropy is then calculated as .3 Heavytail selfregularization theory
In this section, we provide a brief overview of the HTSR theory, and we discuss several metrics that we derive from it. In HTSR theory, the ESDs of the weight matrices become more heavytailed during training as they become increasingly correlated. One could evaluate the amount of correlations by fitting a PL to the ESD of a weight matrix, using the opensource
WeightWatcher tool^{2}^{2}2https://github.com/CalculatedContent/WeightWatcher martin2020predicting_NatComm. After computing the ESD of a weight matrix, we use the maximum likelihood estimate from alstott2014powerlaw to fit the PL distribution, the specific form of which has been defined in (1). We use alpha to denote the PL coefficient averaged over layers, which is effectively the slope of the tail of the ESD, on a loglog scale.Correctly identifying and fitting PL distributions is wellknown to be a challenge in practice. For example, a density that appears as a straight line on a loglog scale plot need not follow a power law, as there are many other distributions that could show a similar behavior, including lognormal and exponentialtype distributions clauset2009power. Nested distributions such as ETPL, which combine the pure PL and other distributional assumptions, can often improve the quality of fitting clauset2009power; alstott2014powerlaw. Therefore, in addition to the PL distribution defined in (1), we consider several other classes of distributions studied in the literature. Specifically, see the following two additional distributional assumptions.

(ETPL exponent) The ESDs are assumed to take the following “nested” form.
(2) After fitting the ETPL, we call the exponential truncation coefficient the exponent metric.

(exp_dist_exponent). The ESDs are assumed to take the following form.
(3) After fitting the EXP, we call the exponential coefficient the exp_dist_exponent metric.
Another metric that we introduce is rand_distance, which is also a shape metric motivated by the ESDs. We use to denote the distribution obtained by normalizing the squared singular values of the th weight matrix , and we use to denote the distribution obtained in the same way but using the randomized weight matrix. Then, we define the rand_distance metric as
(4) 
where is JensenShannon divergence. This is a distance based on the eigenvalues—not the elements—of a weight matrix and a random initialization matrix. The implementation of rand_distance metric can be found in WeightWatcher.
For more details of the various metrics considered in this paper, see Table 1. We note that all of the metrics derived from HTSR do not require the access to data or GPUs. We mainly compare between shape metrics, which are derived from HTSR, and scale metrics, which are mostly normbased metrics. For the precise definitions of these metrics, see Appendix A.
Name  Eqn  Ref  Need initial weights?  Scale or shape  Need data?  Need gpu?  Predicting model quality or generalization gap? 
l2  (5)    No  Scale  No  No  Generalization gap 
l2_dist  (6)    Yes  Scale  No  No  Generalization gap 
param_norm  (7)  jiang2019fantastic  No  Scale  No  No  Generalization gap 
fro_dist  (8)  jiang2019fantastic  Yes  Scale  No  No  Generalization gap 
log_norm  (9)  martin2018implicit_JMLRversion  No  Scale  No  No  Generalization gap 
log_sum_of_fro  (10)  jiang2019fantastic  No  Scale  No  No  Generalization gap 
log_spectral_norm  (11)  MM21a_simpsons_TR  No  Scale  No  No  Generalization gap 
dist_spec_int  (12)  jiang2019fantastic  Yes  Scale  No  No  Generalization gap 
log_prod_of_fro  (13)  jiang2019fantastic  No  Scale  No  No  Generalization gap 
log_sum_of_spec  (14)  jiang2019fantastic  No  Scale  No  No  Generalization gap 
log_prod_of_spec  (15)  jiang2019fantastic  No  Scale  No  No  Generalization gap 
path_norm  (16)  neyshabur2015norm  No  Scale  No  No  Generalization gap 
mp_softrank  (17)  martin2018implicit_JMLRversion  No  Scale/Shape  No  No  Model quality 
stable_rank  (18)  martin2018implicit_JMLRversion  No  Scale/Shape  No  No  Model quality 
alpha  (1)  martin2018implicit_JMLRversion  No  Shape  No  No  Model quality 
exponent  (2)  This paper WeightWatcher  No  Shape  No  No  Model quality 
exp_dist_exponent  (3)  This paper WeightWatcher  No  Shape  No  No  Model quality 
ks_distance  (19)  martin2018implicit_JMLRversion  No  Shape  No  No  Model quality 
tail_mean_vec_entropy  (20)  This paper WeightWatcher  No  Shape  No  No  Model quality 
bulk_mean_vec_entropy  (21)  This paper WeightWatcher  No  Shape  No  No  Model quality 
entropy  (22)  martin2018implicit_JMLRversion  No  Shape  No  No  Model quality 
rand_distance  (4)  This paper WeightWatcher  No  Shape  No  No  Model quality 
alpha_weighted  (23)  martin2018implicit_JMLRversion  No  Hybrid  No  No  Model quality 
log_alpha_norm  (24)  MM21a_simpsons_TR  No  Hybrid  No  No  Model quality 
inverse_margin  (27)  jiang2019fantastic  No  Scale  Yes  Maybe  Generalization gap 
log_prod_of_spec_over_margin  (28)  bartlett2017spectrally; pitas2017pac  No  Scale  Yes  Maybe  Generalization gap 
log_sum_of_spec_over_margin  (29)  bartlett2017spectrally; pitas2017pac  No  Scale  Yes  Maybe  Generalization gap 
log_prod_of_fro_over_margin  (30)  bartlett2017spectrally; pitas2017pac  No  Scale  Yes  Maybe  Generalization gap 
log_sum_of_fro_over_margin  (31)  bartlett2017spectrally; pitas2017pac  No  Scale  Yes  Maybe  Generalization gap 
path_norm_over_margin  (32)  neyshabur2015norm  No  Scale  Yes  Maybe  Generalization gap 
pacbeyes_init  (35)  neyshabur2017exploring  Yes  Scale  Yes  Yes  Generalization gap 
pacbeyes_orig  (36)  neyshabur2017exploring  No  Scale  Yes  Yes  Generalization gap 
pacbeyes_flatness  (37)  neyshabur2017exploring  No  Scale  Yes  Yes  Generalization gap 
pacbeyes_mag_init  (38)  jiang2019fantastic  Yes  Scale  Yes  Yes  Generalization gap 
pacbeyes_mag_orig  (39)  jiang2019fantastic  No  Scale  Yes  Yes  Generalization gap 
pacbeyes_mag_flatness  (40)  jiang2019fantastic  No  Scale  Yes  Yes  Generalization gap 
Issues of PL fitting. It is welldocumented that subtle issues can arise when fitting the ESDs clauset2009power; alstott2014powerlaw; martin2017rethinking; MM21a_simpsons_TR. To best mitigate these issues in PL fits and focus on the core mathematical concepts of predicting generalization, we rely on the same strategies used in WeightWatcher to fit the ESDs. For example, one issue of PL fitting is correctly choosing the lower threshold . The most common way of doing this martin2017rethinking; clauset2009power is to choose that yields the best quality fit under the Kolmogorov–Smirnoff statistic (referred to as ks_distance in the sequel; see Eqn. (19).) However, this method is timeconsuming, especially for ETPL as there are two parameters to fit. Therefore, we adopt the “fixfinger” method documented in WeightWatcher to select as the peak of the ESD. Besides the speed improvement, we also find this method provides more stable results.
Comparing PL and ETPL fitting. Referring to Figure 1, we now discuss how ETPL could partially address these fitting issues. On the first row of Figure 1, we show three typical cases of PL fitting. In Figure 0(a), the loglog scale reveals a “linear region” of the histogram, which the PL fitting correctly locates. The quality of fit, measured by the ks_distance, is within a typical range, as reported in Table 5 of martin2018implicit_JMLRversion. In Figure 0(b) and Figure 0(c), the ESDs do not exhibit a clear linear region on the loglog scale. Following martin2018implicit_JMLRversion, it is illadvised to consider metrics derived from a PL fit in these scenario. In practice, this typically occurs when alpha (e.g., see Figure 0(c)). On the other hand, in these two cases, the corresponding ETPL fits (shown on the second row in Figure 1) still closely match the empirical density function (see Figure 0(e) and Figure 0(f)), and the ks_distance on the second row using a ETPL fit is smaller than that for the PL fit on the first row, even when the fit on the second row clearly covers a larger part of the ESD.^{3}^{3}3We note that the value of ks_distance can be effectively made smaller if one restricts to a smaller part of the distribution, as is often done in practice by optimizing the in the (truncated) PL distribution (1). This potential bias is alleviated by using the fixfinger method. In these two cases, the ETPL exponent plays a similar role as the alpha in PL fitting, and provides an effective alternative when the ESD does not exhibit a proper PL.
Between these three PL and ETPL fittings, we would like to point out that the important thing in HTSR is not the PL fitting per se but that the spectral distributions exhibit HT or other nonstandard shapes; the particular forms of the distributions fit here simply constitute different ways to quantify this property in practice. These details, such as selecting the most appropriate distributional assumptions, clearly matter if we would like to engineer the tools of HT analysis to effectively measure the ground truth. However, the primary concern in predicting generalization is to measure the shape information, and the shape information is independent of the fitting procedure, although better fitting procedures may capture the shape information better.
4 Empirical results
In this section, we first give full details of the experimental setup. Then, we provide the analyses of the empirical results.
4.1 Experimental setup
We consider two machine translation datasets, WMT14 bojar2014findings and IWSLT cettolo2014report
, which are commonly used as benchmarks for neural machine translation
ott2018scaling; vaswani2017attention; shen2020simple; edunov2018understanding. For both datasets, we use German to English (DEEN). IWSLT consists of 160K parallel bilingual sentence pairs for training, and it is a relatively smallscale dataset. WMT14, on the other hand, has a mediumtolarge scale, and it consists of 4.5 million sentence pairs for training. To describe a more holistic picture on the relationship between the generalization metrics and dataset size, and to address the practical concern of training with different amount of data from custom datasets, we subsample the two datasets with different number of samples. Specifically, for IWSLT, we study five cases with {10K, 20K, 40K, 80K, 160K} samples. For WMT14, we study four cases with {160K, 320K, 640K, 1.28M} samples. We intentionally overlap the right endpoint (160K) of IWSLT with the left endpoint (160K) of WMT14 to study the difference between the two datasets. Similarly, we also study five different values of network depth, including {2, 3, 4, 5, 6}layer Transformers, and five different levels of initial learning rate. For each combination of (dataset, samples, depth, learning rate), we train both with and without dropout, and we use dropout probability 0.1 when training with dropout. We include the case of training without dropout because we want to study the performance of generalization metrics when there is some extent of overfitting. In total, there are 50 of these training settings. See Appendix B for the details in these 50 settings. We train with three random seeds for each setting.We follow exactly the training setup in vaswani2017attention, and we develop our implementation based on an online repository^{4}^{4}4https://github.com/gordicaleksa/pytorchoriginaltransformer which reproduces the results from vaswani2017attention
with more easily configurable Transformer architectures. We use Transformerbase with 8 attention heads, and an embedding dimension of 512 for both datasets. As we have mentioned earlier, the number of layers ranges from 2 to 6. We train with the inverse squareroot learning rate and 10% label smoothing. For each experiment, we train the model for 20 epochs. When calculating the ESDs of the weight matrices, we treat the query, key and value matrices as separate weight matrices.
4.2 Performance of generalization metrics
In this subsection, we study 36 generalization metrics (with details provided in Table 1), and we study how well they are correlated with the BLEU score papineni2002bleu. We use BLEU score because it is the most commonly used metric to evaluate machine translation. Also, we are mostly interested in measuring the correlation between the generalization metrics and the testtime BLEU score directly, and we are less interested in the correlation between the generalization metrics to the generalization gap, which is defined as the BLEU score for training subtracted by the BLEU score for test. Nonetheless, we provide the correlation measurement for both the testtime BLEU score and the generalization gap.
4.2.1 ETpl exponent tracks the BLEU score
As a warmup, we use our metric exponent defined in (2) to track the BLEU score, recalling that exponent is derived under the assumption that ESDs follow ETPLs. We use dropout to study the effect of training schemes, and consider different quantities of data to test robustness in the dataset. Referring to Figure 3, the first row considers models trained with dropout, while the second row considers models trained without dropout. The multiple columns track exponent and the BLEU score throughout training for different amounts of data. From the results in Figure 3, we can see that exponent not only tracks the BLEU scores but also differentiates underfitting (first row, with dropout) from overfitting (second row, without dropout) in this experiment.
4.2.2 Rank correlation
To systematically evaluate the various metrics considered in this paper, we study the rank correlation between these metrics and the BLEU score. For each of the 50 settings of the hyperparameters and each random seed, we calculate the Spearman’s rank correlation between BLEU scores and the values of each generalization metric over all epochs. The summarized results are presented in Figure 1(a). A positive Spearman’s rank correlation (with BLEU) in the plot means that the generalization metric is useful in tracking BLEU during training. A negative Spearman’s rank correlation, on the other hand, means that the metric often gives the wrong prediction. In Figure 1(a)
, we use the average rank correlations for all settings to study the effectiveness of each metric, and present 25% quantile rank correlations to indicate robustness across runs.
From Figure 1(a), we see that shape metrics, such as exp_dist_exponent, entropy, mp_softrank, stable_rank, rand_distance, exponent, alpha, ks_distance, are all among the effective generalization metrics that have a high rank correlation with the BLEU score that measures the model quality of machine translation. In particular, we see that the rand_distance metric achieves the highest 25 percent smallest rank correlation. The exponent metric, which assumes a ETPL distribution on the ESDs, achieves the second highest 25 percent smallest rank correlation (we will discuss the problem of the inverse_margin metric in Appendix C). The exp_dist_exponent metric, which corresponds to the EXP fitting, gives the best average rank correlations. On the other hand, normbased metrics, such as log_spectral_norm, prove ineffective for measuring model quality.
Details of the rank correlation calculations. When calculating the rank correlation, for each generalization metric, we need to associate a positive/negative sign to indicate whether the metric should be positively or negatively correlated with generalization. For example, for the powerlaw coefficient alpha, it has been shown in martin2018implicit_JMLRversion; martin2019traditional; martin2020heavy; martin2020predicting_NatComm that a smaller alpha indicates better model quality. Thus, we associate a negative sign to this metric. However, the metric rand_distance implies better model quality when it is larger martin2018implicit_JMLRversion, and so we associate a positive sign to this metric.
4.3 Analysis on the datadependent metrics
Through the empirical analysis on various metrics, we have obtained some observations that could partially explain why existing datadependent generalization metrics do not perform well on the NLP tasks.

Scale metrics correctly predict the generalization gap instead of the model quality. We also provide results on the rank correlation between the various generalization metrics and the generalization gap, defined as the training BLEU score subtracted by the test BLEU score. See the results in Figure 1(b). It is encouraging to see that most existing generalization metrics give the correct predictions. However, as we have discussed, we are less interested in this task because correctly predicting the trends of generalization gap does not automatically give predictions on the bestperforming models.

The PL metrics do not predict the generalization gap. Since the PL metrics measure selfregularization martin2019traditional, it is natural to consider whether, like the datadependent generalization metrics, PL metrics also correlate with the generalization gap. This can easily be seen to be false from Figure 1(b).
We expect these observations to be relevant and useful to improve the existing generalization metrics, especially for the NLP tasks that have not been thoroughly investigated before.
Acknowledgements. We would like to acknowledge the IARPA (contract W911NF20C0035), NSF, and ONR for providing partial support of this work. Kannan Ramchandran would like to acknowledge support from NSF CIF2007669, CIF1703678, and CIF2002821. Joseph E. Gonzalez would like to acknowledge supports from NSF CISE Expeditions Award CCF1730628 and gifts from Alibaba Group, Amazon Web Services, Ant Group, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.
References
Appendix A Generalization metrics
In this section, we provide the details of the generalization metrics considered in our analysis. We first define the scale metrics. Then, we define the shape metrics obtained from the ESDs of the weight matrices. Although our focus is on dataindependent generalization metrics, we also define generalization metrics based on margin bartlett2017spectrally, pitas2017pac and PACBayesian bounds mcallester1999pac, neyshabur2017pac.
a.1 Scale metrics
We first define the metrics motivated from matrix norms and the distance to initialization.
Normbased and distancebased metrics.
In the following, we discuss multiple metrics obtained from the norms of the weights or the distance between the weights and initialization.
A careful reader might notice that the metrics in this subsection are sometimes averaged over the layers and are sometimes summed over the layers. This inconsistency is simply because we follow the exact definitions from several prior papers. It is worth noting that rank correlation for a single training run, which is the major tool used to compare different metrics, is independent of whether the norm is averaged or summed, as long as the network size does not change during one training run. However, to compare networks with different sizes, proper normalization is necessary.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(l2). The norm of vectorized model weights.
(5) 
(l2_dist). The distance between the vectorized model weights and the vectorized initial weights.
(6) 
(param_norm). The squared Frobenius norm summed over all weight matrices.
(7) 
(fro_dist). The distance between a weight matrix and its initilized value, calculated using the Frobenius norm and summed over all layers.
(8) 
(log_norm).
(9) 
(log_sum_of_fro).
(10) 
(log_spectral_norm).
(11) 
(dist_spec_int).
(12) 
(log_prod_of_fro).
(13) 
(log_sum_of_spec).
(14) 
(log_prod_of_spec).
(15) 
(path_norm). The metric is introduced in neyshabur2015norm. To calculate the metric, we square the parameters of the network, do a forward pass on an allones input and then take sum of the network outputs.
(16)
Scale metrics that require more shape information from the ESDs. The following metrics require more than just the scale information. They either come from specific forms of combined norms that roughly describe the shape of the ESDs, or come from the Marchenko–Pastur (MP) fitting of the ESDs, which makes them different from pure normbased metrics.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(mp_softrank). The metric is introduced in martin2018implicit_JMLRversion. To calculate this metric, we fit the MP distribution on the ESD, obtain the bulk max of the MP distribution and then divide by the maximum eigenvalue.
(17) 
(stable_rank). The metric is a normadjusted measure of the scale of the ESD.
(18)
a.2 Shape metrics
Tailexponent fitting. The following metrics are derived from fitting the ESDs using a heavy or lighttail distribution.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(alpha). The slope of the tail of the ESD, on a loglog scale. We use MLE from alstott2014powerlaw to estimate alpha. The distribution of eigenvalues is assumed to have the form of (1).

(exponent). The tail exponent of the ETPL fitting of the ESD. This is a new generalization metric introduced in this paper.

(exp_dist_exponent
). The tail exponent of the EXP fitting of the ESD, under the assumption that the ESD follows an exponential distribution (
3). This is a new generalization metric introduced in this paper. 
(ks_distance). The KolmogorovSmirnoff (KS) goodnessoffit test of the powerlaw fitting.
(19) where is the fitted powerlaw distribution on the ESD, and is the ESD itself.
Vector localization. The following metrics are derived from the localization of eigenvectors. In this part, we use and to denote the th eigenvalue and eigenvector of the th correlation matrix .

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(tail_mean_vec_entropy). The mean of the vector entropies of the eigenvectors corresponding to eigenvalues on the tail of the ESD.
(20) 
(bulk_mean_vec_entropy). The mean of the vector entropies of the eigenvectors corresponding to eigenvalues on the bulk of the ESD.
(21)
Metrics from the eigenvalues. The following two metrics are derived using the eigenvalues of a weight matrix normalized as probabilities. Recall that we use to denote the distribution obtained by normalizing the squared singular values of , and we use to denote the empirical distribution with bins.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(entropy). The entropy of the squared eigenvalues of a weight matrix, normalized as probabilities. This metric is also known as the Generalized vonNeumann Matrix Entropy.
(22) where is the rank of a matrix.

(rand_distance). The distance in distribution from the randomized layer. See the definition in Eqn. (4).
a.3 Hybrid metrics
The following metrics are scaled versions of alpha. They require both the shape information from alpha and the scale information from other weight norms.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(alpha_weighted). A scaleadjusted form of alpha. This metric is called in martin2018implicit_JMLRversion, martin2020predicting_NatComm, MM21a_simpsons_TR.
(23) 
(log_alpha_norm). This metric is another scaleadjusted alpha metric in the form of Schatten norm. Recall that we use to denote the set of eigenvalues of the correlation matrix , where is the by weight matrix that satisfies . Then, we can define the Schatten norm as the following.
(24) Using this definition, we can define the metric log_alpha_norm as the following.
(25)
a.4 Marginbased metrics
Then, we discuss generalization metrics motivated from margins. Recall that we use to denote the neural network with weights . For a multiclass classification problem with samplelabel pair , we define margin as the following.
(26) 
For machine translation, we consider the margin of each output token. We note that the number of classes, or the number of possible tokens, is often particularly large for machine translation, which is at least in the order of thousands.
Following jiang2019fantastic, we consider output margins only^{5}^{5}5We note that margins can be defined in any layer elsayed2018large, yang2020boundary, wei2019improved., and we use the 10 percentile of the margin distribution calculated from the entire training set as a robust surrogate for the minimum margin. Using the margin defined as the 10 percentile, we define several generalization metrics.

[noitemsep,topsep=0pt,leftmargin=*,after=,before=]

(inverse_margin).
(27) 
(log_prod_of_spec_over_margin).
(28) 
(log_sum_of_spec_over_margin).
(29) 
(log_prod_of_fro_over_margin).
(30) 
(log_sum_of_fro_over_margin).
(31) 
(path_norm_over_margin).
(32)
a.5 Metrics derived from PACBayesian bounds
Several wellknown generalization bounds are derived using the PACBayesian framework, which bounds the generalization gap using the KLdivergence between a predefined prior distribution (usually chosen as Gaussian) and the posterior distribution of the trained models. The key component of the PACBayesian bounds used in most existing implementations is the procedure of searching for the largest magnitude of Gaussian perturbation, denoted as , such that the perturbed weights of the neural network achieve a bounded increase in the training loss. More specifically, is defined such that
(33) 
where is a predetermined threshold, and it is chosen as 0.5 in the experiments on machine translation. Similarly, one can define “magnitudeaware” perturbation with magnitude , such that
(34) 
where each weight entry in is distributed as , and is chosen as 1e3 dziugaite2020search. Given the perturbation magnitude , the magnitudeaware perturbation and the number of samples , one can define the following generalization metrics.

(pacbeyes_init).
(35) 
(pacbeyes_orig).
(36) 
(pacbeyes_flatness).
(37) 
(pacbeyes_mag_init).
(38) 
(pacbeyes_mag_orig).
(39) 
(pacbeyes_mag_flatness).
(40)
Appendix B Additional details on the experiment setup
There are 50 different settings that we consider for our experiments. See Table 2. The column titled with “initial learning rate” shows the constant factor multiplied with the standard learning rate schedule. Given the embedding dimension , step number , number of warmup steps , the formula for the inverse squareroot learning rate schedule is the following.
(41) 
Appendix C Additional analysis on scale metrics
In this section, we discuss an issue of computing marginbased generalization metrics. Generically, these bounds are of the form
where is the population error, is the training margin loss at margin , typically
and is some complexity term. First, note that this construction requires the margin to be positive. Moreover, the training margin loss is an increasing function of , while the complexity term is decreasing in , thus the conventional way of using the margin bound is to optimize over the margin to balance two terms in the margin bound bartlett2017spectrally, rather than prespecifying the value of the margin dependent on the data. However, we choose to follow the related papers dziugaite2020search, jiang2019fantastic, and we use the 10th percentile margin as a robust estimate of the minimum margin in the dataset. We use this margin in all of the marginnormalized generalization metrics. However, in all of the experiments on machine translation, the 10th percentile margin remains negative throughout the whole training, violating the requirement that the bound is evaluated at a positive value of margin. See Figure 4. This problem results from the large Alphabet for machine translation, which makes it difficult to fully interpolate the data, and hence makes the marginnormalized generalization metrics in dziugaite2020search, jiang2019fantastic hard to be applicable to the present setting.
Index  Purpose of the experiment  Dataset  Number of samples  Initial learning rate  Network depth  Dropout (Yes/No)  Number of training epochs 
0  Different amount of data in IWSLT  IWSLT  10K  1  6  Yes  20 
1  IWSLT  10K  1  6  No  20  
2  IWSLT  20K  1  6  Yes  20  
3  IWSLT  20K  1  6  No  20  
4  IWSLT  40K  1  6  Yes  20  
5  IWSLT  40K  1  6  No  20  
6  IWSLT  80K  1  6  Yes  20  
7  IWSLT  80K  1  6  No  20  
8  IWSLT  160K  1  6  Yes  20  
9  IWSLT  160K  1  6  No  20  
10  IWSLT  160K  0.75  6  Yes  20  
11  IWSLT  160K  0.75  6  No  20  
12  IWSLT  160K  0.5  6  Yes  20  
13  IWSLT  160K  0.5  6  No  20  
14  IWSLT  160K  0.375  6  Yes  20  
15  IWSLT  160K  0.375  6  No  20  
16  IWSLT  160K  0.25  6  Yes  20  
17  Different learning rate in IWSLT  IWSLT  160K  0.25  6  No  20 
18  Different network depth in IWSLT  IWSLT  160K  1  5  Yes  20 
19  IWSLT  160K  1  5  No  20  
20  IWSLT  160K  1  4  Yes  20  
21  IWSLT  160K  1  4  No  20  
22  IWSLT  160K  1  3  Yes  20  
23  IWSLT  160K  1  3  No  20  
24  IWSLT  160K  1  2  Yes  20  
25  IWSLT  160K  1  2  No  20  
26  WMT  160K  1  6  Yes  20  
27  WMT  160K  1  6  No  20  
28  WMT  320K  1  6  Yes  20  
29  WMT  320K  1  6  No  20  
30  WMT  640K  1  6  Yes  20  
31  WMT  640K  1  6  No  20  
32  WMT  1.28M  1  6  Yes  20  
33  Different amount of data in WMT  WMT  1.28M  1  6  No  20 
34  Different learning rate in WMT  WMT  1.28M  0.75  6  Yes  20 
35  WMT  1.28M  0.75  6  No  20  
36  WMT  1.28M  0.5  6  Yes  20  
37  WMT  1.28M  0.5  6  No  20  
38  WMT  1.28M  0.375  6  Yes  20  
39  WMT  1.28M  0.375  6  No  20  
40  WMT  1.28M  0.25  6  Yes  20  
41  WMT  1.28M  0.25  6  No  20  
42  WMT  1.28M  1  5  Yes  20  
43  WMT  1.28M  1  5  No  20  
44  WMT  1.28M  1  4  Yes  20  
45  WMT  1.28M  1  4  No  20  
46  WMT  1.28M  1  3  Yes  20  
47  WMT  1.28M  1  3  No  20  
48  WMT  1.28M  1  2  Yes  20  
49  Different network depth in WMT  WMT  1.28M  1  2  No  20 