Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data

02/06/2022
by   Yaoqing Yang, et al.
3

The search for effective and robust generalization metrics has been the focus of recent theoretical and empirical work. In this paper, we discuss the performance of natural language processing (NLP) models, and we evaluate various existing and novel generalization metrics. Compared to prior studies, we (i) focus on NLP instead of computer vision (CV), (ii) focus on generalization metrics that predict test error instead of the generalization gap, (iii) focus on generalization metrics that do not need the access to data, and (iv) focus on the heavy-tail (HT) phenomenon that has received comparatively less attention in the study of deep neural networks (NNs). We extend recent HT-based work which focuses on power law (PL) distributions, and we study exponential (EXP) and exponentially truncated power law (E-TPL) fitting to the empirical spectral densities (ESDs) of weight matrices. Our detailed empirical studies show that (i) shape metrics, or the metrics obtained from fitting the shape of the ESDs, perform uniformly better at predicting generalization performance than scale metrics commonly studied in the literature, as measured by the average rank correlations with the generalization performance for all of our experiments; (ii) among forty generalization metrics studied in our paper, the metric, a new shape metric invented in this paper that measures the distance between empirical eigenvalues of weight matrices and those of randomly initialized weight matrices, achieves the highest worst-case rank correlation with generalization performance under a variety of training settings; and (iii) among the three HT distributions considered in our paper, the E-TPL fitting of ESDs performs the most robustly.

READ FULL TEXT VIEW PDF

page 6

page 20

08/03/2018

Generalization Error in Deep Learning

Deep learning models have lately shown great performance in various fiel...
05/23/2021

Compressing Heavy-Tailed Weight Matrices for Non-Vacuous Generalization Bounds

Heavy-tailed distributions have been studied in statistics, random matri...
02/17/2020

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

In many applications, one works with deep neural network (DNN) models tr...
04/25/2022

A global analysis of metrics used for measuring performance in natural language processing

Measuring the performance of natural language processing models is chall...
12/07/2021

Spectral Complexity-scaled Generalization Bound of Complex-valued Neural Networks

Complex-valued neural networks (CVNNs) have been widely applied to vario...
03/30/2020

Empirical Analysis of Zipf's Law, Power Law, and Lognormal Distributions in Medical Discharge Reports

Bayesian modelling and statistical text analysis rely on informed probab...
10/14/2021

Causally Estimating the Sensitivity of Neural NLP Models to Spurious Features

Recent work finds modern natural language processing (NLP) models relyin...

1 Introduction

Recent years have seen rising interest in large-scale empirical studies of the various metrics used to quantify generalization jiang2019fantastic; dziugaite2020search; martin2020predicting_NatComm; MM21a_simpsons_TR. On the one hand, theory-driven metrics have the potential to reveal more information than test error and thus bring us one step closer to unpacking the black box of deep NNs nakkiran2019deep; zhang2021understanding; frankle2018lottery. On the other hand, a wide variety of generalization metrics have been applied to predict the quality of pretrained models martin2020predicting_NatComm; martin2019traditional, design effective training procedures foret2020sharpness; izmailov2018averaging, improve network efficiency chen2021neural; dong2019hawq, quantify network robustness yang2020boundary; tanay2016boundary, improve ensemble learning techniques garipov2018loss; fort2019deep, analyze and improve large-scale contests MM21a_simpsons_TR, and so on.

Despite advances in the study of generalization, however, several recent papers point out the deficiencies of many of these “fantastic” generalization metrics. These include a lack of “robustness” to the changes of environmental hyperparameters

jiang2019fantastic; dziugaite2020search (such as data, network architecture and training schemes), or the Simpson’s paradox that generalization metrics perform differently (i.e., predict opposite trends) when applied to each sub-part of a collection of learning models or to the holistic study MM21a_simpsons_TR. Another drawback is the over-reliance on experiments with CV models, which are relatively well-explored, and which are not representative of many other application areas. Despite a few counterexamples nakkiran2019deep; martin2020predicting_NatComm; yang2021taxonomizing, systematic studies of generalization in other fields, such as NLP, are largely missing.

Generalization metrics for NLP. The objective of this paper is to provide a systematic study of generalization metrics to address issues that have not received proper attention in prior studies jiang2019fantastic; dziugaite2020search; martin2020predicting_NatComm

. We will primarily focus on NLP. Compared to CV, predicting generalization in NLP has several important differences that require careful consideration. The first is that training NLP models to completely interpolate the training data is often impossible, due to the web-scale size of the training data. This creates an issue when applying most existing generalization metrics because most of them focus on predicting the

generalization gap (i.e., the difference between training and test performance) rather than the test error itself.

To see why focusing on the generalization gap is an issue, consider the most commonly used application to motivate the study of generalization metrics: comparing two models MM21a_simpsons_TR; jiang2020neurips.111

As the report of the NeurIPS 2020 Competition on Predicting Generalization in Deep Learning

jiang2020neurips points out, the generalization metric should be able to order models’ performance in a way similar to the generalization gap, and thus it can be used for model selections or neural architecture search; however, see MM21a_simpsons_TR for a detailed exposition of issues with this. Suppose two classification models A and B have 5% and 10% training error, respectively. In this case, even if a generalization metric correctly predicts that the model B has a smaller generalization gap than model A, and even if we know the training errors of both model A and B, it is still unclear if model B indeed has a smaller test error than A. In this paper, we aim to study generalization metrics that correlate with model quality, and we use test error as a close approximation of model quality. On the other hand, as we will demonstrate, (rank) correlation with the generalization gap does not imply (rank) correlation with model quality, as many of generalization metrics we study are on a substantially different scale from the values of the test error. Metrics that focus on predicting the generalization gap include most of the well-known metrics in CV, such as those based on the PAC-Bayesian framework neyshabur2017pac; mcallester1999pac and margins bartlett2017spectrally; pitas2017pac; jiang2018predicting

. Therefore, from a practical point of view, for NLP tasks, we prefer generalization metrics that can directly predict the trends in test error (or similar evaluation metrics in NLP, such as the test BLEU score

papineni2002bleu) rather than trends in the generalization gap, so that we learn information closer to the model quality. Although this paper focuses on data-independent metrics, we also provide results for data-dependent metrics motivated by margins and PAC-Bayesian bounds jiang2019fantastic; dziugaite2020search. However, although these metrics perform well in predicting the generalization gap, we show that none of them satisfactorily predicts test error directly.

The second difference between NLP models and CV is that the data for NLP pretraining are usually web-scale and are hard to access and use, while the training data from standard CV benchmarks can often be easily obtained. Therefore, it will be ideal if the generalization metric under study could measure the quality of learning models without access to data. In this paper, we focus on generalization metrics that do not need access to data. Although surprising, recent work has shown that access to training or testing data is not necessary for assessing the model quality of learning models martin2020predicting_NatComm.

With these objectives in mind, among the generalization metrics in the literature, we take particular interest in those derived from the heavy-tail self regularization (HT-SR) theory martin2019traditional; martin2018implicit_JMLRversion, which (i) predicts test error directly instead of the generalization gap and (ii) does not require access to training (or testing) data. In addition to these two advantages, there are several other benefits worth mentioning. First, it is known that NLP training is harder than benchmark image classification tasks and its optimization loss landscape can be problematic yang2021taxonomizing. Therefore, the focus of the evaluation of NLP models should place more emphasis on the quality of the entire training process instead of the “final stage of data interpolation.” Fortunately, being able to do this is a known advantage of HT-SR theory hodgkinson2021multiplicative; hodgkinson2021generalization. Second, actual data often follow heavy-tail distributions feldman2020does, which can be even more evident in NLP than the more well-behaved datasets in CV li2017deep that are often used to study generalization.

HT-SR theory. The core principle of HT-SR theory martin2018implicit_JMLRversion; martin2019traditional; martin2020predicting_NatComm; MM21a_simpsons_TR

is that HT structures can arise naturally in the ESDs of the weight matrices as the result of extracting various correlations in data during optimization. The main technique used in these papers is to estimate the PL coefficient from the ESDs (which only requires access to weights), with smaller coefficients reported to correspond to higher model quality. However, these estimators can be sensitive to the empirical noise, and so one must be careful not to rely on them alone. For example, the quality of PL fit and the so-called

localization of eigenvectors

martin2018implicit_JMLRversion should all point to similar conclusions, which could be a sanity check. Fortunately, for large Transformer models vaswani2017attention; devlin2018bert typically used in modern NLP tasks, there are many large linear layers, which allows for greater accuracy in the PL estimators.

The principles of HT-SR theory extend beyond fitting the PL coefficient, however, as ESDs can take many forms. To this end, we study three different types of distributions to fit to the ESDs of weight matrices, including powerlaw (PL) in Eqn. (1), exponentially truncated powerlaw (E-TPL) in Eqn. (2), and exponential (EXP) in Eqn. (3). These are all commonly considered families of distributions in classical studies of PL clauset2009power, and it is often hard in practice to predict which family fits data the best (as we show in this paper, this is true for deep NNs especially). Figure 1 shows examples of comparing different HT fittings on the same ESD.

(a) Small ks_distance.
(b) Mediocre ks_distance.
(c) Large ks_distance.
(d) E-TPL fitting of the same ESD.
(e) E-TPL fitting of the same ESD.
(f) E-TPL fitting of the same ESD.
Figure 1: Comparing PL and E-TPL fitting. (First row). Good, mediocre, and bad PL fittings measured by the ks_distance. (Second row). E-TPL fitting of the ESD on the same column. Blue histograms represent the ESDs. Solid vertical lines represent the lower threshold found by the fitting procedure. Solid curves represent ESDs truncated using , and dashed curves represent the fitted HT distributions.

When used appropriately, we will find that the various metrics derived from HT-SR, which (following MM21a_simpsons_TR) we call shape metrics, uniformly perform better than scale metrics (or norm-based metrics) in our empirical results. Furthermore, while the calculation of more subtle metrics (e.g. those derived from PAC-Bayes bounds) is slow when the data is large, metrics in HT-SR theory are derived from weights and are often much faster to compute.

The following summarizes our main contributions.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • Unlike prior papers on generalization metrics that focus on CV jiang2019fantastic; dziugaite2020search, we provide the first systematic empirical study on various generalization metrics in NLP. Compared to existing studies that focus primarily on limited-sized models and data, we consider more practical settings, with Transformers and medium-to-large scale data (e.g., million-scale dataset such as WMT14).

  • HT structure is identified as one of the most effective generalization theories in NLP. From our empirical results, shape metrics obtained from the HT ESDs of weight matrices perform uniformly better than norm-based/scale-based metrics for predicting model quality.

  • We extend prior studies on HT-SR theory and investigate alternative models to fit heavy-tail/light-tail distributions. Our results show that the rand_distance metric, a novel metric quantifying the distance of the ESD (in distribution) from the randomized layer, and the exponent of E-TPL fittings can be used as relatively robust generalization metrics.

2 Background

2.1 Notation and preliminaries

General notation. Consider a NN with weight matrices , ,…, . We use

to denote the collection of all the weights and denote the vector that consists of all the model weights as

. We denote the neural network (function) as , which takes a single input sample and outputs a vector . The superscript on a weight matrix, e.g. , denotes the initial weights from which the model is trained. We use to denote the correlation matrix, i.e., . The notation means an all-one vector, and

means the identity matrix.

Norms and distances. We use different types of norms defined on vectors and matrices. and used on vectors respectively means the norm and the norm. and used on matrices respectively denotes the Frobenius norm and the spectral norm.

2.2 Preliminary of ESDs of weight matrices

For a weight matrix , without loss of generality, we denote its shape by

. We denote the set of singular values as

. Following martin2017rethinking, we define the correlation matrix as . We denote the eigenvalues of the correlation matrix as , and we have that . Furthermore, we use to denote the maximum eigenvalue of the correlation matrix . By ESD of the weight matrix , we mean the empirical density of the eigenvalues of , which is usually plotted as a histogram. Following MM21a_simpsons_TR, we let denote the density function to fit the ESD taking values in the interval . For a power law, satisfies

(1)

We note that, from MM21a_simpsons_TR, is chosen to be the maximum eigenvalue of the correlation matrix. However, is a variable to be optimized to improve the quality of PL fitting, and it is not equal to the minimum eigenvalue in general. We use to denote the fitted PL coefficient of the ESD of the -th weight matrix .

Eigenvector localization.

The eigenvectors of a random matrix can be localized under certain assumptions on the distributions from which the matrix entries are drawn. Our analysis also considers the localization of an eigenvector

, measured by the vector entropy martin2018implicit_JMLRversion. The vector entropy is calculated using the histogram estimated from the vector entries. Specifically, for an eigenvector , a histogram is estimated using a given number of bins

. Then, the histogram is normalized to form a probability vector

. The vector entropy is then calculated as .

3 Heavy-tail self-regularization theory

In this section, we provide a brief overview of the HT-SR theory, and we discuss several metrics that we derive from it. In HT-SR theory, the ESDs of the weight matrices become more heavy-tailed during training as they become increasingly correlated. One could evaluate the amount of correlations by fitting a PL to the ESD of a weight matrix, using the open-source

WeightWatcher tool222https://github.com/CalculatedContent/WeightWatcher martin2020predicting_NatComm. After computing the ESD of a weight matrix, we use the maximum likelihood estimate from alstott2014powerlaw to fit the PL distribution, the specific form of which has been defined in (1). We use alpha to denote the PL coefficient averaged over layers, which is effectively the slope of the tail of the ESD, on a log-log scale.

Correctly identifying and fitting PL distributions is well-known to be a challenge in practice. For example, a density that appears as a straight line on a log-log scale plot need not follow a power law, as there are many other distributions that could show a similar behavior, including lognormal and exponential-type distributions clauset2009power. Nested distributions such as E-TPL, which combine the pure PL and other distributional assumptions, can often improve the quality of fitting clauset2009power; alstott2014powerlaw. Therefore, in addition to the PL distribution defined in (1), we consider several other classes of distributions studied in the literature. Specifically, see the following two additional distributional assumptions.

  • (E-TPL exponent) The ESDs are assumed to take the following “nested” form.

    (2)

    After fitting the E-TPL, we call the exponential truncation coefficient the exponent metric.

  • (exp_dist_exponent). The ESDs are assumed to take the following form.

    (3)

    After fitting the EXP, we call the exponential coefficient the exp_dist_exponent metric.

Another metric that we introduce is rand_distance, which is also a shape metric motivated by the ESDs. We use to denote the distribution obtained by normalizing the squared singular values of the -th weight matrix , and we use to denote the distribution obtained in the same way but using the randomized weight matrix. Then, we define the rand_distance metric as

(4)

where is Jensen-Shannon divergence. This is a distance based on the eigenvalues—not the elements—of a weight matrix and a random initialization matrix. The implementation of rand_distance metric can be found in WeightWatcher.

For more details of the various metrics considered in this paper, see Table 1. We note that all of the metrics derived from HT-SR do not require the access to data or GPUs. We mainly compare between shape metrics, which are derived from HT-SR, and scale metrics, which are mostly norm-based metrics. For the precise definitions of these metrics, see Appendix A.

Name Eqn Ref Need initial weights? Scale or shape Need data? Need gpu? Predicting model quality or generalization gap?
l2 (5) - No Scale No No Generalization gap
l2_dist (6) - Yes Scale No No Generalization gap
param_norm (7) jiang2019fantastic No Scale No No Generalization gap
fro_dist (8) jiang2019fantastic Yes Scale No No Generalization gap
log_norm (9) martin2018implicit_JMLRversion No Scale No No Generalization gap
log_sum_of_fro (10) jiang2019fantastic No Scale No No Generalization gap
log_spectral_norm (11) MM21a_simpsons_TR No Scale No No Generalization gap
dist_spec_int (12) jiang2019fantastic Yes Scale No No Generalization gap
log_prod_of_fro (13) jiang2019fantastic No Scale No No Generalization gap
log_sum_of_spec (14) jiang2019fantastic No Scale No No Generalization gap
log_prod_of_spec (15) jiang2019fantastic No Scale No No Generalization gap
path_norm (16) neyshabur2015norm No Scale No No Generalization gap
mp_softrank (17) martin2018implicit_JMLRversion No Scale/Shape No No Model quality
stable_rank (18) martin2018implicit_JMLRversion No Scale/Shape No No Model quality
alpha (1) martin2018implicit_JMLRversion No Shape No No Model quality
exponent (2) This paper WeightWatcher No Shape No No Model quality
exp_dist_exponent (3) This paper WeightWatcher No Shape No No Model quality
ks_distance (19) martin2018implicit_JMLRversion No Shape No No Model quality
tail_mean_vec_entropy (20) This paper WeightWatcher No Shape No No Model quality
bulk_mean_vec_entropy (21) This paper WeightWatcher No Shape No No Model quality
entropy (22) martin2018implicit_JMLRversion No Shape No No Model quality
rand_distance (4) This paper WeightWatcher No Shape No No Model quality
alpha_weighted (23) martin2018implicit_JMLRversion No Hybrid No No Model quality
log_alpha_norm (24) MM21a_simpsons_TR No Hybrid No No Model quality
inverse_margin (27) jiang2019fantastic No Scale Yes Maybe Generalization gap
log_prod_of_spec_over_margin (28) bartlett2017spectrally; pitas2017pac No Scale Yes Maybe Generalization gap
log_sum_of_spec_over_margin (29) bartlett2017spectrally; pitas2017pac No Scale Yes Maybe Generalization gap
log_prod_of_fro_over_margin (30) bartlett2017spectrally; pitas2017pac No Scale Yes Maybe Generalization gap
log_sum_of_fro_over_margin (31) bartlett2017spectrally; pitas2017pac No Scale Yes Maybe Generalization gap
path_norm_over_margin (32) neyshabur2015norm No Scale Yes Maybe Generalization gap
pacbeyes_init (35) neyshabur2017exploring Yes Scale Yes Yes Generalization gap
pacbeyes_orig (36) neyshabur2017exploring No Scale Yes Yes Generalization gap
pacbeyes_flatness (37) neyshabur2017exploring No Scale Yes Yes Generalization gap
pacbeyes_mag_init (38) jiang2019fantastic Yes Scale Yes Yes Generalization gap
pacbeyes_mag_orig (39) jiang2019fantastic No Scale Yes Yes Generalization gap
pacbeyes_mag_flatness (40) jiang2019fantastic No Scale Yes Yes Generalization gap
Table 1: Overview of the generalization metrics considered. Our paper focuses on the shape metrics derived from the ESDs of weight matrices. See Appendix A for the details of these metrics.

Correlations with model quality

(a) Correlations with model quality. Spearman’s rank correlation between various generalization metrics and BLEU.

Correlations with generalization gap

(b) Correlations with generalization gap. Spearman’s rank correlation between various generalization metrics and the generalization gap.
Figure 2: Comparing multiple generalization metrics in terms of predicting the BLEU score (on the left) or the generalization gap (on the right, defined as the training BLEU score subtracted by the test BLEU score). The left, middle and right verticle edge on each box represents the 25/50/75 percentile of the rank correlations in 50 different settings (including different datasets, different amount of data, different network depths, different initial learning rates, and different dropout).

Issues of PL fitting. It is well-documented that subtle issues can arise when fitting the ESDs clauset2009power; alstott2014powerlaw; martin2017rethinking; MM21a_simpsons_TR. To best mitigate these issues in PL fits and focus on the core mathematical concepts of predicting generalization, we rely on the same strategies used in WeightWatcher to fit the ESDs. For example, one issue of PL fitting is correctly choosing the lower threshold . The most common way of doing this martin2017rethinking; clauset2009power is to choose that yields the best quality fit under the Kolmogorov–Smirnoff statistic (referred to as ks_distance in the sequel; see Eqn. (19).) However, this method is time-consuming, especially for E-TPL as there are two parameters to fit. Therefore, we adopt the “fix-finger” method documented in WeightWatcher to select as the peak of the ESD. Besides the speed improvement, we also find this method provides more stable results.

Comparing PL and E-TPL fitting. Referring to Figure 1, we now discuss how E-TPL could partially address these fitting issues. On the first row of Figure 1, we show three typical cases of PL fitting. In Figure 0(a), the log-log scale reveals a “linear region” of the histogram, which the PL fitting correctly locates. The quality of fit, measured by the ks_distance, is within a typical range, as reported in Table 5 of martin2018implicit_JMLRversion. In Figure 0(b) and Figure 0(c), the ESDs do not exhibit a clear linear region on the log-log scale. Following martin2018implicit_JMLRversion, it is ill-advised to consider metrics derived from a PL fit in these scenario. In practice, this typically occurs when alpha  (e.g., see Figure 0(c)). On the other hand, in these two cases, the corresponding E-TPL fits (shown on the second row in Figure 1) still closely match the empirical density function (see Figure 0(e) and Figure 0(f)), and the ks_distance on the second row using a E-TPL fit is smaller than that for the PL fit on the first row, even when the fit on the second row clearly covers a larger part of the ESD.333We note that the value of ks_distance can be effectively made smaller if one restricts to a smaller part of the distribution, as is often done in practice by optimizing the in the (truncated) PL distribution (1). This potential bias is alleviated by using the fix-finger method. In these two cases, the E-TPL exponent plays a similar role as the alpha in PL fitting, and provides an effective alternative when the ESD does not exhibit a proper PL.

Between these three PL and E-TPL fittings, we would like to point out that the important thing in HT-SR is not the PL fitting per se but that the spectral distributions exhibit HT or other non-standard shapes; the particular forms of the distributions fit here simply constitute different ways to quantify this property in practice. These details, such as selecting the most appropriate distributional assumptions, clearly matter if we would like to engineer the tools of HT analysis to effectively measure the ground truth. However, the primary concern in predicting generalization is to measure the shape information, and the shape information is independent of the fitting procedure, although better fitting procedures may capture the shape information better.

4 Empirical results

Figure 3: E-TPL exponent closely tracks the BLEU score, i.e., BLEU score increases when the E-TPL exponent drops. Results are shown for Transformers trained on WMT14 with different number of samples. (First row). Training with dropout 0.1. (Second row). Training without dropout.

In this section, we first give full details of the experimental setup. Then, we provide the analyses of the empirical results.

4.1 Experimental setup

We consider two machine translation datasets, WMT14 bojar2014findings and IWSLT cettolo2014report

, which are commonly used as benchmarks for neural machine translation

ott2018scaling; vaswani2017attention; shen2020simple; edunov2018understanding. For both datasets, we use German to English (DE-EN). IWSLT consists of 160K parallel bilingual sentence pairs for training, and it is a relatively small-scale dataset. WMT14, on the other hand, has a medium-to-large scale, and it consists of 4.5 million sentence pairs for training. To describe a more holistic picture on the relationship between the generalization metrics and dataset size, and to address the practical concern of training with different amount of data from custom datasets, we subsample the two datasets with different number of samples. Specifically, for IWSLT, we study five cases with {10K, 20K, 40K, 80K, 160K} samples. For WMT14, we study four cases with {160K, 320K, 640K, 1.28M} samples. We intentionally overlap the right end-point (160K) of IWSLT with the left end-point (160K) of WMT14 to study the difference between the two datasets. Similarly, we also study five different values of network depth, including {2, 3, 4, 5, 6}-layer Transformers, and five different levels of initial learning rate. For each combination of (dataset, samples, depth, learning rate), we train both with and without dropout, and we use dropout probability 0.1 when training with dropout. We include the case of training without dropout because we want to study the performance of generalization metrics when there is some extent of overfitting. In total, there are 50 of these training settings. See Appendix B for the details in these 50 settings. We train with three random seeds for each setting.

We follow exactly the training setup in vaswani2017attention, and we develop our implementation based on an online repository444https://github.com/gordicaleksa/pytorch-original-transformer which reproduces the results from vaswani2017attention

with more easily configurable Transformer architectures. We use Transformer-base with 8 attention heads, and an embedding dimension of 512 for both datasets. As we have mentioned earlier, the number of layers ranges from 2 to 6. We train with the inverse square-root learning rate and 10% label smoothing. For each experiment, we train the model for 20 epochs. When calculating the ESDs of the weight matrices, we treat the query, key and value matrices as separate weight matrices.

4.2 Performance of generalization metrics

In this subsection, we study 36 generalization metrics (with details provided in Table 1), and we study how well they are correlated with the BLEU score papineni2002bleu. We use BLEU score because it is the most commonly used metric to evaluate machine translation. Also, we are mostly interested in measuring the correlation between the generalization metrics and the test-time BLEU score directly, and we are less interested in the correlation between the generalization metrics to the generalization gap, which is defined as the BLEU score for training subtracted by the BLEU score for test. Nonetheless, we provide the correlation measurement for both the test-time BLEU score and the generalization gap.

4.2.1 E-Tpl exponent tracks the BLEU score

As a warm-up, we use our metric exponent defined in (2) to track the BLEU score, recalling that exponent is derived under the assumption that ESDs follow E-TPLs. We use dropout to study the effect of training schemes, and consider different quantities of data to test robustness in the dataset. Referring to Figure 3, the first row considers models trained with dropout, while the second row considers models trained without dropout. The multiple columns track exponent and the BLEU score throughout training for different amounts of data. From the results in Figure 3, we can see that exponent not only tracks the BLEU scores but also differentiates underfitting (first row, with dropout) from overfitting (second row, without dropout) in this experiment.

4.2.2 Rank correlation

To systematically evaluate the various metrics considered in this paper, we study the rank correlation between these metrics and the BLEU score. For each of the 50 settings of the hyperparameters and each random seed, we calculate the Spearman’s rank correlation between BLEU scores and the values of each generalization metric over all epochs. The summarized results are presented in Figure 1(a). A positive Spearman’s rank correlation (with BLEU) in the plot means that the generalization metric is useful in tracking BLEU during training. A negative Spearman’s rank correlation, on the other hand, means that the metric often gives the wrong prediction. In Figure 1(a)

, we use the average rank correlations for all settings to study the effectiveness of each metric, and present 25% quantile rank correlations to indicate robustness across runs.

From Figure 1(a), we see that shape metrics, such as exp_dist_exponent, entropy, mp_softrank, stable_rank, rand_distance, exponent, alpha, ks_distance, are all among the effective generalization metrics that have a high rank correlation with the BLEU score that measures the model quality of machine translation. In particular, we see that the rand_distance metric achieves the highest 25 percent smallest rank correlation. The exponent metric, which assumes a E-TPL distribution on the ESDs, achieves the second highest 25 percent smallest rank correlation (we will discuss the problem of the inverse_margin metric in Appendix C). The exp_dist_exponent metric, which corresponds to the EXP fitting, gives the best average rank correlations. On the other hand, norm-based metrics, such as log_spectral_norm, prove ineffective for measuring model quality.

Details of the rank correlation calculations. When calculating the rank correlation, for each generalization metric, we need to associate a positive/negative sign to indicate whether the metric should be positively or negatively correlated with generalization. For example, for the powerlaw coefficient alpha, it has been shown in martin2018implicit_JMLRversion; martin2019traditional; martin2020heavy; martin2020predicting_NatComm that a smaller alpha indicates better model quality. Thus, we associate a negative sign to this metric. However, the metric rand_distance implies better model quality when it is larger martin2018implicit_JMLRversion, and so we associate a positive sign to this metric.

4.3 Analysis on the data-dependent metrics

Through the empirical analysis on various metrics, we have obtained some observations that could partially explain why existing data-dependent generalization metrics do not perform well on the NLP tasks.

  • Scale metrics correctly predict the generalization gap instead of the model quality. We also provide results on the rank correlation between the various generalization metrics and the generalization gap, defined as the training BLEU score subtracted by the test BLEU score. See the results in Figure 1(b). It is encouraging to see that most existing generalization metrics give the correct predictions. However, as we have discussed, we are less interested in this task because correctly predicting the trends of generalization gap does not automatically give predictions on the best-performing models.

  • The PL metrics do not predict the generalization gap. Since the PL metrics measure self-regularization martin2019traditional, it is natural to consider whether, like the data-dependent generalization metrics, PL metrics also correlate with the generalization gap. This can easily be seen to be false from Figure 1(b).

We expect these observations to be relevant and useful to improve the existing generalization metrics, especially for the NLP tasks that have not been thoroughly investigated before.

Acknowledgements. We would like to acknowledge the IARPA (contract W911NF20C0035), NSF, and ONR for providing partial support of this work. Kannan Ramchandran would like to acknowledge support from NSF CIF-2007669, CIF-1703678, and CIF-2002821. Joseph E. Gonzalez would like to acknowledge supports from NSF CISE Expeditions Award CCF-1730628 and gifts from Alibaba Group, Amazon Web Services, Ant Group, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

References

Appendix A Generalization metrics

In this section, we provide the details of the generalization metrics considered in our analysis. We first define the scale metrics. Then, we define the shape metrics obtained from the ESDs of the weight matrices. Although our focus is on data-independent generalization metrics, we also define generalization metrics based on margin bartlett2017spectrally, pitas2017pac and PAC-Bayesian bounds mcallester1999pac, neyshabur2017pac.

a.1 Scale metrics

We first define the metrics motivated from matrix norms and the distance to initialization.

Norm-based and distance-based metrics.

In the following, we discuss multiple metrics obtained from the norms of the weights or the distance between the weights and initialization.

A careful reader might notice that the metrics in this subsection are sometimes averaged over the layers and are sometimes summed over the layers. This inconsistency is simply because we follow the exact definitions from several prior papers. It is worth noting that rank correlation for a single training run, which is the major tool used to compare different metrics, is independent of whether the norm is averaged or summed, as long as the network size does not change during one training run. However, to compare networks with different sizes, proper normalization is necessary.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (l2). The -norm of vectorized model weights.

    (5)
  • (l2_dist). The -distance between the vectorized model weights and the vectorized initial weights.

    (6)
  • (param_norm). The squared Frobenius norm summed over all weight matrices.

    (7)
  • (fro_dist). The distance between a weight matrix and its initilized value, calculated using the Frobenius norm and summed over all layers.

    (8)
  • (log_norm).

    (9)
  • (log_sum_of_fro).

    (10)
  • (log_spectral_norm).

    (11)
  • (dist_spec_int).

    (12)
  • (log_prod_of_fro).

    (13)
  • (log_sum_of_spec).

    (14)
  • (log_prod_of_spec).

    (15)
  • (path_norm). The metric is introduced in neyshabur2015norm. To calculate the metric, we square the parameters of the network, do a forward pass on an all-ones input and then take sum of the network outputs.

    (16)

Scale metrics that require more shape information from the ESDs. The following metrics require more than just the scale information. They either come from specific forms of combined norms that roughly describe the shape of the ESDs, or come from the Marchenko–Pastur (MP) fitting of the ESDs, which makes them different from pure norm-based metrics.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (mp_softrank). The metric is introduced in martin2018implicit_JMLRversion. To calculate this metric, we fit the MP distribution on the ESD, obtain the bulk max of the MP distribution and then divide by the maximum eigenvalue.

    (17)
  • (stable_rank). The metric is a norm-adjusted measure of the scale of the ESD.

    (18)

a.2 Shape metrics

Tail-exponent fitting. The following metrics are derived from fitting the ESDs using a heavy or light-tail distribution.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (alpha). The slope of the tail of the ESD, on a log-log scale. We use MLE from alstott2014powerlaw to estimate alpha. The distribution of eigenvalues is assumed to have the form of (1).

  • (exponent). The tail exponent of the E-TPL fitting of the ESD. This is a new generalization metric introduced in this paper.

  • (exp_dist_exponent

    ). The tail exponent of the EXP fitting of the ESD, under the assumption that the ESD follows an exponential distribution (

    3). This is a new generalization metric introduced in this paper.

  • (ks_distance). The Kolmogorov-Smirnoff (KS) goodness-of-fit test of the powerlaw fitting.

    (19)

    where is the fitted powerlaw distribution on the ESD, and is the ESD itself.

Vector localization. The following metrics are derived from the localization of eigenvectors. In this part, we use and to denote the -th eigenvalue and eigenvector of the -th correlation matrix .

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (tail_mean_vec_entropy). The mean of the vector entropies of the eigenvectors corresponding to eigenvalues on the tail of the ESD.

    (20)
  • (bulk_mean_vec_entropy). The mean of the vector entropies of the eigenvectors corresponding to eigenvalues on the bulk of the ESD.

    (21)

Metrics from the eigenvalues. The following two metrics are derived using the eigenvalues of a weight matrix normalized as probabilities. Recall that we use to denote the distribution obtained by normalizing the squared singular values of , and we use to denote the empirical distribution with bins.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (entropy). The entropy of the squared eigenvalues of a weight matrix, normalized as probabilities. This metric is also known as the Generalized von-Neumann Matrix Entropy.

    (22)

    where is the rank of a matrix.

  • (rand_distance). The distance in distribution from the randomized layer. See the definition in Eqn. (4).

a.3 Hybrid metrics

The following metrics are scaled versions of alpha. They require both the shape information from alpha and the scale information from other weight norms.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (alpha_weighted). A scale-adjusted form of alpha. This metric is called in martin2018implicit_JMLRversion, martin2020predicting_NatComm, MM21a_simpsons_TR.

    (23)
  • (log_alpha_norm). This metric is another scale-adjusted alpha metric in the form of Schatten norm. Recall that we use to denote the set of eigenvalues of the correlation matrix , where is the -by- weight matrix that satisfies . Then, we can define the Schatten -norm as the following.

    (24)

    Using this definition, we can define the metric log_alpha_norm as the following.

    (25)

a.4 Margin-based metrics

Then, we discuss generalization metrics motivated from margins. Recall that we use to denote the neural network with weights . For a multi-class classification problem with sample-label pair , we define margin as the following.

(26)

For machine translation, we consider the margin of each output token. We note that the number of classes, or the number of possible tokens, is often particularly large for machine translation, which is at least in the order of thousands.

Following jiang2019fantastic, we consider output margins only555We note that margins can be defined in any layer elsayed2018large, yang2020boundary, wei2019improved., and we use the 10 percentile of the margin distribution calculated from the entire training set as a robust surrogate for the minimum margin. Using the margin defined as the 10 percentile, we define several generalization metrics.

  • [noitemsep,topsep=0pt,leftmargin=*,after=,before=]

  • (inverse_margin).

    (27)
  • (log_prod_of_spec_over_margin).

    (28)
  • (log_sum_of_spec_over_margin).

    (29)
  • (log_prod_of_fro_over_margin).

    (30)
  • (log_sum_of_fro_over_margin).

    (31)
  • (path_norm_over_margin).

    (32)

a.5 Metrics derived from PAC-Bayesian bounds

Several well-known generalization bounds are derived using the PAC-Bayesian framework, which bounds the generalization gap using the KL-divergence between a predefined prior distribution (usually chosen as Gaussian) and the posterior distribution of the trained models. The key component of the PAC-Bayesian bounds used in most existing implementations is the procedure of searching for the largest magnitude of Gaussian perturbation, denoted as , such that the perturbed weights of the neural network achieve a bounded increase in the training loss. More specifically, is defined such that

(33)

where is a predetermined threshold, and it is chosen as 0.5 in the experiments on machine translation. Similarly, one can define “magnitude-aware” perturbation with magnitude , such that

(34)

where each weight entry in is distributed as , and is chosen as 1e-3 dziugaite2020search. Given the perturbation magnitude , the magnitude-aware perturbation and the number of samples , one can define the following generalization metrics.

  • (pacbeyes_init).

    (35)
  • (pacbeyes_orig).

    (36)
  • (pacbeyes_flatness).

    (37)
  • (pacbeyes_mag_init).

    (38)
  • (pacbeyes_mag_orig).

    (39)
  • (pacbeyes_mag_flatness).

    (40)

Appendix B Additional details on the experiment setup

There are 50 different settings that we consider for our experiments. See Table 2. The column titled with “initial learning rate” shows the constant factor multiplied with the standard learning rate schedule. Given the embedding dimension , step number , number of warm-up steps , the formula for the inverse square-root learning rate schedule is the following.

(41)
Figure 4: The margins remain negative in the experiments on machine translation due to the large alphabet size.

Appendix C Additional analysis on scale metrics

In this section, we discuss an issue of computing margin-based generalization metrics. Generically, these bounds are of the form

where is the population error, is the training margin loss at margin , typically

and is some complexity term. First, note that this construction requires the margin to be positive. Moreover, the training margin loss is an increasing function of , while the complexity term is decreasing in , thus the conventional way of using the margin bound is to optimize over the margin to balance two terms in the margin bound bartlett2017spectrally, rather than pre-specifying the value of the margin dependent on the data. However, we choose to follow the related papers dziugaite2020search, jiang2019fantastic, and we use the 10th percentile margin as a robust estimate of the minimum margin in the dataset. We use this margin in all of the margin-normalized generalization metrics. However, in all of the experiments on machine translation, the 10th percentile margin remains negative throughout the whole training, violating the requirement that the bound is evaluated at a positive value of margin. See Figure 4. This problem results from the large Alphabet for machine translation, which makes it difficult to fully interpolate the data, and hence makes the margin-normalized generalization metrics in dziugaite2020search, jiang2019fantastic hard to be applicable to the present setting.

Index Purpose of the experiment Dataset Number of samples Initial learning rate Network depth Dropout (Yes/No) Number of training epochs
0 Different amount of data in IWSLT IWSLT 10K 1 6 Yes 20
1 IWSLT 10K 1 6 No 20
2 IWSLT 20K 1 6 Yes 20
3 IWSLT 20K 1 6 No 20
4 IWSLT 40K 1 6 Yes 20
5 IWSLT 40K 1 6 No 20
6 IWSLT 80K 1 6 Yes 20
7 IWSLT 80K 1 6 No 20
8 IWSLT 160K 1 6 Yes 20
9 IWSLT 160K 1 6 No 20
10 IWSLT 160K 0.75 6 Yes 20
11 IWSLT 160K 0.75 6 No 20
12 IWSLT 160K 0.5 6 Yes 20
13 IWSLT 160K 0.5 6 No 20
14 IWSLT 160K 0.375 6 Yes 20
15 IWSLT 160K 0.375 6 No 20
16 IWSLT 160K 0.25 6 Yes 20
17 Different learning rate in IWSLT IWSLT 160K 0.25 6 No 20
18 Different network depth in IWSLT IWSLT 160K 1 5 Yes 20
19 IWSLT 160K 1 5 No 20
20 IWSLT 160K 1 4 Yes 20
21 IWSLT 160K 1 4 No 20
22 IWSLT 160K 1 3 Yes 20
23 IWSLT 160K 1 3 No 20
24 IWSLT 160K 1 2 Yes 20
25 IWSLT 160K 1 2 No 20
26 WMT 160K 1 6 Yes 20
27 WMT 160K 1 6 No 20
28 WMT 320K 1 6 Yes 20
29 WMT 320K 1 6 No 20
30 WMT 640K 1 6 Yes 20
31 WMT 640K 1 6 No 20
32 WMT 1.28M 1 6 Yes 20
33 Different amount of data in WMT WMT 1.28M 1 6 No 20
34 Different learning rate in WMT WMT 1.28M 0.75 6 Yes 20
35 WMT 1.28M 0.75 6 No 20
36 WMT 1.28M 0.5 6 Yes 20
37 WMT 1.28M 0.5 6 No 20
38 WMT 1.28M 0.375 6 Yes 20
39 WMT 1.28M 0.375 6 No 20
40 WMT 1.28M 0.25 6 Yes 20
41 WMT 1.28M 0.25 6 No 20
42 WMT 1.28M 1 5 Yes 20
43 WMT 1.28M 1 5 No 20
44 WMT 1.28M 1 4 Yes 20
45 WMT 1.28M 1 4 No 20
46 WMT 1.28M 1 3 Yes 20
47 WMT 1.28M 1 3 No 20
48 WMT 1.28M 1 2 Yes 20
49 Different network depth in WMT WMT 1.28M 1 2 No 20
Table 2: Parameter settings of empirical studies in Section 4.