GenProb
None
view repo
Understanding the generalization behaviour of deep neural networks is a topic of recent interest that has driven the production of many studies, notably the development and evaluation of generalization "explainability" measures that quantify model generalization ability. Generalization measures have also proven useful in the development of powerful layerwise model tuning and optimization algorithms, though these algorithms require specific kinds of generalization measures which can probe individual layers. The purpose of this paper is to explore the neglected subtopic of probeable generalization measures; to establish firm ground for further investigations, and to inspire and guide the development of novel model tuning and optimization algorithms. We evaluate and compare measures, demonstrating effectiveness and robustness across model variations, dataset complexities, training hyperparameters, and training stages. We also introduce a new dataset of trained models and performance metrics, GenProb, for testing generalization measures, model tuning algorithms and optimization algorithms.
READ FULL TEXT VIEW PDFNone
Deep learning has proven very successful this last decade, demonstrating time and time again its ability to generalize feature recognition from train data to test data. Despite all the attention, the underlying mechanisms in deep models that promote generalization are still open questions [huh2021lowrank]. A number of papers attempt to consolidate the intuition behind deep learning generalization by developing ”explainable” measures that attempt to quantify the generalization ability of a given model and dataset [jiang2019fantastic, dziugaite2021search, neyshabur2017exploring].
Generalization measures provide understanding of the learning mechanisms involved in the training of a given network using certain optimizers, datasets, and hyperparameters. Understanding and identifying trends in the relationship between these metrics and the network’s generalization gap or test accuracy provides methods of optimally selecting hyperparameter configuration, topologies, and other training parameters. Furthermore, generalization measures can be implemented for neural architecture search (NAS), hyperparameter optimization (HPO) and training optimization [NIPS2015_eaa32c96, DBLP:journals/corr/abs201001412, cherian2020efficient, DBLP:journals/corr/abs180207191, abdelfattah2021zerocost]. Other methods optimize deep models solely via holistic evaluations of performance, neglecting the discrepancies in training quality between layers, and foregoing the additional tuning required for a better solution.
Many generalization measures have proven effective and robust, but are not practical for implementation of model tuning and optimization algorithms. It is optimal that a generalization measure can probe individual layers (a probeable measure), in order to tune models at a layer level (e.g. channel sizes, weight update rules, individual learning rates). The most successful measures are implemented from PACBayes bounds and margin distributions, and are not probeable [jiang2019fantastic, dziugaite2021search, natekar, k2021robustness]. Our objective in this paper is to investigate probeable generalization measures for model tuning and optimization applications, and to better understand generalization mechanisms in deep models. Given a dataset of trained models and their performance metrics, we evaluate probeable generalization metrics by the pipeline depicted in Fig. 1.
The following lists the contributions in this paper:
We describe and investigate four probeable generalization measures that can be directly measured from individual layers of a deep network, without the need for additional training pipeline.
We test these measures on the NATSBench dataset [Dong_2021], and demonstrate their effectiveness for explaining the generalization of models of varying channel sizes.
We study the evolution of these measures during training to understand their effectiveness at different stages of training.
We introduce a new dataset–dubbed GenProb–to test probeable measures on models of varying channel sizes and training hyperparameters.
We evaluate and compare the measures with GenProb, demonstrating effectiveness and robustness.
With a recent surge in interest for understanding the generalization performance of deep models, studies have produced many new complexity measures. Response to input perturbation measures [DBLP:journals/corr/abs210604765, natekar, k2021robustness], and PACBayes sharpness and flatness measures [natekar, DBLP:journals/corr/DziugaiteR17, jiang2019fantastic] have demonstrated much success. Notably, the first and second place holders [natekar, k2021robustness] of the recent NeurIPS 2020: Predicting Generalization in Deep Learning competition [jiang2020neurips], developed perturbation and sharpness measures. These measures have also been implemented in NAS, HPO and training optimization algorithms [DBLP:journals/corr/abs201001412, cherian2020efficient, DBLP:journals/corr/abs180207191].
Studies also define measures from deep network margin distributions [DBLP:journals/corr/abs210603314, arora2018stronger, bartlett2017spectrallynormalized]
. Some other measures are derived from model weights themselves, such as gradient signaltonoiseratio (GSNR)
[liu2020understanding] and normbased measures [NIPS2015_eaa32c96, DBLP:journals/corr/abs190101672, DBLP:journals/corr/abs171101530, novak2018sensitivity]. Shallow networks are often trained onto the aforementioned measures for accurate generalization performance predictions, at the cost of ability to generalize predictions across different datasets and model structures [jiang2018predicting, yak2019task, corneanu2020computing, unterthiner2021predicting]. Some large scale studies of generalization measures provide vigorous evaluations and comparisons of common and best performing measures, across different datasets and model structures [jiang2019fantastic, dziugaite2021search, neyshabur2017exploring].These last three are the only studies that include probeable measures, however they are tested on sets of fully trained models with drastic variations; irrelevant for most model optimization algorithms. Probeable measures are implemented in [abdelfattah2021zerocost] to reduce the computational cost and increase performance of NAS algorithms, and in [DBLP:journals/corr/abs200606587] to adaptively optimize deep networks for greater generalization at no additional computational overhead.
We define several measures that quantify the quality of a trained model (i.e. quality metrics or complexity measures) and describe its generalization ability. These quality metrics are probeable on individual layers of deep network, and quantify the contribution of each layer as a holistic measure for network representation, unlike other popular and successful measures elaborated above. The measures are thus explainable as they indicate how well training is optimized across layers of a deep network. The overview of all chosen measures is presented in Table 1.


Metric  Formulation  Aggregation 
Stable Quality  
Effective Rank  
Spectral Norm  
Frobenius Norm  

Here is the depth of the model, is the maximum rank of the weight matrix, is the rank of the weight matrix,
is the vector of singular values of matrix
with individual values and normalized singular values and is the value at row and column of matrix. Four dimensional weight tensors (such as convolutions) are first unfolded along their input and output channels for computation.
Stable quality (SQ) refers to the stability of encoding in a deep layer that is calculated with the relative ratio of stable rank and condition number of a layer defined in [DBLP:journals/corr/abs200606587]
. Stable rank encodes the space expansion under the matrix mapping of the layer, and condition number indicates the numerical sensitivity of the mapping layer. Altogether the measure introduces a quality measure of the layer as an autoencoder.
Effective rank (E) refers to the dimension of the output space of the transformation operated by a deep layer that is calculated with the Shannon entropy of the normalized singular values of a layer as defined in [royoliv].
Frobenius norm (F) refers to the magnitude of a deep layer that is calculated with the sum of the squared values of a weight tensor. Frobenius norm is also calculated with the sum of the squared singular values of a layer.
Spectral norm (S) refers to the maximum magnitude of mapping by a transformation operated by a layer that is calculated as the maximum singular value of a weight tensor.
The formulations of these quality metrics are presented in Table 1. With these layerlevel quality measures, we aim to aggregate across model layers for a single meaningful modellevel quality metric. For the normbased measures we borrow the best performing aggregation method for each layerlevel measure from [dziugaite2021search]. For stable quality and effective rank, we inspire ourselves from literature aggregation methods, test all variations, and choose a single best performing method [dziugaite2021search, jiang2019fantastic].
The notation convention used in Table 1 and hereinafter to represent different quality metrics is: where aggregation depthnormalized L2 norm, depthnormalized product and metric stable quality, effective rank, Frobenius norm, stable norm.
Lowrank factorization (LRF) is a preprocessing technique we also employ, aiming to increase the consistency of quality metrics across stages of training. By LRF, the lowrank component of a weight matrix, which represents the useful information, is extracted from raw weights, stripping the weights of residual noise from random initialization. We compute the factorization of our weight matrices by means of EVBMF defined in [NIPS2011_b73ce398]. A wide hat can be included in the notation as to indicate preprocessing of weights by low rank factorization.
NATSBench is a dataset of trained deep neural networks for benchmarking neural architecture search algorithms [Dong_2021]. A set of various related model architectures are trained and have their weights and performances saved to generate NATSBench, a mapping of model architectures and weights to generalization performance. NATSBench models are designed with a standard architecture skeleton as described in [Dong_2021], and vary layer operations or channel sizes, at different locations in the skeleton. We can test the quality metrics with NATSBench by computing them on the provided trained weights, and correlating them with model generalization gap and test accuracy.



Hyperparameter  Size Search Space  Topology Search Space 
Learning Rate  (cosine)  
Weight Decay  5e4  
Batch Size  256  
Epochs  12, 90  12, 200 
Channel Size Variations  8, 16, 24, 32,   
40, 48, 56, 64  
Layer Operation Variations    zeroize, skip, 1x1 conv, 
3x3 conv, averagepool  

The NATSBench dataset consists of two subsets: the topology search space contains data from a set of models of varying layer operations, and the size search space contains data from a set of models of varying layer channel sizes. The topology search space features a set of five operations (Table 2) for six layers (15,625 total permutations), and the size search space features a set of eight channel sizes (Table 2
) for five layers (32,768 total permutations). The models are trained on CIFAR10, CIFAR100 and ImageNet16120 datasets
[chrabaszcz2017downsampled]. Model weights were saved at 12 epochs, and at completion. Training was optimized with Nesterov momentum SGD for crossentropy loss, with L2 weight decay and cosine annealing (see
Table 2for hyperparameter selection). Data augmentation included random flips with probability of 0.5, 32x32 (16x16 for ImageNet16120) random crops with 4 pixel padding, and RGB channel normalization.
Quality metrics are computed with and without use of the LowRank Factorization (LRF) on each set of trained model weights, then the Spearman correlations of these measures, with test accuracy and generalization gap, are computed to generate Fig. 2. Spearman rankorder correlation describes the quality of a relationship between two variables as an arbitrary nonparametric monotonic function. Spearman correlation therefore fits our purpose of evaluating generalization measures, as a generalization measure only needs to rank the relative performance of competing models.


Several measures show very high correlations with test accuracy and generalization gap in the NATS size search space, reaching to and above 0.9, as shown in Fig. (a). Most raw quality metrics show a lot of potential, though use of LRF preprocessing of weights does not provide a consistent improvement. A trend is not observed across different epochs; the patterns for 12 epochs can’t be found in the 90 and 200 epoch alternative, indicating measure inconsistency across stages of training. Trends only hold across datasets in the case of NATS size search space for test accuracy, also indicating measure inconsistency across datasets. The patterns from the size search space do not appear for the topology search space; as illustrated in Fig. (b), correlations don’t pass 0.5, and many of the previous peaks are now troughs. The measures are not valuable across models of changing topologies, as they are sensitive to and vary with the types of layers included in the model.
The NATSBench search spaces feature drastic changes between models which yield stochastic results and inhibit meaningful analysis; however, the high correlations under the size search space motivate further investigation. Notably for HPO and training optimization algorithms, it would be valuable to use a search space with variations in training hyperparameters instead. A last remark is that LRF boosts correlations for certain configurations (e.g. stable quality in the topology search space), which warrants further investigation.
We trained ResNet34 on CIFAR10 and CIFAR100 with AdaS [DBLP:journals/corr/abs200606587] until completion and saved model weights at every epoch. The saved set of weights enables the observation of the evolution of quality metrics over training.


In Fig. (a) and Fig. (b) we observe high noise perturbation at initialization that quickly fades in the first 10 epochs, then rises subtly and plateaus at a low value. The residual noise extracted by LRF fades, as we’d expect of the random noise from initialization; it is replaced with learned structure, though it remains significant at later epochs with CIFAR100. All raw measures fail to mirror test accuracy and generalization gap at early epochs. Stable quality and effective rank measures begin mirroring test accuracy and generalization gap at around 10 epochs, and Frobenius and spectral norm measures only begin at the th epoch. The quality metrics do a better job of mirroring test accuracy and generalization gap at earlier epochs with LRF. Frobenius and spectral norm measures, however, still follow a distinct drop from epoch 10 to 60 with LRF. It intuitively follows that we can expect stable quality and effective rank measures to correlate better with test accuracy and generalization gap, notably with LRF preprocessing, through all stages of training.
For a comprehensive display of experimental results please refer to the Supplementary Material.
To test the effectiveness of the measures for tracking generalization performance at earlier stages of training, we train families of models with varied hyperparameter and channel size configurations, then save model weights and performances at each epoch. Models are trained for 70 epochs on CIFAR10 and CIFAR100 with various optimizers. We dub this dataset Generalization Dataset for Probeable Measures (GenProb).


Block Index  Block Type  Output Shape 
0  input  32 x 32 x 3 
1  3 x 3 convolution  32 x 32 x 8 
2  convolutional block  32 x 32 x {40, 48} 
3  residual block  18 x 18 x {40, 48} 
4  convolutional block  18 x 18 x {40, 48} 
5  residual block  9 x 9 x {40, 48} 
6  convolutional block  9 x 9 x {40, 48} 
7  global average pooling  1 x 1 x {40, 48} 
8  linear  1 x 1 x {10, 100} 

The model architecture used is the same as that in NATSBench’s size search space [Dong_2021], described in Table 3. The convolutional blocks can be described as directed acyclic graphs with five nodes of activations, as depicted in Fig. (a). All nodes are ordered, and each node is connected to all nodes in front of it with a 3x3 convolution. The longest connection in the convolutional block (first node to last node) is replaced by a skip connection. The residual block is composed of a main path and a shortcut path, as illustrated in Fig. (b)
. The main path contains a 3x3 convolution (stride 2) followed by another 3x3 convolution (stride 1), and an average pooling layer (stride 2) followed by a 1x1 convolution in the shortcut path. All 3x3 convolutions are followed by batch normalizations, and all changes in channel sizes take place at the earliest layers possible in the blocks. All activation functions are ReLU, including the prediction layer. The output channel sizes of blocks 26 are varied, the output channel size of block 7 depends on block 6, and the output channel size of block 8 depends on the dataset.
The family of trained models is generated by varying initial learning rate, weight decay and channel sizes with the options listed in Table 4. Furthermore, input data is normalized channelwise and augmented by random 32 x 32 crops with 4 pixels of padding and random horizontal flips of probability 0.5. Five repeated blocks in the model have their output channel sizes chosen, totaling permutations. From the set of all choices of hyperparameters we generate combinations, each yielding a unique trained model. We repeat with 7 different optimizers for both CIFAR10 and CIFAR100, training unique configurations. Saving the model weights and performance at every epoch, we generate a total of unique trained model files for our dataset.^{1}^{1}1GenProb will be released upon publication, alongside the GitHub repository.




To visualize the relationship between the quality metrics and both generalization gap and test accuracy, we produce scatter plots of test accuracy and generalization gap over the quality metrics. Furthermore, by organizing these separate scatter plots relative to the quantity of training each set of models has undergone, we can study the evolution of the relationship. The quality metrics bias at earlier epochs due to residual noise from random initialization; we observe a lack of form in the effective rank scatter plots on models trained with Adam on CIFAR10 in Fig. (a) and Fig. (b) at earlier epochs. The quality metric evolves into clear, strong trends with test accuracy and generalization gap as training progresses and learned structure develops in the model weights. We observe similar behaviour with all three other metrics provided in the Supplementary Material (Fig. 116), and LRF only seems to weaken any trends, see Supplementary Material (Fig. 1732).



Hyperparameter  RMSGD  SGD  SAM  SGDP  Adam  AdaBound  AdamP 
Learning Rate  0.02, 0.04, 0.06, 0.08, 0.10  2e3, 4e3, 6e3, 8e3, 0.01  
Weight Decay  1e4, 5e4, 1e3  
Channel Sizes  40, 48  
Momentum  0.9  
Batch Size  256  
Epochs  70  
Scheduler Step Size    10  

We observe a clear linear relationship between the effective rank measure and generalization gap at later epochs, and a 2nd order relationship between the effective rank measure and test accuracy. The plateauing trend with test accuracy delineates a bound on test accuracy; maximizing effective rank above this bound would still increase generalization gap (linear trend) however, suggesting an increase in train accuracy without changes in test accuracy. It is still evident that for a model trained on CIFAR10 with Adam, a greater effective rank indicates greater test accuracy, and a greater (negative) generalization gap. In fact, the relationships take form before model completion, as trends with test accuracy show as early as epoch 10, and trends with generalization gap show at epoch 40; this may be of interest for implementation of effective rank in NAS, HPO and optimization algorithms. We find similar timelines with all other quality metrics, see Supplementary Material (Fig. 132).
By plotting the correlations of the quality metrics with test accuracy and generalization gap in Fig. (c) and Fig. (d), we can understand the relative progression of the effectiveness of these measures through different stages of training. As the aforementioned trends in Fig. (a) and Fig. (b) become more distinct, the corresponding correlations increase in magnitudes, some nearly up to 1. LRF preprocessing of weights only decreases the magnitude of correlation through all stages of training, see Supplementary Material (Fig. 4142)
The large correlations indicate robustness to changes in training hyperparameters, and model channel sizes. Effective rank and stable quality measures prove to be the most effective and robust generalization measures through all training phases and across dataset complexities. The Frobenius and spectral norm measures underperform consistently and are less consistent across dataset complexities, yielding lower correlations with CIFAR10. Also, as previously highlighted in the scatter plots, correlations with test accuracy plateau after 10 epochs and correlations with generalization gap plateau after 40 epochs.
In this work we investigated ”explainable” generalization measures that can probe individual layers of a deep neural network. We tested four quality metrics, that are calculated from layer weight tensors, over spaces of similar model architectures for NAS (NATSBench). We analyzed the evolution of the quality metrics during training, then produced a dataset for more meaningful analysis and testing related to HPO and training optimization. We introduced GenProb, a dataset of 470,400 trained models and their performance metrics, distinguished by hyperparameter and channel size variations. We demonstrated effective rank and stable quality measure effectiveness and robustness to variations in training hyperparameters, channels sizes, dataset complexities and stages of training. Furthermore, we investigate the amount of training required to produce meaningful generalization measures from model weights, and the shape of the relationships of these measures with test accuracy and generalization gap.
We hope this work inspires and guides the production of novel NAS, HPO and training optimization algorithms that leverage probeable generalization measures to maximize model generalization performance at a layer level. We also aim to motivate further development of probeable generalization measures, for which GenProb will prove a useful tool. Our investigated measures don’t prove robust to drastic variations in model architecture, and as such may not be suitable for all NAS algorithms. We also only consider simple convolutional neural networks and two datasets (CIFAR10 and CIFAR100); we hope that future investigations will experiment with other types of model, deeper models, and other datasets.
Comments
There are no comments yet.