Finite Versus Infinite Neural Networks: an Empirical Study

07/31/2020 ∙ by Jaehoon Lee, et al. ∙ 49

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A broad class of both Bayesian (Neal, 1994; Williams, 1997; Hazan and Jaakkola, 2015; Lee et al., 2018; Matthews et al., 2018a, b; Borovykh, 2018; Garriga-Alonso et al., 2019; Novak et al., 2019; Yang and Schoenholz, 2017; Yang et al., 2019; Pretorius et al., 2019; Yang, 2019a, b; Novak et al., 2020; Hron et al., 2020; Hu et al., 2020a) and gradient descent trained (Jacot et al., 2018; Li and Liang, 2018a; Allen-Zhu et al., 2018; Du et al., 2019a, 2018; Zou et al., 2019; Lee et al., 2019; Chizat et al., 2019; Arora et al., 2019a; Sohl-Dickstein et al., 2020; Huang et al., 2020; Du et al., 2019b; Yang, 2019a, b; Novak et al., 2020; Hron et al., 2020) neural networks converge to Gaussian Processes (GPs) or closely-related kernel methods as their intermediate layers are made infinitely wide The predictions of these infinite width networks are described by the Neural Network Gaussian Process (NNGP) (Lee et al., 2018; Matthews et al., 2018a)

kernel for Bayesian networks, and by the Neural Tangent Kernel (NTK)

(Jacot et al., 2018) and weight space linearization (Lee et al., 2019; Chizat et al., 2019) for gradient descent trained networks.

This correspondence has been key to recent breakthroughs in our understanding of neural networks (Xiao et al., 2018; Valle-Perez et al., 2019; Wei et al., 2019; Xiao et al., 2020; Bietti and Mairal, 2019; Ben-David and Ringel, 2019; Yang and Salman, 2019; Ober and Aitchison, 2020; Hu et al., 2020b; Lewkowycz et al., 2020) It has also enabled practical advances in kernel methods (Garriga-Alonso et al., 2019; Novak et al., 2019; Arora et al., 2019a; Li et al., 2019a; Arora et al., 2020; Shankar et al., 2020; Novak et al., 2020; Hron et al., 2020)

, Bayesian deep learning

(Wang et al., 2019; Cheng et al., 2019; Carvalho et al., 2020)

, active learning

(Tsymbalov et al., 2019)

, and semi-supervised learning

(Hu et al., 2020a). The NNGP, NTK, and related large width limits (Cho and Saul, 2009; Daniely et al., 2016; Poole et al., 2016; Chen et al., 2018; Li and Nguyen, 2019; Daniely, 2017; Pretorius et al., 2018; Hayou et al., 2018; Karakida et al., 2018; Blumenfeld et al., 2019; Hayou et al., 2019; Schoenholz et al., 2017; Pennington et al., 2017; Xiao et al., 2018; Yang and Schoenholz, 2017) are unique in giving an exact theoretical description of large scale neural networks. Because of this, we believe they will continue to play a transformative role in deep learning theory.

Infinite networks are a newly active field, and foundational empirical questions remain unanswered. In this work, we perform an extensive and in-depth empirical study of finite and infinite width neural networks. In so doing, we provide quantitative answers to questions about the factors of variation that drive performance in finite networks and kernel methods, uncover surprising new behaviors, and develop best practices that improve the performance of both finite and infinite width networks. We believe our results will both ground and motivate future work in wide networks.

2 Experiment design

Finite gradient descent (GD)   Infinite GD  Infinite Bayesian

Figure 1: CIFAR-10 test accuracy for finite and infinite networks and their variations. Starting from the finite width base network of given architecture class described in §2, performance changes from centering (+C), large learning rate (+LR), allowing underfitting by early stopping (+U), input preprocessing with ZCA regularization (+ZCA), multiple initialization ensembling (+Ens), and some combinations are shown, for Standard and NTK parameterizations. The performance of the linearized (lin) base network is also shown. See Table S1 for precise values for each of these experiments, as well as for additional experimental conditions not shown here.

To systematically develop a phenomenology of infinite and finite neural networks, we first establish base cases for each architecture where infinite-width kernel methods, linearized weight-space networks, and nonlinear gradient descent based training can be directly compared. In the finite-width settings, the base case uses mini-batch gradient descent at a constant small learning rate Lee et al. (2019) with MSE loss (implementation details in §H). In the kernel-learning setting we compute the NNGP and NTK for the entire dataset and do exact inference as described in (Rasmussen and Williams, 2006, page 16). Once this one-to-one comparison has been established, we augment the base setting with a wide range of interventions. We discuss each of these interventions in detail below. Some interventions will approximately preserve the correspondence (for example, data augmentation), while others explicitly break the correspondence in a way that has been hypothesized in the literature to affect performance (for example, large learning rates Lewkowycz et al. (2020)). We additionally explore linearizing the base model around its initialization, in which case its training dynamics become exactly described by a constant kernel. This differs from the kernel setting described above due to finite width effects.

We use MSE loss to allow for easier comparison to kernel methods, whose predictions can be evaluated in closed form for MSE. See Table S2 and Figure S3 for a comparison of MSE to softmax-cross-entropy loss. Softmax-cross-entropy provides a consistent small benefit over MSE, and will be interesting to consider in future work.

Architectures we work with are built from either Fully-Connected (FCN) or Convolutional (CNN

) layers. In all cases we use ReLU nonlinearities. Except if otherwise stated, we consider

FCNs with 3-layers and CNN

s with 8-layers. For convolutional networks we must collapse the spatial dimensions of image-shaped data before the final readout layer. To do this we either: flatten the image into a one-dimensional vector (

VEC) or apply global average pooling to the spatial dimensions (GAP). Finally, we compare two ways of parameterizing the weights and biases of the network: the standard parameterization (STD), which is used in work on finite-width networks, and the NTK parameterization (NTK) which has been used in most infinite-width studies to date (see Sohl-Dickstein et al. (2020) for the standard parameterization at infinite width).

Except where noted, for all kernel experiments we optimize over diagonal kernel regularization independently for each experiment. For finite width networks, except where noted we use a small learning rate corresponding to the base case. See §C.1 for details.

The experiments described in this paper are often very compute intensive. For example, to compute the NTK or NNGP for the entirety of CIFAR-10 for CNN-GAP architectures one must explicitly evaluate the entries in a -by- kernel matrix. Typically this takes around 1200 GPU hours with double precision, and so we implement our experiments via massively distributed compute infrastructure based on beam Foundation . All experiments use the Neural Tangents library Novak et al. (2020), built on top of JAX Bradbury et al. (2018a).

To be as systematic as possible while also tractable given this large computational requirement, we evaluated every intervention for every architecture and focused on a single dataset, CIFAR-10 Krizhevsky et al. (2009). However, to ensure robustness of our results across dataset, we evaluate several key claims on CIFAR-100 and Fashion-MNIST Xiao et al. (2017).

3 Observed empirical phenomena

3.1 NNGP/NTK can outperform finite networks

A common assumption in the study of infinite networks is that they underperform the corresponding finite network in the large data regime. We carefully examine this assumption, by comparing kernel methods against the base case of a finite width architecture trained with small learning rate and no regularization (§2), and then individually examining the effects of common training practices which break (large LR, L2 regularization) or improve (ensembling) the infinite width correspondence to kernel methods. The results of these experiments are summarized in Figure 1 and Table S1.

First focusing on base finite networks, we observe that infinite FCN and CNN-VEC outperform their respective finite networks. On the other hand, infinite CNN-GAP networks perform worse than their finite-width counterparts in the base case, consistent with observations in Arora et al. (2019a). We emphasize that architecture plays a key role in relative performance. For example, infinite-FCNs outperform finite-width networks even when combined with various tricks such as high learning rate, L2, and underfitting. Here the performance becomes similar only after ensembling (§3.3).

One interesting observation is that ZCA regularization preprocessing (§3.10) can provide significant improvements to the CNN-GAP kernel, closing the gap to within 1-2%.

3.2 NNGP typically outperforms NTK

Recent evaluations of infinite width networks have put significant emphasis on the NTK, without explicit comparison against the respective NNGP models (Arora et al., 2019a; Li et al., 2019a; Du et al., 2019b; Arora et al., 2020). Combined with the view of NNGPs as “weakly-trained” (Lee et al., 2019; Arora et al., 2019a) (i.e. having only the last layer learned), one might expect NTK to be a more effective model class than NNGP. On the contrary, we usually observe that NNGP inference achieves better performance. This can be seen in Table S1 where SOTA performance among fixed kernels is attained with the NNGP across all architectures. In Figure 2 we show that this trend persists across CIFAR-10, CIFAR-100, and Fashion-MNIST (see Figure S5 for similar trends on UCI regression tasks). In addition to producing stronger models, NNGP kernels require about half the memory and compute as the corresponding NTK, and some of the most performant kernels do not have an associated NTK at all Shankar et al. (2020). Together these results suggest that when approaching a new problem where the goal is to maximize performance, practitioners should start with the NNGP.

We emphasize that both tuning of the diagonal regularizer (Figure 5) and sufficient numerical precision (§3.7, Figure S1) were crucial to achieving an accurate comparison of these kernels.

Figure 2: NNGP often outperforms NTK in image classification tasks when diagonal regularization is carefully tuned. The performance of the NNGP and NT kernels are plotted against each other for a variety of data pre-processing configurations (§3.10), while regularization (Figure 5) is independently tuned for each.
Figure 3: Centering can accelerate training and improve performance. Validation accuracy throughout training for several finite width architectures. See Figure S6 for training accuracy.

3.3 Centering and ensembling finite networks both lead to kernel-like performance

For overparameterized neural networks, some randomness from the initial parameters persists throughout training and the resulting learned functions are themselves random. This excess variance in the network’s predictions generically increases the total test error through the variance term of the bias-variance decomposition. For infinite-width kernel systems this variance is eliminated by using the mean predictor. For finite-width models, the variance can be large, and test performance can be significantly improved by ensembling a collection of models. In Figure 4, we examine the effect of ensembling. For FCN networks, ensembling closes the gap with kernel methods, suggesting that FCN NNs underperform FCN kernels primarily due to variance. For CNN models, ensembling also improves test performance, and ensembled CNN-GAP models significantly outperform the best kernel methods.

Prediction variance can also be reduced by centering the model, i.e. subtracting the model’s initial predictions: . A similar variance reduction technique has been studied in Chizat et al. (2019); Zhang et al. (2019); Hu et al. (2020c); Bai and Lee (2020). In Figure 3, we observe that centering significantly speeds up training and improves generalization for FCN and CNN-VEC models, but has little-to-no effect on CNN-GAP architectures. We observe that the scale posterior variance of CNN-GAP, in the infinite-width kernel, is small relative to the prior variance given more data, consistent with centering and ensembles having small effect.

Figure 4: Ensembling base networks enables them to match the performance of kernel methods, and exceed kernel performance for nonlinear CNNs. See Figure S7 for test MSE.

3.4 Large LRs and L2 regularization drive differences between finite networks and kernels

In practice, L2 regularization (a.k.a. weight decay) or larger learning rates can break the correspondence between kernel methods and finite width neural network training even at large widths.

Lee et al. (2019) derives a critical learning rate such that wide network training dynamics are equivalent to linearized training for . Lewkowycz et al. (2020) argues that even at large width a learning rate for a constant forces the network to move away from its initial high curvature minimum and converge to a lower curvature minimum, while Li et al. (2019b) argues that large initial learning rates enable networks to learn ‘hard-to-generalize’ patterns.

In Figure 1 (and Table S1), we observe that the effectiveness of a large learning rate (LR) is highly sensitive to both architecture and paramerization: LR improves performance of FCN and CNN-GAP by about for STD parameterization and about for NTK parameterization. In stark contrast, it has little effect on CNN-VEC with NTK parameterization and surprisingly, a huge performance boost on CNN-VEC with STD parameterization ().

L2 regularization (Equation S1

) regularizes the squared distance between the parameters and the origin and encourages the network to converge to minima with smaller Euclidean norms. Such minima are different from those obtained by NT kernel-ridge regression (i.e. adding a diagonal regularization term to the NT kernel)

(Wei et al., 2019), which essentially penalizes the deviation of the network’s parameters from initialization Hu et al. (2019). See Figure S8 for a comparison.

L2 regularization consistently improves (+-) performance for all architectures and parameterizations. Even with a well-tuned L2 regularization, finite width CNN-VEC and FCN still underperform NNGP/NTK. Combining L2 with early stopping produces a dramatic additional performance boost for finite width CNN-VEC, outperforming NNGP/NTK. Finally, we note that L2+LR together provide a superlinear performance gain for all cases except FCN and CNN-GAP with NTK-parameterization. Understanding the nonlinear interactions between L2, LR, and early stopping on finite width networks is an important research question (e.g. see Lewkowycz et al. (2020); Lewkowycz and Gur-Ari (2020) for LR/L2 effect on the training dynamics).

3.5 Improving L2 regularization for networks using the standard parameterization

We find that L2 regularization provides dramatically more benefit (by up to ) to finite width networks with the NTK parameterization than to those that use the standard parameterization (see Table S1). There is a bijective mapping between weights in networks with the two parameterizations, which preserves the function computed by both networks: , where is the th layer weight matrix, and is the width of the preceding activation vector. Motivated by the improved performance of the L2 regularizer in the NTK parameterization, we use this mapping to construct a regularizer for standard parameterization networks that produces the same penalty as vanilla L2 regularization would produce on the equivalent NTK-parameterized network. This modified regularizer is . This can be thought of as a layer-wise regularization constant . The improved performance of this regularizer is illustrated in Figure 5.


Figure 5: Layerwise scaling motivated by NTK makes L2 regularization more helpful in standard parameterization networks. See §3.5 for introduction of the improved regularizer, Figure S9 for further analysis on L2 regularization to initial weights, and Figure S8 for effects on varying widths.

3.6 Performance can be non-monotonic in width beyond double descent

Figure 6: Finite width networks generally perform better with increasing width, but CNN-VEC shows surprising non-monotonic behavior. L2: non-zero weight decay allowed during training LR: large learning rate allowed. Dashed lines are allowing underfitting (U). See Figure S10 for plots for the standard parameterization, and §3.11 for discussion of CNN-VEC results.

Deep learning practitioners have repeatedly found that increasing the number of parameters in their models leads to improved performance (Lawrence et al., 1998; Bartlett, 1998; Neyshabur et al., 2015; Canziani et al., 2016; Novak et al., 2018; Park et al., 2019; Novak et al., 2019). While this behavior is consistent with a Bayesian perspective on generalization (MacKay, 1995; Smith and Le, 2017; Wilson and Izmailov, 2020)

, it seems at odds with classic generalization theory which primarily considers worst-case overfitting

(Haussler, 1992; Baum and Haussler, 1989; Vapnik, 1998; Bartlett and Mendelson, 2002; Bousquet and Elisseeff, 2002; Mukherjee et al., 2004; Poggio et al., 2004). This has led to a great deal of work on the interplay of overparameterization and generalization (Zhang et al., 2017; Advani and Saxe, 2017; Neyshabur et al., 2018, 2019; Li and Liang, 2018b; Allen-Zhu et al., 2019; Ghorbani et al., 2019a, b; Arora et al., 2019b; Brutzkus and Globerson, 2019). Of particular interest has been the phenomenon of double descent, in which performance increases overall with parameter account, but drops dramatically when the neural network is roughly critically parameterized (Opper et al., 1990; Belkin et al., 2019; Nakkiran et al., 2019).

Empirically, we find that in most cases (FCN and CNN-GAP in both parameterizations, CNN-VEC with standard parameterization) increasing width leads to monotonic improvements in performance. However, we also find a more complex dependence on width in specific relatively simple settings. For example, in Figure 6 for CNN-VEC with NTK parameterization the performance depends non-monotonically on the width, and the optimal width has an intermediate value.111Similar behavior was observed in Andreassen and Dyer (2020). This nonmonotonicity is distinct from double-descent-like behavior, as all widths correspond to overparameterized models.

3.7 Diagonal regularization of kernels behaves like early stopping

Figure 7: Diagonal kernel regularization acts similarly to early stopping. Solid lines corresponds to NTK inference with varying diagonal regularization . Dashed lines correspond to predictions after gradient descent evolution to time (with ). Line color indicates varying training set size . Performing early stopping at time corresponds closely to regularizing with coefficient , where denotes number of output classes.

When performing kernel inference, it is common to add a diagonal regularizer to the training kernel matrix,

. For linear regression,

Ali et al. (2019) proved that the inverse of a kernel regularizer is related to early stopping time under gradient flow. With kernels, gradient flow dynamics correspond directly to training of a wide neural network (Jacot et al., 2018; Lee et al., 2019).

We experimentally explore the relationship between early stopping, kernel regularization, and generalization in Figure 7. We observe a close relationship between regularization and early stopping, and find that in most cases the best validation performance occurs with early stopping and non-zero . While Ali et al. (2019) do not consider a scaling on the kernel regularizer, we found it useful since experiments become invariant under scale of .

3.8 Floating point precision determines critical dataset size for failure of kernel methods

Figure 8: Tail eigenvalues of infinite network kernels show power-law decay.

The red dashed line shows the predicted scale of noise in the eigenvalues due to floating point precision, for kernel matrices of increasing width. Eigenvalues for CNN-GAP architectures decay fast, and may be overwhelmed by

float32 quantization noise for dataset sizes of . For float64, quantization noise is not predicted to become significant until a dataset size of (Figure S1).

We observe empirically that kernels become sensitive to float32 vs. float64 numerical precision at a critical dataset size. For instance, GAP models suffer float32 numerical precision errors at a dataset size of . This phenomena can be understood with a simple random noise model (see §D for details). The key insight is that kernels with fast eigenvalue decay suffer from floating point noise. Empirically, the tail eigenvalue of the NNGP/NTK follows a power law (see Figure 8) and measuring their decay trend provides good indication of critical dataset size


where is the typical noise scale, e.g. float32 epsilon, and the kernel eigenvalue decay is modeled as as increases. Beyond this critical dataset size, the smallest eigenvalues in the kernel become dominated by floating point noise.

3.9 Linearized Cnn-Gap models perform poorly due to poor conditioning

We observe that the linearized CNN-GAP converges extremely slowly on the training set (Figure S6), leading to poor validation performance (Figure 3). Even after training for more than 10M steps with varying L2 regularization strengths and LRs, the best training accuracy was below 90%, and test accuracy 70% – worse than both the corresponding infinite and nonlinear finite width networks.

This is caused by poor conditioning of pooling networks. Xiao et al. (2020) (Table 1) show that the conditioning at initialization of a CNN-GAP network is worse than that of FCN or CNN-VEC networks by a factor of the number of pixels (1024 for CIFAR-10). This poor conditioning of the kernel eigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by a factor of 1024, this leads to numerical instability when using float32.

3.10 Regularized ZCA whitening improves accuracy

ZCA whitening Bell and Sejnowski (1997) (see Figure S2 for an illustration) is a data preprocessing technique that was once common Goodfellow et al. (2018); Zagoruyko and Komodakis (2016), but has fallen out of favor. However it was recently shown to dramatically improve accuracy in some kernel methods by Shankar et al. (2020), in combination with a small regularization parameter in the denominator (see §F). We investigate the utility of ZCA whitening as a preprocessing step for both finite and infinite width neural networks. We observe that while pure ZCA whitening is detrimental for both kernels and finite networks (consistent with predictions in (Wadia et al., 2020)), with tuning of the regularization parameter it provides performance benefits for both kernel methods and finite network training (Figure 9).

[width=]figures/kernel_zca.pdf (a)

[width=]figures/network_zca.pdf (b)

Figure 9: Regularized ZCA whitening improves image classification performance for both finite and infinite width networks. All plots show performance as a function of ZCA regularizaiton strength. (a) ZCA whitening of inputs to kernel methods on CIFAR-10, Fashion-MNIST, and CIFAR-100. (b) ZCA whitening of inputs to finite width networks (training curves in Figure S11).

3.11 Equivariance is only beneficial for narrow networks far from the kernel regime

Due to weight sharing between spatial locations, outputs of a convolutional layer are translation-equivariant (up to edge effects), i.e. if an input image is translated, the activations are translated in the same spatial direction. However, the vast majority of contemporary CNNs utilize weight sharing in conjunction with pooling layers, making the network outputs approximately translation-invariant (CNN-GAP). The impact of equivariance alone (CNN-VEC

) on generalization is not well understood – it is a property of internal representations only, and does not translate into meaningful statements about the classifier outputs. Moreover, in the infinite-width limit it is guaranteed to have no impact on the outputs

(Novak et al., 2019; Yang, 2019a). In the finite regime it has been reported both to provide substantial benefits by Lecun (1989); Novak et al. (2019) and no significant benefits by Bartunov et al. (2018).

We conjecture that equivariance can only be leveraged far from the kernel regime. Indeed, as observed in Figure 1 and discussed in §3.4, multiple kernel correspondence-breaking tricks are required for a meaningful boost in performance over NNGP or NTK (which are mathematically guaranteed to not benefit from equivariance), and the boost is largest at a moderate width (Figure 6). Otherwise, even large ensembles of equivariant models (see CNN-VEC LIN in Figure 4) perform comparably to their infinite width, equivariance-agnostic counterparts. Accordingly, prior work that managed to extract benefits from equivariant models (Lecun, 1989; Novak et al., 2019) tuned networks far outside the kernel regime (extremely small size and +LR+L2+U respectively). We further confirm this phenomenon in a controlled setting in Figure 10.

Figure 10: Equivariance is only leveraged in a CNN model outside of the kernel regime. If a CNN model is able to utilize equivariance effectively, we expect it to be more robust to crops and translations than an FCN. Surprisingly, performance of a wide CNN-VEC degrades with the magnitude of the input perturbation as fast as that of an FCN, indicating that equivariance is not exploited. In contrast, performance of a narrow model with weight decay (CNN-VEC+L2+narrow) falls off much slower. Translation-invariant CNN-GAP remains, as expected, the most robust. Details in §3.11, §C.1.

3.12 Ensembling kernel predictors enables practical data augmentation with NNGP/NTK

Finite width neural network often are trained with data augmentation (DA) to improve performance. We observe that the FCN and CNN-VEC architectures (both finite and infinite networks) benefit from DA, and that DA can cause CNN-VEC to become competitive with CNN-GAP (Table S1). While CNN-VEC possess translation equivariance but not invariance (§3.11), we believe it can effectively leverage equivariance to learn invariance from data.

For kernels, expanding a dataset with augmentation is computationally challenging, since kernel computation is quadratic in dataset size, and inference is cubic. Li et al. (2019a); Shankar et al. (2020) incorporated flip augmentation by doubling the training set size. Extending this strategy to more augmentations such as crop or mixup Zhang et al. (2018), or to broader augmentations strategies like AutoAugment Cubuk et al. (2019a) and RandAugment Cubuk et al. (2019b), becomes rapidly infeasible.

Here we introduce a straightforward method for ensembling kernel predictors to enable more extensive data augmentation. More sophisticated approximation approaches such as the Nyström method (Williams and Seeger, 2001) might yield even better performance. The strategy involves constructing a set of augmented batches, performing kernel inference for each of them, and then performing ensembling of the resulting predictions. This is equivalent to replacing the kernel with a block diagonal approximation, where each block corresponds to one of the batches, and the union of all augmented batches is the full augmented dataset. See §E for more details. This method achieves SOTA for a kernel method corresponding to the infinite width limit of each architecture class we studied (Figure 11 and Table 1).

Figure 11: Ensembling kernel predictors makes predictions from large augmented datasets computationally tractable. We used standard crop by 4 and flip data augmentation (DA) common for training neural networks for CIFAR-10. We observed that DA ensembling improves accuracy and is much more effective for NNGP compared to NTK. In the last panel, we applied data augmentation by ensemble to the Myrtle architecture studied in Shankar et al. (2020). We observe improvements over our base setting, but do not reach the reported best performance. We believe techniques such as leave-one-out tilt and ZCA augmentation also used in Shankar et al. (2020) contribute to this difference.
Architecture Method
FC Novak et al. (2019) - 59.9
ZCA Reg (this work) 59.7 59.7
DA Ensemble (this work) 61.5 62.4
CNN-VEC Novak et al. (2019) - 67.1
Li et al. (2019a) 66.6 66.8
ZCA Reg (this work) 69.8 69.4
Flip Augmentation, Li et al. (2019a) 69.9 70.5
DA Ensemble (this work) 70.5 73.2
CNN-GAP Arora et al. (2019a); Li et al. (2019a) 77.6 78.5
ZCA Reg (this work) 83.2 83.5
Flip Augmentation, Li et al. (2019a) 79.7 80.0
DA Ensemble (this work) 83.7 (32 ens) 84.8 (32 ens)
Myrtle 222The normalized Gaussian Myrtle kernel used in Shankar et al. (2020) does not have a corresponding finite-width neural network, and was additionally tuned on the test set for the case of CIFAR-10. Myrtle ZCA and Flip Augmentation, Shankar et al. (2020) - 89.8
Table 1: CIFAR-10 test accuracy for kernels of the corresponding architecture type

4 Discussion

We performed an in-depth investigation of the phenomenology of finite and infinite width neural networks through a series of controlled interventions. We quantified phenomena having to do with generalization, architecture dependendence, deviations between infinite and finite networks, numerical stability, data augmentation, data preprocessing, ensembling, network topology, and failure modes of linearization. We further developed best practices that improve performance for both finite and infinite networks. We believe our experiments provide firm empirical ground for future studies.

Broader Impact

Developing theoretical understanding of neural networks is crucial both for understanding their biases, and predicting when and how they will fail. Understanding biases in models is of critical importance if we hope to prevent them from perpetuating and exaggerating existing racial, gender, and other social biases (Hardt et al., 2016; Barocas and Selbst, 2016; Doshi-Velez and Kim, 2017; Barocas et al., 2019). Understanding model failure has a direct impact on human safety, as neural networks increasingly do things like drive cars and control the electrical grid (Bojarski et al., 2016; Rudin et al., 2011; Ozay et al., 2015).

We believe that wide neural networks are currently the most promising direction for the development of neural network theory. We further believe that the experiments we present in this paper will provide empirical underpinnings that allow better theory to be developed. We thus believe that this paper will in a small way aid the engineering of safer and more just machine learning models.

We thank Yasaman Bahri and Ethan Dyer for discussions and feedback on the project. We are also grateful to Atish Agarwala, Gamaleldin Elsayed for providing valuable feedbacks on the draft.

We acknowledge the Python community Van Rossum and Drake Jr (1995) for developing the core set of tools that enabled this work, including NumPy van der Walt et al. (2011), SciPy Virtanen et al. (2020), Matplotlib Hunter (2007), Pandas Wes McKinney (2010), Jupyter Kluyver et al. (2016), JAX Bradbury et al. (2018b), Neural Tangents Novak et al. (2020), Apache Beam Foundation

, Tensorflow datasets 

TFD and Google Colaboratory Research .


Appendix A Glossary

We use the following abbreviations in this work:

  • L2: L2 reguarization a.k.a. weight decay;

  • LR: using large learning rate;

  • U: allowing underfitting;

  • DA: using data augmentation;

  • C

    : centering the network so that the logits are always zero at initialization;

  • Ens: neural network ensembling logits over multiple initialization;

  • ZCA: zero-phase component analysis regularization preprocessing;

  • FCN: fully-connected neural network.;

  • CNN-VEC: convolutional neural network with a vectorized readout layer;

  • CNN-GAP: convolutional neural network with a global average pooling readout layer;

  • NNGP: neural network Gaussian process;

  • NTK: neural tangent kernel.

Appendix B Main table

Param Base +C +LR +L2
w/o DA

Param Lin Base +C +L2
58.61 59.70 62.40
66.69 69.44 73.23
>70.00* (Train accuracy 86.22 after 14M steps)
>68.59* (Train accuracy 79.90 after 14M steps)
78.0 83.45 84.82
Table S1: CIFAR-10 classification accuracy for nonlinear and linearized finite neural networks, as well as for NTK and NNGP kernel methods. Starting from Base network of given architecture class described in §2, performance change of centering (+C), large learning rate (+LR), allowing underfitting by early stopping (+U), input preprocessing with ZCA regularization (+ZCA), multiple initialization ensembling (+Ens), and some combinations are shown, for Standard and NTK parameterization. See also Figure 1.

Appendix C Experimental details

For all experiments, we use Neural Tangents (NT) library Novak et al. [2020] built on top of JAX Bradbury et al. [2018b]

. First we describe experimental settings that is mostly common and then describe specific details and hyperparameters for each experiments.

Finite width neural networks We train finite width networks with Mean Squared Error (MSE) loss

where is the number of classes and is the norm in . For the experiments with +L2, we add L2 regularization to the loss


and tune using grid-search optimizing for the validation accuracy.

We optimize the loss using mini-batch SGD with constant learning rate. We use batch-size of for FCN and for both CNN-VEC and CNN-GAP (see §H for further details on this choice). Learning rate is parameterized with learning rate factor with respect to the critical learning rate


In practice, we compute empirical NTK

on 16 random points in the training set to estimate

 Lee et al. [2019] by maximum eigenvalue of . This is readily available in NT library Novak et al. [2020] using nt.monte_carlo_kernel_fn and nt.predict.max_learning_rate. Base case considered without large learning rate indicates , and large learning rate (+LR) runs are allowing . Note that for linearized networks is strict upper-bound for the learning rates and no is allowed Lee et al. [2019], Yang and Salman [2019], Lewkowycz et al. [2020].

Training steps are chosen to be large enough, such that learning rate factor can reach above accuracy on random subset of training data for 5 logarithmic spaced measurements. For different learning rates, physical time roughly determines learning dynamics and small learning rate trials need larger number of steps. Achieving termination criteria was possible for all of the trials except for linearized CNN-GAP and data augmented training of FCN, CNN-VEC. In these cases, we report best achieved performance without fitting the training set.

NNGP / NTK For inference, except for data augmentation ensembles for which default zero regularization was chosen, we grid search over diagonal regularization in the range numpy.logspace(-7, 2, 14) and . Diagonal regularization is parameterized as

where is either NNGP or NTK for the training set. We work with this parameterization since is invariant to scale of .


For all our experiments (unless specified) we use train/valid/test split of 45k/5k/10k for CIFAR-10/100 and 50k/10k/10k for Fashion-MNIST. For all our experiments, inputs are standardized with per channel mean and standard deviation. ZCA regularized whitening is applied as described in §


. Output is encoded as mean subtracted one-hot-encoding for the MSE loss, e.g. for a label in class

, . For the softmax-cross-entropy loss in §G, we use standard one-hot-encoded output.

For data augmentation, we use widely-used augmentation for CIFAR-10; horizontal flips with 50% probability and random crops by 4-pixels with zero-padding.

Details of architecture choice: We only consider ReLU activation (with the exception of Myrtle-kernel which use scaled Gaussian activation Shankar et al. [2020]) and choose critical initialization weight variance of with small bias variance . For convolution layers, we exclusively consider

filters with stride

and SAME (zero) padding so that image size does not change under convolution operation.

c.1 Hyperparameter configurations for all experiments

We used grid-search for tuning hyperparameters and use accuracy on validation set for deciding on hyperparameter configuration or measurement steps (for underfitting / early stopping). All reported numbers unless specified is test set performance.

Figure 1, Table S1: We grid-search over L2 regularization strength and learning rate factor . For linearized networks same search space is used except that configuration is infeasible and training diverges. For non-linear, centered runs is used. Network ensembles uses base configuration with , with 64 different initialization seed. Kernel ensemble is over 50 predictors for FCN and CNN-VEC and 32 predictors for CNN-GAP. Finite networks trained with data-augmentation has different learning rate factor range of .

Figure 2: Each datapoint corresponds to either standard preprocessed or ZCA regularization preprocessed (as described in §3.10) with regularization strength was varied in for FCN and CNN-VEC, for CNN-GAP.

Figure 3, Figure 4, Figure S6, Figure S7: Learning rate factors are for non-linear networks and for linearized networks. While we show NTK parameterized runs, we also observe similar trends for STD parameterized networks. Shaded regions show range of minimum and maximum performance across 64 different seeds. Solid line indicates the mean performance.

Figure 5 While FCN is the base configuration, CNN-VEC is a narrow network with 64 channels per layer since moderate width benefits from L2 more for the NTK parameterization Figure S10. For CNN-GAP 128 channel networks is used. All networks with different L2 strategy are trained with +LR ().

Figure 6, Figure S8, Figure S10: and .

Figure 7: We use 640 subset of validation set for evaluation. CNN-GAP is a variation of the base model with 3 convolution layers with while FCN and CNN-VEC is the base model. Training evolution is computed using analytic time-evolution described in Lee et al. [2019] and implemented in NT library via nt.predict.gradient_descent_mse with 0 diagonal regularization.

Figure 9: Kernel experiments details are same as in Figure 2. Finite networks are base configuration with and .

Figure 10: Evaluated networks uses NTK parameterization with . CNN-VEC+L2+narrow uses 128 channels instead of 512 of the base CNN-VEC and CNN-GAP networks, and trained with L2 regularization strength . Crop transformation uses zero-padding while Translate transformation uses circular boundary condition after shifting images. Each transformation is applied to the test set inputs where shift direction is chosen randomly. Each points correspond to average accuracy over 20 random seeds. FCN had 2048 hidden units.

Figure 11, Table 1: For all data augmentation ensembles, first instance is taken to be from non-augmented training set. Further details on kernel ensemble is described in §E. For all kernels, inputs are preprocessed with optimal ZCA regularization observed in Figure 9 (10 for FCN, 1 for CNN-VEC, CNN-GAP and Myrtle.). We ensemble over 50 different augmented draws for FCN and CNN-VEC, whereas for CNN-GAP, we ensemble over 32 draws of augmented training set.

Figure S3, Table S2: Details for MSE trials are same as  Figure 1 and Table S1. Trials with softmax-cross-entropy loss was tuned with same hyperparameter range as MSE except that learning rate factor range was .

Figure S4: We present result with NTK parameterized networks with . FCN network is width 1024 with for MSE loss and for softmax-cross-entropy loss. CNN-GAP uses 256 channels with for MSE loss and for softmax-cross-entropy loss. Random seed was fixed to be the same across all runs for comparison.

Figure S9: NTK pamareterization with was used for both L2 to zero or initialization. Random seed was fixed to be the same across all runs for comparison.

Appendix D Noise model

In this section, we provide details on noise model discussed in §3.8. Consider a random Hermitian matrix with entries order of which is considered as noise perturbation to the kernel matrix


Eigenvalues of this random matrix

follow Wigner’s semi-circle law, and the smallest eigenvalue is given by . When the smallest eigenvalue of is smaller (in order) than , one needs to add diagonal regularizer larger than the order of to ensure positive definiteness. For estimates, let us use machine precision333np.finfo(np.float32).eps, np.finfo(np.float64).eps and which we use as proxy values for . Note that noise scale is relative to elements in which is assume to be . Naively scaling by multiplicative constant will also scale .

Empirically one can model tail eigenvalues of infinite width kernel matrix of size as


Note that we are considering entries for and typical eigenvalues scale linearly with dataset size . For a given dataset size, the power law observed is and is dataset-size independent constant. Thus the smallest eigenvalue is order .

In the noise model, we can apply Weyl’s inequality which says


Consider the worst-case where negative eigenvalue noise affecting the kernel’s smallest eigenvalue. In that case perturbed matrices minimum eigenvalue could become negative, breaking positive semi-definiteness(PSD) of the kernel.

This model allows to predict critical dataset size () over which PSD can be broken under specified noise scale and kernel eigenvalue decay. With condition that perturbed smallest eigenvalue becomes negative


we obtain


When PSD is broken, one way to preserve PSD is to add diagonal regularizer (§3.7). For CIFAR-10 with , typical negative eigenvalue from float32 noise is around and with float64 noise scale, considering . Note that Arora et al. [2019a] regularized kernel with regularization strength which is on par with typical negative eigenvalue introduced due to float32 noise. Of course, this only applies if kernel eigenvalue decay is sufficiently fast that full dataset size is above .

We observe that FCN and CNN-VEC kernels with small would not suffer from increasing dataset-size under float32 precision. On the other-hand, worse conditioning of CNN-GAP not only affects the training time (§3.9) but also required precision. One could add sufficiently large diagonal regularization to mitigate effect from the noise at the expense of losing information and generalization strength included in eigen-directions with small eigenvalues.

[width=0.86]figures/kernel_spectrum.png (a)
[width=0.43]figures/critical_dataset_size_by_exponent (b)[width=0.43]figures/critical_dataset_size_by_noise (c)

Figure S1: The CNN-GAP architecture has poor kernel conditioning (a) Eigenvalue spectrum of infinite network kernels on 10k datapoints. Dashed lines are noise eigenvalue scale from float32 precision. Eigenvalue for CNN-GAP’s NNGP decays fast and negative eigenvalue may occur when dataset size is in float32 but is well-behaved with higher precision. (b-c) Critical dataset size as function of eigenvalue decay exponent or noise strength given by Equation 1.

Appendix E Data augmentation via kernel ensembling

We start considering general ensemble averaging of predictors. Consider a sequence of training sets each consisting of input-output pairs from a data-generating distribution. For a learning algorithm, which we use NNGP/NTK inference for this study, will give prediction of unseen test point . It is possible to obtain better predictor by averaging output of different predictors


where denotes the cardinality of . This ensemble averaging is simple type of committee machine which has long history Clemen [1989], Dietterich [2000]. While more sophisticated ensembling method exists (e.g. Freund and Schapire [1995], Breiman [1996, 2001], Opitz and Shavlik [1996], Opitz and Maclin [1999], Rokach [2010]), we strive for simplicity and considered naive averaging. One alternative we considered is generalizing average by


were in general is set of weights satisfying . We can utilize posterior variance from NNGP or NTK with MSE loss via Inverse-variance weighting (IVW) where weights are given as


In simple bagging setting Breiman [1996], we observe small improvements with IVW over naive averaging. This indicates posterior variance for different draw of was quite similar.

Application to data augmentation (DA) is simple as we consider process of generating from a (stochastic) data augmentation transformation . We consider action of be stochastic (e.g. is a random crop operator) with probability augmentation transformation (which itself could be stochastic) and probability of . Considering as clean un-augmented training set, we can imagine dataset generating process , where we overloaded definition of on training-set to be data generating distribution.

For experiments in §3.12, we took to be standard augmentation strategy of horizontal flip and random crop by 4-pixels with augmentation fraction (see Figure S12 for effect of augmentation fraction on kernel ensemble). In this framework, it is trivial to generalize the DA transformation to be quite general (e.g. learned augmentation strategy studied by Cubuk et al. [2019a, b]).

Appendix F ZCA whitening

Consider (flattened) -dimensional training set inputs (a matrix) with data covariance


The goal of whitening is to find a whitening transformation , a matrix, such that the features of transformed input


are uncorrelated, e.g. . Note that is constructed only from training set while is applied to both training set and test set inputs. Whitening transformation can be efficiently computed by eigen-decomposition444For PSD matricies, it is numerically more reliable to obtain via SVD.


where is diagonal matrix with eigenvalues, and

contains eigenvector of

as its columns.

With this ZCA whitening transformation is obtained by following whitening matrix


Here, we introduced trivial reparameterization of conventional regularizer such that regularization strength is input scale invariant. It is easy to check corresponds to whitening with . In §3.10, we study the benefit of taking non-zero regularization strength for both kernels and finite networks. We denote transformation with non-zero regularizer, ZCA regularization preprocessing. ZCA transformation preserves spatial and chromatic structure of original image as illustrated in Figure F. Therefore image inputs are reshaped to have the same shape as original image.

In practice, we standardize both training and test set per (RGB channel) features of the training set before and after the ZCA whitening. This ensures transformed inputs are mean zero and variance of order 1.

[width=]figures/zca_fig.pdf (a)(b)

Figure S2: Illustration of ZCA whitening.

Whitening is a linear transformation of a dataset that removes correlations between feature dimensions, setting all non-zero eigenvalues of the covariance matrix to 1. ZCA whitening is a specific choice of the linear transformation that rescales the data in the directions given by the eigenvectors of the covariance matrix, but without additional rotations or flips.

(a) A toy 2d dataset before and after ZCA whitening. Red arrows indicate the eigenvectors of the covariance matrix of the unwhitened data. (b) ZCA whitening of CIFAR-10 images preserves spatial and chromatic structure, while equalizing the variance across all feature directions. Figure reproduced with permission from Wadia et al. [2020]. See also §3.10.

Appendix G MSE vs Softmax-cross-entropy loss training of neural networks

Our focus was mainly on fininte networks trained with MSE loss for simple comparison with kernel methods that gives closed form solution. Here we present comparison of MSE vs softmax-cross-entropy trained networks. See Table S2 and Figure S3.

Figure S3: MSE trained networks are competitive while there is a clear benefit to using Cross-entropy loss
Architecture Type Param Base +LR+U +L2+U +L2+LR+U Best
FCN MSE STD 47.82 49.07 49.82 55.32 55.90
NTK 46.16 49.17 54.27 55.44 55.44
XENT STD 55.01 57.28 53.98 57.64 57.64
NTK 53.39 56.59 56.31 58.99 58.99
MSE+DA STD 65.29 66.11 65.28 67.43 67.43
NTK 61.87 62.12 67.58 69.35 69.35
XENT+DA STD 64.15 64.15 67.93 67.93 67.93
NTK 62.88 62.88 67.90 67.90 67.90
CNN-VEC MSE STD 56.68 63.51 67.07 68.99 68.99
NTK 60.73 61.58 75.85 77.47 77.47
XENT STD 64.31 65.30 64.57 66.95 66.95
NTK 67.13 73.23 72.93 74.05 74.05
MSE+DA STD 76.73 81.84 76.66 83.01 83.01
NTK 83.92 84.76 84.87 85.63 85.63
XENT+DA STD 81.84 83.86 81.78 84.37 84.37
NTK 86.83 88.59 87.49 88.83 88.83
CNN-GAP MSE STD 80.26 80.93 81.10 83.01 84.22
NTK 80.61 82.44 81.17 82.43 83.92
XENT STD 83.66 83.80 84.59 83.87 83.87
NTK 83.87 84.40 84.51 84.51 84.51
MSE+DA STD 84.36 83.88 84.89 86.45 86.45
NTK 84.07 85.54 85.39 86.68 86.68
XENT+DA STD 86.04 86.01 86.42 87.26 87.26
NTK 86.87 87.31 86.39 88.26 88.26
Table S2: Effects of MSE vs softmax-cross-entropy loss on base networks with various interventions

Appendix H Comment on batch size

Correspondence between NTK and gradient descent training is direct in the full batch gradient descent (GD) setup (see Dyer and Gur-Ari [2020] for extensions to mini-batch SGD setting). Therefore base comparison between finite networks and kernels is the full batch setting. While it is possible to train our base models with GD, for full CIFAR-10 large emprical study becomes impractical. In practice, we use mini-batch SGD with batch-size for FCN and for CNNs.

We studied batch size effect of training dynamics in Figure S4 and found that for these batch-size choices does not affecting training dynamics compared to much larger batch size. Shallue et al. [2019], McCandlish et al. [2018] observed that universally for wide variety of deep learning models there are batch size beyond which one could gain training speed benefit in number of steps. We observe that maximal useful batch-size in workloads we study is quite small.

[width=0.8]figures/bs_fc_mse.pdf (a)
[width=0.8]figures/bs_fc_xent.pdf (b)
[width=0.8]figures/bs_cg_mse.pdf (c)
[width=0.8]figures/bs_cg_xent.pdf (d)

Figure S4: Batch size does not affect training dynamics for moderately large batch size.

Appendix I Addtional tables and plots

Param Base +C +LR +L2
w/o DA
FCN STD 0.0443 0.0363 0.0406 0.0411 0.0355 0.0337 0.0329 0.0483 0.0319 0.0301 0.0304 0.0267 0.0242
NTK 0.0465 0.0371 0.0423 0.0338 0.0336 0.0308 0.0308 0.0484 0.0308 0.0300 0.0302 0.0281 0.0225
CNN-VEC STD 0.0381 0.0330 0.0340 0.0377 0.0279 0.0340 0.0265 0.0383 0.0265 0.0278 0.0287 0.0228 0.0183
NTK 0.0355 0.0353 0.0355 0.0355 0.0231 0.0246 0.0227 0.0361 0.0227 0.0254 0.0278 0.0164 0.0143
CNN-GAP STD 0.0209 0.0201 0.0207 0.0201 0.0201 0.0179 0.0177 0.0190 0.0159 0.0172 0.0165 0.0185 0.0149
NTK 0.0209 0.0201 0.0195 0.0205 0.0181 0.0175 0.0170 0.0194 0.0161 0.0163 0.0157 0.0186 0.0145

Param Lin Base +C +L2
FCN STD 0.0524 0.0371 0.0508 0.0350 0.0309 0.0305 0.0306 0.0302 - 0.0309 0.0308 0.0297
NTK 0.0399 0.0366 0.0370 0.0368 0.0305 0.0304 0.0305 0.0302 0.0298
CNN-VEC STD 0.0436 0.0322 0.0351 0.0351 0.0293 0.0291 0.0287 0.0277 - 0.0286 0.0281 0.0256
NTK 0.0362 0.0337 0.0342 0.0339 0.0286 0.0286 0.0283 0.0274 0.0273
CNN-GAP STD < 0.0272* (Train accuracy 86.22 after 14M steps) 0.0233 0.0200 - 0.0231 0.0204 0.0191
NTK < 0.0276* (Train accuracy 79.90 after 14M steps) 0.0232 0.0200 0.0195
Table S3: CIFAR-10 classification mean squared error(MSE) for nonlinear and linearized finite neural networks, as well as for NTK and NNGP kernel methods. Starting from Base network of given architecture class described in §2, performance change of centering (+C), large learning rate (+LR), allowing underfitting by early stopping (+U), input preprocessing with ZCA regularization (+ZCA), multiple initialization ensembling (+Ens), and some combinations are shown, for Standard and NTK parameterization. See also Table S1 and Figure 1 for accuracy comparison.
Figure S5: On UCI dataset NNGP often outperforms NTK on RMSE. We evaluate predictive performance of FC NNGP and NTK on UCI regression dataset in the standard 20-fold splits first utilized in Hernández-Lobato and Adams [2015], Gal and Ghahramani [2016]. We plot average RMSE across the splits. Different scatter points are varying hyperparameter settings of (depth, weight variance, bias variance). In the tabular data setting, dominance of NNGP is not as prominent across varying dataset as in image classification domain.
Figure S6: Centering can accelerate training. Validation (top) and training (bottom) accuracy throughout training for several finite width architectures. See also §3.3 and Figure 3.
Figure S7: Ensembling base networks causes them to match kernel performance, or exceed it for nonlinear CNNs. See also §3.3 and Figure 4.


Figure S8: Performance of nonlinear and linearized networks as a function of L2 regularization for a variety of widths. Dashed lines are NTK parameterized networks while solid lines are networks with standard parameterization. We omit linearized CNN-GAP plots as they did not converge even with extensive compute budget. L2 regularization is more helpful in networks with an NTK parameterization than a standard parameterization

[width=.6]figures/l2_to_init_training_curve.pdf (a)[width=.32]figures/l2_to_init.pdf (b)

Figure S9: L2 regularization to initial weights does not provide performance benefit. (a) Comparing training curves of L2 regularization to either 0 or initial weights. (b) Peak performance of after L2 regularization to either 0 or initial weights. Increasing L2 regularization to initial weights do not provide performance benefits, instead performance remains flat until model’s capacity deteriorates.
Figure S10: Finite width networks generally perform better with increasing width, but CNN-VEC shows surprising non-monotonic behavior. See also §3.6 and Figure 6 L2: non-zero weight decay allowed during training LR: large learning rate allowed. Dashed lines are allowing underfitting (U).
Figure S11: ZCA regularization helps finite network training. (upper) Standard parameterization, (lower) NTK parameterization. See also §3.10 and Figure 9.
Figure S12: Data augmentation ensemble for infinite network kernels with varying augmentation fraction. See also §3.12.