1 Introduction
A broad class of both Bayesian (Neal, 1994; Williams, 1997; Hazan and Jaakkola, 2015; Lee et al., 2018; Matthews et al., 2018a, b; Borovykh, 2018; GarrigaAlonso et al., 2019; Novak et al., 2019; Yang and Schoenholz, 2017; Yang et al., 2019; Pretorius et al., 2019; Yang, 2019a, b; Novak et al., 2020; Hron et al., 2020; Hu et al., 2020a) and gradient descent trained (Jacot et al., 2018; Li and Liang, 2018a; AllenZhu et al., 2018; Du et al., 2019a, 2018; Zou et al., 2019; Lee et al., 2019; Chizat et al., 2019; Arora et al., 2019a; SohlDickstein et al., 2020; Huang et al., 2020; Du et al., 2019b; Yang, 2019a, b; Novak et al., 2020; Hron et al., 2020) neural networks converge to Gaussian Processes (GPs) or closelyrelated kernel methods as their intermediate layers are made infinitely wide The predictions of these infinite width networks are described by the Neural Network Gaussian Process (NNGP) (Lee et al., 2018; Matthews et al., 2018a)
kernel for Bayesian networks, and by the Neural Tangent Kernel (NTK)
(Jacot et al., 2018) and weight space linearization (Lee et al., 2019; Chizat et al., 2019) for gradient descent trained networks.This correspondence has been key to recent breakthroughs in our understanding of neural networks (Xiao et al., 2018; VallePerez et al., 2019; Wei et al., 2019; Xiao et al., 2020; Bietti and Mairal, 2019; BenDavid and Ringel, 2019; Yang and Salman, 2019; Ober and Aitchison, 2020; Hu et al., 2020b; Lewkowycz et al., 2020) It has also enabled practical advances in kernel methods (GarrigaAlonso et al., 2019; Novak et al., 2019; Arora et al., 2019a; Li et al., 2019a; Arora et al., 2020; Shankar et al., 2020; Novak et al., 2020; Hron et al., 2020)
, Bayesian deep learning
(Wang et al., 2019; Cheng et al., 2019; Carvalho et al., 2020)(Tsymbalov et al., 2019), and semisupervised learning
(Hu et al., 2020a). The NNGP, NTK, and related large width limits (Cho and Saul, 2009; Daniely et al., 2016; Poole et al., 2016; Chen et al., 2018; Li and Nguyen, 2019; Daniely, 2017; Pretorius et al., 2018; Hayou et al., 2018; Karakida et al., 2018; Blumenfeld et al., 2019; Hayou et al., 2019; Schoenholz et al., 2017; Pennington et al., 2017; Xiao et al., 2018; Yang and Schoenholz, 2017) are unique in giving an exact theoretical description of large scale neural networks. Because of this, we believe they will continue to play a transformative role in deep learning theory.Infinite networks are a newly active field, and foundational empirical questions remain unanswered. In this work, we perform an extensive and indepth empirical study of finite and infinite width neural networks. In so doing, we provide quantitative answers to questions about the factors of variation that drive performance in finite networks and kernel methods, uncover surprising new behaviors, and develop best practices that improve the performance of both finite and infinite width networks. We believe our results will both ground and motivate future work in wide networks.
2 Experiment design
To systematically develop a phenomenology of infinite and finite neural networks, we first establish base cases for each architecture where infinitewidth kernel methods, linearized weightspace networks, and nonlinear gradient descent based training can be directly compared. In the finitewidth settings, the base case uses minibatch gradient descent at a constant small learning rate Lee et al. (2019) with MSE loss (implementation details in §H). In the kernellearning setting we compute the NNGP and NTK for the entire dataset and do exact inference as described in (Rasmussen and Williams, 2006, page 16). Once this onetoone comparison has been established, we augment the base setting with a wide range of interventions. We discuss each of these interventions in detail below. Some interventions will approximately preserve the correspondence (for example, data augmentation), while others explicitly break the correspondence in a way that has been hypothesized in the literature to affect performance (for example, large learning rates Lewkowycz et al. (2020)). We additionally explore linearizing the base model around its initialization, in which case its training dynamics become exactly described by a constant kernel. This differs from the kernel setting described above due to finite width effects.
We use MSE loss to allow for easier comparison to kernel methods, whose predictions can be evaluated in closed form for MSE. See Table S2 and Figure S3 for a comparison of MSE to softmaxcrossentropy loss. Softmaxcrossentropy provides a consistent small benefit over MSE, and will be interesting to consider in future work.
Architectures we work with are built from either FullyConnected (FCN) or Convolutional (CNN
) layers. In all cases we use ReLU nonlinearities. Except if otherwise stated, we consider
FCNs with 3layers and CNNs with 8layers. For convolutional networks we must collapse the spatial dimensions of imageshaped data before the final readout layer. To do this we either: flatten the image into a onedimensional vector (
VEC) or apply global average pooling to the spatial dimensions (GAP). Finally, we compare two ways of parameterizing the weights and biases of the network: the standard parameterization (STD), which is used in work on finitewidth networks, and the NTK parameterization (NTK) which has been used in most infinitewidth studies to date (see SohlDickstein et al. (2020) for the standard parameterization at infinite width).Except where noted, for all kernel experiments we optimize over diagonal kernel regularization independently for each experiment. For finite width networks, except where noted we use a small learning rate corresponding to the base case. See §C.1 for details.
The experiments described in this paper are often very compute intensive. For example, to compute the NTK or NNGP for the entirety of CIFAR10 for CNNGAP architectures one must explicitly evaluate the entries in a by kernel matrix. Typically this takes around 1200 GPU hours with double precision, and so we implement our experiments via massively distributed compute infrastructure based on beam Foundation . All experiments use the Neural Tangents library Novak et al. (2020), built on top of JAX Bradbury et al. (2018a).
To be as systematic as possible while also tractable given this large computational requirement, we evaluated every intervention for every architecture and focused on a single dataset, CIFAR10 Krizhevsky et al. (2009). However, to ensure robustness of our results across dataset, we evaluate several key claims on CIFAR100 and FashionMNIST Xiao et al. (2017).
3 Observed empirical phenomena
3.1 NNGP/NTK can outperform finite networks
A common assumption in the study of infinite networks is that they underperform the corresponding finite network in the large data regime. We carefully examine this assumption, by comparing kernel methods against the base case of a finite width architecture trained with small learning rate and no regularization (§2), and then individually examining the effects of common training practices which break (large LR, L2 regularization) or improve (ensembling) the infinite width correspondence to kernel methods. The results of these experiments are summarized in Figure 1 and Table S1.
First focusing on base finite networks, we observe that infinite FCN and CNNVEC outperform their respective finite networks. On the other hand, infinite CNNGAP networks perform worse than their finitewidth counterparts in the base case, consistent with observations in Arora et al. (2019a). We emphasize that architecture plays a key role in relative performance. For example, infiniteFCNs outperform finitewidth networks even when combined with various tricks such as high learning rate, L2, and underfitting. Here the performance becomes similar only after ensembling (§3.3).
One interesting observation is that ZCA regularization preprocessing (§3.10) can provide significant improvements to the CNNGAP kernel, closing the gap to within 12%.
3.2 NNGP typically outperforms NTK
Recent evaluations of infinite width networks have put significant emphasis on the NTK, without explicit comparison against the respective NNGP models (Arora et al., 2019a; Li et al., 2019a; Du et al., 2019b; Arora et al., 2020). Combined with the view of NNGPs as “weaklytrained” (Lee et al., 2019; Arora et al., 2019a) (i.e. having only the last layer learned), one might expect NTK to be a more effective model class than NNGP. On the contrary, we usually observe that NNGP inference achieves better performance. This can be seen in Table S1 where SOTA performance among fixed kernels is attained with the NNGP across all architectures. In Figure 2 we show that this trend persists across CIFAR10, CIFAR100, and FashionMNIST (see Figure S5 for similar trends on UCI regression tasks). In addition to producing stronger models, NNGP kernels require about half the memory and compute as the corresponding NTK, and some of the most performant kernels do not have an associated NTK at all Shankar et al. (2020). Together these results suggest that when approaching a new problem where the goal is to maximize performance, practitioners should start with the NNGP.
3.3 Centering and ensembling finite networks both lead to kernellike performance
For overparameterized neural networks, some randomness from the initial parameters persists throughout training and the resulting learned functions are themselves random. This excess variance in the network’s predictions generically increases the total test error through the variance term of the biasvariance decomposition. For infinitewidth kernel systems this variance is eliminated by using the mean predictor. For finitewidth models, the variance can be large, and test performance can be significantly improved by ensembling a collection of models. In Figure 4, we examine the effect of ensembling. For FCN networks, ensembling closes the gap with kernel methods, suggesting that FCN NNs underperform FCN kernels primarily due to variance. For CNN models, ensembling also improves test performance, and ensembled CNNGAP models significantly outperform the best kernel methods.
Prediction variance can also be reduced by centering the model, i.e. subtracting the model’s initial predictions: . A similar variance reduction technique has been studied in Chizat et al. (2019); Zhang et al. (2019); Hu et al. (2020c); Bai and Lee (2020). In Figure 3, we observe that centering significantly speeds up training and improves generalization for FCN and CNNVEC models, but has littletono effect on CNNGAP architectures. We observe that the scale posterior variance of CNNGAP, in the infinitewidth kernel, is small relative to the prior variance given more data, consistent with centering and ensembles having small effect.
3.4 Large LRs and L2 regularization drive differences between finite networks and kernels
In practice, L2 regularization (a.k.a. weight decay) or larger learning rates can break the correspondence between kernel methods and finite width neural network training even at large widths.
Lee et al. (2019) derives a critical learning rate such that wide network training dynamics are equivalent to linearized training for . Lewkowycz et al. (2020) argues that even at large width a learning rate for a constant forces the network to move away from its initial high curvature minimum and converge to a lower curvature minimum, while Li et al. (2019b) argues that large initial learning rates enable networks to learn ‘hardtogeneralize’ patterns.
In Figure 1 (and Table S1), we observe that the effectiveness of a large learning rate (LR) is highly sensitive to both architecture and paramerization: LR improves performance of FCN and CNNGAP by about for STD parameterization and about for NTK parameterization. In stark contrast, it has little effect on CNNVEC with NTK parameterization and surprisingly, a huge performance boost on CNNVEC with STD parameterization ().
L2 regularization (Equation S1
) regularizes the squared distance between the parameters and the origin and encourages the network to converge to minima with smaller Euclidean norms. Such minima are different from those obtained by NT kernelridge regression (i.e. adding a diagonal regularization term to the NT kernel)
(Wei et al., 2019), which essentially penalizes the deviation of the network’s parameters from initialization Hu et al. (2019). See Figure S8 for a comparison.L2 regularization consistently improves (+) performance for all architectures and parameterizations. Even with a welltuned L2 regularization, finite width CNNVEC and FCN still underperform NNGP/NTK. Combining L2 with early stopping produces a dramatic additional performance boost for finite width CNNVEC, outperforming NNGP/NTK. Finally, we note that L2+LR together provide a superlinear performance gain for all cases except FCN and CNNGAP with NTKparameterization. Understanding the nonlinear interactions between L2, LR, and early stopping on finite width networks is an important research question (e.g. see Lewkowycz et al. (2020); Lewkowycz and GurAri (2020) for LR/L2 effect on the training dynamics).
3.5 Improving L2 regularization for networks using the standard parameterization
We find that L2 regularization provides dramatically more benefit (by up to ) to finite width networks with the NTK parameterization than to those that use the standard parameterization (see Table S1). There is a bijective mapping between weights in networks with the two parameterizations, which preserves the function computed by both networks: , where is the th layer weight matrix, and is the width of the preceding activation vector. Motivated by the improved performance of the L2 regularizer in the NTK parameterization, we use this mapping to construct a regularizer for standard parameterization networks that produces the same penalty as vanilla L2 regularization would produce on the equivalent NTKparameterized network. This modified regularizer is . This can be thought of as a layerwise regularization constant . The improved performance of this regularizer is illustrated in Figure 5.
3.6 Performance can be nonmonotonic in width beyond double descent
Deep learning practitioners have repeatedly found that increasing the number of parameters in their models leads to improved performance (Lawrence et al., 1998; Bartlett, 1998; Neyshabur et al., 2015; Canziani et al., 2016; Novak et al., 2018; Park et al., 2019; Novak et al., 2019). While this behavior is consistent with a Bayesian perspective on generalization (MacKay, 1995; Smith and Le, 2017; Wilson and Izmailov, 2020)
, it seems at odds with classic generalization theory which primarily considers worstcase overfitting
(Haussler, 1992; Baum and Haussler, 1989; Vapnik, 1998; Bartlett and Mendelson, 2002; Bousquet and Elisseeff, 2002; Mukherjee et al., 2004; Poggio et al., 2004). This has led to a great deal of work on the interplay of overparameterization and generalization (Zhang et al., 2017; Advani and Saxe, 2017; Neyshabur et al., 2018, 2019; Li and Liang, 2018b; AllenZhu et al., 2019; Ghorbani et al., 2019a, b; Arora et al., 2019b; Brutzkus and Globerson, 2019). Of particular interest has been the phenomenon of double descent, in which performance increases overall with parameter account, but drops dramatically when the neural network is roughly critically parameterized (Opper et al., 1990; Belkin et al., 2019; Nakkiran et al., 2019).Empirically, we find that in most cases (FCN and CNNGAP in both parameterizations, CNNVEC with standard parameterization) increasing width leads to monotonic improvements in performance. However, we also find a more complex dependence on width in specific relatively simple settings. For example, in Figure 6 for CNNVEC with NTK parameterization the performance depends nonmonotonically on the width, and the optimal width has an intermediate value.^{1}^{1}1Similar behavior was observed in Andreassen and Dyer (2020). This nonmonotonicity is distinct from doubledescentlike behavior, as all widths correspond to overparameterized models.
3.7 Diagonal regularization of kernels behaves like early stopping
When performing kernel inference, it is common to add a diagonal regularizer to the training kernel matrix,
. For linear regression,
Ali et al. (2019) proved that the inverse of a kernel regularizer is related to early stopping time under gradient flow. With kernels, gradient flow dynamics correspond directly to training of a wide neural network (Jacot et al., 2018; Lee et al., 2019).We experimentally explore the relationship between early stopping, kernel regularization, and generalization in Figure 7. We observe a close relationship between regularization and early stopping, and find that in most cases the best validation performance occurs with early stopping and nonzero . While Ali et al. (2019) do not consider a scaling on the kernel regularizer, we found it useful since experiments become invariant under scale of .
3.8 Floating point precision determines critical dataset size for failure of kernel methods
We observe empirically that kernels become sensitive to float32 vs. float64 numerical precision at a critical dataset size. For instance, GAP models suffer float32 numerical precision errors at a dataset size of . This phenomena can be understood with a simple random noise model (see §D for details). The key insight is that kernels with fast eigenvalue decay suffer from floating point noise. Empirically, the tail eigenvalue of the NNGP/NTK follows a power law (see Figure 8) and measuring their decay trend provides good indication of critical dataset size
(1) 
where is the typical noise scale, e.g. float32 epsilon, and the kernel eigenvalue decay is modeled as as increases. Beyond this critical dataset size, the smallest eigenvalues in the kernel become dominated by floating point noise.
3.9 Linearized CnnGap models perform poorly due to poor conditioning
We observe that the linearized CNNGAP converges extremely slowly on the training set (Figure S6), leading to poor validation performance (Figure 3). Even after training for more than 10M steps with varying L2 regularization strengths and LRs, the best training accuracy was below 90%, and test accuracy 70% – worse than both the corresponding infinite and nonlinear finite width networks.
This is caused by poor conditioning of pooling networks. Xiao et al. (2020) (Table 1) show that the conditioning at initialization of a CNNGAP network is worse than that of FCN or CNNVEC networks by a factor of the number of pixels (1024 for CIFAR10). This poor conditioning of the kernel eigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by a factor of 1024, this leads to numerical instability when using float32.
3.10 Regularized ZCA whitening improves accuracy
ZCA whitening Bell and Sejnowski (1997) (see Figure S2 for an illustration) is a data preprocessing technique that was once common Goodfellow et al. (2018); Zagoruyko and Komodakis (2016), but has fallen out of favor. However it was recently shown to dramatically improve accuracy in some kernel methods by Shankar et al. (2020), in combination with a small regularization parameter in the denominator (see §F). We investigate the utility of ZCA whitening as a preprocessing step for both finite and infinite width neural networks. We observe that while pure ZCA whitening is detrimental for both kernels and finite networks (consistent with predictions in (Wadia et al., 2020)), with tuning of the regularization parameter it provides performance benefits for both kernel methods and finite network training (Figure 9).
3.11 Equivariance is only beneficial for narrow networks far from the kernel regime
Due to weight sharing between spatial locations, outputs of a convolutional layer are translationequivariant (up to edge effects), i.e. if an input image is translated, the activations are translated in the same spatial direction. However, the vast majority of contemporary CNNs utilize weight sharing in conjunction with pooling layers, making the network outputs approximately translationinvariant (CNNGAP). The impact of equivariance alone (CNNVEC
) on generalization is not well understood – it is a property of internal representations only, and does not translate into meaningful statements about the classifier outputs. Moreover, in the infinitewidth limit it is guaranteed to have no impact on the outputs
(Novak et al., 2019; Yang, 2019a). In the finite regime it has been reported both to provide substantial benefits by Lecun (1989); Novak et al. (2019) and no significant benefits by Bartunov et al. (2018).We conjecture that equivariance can only be leveraged far from the kernel regime. Indeed, as observed in Figure 1 and discussed in §3.4, multiple kernel correspondencebreaking tricks are required for a meaningful boost in performance over NNGP or NTK (which are mathematically guaranteed to not benefit from equivariance), and the boost is largest at a moderate width (Figure 6). Otherwise, even large ensembles of equivariant models (see CNNVEC LIN in Figure 4) perform comparably to their infinite width, equivarianceagnostic counterparts. Accordingly, prior work that managed to extract benefits from equivariant models (Lecun, 1989; Novak et al., 2019) tuned networks far outside the kernel regime (extremely small size and +LR+L2+U respectively). We further confirm this phenomenon in a controlled setting in Figure 10.
3.12 Ensembling kernel predictors enables practical data augmentation with NNGP/NTK
Finite width neural network often are trained with data augmentation (DA) to improve performance. We observe that the FCN and CNNVEC architectures (both finite and infinite networks) benefit from DA, and that DA can cause CNNVEC to become competitive with CNNGAP (Table S1). While CNNVEC possess translation equivariance but not invariance (§3.11), we believe it can effectively leverage equivariance to learn invariance from data.
For kernels, expanding a dataset with augmentation is computationally challenging, since kernel computation is quadratic in dataset size, and inference is cubic. Li et al. (2019a); Shankar et al. (2020) incorporated flip augmentation by doubling the training set size. Extending this strategy to more augmentations such as crop or mixup Zhang et al. (2018), or to broader augmentations strategies like AutoAugment Cubuk et al. (2019a) and RandAugment Cubuk et al. (2019b), becomes rapidly infeasible.
Here we introduce a straightforward method for ensembling kernel predictors to enable more extensive data augmentation. More sophisticated approximation approaches such as the Nyström method (Williams and Seeger, 2001) might yield even better performance. The strategy involves constructing a set of augmented batches, performing kernel inference for each of them, and then performing ensembling of the resulting predictions. This is equivalent to replacing the kernel with a block diagonal approximation, where each block corresponds to one of the batches, and the union of all augmented batches is the full augmented dataset. See §E for more details. This method achieves SOTA for a kernel method corresponding to the infinite width limit of each architecture class we studied (Figure 11 and Table 1).
Architecture  Method 




FC  Novak et al. (2019)    59.9  
ZCA Reg (this work)  59.7  59.7  
DA Ensemble (this work)  61.5  62.4  
CNNVEC  Novak et al. (2019)    67.1  
Li et al. (2019a)  66.6  66.8  
ZCA Reg (this work)  69.8  69.4  
Flip Augmentation, Li et al. (2019a)  69.9  70.5  
DA Ensemble (this work)  70.5  73.2  
CNNGAP  Arora et al. (2019a); Li et al. (2019a)  77.6  78.5  
ZCA Reg (this work)  83.2  83.5  
Flip Augmentation, Li et al. (2019a)  79.7  80.0  
DA Ensemble (this work)  83.7 (32 ens)  84.8 (32 ens)  
Myrtle ^{2}^{2}2The normalized Gaussian Myrtle kernel used in Shankar et al. (2020) does not have a corresponding finitewidth neural network, and was additionally tuned on the test set for the case of CIFAR10.  Myrtle ZCA and Flip Augmentation, Shankar et al. (2020)    89.8 
4 Discussion
We performed an indepth investigation of the phenomenology of finite and infinite width neural networks through a series of controlled interventions. We quantified phenomena having to do with generalization, architecture dependendence, deviations between infinite and finite networks, numerical stability, data augmentation, data preprocessing, ensembling, network topology, and failure modes of linearization. We further developed best practices that improve performance for both finite and infinite networks. We believe our experiments provide firm empirical ground for future studies.
Broader Impact
Developing theoretical understanding of neural networks is crucial both for understanding their biases, and predicting when and how they will fail. Understanding biases in models is of critical importance if we hope to prevent them from perpetuating and exaggerating existing racial, gender, and other social biases (Hardt et al., 2016; Barocas and Selbst, 2016; DoshiVelez and Kim, 2017; Barocas et al., 2019). Understanding model failure has a direct impact on human safety, as neural networks increasingly do things like drive cars and control the electrical grid (Bojarski et al., 2016; Rudin et al., 2011; Ozay et al., 2015).
We believe that wide neural networks are currently the most promising direction for the development of neural network theory. We further believe that the experiments we present in this paper will provide empirical underpinnings that allow better theory to be developed. We thus believe that this paper will in a small way aid the engineering of safer and more just machine learning models.
We thank Yasaman Bahri and Ethan Dyer for discussions and feedback on the project. We are also grateful to Atish Agarwala, Gamaleldin Elsayed for providing valuable feedbacks on the draft.
We acknowledge the Python community Van Rossum and Drake Jr (1995) for developing the core set of tools that enabled this work, including NumPy van der Walt et al. (2011), SciPy Virtanen et al. (2020), Matplotlib Hunter (2007), Pandas Wes McKinney (2010), Jupyter Kluyver et al. (2016), JAX Bradbury et al. (2018b), Neural Tangents Novak et al. (2020), Apache Beam Foundation
, Tensorflow datasets
TFD and Google Colaboratory Research .References
 Neal (1994) Radford M. Neal. Priors for infinite networks (tech. rep. no. crgtr941). University of Toronto, 1994.
 Williams (1997) Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pages 295–301, 1997.
 Hazan and Jaakkola (2015) Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
 Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha Sohldickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
 Matthews et al. (2018a) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018a.
 Matthews et al. (2018b) Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 9 2018b.
 Borovykh (2018) Anastasia Borovykh. A gaussian process perspective on convolutional neural networks. arXiv preprint arXiv:1810.10798, 2018.
 GarrigaAlonso et al. (2019) Adrià GarrigaAlonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations, 2019.
 Novak et al. (2019) Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.
 Yang and Schoenholz (2017) Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems. 2017.

Yang et al. (2019)
Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha SohlDickstein, and Samuel S.
Schoenholz.
A mean field theory of batch normalization.
In International Conference on Learning Representations, 2019.  Pretorius et al. (2019) Arnu Pretorius, Herman Kamper, and Steve Kroon. On the expected behaviour of noise regularised deep neural networks as gaussian processes. arXiv preprint arXiv:1910.05563, 2019.
 Yang (2019a) Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019a.

Yang (2019b)
Greg Yang.
Wide feedforward or recurrent neural networks of any architecture are gaussian processes.
In Advances in Neural Information Processing Systems, 2019b.  Novak et al. (2020) Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha SohlDickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neuraltangents.
 Hron et al. (2020) Jiri Hron, Yasaman Bahri, Jascha SohlDickstein, and Roman Novak. Infinite attention: NNGP and NTK for deep attention networks. In International Conference on Machine Learning, 2020.
 Hu et al. (2020a) Jilin Hu, Jianbing Shen, Bin Yang, and Ling Shao. Infinitely wide graph convolutional networks: semisupervised learning via gaussian processes. arXiv preprint arXiv:2002.12168, 2020a.
 Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018.

Li and Liang (2018a)
Yuanzhi Li and Yingyu Liang.
Learning overparameterized neural networks via stochastic gradient descent on structured data.
In Advances in Neural Information Processing Systems, pages 8157–8166, 2018a.  AllenZhu et al. (2018) Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning, 2018.
 Du et al. (2019a) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, 2019a.
 Du et al. (2018) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
 Zou et al. (2019) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes overparameterized deep relu networks. Machine Learning, 2019.
 Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, 2019.
 Chizat et al. (2019) Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2937–2947, 2019.
 Arora et al. (2019a) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150. Curran Associates, Inc., 2019a.
 SohlDickstein et al. (2020) Jascha SohlDickstein, Roman Novak, Samuel S Schoenholz, and Jaehoon Lee. On the infinite width limit of neural networks with a standard parameterization. arXiv preprint arXiv:2001.07301, 2020.
 Huang et al. (2020) Wei Huang, Weitao Du, and Richard Yi Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867, 2020.
 Du et al. (2019b) Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems. 2019b.

Xiao et al. (2018)
Lechao Xiao, Yasaman Bahri, Jascha SohlDickstein, Samuel Schoenholz, and
Jeffrey Pennington.
Dynamical isometry and a mean field theory of CNNs: How to train 10,000layer vanilla convolutional neural networks.
In International Conference on Machine Learning, 2018.  VallePerez et al. (2019) Guillermo VallePerez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameterfunction map is biased towards simple functions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rye4g3AqFm.
 Wei et al. (2019) Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709–9721, 2019.
 Xiao et al. (2020) Lechao Xiao, Jeffrey Pennington, and Samuel S Schoenholz. Disentangling trainability and generalization in deep learning. In International Conference on Machine Learning, 2020.
 Bietti and Mairal (2019) Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems 32. 2019.
 BenDavid and Ringel (2019) Oded BenDavid and Zohar Ringel. The role of a layer in deep neural networks: a gaussian process perspective. arXiv preprint arXiv:1902.02354, 2019.
 Yang and Salman (2019) Greg Yang and Hadi Salman. A finegrained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019.
 Ober and Aitchison (2020) Sebastian W Ober and Laurence Aitchison. Global inducing point variational posteriors for bayesian neural networks and deep gaussian processes. arXiv preprint arXiv:2005.08140, 2020.
 Hu et al. (2020b) Wei Hu, Lechao Xiao, and Jeffrey Pennington. Provable benefit of orthogonal initialization in optimizing deep linear networks. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=rkgqN1SYvr.
 Lewkowycz et al. (2020) Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha SohlDickstein, and Guy GurAri. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
 Li et al. (2019a) Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019a.
 Arora et al. (2020) Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli Yu. Harnessing the power of infinitely wide deep nets on smalldata tasks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkl8sJBYvH.
 Shankar et al. (2020) Vaishaal Shankar, Alex Chengyu Fang, Wenshuo Guo, Sara FridovichKeil, Ludwig Schmidt, Jonathan RaganKelley, and Benjamin Recht. Neural kernels without tangents. In International Conference on Machine Learning, 2020.
 Wang et al. (2019) Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for bayesian neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BkgtDsCcKQ.

Cheng et al. (2019)
Zezhou Cheng, Matheus Gadelha, Subhransu Maji, and Daniel Sheldon.
A bayesian perspective on the deep image prior.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2019.  Carvalho et al. (2020) Eduardo D C Carvalho, Ronald Clark, Andrea Nicastro, and Paul H. J. Kelly. Scalable uncertainty for computer vision with functional variational inference. In CVPR, 2020.

Tsymbalov et al. (2019)
Evgenii Tsymbalov, Sergei Makarychev, Alexander Shapeev, and Maxim Panov.
Deeper connections between neural networks and gaussian processes
speedup active learning.
In
International Joint Conference on Artificial Intelligence, IJCAI19
, 2019.  Cho and Saul (2009) Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances In Neural Information Processing Systems, 2009.
 Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, 2016.
 Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, 2016.
 Chen et al. (2018) Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In International Conference on Machine Learning, 2018.

Li and Nguyen (2019)
Ping Li and PhanMinh Nguyen.
On random deep weighttied autoencoders: Exact asymptotic analysis, phase transitions, and implications to training.
In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJx54i05tX.  Daniely (2017) Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, 2017.
 Pretorius et al. (2018) Arnu Pretorius, Elan van Biljon, Steve Kroon, and Herman Kamper. Critical initialisation for deep signal propagation in noisy rectifier neural networks. In Advances in Neural Information Processing Systems. 2018.
 Hayou et al. (2018) Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266, 2018.
 Karakida et al. (2018) Ryo Karakida, Shotaro Akaho, and Shunichi Amari. Universal statistics of fisher information in deep neural networks: Mean field approach. arXiv preprint arXiv:1806.01316, 2018.
 Blumenfeld et al. (2019) Yaniv Blumenfeld, Dar Gilboa, and Daniel Soudry. A mean field theory of quantized deep networks: The quantizationdepth tradeoff. arXiv preprint arXiv:1906.00771, 2019.
 Hayou et al. (2019) Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. Meanfield behaviour of neural tangent kernel for deep neural networks, 2019.
 Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha SohlDickstein. Deep information propagation. International Conference on Learning Representations, 2017.
 Pennington et al. (2017) Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, 2017.
 Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
 (61) The Apache Software Foundation. Apache beam. URL https://beam.apache.org/.
 Bradbury et al. (2018a) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018a. URL http://github.com/google/jax.
 Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
 Zhang et al. (2019) Yaoyu Zhang, ZhiQin John Xu, Tao Luo, and Zheng Ma. A type of generalization error induced by initialization in deep neural networks. arXiv preprint arXiv:1905.07777, 2019.
 Hu et al. (2020c) Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020c.
 Bai and Lee (2020) Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higherorder approximation of wide neural networks. In International Conference on Learning Representations, 2020.
 Li et al. (2019b) Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, 2019b.
 Hu et al. (2019) Wei Hu, Zhiyuan Li, and Dingli Yu. Understanding generalization of deep neural networks trained with noisy labels. arXiv preprint arXiv:1905.11368, 2019.
 Lewkowycz and GurAri (2020) Aitor Lewkowycz and Guy GurAri. On the training dynamics of deep networks with regularization. arXiv preprint arXiv:2006.08643, 2020.

Lawrence et al. (1998)
Steve Lawrence, C Lee Giles, and Ah Chung Tsoi.
What size neural network gives optimal generalization? convergence properties of backpropagation.
Technical report, 1998.  Bartlett (1998) Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 1998.
 Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceeding of the international Conference on Learning Representations workshop track, abs/1412.6614, 2015.
 Canziani et al. (2016) Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
 Novak et al. (2018) Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018.
 Park et al. (2019) Daniel S. Park, Jascha SohlDickstein, Quoc V. Le, and Samuel L. Smith. The effect of network width on stochastic gradient descent and generalization: an empirical study. In International Conference on Machine Learning, 2019.
 MacKay (1995) David JC MacKay. Probable networks and plausible predictions—a review of practical bayesian methods for supervised neural networks. Network: computation in neural systems, 6(3):469–505, 1995.
 Smith and Le (2017) Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
 Wilson and Izmailov (2020) Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
 Haussler (1992) David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 1992.
 Baum and Haussler (1989) Eric B. Baum and David Haussler. What size net gives valid generalization? In Advances in Neural Information Processing Systems. 1989.
 Vapnik (1998) Vladimir Vapnik. Statistical learning theory. 1998.
 Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
 Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
 Mukherjee et al. (2004) Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Technical report, 2004.
 Poggio et al. (2004) Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419–422, 2004.
 Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
 Advani and Saxe (2017) Madhu S Advani and Andrew M Saxe. Highdimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
 Neyshabur et al. (2018) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
 Neyshabur et al. (2019) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role of overparametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
 Li and Liang (2018b) Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems. 2018b.
 AllenZhu et al. (2019) Zeyuan AllenZhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155–6166, 2019.
 Ghorbani et al. (2019a) Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of twolayers neural network. In Advances in Neural Information Processing Systems, pages 9108–9118, 2019a.
 Ghorbani et al. (2019b) Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized twolayers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019b.
 Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
 Brutzkus and Globerson (2019) Alon Brutzkus and Amir Globerson. Why do larger models generalize better? A theoretical perspective via the XOR problem. In Proceedings of the 36th International Conference on Machine Learning, 2019.

Opper et al. (1990)
M Opper, W Kinzel, J Kleinz, and R Nehl.
On the ability of the optimal perceptron to generalise.
Journal of Physics A: Mathematical and General, 23(11):L581, 1990.  Belkin et al. (2019) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias–variance tradeoff. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
 Nakkiran et al. (2019) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.
 Andreassen and Dyer (2020) Anders Andreassen and Ethan Dyer. Asymptotics of wide convolutional neural networks. preprint, 2020.
 Ali et al. (2019) Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuoustime view of early stopping for least squares regression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1370–1378, 2019.
 Bell and Sejnowski (1997) Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge filters. Vision research, 37(23):3327–3338, 1997.
 Goodfellow et al. (2018) Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In International Conference on Machine Learning, 2018.
 Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
 Wadia et al. (2020) Neha Wadia, Daniel Duckworth, Samuel Schoenholz, Ethan Dyer, and Jascha SohlDickstein. Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible. preprint, 2020.
 Lecun (1989) Yann Lecun. Generalization and network design strategies. In Connectionism in perspective. Elsevier, 1989.
 Bartunov et al. (2018) Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy Lillicrap. Assessing the scalability of biologicallymotivated deep learning algorithms and architectures. In Advances in Neural Information Processing Systems, pages 9368–9378, 2018.
 Zhang et al. (2018) Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1Rb.
 Cubuk et al. (2019a) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113–123, 2019a.
 Cubuk et al. (2019b) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019b.
 Williams and Seeger (2001) Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
 Hardt et al. (2016) Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323, 2016.
 Barocas and Selbst (2016) Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
 DoshiVelez and Kim (2017) Finale DoshiVelez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
 Barocas et al. (2019) Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2019. http://www.fairmlbook.org.
 Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 Rudin et al. (2011) Cynthia Rudin, David Waltz, Roger N Anderson, Albert Boulanger, Ansaf SallebAouissi, Maggie Chow, Haimonti Dutta, Philip N Gross, Bert Huang, Steve Ierome, et al. Machine learning for the new york city power grid. IEEE transactions on pattern analysis and machine intelligence, 2011.
 Ozay et al. (2015) Mete Ozay, Inaki Esnaola, Fatos Tunay Yarman Vural, Sanjeev R Kulkarni, and H Vincent Poor. Machine learning methods for attack detection in the smart grid. IEEE transactions on neural networks and learning systems, 2015.
 Van Rossum and Drake Jr (1995) Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995.
 van der Walt et al. (2011) S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science Engineering, 13(2):22–30, 2011.
 Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
 Hunter (2007) J. D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3):90–95, 2007.
 Wes McKinney (2010) Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010. doi: 10.25080/Majora92bf192200a.
 Kluyver et al. (2016) Thomas Kluyver, Benjamin RaganKelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, and Carol Willing. Jupyter Notebooks – a publishing format for reproducible computational workflows. In F. Loizides and B. Schmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press, 2016.
 Bradbury et al. (2018b) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018b. URL http://github.com/google/jax.
 (126) TensorFlow Datasets, a collection of readytouse datasets. https://www.tensorflow.org/datasets.
 (127) Google Research. Google colab. URL https://colab.research.google.com/.
 Clemen (1989) Robert T Clemen. Combining forecasts: A review and annotated bibliography. International journal of forecasting, 5(4):559–583, 1989.
 Dietterich (2000) Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.

Freund and Schapire (1995)
Yoav Freund and Robert E Schapire.
A desiciontheoretic generalization of online learning and an
application to boosting.
In
European conference on computational learning theory
, pages 23–37. Springer, 1995.  Breiman (1996) Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
 Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 Opitz and Shavlik (1996) David W Opitz and Jude W Shavlik. Generating accurate and diverse members of a neuralnetwork ensemble. In Advances in neural information processing systems, pages 535–541, 1996.
 Opitz and Maclin (1999) David Opitz and Richard Maclin. Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198, 1999.
 Rokach (2010) Lior Rokach. Ensemblebased classifiers. Artificial intelligence review, 33(12):1–39, 2010.
 Dyer and GurAri (2020) Ethan Dyer and Guy GurAri. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1gFvANKDS.
 Shallue et al. (2019) Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha SohlDickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 2019.
 McCandlish et al. (2018) Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of largebatch training. arXiv preprint arXiv:1812.06162, 2018.
 HernándezLobato and Adams (2015) José Miguel HernándezLobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869, 2015.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
Appendix A Glossary
We use the following abbreviations in this work:

L2: L2 reguarization a.k.a. weight decay;

LR: using large learning rate;

U: allowing underfitting;

DA: using data augmentation;

C
: centering the network so that the logits are always zero at initialization;

Ens: neural network ensembling logits over multiple initialization;

ZCA: zerophase component analysis regularization preprocessing;

FCN: fullyconnected neural network.;

CNNVEC: convolutional neural network with a vectorized readout layer;

CNNGAP: convolutional neural network with a global average pooling readout layer;

NNGP: neural network Gaussian process;

NTK: neural tangent kernel.
Appendix B Main table
Param  Base  +C  +LR  +L2 



+ZCA 

+Ens 





FCN 















CNNVEC 















CNNGAP 














Param  Lin Base  +C  +L2 

+Ens 

NTK  +ZCA 

NNGP  +ZCA 



FCN 










58.61  59.70  62.40  
CNNVEC 










66.69  69.44  73.23  
CNNGAP 





78.0  83.45  84.82 
Appendix C Experimental details
For all experiments, we use Neural Tangents (NT) library Novak et al. [2020] built on top of JAX Bradbury et al. [2018b]
. First we describe experimental settings that is mostly common and then describe specific details and hyperparameters for each experiments.
Finite width neural networks We train finite width networks with Mean Squared Error (MSE) loss
where is the number of classes and is the norm in . For the experiments with +L2, we add L2 regularization to the loss
(S1) 
and tune using gridsearch optimizing for the validation accuracy.
We optimize the loss using minibatch SGD with constant learning rate. We use batchsize of for FCN and for both CNNVEC and CNNGAP (see §H for further details on this choice). Learning rate is parameterized with learning rate factor with respect to the critical learning rate
(S2) 
In practice, we compute empirical NTK
on 16 random points in the training set to estimate
Lee et al. [2019] by maximum eigenvalue of . This is readily available in NT library Novak et al. [2020] using nt.monte_carlo_kernel_fn and nt.predict.max_learning_rate. Base case considered without large learning rate indicates , and large learning rate (+LR) runs are allowing . Note that for linearized networks is strict upperbound for the learning rates and no is allowed Lee et al. [2019], Yang and Salman [2019], Lewkowycz et al. [2020].Training steps are chosen to be large enough, such that learning rate factor can reach above accuracy on random subset of training data for 5 logarithmic spaced measurements. For different learning rates, physical time roughly determines learning dynamics and small learning rate trials need larger number of steps. Achieving termination criteria was possible for all of the trials except for linearized CNNGAP and data augmented training of FCN, CNNVEC. In these cases, we report best achieved performance without fitting the training set.
NNGP / NTK For inference, except for data augmentation ensembles for which default zero regularization was chosen, we grid search over diagonal regularization in the range numpy.logspace(7, 2, 14) and . Diagonal regularization is parameterized as
where is either NNGP or NTK for the training set. We work with this parameterization since is invariant to scale of .
Dataset
For all our experiments (unless specified) we use train/valid/test split of 45k/5k/10k for CIFAR10/100 and 50k/10k/10k for FashionMNIST. For all our experiments, inputs are standardized with per channel mean and standard deviation. ZCA regularized whitening is applied as described in §
F. Output is encoded as mean subtracted onehotencoding for the MSE loss, e.g. for a label in class
, . For the softmaxcrossentropy loss in §G, we use standard onehotencoded output.For data augmentation, we use widelyused augmentation for CIFAR10; horizontal flips with 50% probability and random crops by 4pixels with zeropadding.
Details of architecture choice: We only consider ReLU activation (with the exception of Myrtlekernel which use scaled Gaussian activation Shankar et al. [2020]) and choose critical initialization weight variance of with small bias variance . For convolution layers, we exclusively consider
filters with stride
and SAME (zero) padding so that image size does not change under convolution operation.c.1 Hyperparameter configurations for all experiments
We used gridsearch for tuning hyperparameters and use accuracy on validation set for deciding on hyperparameter configuration or measurement steps (for underfitting / early stopping). All reported numbers unless specified is test set performance.
Figure 1, Table S1: We gridsearch over L2 regularization strength and learning rate factor . For linearized networks same search space is used except that configuration is infeasible and training diverges. For nonlinear, centered runs is used. Network ensembles uses base configuration with , with 64 different initialization seed. Kernel ensemble is over 50 predictors for FCN and CNNVEC and 32 predictors for CNNGAP. Finite networks trained with dataaugmentation has different learning rate factor range of .
Figure 2: Each datapoint corresponds to either standard preprocessed or ZCA regularization preprocessed (as described in §3.10) with regularization strength was varied in for FCN and CNNVEC, for CNNGAP.
Figure 3, Figure 4, Figure S6, Figure S7: Learning rate factors are for nonlinear networks and for linearized networks. While we show NTK parameterized runs, we also observe similar trends for STD parameterized networks. Shaded regions show range of minimum and maximum performance across 64 different seeds. Solid line indicates the mean performance.
Figure 5 While FCN is the base configuration, CNNVEC is a narrow network with 64 channels per layer since moderate width benefits from L2 more for the NTK parameterization Figure S10. For CNNGAP 128 channel networks is used. All networks with different L2 strategy are trained with +LR ().
Figure 7: We use 640 subset of validation set for evaluation. CNNGAP is a variation of the base model with 3 convolution layers with while FCN and CNNVEC is the base model. Training evolution is computed using analytic timeevolution described in Lee et al. [2019] and implemented in NT library via nt.predict.gradient_descent_mse with 0 diagonal regularization.
Figure 9: Kernel experiments details are same as in Figure 2. Finite networks are base configuration with and .
Figure 10: Evaluated networks uses NTK parameterization with . CNNVEC+L2+narrow uses 128 channels instead of 512 of the base CNNVEC and CNNGAP networks, and trained with L2 regularization strength . Crop transformation uses zeropadding while Translate transformation uses circular boundary condition after shifting images. Each transformation is applied to the test set inputs where shift direction is chosen randomly. Each points correspond to average accuracy over 20 random seeds. FCN had 2048 hidden units.
Figure 11, Table 1: For all data augmentation ensembles, first instance is taken to be from nonaugmented training set. Further details on kernel ensemble is described in §E. For all kernels, inputs are preprocessed with optimal ZCA regularization observed in Figure 9 (10 for FCN, 1 for CNNVEC, CNNGAP and Myrtle.). We ensemble over 50 different augmented draws for FCN and CNNVEC, whereas for CNNGAP, we ensemble over 32 draws of augmented training set.
Figure S3, Table S2: Details for MSE trials are same as Figure 1 and Table S1. Trials with softmaxcrossentropy loss was tuned with same hyperparameter range as MSE except that learning rate factor range was .
Figure S4: We present result with NTK parameterized networks with . FCN network is width 1024 with for MSE loss and for softmaxcrossentropy loss. CNNGAP uses 256 channels with for MSE loss and for softmaxcrossentropy loss. Random seed was fixed to be the same across all runs for comparison.
Figure S9: NTK pamareterization with was used for both L2 to zero or initialization. Random seed was fixed to be the same across all runs for comparison.
Appendix D Noise model
In this section, we provide details on noise model discussed in §3.8. Consider a random Hermitian matrix with entries order of which is considered as noise perturbation to the kernel matrix
(S3) 
Eigenvalues of this random matrix
follow Wigner’s semicircle law, and the smallest eigenvalue is given by . When the smallest eigenvalue of is smaller (in order) than , one needs to add diagonal regularizer larger than the order of to ensure positive definiteness. For estimates, let us use machine precision^{3}^{3}3np.finfo(np.float32).eps, np.finfo(np.float64).eps and which we use as proxy values for . Note that noise scale is relative to elements in which is assume to be . Naively scaling by multiplicative constant will also scale .Empirically one can model tail eigenvalues of infinite width kernel matrix of size as
(S4) 
Note that we are considering entries for and typical eigenvalues scale linearly with dataset size . For a given dataset size, the power law observed is and is datasetsize independent constant. Thus the smallest eigenvalue is order .
In the noise model, we can apply Weyl’s inequality which says
(S5) 
Consider the worstcase where negative eigenvalue noise affecting the kernel’s smallest eigenvalue. In that case perturbed matrices minimum eigenvalue could become negative, breaking positive semidefiniteness(PSD) of the kernel.
This model allows to predict critical dataset size () over which PSD can be broken under specified noise scale and kernel eigenvalue decay. With condition that perturbed smallest eigenvalue becomes negative
(S6) 
we obtain
(S7) 
When PSD is broken, one way to preserve PSD is to add diagonal regularizer (§3.7). For CIFAR10 with , typical negative eigenvalue from float32 noise is around and with float64 noise scale, considering . Note that Arora et al. [2019a] regularized kernel with regularization strength which is on par with typical negative eigenvalue introduced due to float32 noise. Of course, this only applies if kernel eigenvalue decay is sufficiently fast that full dataset size is above .
We observe that FCN and CNNVEC kernels with small would not suffer from increasing datasetsize under float32 precision. On the otherhand, worse conditioning of CNNGAP not only affects the training time (§3.9) but also required precision. One could add sufficiently large diagonal regularization to mitigate effect from the noise at the expense of losing information and generalization strength included in eigendirections with small eigenvalues.
Appendix E Data augmentation via kernel ensembling
We start considering general ensemble averaging of predictors. Consider a sequence of training sets each consisting of inputoutput pairs from a datagenerating distribution. For a learning algorithm, which we use NNGP/NTK inference for this study, will give prediction of unseen test point . It is possible to obtain better predictor by averaging output of different predictors
(S8) 
where denotes the cardinality of . This ensemble averaging is simple type of committee machine which has long history Clemen [1989], Dietterich [2000]. While more sophisticated ensembling method exists (e.g. Freund and Schapire [1995], Breiman [1996, 2001], Opitz and Shavlik [1996], Opitz and Maclin [1999], Rokach [2010]), we strive for simplicity and considered naive averaging. One alternative we considered is generalizing average by
(S9) 
were in general is set of weights satisfying . We can utilize posterior variance from NNGP or NTK with MSE loss via Inversevariance weighting (IVW) where weights are given as
(S10) 
In simple bagging setting Breiman [1996], we observe small improvements with IVW over naive averaging. This indicates posterior variance for different draw of was quite similar.
Application to data augmentation (DA) is simple as we consider process of generating from a (stochastic) data augmentation transformation . We consider action of be stochastic (e.g. is a random crop operator) with probability augmentation transformation (which itself could be stochastic) and probability of . Considering as clean unaugmented training set, we can imagine dataset generating process , where we overloaded definition of on trainingset to be data generating distribution.
For experiments in §3.12, we took to be standard augmentation strategy of horizontal flip and random crop by 4pixels with augmentation fraction (see Figure S12 for effect of augmentation fraction on kernel ensemble). In this framework, it is trivial to generalize the DA transformation to be quite general (e.g. learned augmentation strategy studied by Cubuk et al. [2019a, b]).
Appendix F ZCA whitening
Consider (flattened) dimensional training set inputs (a matrix) with data covariance
(S11) 
The goal of whitening is to find a whitening transformation , a matrix, such that the features of transformed input
(S12) 
are uncorrelated, e.g. . Note that is constructed only from training set while is applied to both training set and test set inputs. Whitening transformation can be efficiently computed by eigendecomposition^{4}^{4}4For PSD matricies, it is numerically more reliable to obtain via SVD.
(S13) 
where is diagonal matrix with eigenvalues, and
contains eigenvector of
as its columns.With this ZCA whitening transformation is obtained by following whitening matrix
(S14) 
Here, we introduced trivial reparameterization of conventional regularizer such that regularization strength is input scale invariant. It is easy to check corresponds to whitening with . In §3.10, we study the benefit of taking nonzero regularization strength for both kernels and finite networks. We denote transformation with nonzero regularizer, ZCA regularization preprocessing. ZCA transformation preserves spatial and chromatic structure of original image as illustrated in Figure F. Therefore image inputs are reshaped to have the same shape as original image.
In practice, we standardize both training and test set per (RGB channel) features of the training set before and after the ZCA whitening. This ensures transformed inputs are mean zero and variance of order 1.
Appendix G MSE vs Softmaxcrossentropy loss training of neural networks
Our focus was mainly on fininte networks trained with MSE loss for simple comparison with kernel methods that gives closed form solution. Here we present comparison of MSE vs softmaxcrossentropy trained networks. See Table S2 and Figure S3.
Architecture  Type  Param  Base  +LR+U  +L2+U  +L2+LR+U  Best 

FCN  MSE  STD  47.82  49.07  49.82  55.32  55.90 
NTK  46.16  49.17  54.27  55.44  55.44  
XENT  STD  55.01  57.28  53.98  57.64  57.64  
NTK  53.39  56.59  56.31  58.99  58.99  
MSE+DA  STD  65.29  66.11  65.28  67.43  67.43  
NTK  61.87  62.12  67.58  69.35  69.35  
XENT+DA  STD  64.15  64.15  67.93  67.93  67.93  
NTK  62.88  62.88  67.90  67.90  67.90  
CNNVEC  MSE  STD  56.68  63.51  67.07  68.99  68.99 
NTK  60.73  61.58  75.85  77.47  77.47  
XENT  STD  64.31  65.30  64.57  66.95  66.95  
NTK  67.13  73.23  72.93  74.05  74.05  
MSE+DA  STD  76.73  81.84  76.66  83.01  83.01  
NTK  83.92  84.76  84.87  85.63  85.63  
XENT+DA  STD  81.84  83.86  81.78  84.37  84.37  
NTK  86.83  88.59  87.49  88.83  88.83  
CNNGAP  MSE  STD  80.26  80.93  81.10  83.01  84.22 
NTK  80.61  82.44  81.17  82.43  83.92  
XENT  STD  83.66  83.80  84.59  83.87  83.87  
NTK  83.87  84.40  84.51  84.51  84.51  
MSE+DA  STD  84.36  83.88  84.89  86.45  86.45  
NTK  84.07  85.54  85.39  86.68  86.68  
XENT+DA  STD  86.04  86.01  86.42  87.26  87.26  
NTK  86.87  87.31  86.39  88.26  88.26 
Appendix H Comment on batch size
Correspondence between NTK and gradient descent training is direct in the full batch gradient descent (GD) setup (see Dyer and GurAri [2020] for extensions to minibatch SGD setting). Therefore base comparison between finite networks and kernels is the full batch setting. While it is possible to train our base models with GD, for full CIFAR10 large emprical study becomes impractical. In practice, we use minibatch SGD with batchsize for FCN and for CNNs.
We studied batch size effect of training dynamics in Figure S4 and found that for these batchsize choices does not affecting training dynamics compared to much larger batch size. Shallue et al. [2019], McCandlish et al. [2018] observed that universally for wide variety of deep learning models there are batch size beyond which one could gain training speed benefit in number of steps. We observe that maximal useful batchsize in workloads we study is quite small.
Appendix I Addtional tables and plots
Param  Base  +C  +LR  +L2 



+ZCA 

+Ens 





FCN  STD  0.0443  0.0363  0.0406  0.0411  0.0355  0.0337  0.0329  0.0483  0.0319  0.0301  0.0304  0.0267  0.0242  
NTK  0.0465  0.0371  0.0423  0.0338  0.0336  0.0308  0.0308  0.0484  0.0308  0.0300  0.0302  0.0281  0.0225  
CNNVEC  STD  0.0381  0.0330  0.0340  0.0377  0.0279  0.0340  0.0265  0.0383  0.0265  0.0278  0.0287  0.0228  0.0183  
NTK  0.0355  0.0353  0.0355  0.0355  0.0231  0.0246  0.0227  0.0361  0.0227  0.0254  0.0278  0.0164  0.0143  
CNNGAP  STD  0.0209  0.0201  0.0207  0.0201  0.0201  0.0179  0.0177  0.0190  0.0159  0.0172  0.0165  0.0185  0.0149  
NTK  0.0209  0.0201  0.0195  0.0205  0.0181  0.0175  0.0170  0.0194  0.0161  0.0163  0.0157  0.0186  0.0145 
Param  Lin Base  +C  +L2 

+Ens 

NTK  +ZCA 

NNGP  +ZCA 



FCN  STD  0.0524  0.0371  0.0508  0.0350  0.0309  0.0305  0.0306  0.0302    0.0309  0.0308  0.0297  
NTK  0.0399  0.0366  0.0370  0.0368  0.0305  0.0304  0.0305  0.0302  0.0298  
CNNVEC  STD  0.0436  0.0322  0.0351  0.0351  0.0293  0.0291  0.0287  0.0277    0.0286  0.0281  0.0256  
NTK  0.0362  0.0337  0.0342  0.0339  0.0286  0.0286  0.0283  0.0274  0.0273  
CNNGAP  STD  < 0.0272* (Train accuracy 86.22 after 14M steps)  0.0233  0.0200    0.0231  0.0204  0.0191  
NTK  < 0.0276* (Train accuracy 79.90 after 14M steps)  0.0232  0.0200  0.0195 
Comments
There are no comments yet.