1 Introduction
Deep learning has made immense progress in solving artificial intelligence challenges such as computer vision, natural language processing, reinforcement learning, and so on
(LeCun et al., 2015). Despite this great success, fundamental theoretical questions such as why deep networks train and generalize well are only partially understood.A recent surge of research establishes the connection between wide neural networks and their linearized models. It is shown that wide neural networks can be trained in a setting in which each individual weight only moves very slightly (relative to itself), so that the evolution of the network can be closely approximated by the evolution of the linearized model, which when the width goes to infinity has a certain statistical limit governed by its Neural Tangent Kernel (NTK). Such a connection has led to provable optimization and generalization results for wide neural nets (Li & Liang, 2018; Jacot et al., 2018; Du et al., 2018, 2019; Zou et al., 2019; Lee et al., 2019; Arora et al., 2019a; AllenZhu et al., 2019a), and has inspired the design of new algorithms such as neuralbased kernel machines that achieve competitive results on benchmark learning tasks (Arora et al., 2019b; Li et al., 2019b).
While linearized training is powerful in theory, it is questionable whether it really explains neural network training in practical settings. Indeed, (1) the linearization theory requires small learning rates or specific network parameterizations (such as the NTK parameterization), yet in practice a large (initial) learning rate is typically required in order to reach a good performance; (2) the linearization theory requires a high width in order for the linearized model to fit the training dataset and generalize, yet it is unclear whether the finitewidth linearization of practically sized network have such capacities. Such a gap between linearized and full neural network training has been identified in recent work (Chizat et al., 2019; Ghorbani et al., 2019b, a; Li et al., 2019a), and suggests the need for a better model towards understanding neural network training in practical regimes.
Towards closing this gap, in this paper we propose and study Taylorized training, a principled generalization of linearized training. For any neural network and a given initialization , assuming sufficient smoothness, we can expand around to the th order for any :
The model is exactly the linearized model when , and becomes th order polynomials of that are increasingly better local approximations of as we increase . Taylorized training refers to training these Taylorized models explicitly (and not necessarily locally), and using it as a tool towards understanding the training of the full neural network . The hope with Taylorized training is to “trade expansion order with width”, that is, to hopefully understand finitewidth dynamics better by using a higher expansion order rather than by increasing the width.
In this paper, we take an empirical approach towards studying Taylorized training, demonstrating its usefulness in understanding finitewidth full training^{1}^{1}1By full training, we mean the usual (nonTaylorized) training of the neural networks.. Our main contributions can be summarized as follows:

We experiment with Taylorized training on vanilla convolutional and residual networks in their practical training regimes (standard parameterization + large initial learning rate) on CIFAR10. We show that Taylorized training gives increasingly better approximations of the training trajectory of the full neural net as we increase the expansion order , in both the parameter space and the function space (Section 5). This is not necessarily expected, as higherorder Taylorized models are no longer guaranteed to give better approximations when parameters travel significantly, yet empiricially they do approximate full training better.

We find that Taylorized models can significantly close the performance gap between fully trained neural nets and their linearized models at finite width. Finitewidth linearized networks typically has over 40% worse test accuracy than their fully trained counterparts, whereas quartic (4th order) training is only 10%15% worse than full training under the same setup.

We demonstrate the potential of Taylorized training as a tool for understanding layer importance. Specifically, higherorder Taylorized training agrees well with full training in layer movements, i.e. how far each layer travels from its initialization, whereas linearized training does not agree well.

We provide a theoretical analysis on the approximation power of Taylorized training (Section 6). We prove that th order Taylorized training approximates the full training trajectory with error bound on a wide twolayer network with width . This extends existing results on linearized training and provides a preliminary justification of our experimental findings.
Additional paper organization
A visualization of Taylorized training
A highlevel illustration of our results is provided in Figure 1
, which visualizes the training trajectories of a 4layer convolutional neural network and its Taylorized models. Observe that the linearized model struggles to progress past the initial phase of training and is a rather poor approximation of full training in this setting, whereas higherorder Taylorized models approximate full training significantly better.
2 Preliminaries
We consider the supervised learning problem
where is the input, is the label,
is a convex loss function,
is the learnable parameter, andis the neural network that maps the input to the output (e.g. the prediction in a regression problem, or the vector of logits in a classification problem).
This paper focuses on the case where is a (deep) neural network. A standard feedforward neural network with layers is defined through , where , and
(1) 
for all , where are weight matrices, are biases, and
is an activation function (e.g. the ReLU) applied entrywise. We will not describe other architectures in detail; for the purpose of describing our approach and empirical results, it suffices to think of
as a general nonlinear function of the parameter (for a given input .)Once the architecture is chosen, it remains to define an initialization strategy and a learning rule.
Initialization and training
We will mostly consider the standard initialization (or variants of it such as Xavier (Glorot & Bengio, 2010) or Kaiming (He et al., 2015)) in this paper, which for a feedforward network is defined as
and can be similarly defined for convolutional and residual networks. This is in contrast with the NTK parameterization (Jacot et al., 2018), which encourages the weights to move significantly less.
We consider training the neural network via (stochastic) gradient descent:
(2) 
We will refer to the above as full training of neural networks, so as to differentiate with various approximate training regimes to be introduced below.
3 Linearized Training and Its Limitations
We briefly review the theory of linearized training (Lee et al., 2019; Chizat et al., 2019) for explaining the training and generalization success of neural networks, and provide insights on its limitations.
3.1 Linearized training and Neural Tangent Kernels
The theory of linearized training begins with the observation that a neural network near init can be accurately approximated by a linearized network. Given an initialization and an arbitrary near , we have that
that is, the neural network is approximately equal to the linearized network . Consequently, near , the trajectory of minimizing can be well approximated by the trajectory of linearized training, i.e. minimizing
which is a convex problem and enjoys convergence guarantees.
Furthermore, linearized training can approximate the entire trajectory of full training provided that we are in a certain linearized regime in which we use

Small learning rate, so that stays in a small neighborhood of for any fixed amount of time;

Overparameterization, so that such a neighborhood gives a function space that is rich enough to contain a point so that can fit the entire training dataset.
As soon as we are in the above linearized regime, gradient descent is guaranteed to reach a global minimum (Du et al., 2019; AllenZhu et al., 2019b; Zou et al., 2019). Further, as the width goes to infinity, due to randomness in the initialization , the function space containing such linearized models goes to a statistical limit governed by the Neural Tangent Kernels (NTKs) (Jacot et al., 2018), so that wide networks trained in this linearized regime generalize as well as a kernel method (Arora et al., 2019a; AllenZhu et al., 2019a).
3.2 Unrealisticness of linearized training in practice
Our key concern about the theory of linearized training is that there are significant differences between training regimes in which the linearized approximation is accurate, and regimes in which neural nets typically attain their best performance in practice. More concretely,

Linearized training is a good approximation of full training under small learning rates^{2}^{2}2Or large learning rates but on the NTK parameterization (Lee et al., 2019). in which each individual weight barely moves. However, neural networks typically attain their best test performance when using a large (initial) learning rate, in which the weights move significantly in a way not explained by linearized training (Li et al., 2019a);

Linearized networks are powerful models on their own when the base architecture is overparameterized, but can be rather poor when the network is of a practical size. Indeed, infinitewidth linearized models such as CNTK achieve competitive performance on benchmark tasks (Arora et al., 2019b, c), yet their finitewidth counterparts often perform significantly worse (Lee et al., 2019; Chizat et al., 2019).
4 Taylorized Training
Towards closing this gap between linearized and full training, we propose to study Taylorized training, a principled extension of linearized training. Taylorized training involves training higherorder expansions of the neural network around the initialization. For any —assuming sufficient smoothness—we can Taylor expand to the th order as
where we have defined the th order Taylorized model . The Taylorized model reduces to the linearized model when , and is a th order polynomial model for a general , where the “features” are (which depend on the architecture and initialization ), and the “coefficients” are for .
Similar as linearized training, we define Taylorized training as the process (or trajectory) for training via gradient descent, starting from the initialization . Concretely, the trajectory for th order Taylorized training will be denoted as , where
(3)  
Taylorized models arise from a similar principle as linearized models (Taylor expansion of the neural net), and gives increasingly better approximations of the neural network (at least locally) as we increase . Further, higherorder Taylorized training () are no longer convex problems, yet they model the nonconvexity of full training in a mild way that is potentially amenable to theoretical analyses. Indeed, quadratic training () been shown to enjoy a nice optimization landscape and achieve better sample complexity than linearized training on learning certain simple functions (Bai & Lee, 2020)
. Higherorder training also has the potential to be understood through its polynomial structure and its connection to tensor decomposition problems
(Mondelli & Montanari, 2019).Implementation
Naively implementing Taylorization by directly computing higherorder derivative tensors of neural networks is prohibitive in both memory and time. Fortunately, Taylorized models can be efficiently implemented through a series of nested JacobianVector Product operations (JVPs). Each JVP operation can be computed with the operator algorithm of Pearlmutter (1994)
, which gives directional derivatives through arbitrary differentiable functions, and is the transpose of backpropagation.
For any function with parameters , we denote its JVP with respect to the direction using the notation of Pearlmutter (1994) by
(4) 
The th order Taylorized model can be computed as
(5) 
where is the times nested evaluation of the operator.
5 Experiments
Name  Architecture  Params  Train For  Batch  Accuracy  Opt  Rate  Grad Clip  LR Decay Schedule 
CNNTHIN  CNN4128  447K  200 epochs  256  81.6%  SGD  0.1  5.0  10x drop at 100, 150 epochs 
CNNTHICK  CNN4512  7.10M  160 epochs  64  85.9%  SGD  0.1  5.0  10x drop at 80, 120 epochs 
WRNTHIN  WideResNet164128  3.22M  200 epochs  256  88.1%  SGD  1e1.5  10.0  10x drop at 100, 150 epochs 
WRNTHICK  WideResNet168256  12.84M  160 epochs  64  91.7%  SGD  1e1.5  10.0  10x drop at 80, 120 epochs 
We experiment with Taylorized training on convolutional and residual networks for the image classification task on CIFAR10.
5.1 Basic setup
We choose four representative architectures for the image classification task: two CNNs with 4 layers + Global Average Pooling (GAP) with width , and two WideResNets (Zagoruyko & Komodakis, 2016) with depth 16 and different widths as well. All networks use standard parameterization and are trained with the crossentropy loss^{3}^{3}3Different from prior work on linearized training which primarily focused on the squared loss (Arora et al., 2019b; Lee et al., 2019).. We optimize the training loss using SGD with a large initial learning rate + learning rate decay.^{4}^{4}4
We also use gradient clipping with a large clipping norm in order to prevent occasional gradient blowups.
For each architecture, the initial learning rate was tuned within and chosen to be the largest learning rate under which the full neural network can stably train (i.e. has a smoothly decreasing training loss). We use standard data augmentation (random crop, flip, and standardize) as a optimizationindependent way for improving generalization. Detailed training settings for each architecture are summarized in Table 1.
Methodology
For each architecture, we train Taylorized models of order (referred to as {linearized, quadratic, cubic, quartic} models) from the same initialization as full training using the exact same optimization setting (including learning rate decay, gradient clipping, minibatching, and data augmentation noise). This allows us to eliminate the effects of optimization setup and randomness, and examine the agreement between Taylorized and full training in identical settings.
5.2 Approximation power of Taylorized training
We examine the approximation power of Taylorized training through comparing Taylorized training of different orders in terms of both the training trajectory and the test performance.
Metrics
We monitor the training loss and test accuracy for both full and Taylorized training. We also evaluate the approximation error between Taylorized and full training quantitatively through the following similarity metrics between models:

Cosine similarity in the parameter space, defined as
where (recall (3) and (2)) and denote the parameters in th order Taylorized training and full training, and is their common initialization.

Cosine similarity in the function space, defined as
where we have overloaded the notation (and similarly ) to denote the output (logits) of a model on the test dataset^{5}^{5}5We centralized (demeaned) the logits for each example along the classification axis so as to remove the effect of the invariance in the softmax mapping. .
Architecture  CNNTHIN  CNNTHICK  WRNTHIN  WRNTHICK 
Linearized ()  41.3%  49.0%  50.2%  55.3% 
Quadratic ()  61.6%  70.1%  65.8%  71.7% 
Cubic ()  69.3%  75.3%  72.6%  76.9% 
Quartic ()  71.8%  76.2%  75.6%  78.7% 
Full network  81.6%  85.9%  88.1%  91.7% 
Results
Figure 2 plots training and approximation metrics for full and Taylorized training on the CNNTHIN model. Observe that Taylorized models are much better approximators than linearized models in both the parameter space and the function space—both cosine similarity curves shift up as we increase from 1 to 4. Further, for the cubic and quartic models, the cosine similarity in the logit space stays above 0.8 over the entire training trajectory (which includes both weakly and strongly trained models), suggesting a fine agreement between higherorder Taylorized training and full training. Results for {CNNTHICK, WRNTHIN, WRNTHICK} are qualitatively similar and are provided in Appendix B.1.
We further report the final test performance of the Taylorized models on all architectures in Table 2. We observe that

Taylorized models can indeed close the performance gap between linearized and full training: linearized models are typically 30%40% worse than fully trained networks, whereas quartic (4th order Taylorized) models are within {10%, 13%} of a fully trained network on {CNNs, WideResNets}.

All Taylorized models can benifit from increasing the width (from CNNTHIN to CNNTHICK, and WRNTHIN to WRNTHICK), but the performance of higherorder models () are generally less sensitive to width than lowerorder models (), suggesting their realisticness for explaining the training behavior of practically sized finitewidth networks.
On finite vs. infinitewidth linearized models
We emphasize that the performance of our baseline linearized models in Table 2 (40%55%) is at finite width, and is thus not directly comparable to existing results on infinitewidth linearized models such as CNTK (Arora et al., 2019b). It is possible to achieve stronger results with finitewidth linearized networks by using the NTK parameterization, which more closely resembles the infinite width limit. However, full neural net training with this reparameterization results in significantly weakened performance, suggesting its unrealisticness. The best documented test accuracy of a finitewidth linearized network on CIFAR10 is 65% (Lee et al., 2019), and due to the NTK parameterization, the neural network trained under these same settings only reached 70%. In contrast, our best higher order models can approach 80%, and are trained under realistic settings where a neural network can reach over 90%.
5.3 Agreement on layer movements
Layer importance, i.e. the contribution and importance of each layer in a welltrained (deep) network, has been identified as a useful concept towards building an architecturedependent understanding on neural network training (Zhang et al., 2019). Here we demonstrate that higherorder Taylorized training has the potential to lead to better understandings on the layer importance in full training.
Method and result
We examine layer movements, i.e. the distances each layer has travelled along training, and illustrate it on both full and Taylorized training.^{6}^{6}6Taylorized models are polynomials of where has the same shape as the base network. By a “layer” in a Taylorized model, we mean the partition that’s same as how we partition into layers in the base network. In Figure 3, we plot the layer movements on the CNNTHIN and WRNTHIN models. Compared with linearized training, quartic training agrees with full training much better in the shape of the layer movement curve, both at an early stage and at convergence. Furthermore, comparing the layer movement curves between the 10th epoch and at convergence, quartic training seems to be able to adjust the shape of the movement curve much better than linearized training.
Intriguing results about layer importance has also been (implicitly) shown in the study of infinitewidth linearized models (i.e. NTK type kernel methods). For example, it has been observed that the CNNGP kernel (which corresponds to training the top layer only) has consistently better generalization performance than the CNTK kernel (which corresponds to training all the layer) (Li et al., 2019b). In other words, when training an extremely wide convolutional net on a finite dataset, training the last layer only gives a better generalization performance (i.e. a better implicit bias); existing theoretical work on linearized training fall short of understanding layer importance in these settings. We believe Taylorized training can serve as an (at least empirically) useful tool towards understanding layer importance.
6 Theoretical Results
We provide a theoretical analysis on the distance between the trajectories of Taylorized training and full training on wide neural networks.
Problem Setup
We consider training a wide twolayer neural network with width and the NTK parameterization^{7}^{7}7For wide twolayer networks, nontrivial linearized/lazy training can only happen at the NTK parameterization; standard parameterization + small learning rate would collapse to training a linear function of the input.:
(6) 
where is the input satisfying ,
are the neurons,
are the toplayer coefficients, and is a smooth activation function. We set fixed and only train , so that the learnable parameter of the problem is the weight matrix .^{8}^{8}8Setting the top layer as fixed is standard in the analysis of twolayer networks in the linearized regime, see e.g. (Du et al., 2018).We initialize randomly according to the standard initialization, that is,
We consider the regression task over a finite dataset with squared loss
and train via gradient flow (i.e. continuous time gradient descent) with “stepsize”^{9}^{9}9Gradient flow trajectories are invariant to the stepsize choice; however, we choose a “stepsize” so as to simplify the presentation. :
(7) 
Taylorized training
We compare the full training dynamics (7) with the corresponding Taylorized training dynamics. The th order Taylorized model for the neural network (6), denoted as , has the form
The Taylorized training dynamics can be described as
(8)  
starting at the same initialization .
We now present our main theoretical result which gives bounds on the agreement between th order Taylorized training and full training on wide neural networks. [Agreement between Taylorized and full training: informal version] There exists a suitable stepsize such that for any fixed and all suffiicently large
, with high probability over the random initialization, full training (
7) and Taylorized training (8) are coupled in both the parameter space and the function space:Theorem 6 extends existing results which state that linearized training approximates full training with an error bound in the function space (Lee et al., 2019; Chizat et al., 2019), showing that higherorder Taylorized training enjoys a stronger approximation bound . Such a bound corroborates our experimental finding that Taylorized training are increasingly better approximations of full training as we increase . We defer the formal statement of Theorem 6 and its proof to Appendix A.
We emphasize that Theorem 6 is still only mostly relevant for explaining the initial stage rather than the entire trajectory for full training in practice, due to the fact that the result holds for gradient flow which only simulates gradient descent with an infinitesimally small learning rate. We believe it is an interesting open direction how to prove the coupling between neural networks and the th order Taylorized training under large learning rates.
7 Related Work
Here we review some additional related work.
Neural networks, linearized training, and kernels
The connection between wide neural networks and kernel methods has first been identified in (Neal, 1996). A fastgrowing body of recent work has studied the interplay between wide neural networks, linearized models, and their infinitewidth limits governed by either the Gaussian Process (GP) kernel (corresponding to training the top linear layer only) (Daniely, 2017; Lee et al., 2018; Matthews et al., 2018) or the Neural Tangent Kernel (corresponding to training all the layers) (Jacot et al., 2018). By exploiting such an interplay, it has been shown that gradient descent on overparameterized neural nets can reach global minima (Jacot et al., 2018; Du et al., 2018, 2019; AllenZhu et al., 2019b; Zou et al., 2019; Lee et al., 2019), and generalize as well as a kernel method (Li & Liang, 2018; Arora et al., 2019a; Cao & Gu, 2019).
NTKbased and NTKinspired learning algorithms
Inspired by the connection between neural nets and kernel methods, algorithms for computing the exact (limiting) GP / NTK kernels efficiently has been proposed (Arora et al., 2019b; Lee et al., 2019; Novak et al., 2020; Yang, 2019) and shown to yield stateoftheart kernelbased algorithms on benchmark learning tasks (Arora et al., 2019b; Li et al., 2019b; Arora et al., 2019c)
. The connection between neural nets and kernels have further been used in designing algorithms for general machine learning use cases such as multitask learning
(Mu et al., 2020) and protecting against noisy labels (Hu et al., 2020).Limitations of linearized training
The performance gap between linearized and fully trained networks has been empirically observed in (Arora et al., 2019b; Lee et al., 2019; Chizat et al., 2019). On the theoretical end, the sample complexity gap between linearized training and full training has been shown in (Wei et al., 2019; Ghorbani et al., 2019a; AllenZhu & Li, 2019; Yehudai & Shamir, 2019) under specific data distributions and architectures.
Provable training beyond linearization
AllenZhu et al. (2019a); Bai & Lee (2020) show that wide neural nets can couple with quadratic models with provably nice optimization landscapes and better generalization than the NTKs, and Bai & Lee (2020) furthers show the sample complexity benefit of th order models for all . Li et al. (2019a) show that a large initial learning rate + learning rate decay generalizes better than a small learning rate for learning a twolayer network on a specific toy data distribution.
An parallel line of work studies overparameterized neural net training in the meanfield limit, in which the training dynamics can be characterized as a PDE over the distribution of weights (Mei et al., 2018; Chizat & Bach, 2018; Rotskoff & VandenEijnden, 2018; Sirignano & Spiliopoulos, 2018). Unlike the NTK regime, the meanfield regime moves weights significantly, though the inductive bias (what function it converges to) and generalization power there is less clear.
8 Conclusion
In this paper, we introduced and studied Taylorized training. We demonstrated experimentally the potential of Taylorized training in understanding full neural network training, by showing its advantage in terms of approximation in both weight and function space, training and test performance, and other empirical properties such as layer movements. We also provided a preliminary theoretical analysis on the approximation power of Taylorized training.
We believe Taylorized training can serve as a useful tool towards studying the theory of deep learning and open many interesting future directions. For example, can we prove the coupling between full and Taylorized training with large learning rates? How well does Taylorized training approximate full training as approaches infinity? Following up on our layer movement experiments, it would also be interesting use Taylorized training to study the properties of neural network architectures or initializations.
References
 AllenZhu & Li (2019) AllenZhu, Z. and Li, Y. What can resnet learn efficiently, going beyond kernels? In Advances in Neural Information Processing Systems, pp. 9015–9025, 2019.
 AllenZhu et al. (2019a) AllenZhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pp. 6155–6166, 2019a.
 AllenZhu et al. (2019b) AllenZhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning, pp. 242–252, 2019b.
 Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. In International Conference on Machine Learning, pp. 322–332, 2019a.
 Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8139–8148, 2019b.
 Arora et al. (2019c) Arora, S., Du, S. S., Li, Z., Salakhutdinov, R., Wang, R., and Yu, D. Harnessing the power of infinitely wide deep nets on smalldata tasks. arXiv preprint arXiv:1910.01663, 2019c.
 Bai & Lee (2020) Bai, Y. and Lee, J. D. Beyond linearization: On quadratic and higherorder approximation of wide neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkllGyBFPH.
 Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and WandermanMilne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
 Cao & Gu (2019) Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pp. 10835–10845, 2019.
 Chizat & Bach (2018) Chizat, L. and Bach, F. On the global convergence of gradient descent for overparameterized models using optimal transport. In Advances in neural information processing systems, pp. 3036–3046, 2018.
 Chizat et al. (2019) Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. 2019.
 Daniely (2017) Daniely, A. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2017.
 Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685, 2019.
 Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
 Ghorbani et al. (2019a) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Limitations of lazy training of twolayers neural network. In Advances in Neural Information Processing Systems, pp. 9108–9118, 2019a.
 Ghorbani et al. (2019b) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized twolayers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019b.
 Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.

He et al. (2015)
He, K., Zhang, X., Ren, S., and Sun, J.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.  Hu et al. (2020) Hu, W., Li, Z., and Yu, D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hke3gyHYwH.
 Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
 Lee et al. (2018) Lee, J., Sohldickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EAM0Z.
 Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., SohlDickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570–8581, 2019.
 Li & Liang (2018) Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166, 2018.
 Li et al. (2019a) Li, Y., Wei, C., and Ma, T. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pp. 11669–11680, 2019a.
 Li et al. (2019b) Li, Z., Wang, R., Yu, D., Du, S. S., Hu, W., Salakhutdinov, R., and Arora, S. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019b.
 Matthews et al. (2018) Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
 Mei et al. (2018) Mei, S., Montanari, A., and Nguyen, P. A mean field view of the landscape of twolayers neural networks. Proceedings of the National Academy of Sciences, 115:E7665–E7671, 2018.
 Mondelli & Montanari (2019) Mondelli, M. and Montanari, A. On the connection between learning twolayer neural networks and tensor decomposition. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1051–1060, 2019.
 Mu et al. (2020) Mu, F., Liang, Y., and Li, Y. Gradients as features for deep representation learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BkeoaeHKDS.
 Neal (1996) Neal, R. M. Priors for infinite networks. In Bayesian Learning for Neural Networks, pp. 29–53. Springer, 1996.
 Novak et al. (2020) Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A. A., SohlDickstein, J., and Schoenholz, S. S. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SklD9yrFPS.
 Pearlmutter (1994) Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
 Rotskoff & VandenEijnden (2018) Rotskoff, G. M. and VandenEijnden, E. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
 Sirignano & Spiliopoulos (2018) Sirignano, J. and Spiliopoulos, K. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018.
 Wei et al. (2019) Wei, C., Lee, J. D., Liu, Q., and Ma, T. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pp. 9709–9721, 2019.
 Yang (2019) Yang, G. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. arXiv preprint arXiv:1910.12478, 2019.
 Yehudai & Shamir (2019) Yehudai, G. and Shamir, O. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, pp. 6594–6604, 2019.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zhang et al. (2019) Zhang, C., Bengio, S., and Singer, Y. Are all layers created equal? arXiv preprint arXiv:1902.01996, 2019.
 Zou et al. (2019) Zou, D., Cao, Y., Zhou, D., and Gu, Q. Gradient descent optimizes overparameterized deep relu networks. Machine Learning, pp. 1–26, 2019.
Appendix A Proof of Theorem 6
a.1 Formal statement of Theorem 6
We first collect notation and state our assumptions. Recall that our twolayer neural network is defined as
and its th order Taylorized model is
(9) 
Let and denote inputs and labels of the training dataset. For any weight matrix we let
With this notation, the loss functions can be written as and , and the training dynamics (full and Taylorized) can be written as
We now state our assumptions. [Fullrankness of analytic NTK] The analytic NTK on the training dataset, defined as
is full rank and satisfies for some .
[Smooth activation] The activation function is and has a bounded Lipschitz derivative: there exists a constant such that
Further, has a Lipschitz th derivative: is Lipschitz for some constant .
Throughout the rest of this section, we assume the above assumptions hold. We are now in position to formally state our main theorem. [Approximation error of Taylorized training; formal version of Theorem 6] There exists a suitable stepsize choice such that the following is true: for any fixed and all sufficiently large , with high probability over the random initialization, full training (7) and Taylorized training (8) are coupled in both the parameter space and the function space:
Remark on extending to entire trajectory
Compared with the existing result on linearized training (Lee et al., 2019, Theorem H.1), our Theorem A.1 only shows the approximation for a fixed time horizon instead of the entire trajectory . Technically, this is due to that the linearized result uses a more careful Gronwall type argument which relies on the fact that the kernel does not change, which ceases to hold here. It would be a compelling question if we could show the approximation result for higherorder Taylorized training for the entire trajectory.
a.2 Proof of Theorem a.1
Throughout the proof, we let be a constant that does not depend on , but can depend on other problem parameters and can vary from line to line. We will also denote
(10) 
so that the Taylorized model can be essentially thought of as a twolayer neural network with the (data and neurondependent) activation functions .
We first present some known results about the full training trajectory , adapted from (Lee et al., 2019, Appendix G). [Basic properties of full training] Under Assumptions A.1, A.1, the followings hold:

is locally bounded and Lipschitz: For any absolute constant there exists a constant such that for sufficiently large , with high probability (over the random initialization ) we have
for any , where denotes a Frobenius norm ball.

Boundedness of gradient flow: there exists an absolute such that with high probability for sufficiently large and a suitable stepsize choice (independent of ), we have for all that
[Properties of Taylorized training] Lemma A.2 also holds if we replace full training with th order Taylorized training. More concretely, we have

is locally bounded and Lipschitz: For any absolute constant there exists a constant such that for sufficiently large , with high probability (over the random initialization ) we have
for any , where denotes a Frobenius norm ball.

Boundedness of gradient flow: there exists an absolute such that with high probability for sufficiently large and a suitable stepsize choice (independent of ), we have for all that
Proof.

Rewrite the th order Taylorized model (9) as
(11) where we have used the definition of the “Taylorized” activation function in (10).
Our goal here is to show that is bounded and Lipschitz for some absolute constant . By Lemma A.2, it suffices to show the same for , as we already have the result for the original Jacobian . Let , we have
(12) Above, (i) uses the th order smoothness of . This shows the boundedness of .
A similar argument can be done for the Lipschitzness of , where the secondtolast expression is replaced by , from which the same argument goes through as whenever , and for the sum is bounded by .

This is a direct corollary of part (a), as we can view the Taylorized network as an architecture on its own, which has the same NTK as at init (so the nondegeneracy of the NTK also holds), and has a locally bounded Lipschitz Jacobian. Repeating the argument of (Lee et al., 2019, Theorem G.2) gives the results.
∎
[Bounding invididual weight movements in and ] Under the same settings as Lemma A.2 and A.2, we have
(13) 
Consequently, we have for that
Proof.
We first show the bound for , and the bound for follows similarly. We have
Note that
due to the boundedness of , and by Lemma A.2(b), so we have
integrating which (and noticing the initial condition ) yields that
We now show the bound on , again focusing on the case (and the case follows similarly). By (12), we have
Taking the square root gives the desired result. ∎
We are now in position to prove the main theorem.
Comments
There are no comments yet.