Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width

02/10/2020 ∙ by Yu Bai, et al. ∙ Salesforce 20

We propose Taylorized training as an initiative towards better understanding neural network training at finite width. Taylorized training involves training the k-th order Taylor expansion of the neural network at initialization, and is a principled extension of linearized training—a recently proposed theory for understanding the success of deep learning. We experiment with Taylorized training on modern neural network architectures, and show that Taylorized training (1) agrees with full neural network training increasingly better as we increase k, and (2) can significantly close the performance gap between linearized and full training. Compared with linearized training, higher-order training works in more realistic settings such as standard parameterization and large (initial) learning rate. We complement our experiments with theoretical results showing that the approximation error of k-th order Taylorized models decay exponentially over k in wide neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has made immense progress in solving artificial intelligence challenges such as computer vision, natural language processing, reinforcement learning, and so on 

(LeCun et al., 2015). Despite this great success, fundamental theoretical questions such as why deep networks train and generalize well are only partially understood.

A recent surge of research establishes the connection between wide neural networks and their linearized models. It is shown that wide neural networks can be trained in a setting in which each individual weight only moves very slightly (relative to itself), so that the evolution of the network can be closely approximated by the evolution of the linearized model, which when the width goes to infinity has a certain statistical limit governed by its Neural Tangent Kernel (NTK). Such a connection has led to provable optimization and generalization results for wide neural nets  (Li & Liang, 2018; Jacot et al., 2018; Du et al., 2018, 2019; Zou et al., 2019; Lee et al., 2019; Arora et al., 2019a; Allen-Zhu et al., 2019a), and has inspired the design of new algorithms such as neural-based kernel machines that achieve competitive results on benchmark learning tasks (Arora et al., 2019b; Li et al., 2019b).

While linearized training is powerful in theory, it is questionable whether it really explains neural network training in practical settings. Indeed, (1) the linearization theory requires small learning rates or specific network parameterizations (such as the NTK parameterization), yet in practice a large (initial) learning rate is typically required in order to reach a good performance; (2) the linearization theory requires a high width in order for the linearized model to fit the training dataset and generalize, yet it is unclear whether the finite-width linearization of practically sized network have such capacities. Such a gap between linearized and full neural network training has been identified in recent work (Chizat et al., 2019; Ghorbani et al., 2019b, a; Li et al., 2019a), and suggests the need for a better model towards understanding neural network training in practical regimes.

Towards closing this gap, in this paper we propose and study Taylorized training, a principled generalization of linearized training. For any neural network and a given initialization , assuming sufficient smoothness, we can expand around to the -th order for any :

The model is exactly the linearized model when , and becomes -th order polynomials of that are increasingly better local approximations of as we increase . Taylorized training refers to training these Taylorized models explicitly (and not necessarily locally), and using it as a tool towards understanding the training of the full neural network . The hope with Taylorized training is to “trade expansion order with width”, that is, to hopefully understand finite-width dynamics better by using a higher expansion order rather than by increasing the width.

In this paper, we take an empirical approach towards studying Taylorized training, demonstrating its usefulness in understanding finite-width full training111By full training, we mean the usual (non-Taylorized) training of the neural networks.. Our main contributions can be summarized as follows:

  • We experiment with Taylorized training on vanilla convolutional and residual networks in their practical training regimes (standard parameterization + large initial learning rate) on CIFAR-10. We show that Taylorized training gives increasingly better approximations of the training trajectory of the full neural net as we increase the expansion order , in both the parameter space and the function space (Section 5). This is not necessarily expected, as higher-order Taylorized models are no longer guaranteed to give better approximations when parameters travel significantly, yet empiricially they do approximate full training better.

  • We find that Taylorized models can significantly close the performance gap between fully trained neural nets and their linearized models at finite width. Finite-width linearized networks typically has over 40% worse test accuracy than their fully trained counterparts, whereas quartic (4th order) training is only 10%-15% worse than full training under the same setup.

  • We demonstrate the potential of Taylorized training as a tool for understanding layer importance. Specifically, higher-order Taylorized training agrees well with full training in layer movements, i.e. how far each layer travels from its initialization, whereas linearized training does not agree well.

  • We provide a theoretical analysis on the approximation power of Taylorized training (Section 6). We prove that -th order Taylorized training approximates the full training trajectory with error bound on a wide two-layer network with width . This extends existing results on linearized training and provides a preliminary justification of our experimental findings.

Additional paper organization

We provide preliminaries in Section 2, review linearized training in Section 3, describe Taylorized training in more details in Section 4, and review additional related work in Section 7. Additional experimental results are reported in Appendix B.

A visualization of Taylorized training

A high-level illustration of our results is provided in Figure 1

, which visualizes the training trajectories of a 4-layer convolutional neural network and its Taylorized models. Observe that the linearized model struggles to progress past the initial phase of training and is a rather poor approximation of full training in this setting, whereas higher-order Taylorized models approximate full training significantly better.

Figure 1: Function space dynamics of a 4-layer CNN under full training ( NN) and Taylorized training of order 1-4 ( linearized, quadratic, cubic, quartic)

. All models are trained on CIFAR-10 with the same initialization + optimization setup, and we plot the test logits on the trained models in the first 20 epochs. Each point is a 2D PCA embedding of the test logits of the corresponding model. Observe that Taylorized training becomes an increasingly better approximation of full NN training as we increase the expansion order

.

2 Preliminaries

We consider the supervised learning problem

where is the input, is the label,

is a convex loss function,

is the learnable parameter, and

is the neural network that maps the input to the output (e.g. the prediction in a regression problem, or the vector of logits in a classification problem).

This paper focuses on the case where is a (deep) neural network. A standard feedforward neural network with layers is defined through , where , and

(1)

for all , where are weight matrices, are biases, and

is an activation function (e.g. the ReLU) applied entry-wise. We will not describe other architectures in detail; for the purpose of describing our approach and empirical results, it suffices to think of

as a general nonlinear function of the parameter (for a given input .)

Once the architecture is chosen, it remains to define an initialization strategy and a learning rule.

Initialization and training

We will mostly consider the standard initialization (or variants of it such as Xavier (Glorot & Bengio, 2010) or Kaiming (He et al., 2015)) in this paper, which for a feedforward network is defined as

and can be similarly defined for convolutional and residual networks. This is in contrast with the NTK parameterization (Jacot et al., 2018), which encourages the weights to move significantly less.

We consider training the neural network via (stochastic) gradient descent:

(2)

We will refer to the above as full training of neural networks, so as to differentiate with various approximate training regimes to be introduced below.

3 Linearized Training and Its Limitations

We briefly review the theory of linearized training (Lee et al., 2019; Chizat et al., 2019) for explaining the training and generalization success of neural networks, and provide insights on its limitations.

3.1 Linearized training and Neural Tangent Kernels

The theory of linearized training begins with the observation that a neural network near init can be accurately approximated by a linearized network. Given an initialization and an arbitrary near , we have that

that is, the neural network is approximately equal to the linearized network . Consequently, near , the trajectory of minimizing can be well approximated by the trajectory of linearized training, i.e. minimizing

which is a convex problem and enjoys convergence guarantees.

Furthermore, linearized training can approximate the entire trajectory of full training provided that we are in a certain linearized regime in which we use

  • Small learning rate, so that stays in a small neighborhood of for any fixed amount of time;

  • Over-parameterization, so that such a neighborhood gives a function space that is rich enough to contain a point so that can fit the entire training dataset.

As soon as we are in the above linearized regime, gradient descent is guaranteed to reach a global minimum  (Du et al., 2019; Allen-Zhu et al., 2019b; Zou et al., 2019). Further, as the width goes to infinity, due to randomness in the initialization , the function space containing such linearized models goes to a statistical limit governed by the Neural Tangent Kernels (NTKs) (Jacot et al., 2018), so that wide networks trained in this linearized regime generalize as well as a kernel method (Arora et al., 2019a; Allen-Zhu et al., 2019a).

3.2 Unrealisticness of linearized training in practice

Our key concern about the theory of linearized training is that there are significant differences between training regimes in which the linearized approximation is accurate, and regimes in which neural nets typically attain their best performance in practice. More concretely,

  1. Linearized training is a good approximation of full training under small learning rates222Or large learning rates but on the NTK parameterization (Lee et al., 2019). in which each individual weight barely moves. However, neural networks typically attain their best test performance when using a large (initial) learning rate, in which the weights move significantly in a way not explained by linearized training (Li et al., 2019a);

  2. Linearized networks are powerful models on their own when the base architecture is over-parameterized, but can be rather poor when the network is of a practical size. Indeed, infinite-width linearized models such as CNTK achieve competitive performance on benchmark tasks (Arora et al., 2019b, c), yet their finite-width counterparts often perform significantly worse  (Lee et al., 2019; Chizat et al., 2019).

4 Taylorized Training

Towards closing this gap between linearized and full training, we propose to study Taylorized training, a principled extension of linearized training. Taylorized training involves training higher-order expansions of the neural network around the initialization. For any —assuming sufficient smoothness—we can Taylor expand to the -th order as

where we have defined the -th order Taylorized model . The Taylorized model reduces to the linearized model when , and is a -th order polynomial model for a general , where the “features” are (which depend on the architecture and initialization ), and the “coefficients” are for .

Similar as linearized training, we define Taylorized training as the process (or trajectory) for training via gradient descent, starting from the initialization . Concretely, the trajectory for -th order Taylorized training will be denoted as , where

(3)

Taylorized models arise from a similar principle as linearized models (Taylor expansion of the neural net), and gives increasingly better approximations of the neural network (at least locally) as we increase . Further, higher-order Taylorized training () are no longer convex problems, yet they model the non-convexity of full training in a mild way that is potentially amenable to theoretical analyses. Indeed, quadratic training () been shown to enjoy a nice optimization landscape and achieve better sample complexity than linearized training on learning certain simple functions (Bai & Lee, 2020)

. Higher-order training also has the potential to be understood through its polynomial structure and its connection to tensor decomposition problems 

(Mondelli & Montanari, 2019).

Implementation

Naively implementing Taylorization by directly computing higher-order derivative tensors of neural networks is prohibitive in both memory and time. Fortunately, Taylorized models can be efficiently implemented through a series of nested Jacobian-Vector Product operations (JVPs). Each JVP operation can be computed with the -operator algorithm of Pearlmutter (1994)

, which gives directional derivatives through arbitrary differentiable functions, and is the transpose of backpropagation.

For any function with parameters , we denote its JVP with respect to the direction using the notation of Pearlmutter (1994) by

(4)

The -th order Taylorized model can be computed as

(5)

where is the -times nested evaluation of the -operator.

Our implementation uses Jax (Bradbury et al., 2018) and neural_tangents (Novak et al., 2020) which has built-in support for Taylorizing any function to an arbitrary order based on nested JVP operations.

5 Experiments

Name Architecture Params Train For Batch Accuracy Opt Rate Grad Clip LR Decay Schedule
CNNTHIN CNN-4-128 447K 200 epochs 256 81.6% SGD 0.1 5.0 10x drop at 100, 150 epochs
CNNTHICK CNN-4-512 7.10M 160 epochs 64 85.9% SGD 0.1 5.0 10x drop at 80, 120 epochs
WRNTHIN WideResNet-16-4-128 3.22M 200 epochs 256 88.1% SGD 1e-1.5 10.0 10x drop at 100, 150 epochs
WRNTHICK WideResNet-16-8-256 12.84M 160 epochs 64 91.7% SGD 1e-1.5 10.0 10x drop at 80, 120 epochs
Table 1: Our architectures and training setups. CNN-- stands for a CNN with depth and channels per layer. WideResNet--- stands for a WideResNet with depth , widening factor , and channels in the first convolutional layer.

We experiment with Taylorized training on convolutional and residual networks for the image classification task on CIFAR-10.

5.1 Basic setup

We choose four representative architectures for the image classification task: two CNNs with 4 layers + Global Average Pooling (GAP) with width , and two WideResNets (Zagoruyko & Komodakis, 2016) with depth 16 and different widths as well. All networks use standard parameterization and are trained with the cross-entropy loss333Different from prior work on linearized training which primarily focused on the squared loss (Arora et al., 2019b; Lee et al., 2019).. We optimize the training loss using SGD with a large initial learning rate + learning rate decay.444

We also use gradient clipping with a large clipping norm in order to prevent occasional gradient blow-ups.

For each architecture, the initial learning rate was tuned within and chosen to be the largest learning rate under which the full neural network can stably train (i.e. has a smoothly decreasing training loss). We use standard data augmentation (random crop, flip, and standardize) as a optimization-independent way for improving generalization. Detailed training settings for each architecture are summarized in Table 1.

Methodology

For each architecture, we train Taylorized models of order (referred to as {linearized, quadratic, cubic, quartic} models) from the same initialization as full training using the exact same optimization setting (including learning rate decay, gradient clipping, minibatching, and data augmentation noise). This allows us to eliminate the effects of optimization setup and randomness, and examine the agreement between Taylorized and full training in identical settings.

5.2 Approximation power of Taylorized training

We examine the approximation power of Taylorized training through comparing Taylorized training of different orders in terms of both the training trajectory and the test performance.

Metrics

We monitor the training loss and test accuracy for both full and Taylorized training. We also evaluate the approximation error between Taylorized and full training quantitatively through the following similarity metrics between models:

  • Cosine similarity in the parameter space, defined as

    where (recall (3) and (2)) and denote the parameters in -th order Taylorized training and full training, and is their common initialization.

  • Cosine similarity in the function space, defined as

    where we have overloaded the notation (and similarly ) to denote the output (logits) of a model on the test dataset555We centralized (demeaned) the logits for each example along the classification axis so as to remove the effect of the invariance in the softmax mapping. .

Figure 2: -th order Taylorized training approximates full training increasingly better with on the CNNTHIN model. Training statistics are plotted for { full, linearized, quadratic, cubic, quartic} models. Left to right: (1) training loss; (2) test accuracy; (3) cosine similarity between Taylorized and full training in the parameter space; (4) cosine similarity between Taylorized and full training in the function (logit) space. All models are trained on CIFAR-10 for 39200 steps, and a 10x learning rate decay happened at step {19600, 29400}.
Architecture CNNTHIN CNNTHICK WRNTHIN WRNTHICK
Linearized () 41.3% 49.0% 50.2% 55.3%
Quadratic () 61.6% 70.1% 65.8% 71.7%
Cubic () 69.3% 75.3% 72.6% 76.9%
Quartic () 71.8% 76.2% 75.6% 78.7%
Full network 81.6% 85.9% 88.1% 91.7%
Table 2: Final test accuracy on CIFAR-10 for Taylorized models trained under the same optimization setup as full neural nets. Details about the architectures and their training setups can be found in Table 1.

Results

Figure 2 plots training and approximation metrics for full and Taylorized training on the CNNTHIN model. Observe that Taylorized models are much better approximators than linearized models in both the parameter space and the function space—both cosine similarity curves shift up as we increase from 1 to 4. Further, for the cubic and quartic models, the cosine similarity in the logit space stays above 0.8 over the entire training trajectory (which includes both weakly and strongly trained models), suggesting a fine agreement between higher-order Taylorized training and full training. Results for {CNNTHICK, WRNTHIN, WRNTHICK} are qualitatively similar and are provided in Appendix B.1.

We further report the final test performance of the Taylorized models on all architectures in Table 2. We observe that

  1. Taylorized models can indeed close the performance gap between linearized and full training: linearized models are typically 30%-40% worse than fully trained networks, whereas quartic (4th order Taylorized) models are within {10%, 13%} of a fully trained network on {CNNs, WideResNets}.

  2. All Taylorized models can benifit from increasing the width (from CNNTHIN to CNNTHICK, and WRNTHIN to WRNTHICK), but the performance of higher-order models () are generally less sensitive to width than lower-order models (), suggesting their realisticness for explaining the training behavior of practically sized finite-width networks.

On finite- vs. infinite-width linearized models

We emphasize that the performance of our baseline linearized models in Table 2 (40%-55%) is at finite width, and is thus not directly comparable to existing results on infinite-width linearized models such as CNTK (Arora et al., 2019b). It is possible to achieve stronger results with finite-width linearized networks by using the NTK parameterization, which more closely resembles the infinite width limit. However, full neural net training with this re-parameterization results in significantly weakened performance, suggesting its unrealisticness. The best documented test accuracy of a finite-width linearized network on CIFAR-10 is 65% (Lee et al., 2019), and due to the NTK parameterization, the neural network trained under these same settings only reached 70%. In contrast, our best higher order models can approach 80%, and are trained under realistic settings where a neural network can reach over 90%.

5.3 Agreement on layer movements

Layer importance, i.e. the contribution and importance of each layer in a well-trained (deep) network, has been identified as a useful concept towards building an architecture-dependent understanding on neural network training (Zhang et al., 2019). Here we demonstrate that higher-order Taylorized training has the potential to lead to better understandings on the layer importance in full training.

Method and result

We examine layer movements, i.e. the distances each layer has travelled along training, and illustrate it on both full and Taylorized training.666Taylorized models are polynomials of where has the same shape as the base network. By a “layer” in a Taylorized model, we mean the partition that’s same as how we partition into layers in the base network. In Figure 3, we plot the layer movements on the CNNTHIN and WRNTHIN models. Compared with linearized training, quartic training agrees with full training much better in the shape of the layer movement curve, both at an early stage and at convergence. Furthermore, comparing the layer movement curves between the 10th epoch and at convergence, quartic training seems to be able to adjust the shape of the movement curve much better than linearized training.

Intriguing results about layer importance has also been (implicitly) shown in the study of infinite-width linearized models (i.e. NTK type kernel methods). For example, it has been observed that the CNN-GP kernel (which corresponds to training the top layer only) has consistently better generalization performance than the CNTK kernel (which corresponds to training all the layer) (Li et al., 2019b). In other words, when training an extremely wide convolutional net on a finite dataset, training the last layer only gives a better generalization performance (i.e. a better implicit bias); existing theoretical work on linearized training fall short of understanding layer importance in these settings. We believe Taylorized training can serve as an (at least empirically) useful tool towards understanding layer importance.

(a) CNNTHIN at 10 epochs
(b) WRNSMALL at 10 epochs
(c) CNNTHIN at convergence
(d) WRNSMALL at convergence
Figure 3: Layer movement of full NN, linearized, and quartic models. Compared with linearized training, quartic (4th order) Taylorized training agrees with the full neural network much better in terms of layer movement, both at the initial stage and at convergence.

6 Theoretical Results

We provide a theoretical analysis on the distance between the trajectories of Taylorized training and full training on wide neural networks.

Problem Setup

We consider training a wide two-layer neural network with width and the NTK parameterization777For wide two-layer networks, non-trivial linearized/lazy training can only happen at the NTK parameterization; standard parameterization + small learning rate would collapse to training a linear function of the input.:

(6)

where is the input satisfying ,

are the neurons,

are the top-layer coefficients, and is a smooth activation function. We set fixed and only train , so that the learnable parameter of the problem is the weight matrix .888Setting the top layer as fixed is standard in the analysis of two-layer networks in the linearized regime, see e.g. (Du et al., 2018).

We initialize randomly according to the standard initialization, that is,

We consider the regression task over a finite dataset with squared loss

and train via gradient flow (i.e. continuous time gradient descent) with “step-size”999Gradient flow trajectories are invariant to the step-size choice; however, we choose a “step-size” so as to simplify the presentation. :

(7)

Taylorized training

We compare the full training dynamics (7) with the corresponding Taylorized training dynamics. The -th order Taylorized model for the neural network (6), denoted as , has the form

The Taylorized training dynamics can be described as

(8)

starting at the same initialization .

We now present our main theoretical result which gives bounds on the agreement between -th order Taylorized training and full training on wide neural networks. [Agreement between Taylorized and full training: informal version] There exists a suitable step-size such that for any fixed and all suffiicently large

, with high probability over the random initialization, full training (

7) and Taylorized training (8) are coupled in both the parameter space and the function space:

Theorem 6 extends existing results which state that linearized training approximates full training with an error bound in the function space (Lee et al., 2019; Chizat et al., 2019), showing that higher-order Taylorized training enjoys a stronger approximation bound . Such a bound corroborates our experimental finding that Taylorized training are increasingly better approximations of full training as we increase . We defer the formal statement of Theorem 6 and its proof to Appendix A.

We emphasize that Theorem 6 is still only mostly relevant for explaining the initial stage rather than the entire trajectory for full training in practice, due to the fact that the result holds for gradient flow which only simulates gradient descent with an infinitesimally small learning rate. We believe it is an interesting open direction how to prove the coupling between neural networks and the -th order Taylorized training under large learning rates.

7 Related Work

Here we review some additional related work.

Neural networks, linearized training, and kernels

The connection between wide neural networks and kernel methods has first been identified in (Neal, 1996). A fast-growing body of recent work has studied the interplay between wide neural networks, linearized models, and their infinite-width limits governed by either the Gaussian Process (GP) kernel (corresponding to training the top linear layer only) (Daniely, 2017; Lee et al., 2018; Matthews et al., 2018) or the Neural Tangent Kernel (corresponding to training all the layers) (Jacot et al., 2018). By exploiting such an interplay, it has been shown that gradient descent on overparameterized neural nets can reach global minima (Jacot et al., 2018; Du et al., 2018, 2019; Allen-Zhu et al., 2019b; Zou et al., 2019; Lee et al., 2019), and generalize as well as a kernel method (Li & Liang, 2018; Arora et al., 2019a; Cao & Gu, 2019).

NTK-based and NTK-inspired learning algorithms

Inspired by the connection between neural nets and kernel methods, algorithms for computing the exact (limiting) GP / NTK kernels efficiently has been proposed (Arora et al., 2019b; Lee et al., 2019; Novak et al., 2020; Yang, 2019) and shown to yield state-of-the-art kernel-based algorithms on benchmark learning tasks (Arora et al., 2019b; Li et al., 2019b; Arora et al., 2019c)

. The connection between neural nets and kernels have further been used in designing algorithms for general machine learning use cases such as multi-task learning 

(Mu et al., 2020) and protecting against noisy labels (Hu et al., 2020).

Limitations of linearized training

The performance gap between linearized and fully trained networks has been empirically observed in (Arora et al., 2019b; Lee et al., 2019; Chizat et al., 2019). On the theoretical end, the sample complexity gap between linearized training and full training has been shown in (Wei et al., 2019; Ghorbani et al., 2019a; Allen-Zhu & Li, 2019; Yehudai & Shamir, 2019) under specific data distributions and architectures.

Provable training beyond linearization

Allen-Zhu et al. (2019a); Bai & Lee (2020) show that wide neural nets can couple with quadratic models with provably nice optimization landscapes and better generalization than the NTKs, and Bai & Lee (2020) furthers show the sample complexity benefit of -th order models for all .  Li et al. (2019a) show that a large initial learning rate + learning rate decay generalizes better than a small learning rate for learning a two-layer network on a specific toy data distribution.

An parallel line of work studies over-parameterized neural net training in the mean-field limit, in which the training dynamics can be characterized as a PDE over the distribution of weights (Mei et al., 2018; Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; Sirignano & Spiliopoulos, 2018). Unlike the NTK regime, the mean-field regime moves weights significantly, though the inductive bias (what function it converges to) and generalization power there is less clear.

8 Conclusion

In this paper, we introduced and studied Taylorized training. We demonstrated experimentally the potential of Taylorized training in understanding full neural network training, by showing its advantage in terms of approximation in both weight and function space, training and test performance, and other empirical properties such as layer movements. We also provided a preliminary theoretical analysis on the approximation power of Taylorized training.

We believe Taylorized training can serve as a useful tool towards studying the theory of deep learning and open many interesting future directions. For example, can we prove the coupling between full and Taylorized training with large learning rates? How well does Taylorized training approximate full training as approaches infinity? Following up on our layer movement experiments, it would also be interesting use Taylorized training to study the properties of neural network architectures or initializations.

References

Appendix A Proof of Theorem 6

a.1 Formal statement of Theorem 6

We first collect notation and state our assumptions. Recall that our two-layer neural network is defined as

and its -th order Taylorized model is

(9)

Let and denote inputs and labels of the training dataset. For any weight matrix we let

With this notation, the loss functions can be written as and , and the training dynamics (full and Taylorized) can be written as

We now state our assumptions. [Full-rankness of analytic NTK] The analytic NTK on the training dataset, defined as

is full rank and satisfies for some .

[Smooth activation] The activation function is and has a bounded Lipschitz derivative: there exists a constant such that

Further, has a Lipschitz -th derivative: is -Lipschitz for some constant .

Throughout the rest of this section, we assume the above assumptions hold. We are now in position to formally state our main theorem. [Approximation error of Taylorized training; formal version of Theorem 6] There exists a suitable step-size choice such that the following is true: for any fixed and all sufficiently large , with high probability over the random initialization, full training (7) and Taylorized training (8) are coupled in both the parameter space and the function space:

Remark on extending to entire trajectory

Compared with the existing result on linearized training (Lee et al., 2019, Theorem H.1), our Theorem A.1 only shows the approximation for a fixed time horizon instead of the entire trajectory . Technically, this is due to that the linearized result uses a more careful Gronwall type argument which relies on the fact that the kernel does not change, which ceases to hold here. It would be a compelling question if we could show the approximation result for higher-order Taylorized training for the entire trajectory.

a.2 Proof of Theorem a.1

Throughout the proof, we let be a constant that does not depend on , but can depend on other problem parameters and can vary from line to line. We will also denote

(10)

so that the Taylorized model can be essentially thought of as a two-layer neural network with the (data- and neuron-dependent) activation functions .

We first present some known results about the full training trajectory , adapted from (Lee et al., 2019, Appendix G). [Basic properties of full training] Under Assumptions A.1A.1, the followings hold:

  1. is locally bounded and Lipschitz: For any absolute constant there exists a constant such that for sufficiently large , with high probability (over the random initialization ) we have

    for any , where denotes a Frobenius norm ball.

  2. Boundedness of gradient flow: there exists an absolute such that with high probability for sufficiently large and a suitable step-size choice (independent of ), we have for all that

[Properties of Taylorized training] Lemma A.2 also holds if we replace full training with -th order Taylorized training. More concretely, we have

  1. is locally bounded and Lipschitz: For any absolute constant there exists a constant such that for sufficiently large , with high probability (over the random initialization ) we have

    for any , where denotes a Frobenius norm ball.

  2. Boundedness of gradient flow: there exists an absolute such that with high probability for sufficiently large and a suitable step-size choice (independent of ), we have for all that

Proof.
  1. Rewrite the -th order Taylorized model (9) as

    (11)

    where we have used the definition of the “Taylorized” activation function in (10).

    Our goal here is to show that is -bounded and -Lipschitz for some absolute constant . By Lemma A.2, it suffices to show the same for , as we already have the result for the original Jacobian . Let , we have

    (12)

    Above, (i) uses the -th order smoothness of . This shows the boundedness of .

    A similar argument can be done for the Lipschitzness of , where the second-to-last expression is replaced by , from which the same argument goes through as whenever , and for the sum is bounded by .

  2. This is a direct corollary of part (a), as we can view the Taylorized network as an architecture on its own, which has the same NTK as at init (so the non-degeneracy of the NTK also holds), and has a locally bounded Lipschitz Jacobian. Repeating the argument of (Lee et al., 2019, Theorem G.2) gives the results.

[Bounding invididual weight movements in and ] Under the same settings as Lemma A.2 and A.2, we have

(13)

Consequently, we have for that

Proof.

We first show the bound for , and the bound for follows similarly. We have

Note that

due to the boundedness of , and by Lemma A.2(b), so we have

integrating which (and noticing the initial condition ) yields that

We now show the bound on , again focusing on the case (and the case follows similarly). By (12), we have

Taking the square root gives the desired result. ∎

We are now in position to prove the main theorem.

[A.1] Step 1. We first bound the rate of change of . We have

For term I, applying Lemma A.2 yields

For term II, applying the local Lipschitzness of and the fact that are in (Lemma A.2) yields that