ResNet owes its great success to a surprisingly efficient training compared to the widely used feedforward Convolutional Neural Networks (CNN,krizhevsky2012imagenet). Feedforward CNNs are seldomly used with more than 30 layers in the existing literature. There are experimental results suggest that very deep feedforward CNNs are significantly slow to train, and yield worse performance than their shallow counterparts (he2016deep)
. However, simple first order algorithms such as stochastic gradient descent and its variants are able to train ResNet with hundreds of layers, and achieve better performance than the state-of-the-art. For example, ResNet-152(he2016deep), consisting of 152 layers, achieves a
top-1 error on ImageNet.he2016 also demonstrated a more aggressive ResNet-1001 on the CIFAR-10 data set with 1000 layers. It achieves a error — better than shallower ResNets such as ResNet-.
Despite the great success and popularity of ResNet, the reason why it can be efficiently trained is still largely unknown. One line of research empirically studies ResNet and provides intriguing observations. veit2016residual
, for example, suggest that ResNet can be viewed as a collection of weakly dependent smaller networks of varying sizes. More interestingly, they reveal that these smaller networks alleviate the vanishing gradient problem.balduzzi2017shattered further elaborate on the vanishing gradient problem. They show that the gradient in ResNet only decays sublinearly in contrast to the exponential decay in feedforward neural networks. Recently, li2018visualizing visualize the landscape of neural networks, and show that the shortcut connection yields a smoother optimization landscape. In spite of these empirical evidences, rigorous theoretical justifications are seriously lacking.
Another line of research theoretically investigates ResNet with simple network architectures. hardt2016identity show that linear ResNet has no spurious local optima (local optima that yield larger objective values than the global optima). Later, li2017convergence study using Stochastic Gradient Descent (SGD) to train a two-layer ResNet with only one unknown layer. They show that the optimization landscape has no spurious local optima and saddle points. They also characterize the local convergence of SGD around the global optimum. These results, however, are often considered to be overoptimistic, due to the oversimplified assumptions.
To better understand ResNet, we study a two-layer non-overlapping convolutional neural network, whose optimization landscape contains a spurious local optimum. Such a network was first studied in du2017gradient. Specifically, we consider
where is an input, are the output weight and the convolutional weight, respectively, and
is the element-wise ReLU activation. Since the ReLU activation is positive homogeneous, the weightsand can arbitrarily scale with each other. Thus, we impose the assumption to make the neural network identifiable. We further decompose with
being a vector of’s in , and rewrite (1) as
Here represents the average pooling shortcut connection, which allows a direct interaction between the input and the output weight .
We investigate the convergence of training ResNet by considering a realizable case. Specifically, the training data is generated from a teacher network with true parameters , with . We aim to recover the teacher neural network using a student network defined in (2) by solving an optimization problem:
where is independent Gaussian input. Although largely simplified, (3) is nonconvex and possesses a nuisance — There exists a spurious local optimum (see an explicit characterization in Section 2). Early work, du2017gradient
, show that when the student network has the same architecture as the teacher network, GD with random initialization can be trapped in a spurious local optimum with a constant probability111The probability is bounded between and . Numerical experiments show that this probability can be as bad as with the worst configuration of .. A natural question here is
Does the shortcut connection eases the training?
This paper suggests a positive answer: When initialized with and arbitrarily in a ball, GD with proper normalization converges to a global optimum of (3) in polynomial time, under the assumption that is close to . Such an assumption requires that there exists a of relatively small magnitude, such that . This assumption is supported by both empirical and theoretical evidences. Specifically, the experiments in li2016demystifying and yu2018learning, show that the weight in well-trained deep ResNet has a small magnitude, and the weight for each layer has vanishing norm as the depth tends to infinity. hardt2016identity
suggest that, when using linear ResNet to approximate linear transformations, the norm of the weight in each layer scales aswith being the depth. bartlett2018representing further show that deep nonlinear ResNet, with the norm of the weight of order , is sufficient to express differentiable functions under certain regularity conditions. These results motivate us to assume is relatively small.
Our analysis shows that the convergence of GD exhibits 2 stages. Specifically, our initialization guarantees is sufficiently away from the spurious local optimum. In the first stage, with proper step sizes, we show that the shortcut connection helps the algorithm avoid being attracted by the spurious local optima. Meanwhile, the shortcut connection guides the algorithm to evolve towards a global optimum. In the second stage, the algorithm enters the basin of attraction of the global optimum. With properly chosen step sizes, and jointly converge to the global optimum.
Our analysis thus explains why ResNet benefits training, when the weights are simply initialized at zero (li2016demystifying), or using the Fixup initialization in zhang2019fixup. We remark that our choice of step sizes is also related to learning rate warmup (goyal2017accurate), and other learning rate schemes for more efficient training of neural networks (smith2017cyclical; smith2018super). We refer readers to Section 5 for a more detailed discussion.
Notations: Given a vector , we denote the Euclidean norm . Given two vectors , we denote the angle between them as , and the inner product as . We denote as the vector of all the entries being . We also denote as the Euclidean ball centered at with radius .
2 Model and Algorithm
We consider the realizable setting where the label is generated from a noiseless teacher network in the following form
Here ’s are the true convolutional weight, true output weight, and input. denotes the element-wise ReLU activation.
where , , and for all . We assume the input data ’s are identically independently sampled from Note that the above network is not identifiable, because of the positive homogeneity of the ReLU function, that is
and can scale with each other by any positive constant without changing the output value. Thus, to achieve identifiability, instead of (4), we propose to train the following student network,
Recall that we assume . One can easily verify that (6) has global optima and spurious local optima. The characterization is analogous to du2017gradient, although the objective is different. For any constant is a global optimum of (6), if and is a spurious local optimum of (6), if and The proof is adapted from du2017gradient, and the details are provided in Appendix B.1.
Now we formalize the assumption on in Section 1, which is supported by the theoretical and empirical evidence in li2016demystifying; yu2018learning; hardt2016identity; bartlett2018representing. [Shortcut Prior] There exists a with such that Assumption 1 implies . We remark that our analysis actually applies to any satisfying for any positive constant . Here we consider to ease the presentation. Throughout the rest of the paper, we assume this assumption holds true.
GD with Normalization.
We solve the optimization problem (6) by gradient descent. Specifically, at the -th iteration, we compute
) can be viewed as a population version of the widely used batch normalization trick to accelerate the training of neural networks(ioffe2015batch). Moreover, (6) has one unique optimal solution under such a normalization. Specifically, is the unique global optimum, and is the only spurious local optimum along the solution path, where and .
We initialize our algorithm at satisfying: and We set with a magnitude of to match common initialization techniques (glorot2010understanding; lecun2012efficient; he2015delving). We highlight that our algorithm starts with an arbitrary initialization on , which is different from random initialization. The step sizes and will be specified later in our analysis.
3 Convergence Analysis
We characterize the algorithmic behavior of the gradient descent algorithm. Our analysis shows that under Assumption 1, the convergence of GD exhibits two stages. In the first stage, the algorithm avoids being trapped by the spurious local optimum. Given the algorithm is sufficiently away from the spurious local optima, the algorithm enters the basin of attraction of the global optimum and finally converge to it.
To present our main result, we begin with some notations. Denote
as the angle between and the ground truth at the -th iteration. Throughout the rest of the paper, we assume is a constant. The notation hides , and factors. Then we state the convergence of GD in the following theorem. [Main Results] Let the GD algorithm defined in Section 2 be initialized with and arbitrary Then the algorithm converges in two stages:
Stage I: Avoid the spurious local optimum (Theorem 3.1): We choose
Then there exists , such that
hold for some constants .
Stage II: Converge to the global optimum (Theorem 3.2): After iterations, we restart the counter, and choose
Then for any , any , we have
Note that the set belongs to be the basin of attraction around the global optimum (Lemma 3.2), where certain regularity condition (partial dissipativity) guides the algorithm toward the global optimum. Hence, after the algorithm enters the second stage, we increase the step size of for a faster convergence. Figure 2 demonstrates the initialization of , and the convergence of GD both on CNN in du2017gradient and our ResNet model.
We start our convergence analysis with the definition of partial dissipativity for . [Partial Dissipativity] Given any and a constant , is -partially dissipative with respect to in a set , if for every we have
is -partially dissipative with respect to in a set , if for every we have
Moreover, If is -jointly dissipative with respect to in i.e., for every we have
The concept of dissipativity is originally used in dynamical systems (barrera2015thermalisation), and is defined for general operators. It suffices to instantiate the concept to gradients here for our convergence analysis. The variational coherence studied in zhou2017stochastic and one point convexity studied in li2017convergence can be viewed as special examples of partial dissipativity.
3.1 Stage I: Avoid the Spurious Local Optimum
We first show with properly chosen step sizes, GD algorithm can avoid being trapped by the spurious local optimum. We propose to update using different step sizes. We formalize our result in the following theorem. Initialize with arbitrary . We choose step sizes
for some constant Then, we have
for all , where
Due to the space limit, we only provide a proof sketch here. The detailed proof is deferred to Appendix B.2. We prove the two arguments in (8) in order. Before that, we first show our initialization scheme guarantees an important bound on as stated in the following lemma. Given we choose
Then for any ,
Under the shortcut prior assumption 1 that is close to , the update of should be more conservative to provide enough accuracy for to make progress. Based on Lemma 3.1, the next lemma shows that when is small enough, stays acute (), i.e., is sufficiently away from . Given we choose
for some absolute constant . Then for all ,
Given we choose
Then there exists such that
One can easily verify that holds for any Together with Lemma 3.1, we claim that even with arbitrary initialization, the iterates can always enter the region with positive and bounded in polynomial time. The next lemma shows that with proper chosen step sizes, stays positive and bounded. Suppose , and holds for all . Choose
then we have for all
Take and we complete the proof. ∎
In Theorem 3.1, we choose a conservative . This brings two benefits to the training process: 1). stays away from . The update on is quite limited, since is small. Hence, is kept sufficiently away from , even if moves towards in every iteration); 2). continuously updates toward
Theorem 3.1 ensures that under the shortcut prior, GD with adaptive step sizes can successfully overcome the optimization challenge early in training, i.e., the iterate is sufficiently away from the spurious local optima at the end of Stage I. Meanwhile, (8) actually demonstrates that the algorithm enters the basin of attraction of the global optimum, and we next show the convergence of GD.
3.2 Stage II: Converge to the Global Optimum
Recall that in the previous stage, we use a conservative step size to avoid being trapped by the spurious local optimum. However, the small step size slows down the convergence of in the basin of attraction of the global optimum. Now we choose larger step sizes to accelerate the convergence. The following theorem shows that, after Stage I, we can use a larger while the results in Theorem 3.1 still hold, i.e., the iterate stays in the basin of attraction of . We restart the counter of time. Suppose and We choose
Then for all , we have
To prove the first argument, we need the partial dissipativity of . For any , satisfies
for any , where
This condition ensures that when is positive, always makes positive progress towards or equivalently decreasing. We need not worry about getting obtuse, and thus a larger step size can be adopted. The second argument can be proved following similar lines to Lemma 3.1. Please see Appendix B.3.2 for more details. ∎
Now we are ready to show the convergence of our GD algorithm. Note that Theorem 3.2 and Lemma 3.2 together show that the iterate stays in the partially dissipative region which leads to the convergence of Moreover, as shown in the following lemma, when is accurate enough, the partial gradient with respect to enjoys partial dissipativity. For any satisfies
for any , where
As a direct result, converges to . The next theorem formalize the above discussion.
[Convergence] Suppose hold for all For any choose
then we have
Note that the partial dissipative region depends on the precision of Thus, we first show the convergence of [Convergence of ] Suppose hold for all For any choose
then we have
for any Lemma 3.2 implies that after iterations, the algorithm enters Then we show the convergence property of in next lemma. [Convergence of ] Suppose and holds for all We choose
Then for all we have
Combine the above two lemmas together, take , and we complete the proof. ∎
Theorem 3.2 shows that with larger than in Stage I, GD converges to the global optimum in polynomial time. Compared to the convergence with constant probability for CNN (du2017gradient), Assumption 1 assures convergence even under arbitrary initialization of This partially justifies the importance of shortcut in ResNet.
4 Numerical Experiment
We present numerical experiments to illustrate the convergence of the GD algorithm. We first demonstrate that with the shortcut prior, our choice of step sizes and the initialization guarantee the convergence of GD. We consider the training of a two-layer non-overlapping convolutional ResNet by solving (6). Specifically, we set and . The teacher network is set with parameters satisfying , and satisfying , , and for 222 essentially satisfies . More detailed experimental setting is provided in Appendix C. We initialize with and uniformly distributed over . We adopt the following learning rate scheme with Step Size Warmup (SSW) suggested in Section 3: We first choose step sizes and , and run for iterations. Then, we choose . We also consider learning the same teacher network using step sizes throughout, i.e., without step size warmup.
We further demonstrate learning the aforementioned teacher network using a student network of the same architecture. Specifically, we keep unchanged. We use the GD in du2017gradient with step size , and initialize uniformly distributed over the unit sphere and uniformly distributed over .
For each combination of and , we repeat simulations for aforementioned three settings, and report the success rate of converging to the global optimum in Table 1. As can be seen, our GD on ResNet is capable of avoiding the spurious local optimum, and converges to the global optimum in all simulations. However, GD without SSW can be trapped in the spurious local optimum. The failure probability diminishes as the dimension increase. Learning the teacher network using a two-layer CNN student network (du2017gradient) can also be trapped in the spurious local optimum.
|ResNet w/ SSW||1.0000||1.0000||1.0000||1.0000||1.0000||1.0000||1.0000|
|ResNet w/o SSW||0.7042||0.7354||0.7776||0.7848||0.8220||0.8388||0.8426|
We then demonstrate the algorithmic behavior of our GD. We set for the teacher network, and other parameters the same as in the previous experiment. We initialize and . We start with and . After iterations, we set the step sizes . The algorithm is terminated when . We also demonstrate the GD algorithm without SSW at the same initialization. The step sizes are throughout the training.
One solution path of GD with SSW is shown in the first column of Figure 3
. As can be seen, the algorithm has a phase transition. In the first stage, we observe thatmakes very slow progress due to the small step size . While gradually increases. This implies the algorithm avoids being attracted by the spurious local optimum. In the second stage, and both continuously evolve towards the global optimum.
The second row of Figure 3 illustrates the trajectory of GD without SSW being trapped by the spurious local optimum. Specifically, converges to as we observe that converges to , and converges to .
Deep ResNet. Our two-layer network model is largely simplified compared with deep and wide ResNets in practice, where the role of the shortcut connection is more complicated. It is worth mentioning that the empirical results in veit2016residual show that ResNet can be viewed as an ensemble of smaller networks, and most of the smaller networks are shallow due to the shortcut connection. They also suggest that the training is dominated by the shallow smaller networks. We are interested in investigating whether these shallow smaller networks possesses similar benign properties to ease the training as our two-layer model.
Moreover, our student network and the teacher network have the same degree of freedom. We have not considered deeper and wider student networks. It is also worth an investigation that what is the role of shortcut connections in deeper and wider networks.
From GD to SGD. A straightforward extension is to investigate the convergence of SGD with mini-batch. We remark that when the batch size is large, the effect of the noise on gradient is limited and SGD mimics the behavior of GD. When the batch size is small, the noise on gradient plays a significant role in training, which is technically more challenging.
Related Work. li2017convergence study ResNet-type two-layer neural networks with the output weight known (), which is equivalent to assuming for all in our analysis. Thus, their analysis does not have Stage I (). Moreover, since they do not need to optimize , they only need to handle the partial dissipativity of with (one-point convexity). In our analysis, however, we also need to handle the the partial dissipativity of with which makes our proof more involved.
Initialization. Our analysis shows that GD converges to the global optimum, when is initialized at zero. Empirical results in li2016demystifying and zhang2019fixup also suggest that deep ResNet works well, when the weights are simply initialized at zero or using the Fixup initialization. We are interested in building a connection between training a two-layer ResNet and its deep counterpart.
Step Size Warmup. Our choice of step size is related to the learning rate warmup and layerwise learning rate in the existing literature. Specifically, goyal2017accurate presents an effective learning rate scheme for training ResNet on ImageNet for less than hour. They start with a small step size, gradually increase (linear scale) it, and finally shrink it for convergence. Our analysis suggests that in the first stage, we need smaller to avoid being attracted by the spurious local optimum. This is essentially consistent with goyal2017accurate. Note that we are considering GD (no noise), hence, we do not need to shrink the step size in the final stage. While goyal2017accurate need to shrink the step size to control the noise in SGD. Similar learning rate schemes are proposed by smith2017cyclical.
On the other hand, we incorporate the shortcut prior, and adopt a smaller step size for the inner layer, and a larger step size for the outer layer. Such a choice of step size is shown to be helpful in both deep learning and transfer learning(singh2015layer; howard2018universal), where it is referred to as differential learning rates or discriminative fine-tuning. It is interesting to build a connection between our theoretical discoveries and these empirical observations.
Appendix A Preliminaries
We first provide the explicit forms of the loss function and its gradients with respect toand Let When , the loss function and the gradient w.r.t , i.e., and have the following analytic forms.
where This proposition is a simple extension of Theorem 3.1 in du2017gradient. Here, we omit the proof.
For notational simplicity, we denote in the future proof.
Appendix B Proof of Theoretical Results
b.1 Proof of Proposition b.1
Recall that du2017gradient proves that is the spurious local optimum of the CNN counterpart to our ResNet. Substitute by and we prove the result. ∎
b.2 Proof of Theorem 3.1
b.2.1 Proof of Lemma 3.1
By simple manipulication, we know that the initialization of satisfies We first prove the right side of the inequality. Expand as and we have
Subtract from both sides, then we get
for any The right side inequality is proved.
The proof of the left side follows similar lines. Since we have
which is equivalent to the following inequality.