Towards Understanding the Importance of Shortcut Connections in Residual Networks

09/10/2019 ∙ by Tianyi Liu, et al. ∙ Georgia Institute of Technology 2

Residual Network (ResNet) is undoubtedly a milestone in deep learning. ResNet is equipped with shortcut connections between layers, and exhibits efficient training using simple first order algorithms. Despite of the great empirical success, the reason behind is far from being well understood. In this paper, we study a two-layer non-overlapping convolutional ResNet. Training such a network requires solving a non-convex optimization problem with a spurious local optimum. We show, however, that gradient descent combined with proper normalization, avoids being trapped by the spurious local optimum, and converges to a global optimum in polynomial time, when the weight of the first layer is initialized at 0, and that of the second layer is initialized arbitrarily in a ball. Numerical experiments are provided to support our theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Networks have revolutionized a variety of real world applications in the past few years, such as computer vision (krizhevsky2012imagenet; goodfellow2014generative; Long_2015_CVPR)

, natural language processing

(graves2013speech; bahdanau2014neural; young2018recent), etc. Among different types of networks, Residual Network (ResNet, he2016deep) is undoubted a milestone. ResNet is equipped with shortcut connections, which skip layers in the forward step of an input. Similar idea also appears in the Highway Networks (srivastava2015training), and further inspires densely connected convolutional networks (huang2017densely).

ResNet owes its great success to a surprisingly efficient training compared to the widely used feedforward Convolutional Neural Networks (CNN,

krizhevsky2012imagenet). Feedforward CNNs are seldomly used with more than 30 layers in the existing literature. There are experimental results suggest that very deep feedforward CNNs are significantly slow to train, and yield worse performance than their shallow counterparts (he2016deep)

. However, simple first order algorithms such as stochastic gradient descent and its variants are able to train ResNet with hundreds of layers, and achieve better performance than the state-of-the-art. For example, ResNet-152

(he2016deep), consisting of 152 layers, achieves a

top-1 error on ImageNet.

he2016 also demonstrated a more aggressive ResNet-1001 on the CIFAR-10 data set with 1000 layers. It achieves a error — better than shallower ResNets such as ResNet-.

Despite the great success and popularity of ResNet, the reason why it can be efficiently trained is still largely unknown. One line of research empirically studies ResNet and provides intriguing observations. veit2016residual

, for example, suggest that ResNet can be viewed as a collection of weakly dependent smaller networks of varying sizes. More interestingly, they reveal that these smaller networks alleviate the vanishing gradient problem.

balduzzi2017shattered further elaborate on the vanishing gradient problem. They show that the gradient in ResNet only decays sublinearly in contrast to the exponential decay in feedforward neural networks. Recently, li2018visualizing visualize the landscape of neural networks, and show that the shortcut connection yields a smoother optimization landscape. In spite of these empirical evidences, rigorous theoretical justifications are seriously lacking.

Another line of research theoretically investigates ResNet with simple network architectures. hardt2016identity show that linear ResNet has no spurious local optima (local optima that yield larger objective values than the global optima). Later, li2017convergence study using Stochastic Gradient Descent (SGD) to train a two-layer ResNet with only one unknown layer. They show that the optimization landscape has no spurious local optima and saddle points. They also characterize the local convergence of SGD around the global optimum. These results, however, are often considered to be overoptimistic, due to the oversimplified assumptions.

To better understand ResNet, we study a two-layer non-overlapping convolutional neural network, whose optimization landscape contains a spurious local optimum. Such a network was first studied in du2017gradient. Specifically, we consider

(1)

where is an input, are the output weight and the convolutional weight, respectively, and

is the element-wise ReLU activation. Since the ReLU activation is positive homogeneous, the weights

and can arbitrarily scale with each other. Thus, we impose the assumption to make the neural network identifiable. We further decompose with

being a vector of

’s in , and rewrite (1) as

(2)

Here represents the average pooling shortcut connection, which allows a direct interaction between the input and the output weight .

We investigate the convergence of training ResNet by considering a realizable case. Specifically, the training data is generated from a teacher network with true parameters , with . We aim to recover the teacher neural network using a student network defined in (2) by solving an optimization problem:

(3)

where is independent Gaussian input. Although largely simplified, (3) is nonconvex and possesses a nuisance — There exists a spurious local optimum (see an explicit characterization in Section 2). Early work, du2017gradient

, show that when the student network has the same architecture as the teacher network, GD with random initialization can be trapped in a spurious local optimum with a constant probability

111The probability is bounded between and . Numerical experiments show that this probability can be as bad as with the worst configuration of .. A natural question here is

Does the shortcut connection eases the training?

This paper suggests a positive answer: When initialized with and arbitrarily in a ball, GD with proper normalization converges to a global optimum of (3) in polynomial time, under the assumption that is close to . Such an assumption requires that there exists a of relatively small magnitude, such that . This assumption is supported by both empirical and theoretical evidences. Specifically, the experiments in li2016demystifying and yu2018learning, show that the weight in well-trained deep ResNet has a small magnitude, and the weight for each layer has vanishing norm as the depth tends to infinity. hardt2016identity

suggest that, when using linear ResNet to approximate linear transformations, the norm of the weight in each layer scales as

with being the depth. bartlett2018representing further show that deep nonlinear ResNet, with the norm of the weight of order , is sufficient to express differentiable functions under certain regularity conditions. These results motivate us to assume is relatively small.

Our analysis shows that the convergence of GD exhibits 2 stages. Specifically, our initialization guarantees is sufficiently away from the spurious local optimum. In the first stage, with proper step sizes, we show that the shortcut connection helps the algorithm avoid being attracted by the spurious local optima. Meanwhile, the shortcut connection guides the algorithm to evolve towards a global optimum. In the second stage, the algorithm enters the basin of attraction of the global optimum. With properly chosen step sizes, and jointly converge to the global optimum.

Our analysis thus explains why ResNet benefits training, when the weights are simply initialized at zero (li2016demystifying), or using the Fixup initialization in zhang2019fixup. We remark that our choice of step sizes is also related to learning rate warmup (goyal2017accurate), and other learning rate schemes for more efficient training of neural networks (smith2017cyclical; smith2018super). We refer readers to Section 5 for a more detailed discussion.

Notations: Given a vector , we denote the Euclidean norm . Given two vectors , we denote the angle between them as , and the inner product as . We denote as the vector of all the entries being . We also denote as the Euclidean ball centered at with radius .

2 Model and Algorithm

We consider the realizable setting where the label is generated from a noiseless teacher network in the following form

Here ’s are the true convolutional weight, true output weight, and input. denotes the element-wise ReLU activation.

Our student network is defined in (2). For notational convenience, we expand the second layer and rewrite (2) as

(4)

where , , and for all . We assume the input data ’s are identically independently sampled from Note that the above network is not identifiable, because of the positive homogeneity of the ReLU function, that is

Figure 1: The non-overlapping two layer residual network with normalization layer.

and can scale with each other by any positive constant without changing the output value. Thus, to achieve identifiability, instead of (4), we propose to train the following student network,

(5)

An illustration of (5) is provided in Figure 1. We then recover of our teacher network by solving a nonconvex optimization problem

(6)

Recall that we assume . One can easily verify that (6) has global optima and spurious local optima. The characterization is analogous to du2017gradient, although the objective is different. For any constant is a global optimum of (6), if and is a spurious local optimum of (6), if and The proof is adapted from du2017gradient, and the details are provided in Appendix B.1.

Now we formalize the assumption on in Section 1, which is supported by the theoretical and empirical evidence in li2016demystifying; yu2018learning; hardt2016identity; bartlett2018representing. [Shortcut Prior] There exists a with such that Assumption 1 implies . We remark that our analysis actually applies to any satisfying for any positive constant . Here we consider to ease the presentation. Throughout the rest of the paper, we assume this assumption holds true.

GD with Normalization.

We solve the optimization problem (6) by gradient descent. Specifically, at the -th iteration, we compute

(7)

Note that we normalize in (7), which essentially guarantees As is sampled from , we further have . The normalization step in (7

) can be viewed as a population version of the widely used batch normalization trick to accelerate the training of neural networks

(ioffe2015batch). Moreover, (6) has one unique optimal solution under such a normalization. Specifically, is the unique global optimum, and is the only spurious local optimum along the solution path, where and .

Figure 2: The left panel shows random initialization on feedforward CNN can be trapped in the spurious local optimum with probability at least (du2017gradient). The right panel demonstrates: 1). Under the shortcut prior, our initialization of avoids starting near the spurious local optimum; 2). Convergence of GD exhibits two stages (I. improvement of and avoid being attracted by II. joint convergence).

We initialize our algorithm at satisfying: and We set with a magnitude of to match common initialization techniques (glorot2010understanding; lecun2012efficient; he2015delving). We highlight that our algorithm starts with an arbitrary initialization on , which is different from random initialization. The step sizes and will be specified later in our analysis.

3 Convergence Analysis

We characterize the algorithmic behavior of the gradient descent algorithm. Our analysis shows that under Assumption 1, the convergence of GD exhibits two stages. In the first stage, the algorithm avoids being trapped by the spurious local optimum. Given the algorithm is sufficiently away from the spurious local optima, the algorithm enters the basin of attraction of the global optimum and finally converge to it.

To present our main result, we begin with some notations. Denote

as the angle between and the ground truth at the -th iteration. Throughout the rest of the paper, we assume is a constant. The notation hides , and factors. Then we state the convergence of GD in the following theorem. [Main Results] Let the GD algorithm defined in Section 2 be initialized with and arbitrary Then the algorithm converges in two stages:

Stage I: Avoid the spurious local optimum (Theorem 3.1): We choose

Then there exists , such that

hold for some constants .

Stage II: Converge to the global optimum (Theorem 3.2): After iterations, we restart the counter, and choose

Then for any , any , we have

Note that the set belongs to be the basin of attraction around the global optimum (Lemma 3.2), where certain regularity condition (partial dissipativity) guides the algorithm toward the global optimum. Hence, after the algorithm enters the second stage, we increase the step size of for a faster convergence. Figure 2 demonstrates the initialization of , and the convergence of GD both on CNN in du2017gradient and our ResNet model.

We start our convergence analysis with the definition of partial dissipativity for . [Partial Dissipativity] Given any and a constant , is -partially dissipative with respect to in a set , if for every we have

is -partially dissipative with respect to in a set , if for every we have

Moreover, If is -jointly dissipative with respect to in i.e., for every we have

The concept of dissipativity is originally used in dynamical systems (barrera2015thermalisation), and is defined for general operators. It suffices to instantiate the concept to gradients here for our convergence analysis. The variational coherence studied in zhou2017stochastic and one point convexity studied in li2017convergence can be viewed as special examples of partial dissipativity.

3.1 Stage I: Avoid the Spurious Local Optimum

We first show with properly chosen step sizes, GD algorithm can avoid being trapped by the spurious local optimum. We propose to update using different step sizes. We formalize our result in the following theorem. Initialize with arbitrary . We choose step sizes

for some constant Then, we have

(8)

for all , where

Proof Sketch.

Due to the space limit, we only provide a proof sketch here. The detailed proof is deferred to Appendix B.2. We prove the two arguments in (8) in order. Before that, we first show our initialization scheme guarantees an important bound on as stated in the following lemma. Given we choose

Then for any ,

(9)

Under the shortcut prior assumption 1 that is close to , the update of should be more conservative to provide enough accuracy for to make progress. Based on Lemma 3.1, the next lemma shows that when is small enough, stays acute (), i.e., is sufficiently away from . Given we choose

for some absolute constant . Then for all ,

(10)

We want to remark that (9) and (10) are two of the key conditions that define the partially dissipative region of , as shown in the following lemma. For any satisfies

(11)

where

Please refer to Appendix B.2.3 for a detailed proof. Note that with arbitrary initialization of , or possibly holds at In this case, falls in and (11) ensures the improvement of

Given we choose

Then there exists such that

One can easily verify that holds for any Together with Lemma 3.1, we claim that even with arbitrary initialization, the iterates can always enter the region with positive and bounded in polynomial time. The next lemma shows that with proper chosen step sizes, stays positive and bounded. Suppose , and holds for all . Choose

then we have for all

Take and we complete the proof. ∎

In Theorem 3.1, we choose a conservative . This brings two benefits to the training process: 1). stays away from . The update on is quite limited, since is small. Hence, is kept sufficiently away from , even if moves towards in every iteration); 2). continuously updates toward

Theorem 3.1 ensures that under the shortcut prior, GD with adaptive step sizes can successfully overcome the optimization challenge early in training, i.e., the iterate is sufficiently away from the spurious local optima at the end of Stage I. Meanwhile, (8) actually demonstrates that the algorithm enters the basin of attraction of the global optimum, and we next show the convergence of GD.

3.2 Stage II: Converge to the Global Optimum

Recall that in the previous stage, we use a conservative step size to avoid being trapped by the spurious local optimum. However, the small step size slows down the convergence of in the basin of attraction of the global optimum. Now we choose larger step sizes to accelerate the convergence. The following theorem shows that, after Stage I, we can use a larger while the results in Theorem 3.1 still hold, i.e., the iterate stays in the basin of attraction of . We restart the counter of time. Suppose and We choose

Then for all , we have

Proof Sketch.

To prove the first argument, we need the partial dissipativity of . For any , satisfies

for any , where

This condition ensures that when is positive, always makes positive progress towards or equivalently decreasing. We need not worry about getting obtuse, and thus a larger step size can be adopted. The second argument can be proved following similar lines to Lemma 3.1. Please see Appendix B.3.2 for more details. ∎

Now we are ready to show the convergence of our GD algorithm. Note that Theorem 3.2 and Lemma 3.2 together show that the iterate stays in the partially dissipative region which leads to the convergence of Moreover, as shown in the following lemma, when is accurate enough, the partial gradient with respect to enjoys partial dissipativity. For any satisfies

for any , where

As a direct result, converges to . The next theorem formalize the above discussion.

[Convergence] Suppose hold for all For any choose

then we have

for any

Proof Sketch.

The detailed proof is provided in Appendix B.4. Our proof relies on the partial dissipativity of (Lemma 3.2) and that of (Lemma 3.2).

Note that the partial dissipative region depends on the precision of Thus, we first show the convergence of [Convergence of ] Suppose hold for all For any choose

then we have

for any Lemma 3.2 implies that after iterations, the algorithm enters Then we show the convergence property of in next lemma. [Convergence of ] Suppose and holds for all We choose

Then for all we have

Combine the above two lemmas together, take , and we complete the proof. ∎

Theorem 3.2 shows that with larger than in Stage I, GD converges to the global optimum in polynomial time. Compared to the convergence with constant probability for CNN (du2017gradient), Assumption 1 assures convergence even under arbitrary initialization of This partially justifies the importance of shortcut in ResNet.

4 Numerical Experiment

We present numerical experiments to illustrate the convergence of the GD algorithm. We first demonstrate that with the shortcut prior, our choice of step sizes and the initialization guarantee the convergence of GD. We consider the training of a two-layer non-overlapping convolutional ResNet by solving (6). Specifically, we set and . The teacher network is set with parameters satisfying , and satisfying , , and for 222 essentially satisfies . More detailed experimental setting is provided in Appendix C. We initialize with and uniformly distributed over . We adopt the following learning rate scheme with Step Size Warmup (SSW) suggested in Section 3: We first choose step sizes and , and run for iterations. Then, we choose . We also consider learning the same teacher network using step sizes throughout, i.e., without step size warmup.

We further demonstrate learning the aforementioned teacher network using a student network of the same architecture. Specifically, we keep unchanged. We use the GD in du2017gradient with step size , and initialize uniformly distributed over the unit sphere and uniformly distributed over .

For each combination of and , we repeat simulations for aforementioned three settings, and report the success rate of converging to the global optimum in Table 1. As can be seen, our GD on ResNet is capable of avoiding the spurious local optimum, and converges to the global optimum in all simulations. However, GD without SSW can be trapped in the spurious local optimum. The failure probability diminishes as the dimension increase. Learning the teacher network using a two-layer CNN student network (du2017gradient) can also be trapped in the spurious local optimum.

16 25 36 49 64 81 100
ResNet w/ SSW 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
ResNet w/o SSW 0.7042 0.7354 0.7776 0.7848 0.8220 0.8388 0.8426
CNN 0.5348 0.5528 0.5312 0.5426 0.5192 0.5368 0.5374
Table 1: Success rates of converging to the global optimum for GD training ResNet with and without SSW and CNN with varying and and .

We then demonstrate the algorithmic behavior of our GD. We set for the teacher network, and other parameters the same as in the previous experiment. We initialize and . We start with and . After iterations, we set the step sizes . The algorithm is terminated when . We also demonstrate the GD algorithm without SSW at the same initialization. The step sizes are throughout the training.

Figure 3: Algorithmic behavior of GD on ResNet. The horizontal axis corresponds to the number of iterations.

One solution path of GD with SSW is shown in the first column of Figure 3

. As can be seen, the algorithm has a phase transition. In the first stage, we observe that

makes very slow progress due to the small step size . While gradually increases. This implies the algorithm avoids being attracted by the spurious local optimum. In the second stage, and both continuously evolve towards the global optimum.

The second row of Figure 3 illustrates the trajectory of GD without SSW being trapped by the spurious local optimum. Specifically, converges to as we observe that converges to , and converges to .

5 Discussions

Deep ResNet. Our two-layer network model is largely simplified compared with deep and wide ResNets in practice, where the role of the shortcut connection is more complicated. It is worth mentioning that the empirical results in veit2016residual show that ResNet can be viewed as an ensemble of smaller networks, and most of the smaller networks are shallow due to the shortcut connection. They also suggest that the training is dominated by the shallow smaller networks. We are interested in investigating whether these shallow smaller networks possesses similar benign properties to ease the training as our two-layer model.

Moreover, our student network and the teacher network have the same degree of freedom. We have not considered deeper and wider student networks. It is also worth an investigation that what is the role of shortcut connections in deeper and wider networks.

From GD to SGD. A straightforward extension is to investigate the convergence of SGD with mini-batch. We remark that when the batch size is large, the effect of the noise on gradient is limited and SGD mimics the behavior of GD. When the batch size is small, the noise on gradient plays a significant role in training, which is technically more challenging.

Related Work. li2017convergence study ResNet-type two-layer neural networks with the output weight known (), which is equivalent to assuming for all in our analysis. Thus, their analysis does not have Stage I (). Moreover, since they do not need to optimize , they only need to handle the partial dissipativity of with (one-point convexity). In our analysis, however, we also need to handle the the partial dissipativity of with which makes our proof more involved.

Initialization. Our analysis shows that GD converges to the global optimum, when is initialized at zero. Empirical results in li2016demystifying and zhang2019fixup also suggest that deep ResNet works well, when the weights are simply initialized at zero or using the Fixup initialization. We are interested in building a connection between training a two-layer ResNet and its deep counterpart.

Step Size Warmup. Our choice of step size is related to the learning rate warmup and layerwise learning rate in the existing literature. Specifically, goyal2017accurate presents an effective learning rate scheme for training ResNet on ImageNet for less than hour. They start with a small step size, gradually increase (linear scale) it, and finally shrink it for convergence. Our analysis suggests that in the first stage, we need smaller to avoid being attracted by the spurious local optimum. This is essentially consistent with goyal2017accurate. Note that we are considering GD (no noise), hence, we do not need to shrink the step size in the final stage. While goyal2017accurate need to shrink the step size to control the noise in SGD. Similar learning rate schemes are proposed by smith2017cyclical.

On the other hand, we incorporate the shortcut prior, and adopt a smaller step size for the inner layer, and a larger step size for the outer layer. Such a choice of step size is shown to be helpful in both deep learning and transfer learning

(singh2015layer; howard2018universal), where it is referred to as differential learning rates or discriminative fine-tuning. It is interesting to build a connection between our theoretical discoveries and these empirical observations.

References

Appendix A Preliminaries

We first provide the explicit forms of the loss function and its gradients with respect to

and Let When , the loss function and the gradient w.r.t , i.e., and have the following analytic forms.

where This proposition is a simple extension of Theorem 3.1 in du2017gradient. Here, we omit the proof.

For notational simplicity, we denote in the future proof.

Appendix B Proof of Theoretical Results

b.1 Proof of Proposition b.1

Proof.

Recall that du2017gradient proves that is the spurious local optimum of the CNN counterpart to our ResNet. Substitute by and we prove the result. ∎

b.2 Proof of Theorem 3.1

b.2.1 Proof of Lemma 3.1

Proof.

By simple manipulication, we know that the initialization of satisfies We first prove the right side of the inequality. Expand as and we have

Subtract from both sides, then we get

for any The right side inequality is proved.

The proof of the left side follows similar lines. Since we have

which is equivalent to the following inequality.