On the Convergence Rate of Training Recurrent Neural Networks

10/29/2018 ∙ by Zeyuan Allen-Zhu, et al. ∙ MIT The University of Texas at Austin Stanford University 0

Despite the huge success of deep learning, our understanding to how the non-convex neural networks are trained remains rather limited. Most of existing theoretical works only tackle neural networks with one hidden layer, and little is known for multi-layer neural networks. Recurrent neural networks (RNNs) are special multi-layer networks extensively used in natural language processing applications. They are particularly hard to analyze, comparing to feedforward networks, because the weight parameters are reused across the entire time horizon. We provide arguably the first theoretical understanding to the convergence speed of training RNNs. Specifically, when the number of neurons is sufficiently large ---meaning polynomial in the training data size and in the time horizon--- and when the weights are randomly initialized, we show that gradient descent and stochastic gradient descent both minimize the training loss in a linear convergence rate, that is, ε∝ e^-Ω(T).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have been one of the most powerful tools in machine learning over the past a few decades 

[51, 35, 56, 3, 43, 84, 85]. The multi-layer structure of neural network gives it supreme power in expressibility and learning performance. However, it also raises complexity concerns: the training objective involving multi-layer (or even just two-layer) neural networks, equipped with non-linear activation functions, is generally neither convex nor concave. However, it is still observed that in practice, local-search algorithms such as stochastic gradient descent (SGD) is capable of finding globally optimal solutions, at least on the training data set [103].

In recent years, there have been a number of theoretical results aiming at a better understanding of this phenomenon. Many of them focus on two-layer (thus one-hidden-layer) neural networks and assume that the inputs are random Gaussian vectors 

[12, 87, 95, 55, 26, 31, 69, 107, 106]. Some study deep neural networks but assuming the activation function is linear [39, 7, 10]. On the technique side, some of these results try to understand the gradient dynamics [87, 95, 12, 55, 21, 13, 26, 69, 107, 106, 97], while others focus on the geometry properties of the training objective [31, 76, 29, 108, 39].

More recently, Safran and Shamir [76] provided evidence that, even when inputs are standard Gaussians, two-layer neural networks can indeed have spurious local minima, and suggested that over-parameterization (i.e., increasing the number of neurons) may be the key in avoiding spurious local minima. Later, Li and Liang [54] showed that, for two-layer networks with the cross-entropy loss, in the over-parametrization regime, gradient descent is capable of finding nearly-global optimal solutions on the training data. This result was later extended to the loss by Du et al. [27].

Recurrent Neural Networks.   Among different variations of neural networks, one of the least theoretically-understood structure is the recurrent one [28]. A recurrent neural network (RNN) recurrently applies the same network unit to a sequence of input points, such as a sequence of words in a language sentence. RNN is particularly useful when there are long-term, non-linear interactions between input points in the same sequence. These networks are widely used in practice for natural language processing, language generation, machine translation, speech recognition, video and music processing, and many other tasks [62, 63, 91, 47, 77, 93, 15, 17]. On the theory side, while there are some attempts to show that an RNN is more expressive than a feedforward neural network [49], when and how an RNN can be efficiently learned has nearly-zero

theoretical explanation.

In practice, RNN is usually trained by simple local-search algorithms such as stochastic gradient descent, via back-propagation through time (BPTT) [99]. However, unlike shallow networks, the training process of RNN often runs into the trouble of vanishing or exploding gradient [92]. That is, the value of the gradient becomes exponentially small or large in the time horizon, even when the training objective is still constant.111Intuitively, an RNN recurrently applies the same network unit for times if the input sequence is of length . When this unit has “operator norm” larger than one or smaller than one, the final output can possibly exponentially explode or vanish in . More importantly, when BPTT back propagates through time—which intuitively corresponds to applying the reverse unit multiple times— the gradient can also vanish or explode. Controlling the operator norm of a non-linear operator can be quite challenging.

In practice, one of the most popular way to resolve this issue is by the long short term memory (LSTM) structure 

[44]

. However, one can also use rectified linear units (ReLUs) as activation functions to avoid vanishing or exploding gradient 

[78]. In fact, one of the earliest adoptions of ReLUs was on applications of RNNs for this purpose twenty years ago [38, 79].

Since the RNN structure was proposed, a large number of variations have been designed over past decades, including LSTM [44], Bidirectional RNN [81], Bidirectional LSTM [34]

, gate recurrent unit

[16], statistical recurrent unit [67], Fourier recurrent unit [104]. For a more detailed survey, we refer the readers to Salehinejad et al. [78].

1.1 Our Question

In this paper, we study the following general question

  • [leftmargin=20pt]

  • Can ReLU provably stabilize the training process and avoid vanishing/exploding gradient?

  • Can RNN be trained close to zero training error efficiently under mild assumptions?

Remark 1.1.

When there is no activation function, RNN is known as linear dynamical system. Hardt, Ma and Recht [40] first proved the convergence of finding global minima for such linear dynamical systems. Followups in this line of research include [41, 42].

Motivation.   Indeed, the ultimate question in this line would be whether RNN can be trained close to zero generalization error. However, unlike feedforward neural networks, for RNNs, the training error, or the ability to memorize examples may actually be desirable. After all, many tasks involving RNN are related to memories, and certain RNN units are also referred to memory cells. Since RNN applies the same network unit to all the input points in a sequence, the following question can possibly of its own interest:

  • [leftmargin=20pt]

  • How does RNN learn a new mapping (say from one input point to its output in 3 time steps), without destroying other mappings?

Another motivation is the following. An RNN can be viewed as a space constraint, differentiable Turing machine, except that the input is only allowed to be read in a fixed order. It was shown in

Siegelmann and Sontag [83]

that all Turing machines can be simulated by fully-connected recurrent networks built of neurons with non-linear activation functions. In practice, RNN is also used as a tool to build the neural Turing machine 

[36], equipped with a grand goal of automatically learning an algorithm based on the observation of the inputs and outputs. To this extent, we believe the task of understanding the trainability as a first step towards understanding RNN can be meaningful on its own.

Our Result.   We answer this question positively in this paper. To present the simplest result, we focus on the classical Elman network with ReLU activation:

where ,
where

We let and denote by the ReLU activation function: .

We consider a regression task where each sequence of inputs consists of vectors and we perform regression with respect to . The least-square regression loss with respect to this input is . We assume there are training sequences, each of length . We assume the training sequences are -separable (say for instance vectors are different by relative distance for every pairs of training sequences). Our main theorem can be stated as follows

Theorem.

If the number of neurons is polynomially large, we can find weight matrices where the RNN gives training error

  • if gradient descent (GD) is applied for iterations, starting from random Gaussian initializations; or

  • if (mini-batch or regular) stochastic gradient descent (SGD) is applied for iterations, starting from random Gaussian initializations.222At a first glance, one may question how it is possible for SGD to enjoy a logarithmic time dependency in ; after all, even when minimizing strongly-convex and Lipschitz-smooth functions, the typical convergence rate of SGD is as opposed to . We quickly point out there is no contradiction here if the stochastic pieces of the objective enjoy a common global minimizer. In math terms, suppose we want to minimize some function , and suppose is the global minimizer of convex functions . Then, if is -strongly convex, and each each is -Lipschitz smooth, then SGD—moving in negative direction of for a random per step— can find -minimizer of this function in iterations.

(To present the simplest possible result, we have not tried to tighten the polynomial dependency with respect to and . We only tightened the dependency with respect to and .)

We believe this is the first proof of convergence of GD or SGD for recurrent neural networks with activation functions, and possibly the first proof of finding approximate global optima on the RNN training objective when activation functions are present.

Remark 1.2.

Our theorem does not exclude the possible existence of (spurious) local minima.

Extension: DNN.   A feedforward neural network of depth is similar to Elman RNN with the main difference being that the weights across layers are separately trained. As one shall see, this only makes our proofs simpler because we have more independence in randomness. Our theorems also apply to feedforward neural networks, and we have written a separate follow-up paper [2]

to address feedforward (fully-connected, residual, and convolutional) neural networks.

Extension: Deep RNN.

   One may also study the convergence of RNNs with multiple layers of hidden neurons. This is referred to as deep RNN or RNN with multi-layer perceptron 

[78]. Our theorem also generalizes to deep RNNs (by combining this paper together with [2]). We do not include the details here because the convergence of Elman RNN is already quite involved to prove.

Extension: Other loss functions.   For simplicity, in this paper we have adopted the

regression loss. Our results generalize to other Lipschitz smooth (but possibly nonconvex) loss functions. We refer interested readers to

[2] for details regarding how to work with other loss functions.

1.2 Other Related Works

Another relevant work is Brutzkus et al. [13] where the authors studied over-paramterization in the case of two-layer neural network under a linear-separable assumption.

Some other works learn neural networks using more delicate algorithms—such as tensor decomposition 

[37, 18, 6, 96, 64] or tailored algorithm for the problem structure [5, 105, 33]— or design more special activation functions [4, 65, 69, 88]. In this paper, instead, we focus on why basic algorithms such as GD or SGD can already almost surely find global optima of the objective.

Instead of using randomly initialized weights like this paper, there is a line of work proposing algorithms using weights generated from some “tensor initialization” process [5, 82, 46, 95, 107].

There is huge literature on using the mean-field theory to study neural networks [61, 102, 23, 100, 53, 101, 14, 71, 70, 72, 75, 80]. At a high level, they study the network dynamics at random initialization when the number of hidden neurons grow to infinity, and use such initialization theory to predict performance after training. However, they do not provide theoretical convergence rate for the training process (at least when the number of neurons is finite).

Some other works focus on the expressibility of neural networks, that is, to show that there exist certain weights so that the network calculates interesting functions. These results do not usually cover the theoretical issue of how to find such weight parameters [19, 30, 45, 73, 9, 94, 66, 23, 75].

There is a long line of research focusing on the hardness result for neural network [11, 50, 57, 20, 22, 32, 89, 48, 98, 59].

Linear dynamical systems is an important topic on its own, and are also related to linear variants of reinforcement learning. Some recent works on this topic include

[24, 68, 1, 86, 60, 25, 8].

2 Notations and Preliminaries

Notations.   We denote by (or sometimes ) the Euclidean norm of vectors or spectral norm of matrices. We denote by the infinite norm of vectors, the sparsity of vectors or diagonal matrices, and the Frobenius norm of matrices. We use

to denote Gaussian distribution with mean

and variance

; or to denote Gaussian vector with mean and covariance .

We use to denote the indicator function of whether is true. We denote by the -th standard basis vector. We use to denote the ReLU function, namely . Given univariate function , slightly abusing notation, we also use to denote the same function over vectors: if .

Given matrix , we denote by the -th column vector of . Therefore, .

Given vectors , we define as their Gram-Schmidt orthonormalization. Namely, where

 and  for :  .

Note that in the occasion that is the zero vector, we let be an arbitrary unit vector that is orthogonal to .

2.1 Elman Recurrent Neural Network

We assume that training inputs are given, in the form of for each input . We assume training labels are given, in the form of for each input . Without loss of generality, we assume for every and . Also without loss of generality, we assume and its last coordinate for every .333If it only satisfies

one can pad it with an additional coordinate to make

hold. As for the assumption , this is equivalent to adding a bias term for the first layer.

We make the following assumption on the input data (see Footnote 7 for how to relax it):

Assumption 2.1.

for some parameter and every pair of .

Given weight matrices , , , we introduce the following notations to describe the evaluation of RNN on the input sequences. For each and :

A very important notion that this entire paper relies on is the following:

Definition 2.2.

For each and , let be the diagonal matrix where

As a result, we can write .

We consider the following random initialization distributions for , and .

Definition 2.3.

We say that are at random initialization, if the entries of and are i.i.d. generated from , and the entries of are i.i.d. generated from .

Throughout this paper, for notational simplicity, we refer to index as the -th layer of RNN, and , , respectively as the hidden neurons, input, output on the -th layer. We acknowledge that in certain literatures, one may regard Elman network as a three-layer neural network.

Assumption 2.4.

We assume for some sufficiently large polynomial.

Without loss of generality, we assume for some sufficiently large constant (if this is not satisfied one can decrease ). Throughout the paper except the detailed proofs in the appendix, we use , and notions to hide polylogarithmic dependency in . To simplify notations, we denote by

2.2 Objective and Gradient

For simplicity, we only optimize over the weight matrix and let and be at random initialization. As a result, our -regression objective is a function over :444The index starts from , because remains constant if we are not optimizing over and .

(2.1)

Using chain rule, one can write down a closed form of the (sub-)gradient:

Fact 2.5.

For , the gradient with respect to (denoted by ) and the full gradient are

where for every , , and :

3 Our Results

Our main results can be formally stated as follows.

Theorem 1 (Gd).

Suppose and . Let

be at random initialization. With high probability over the randomness of

, if we apply gradient descent for steps , then it satisfies

for .
Theorem 2 (Sgd).

Suppose and . Let be at random initialization. If we apply stochastic gradient descent for steps for a random index per step, then with high probability (over and the randomness of SGD), it satisfies

for .

In both cases, we essentially have linear convergence rates.555We remark here that the notation may hide additional polynomial dependency in . This may not be necessary, and we have not tried to tighten such log-log dependency in . Notably, our results show that the dependency of the number of layers , is polynomial. Thus, even when RNN is applied to sequences of long input data, it does not suffer from exponential gradient explosion or vanishing (e.g., or ) through the entire training process.

4 Main Technical Theorems

Our main Theorem 1 and Theorem 2 are in fact natural consequences of the following two technical theorems. They both talk about the first-order behavior of RNNs when the weight matrix is sufficiently close to some random initialization.

The first theorem is similar to the classical Polyak-Łojasiewicz condition [74, 58], and says that is at least as large as the objective value.

Theorem 3.

With high probability over random initialization , it satisfies

(Only the first statement is the Polyak-Łojasiewicz condition; the second is a simple-to-proof gradient upper bound.) The second theorem shows a special smoothness property of the objective.

Theorem 4.

With high probability over random initialization , it satisfies for every with , and for every with ,

At a high level, the convergence of GD and SGD are careful applications of Theorem 3 and 4 stated above (see Appendix H and Appendix I respectively).

The main difficulty of this paper is to prove Theorem 3 and 4, and we shall sketch the proof ideas in Section 5 through 8. In such high-level discussions, we shall put our emphasize on

  • how to avoid exponential blow up in , and

  • how to deal with the issue of randomness dependence across layers.

5 Basic Properties at Random Initialization

In this section we derive basic properties of the RNN when the weight matrices are all at random initialization. The corresponding precise statements and proofs are in Appendix B.

The first one says that the forward propagation neither explodes or vanishes, that is,

(5.1)

Intuitively, (5.1) very reasonable. Since the weight matrix is randomly initialized with entries i.i.d. from , the norm is around for any fixed vector . Equipped with ReLU activation, it “shuts down” roughly half of the coordinates of and reduces the norm to one. Since in each layer , there is an additional unit-norm signal coming in, we should expect the final norm of hidden neurons to be at most .

Unfortunately, the above argument cannot be directly applied since the weight matrix is reused for times so there is no fresh new randomness across layers. Let us explain how we deal with this issue carefully, because it is at the heart of all of our proofs in this paper. Recall, each time is applied to some vector , it only uses “one column of randomness” of . Mathematically, letting denote the column orthonormal matrix using Gram-Schmidt

we have .

  • The second term has new randomness independent of the previous layers.666More precisely, letting , we have . Here, is a random Gaussian vector in and is independent of all .

  • The first term relies on the randomness of in the directions of for of the previous layers. We cannot rely on the randomness of this term, because when applying inductive argument till layer , the randomness of is already used.

    Fortunately, is a rectangular matrix with (thanks to overparameterization!) so one can bound its spectral norm by roughly . This ensures that no matter how behaves (even arbitrarily correlated with ), the norm of the first term cannot be too large. It is crucial here that is a rectangular

    matrix, because for a square random matrix such as

    , its spectral norm is and using that, the forward propagation bound will exponentially blow up.

This summarizes the main idea for proving in (5.1); the lower bound can be similarly argued.

Our next property says that in each layer, the amount of “fresh new randomness” is non-negligible, that is,

(5.2)

This relies on a more involved inductive argument than (5.1). At high level, one needs to show that in each layer, the amount of “fresh new randomness” reduces only by a factor at most .

Using (5.1) and (5.2), we obtain the following interesting property about the data separability:

and are -separable, with (5.3)

Here, we say two vectors and are -separable if and vice versa.

We prove (5.3) by induction. In the first layer we have and are -separable which is a consequence of Assumption 2.1. If having fresh new randomness, given two separable vectors , one can show that and are also -separable. Again, in RNN, we do not have fresh new randomness, so we rely on (5.2) to give us reasonably large fresh new randomness. Applying a careful induction helps us to derive that (5.3) holds for all layers.777This is the only place that we rely on Assumption 2.1. This assumption is somewhat necessary in the following sense. If for some pair for all the first ten layers , and if for even just one of these layers, then there is no hope in having the training objective decrease to zero. Of course, one can make more relaxed assumption on the input data, involving both and . While this is possible, it complicates the statements so we do not present such results in this paper.

Intermediate Layers and Backward Propagation.   Training RNN (or any neural network) is not only about forward propagation. We also have to bound its behavior in intermediate layers and in backward propagation.

The first two results we derive are the following. For every and diagonal matrices of sparsity :

(5.4)
(5.5)

Intuitively, one cannot use spectral bound argument to derive (5.4) or (5.5): the spectral norm of is , and even if ReLU activations (i.e., the matrices) cancel half of its mass, the spectral norm remains to be . When multiplied together, this grows exponential in .

Instead, we use an analogous argument to (5.1) to show that, for each fixed vector , the norm of is at most with extremely high probability . By standard -net argument, is at most for all -sparse vectors . Finally, for a possible dense vector , we can divide it into chunks each of sparsity . Finally, we apply the upper bound for times. This proves Lemma 5.4. One can use similar argument to prove (5.5).

Remark 5.1.

We did not try to tighten the polynomial factor here in . We conjecture that proving an bound may be possible, but that question itself may be a sufficiently interesting random matrix theory problem on its own.

The next result is for back propagation. For every and diagonal matrices of sparsity :

(5.6)

Its proof is in the same spirit as (5.5), with the only difference being the spectral norm of is around as opposed to .

6 Stability After Adversarial Perturbation

In this section we study the behavior of RNN after adversarial perturbation. The corresponding precise statements and proofs are in Appendix C.

Letting be at random initialization, we consider some matrix for . Here, may depend on the randomness of and , so we say it can be adversarially chosen. The results of this section will later be applied essentially twice:

  • Once for those updates generated by GD or SGD, where is how much the algorithm has moved away from the random initialization.

  • The other time (see Section 7.3) for a technique that we call “randomness decomposition” where we decompose the true random initialization into , where is a “fake” random initialization but identically distributed as . Such technique at least traces back to smooth analysis [90].

To illustrate our high-level idea, from this section on (so in Section 6,  7 and 8)

We denote by , , respectively the values of , and determined by and at random initialization; and by , and respectively those determined by after the adversarial perturbation.

Forward Stability.   Our first, and most technical result is the following:

(6.1)

Intuitively, one may hope for proving (6.1) by simple induction, because we have (ignoring subscripts in )

The main issue here is that, the spectral norm of in ③ is greater than 1, so we cannot apply naive induction due to exponential blow up in . Neither can we apply techniques from Section 5, because the changes such as can be adversarial.

In our actual proof of (6.1), instead of applying induction on ③, we recursively expand ③ by the above formula. This results in a total of terms of ① type and terms of ② type. The main difficulty is to bound a term of type, that is:

Our argument consists of two conceptual steps.

  1. [label=(0)]

  2. Suppose where and , then we argue that and .

  3. Suppose we have with and , then we show that can be written as with and .

The two steps above enable us to perform induction without exponential blow up. Indeed, they together enable us to go through the following logic chain:

Since there is a gap between and , we can make sure that all blow-up factors are absorbed into this gap, using the property that is polynomially large. This enables us to perform induction to prove (6.1) without exponential blow-up.

Intermediate Layers and Backward Stability.   Using (6.1), and especially using the sparsity from (6.1), one can apply the results in Section 5 to derive the following stability bounds for intermediate layers and backward propagation:

(6.2)
(6.3)

Special Rank-1 Perturbation.   For technical reasons, we also need two bounds in the special case of for some unit vector and sparse with . We prove that, for this type of rank-one adversarial perturbation, it satisfies for every :

(6.4)
(6.5)

7 Proof Sketch of Theorem 3: Polyak-Łojasiewicz Condition

The upper bound in Theorem 3 is easy to prove (based on Section 5 and 6), but the lower bound (a.k.a. the Polyak-Łojasiewicz condition) is the most technically involved result to prove in this paper. We introduce the notion of “fake gradient”. Given fixed vectors , we define

(7.1)

where . Note that if is the true loss vector, then will be identical to by Fact 2.5. Our main technical theorem is the following:

Theorem 5.

For every fixed vectors , if are at random initialization, then with high probability

There are only two conceptually simple steps from Theorem 5 to Theorem 3 (see Appendix F).

  • First, one can use the stability lemmas in Section 6 to show that, the fake gradient after adversarial perturbation (with ) is also large.

  • Second, one can apply -net and union bound to turn “fixed ” into “for all ”. This allows us to turn the lower bound on the fake gradient into a lower bound on the true gradient .

Therefore, in the rest of this section, we only sketch the ideas behind proving Theorem 5.

Let be the sample and layer corresponding to the largest loss. Recall are at random initialization.

7.1 Indicator and Backward Coordinate Bounds

There are three factors in the notion of fake gradient (7.1): the backward coordinate, the forward vector, and the indicator coordinate. We already know very well how the forward vector behaves from the previous sections. Let us provide bounds on the other two factors at the random initialization. (Details in Appendix D.)

Our “backward coordinate bound” controls the value of : at random initialization,

(7.2)

The main idea behind proving (7.2) is to use the randomness of . For a fixed , it is in fact not hard to show that is large with high probability. Unfortunately, the randomness of are shared for different coordinates . We need to also bound the correlation between pairs of coordinates , and resort to MiDiarmid inequality to provide a high concentration bound with respect to all the coordinates.

Our indicator coordinate bound controls the value inside the indicator functions . It says, letting and , then at random initialization, for at least fraction of the coordinates ,

(7.3)

This should be quite intuitive to prove, in the following two steps.

  • First, there are coordinates with .

    To show this, we write , and prove that for every with bounded norms (by -net), there are at least coordinates with . This is possible using the independence between and .

  • Then, conditioning on the first event happens, we look at for (1) each and , or (2) each and . In both cases, even though the value of