Feedforward and Recurrent Neural Networks Backward Propagation and Hessian in Matrix Form

09/16/2017 ∙ by Maxim Naumov, et al. ∙ 0

In this paper we focus on the linear algebra theory behind feedforward (FNN) and recurrent (RNN) neural networks. We review backward propagation, including backward propagation through time (BPTT). Also, we obtain a new exact expression for Hessian, which represents second order effects. We show that for t time steps the weight gradient can be expressed as a rank-t matrix, while the weight Hessian is as a sum of t^2 Kronecker products of rank-1 and W^TAW matrices, for some matrix A and weight matrix W. Also, we show that for a mini-batch of size r, the weight update can be expressed as a rank-rt matrix. Finally, we briefly comment on the eigenvalues of the Hessian matrix.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The concept of neural networks has originated in the study of human behavior and perception in the 1940s and 1950s [18, 29, 36]. Different types of neural networks, such as Hopfield, Jordan and and Elman networks, have been developed and successfully adapted for approximating complex functions and recognizing patterns in the 1970s and 1980s [14, 20, 24, 44].

More recently, a wide variety of neural networks has been developed, including convolutional (CNNs) and recurrent long short-term memory (LSTMs). These networks have been applied and were able to achieve incredible results in image and video classification

[25, 42, 43], natural language and speech processing [12, 17, 30, 39], as well as many other fields. These new results were made possible by a vast amount of available data, more flexible and scalable software frameworks [1, 8, 9, 11, 23, 40] and the computational power provided by the GPUs and other parallel computing platforms [16, 39, 46, 47].

A neural network is a function , where and is the number of inputs and outputs to the network. It is usually expressed through a repeated composition of affine (linear + constant) functions of the form

(1)

and non-linear functions of the form

(2)

where weight matrix

, input vector

, while bias, intermediate and output vectors b, y and , respectively. The function is typically a component-wise application of a monotonic non-decreasing function , such as logistic (sigmoid)

, rectified linear unit (ReLU)

or softplus , which is a smooth approximation of ReLU [31, 15].

A neural network can also be thought of as a composition of neurons, which add the weighted input signals

with bias

and pass the intermediate result through a threshold activation function

to obtain an output , as shown on Fig. 1. These single neurons can be further organized into layers, such as the fully connected layer shown in Fig. 2. If the layers are stacked together, with at least one hidden layer that does not directly produce a final output, we refer to this network as a deep neural network [5, 37].

[Shape = circle, LineWidth= 1pt] [lw = 1pt, color= black, style=-¿]

Fig. 1: A Single Neuron

[Shape = circle, LineWidth= 1pt] [lw = 1pt, color= black, style=-¿]

Fig. 2: A Single Fully Connected Layer

The connections between neurons and layers determine the type of a neural network. In particular, in this paper we will work with feedforward (FNNs) and recursive (RNNs) neural networks with fully connected layers [19, 27, 34]. We point out that in general CNNs can be expressed as FNNs [10, 26, 28].

The weights associated with a neural network can be viewed as coefficients of the function

defined by it. In pattern recognition we often would like to find these coefficients, such that the function

approximates a training data set

according to a loss function

in the best possible way. The expectation is that when a new previously unknown input is presented to the neural network it will then be able to approximate it reasonably well. The process of finding the coefficients of is called training, while the process of approximating a previously unknown input is called inference [4, 16].

The data set is composed of data samples , which are pairs of known inputs and outputs . These pairs are often ordered and further partitioned into disjoint mini-batches , so that and . We assume that the total number of pairs is

, otherwise the last batch is padded. Also, we do not consider the problem of splitting data into training, validation and test partitions, that are designed to prevent over-fitting and validate the results. We assume this has already been done, and we are already working with the training data set.

The choice of the loss function often depends on a particular application. In this paper we will assume that it has the following form

(3)

where is the correct and is the obtained output for a given input , while denotes the Frobenius norm.

In order to find the coefficients of we must find

(4)

Notice that we are not trying to find weights and bias that result in a minimum for a particular data point, but rather on “average” across the entire training data set. In the next sections we will choose to work with scaled loss to simplify the formulas.

The process of adjusting the weights of the neural network to find the minimum of (4) is called learning. In particular, when making updates to the weights based on a single data sample it is called online, on several data samples it is called mini-batch, and on all available data samples it is called batch learning.

In practice, due to large amounts of data, the weight updates are often made based on partial information obtained from individual components of the loss function . Notice that in this case we are essentially minimizing a function across multiple inputs, and we can interpret this process as a form of stochastic optimization. We note that the optimization process makes a pass over the entire data set making updates to weights before proceeding to the next iteration. In this context, a pass over the training data set is called an epoch.

There are many optimization algorithms with different tradeoffs that can find the minimum of the problem (4). Some rely only on function evaluations, many take advantage of the gradient , while others require knowledge of second-order effects using Hessian or its approximation [7, 32, 41]

. The most popular approaches for this problem are currently based on variations of stochastic gradient descent (SGD) method, which relies exclusively on function evaluations and gradients

[6, 13, 33, 35].

These methods require an evaluation of the partial derivatives of the loss function or its components with respect to weights (for simplicity we have dropped the subscript in ). However, notice that the function specifying the neural network is not given explicitly and is typically only defined through a composition of affine and non-linear f functions. In the next sections we will discuss the process for evaluating called forward propagation and the derivatives of called backward propagation [2, 38, 44, 45].

2 Contributions

In this paper we focus on the linear algebra theory behind the neural networks, that often generalizes across many of their types. First, we briefly review backward propagation, obtaining an expression for the weight gradient at level as a rank- matrix

(5)

for FNNs and rank- matrix

(6)

for RNNs at time step . Here, the yet to be specified vector v is related to , while vector z is the input to the current layer of the neural network. Therefore, we conclude that for a mini-batch of size , the weight update can be expressed as a rank- matrix.

Then, we obtain a new exact expression for the weight Hessian, as a Kronecker product111 The Kronecker product for matrix and matrix is defined as a matrix .

(7)

for FNNs, and as a sum of Kronecker products

(8)

for RNNs at time step . Here, the yet to be specified matrix is related to and will be shown to have the form , for some matrix and weight matrix .

Also, we show that expression for Hessian can be further simplified for ReLU activation function by taking advantage of the fact that for .

Finally, using the fact that Kronecker products (7) and (8) involve a rank- matrix, we will show that the eigenvalues of Hessian matrix can be expressed as and related to

(9)

for FNNs and RNNs, respectively. Therefore, we can determine whether Hessian is positive, negative or indefinite by looking only at the eigenvalues of matrix .

3 Feedforward Neural Network (FNN)

The feedforward neural network consists of a set of stacked fully connected layers. The layers are defined by their matrix of weights and vector of bias . They accept an input vector and produce an output vector for , with the last vector being the output of the entire neural network, as shown in Fig. 3.

In some cases the neural network can be simplified to have sparse connections, resulting in a sparse matrix of weights . However, the connections between layers are always such that the data flow graph is a directed acyclic graph (DAG).

[Shape = circle, LineWidth= 1pt] [lw = 1pt, color= black, style=-¿]

Fig. 3: A Sample Feedforward Neural Network (FNN) with Levels

3.1 Forward Propagation

Let us assume that we are given an input , then we can compute an output of the neural network by repeated applications of the formula

(10)
(11)

for where . This process is called forward propagation.

3.2 Backward Propagation

Let us assume that we have a data sample from the training data set , as we have discussed in the introduction. Notice that using forward propagation we may also compute the actual output and the error (associated with scaled loss )

(12)

with generalization from online with to mini-batch and batch learning with being trivial.

We would like to find the solution for the optimization problem (4) by adjusting the weights and bias on each layer of the network based on individual components of the loss function . The process of updating the weights and bias can be written as

(13)
(14)

where is some constant, often referred to as the learning rate.

It is natural to express these weight and bias updates based on

(15)

that indicate the direction where the total effect of the weights and bias on the loss function component is the largest, respectively.

Lemma 1.

Let the feedforward neural network be defined in (10) and (11), and the loss function component in (12). Then, the gradient of the weights and bias can be written as

(16)
(17)

where is matrix, is vector, with

(18)
(19)

for , where is Hadamard (component-wise) product and .

Proof.

Notice that taking partial derivative of the loss function component-wise with respect to weight we can write

(20)
(21)
(22)

where denotes a simple ordinary derivative and .

Also, notice that for the output layer , using (12), we have

(23)

while for the hidden layers, using chain rule, we have

(24)
(25)

Finally, assembling the indices and into a vector and matrix form we obtain the expression for . The derivation for is analogous, with in (20). ∎

Notice that the computation of the auxiliary vector in (18) - (19) represents the propagation of the error (12) from the output layer through the hidden network layers . Therefore, this process is often called backward propagation.

Corollary 1.

Let the feedforward neural network be defined in (10) and (11), and the loss function in (3). Then, for mini-batch of size the weight update based on the gradient in (16) can be expressed as rank- matrix

(26)

where and for data pairs.

3.3 Hessian and Second Order Effects

We can also incorporate second order effects based on

(27)

into the optimization process for updating the weights by looking at the expression for Hessian of the neural network.

Theorem 1.

Let the feedforward neural network be defined in (10) and (11), and the loss function component in (12). Then, Hessian of weight and bias can be written as

(28)
(29)

where is matrix, is matrix, with

(30)
(31)
(32)
(33)

where is identity matrix and vectors

(34)
(35)

for , where is Hadamard (component-wise) and is Kronecker matrix product, while vectors and .

Proof.

Notice that using (21) the second derivative with respect to weight is

(36)
(37)

where we have taken advantage of the fact that

(38)

and previous level output does not depend on the current level weight and therefore is treated as a constant, while is Kronecker delta222 Kronecker delta .

Let us now find an expression for the first term in (37). Notice that using (23) at the output layer we have

(39)

while using (24) at the hidden layers we may write

(40)
(41)

where we have used (38) and the fact that the current level weight does not depend on the previous level output and therefore is treated as a constant.

We may conclude the proof by noticing the following two results. First, a matrix with block elements for can be expressed as Kronecker product , which under a permutation is equivalent to . Second, a matrix has elements . The former and latter results can be used to write (37) and (41) in the matrix form, respectively.

Finally, the derivation for is analogous, with in (37). ∎

Notice that using (39) we may drop the double sum at level and write

(42)

which matches the expression obtained for a single hidden layer in [3]. However, we may not drop the double sum at an arbitrary layer , because in general the term may be nonzero even when .

Finally, notice that to compute Hessian we once again need to perform backward propagation for both vector in (34) - (35) and matrix in (32) - (33).

Corollary 2.

Suppose that we are using piecewise continuous ReLU activation function in Theorem 1. Notice that its first derivative if , if , and is undefined if . Also, its second derivative for . Then, for Hessian of weights can be written as

(43)
(44)

with and binary matrix for .

Notice that the eigenvalues of Kronecker product of two square and matrices are

(45)

for , and , see Theorem 4.2.12 in [22]. Therefore, the eigenvalues of the Hessian matrix can be expressed as

(46)

with eigenvalue multiplicity being .

4 Recurrent Neural Network (RNN)

The recurrent neural network consists of a set of stacked fully connected layers, where neurons can receive feedback from other neurons at the previous, same and next layer at earlier time steps. However, in this paper for simplicity we will assume that the feedback is received only from the same level at earlier time steps, as shown in Fig. 4.

Therefore, the layers are defined by their matrix of weights , matrix of feedback and vector of bias . They accept an input vector from the previous level and the hidden state vector from the previous time step. They produce an output vector for layers and time steps . The output of the entire neural network is often a sub-sequence of vectors at the last layer and time steps with starting time step .

In some cases the neural network can be simplified to have sparse connections, resulting in a sparse matrix of weights and feedback . Also, notice that in our example the connections within a layer have cycles due to feedback, but the connections between layers are always such that the data flow graph between them is a DAG.

[Shape = circle, LineWidth= 1pt] [lw = 1pt, color= black, style=-¿]

Fig. 4: A Sample Recurrent Neural Network (RNN) with Levels and Time Steps

4.1 Forward Propagation

Let us assume that we are given an input sequence , then we can compute an output sequence generated by the neural network by repeated applications of the formula

(47)
(48)

for and , where initial hidden state and input . Notice that the final output is often a sub-sequence , where with starting time step . This process is called forward propagation.

4.2 Backward Propagation (Through Time)

Let us assume that we have a data sample from the training data set , as we have discussed in the introduction. Notice that here the input and output are actually a sequence and for time steps and with starting time step , respectively. Also, notice that using forward propagation we may compute the actual output and the error (associated with scaled loss )

(49)

with generalization from online with to mini-batch and batch learning with being trivial.

We would like to find the solution for the optimization problem (4) by adjusting the weights , feedback and bias on each layer of the network based on individual components of the loss function . So that the updating process can be written as

(50)
(51)
(52)

where is some constant, often referred to as the learning rate.

It is natural to express these weight , feedback and bias updates based on

(53)

that indicate the direction where the total effect of the weights, feedback and bias on the loss function component is the largest, respectively. Notice in turn that these quantities can be expressed through the sum of their sub-components

(54)

for , which will be our focus next.

Lemma 2.

Let the recurrent neural network be defined in (47) and (48), and the loss function components in (49). Then, the gradient of the weights and bias can be written as

(55)
(56)
(57)

where is matrix, is matrix and is vector, with

(58)

and

(59)
(60)

for and , where we consider the terms for time to be zero. Also, is Hadamard (component-wise) product, and