Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

02/27/2018 ∙ by Huishuai Zhang, et al. ∙ Microsoft 0

Stochastic gradient descent (SGD) has achieved great success in training deep neural network, where the gradient is computed through back-propagation. However, the back-propagated values of different layers vary dramatically. This inconsistence of gradient magnitude across different layers renders optimization of deep neural network with a single learning rate problematic. We introduce the back-matching propagation which computes the backward values on the layer's parameter and the input by matching backward values on the layer's output. This leads to solving a bunch of least-squares problems, which requires high computational cost. We then reduce the back-matching propagation with approximations and propose an algorithm that turns to be the regular SGD with a layer-wise adaptive learning rate strategy. This allows an easy implementation of our algorithm in current machine learning frameworks equipped with auto-differentiation. We apply our algorithm in training modern deep neural networks and achieve favorable results over SGD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have been advancing the state-of-the-art performance over a number of tasks in artificial intelligence, from speech recognition

(Hinton et al., 2012)

, computer vision

He et al. (2016) to natural language understanding Hochreiter & Schmidhuber (1997). These problems are typically formulated as minimizing non-convex objectives parameterized by the neural network models. Typically the models are trained with stochastic gradient descent (SGD) or its variants and the gradient information is computed through back-propagation (BP) Rumelhart et al. (1986).

However, the magnitudes of gradient components often vary significantly in neural network. Recall that one coordinate of the gradient is the directional derivative along with that coordinate which represents how a change on the weight will affect the loss rather than how we should modify the weight to minimize the loss. Thus vanilla SGD with a single learning rate could be problematic for the optimization of deep neural network because of the inconsistent magnitude of gradient components. In practice, extreme small learning rate alleviates this problem but leads to slow convergence. Moreover, momentum (Rumelhart et al., 1986; Qian, 1999; Nesterov, 2013; Sutskever et al., 2013) amends this problem by accumulating velocity along the coordinate with small magnitude but consistent direction and reducing the velocity for those coordinates with large magnitudes but opposite directions. Adaptive learning rate algorithms (Duchi et al., 2011; Kingma & Ba, 2014) scale coordinates of the gradient by reciprocals of some averages of their past magnitudes, confirming that weakening the affect of magnitudes of the gradient components could be favorable to the optimization from the other side.

We want to solve this problem from another perspective. Ye et al. (2017) suggests that the magnitude inconsistence of gradient components are mainly across layers. We can get a hint by scrutinizing the back-propagation through a fully connected layer111We omit bias terms for simplicity. which has output and input and weight parameter . The layer mapping is given by . If is the back-propagated value on the output is , i.e., the partial derivatives of the loss with respect to , the back-propagation equation is given by

where is the -th column of (represents all the connections emanating from unit ) and is the back-propagated value on the layer’s input computed through back-propagation. If the rows of are initialized with unit length (roughly) to preserve the forward signal, then back-propagation through this layer would shrink the magnitude of the backward signal heavily when the number of input is much larger than the number of output.

We suggest a principled way to overcome the problem of magnitude inconsistence of gradient components across layers in back-propagation. Specifically, we compute changes and on the weight parameter and on the input respectively, to match the error guiding signal as closely as possible. This motivates us to formulate the backward pass through the fully connected layer as solving a group of least-squares problems. By hiding technical details and some assumption, we propose the back-matching propagation as follows,

(1)
(2)

where the expectation is over the data points in a mini batch. A direct expalanation of (2) is that we want to change the weight matrix by to produce a desired change on the output (or sufficiently close to) given the current input . So can we explain equation (1). Then we use to update the parameter and use as the error guiding signal to back-propagate to lower layers.

For the back-matching propagation (1), we need to compute which is easy since it is a scaler. For the parameter update solution (2), we need to compute an inverse

. This requires a large number of matrix inverse operations, roughly the number of neurons, and each inverse requires flops on the cubic order of the number of neurons in one layer. This hinders it to be applied to large neural networks which typically contain tens of thousands of neurons in a single layer.

Fortunately, we can work with the batch normalization (BN) technique

Ioffe & Szegedy (2015) to circumvent this difficulty. With batch normalization, we regard

as an identity matrix approximately and remove the inverse in (

2

). Then with some approximation, we can reduce the back-matching propagation into a layer-wise gradient adaption strategy, which can be viewed as layer-wise adaptive learning rates when applying pure SGD. As such a layer-wise gradient adaption strategy is built upon the regular BP process, it is easy to implement in current deep learning frameworks

Bastien et al. (2012); Abadi et al. (2016); Paszke et al. (2017); Seide & Agarwal (2016). Moreover, this strategy also works with other popular optimization techiques (momentum, ada-algorithms, weight-decay) naturally to achieve possible higher performances in machine learning tasks. We expect this layer-wise gradient adaption strategy could accelerate the training procedure. Surprisingly, this strategy often improves the test accuracy by a considerable margin in practice.

1.1 Related Works

Training neural network with layer-wise adaptive learning rate has been proposed in several previous works. Specifically, Singh et al. (2015) suggested using as the learning rate for the layer . You et al. (2017) suggested using as the learning rate for layer and demonstrated that this would benefit the large-batch training. However, the suggestion in both works mainly comes from empirical experience and do not have explanation of why the rate is set in that way.

Our paper is related to the block-diagonal second order algorithms Lafond et al. (2017); Zhang et al. (2017); Grosse & Martens (2016). Specifically, Lafond et al. (2017) proposes a weight reparametrization scheme with a diagonal rescaling step-size and show its potential advantages over batch normalization. Zhang et al. (2017) proposes a block diagonal Hessian-free method to train neural networks and shows fast convergence rate over first-order methods. Martens & Grosse (2015); Grosse & Martens (2016) propose the Kronecker Factored Approximation (KFA) method to approximate the natural gradient using a block-diagonal or block-tridiagonal approximation of the Fisher matrix. These second-order algorithms all share a layer-wise or block-diagonal structure design, which agrees with our algorithm. However, our layer-wise adaptive learning rate comes from the perspective of back-matching propagation and is different from the second-order approximations.

Our paper is also related to the Riemannian algorithms Amari (1998); Ollivier (2015); Marceau-Caron & Ollivier (2016). Specifically, Ollivier (2015) proposes using as the update for the parameter , where

is a backpropagated metric. Similarly,

Ye et al. (2017) advocates using as the update of the parameter.

In comparison, the back-matching propagation comes from a different perspective that the back-propagated values should match the error guiding signal. Our layer-wise gradient adaption strategy, which is derived from back-matching propagation, is simpler than the Riemannian algorithms in terms of implementational and computational complexity.

2 Back-matching Propagation

In this section, we present how the back-matching propagation works under several popular types of layers. Specifically, we derive the formula of the backprogated values on the layer’s input and on the layer’s parameters given the backpropagated value on the layer’s output based on the back-matching propagation. Moreover, we compare the back-matching propagation to the regular BP.

We introduce several notations here (some have been used in Introduction). Let

denote the objective (loss function). We use

and to denote the layer’s output and input respectively and use parameter to denote the layer’s parameter. We use to denote the back-propagated value on the layer’s output. Then we use and to denote the back-propagated values computed through BP, and use and to denote the back-propagated values computed through back-matching propagation.

Let us briefly review the regular BP here. The BP propagates derivatives from the top layer back to the bottom one. Suppose we are dealing with a general layer which has forward mapping . Then the derivative of the loss with respect to a specific output component is . The BP equations are given by

(3)
(4)

where represents a data point and the expectation is over the data points in a mini batch.

Next we present how the back-matching propagation back-propagated through specific layers. In order to compare conveniently, for each type of layer we first provide the BP formula and then derive the formula via back-matching propagation and in the end discuss the relation between the back-matching propagation and BP.

2.1 Fully Connected Layer

We first consider a fully connected layer, whose mapping function is given by222For simplicity we omit the bias term.

(5)

Suppose the backpropagated values on the output are . Following backpropagation equations (3) and (4), we compute the backpropagated values on the input and on the weight parameter as follows,

(6)
(7)

where is the -th column of .

We next derive the formula for back-matching propagation, where we compute and that try to match the guiding signal as accurately as possible, in the sense of minimizing square error,

(8)
(9)

Note that by writing the matching problem as two independent problems (8) and (9), we presume that updating and propagating backward values are independent, and such layer independence has been used in block-diagonal second-order algorithms Zhang et al. (2017); Lafond et al. (2017). Moreover, (8) is separable along the rows of . Hence we obtain a bunch of (total number ) least-squares problems

(10)

where is the -th row of and represents the back-propagated values at neuron in one mini batch of data. We further assume all are updated independently, based on the intuition that a neuron doesn’t know the other neurons’ states on/off and a fair strategy is to try its best to match the guiding signal by itself. Then (9) becomes a bunch (total number ) of least-squares problems

(11)

where are the weights emanating from neuron (a column of corresponding neuron ). We call equations (8) and (11) the back-matching propagation rule. Solving the least-squares problems (11) and (8) gives us:

(12)
(13)

From (6) (7) and (12) (13), we can see how the back-matching propagation is related with the regular BP:

(14)
(15)

We can see that the formulas (14) (15) of back-matching propagation are the corresponding BP formulas (6) (7) rescaled by a number or a matrix.

2.2 Convolutional Layer

In this section we study the back-matching propagation through a convolutional layer. The weight parameter is an array with dimension , where and are the number of output features and the number of input features respectively, and and are the width and height of convolutional kernels. We use to denote the output at location of feature and to denote the input at location of feature , then the forward process is

(16)

and the BP is given by

(17)
(18)

However, this formula of the forward and backward process of convolutional layer make the derivation of back-matching propagation complex. Note that the convolution operation essentially performs dot products between the convolution kernels and local regions of the input. The forward pass of a convolution layer can be formulated as one big matrix multiply with im2col operation. In order to describe back matching process clearly, we rewrite the convolution layer forward and backward pass with im2col operation. We use and to represent the weight matrices with dimension and , respectively, which both are stretched out from . To mimic the convolutional operation, we rearrange the input features into a big matrix through im2col operation: each column of is composed of the elements of that are used to compute one location in . Thus if has dimension , then has dimension . Furthermore, we stack the latter two dimensions of

into a tall vector, denoted as

which has dimension . The forward process (16) of convolutional layer can be rewritten as

(19)

Similarly, we can rewrite the regular BP (17) and (18) as

(20)
(21)

where is composed of weight components that interact with input location , which approximately has elements and

is a factor related with pooling and stride, and

is composed of output locations that have interaction with input location . With these notations, we can derive the formula for back-matching propagation via solving the least squares problems (11) and (8), given by

(22)
(23)

We can see that the formulas (22) (23) of back-matching propagation are the corresponding BP formulas (20) (21) rescaled by a number or a matrix. As the convolutional layer is essentially a linear mapping, the formulas here is similar to those of the fully connected layer although they are more involved.

2.3 Batch Normalization Layer

Batch normalization (BN) is widely used for accelerating training of feedforward neural networks. In practice, BN is usually inserted right before the activation function. We fix the affine transformation of batch normalization to be identity. Then the BN layer mapping is given by

(24)

The BP formula through the BN layer is given by (Ioffe & Szegedy, 2015),

(25)

where is the mini-batch size, and and is the backpropagated values on quantities and respectively.

We next derive the formula of back-matching propagation through BN. By solving (11), we have

(26)

To see how the back-matching propagation is related with BP, we ignore the latter two terms in (25) when the mini-batch size is large, and have the following approximation

(27)

2.4 Rectified Linear Unit (ReLU)

We use

to denote the ReLU nonlinear function. Then the ReLU mapping is given by

(28)

For the formula of BP, we have

(29)

Following (11), we have the formula of back-matching propagation for the ReLU layer

(30)

Therefore the formula of back-matching propagation for ReLU is the same as that of BP,

(31)

3 Layer-wise Adaptive Rate via Approximate Back-matching Propagation

The back-matching propagation involves large number of matrix inverse operations, which is computationally prohibited in training large neural networks. In this section we present how to approximate the back-matching propagation under certain assumption and end up with a layer-wise adaptive rate strategy based on the approximation of the back-matching propagation, which allows easy implementation in frameworks equipped with auto-differentiation.

3.1 Approximate Back-matching Propagation via BP

We firstly look at the formula of the back-matching propagated value on the weight parameter (15) and (23). It is the gradient scaled by an inverse of a matrix, which is prohibited for large networks. We use batch normalization to circumvent this difficulty.

With batch normalization, we regard as identity matrix approximately. From now on, we require each intermediate layer is bonded with a batch normalization layer except the output layer. Since BN has been widely used for accelerating the training process and improving the test accuracy, this requirement does not confine us much. Under this requirement, the back-matching propagation for the fully connected layer (15) is approximated by,

(32)

and the back-matching propagation for the convolutional layer (23) is approximated by,

(33)

where is the sharing factor for the convolutional layer.

Next we consider the formula of back-matching propagated values on the input (14) and (22). To further reduce the complexity and develop a layer-wise adaptive learning rate strategy, we assume that is row homogeneous Ba et al. (2016), i.e., they represent the same level of information and are roughly of similar magnitude. We define

where is the -th row of . Then the in equation (1) can be approximated as Under this assumption, the back-matching propagation for the fully connected layer (14) is approximated by,

(34)

and the back-matching propagation for the convolutional layer (22) is approximated by,

(35)

where

is a factor related with pooling, stride and padding operations. We will see a detailed example in Section

3.3.

For BN layer, we assume the weight parameter is row homogeneous and then the back-matching propagation for the BN layer (26) is approximated by,

(36)

3.2 Layer-wise Adaptive Rate Strategy

Based on the approximations in Section 3.1 we are ready to derive a layer-wise adaptive learning rate strategy. We note that the approximate back-matching propagation gives a scaling factor for each layer’s gradient if we back-propagate starting from the top layer. We set the initial factor of the output layer is , which indicates that we regard the derivative of the loss with respect to the output of the network as the desired changes on the output to minimize the loss.

Then starting from the top layer, we compute a backward factor for each layer through

(37)

where the relations of and are given by (34), (35), (36) and (31) for fully connected layer, convolutional layer, BN layer and ReLU, respectively. If the layer has parameter and gradient , then we use as the new adaptive gradient to update , where is the backward factor on the output of the layer and is the sharing factor of the layer. Then can be viewed as a layer-wise adaptive learning rate when using vanilla SGD. This strategy is described in Algorithm 1.

  Initial: Backward factor , for fully connected, ReLU and BN layers, for convolutional layer
  repeat
     BP from the layer’s output
     if layer has weight  then
        
     end if
     Calculate the ratio according to the layer type
     Update
  until bottom layer
Algorithm 1 SGD with Layer-wise Adaptive Rate via Approximate Back-matching Propagation

Our algorithm can work with momentum naturally. In practice, we use the flow in Algorithm 1 to modify the gradient computed via BP. Then we apply the momentum update with the modified gradient. With the modified gradient given by Algorithm 1, we can also apply other adaptive strategy, i.e., Adam and Adagrad, without difficulty.

Weight-decay is a widely used technique to improve the generalization of the model. Note that both the weight-decay and our algorithm are modifying the gradient of the network parameter computed through BP. In practice, we first apply the weight-decay modification and then apply our algorithm on the modified gradient, which produces better result than the other way around.

3.3 An Example: LeNet

We use LeNet (Figure. 1) as an example. We modify the original LeNetLeCun et al. (1998a) by inserting a batch normalization transformation before each activation layer (ReLU) and omitting all the bias terms.


Figure 1: LeNet with batch normalization

We next walk through the approximated back-matching propagation of the LeNet and show how each layer’s weight should be changed (). Following the procedure of Algorithm 1, we have the following initial value: . Given the loss , we can compute the normal gradient on each weight parameter through BP, which are denoted as with subscript of the layer name. We start from the top layer fc3 and compute

and update . Since the ReLU activation does not contain parameter and does not change the backward factor , then we move to the BN layer. Since our BN layer does not have parameter, we only have to update the backward factor

Then we move to layer fc2 and compute

and update the backward factor

We continue doing this till the bottom layer. Further details are provided in supplemental material.

We next train LeNet with BN to classify the CIFAR-10 dataset

Krizhevsky & Hinton (2009). CIFAR-10 is composed of 60,000 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. We want to compare the training procedure and test accuracy of the classical SGD based on BP and our algorithm. There are many hyper-parameters/hyper-routines that would affect the learning curve significantly, and we try our best to make a fair comparison. In the first experiment, we use only CIFAR-10 dataset without augmentation and fix momentum to be for both algorithm, two models start from the same initial point and pass the same mini batch of data, where the mini-batch size is 128. Global learning rates are chosen to perform best in terms of test accuracy from a pool of five candidates333The pool for regular SGD is and the pool for our algorithm is .. We choose global learning rate for SGD and

for our algorithm. We fix the global learning rate through the training process and train for 200 epochs. A weight-decay term (0.0005) is optional for both methods. We note that the loss in the training curve does not include the weight-decay term in Figure. 

2 no matter whether the weight-decay is used.


Figure 2: Performance comparison between regular momentum-SGD and our algorithm on CIFAR-10 classification using LeNet with fixed learning rate.

From Figure. 2, we can see that without weight decay both algorithms can drive the training loss to zero but test accuracy of the regular SGD turns worse after very few epochs (15) and the model becomes overfitting from then on. In comparison, our algorithm achieves a much lower test error than the regular SGD and there is a considerable margin between our algorithm and the regular SGD even in the overfitting phase. On the other hand, with weight decay both training losses do not converge to zero any more but our algorithm achieves much lower training loss than the regular SGD. Weight decay improves the final test accuracy but does not improve the lowest test error during the training period of our algorithm. The weight obtained by our algorithm have rather large magnitude (the network does not explode as the batch normalization stabilize the forward propagation). We believe our algorithm combats overfitting differently from the weight decay and could provide another way to improve generalization in certain setting that cannot achieved by weight decay.

4 Experiments

In this section, we evaluate the proposed algorithm for image classification tasks with two datasets: CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009). CIFAR-10 has been introduced in Section 3.3. The CIFAR-100 dataset is similar to the CIFAR-10, except it has 100 classes with 600 images per class and there are 500 training images and 100 testing images per class. We train VGG networks (Simonyan & Zisserman, 2015) to classify these two datasets because they are widely used baselines for image classification tasks and they are of feedforward architecture. We modify the VGG nets by keeping the last fully connected layers and removing the intermediate two fully connected layers and all the biases 444We find this does not hurt accuracy for CIFAR dataset and shortens training time due to fewer parameters.. We equip each intermediate layer of the VGG nets with batch normalization transformation right before the activation function and the batch normalization has no trainable parameters.

Differently from the setting in Section 3.3, here we train VGG nets by using the randomly augmented CIFAR-10 and CIFAR-100 datasets (random flip and rotation) as such big models get overfitting to the datasets rapidly. We note that augmenting the dataset the training does not gain much benefit directly. We need to decay the learning rate to learn effectively with data augmentation. In order to compare fairly, we apply the same learning rate scheduling strategy to all algorithms: multiplying the learning rate by a factor every 60 epochs.

4.1 Baseline Algorithms

We introduce several baseline algorithms and their settings.

The base algorithm is the regular SGD with Nesterov momentum . The learning rate is set to be .

The second baseline algorithm is LSALR Singh et al. (2015) which uses as the learning rate for the layer . The global learning rate is set to be , which achieves best performance comparing from a pool of candidates , and is different from the suggestion () in the original paper.

The third baseline algorithm is LARS You et al. (2017) which uses as the learning rate for layer . In our experiment, we use the global learning rate for LARS, which achieves best performance from a pool of .

Noting that all these layer-wise adaptive algorithms modify the regular layer gradient computed through BP, we equip Nesterov momentum on them in the experiment. For baseline algorithms, we apply weight decay with coefficient 5e-4 if without specific description.

4.2 Result

We first compare the learning curves between the our algorithm and the vanilla SGD on training VGG11 with CIFAR-100. We apply Nesterov momentum 0.9 and weight decay coefficient 5e-3 for both algorithms. Similarly to Section 3.3, two algorithms start from the same initialization and pass the same batches of data. We set the same learning rate for both algorithms. Both algorithms are run 300 epochs. We plot the learning curves in Figure 3.


Figure 3: Performance comparison between regular momentum-SGD and our algorithm on CIFAR-10 classification using LeNet with data augmentation and learning rate scheduling.

From Figure 3, we can see that the learning curves of our algorithm and SGD have similar trend: curves jump at each learning rate decay. This is predictable as our algorithm only modifies the magnitude of the layer’s gradient as a whole and does not involve any further information (second order information) and moreover we use the same hyper-parameters and the same learning rate scheduling strategy for both algorithms. Scrutinizing more closely, we can see our training loss curve is almost always lower than SGD’s and our test error fluctuates heavier initially but ends with a considerably lower number.

Next we present the test result of different VGG nets for classification of CIFAR-100 in Table 1. For this group of experiments, we use global learning rate and weight decay coefficient 5e-3 for our algorithm. Our algorithm achieves higher test accuracy over its competitors on all four VGG models with margins.


Model VGG11 VGG13 VGG16 VGG19
SGD 71.47 74.01 72.86 71.35
LARS 67.26 70.21 69.90 69.52
LSALR 70.75 73.74 72.56 70.76
Ours 73.39 75.32 74.46 72.90
Table 1: Classification accuracies for CIFAR-100.

We then present the test accuracy result of different VGG nets for classification of CIFAR-10 in Table 2. The numbers in the table are the best of five independent trials of each algorithm. We use learning rate 2e-3 and weight decay coefficient 1e-4 for this group of experiments. We use a different learning rate from the case of CIFAR-100. The reason is that for training CIFAR-10 the last layer of VGG nets has 10 output and then the backpropagated values shall be multiplied by if following the back-matching propagation rule. Such an imbalanced mapping layer makes our algorithm behave aggressively. Hence we reduce the learning rate to 2e-3 for consistent result. We can see that our algorithm achieves higher test accuracy as its competitors on almost all VGG models with various margins.


Model VGG11 VGG13 VGG16 VGG19
SGD 92.63 93.90 93.72 93.66
LARS 91.81 93.20 94.00 93.48
LSALR 92.58 93.81 94.00 93.46
Ours 92.69 94.08 94.22 93.98
Table 2: Classification accuracies for CIFAR-10.

5 Conclusion and Discussion

In this paper we present the back-matching propagation which provides a principled way of computing the backpropagated values on the weight parameter and on the input, which try to match the error guiding signal on the output as accurately as possible. To utilize the idea of back-matching propagation in training large neural networks efficiently, we make several approximations based on intuitive understanding and reduce the back-matching propagation to the regular BP with a layer-wise adaptive learning rate strategy. It it easy to implement within current machine learning frameworks that are equipped with auto-differentiation. We test our algorithm in training feedforward neural networks and achieve favorable result over SGD.

There are several future directions along with this work. In our derivation of the Algorithm 1, we assume that each neuron updates its values independently from others in the same layer. This is a strong assumption and may produce inaccuracy on computing the backpropagated values across layers. Thus one future direction is to modify the back-matching propagation by considering the co-update of neurons in the same layer, which is closely related to Riemannian algorithms Ollivier (2015)

that have been introduced but not widely used because of their complexity. Moreover, applying the idea of back-matching propagation to other architectures like residual networks and recurrent neural networks is also under consideration.

References