Improving the Backpropagation Algorithm with Consequentialism Weight Updates over Mini-Batches

03/11/2020 ∙ by Naeem Paeedeh, et al. ∙ Ferdowsi University of Mashhad 0

Least mean squares (LMS) is a particular case of the backpropagation (BP) algorithm applied to single-layer neural networks with the mean squared error (MSE) loss. One drawback of the LMS is that the instantaneous weight update is proportional to the square of the norm of the input vector. Normalized least mean squares (NLMS) algorithm amends this drawback by dividing the weight changes by the square of the norm of the input vector. The affine projection algorithm (APA) improved the NLMS algorithm to weight update over a batch of recently seen samples. However, the application of NLMS and APA had been limited to single-layer networks and adaptive filters. In this paper, we consider a virtual target for each neuron of a multi-layer neural network and show that the BP algorithm is equivalent to training the weights of each layer using these virtual targets and the LMS algorithm. We also introduce a consequentialism interpretation of the NLMS and the APA algorithms that justifies their use in multi-layer neural networks. Given any optimization algorithm based on the BP over mini-batches, we propose a novel consequentialism method for updating the weights.Consequently, our proposed weight update can be applied both to plain stochastic gradient descent (SGD) and to momentum methods like RMSProp, Adam, and NAG. These ideas helped us to update the weights more carefully in such a way that minimization of the loss for one sample of the mini-batch does not interfere with other samples in that mini-batch. Our experiments show the usefulness of the proposed method in optimizing deep neural network architectures.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

LMS algorithm is the common origin of neural networks and adaptive filters. While research on neural networks has been targeting multi-layer architectures, research in the field of adaptive filters has been mainly focused on single-layer architectures. In the field of neural networks, BP Rumelhart et al. (1986) (Bishop, 2006, pp. 241-245) over mini-batches is the core method for training multi-layer networks, and any improvement to this algorithm is of great practical importance. In this paper, we try to transfer the improvements to the LMS algorithm in the field of adaptive filters to the training of deep neural network (DNN)s. In the following, we review some techniques that make the behavior of the network more predictable. This can be done, for example, by reducing the oscillations of the parameters around the optimal values, by preventing their deviation from the optimal path to the minimum, or by eliminating the redundant variables from the optimization calculations.

The learning curves of the SGD

algorithm usually contain many fluctuations. This phenomenon may happen due to optimization over a small subset of input vectors or mini-batches at each iteration, which gives us a rough estimation

LeCun et al. (2015). Besides, SGD shows weakness in parts of the loss surface that bends sharply in one dimension than another (Goodfellow et al., 2016, p. 296)Ruder (2016). We want to provide more clues to the source of such instabilities.

The momentum method Rumelhart et al. (1986) and its extensions such as Adam, NAG Ruder (2016), AdagradDuchi et al. (2011), and Adadelta Zeiler (2012) help to stabilize and speed-up the training by bringing some eigen components of the system closer to critical damping Qian (1999). These methods reduce the oscillations around the optimum value of parameters. It is possible to reduce such oscillations by guiding them to move straight towards their optimum values.

Batch normalization (BN) Ioffe and Szegedy (2015) is an algorithm that helps to optimize the error independent of the change in the distribution of the input features that they called it internal covariate shift (ICS

). They assumed it as a source of unpredictability during the training, which slows down the training by forcing one to pick lower learning rates and to initialize the parameters carefully. BN normalizes each feature of the outputs by using the first two moments by the statistics of each mini-batch. It also introduces two new variables to restore the representation power of the network by finding the preferred scale and shift over the training. These initiatives enabled them to use higher learning rates to achieve the same accuracy of the state-of-the-art image classification with 14 times fewer training steps. The success of BN can be because of making the outputs of each layer, hence the inputs of the next layers more predictable, by eliminating one source of the unpredictability. The possibility of using a higher learning rate for the modified architecture can be an effect rather than the cause.

LMS algorithm trains adaptive filters Widrow and Lehr (1990) Widrow and Hoff (1960) in an online fashion by reducing the errors through an iterative process. An adaptive linear element (Adaline) Widrow and Lehr (1990) network Widrow and Hoff (1960)

is a single-layer neural network with a linear activation function. However, subsequent investigations revealed the sensitiveness of the learning rate selection to the norm of the input vector

Widrow and Lehr (1990)

, which makes it hard or impossible to pick a learning rate to make the convergence faster while maintaining stability at the same time. Besides, after the selection of a fixed learning rate, we should expect a kind of bias towards stronger inputs. While the filter learns the stronger inputs faster, the weaker inputs need more iterations. Additionally, it may easily be confused when suddenly encounters an outlier by jumping over a completely different area of the loss surface and forgetting the earlier learned parameters.

NLMS (Haykin, 2014, pp. 333-337) in regression tasks, which is also known as -LMS Widrow and Lehr (1990); Hassoun (1995) in classification variants, can solve this problem. It is known as a method that adjusts the learning rate to cancel the problematic effect of the norm of the input vector.

Two questions ought to be answered to extend the success of NLMS to DNNs. First, how can someone generalize the underlying idea to a network with multiple layers? The second one arose after we extended it to a

multi-layer perceptron

(MLP), and observed a kind of instability during the mini-batch training. The second question is, what was the cause? To address the first issue, we show that each layer follows some virtual targets. We prove that optimization with BP is equivalent to using the LMS algorithm for independent layers. The second one can be solved by considering how each sample of the input deflects another sample from following its virtual target. This makes the outputs of each layer, which become the inputs of the next layer after an activation function, unpredictable. Moreover, the groups of neurons, not just a single neuron, determines the outcome of the neural networks. For example, networks with better generalization do not rely on individual neurons Morcos et al. (2018). By considering vectors as a group of scalar inputs in on-line training, mini-batch training should be considered as its matrix extension LeCun et al. (2015). These findings led us to use this matrix view instead of a vector view to handle the second case.

Similar efforts have been made before, for single-layer networks. Since our solution contains pseudoinverse, we later discovered that Ozeki and Umeda generalized NLMS in adaptive filters, by considering the batch of last seen inputs at each time step and solving it with a geometric viewpoint by an affine projection algorithm (APA) Ozeki and Umeda (1984) (Haykin, 2014, pp. 345-350) that reached to the same pseudoinverse formula.

Finally, one way to solve the optimization problem in neural networks is to consider some restrictions. For example, in Crammer et al. (2006)

they optimized the weights as a solution of a constrained optimization problem. They suggested a margin-based cost function, which made their online algorithm more aggressive. They also put restrictions on weight changes and came up with an NLMS like formula with a slack term. As another example, Atiya and Parlos put constraints on gradients w.r.t. the outputs to train the recurrent neural networks

Atiya and Parlos (2000). In their method, they considered the states as the control variables. The weight modifications elicited from the changes in the states. As we show in Section 3, NLMS can also be explained by the gradients w.r.t. the outputs of the neurons instead of the weights.

The paper proceeds as follows. First, the notations are introduced in Section 2. Second, some past works are analyzed in Section 3 to grasp the consequentialism idea and the basis for the next section. After that, we propose our method in Section 4. Next, we investigate the effectiveness of the algorithm with some experiments in Section 5. Finally, the paper concludes in Section 6.

2 Notations

In this section, we provide a summary of the notations which are used in this paper. Bold letters indicate vectors, and bold capital letters represent matrices. Lower case letters denote the matrix and vector elements. For matrix , refers to the th row and th column.

The new method applies to any type of layer. However, we study just the feed-forward networks throughout this paper to simplify notations. Other layers are just some special cases of the fully-connected (FC) layer. For example, the convolution operator almost always is implemented as a matrix multiplication by using the im2col method for forward and col2im method for backward Jia et al. (2014).

Consider a feed-forward network with layers (or hidden layers) to be trained with batch size of . denotes the dimension of the output for layer . As two special cases, is the dimension of the inputs of the network, and is the number of classes in a typical classification task. For layer , we refer to the input matrix by , which is , and for layer , we refer to the weight matrix by , which is . Fig. 1 illustrates a 3-layer MLP network.

Figure 1: An MLP network with 3 layers (2 hidden layers).

In the forward pass, by starting from the first layer (), the outputs of a layer are calculated by


These outputs then pass through a non-linearity or activation function to create the inputs of the next layer as

. When we discuss a single layer, we suppress the layer number. In the final step, loss function

finds the discrepancy between the outputs of the network () and desired targets, which are referred by the matrix of . As a special case, refers to the MSE loss function.

Finally, denotes the gradients in the backward pass, and denotes our expected changes in the forward pass. For example, is the matrix of gradients of the loss w.r.t. the weights and is the matrix of actual changes to the weights.

3 Lessons from the past

In this section, we briefly review BP, LMS, and NLMS algorithms. This background knowledge is needed to introduce our consequentialism idea, which we want to extend to mini-batch training of the BP. BP

assigns the needed correction to each weight in a neural network by following the long path of the chain rule to that weight

(Haykin, 2009, p. 126) Schmidhuber (2015). Besides, it utilizes dynamic programming to store the intermediate calculations to prevent the recalculation of them when applying the chain rule.

3.1 Backpropagation

BP can be explained by mathematical induction (Bishop, 2006, pp. 241-245) (Haykin, 2009, pp. 123-139), as we mentioned in Section 2. Algorithm 1 shows the forward phase for a mini-batch.

1:input , an array of weight matrices , target matrix , number of layers , and activation function
2:for  do
5:end for
6:Compute Loss
Algorithm 1 Forward phase for a feed-forward network

The process begins by finding the derivatives of the loss w.r.t. the outputs of the network as follow:


where , is the gradient of the loss w.r.t. the , and means definition. It can be shown in the compact form of


It is followed by applying the chain rule again to the activation function of the last layer. Therefore, for the previous layer we have:


Next, the weight parameters should be changed in the opposite direction of the gradients. Consequently, for , and learning rate we have:


which can be show in the compact form of


For a mini-batch, the final weight changes are being calculated by summation of this individual weight corrections (Haykin, 2009, pp. 127-128) of (11) as


where is the batch-size, is the th sample in the matrix of the gradients of the outputs, and is the th sample in the input matrix. After that, to calculate the gradients for the weight parameters of the former layers, one needs the gradients w.r.t. the inputs of this layer as follow


where and . These are the gradients w.r.t. the elements of that is needed for the bottom layer. The same pattern can be repeated to obtain the BP formula. The backward phase can be simplified to


by considering mini-batches. Algorithm 2 shows the pseudocode of the backward phase for a mini-batch.

1:arrays of input matrices , weight matrices , output matrices , number of layers , Loss and learning rate
3:for  do
6:     if  then
8:     end if
9:end for
10:return Weight Changes
Algorithm 2 Backward phase for an MLP

3.2 Lms

LMS is an algorithm for on-line training of a single layer network with linear activation function, hence . Consequently, the newtork computes . The error of each output element of this vector is calculated by


which , and are the th elements of the ideal target, current output, and error vectors respectively. LMS computes the MSE loss as


which yields


Therefore, the derivatives of the loss w.r.t. the outputs are just the elements of the error vector. For , and learning rate the chain rule can be followed to find the weight update of


in the opposite direction of gradients. This is the -LMS algorithm.

3.3 Nlms

Since the inputs and targets of a single sample is all that we have at the current moment, let us foresee the consequences of applying this weight modification on the error of the current sample:


This is not ideal, because the actual correction of error also depends on the square of the second norm of the input. Changing the learning rate to


can eliminate it from the equation. After that, by applying this new learning rate (24) becomes


Finally, this new weight update yields


As an outcome, NLMS minimizes the error regardless of the power of the input. This is the key idea that we call it consequentialism that we want to generalize it to BP. Another interpretation is that it minimizes the error in the opposite direction of the gradients of the loss w.r.t. the outputs.

To sum up, we want to foresee what happens after applying the weight update formula, then to improve it by considering the consequences of our actions.

4 Proposed method

In this section, we show that the negative direction of the gradients of the loss w.r.t. the outputs points to some virtual targets that each layer tries to achieve. We aim to follow these moving targets for each sample of the mini-batch more accurately, without making drastic changes to BP.

These are our objectives: At first, to introduce the idea of the virtual targets. After that, we want to extend the consequentialism view to the BP in on-line mode with the help of the virtual targets. Third, we explain the difficulty of extending it to mini-batch training. At last, we introduce our comprehensive formula.

4.1 Virtual targets

Here, we want to show that the on-line training of BP with any loss function is exactly like the training of each layer of the network in isolation by -LMS with virtual targets. From (17), for layer we have


is just a vector of numbers at the end of the calculations, and (30) resembles (24). To prove the equivalence, we first define


as virtual gradients, and


as virtual targets vector for layer , where means definition. By repeating the same steps of (19) to (24) again, we obtain


By substituting the real values of the virtual variables from (33) and (32) we get


and finally, (36) becomes


which is the same weight changes in (30), and the proof is completed. These targets are our best predictions of where the output vectors should stand in the next iteration for this very mini-batch. If we successfully control the outputs of each layer, which will be the inputs of the next layer after the non-linearities, the inputs of the next layers should be more predictable, and the error of each sample should be decreased according to our expectations.

4.2 Simple extension

The main strength of NLMS, as we saw in Section 3

, was the predictable minimization of error. In our first attempt, we take a straightforward approach. We want to extend NLMS to feed-forward neural networks by normalizing each single input vector. This can be done by creating a new normalized matrix

by the following definition for column of a mini-batch:


where is the -th input vector, and can be a small positive number for stability, otherwise we can ignore it. In BP, (36) is generalized to matrices of mini-batches as


This simple extension can be refuted by a simple counterexample. Let us test it on a single layer network. By defining the matrix with two features and batch-size of 2 as


the will be


Let us look ahead by forwarding the weight changes:


This yields


where we assumed . This equation explains how one sample can interfere with the optimization of another sample. With this matrix removed from the calculations, we expect to optimize without the mentioned bias towards stronger inputs that we mentioned earlier in Section 1

. Therefore, the only way to move precisely towards the virtual targets happens when the right side matrix is an identity matrix. This requires the non-diagonal elements to be zeros that is not true most of the time.

4.3 The main proposal

Let us have a consequentialism view at BP for a specific layer to find out where the calculations deviate from our expectations. In mini-batch mode, from (36) and (38), we have


Now, we want to find out what exactly happens at the next iteration if we see the same mini-batch again. From (1) we have


By substituting in the above equation by (46), we get the real changes in the error matrix:


This is where a correlation matrix emerges. Thus, in (26) of -LMS becomes for a mini-batch in BP. We expect from (48) to even deflect some of the outputs from the virtual targets. This redundant calculation can confuse DNNs.

The next step is to devise an analytical solution to remedy this problem. The idea is to modify the weight correction rule in such a way that produces an identity matrix instead of the mentioned correlation matrix. Therefore, we extend (29) to mini-batch training and solve it. We wish to have matrix as our desired error reduction to be just


The answer is the solution to


where is our ideal matrix of weight changes. By solving it, we find our desired formula of


where the is pseudoinverse operator. By repeating the substitution of in (47) by this new weight update rule of we get


where is the identity matrix. Therefore, by this new update rule, the outputs of each layer should move to the exact direction of the virtual targets, and we achieved a part of our goal.

Let us rewrite the pseudoinverse of the input vector in NLMS as


where is the normalized input vector (unit vector) of . This explains that the part of the NLMS should be treated as a mere pseudoinverse of a vector, instead of calling an adaptive learning rate. Besides, it indicates how the norm of the input affects the optimization twice. First, on the backward pass, then on the next forward pass. For a single scalar input, it means that instead of multiplication, we are dividing the gradient by the input to neutralize the multiplication in the next iteration. It is worth noticing that the new algorithm does not normalize the inputs, but it tries to cancel their effect just on the optimization of the activations. As another interpretation, it considers outputs as variables. In other words, it modifies the weights to change the outputs in such a way that minimizes the loss.

The perfect learning rate for a single layer with a linear activation function in full-batch mode is . However, for DNN

s, the changes should be infinitesimal as the main assumption of the differentiation. We use the Moore-Penrose method with ridge regression

(Goodfellow et al., 2016, pp. 45-46) Hoerl and Kennard (1970) for two purposes. In the first place, it can easily prevent the ill-conditioning issues. Besides, it also prevents the weights from changing dramatically, which may cause the violation of the chain rule as a consequence of our modification.

Therefore, (51) can be rewritten as


We refer to the complete formula of (57) as the consequentialism weight update rule. The pseudocode of the proposed algorithm is shown in Algorithm 3.

1:arrays of input matrices , weight matrices , output matrices , number of layers , Loss and learning rate
3:for  do
5:     Compute the with ridge regression
7:     if  then
9:     end if
10:end for
11:return Weight changes
Algorithm 3 Backward phase for a feed-forward network with consequentialism weight updates

4.4 Time Complexity

The most time-intensive parts of the neural networks are the FC and convolutional layers, and we explained in Section 2 that the convolution layer can be treated as a FC layer. In addition, the computations of the other layers, activation and loss functions are negligible in the big O notation compared to these layers. Thus, we just concentrate on the forward and backward propagation of the FC layers.

In this subsection, denotes the total operations. From (17), total operations of the backward in vanilla mini-batch SGD for each FC layer is:


We dropped the layer numbers to simplify the formulas. These operaions take


steps. In addition, according to (1), for the forward pass we have:


with the time complexity of


Overall, BP has the total time complexity of:


for layers. In contrast, for the proposed method with Moore-Penrose method we have:


For the first case with Naïve Gauss elimination (Chapra, 2014, pp. 252-258), we have:


and for the second case the total operations are


Therefore, for weight updates formula, we have


with the time complexity of


Totalling up both forward and backward operations of the consequentialism method of the BP algorithm yields:


Since the dimenions of features are usually greater than the batch-size number, we conclude:


Overall, both algorithms have the same time complexity in big O terms, but the inside coefficient constants differ.

5 Experiments

In this section, we evaluate the effectiveness of the proposed algorithm by several measures: loss surface of parameters, the convergence path of the outputs to the targets, learning curves of an MLP on multiple datasets, and training of ResNet He et al. (2015) with different optimizers. We used a 3 GHz Pentium G2030 CPU and a GTX 750 2GB GPU for our experiments. MLP

experiments were carried out in tensorflow

Abadi et al. (2015)

and ResNet experiments were performed in Caffe

Jia et al. (2014).

5.1 Loss surface

To begin, we wanted to know how the weights converge to their optimum values in the loss surface graph. In this experiment, we created a very simple single-layer network with just two separate inputs and outputs with MSE loss function, which is depicted in Fig. 2. The loss surface of two weight parameters and the path that they travel to the optimum values with different learning rates are also shown in Fig. 3.

Figure 2: A simple network of two distinct inputs, weights and outputs.
Figure 3: Loss surface for C-SGD, SGD and SGD with momentum. The left figure shows the convergence with a high learning rate. The right figure shows the convergence with a small learning rate.

As the graphs show, the proposed algorithm moves the parameters directly towards the optimum values, regardless of the learning rate. By contrast, SGD moves the parameters in the direction of the gradients which is perpendicular to the ellipses in the loss surface. While a weight parameter in SGD converges faster in one direction, it travels slowly in another direction. This is why increasing the learning rate to achieve faster convergence in one direction may cause some perturbations or amplifies them in another direction. However, this quivering may inject some noise to the calculations which can be good for generalization.

For the higher learning rate in the left graph, the momentum term reduced the oscillations of SGD, but it also caused the parameters to pass the optimum values because of the accumulated momentum. Since the momentum term is usually been used alongside the higher learning rates, it also can be an unexpected source of randomness.

In the right graph, applying a small learning rate eliminated the oscilations of the vanilla SGD and reduced it for SGD with momentum at the cost of more iterations for convergence.

To conclude, because the consequentialism makes the outputs of each layer more predictable, we expect fewer perturbations in more complex models.

5.2 The path from the initial outputs to the ideal targets

In our second experiment, we tested how the outputs converge to the targets during the training, in comparison to SGD in mini-batch training. The network of this experiment has a single layer with 20-dimension inputs and 2 output neurons. The mini-batch contains 10 samples of random inputs and outputs. The training was done in full-batch mode. Fig. 4 depicts how the outputs of 2 out of 10 samples converged to their targets for both SGD and C-SGD.

As the graphs illustrate, regardless of the learning rate, the proposed method moves the outputs directly towards the targets. Conversely, SGD, not only took an indirect path, but it also demonstrated more oscillations. While this usually happens in practice with higher learning rates, the stochastic nature of mini-batch also makes it less predictable.

The right graph of lower learning rate shows that even decreasing the learning rate does not lessen the fluctuations, because one or more of the input samples can affect the other samples from behaving predictably. Thus, the learning rate is not the only cause of the observed noise of the mini-batch SGD.

These plots illustrate how the combination of increasing the learning rate and deploying the mini-batch training can cause the outputs of a layer, which will be the inputs of the next layer after a non-linearity in DNNs, to oscillate around the optimum values (virtual targets). Because of this, SGD has a higher chance of escaping from a local minimum. Moreover, the injected randomness may improve the generalization.

Figure 4: The convergence of the outputs of a single-layer FC network for 2 out of 10 samples of 20 dimensions inputs during the training. Outputs have 2 dimensions. The outpus start from the point and should reach to the targets marked with . The learning rate for C-SGD was 0.7 for both figures. The learning rate of SGD was 0.03 for the left graph and 0.01 for the right graph.

5.3 Fully-connected neural networks

Here, we investigate the effectiveness of the consequentialism version of the SGD against the plain SGD on one MLP architecture for 3 datasets. All FC layers in the following experiments had 8 hidden layers, and the batch size was 32. Weights were initialized by He’s method He et al. (2015). All layers except the last one had rectified linear unit (ReLU) activation functions. The last layer had the softmax activation. Finally, we applied the cross-entropy loss. We used the full training set to report the loss. While the networks for CIFAR-10 Krizhevsky and Hinton (2009) and Fashion-MNIST Xiao et al. (2017) had 800 hidden units for each layer, the network for CIFAR-100 had 1600 hidden units Krizhevsky and Hinton (2009). The inputs of the first layer was normalized to be between 0 and 1 for half of the experiments. was set to

for the consequentialism formula. For the other half, we normalized each feature to zero mean and unit variance (marked by NF), and

to . Fig. 5 shows the loss graph of SGD

against C-SGD over 100 epochs for all datasets based on iterations and elapsed time.

Figure 5: Log scaled cross-entropy loss of an 8-layer FC neural network with ReLU activations, trained with SGD and C-SGD on different datasets. NF denotes normalized features. The left graphs display the loss for all training samples at each iteration, and the right graphs display the same values based on the elapsed time.

According to these experiments, C-SGD optimizes faster than SGD in the beginning when the distribution of the feature has not changed. Moreover, the graphs display more overall stability throughout the training for C-SGD. While the normalization of the features improved both algorithms, it brings more stability to SGD and makes its convergence faster and even competitive to the C-SGD on Fashion-MNIST. In contrast, the feature normalization decreased the stability of the C-SGD. Overall, C-SGD suffers less from the changes in the distribution of the inputs.

The graphs also show that both algorithms will reach to a point that the loss drops sharply, then the progress would be slow, which occurs earlier for C-SGD. More oscillations around the virtual targets may be the source of a randomness that gives SGD a higher chance to find a better local minimum at the end.

Since C-SGD needs more time because of the pseudo-inverse operations, we also compared them by elapsed time. The right-hand side graphs of Fig. 5 depict the comparison between both algorithms in seconds.

According to these experiments, the C-SGD maintains its competitiveness for CIFAR-10 and CIFAR-100 even if we take the time into account, but does not perform better for Fashion-MNIST. This can happen due to the more preprocessing for this dataset that the pseudoinverse does not help a lot. For example, the images have not much skewing, rotations, and significant portions of the images are just black. But the improvement is substantial for a more complicated dataset. Therefore, it enjoys sharper drops from the first epochs for CIFAR-100 rather than CIFAR-10. Finally, another interesting observation is that for more complicated inputs of the CIFAR-100, the feature normalization does not help much both algorithms.

5.4 ResNet-20

In this section, we investigate the effectiveness of the proposed method on a deeper architecture of the ResNet-20 He et al. (2015). We implemented the consequentialism weight updates for FC and convolutional layers in Caffe Jia et al. (2014). All tests were done on the CIFAR-10 dataset. We chose SGD Rumelhart et al. (1986), Nesterov Ruder (2016), and Adam Kingma and Ba (2014) from popular optimizers. Fig. 6 compares the loss and accuracy of the consequentialism method alongside their plain counterpart optimizers for two different architectures, in the presence and absence of the BN layers. was set to 0.03 and batch-size was 128. Momentum was 0.95 for SGD and Nesterov, and 0.9 for Adam. The second momentum for Adam was set to 0.999. The values were reported for the last mini-batch of 100 iterations. The inputs were normalized between -1 and 1 for plain optimizers, and between -0.5 and 0.5 for the consequentialism method. The learning rate multiplier of the FC layer was set to 0.1 for the consequentialism experiments.

Figure 6: Comparison of log scaled cross-entropy loss and accuracy of ResNet-20 on CIFAR-10 trained by various optimizers with and without BN layers. BN denotes the models with BN layers.

For the 3 top right-hand side graphs, we deployed unmodified models with BN layers, and for the left-hand side graphs, we removed the BN layers. As the non-BN graphs demonstrate, C-SGD and C-Nesterov performs better than SGD and Nesterov, C-Adam does not improve the results of Adam optimizer. However, the C-SGD outperformed the Adam/C-Adam optimizer. BN graphs show a conflict between the new method and BN algorithm. The combination of consequentialism and BN layers was unstable. The C-Adam optimizer displays strong oscillations, but with the help of this randomness, it eventually found a better minimum.

There were also two good reasons to compare C-SGD without BN layers with SGD and BN layers. Firstly, the effectiveness of the optimizers is not our main concern, because our method is not an optimizer, but an extension that can be combined to improve the optimizers. Secondly, SGD represents the best results in both cases. Thus, we compared them at the bottom of Fig. 6. According to the graph, while the oscillations of the BN increase after 10000 iterations, the consequentialism counterpart shows more stability. As these graphs were drawn for individual mini-batches, this is a sign that our method forgets less the previously trained mini-batches during the training, with a side-effect of less randomness, hence the lower chance of escaping the worse local minima.

While C-SGD achieved 100% train accuracy without forgetting any sample after about 8000 iterations, the SGD with BN, sometimes forgot some samples. Overall, C-SGD converged faster in the beginning and maintained its stability to the end of the training, but SGD with BN finally crossed the C-SGD line after 25000 iterations. Another interesting observation is the striking resemblance between their accuracy graphs. The above tests show that our method, like BN, does not suffer from the vanishing and exploding gradients.

Now, let us compare them by considering the time instead of steps. Since the previous experiments show a conflict between consequentialism and BN, in the next figure, we did not use the combination of consequentialism optimizers with BN layers at all. Fig. 7 compares the loss and accuracy of the consequentialism method alongside their plain counterpart optimizers for two different architectures, in the presence and absence of the BN layers in seconds.

According to the left-hand side graphs, while C-SGD and C-Nesterov kept their advantages even by taking the time into account, C-Adam did not perform better than Adam. However, both C-SGD and C-Nesterov achieved better results.

From the right-hand side graphs, while SGD BN did well before about 6000 seconds, C-SGD crossed it for some time, but again SGD BN took the lead by crossing it at roughly 9000 seconds. As for Adam, Adam BN outperformed C-Adam. However, Adam optimizer did not do well overall compared to SGD and Nesterov. At last, Nesterov BN performed well in the beginning until about 3000 seconds that C-Nesterov easily crossed it, and kept its advantage to about 11000 seconds.

Figure 7: Time-based comparison of log scaled cross-entropy loss and accuracy of ResNet-20 on CIFAR-10 trained by various optimizers with and without BN layers. BN denotes the models with BN layers.

6 Conclusion and Future works

In this paper, we introduced two ideas. Firstly, a new way of improving weight correction in BP by anticipating the consequences of our actions in the hypothetical next iteration. It is equivalent to minimizing the loss with gradients of the outputs, via the weight updates. Secondly, we explained that one could treat the gradients of the outputs of each layer as the changes to be made towards some virtual targets. Next, optimizing by BP is equivalent to optimizing each layer by the LMS algorithm with virtual targets. As a result, it is possible to apply the successful algorithms from single layers to the deep neural network (DNN)s.

In spite of the faster convergence of the proposed method in the beginning and more stability during the training, it could get stuck at a worse local minimum in the long run. This weakness should be addressed in future works.

We also got mixed results from the time-based experiments. While for the MLP tests in Tensorflow, it was competitive, for the ResNet tests in Caffe, the processing demand almost eliminated its advantage. At least one of these two reasons can be the cause. Either the convolution with big mini-batch sizes does not benefit from the proposed method rather than using the BN layers, or our Caffe implementation was not optimized as well as Tensorflow. However, the second case is not easy to prove or dispute.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from Cited by: §5.
  • A. F. Atiya and A. G. Parlos (2000) New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE transactions on neural networks 11 (3), pp. 697–709. Note: 363 Cited by: §1.
  • C.M. Bishop (2006) Pattern recognition and machine learning. Information Science and Statistics, Springer. External Links: ISBN 9780387310732, LCCN 2006922522 Cited by: §1, §3.1.
  • S. Chapra (2014) Numerical methods for engineers. McGraw-Hill Higher Education. External Links: ISBN 9780077492168 Cited by: §4.4.
  • K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer (2006) Online passive-aggressive algorithms. Journal of Machine Learning Research 7 (Mar), pp. 551–585. Note: 1577 Cited by: §1.
  • J. C. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. Cited by: §1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §1, §4.3.
  • M. H. Hassoun (1995) Fundamentals of artificial neural networks. MIT press. Note: 66-69 Cited by: §1.
  • S. Haykin (2014) Adaptive filter theory : international edition. Pearson Education Limited. External Links: ISBN 9780273764083 Cited by: §1, §1.
  • S. Haykin (2009) Neural networks and learning machines, 3/e. Cited by: §3.1, §3.1, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification


    2015 IEEE International Conference on Computer Vision (ICCV)

    , pp. 1026–1034.
    Cited by: §5.3, §5.4, §5.
  • A. E. Hoerl and R. W. Kennard (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §4.3.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §1.
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Cited by: §2, §5.4, §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.4.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.3.
  • Y. LeCun, Y. Bengio, and G. E. Hinton (2015) Deep learning. Nature 521, pp. 436–444. Cited by: §1, §1.
  • A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick (2018) On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §1.
  • K. Ozeki and T. Umeda (1984) An adaptive filtering algorithm using an orthogonal projection to an affine subspace and its properties. Electronics and Communications in Japan (Part I: Communications) 67 (5), pp. 19–27. Cited by: §1.
  • N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society 12 1, pp. 145–151. Cited by: §1.
  • S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §1, §1, §5.4.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533. Note: 15905 Cited by: §1, §1, §5.4.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §3.
  • B. Widrow and M. E. Hoff (1960) Adaptive switching circuits. Technical report Stanford Univ Ca Stanford Electronics Labs. Note: 5283 Cited by: §1.
  • B. Widrow and M. A. Lehr (1990) 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE 78 (9), pp. 1415–1442. Note: 2572 Cited by: §1, §1.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §5.3.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. Cited by: §1.