1 Introduction
LMS algorithm is the common origin of neural networks and adaptive filters. While research on neural networks has been targeting multilayer architectures, research in the field of adaptive filters has been mainly focused on singlelayer architectures. In the field of neural networks, BP Rumelhart et al. (1986) (Bishop, 2006, pp. 241245) over minibatches is the core method for training multilayer networks, and any improvement to this algorithm is of great practical importance. In this paper, we try to transfer the improvements to the LMS algorithm in the field of adaptive filters to the training of deep neural network (DNN)s. In the following, we review some techniques that make the behavior of the network more predictable. This can be done, for example, by reducing the oscillations of the parameters around the optimal values, by preventing their deviation from the optimal path to the minimum, or by eliminating the redundant variables from the optimization calculations.
The learning curves of the SGD
algorithm usually contain many fluctuations. This phenomenon may happen due to optimization over a small subset of input vectors or minibatches at each iteration, which gives us a rough estimation
LeCun et al. (2015). Besides, SGD shows weakness in parts of the loss surface that bends sharply in one dimension than another (Goodfellow et al., 2016, p. 296)Ruder (2016). We want to provide more clues to the source of such instabilities.The momentum method Rumelhart et al. (1986) and its extensions such as Adam, NAG Ruder (2016), AdagradDuchi et al. (2011), and Adadelta Zeiler (2012) help to stabilize and speedup the training by bringing some eigen components of the system closer to critical damping Qian (1999). These methods reduce the oscillations around the optimum value of parameters. It is possible to reduce such oscillations by guiding them to move straight towards their optimum values.
Batch normalization (BN) Ioffe and Szegedy (2015) is an algorithm that helps to optimize the error independent of the change in the distribution of the input features that they called it internal covariate shift (ICS
). They assumed it as a source of unpredictability during the training, which slows down the training by forcing one to pick lower learning rates and to initialize the parameters carefully. BN normalizes each feature of the outputs by using the first two moments by the statistics of each minibatch. It also introduces two new variables to restore the representation power of the network by finding the preferred scale and shift over the training. These initiatives enabled them to use higher learning rates to achieve the same accuracy of the stateoftheart image classification with 14 times fewer training steps. The success of BN can be because of making the outputs of each layer, hence the inputs of the next layers more predictable, by eliminating one source of the unpredictability. The possibility of using a higher learning rate for the modified architecture can be an effect rather than the cause.
LMS algorithm trains adaptive filters Widrow and Lehr (1990) Widrow and Hoff (1960) in an online fashion by reducing the errors through an iterative process. An adaptive linear element (Adaline) Widrow and Lehr (1990) network Widrow and Hoff (1960)
is a singlelayer neural network with a linear activation function. However, subsequent investigations revealed the sensitiveness of the learning rate selection to the norm of the input vector
Widrow and Lehr (1990), which makes it hard or impossible to pick a learning rate to make the convergence faster while maintaining stability at the same time. Besides, after the selection of a fixed learning rate, we should expect a kind of bias towards stronger inputs. While the filter learns the stronger inputs faster, the weaker inputs need more iterations. Additionally, it may easily be confused when suddenly encounters an outlier by jumping over a completely different area of the loss surface and forgetting the earlier learned parameters.
NLMS (Haykin, 2014, pp. 333337) in regression tasks, which is also known as LMS Widrow and Lehr (1990); Hassoun (1995) in classification variants, can solve this problem. It is known as a method that adjusts the learning rate to cancel the problematic effect of the norm of the input vector.
Two questions ought to be answered to extend the success of NLMS to DNNs. First, how can someone generalize the underlying idea to a network with multiple layers? The second one arose after we extended it to a
multilayer perceptron
(MLP), and observed a kind of instability during the minibatch training. The second question is, what was the cause? To address the first issue, we show that each layer follows some virtual targets. We prove that optimization with BP is equivalent to using the LMS algorithm for independent layers. The second one can be solved by considering how each sample of the input deflects another sample from following its virtual target. This makes the outputs of each layer, which become the inputs of the next layer after an activation function, unpredictable. Moreover, the groups of neurons, not just a single neuron, determines the outcome of the neural networks. For example, networks with better generalization do not rely on individual neurons Morcos et al. (2018). By considering vectors as a group of scalar inputs in online training, minibatch training should be considered as its matrix extension LeCun et al. (2015). These findings led us to use this matrix view instead of a vector view to handle the second case.Similar efforts have been made before, for singlelayer networks. Since our solution contains pseudoinverse, we later discovered that Ozeki and Umeda generalized NLMS in adaptive filters, by considering the batch of last seen inputs at each time step and solving it with a geometric viewpoint by an affine projection algorithm (APA) Ozeki and Umeda (1984) (Haykin, 2014, pp. 345350) that reached to the same pseudoinverse formula.
Finally, one way to solve the optimization problem in neural networks is to consider some restrictions. For example, in Crammer et al. (2006)
they optimized the weights as a solution of a constrained optimization problem. They suggested a marginbased cost function, which made their online algorithm more aggressive. They also put restrictions on weight changes and came up with an NLMS like formula with a slack term. As another example, Atiya and Parlos put constraints on gradients w.r.t. the outputs to train the recurrent neural networks
Atiya and Parlos (2000). In their method, they considered the states as the control variables. The weight modifications elicited from the changes in the states. As we show in Section 3, NLMS can also be explained by the gradients w.r.t. the outputs of the neurons instead of the weights.The paper proceeds as follows. First, the notations are introduced in Section 2. Second, some past works are analyzed in Section 3 to grasp the consequentialism idea and the basis for the next section. After that, we propose our method in Section 4. Next, we investigate the effectiveness of the algorithm with some experiments in Section 5. Finally, the paper concludes in Section 6.
2 Notations
In this section, we provide a summary of the notations which are used in this paper. Bold letters indicate vectors, and bold capital letters represent matrices. Lower case letters denote the matrix and vector elements. For matrix , refers to the th row and th column.
The new method applies to any type of layer. However, we study just the feedforward networks throughout this paper to simplify notations. Other layers are just some special cases of the fullyconnected (FC) layer. For example, the convolution operator almost always is implemented as a matrix multiplication by using the im2col method for forward and col2im method for backward Jia et al. (2014).
Consider a feedforward network with layers (or hidden layers) to be trained with batch size of . denotes the dimension of the output for layer . As two special cases, is the dimension of the inputs of the network, and is the number of classes in a typical classification task. For layer , we refer to the input matrix by , which is , and for layer , we refer to the weight matrix by , which is . Fig. 1 illustrates a 3layer MLP network.
In the forward pass, by starting from the first layer (), the outputs of a layer are calculated by
(1) 
These outputs then pass through a nonlinearity or activation function to create the inputs of the next layer as
. When we discuss a single layer, we suppress the layer number. In the final step, loss function
finds the discrepancy between the outputs of the network () and desired targets, which are referred by the matrix of . As a special case, refers to the MSE loss function.Finally, denotes the gradients in the backward pass, and denotes our expected changes in the forward pass. For example, is the matrix of gradients of the loss w.r.t. the weights and is the matrix of actual changes to the weights.
3 Lessons from the past
In this section, we briefly review BP, LMS, and NLMS algorithms. This background knowledge is needed to introduce our consequentialism idea, which we want to extend to minibatch training of the BP. BP
assigns the needed correction to each weight in a neural network by following the long path of the chain rule to that weight
(Haykin, 2009, p. 126) Schmidhuber (2015). Besides, it utilizes dynamic programming to store the intermediate calculations to prevent the recalculation of them when applying the chain rule.3.1 Backpropagation
BP can be explained by mathematical induction (Bishop, 2006, pp. 241245) (Haykin, 2009, pp. 123139), as we mentioned in Section 2. Algorithm 1 shows the forward phase for a minibatch.
The process begins by finding the derivatives of the loss w.r.t. the outputs of the network as follow:
(2) 
where , is the gradient of the loss w.r.t. the , and means definition. It can be shown in the compact form of
(3) 
It is followed by applying the chain rule again to the activation function of the last layer. Therefore, for the previous layer we have:
(4)  
(5)  
(6) 
Next, the weight parameters should be changed in the opposite direction of the gradients. Consequently, for , and learning rate we have:
(7)  
(8)  
(9) 
which can be show in the compact form of
(10)  
(11) 
For a minibatch, the final weight changes are being calculated by summation of this individual weight corrections (Haykin, 2009, pp. 127128) of (11) as
(12)  
(13) 
where is the batchsize, is the th sample in the matrix of the gradients of the outputs, and is the th sample in the input matrix. After that, to calculate the gradients for the weight parameters of the former layers, one needs the gradients w.r.t. the inputs of this layer as follow
(14)  
(15)  
(16) 
where and . These are the gradients w.r.t. the elements of that is needed for the bottom layer. The same pattern can be repeated to obtain the BP formula. The backward phase can be simplified to
(17) 
by considering minibatches. Algorithm 2 shows the pseudocode of the backward phase for a minibatch.
3.2 Lms
LMS is an algorithm for online training of a single layer network with linear activation function, hence . Consequently, the newtork computes . The error of each output element of this vector is calculated by
(18)  
(19) 
which , and are the th elements of the ideal target, current output, and error vectors respectively. LMS computes the MSE loss as
(20) 
which yields
(21)  
(22) 
Therefore, the derivatives of the loss w.r.t. the outputs are just the elements of the error vector. For , and learning rate the chain rule can be followed to find the weight update of
(23)  
(24) 
in the opposite direction of gradients. This is the LMS algorithm.
3.3 Nlms
Since the inputs and targets of a single sample is all that we have at the current moment, let us foresee the consequences of applying this weight modification on the error of the current sample:
(25)  
(26) 
This is not ideal, because the actual correction of error also depends on the square of the second norm of the input. Changing the learning rate to
(27) 
can eliminate it from the equation. After that, by applying this new learning rate (24) becomes
(28) 
Finally, this new weight update yields
(29) 
As an outcome, NLMS minimizes the error regardless of the power of the input. This is the key idea that we call it consequentialism that we want to generalize it to BP. Another interpretation is that it minimizes the error in the opposite direction of the gradients of the loss w.r.t. the outputs.
To sum up, we want to foresee what happens after applying the weight update formula, then to improve it by considering the consequences of our actions.
4 Proposed method
In this section, we show that the negative direction of the gradients of the loss w.r.t. the outputs points to some virtual targets that each layer tries to achieve. We aim to follow these moving targets for each sample of the minibatch more accurately, without making drastic changes to BP.
These are our objectives: At first, to introduce the idea of the virtual targets. After that, we want to extend the consequentialism view to the BP in online mode with the help of the virtual targets. Third, we explain the difficulty of extending it to minibatch training. At last, we introduce our comprehensive formula.
4.1 Virtual targets
Here, we want to show that the online training of BP with any loss function is exactly like the training of each layer of the network in isolation by LMS with virtual targets. From (17), for layer we have
(30) 
is just a vector of numbers at the end of the calculations, and (30) resembles (24). To prove the equivalence, we first define
(31) 
as virtual gradients, and
(32) 
as virtual targets vector for layer , where means definition. By repeating the same steps of (19) to (24) again, we obtain
(33)  
(34)  
(35)  
(36) 
(37)  
(38) 
and finally, (36) becomes
(39) 
which is the same weight changes in (30), and the proof is completed. These targets are our best predictions of where the output vectors should stand in the next iteration for this very minibatch. If we successfully control the outputs of each layer, which will be the inputs of the next layer after the nonlinearities, the inputs of the next layers should be more predictable, and the error of each sample should be decreased according to our expectations.
4.2 Simple extension
The main strength of NLMS, as we saw in Section 3
, was the predictable minimization of error. In our first attempt, we take a straightforward approach. We want to extend NLMS to feedforward neural networks by normalizing each single input vector. This can be done by creating a new normalized matrix
by the following definition for column of a minibatch:(40) 
where is the th input vector, and can be a small positive number for stability, otherwise we can ignore it. In BP, (36) is generalized to matrices of minibatches as
(41) 
This simple extension can be refuted by a simple counterexample. Let us test it on a single layer network. By defining the matrix with two features and batchsize of 2 as
(42) 
the will be
(43) 
Let us look ahead by forwarding the weight changes:
(44) 
This yields
(45) 
where we assumed . This equation explains how one sample can interfere with the optimization of another sample. With this matrix removed from the calculations, we expect to optimize without the mentioned bias towards stronger inputs that we mentioned earlier in Section 1
. Therefore, the only way to move precisely towards the virtual targets happens when the right side matrix is an identity matrix. This requires the nondiagonal elements to be zeros that is not true most of the time.
4.3 The main proposal
Let us have a consequentialism view at BP for a specific layer to find out where the calculations deviate from our expectations. In minibatch mode, from (36) and (38), we have
(46) 
Now, we want to find out what exactly happens at the next iteration if we see the same minibatch again. From (1) we have
(47) 
By substituting in the above equation by (46), we get the real changes in the error matrix:
(48) 
This is where a correlation matrix emerges. Thus, in (26) of LMS becomes for a minibatch in BP. We expect from (48) to even deflect some of the outputs from the virtual targets. This redundant calculation can confuse DNNs.
The next step is to devise an analytical solution to remedy this problem. The idea is to modify the weight correction rule in such a way that produces an identity matrix instead of the mentioned correlation matrix. Therefore, we extend (29) to minibatch training and solve it. We wish to have matrix as our desired error reduction to be just
(49) 
The answer is the solution to
(50) 
where is our ideal matrix of weight changes. By solving it, we find our desired formula of
(51) 
where the is pseudoinverse operator. By repeating the substitution of in (47) by this new weight update rule of we get
(52)  
(53)  
(54) 
where is the identity matrix. Therefore, by this new update rule, the outputs of each layer should move to the exact direction of the virtual targets, and we achieved a part of our goal.
Let us rewrite the pseudoinverse of the input vector in NLMS as
(55)  
(56) 
where is the normalized input vector (unit vector) of . This explains that the part of the NLMS should be treated as a mere pseudoinverse of a vector, instead of calling an adaptive learning rate. Besides, it indicates how the norm of the input affects the optimization twice. First, on the backward pass, then on the next forward pass. For a single scalar input, it means that instead of multiplication, we are dividing the gradient by the input to neutralize the multiplication in the next iteration. It is worth noticing that the new algorithm does not normalize the inputs, but it tries to cancel their effect just on the optimization of the activations. As another interpretation, it considers outputs as variables. In other words, it modifies the weights to change the outputs in such a way that minimizes the loss.
The perfect learning rate for a single layer with a linear activation function in fullbatch mode is . However, for DNN
s, the changes should be infinitesimal as the main assumption of the differentiation. We use the MoorePenrose method with ridge regression
(Goodfellow et al., 2016, pp. 4546) Hoerl and Kennard (1970) for two purposes. In the first place, it can easily prevent the illconditioning issues. Besides, it also prevents the weights from changing dramatically, which may cause the violation of the chain rule as a consequence of our modification.Therefore, (51) can be rewritten as
(57) 
4.4 Time Complexity
The most timeintensive parts of the neural networks are the FC and convolutional layers, and we explained in Section 2 that the convolution layer can be treated as a FC layer. In addition, the computations of the other layers, activation and loss functions are negligible in the big O notation compared to these layers. Thus, we just concentrate on the forward and backward propagation of the FC layers.
In this subsection, denotes the total operations. From (17), total operations of the backward in vanilla minibatch SGD for each FC layer is:
(58)  
(59) 
We dropped the layer numbers to simplify the formulas. These operaions take
(60)  
(61) 
steps. In addition, according to (1), for the forward pass we have:
(62) 
with the time complexity of
(63) 
Overall, BP has the total time complexity of:
(64)  
(65)  
(66) 
for layers. In contrast, for the proposed method with MoorePenrose method we have:
(67) 
For the first case with Naïve Gauss elimination (Chapra, 2014, pp. 252258), we have:
(68)  
(69) 
and for the second case the total operations are
(70)  
(71) 
Therefore, for weight updates formula, we have
(72) 
with the time complexity of
(73) 
Totalling up both forward and backward operations of the consequentialism method of the BP algorithm yields:
(74)  
(75)  
(76)  
(77)  
(78) 
Since the dimenions of features are usually greater than the batchsize number, we conclude:
(79)  
(80) 
Overall, both algorithms have the same time complexity in big O terms, but the inside coefficient constants differ.
5 Experiments
In this section, we evaluate the effectiveness of the proposed algorithm by several measures: loss surface of parameters, the convergence path of the outputs to the targets, learning curves of an MLP on multiple datasets, and training of ResNet He et al. (2015) with different optimizers. We used a 3 GHz Pentium G2030 CPU and a GTX 750 2GB GPU for our experiments. MLP
experiments were carried out in tensorflow
Abadi et al. (2015)and ResNet experiments were performed in Caffe
Jia et al. (2014).5.1 Loss surface
To begin, we wanted to know how the weights converge to their optimum values in the loss surface graph. In this experiment, we created a very simple singlelayer network with just two separate inputs and outputs with MSE loss function, which is depicted in Fig. 2. The loss surface of two weight parameters and the path that they travel to the optimum values with different learning rates are also shown in Fig. 3.
As the graphs show, the proposed algorithm moves the parameters directly towards the optimum values, regardless of the learning rate. By contrast, SGD moves the parameters in the direction of the gradients which is perpendicular to the ellipses in the loss surface. While a weight parameter in SGD converges faster in one direction, it travels slowly in another direction. This is why increasing the learning rate to achieve faster convergence in one direction may cause some perturbations or amplifies them in another direction. However, this quivering may inject some noise to the calculations which can be good for generalization.
For the higher learning rate in the left graph, the momentum term reduced the oscillations of SGD, but it also caused the parameters to pass the optimum values because of the accumulated momentum. Since the momentum term is usually been used alongside the higher learning rates, it also can be an unexpected source of randomness.
In the right graph, applying a small learning rate eliminated the oscilations of the vanilla SGD and reduced it for SGD with momentum at the cost of more iterations for convergence.
To conclude, because the consequentialism makes the outputs of each layer more predictable, we expect fewer perturbations in more complex models.
5.2 The path from the initial outputs to the ideal targets
In our second experiment, we tested how the outputs converge to the targets during the training, in comparison to SGD in minibatch training. The network of this experiment has a single layer with 20dimension inputs and 2 output neurons. The minibatch contains 10 samples of random inputs and outputs. The training was done in fullbatch mode. Fig. 4 depicts how the outputs of 2 out of 10 samples converged to their targets for both SGD and CSGD.
As the graphs illustrate, regardless of the learning rate, the proposed method moves the outputs directly towards the targets. Conversely, SGD, not only took an indirect path, but it also demonstrated more oscillations. While this usually happens in practice with higher learning rates, the stochastic nature of minibatch also makes it less predictable.
The right graph of lower learning rate shows that even decreasing the learning rate does not lessen the fluctuations, because one or more of the input samples can affect the other samples from behaving predictably. Thus, the learning rate is not the only cause of the observed noise of the minibatch SGD.
These plots illustrate how the combination of increasing the learning rate and deploying the minibatch training can cause the outputs of a layer, which will be the inputs of the next layer after a nonlinearity in DNNs, to oscillate around the optimum values (virtual targets). Because of this, SGD has a higher chance of escaping from a local minimum. Moreover, the injected randomness may improve the generalization.
5.3 Fullyconnected neural networks
Here, we investigate the effectiveness of the consequentialism version of the SGD against the plain SGD on one MLP architecture for 3 datasets. All FC layers in the following experiments had 8 hidden layers, and the batch size was 32. Weights were initialized by He’s method He et al. (2015). All layers except the last one had rectified linear unit (ReLU) activation functions. The last layer had the softmax activation. Finally, we applied the crossentropy loss. We used the full training set to report the loss. While the networks for CIFAR10 Krizhevsky and Hinton (2009) and FashionMNIST Xiao et al. (2017) had 800 hidden units for each layer, the network for CIFAR100 had 1600 hidden units Krizhevsky and Hinton (2009). The inputs of the first layer was normalized to be between 0 and 1 for half of the experiments. was set to
for the consequentialism formula. For the other half, we normalized each feature to zero mean and unit variance (marked by NF), and
to . Fig. 5 shows the loss graph of SGDagainst CSGD over 100 epochs for all datasets based on iterations and elapsed time.
According to these experiments, CSGD optimizes faster than SGD in the beginning when the distribution of the feature has not changed. Moreover, the graphs display more overall stability throughout the training for CSGD. While the normalization of the features improved both algorithms, it brings more stability to SGD and makes its convergence faster and even competitive to the CSGD on FashionMNIST. In contrast, the feature normalization decreased the stability of the CSGD. Overall, CSGD suffers less from the changes in the distribution of the inputs.
The graphs also show that both algorithms will reach to a point that the loss drops sharply, then the progress would be slow, which occurs earlier for CSGD. More oscillations around the virtual targets may be the source of a randomness that gives SGD a higher chance to find a better local minimum at the end.
Since CSGD needs more time because of the pseudoinverse operations, we also compared them by elapsed time. The righthand side graphs of Fig. 5 depict the comparison between both algorithms in seconds.
According to these experiments, the CSGD maintains its competitiveness for CIFAR10 and CIFAR100 even if we take the time into account, but does not perform better for FashionMNIST. This can happen due to the more preprocessing for this dataset that the pseudoinverse does not help a lot. For example, the images have not much skewing, rotations, and significant portions of the images are just black. But the improvement is substantial for a more complicated dataset. Therefore, it enjoys sharper drops from the first epochs for CIFAR100 rather than CIFAR10. Finally, another interesting observation is that for more complicated inputs of the CIFAR100, the feature normalization does not help much both algorithms.
5.4 ResNet20
In this section, we investigate the effectiveness of the proposed method on a deeper architecture of the ResNet20 He et al. (2015). We implemented the consequentialism weight updates for FC and convolutional layers in Caffe Jia et al. (2014). All tests were done on the CIFAR10 dataset. We chose SGD Rumelhart et al. (1986), Nesterov Ruder (2016), and Adam Kingma and Ba (2014) from popular optimizers. Fig. 6 compares the loss and accuracy of the consequentialism method alongside their plain counterpart optimizers for two different architectures, in the presence and absence of the BN layers. was set to 0.03 and batchsize was 128. Momentum was 0.95 for SGD and Nesterov, and 0.9 for Adam. The second momentum for Adam was set to 0.999. The values were reported for the last minibatch of 100 iterations. The inputs were normalized between 1 and 1 for plain optimizers, and between 0.5 and 0.5 for the consequentialism method. The learning rate multiplier of the FC layer was set to 0.1 for the consequentialism experiments.
For the 3 top righthand side graphs, we deployed unmodified models with BN layers, and for the lefthand side graphs, we removed the BN layers. As the nonBN graphs demonstrate, CSGD and CNesterov performs better than SGD and Nesterov, CAdam does not improve the results of Adam optimizer. However, the CSGD outperformed the Adam/CAdam optimizer. BN graphs show a conflict between the new method and BN algorithm. The combination of consequentialism and BN layers was unstable. The CAdam optimizer displays strong oscillations, but with the help of this randomness, it eventually found a better minimum.
There were also two good reasons to compare CSGD without BN layers with SGD and BN layers. Firstly, the effectiveness of the optimizers is not our main concern, because our method is not an optimizer, but an extension that can be combined to improve the optimizers. Secondly, SGD represents the best results in both cases. Thus, we compared them at the bottom of Fig. 6. According to the graph, while the oscillations of the BN increase after 10000 iterations, the consequentialism counterpart shows more stability. As these graphs were drawn for individual minibatches, this is a sign that our method forgets less the previously trained minibatches during the training, with a sideeffect of less randomness, hence the lower chance of escaping the worse local minima.
While CSGD achieved 100% train accuracy without forgetting any sample after about 8000 iterations, the SGD with BN, sometimes forgot some samples. Overall, CSGD converged faster in the beginning and maintained its stability to the end of the training, but SGD with BN finally crossed the CSGD line after 25000 iterations. Another interesting observation is the striking resemblance between their accuracy graphs. The above tests show that our method, like BN, does not suffer from the vanishing and exploding gradients.
Now, let us compare them by considering the time instead of steps. Since the previous experiments show a conflict between consequentialism and BN, in the next figure, we did not use the combination of consequentialism optimizers with BN layers at all. Fig. 7 compares the loss and accuracy of the consequentialism method alongside their plain counterpart optimizers for two different architectures, in the presence and absence of the BN layers in seconds.
According to the lefthand side graphs, while CSGD and CNesterov kept their advantages even by taking the time into account, CAdam did not perform better than Adam. However, both CSGD and CNesterov achieved better results.
From the righthand side graphs, while SGD BN did well before about 6000 seconds, CSGD crossed it for some time, but again SGD BN took the lead by crossing it at roughly 9000 seconds. As for Adam, Adam BN outperformed CAdam. However, Adam optimizer did not do well overall compared to SGD and Nesterov. At last, Nesterov BN performed well in the beginning until about 3000 seconds that CNesterov easily crossed it, and kept its advantage to about 11000 seconds.
6 Conclusion and Future works
In this paper, we introduced two ideas. Firstly, a new way of improving weight correction in BP by anticipating the consequences of our actions in the hypothetical next iteration. It is equivalent to minimizing the loss with gradients of the outputs, via the weight updates. Secondly, we explained that one could treat the gradients of the outputs of each layer as the changes to be made towards some virtual targets. Next, optimizing by BP is equivalent to optimizing each layer by the LMS algorithm with virtual targets. As a result, it is possible to apply the successful algorithms from single layers to the deep neural network (DNN)s.
In spite of the faster convergence of the proposed method in the beginning and more stability during the training, it could get stuck at a worse local minimum in the long run. This weakness should be addressed in future works.
We also got mixed results from the timebased experiments. While for the MLP tests in Tensorflow, it was competitive, for the ResNet tests in Caffe, the processing demand almost eliminated its advantage. At least one of these two reasons can be the cause. Either the convolution with big minibatch sizes does not benefit from the proposed method rather than using the BN layers, or our Caffe implementation was not optimized as well as Tensorflow. However, the second case is not easy to prove or dispute.
References

TensorFlow: largescale machine learning on heterogeneous systems
. Note: Software available from tensorflow.org Cited by: §5.  New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE transactions on neural networks 11 (3), pp. 697–709. Note: 363 Cited by: §1.
 Pattern recognition and machine learning. Information Science and Statistics, Springer. External Links: ISBN 9780387310732, LCCN 2006922522 Cited by: §1, §3.1.
 Numerical methods for engineers. McGrawHill Higher Education. External Links: ISBN 9780077492168 Cited by: §4.4.
 Online passiveaggressive algorithms. Journal of Machine Learning Research 7 (Mar), pp. 551–585. Note: 1577 Cited by: §1.
 Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. Cited by: §1.
 Deep learning. MIT Press. Cited by: §1, §4.3.
 Fundamentals of artificial neural networks. MIT press. Note: 6669 Cited by: §1.
 Adaptive filter theory : international edition. Pearson Education Limited. External Links: ISBN 9780273764083 Cited by: §1, §1.
 Neural networks and learning machines, 3/e. Cited by: §3.1, §3.1, §3.

Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
.2015 IEEE International Conference on Computer Vision (ICCV)
, pp. 1026–1034. Cited by: §5.3, §5.4, §5.  Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §4.3.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §1.
 Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Cited by: §2, §5.4, §5.
 Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.4.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.3.
 Deep learning. Nature 521, pp. 436–444. Cited by: §1, §1.
 On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §1.
 An adaptive filtering algorithm using an orthogonal projection to an affine subspace and its properties. Electronics and Communications in Japan (Part I: Communications) 67 (5), pp. 19–27. Cited by: §1.
 On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society 12 1, pp. 145–151. Cited by: §1.
 An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §1, §1, §5.4.
 Learning representations by backpropagating errors. nature 323 (6088), pp. 533. Note: 15905 Cited by: §1, §1, §5.4.
 Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §3.
 Adaptive switching circuits. Technical report Stanford Univ Ca Stanford Electronics Labs. Note: 5283 Cited by: §1.
 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE 78 (9), pp. 1415–1442. Note: 2572 Cited by: §1, §1.
 External Links: cs.LG/1708.07747 Cited by: §5.3.
 ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. Cited by: §1.
Comments
There are no comments yet.