Blind Descent: A Prequel to Gradient Descent

06/20/2020 ∙ by Akshat Gupta, et al. ∙ Carnegie Mellon University 0

We describe an alternative to gradient descent for backpropogation through a neural network, which we call Blind Descent. We believe that Blind Descent can be used to augment backpropagation by using it as an initialisation method and can also be used at saturation. Blind Descent, inherently by design, does not face problems like exploding or vanishing gradients.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Blind Descent can be looked as an alternative or as a complementary aspect to backpropagation (of artificial neural networks). We see Blind Descent as a prequel to gradient descent. It doesn’t require calculation of gradients. Consequently, it doesn’t face certain problems that are faced by gradient descent, namely vanishing and exploding gradients. There’s also no need to store values from forward pass at each node. However, not calculating gradients for backpropagation is what makes this method blind, hence the name Blind Descent.

In later sections, we describe the algorithm and its motivations. We also perform experiments on the MNIST dataset to check the viability of a blind descent through the neural network.

2 Prior work

We did not find much prior works to build on (even though we searched after performing experiments as mentioned in ”Initial RANSAC Concept ideation” section). A remotely similar work is on Extreme Learning Machine.[2] However, the input-hidden layer appears to be randomly initialised and frozen and only the hidden-output layer appears to be minimised using pseudo-inverse solution. Even the incremental learning variant appears to do the same (but with convex optimisation for only one layer).[3] Other methods use injected noise.[4]

But, controlling the noise can be a difficult endeavor and moreover, the data itself can be said to have noise and the required information components. So, we decided to eliminate the noise present in the data instead of introducing additional noise. There are several variants of Extreme Learning Machine (ELM). But, as we can see in the survey, they are either too simplistic and reduce to linear regression or tend towards convex optimisation problems like SVM.

[5] However, we did find random optimisation statistical papers of 1960s and one of the most influential papers that directly influenced our research is the backpropagation paper.[6]

3 Initial RANSAC Concept ideation

We initially came up with initial random weight updation idea initially thinking about removing singularity from linear regression and trying to use batches in linear regression and eventually thinking about RANSAC. We thought about using errors themselves instead of using chains of gradients of errors f’(x) and integrating that f(x) which avoids convex optimisation altogether if we randomly initialise weights repeatedly until error minimises/accuracy rises. We used MNIST dataset from Keras

[1]

, shuffled and split into batches, randomly initialised the network with random weights (uniform random distribution) for a given network architecture. For the next batch, we randomly (uniform random distribution) initialised the weights again. Initially, we checked only for the accuracy for that batch. If the accuracy turned out to be greater than previous accuracy, then, we saved the new weights. If batch-wise accuracy reached 100%, we considered the accuracy of the whole epoch and ran aforementioned RANSAC-like random weight updation. However, our method suffers from a saturation problem (where the training accuracy would not rise as quickly as in the initial iterations). So, we thought of using this as initialisation for backpropagation instead of replacing backpropagation (because of saturation problem). We thought about approximating the entire network to 4th power of inputs (as Abel Ruffini theorem’s closed form solution possibility is a constraint for polynomials beyond four degrees). But, that approach suffered from overdetermined system of equations and solving for fourth degree polynomials without matrix inversion turned out to be a difficult endeavor. We thought of computing error at each node (Bayes Minimum Mean Squared Error) which indirectly provides the exact solution. In our case of errors of all nodes known, RANSAC-like optimisation has to optimise only two conditions (because, MSE is a square and

for example.). But, our random weight method might not work successfully if the layers are deep and almost all of the weights in the initial layers are small (that can cause numerical instabilities). We identified that RANSAC has the possibility of convergence towards global minima if we sample from uniform random distribution (because, we are not using convex optimisation and this won’t get stuck at local minima as we are randomly considering all possibilities if we have weights between range 0 to 1 or -1 to 1).

4 Blind Descent

The motivation behind coming up with an algorithm like Blind Descent is to get rid of certain problems we face when we use gradients. To do that, we decided to do away with calculating gradients during backpropagation. Calculating gradient do have their advantages, as they guide is the direction of steepest descent. Thus, using gradients, we not only know if we’re going in the correct direction, but we also know that we’re going in the best possible correct direction, locally speaking. Thus at first the idea of Blind Descent seemed unintelligent. But on experimentation, we found that it is not as ludicrous as it seems. We describe the experiments and results in the next section.

In Blind Descent, we do not calculate gradients for backpropagation. This is a one sentence elevator pitch for the algorithm. But the question is, how does the network learn? The weights at each node of the neural network are randomly initialised. Then on every iteration, we update the weights randomly. This randomness although, is guided. Only those updates to weights that lead to a reduction to the overall loss function are kept, other updates are discarded. This is Blind Descent in its most basic form.


To define the random updates, we need to define the distribution from which the random numbers are picked. The distribution has a mean and a standard deviation, which defines the random updates. We center this distribution on the current value of the weights. The update rule in blind descent is given below:

Here, is a weight of a node of the neural network as a certain time during backpropagation. is the learning rate, which here defines the standard deviation of the distribution. is a distribution governing the random updates, center around the value of the weight at the current time step. is the loss function for the neural network under consideration.

The above definition of Blind Descent can have various variants. Different distributions can be used to update the weights. The standard deviations can become smaller and smaller as the loss increases and we reach closer to a good solution, thus making standard deviation a variable quantity. This would be an analogue to the learning schedule used in neural networks these days. We can also use concepts of momentum in Blind Descent, thus favoring the direction of previous descent. The above description can be considered a gross generalisation of gradient descent. Once we take gradients at each node, the distribution becomes a deterministic version of Blind Descent.

5 Experiments with Blind Descent

We perform various experiments using Blind Descent and compare it to gradient descent.

5.1

We begin by solving the easiest problem of finding the global minima of the most basic quadratic function, as shown in figure 1

. We use the normal distribution for updates in blind descent. The learning rate parameter is used as usual for gradient descent, and is used as the standard deviation of the update rule of blind descent. We see that as the learning rate in gradient descent increases, it is unable to reach an optimum solution and is oscillating, whereas such a problem does not occur in blind descent. We can keep the learning rate as high as we want in Blind Descent and still reach an optimum solution. The learning rate although does affect the speed of convergence. Similar curves are obtained for higher dimensions with smooth curves.

(a) Learning rate = 0.01
(b) Learning rate = 0.1
(c) Learning rate = 1
Figure 1: Blind Descent and Gradient Descent on with different learning rates. Note that learning rate for Blind Descent is the standard deviation of the underlying distributions.

5.2 The MNIST Dataset

Here we use Blind Descent for actually training a neural network. MNIST is not a trivial dataset and has 10 classes. Thus the task is not as trivial as a binary classifier. We were able to achieve accuracies close to 70% on the MNIST dataset. The experiments performed are not optimum and various other techniques like momentum, standard deviation scheduler can be used to improve the results. Our focus is to present a proof of concept that Blind Descent works.


For training, we use a two layer deep neural network. The network architecture is [784, 256, 10], where 784 is the input dimensions of MNIST, the first layer has 256 neurons and the output layer has 10 neurons. We use cross entropy loss to measure the loss. While training, we do batch updates. Here are the steps followed during batch updates in Blind descent:

  1. Calculate loss with current weights for entire batch

  2. Do weight updates using Blind Descent

  3. Calculate loss with new weights for entire batch

  4. If new loss is less than previous loss, adopt new weights

Nowhere in the algorithm do we mention accuracy. Yet following the above procedure takes us to 70% test set accuracy on MNIST. This is especially astonishing when we find that the training and test loss are increasing along with the test accuracy. This can be seen in figure 2.

This is one of the reasons why we believe that backpropagation may not be the only solution to the training process and there may be other effective methods which we have not discovered yet.

(a) Training and Test Loss
(b) Test Accuracy
Figure 2: Blind Descent at work for MNIST dataset. Different batch sizes are compared for learning rate = 0.01

A unique feature we see in Blind Descent is that the training and testing loss keeps increasing along with the increasing accuracy. This can be seen in both figure 2 and figure 3. It might seem unintuitive as we are restricting updates only if the loss on update is lower than the loss before updates. Although this is true for one batch, the loss for the updated parameters for the next batch could be higher for the same parameters. The condition only compared the updated loss to the loss of the current batch. This is what causes the loss to increase. Even though that happens, the accuracy still increases. Notice that we never put in the condition for the accuracy to increase in our algorithm and it still increases!

(a) Training and Test Loss
(b) Test Accuracy
Figure 3: Blind Descent at work for MNIST dataset. Different batch sizes are compared for learning rate = 0.001

The training loss and test loss no longer increases when learning rate reduces to as can be seen in figure 4, a behavior that is still unclear to us. All the losses are normalized in exactly the same way for each of the three cases presented. The training accuracies achieved with the same number of epochs is also quite for the three learning rates. We also notice that learning rate does not seem to effect the training time. All these are open questions.

(a) Training and Test Loss
(b) Test Accuracy
Figure 4: Blind Descent at work for MNIST dataset. Different batch sizes are compared for learning rate = 0.0001

6 Variants

As mentioned previously, many variants of gradient descent can also be used with Blind Descent. These include the use of momentum while doing backpropogation. One possible implementation of accumulating momentum is to shift the mean of the probability distribution in direction of previous descent. A learning rate scheduler can also be used. This would make the standard deviation of the probability distribution a variable quantity. We can also use various underlying probability distributions for the update.

7 Discussion

We proposed the algorithm of Blind Descent as an alternative to gradient descent for backpropagation through a neural network. Blind Descent has various advantages over gradient descent including not having to save forward pass values and calculating gradients at each node. This happens by design as we are not calculating gradients. This could also mean the implementation doesn’t scale with the size of the network. Accordingly, Blind Descent does not suffer with limitations of gradient descent like vanishing and exploding gradients.

Notice that while backpropagating with Blind Descent we can choose to accept or reject an update Blind Descent, such a luxury is not available in case of gradient descent as the gradient will always be the same. This also remedies another disadvantage faced during gradient descent. We are bound by the actions of the past, including initialisations and gradients updates we did based on current values of learning rate and other parameters. Blind descent on the other hand gives us the opportunity to explore different paths at any moment during the backpropagation process. This control is not demonstrated in our work but can part of future work.


Even though Blind Descent has a lot of merit, without gradients, we do not know the direction in which we should proceed and thus we are blind. Thus we do not expect Blind Descent to compete with Gradient Descent in terms of performance. But the merits of Blind Descent make it worth exploring. It can be used as something that augments gradient descent, either during initialisation or saturation.

Future work with Blind Descent could include trying our various variants of Blind Descent as discussed in the previous section. We also plan to train larger neural networks using blind descent in various domains. There are many other open questions about Blind Descent.

The primary question here is, will Blind Descent work with larger neural networks and across different domains like speech, videos, text and with different neural network structures like CNN’s, RNN’s etc? Another question that we don’t completely understand is why this method actually works when we try to train with loss (instead of accuracy as metric). Other questions that have come up in this paper is the peculiar increase in training and test loss accompanying increasing accuracies. Even more peculiar is the fact that it does not happen for a particular learning rate. The training accuracies seem to follow similar trends for all three learning rates, although the dynamics get smoother as the learning rate decreases. The learning rate seems to have no or little effect on how quickly the network learns. These are some of the few questions that we would like to answer going forward. We would also like to train deeper networks and ultra-wide networks with our proposed method.

8 Conflicts of interests

Authors do not have any conflicts of interests. Akshat Gupta is the main contributor and has worked on most parts of this paper (including code). However, Prasad is the sole contributor of ”Initial RANSAC concept ideation” section (in terms of idea and software program code) and he had taken 16720 (which introduced RANSAC concept) and 18661 classes (which introduced linear regression) of CMU. All of Prasad’s experiments were run on his personal laptop and personal AWS instance.

References

  • [1] Keras https://keras.io/
  • [2] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew, Extreme learning machine: Theory and applications, vol. 70, pp. 489 - 501, Neurocomputing, Elsevier, 2005
  • [3] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew, Universal Approximation Using Incremental Constructive Feedforward Networks With Random Hidden Nodes, IEEE Transactions on Neural Networks, Vol. 17, 2006.
  • [4] J. L. Maryak, D. C. Chin, Global random optimization by simultaneous perturbation stochastic approximation, Proceeding of the Winter Simulation Conference IEEE, 2001
  • [5] Guang-Bin Huang, Dian Hui Wang, Yuan Lan, Extreme learning machines: a survey, International Journal of Machine Learning and Cybernetics, Vol. 2, pp. 107–12, 2011
  • [6] David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Learning representations by back-propagating errors, Vol. 323, Nature, 1986