Stochastic Gradient Descent

What is Stochastic Gradient Descent?

Stochastic gradient descent is a method to find the optimal parameter configuration for a machine learning algorithm. It iteratively makes small adjustments to a machine learning network configuration to decrease the error of the network. 

 

Error functions are rarely as simple as a typical parabola. Most often they have lots of hills and valleys, like the function pictured here. In this graph, if true gradient descent started on the left side of the graph, it would stop at the left valley because no matter which direction you travel from this point, you must travel upwards. This point is known as a local minimum. However, there exists another point in the graph that is lower. The lowest point in the entire graph is the global minimum, which is what stochastic gradient descent attempts to find. 



 
        Image result for local minimum

Stochastic gradient descent attempts to find the global minimum by adjusting the configuration of the network after each training point. Instead of decreasing the error, or finding the gradient,  for the entire data set, this method merely decreases the error by approximating the gradient for a randomly selected batch (which may be as small as single training sample).  In practice, the random selection is achieved by randomly shuffling the dataset and working through batches in a stepwise fashion.

Heuristically,  if the network gets a training example wrong, it will update the configuration in favor of getting it right in the future. However, the configuration update might come at the cost of getting other questions wrong, thus increasing the overall error of the network. Thus not every training iteration may improve the network through the stochastic gradient descent algorithm.
 
On the other hand, stochastic gradient descent can adjust the network parameters in such a way as to move the model out of a local minimum and toward a global minimum. Looking back to the concave function pictured above, after processing a training example, the algorithm may choose to move to the right on the graph in order to get out of the local minimum we were in. Even though doing so increases the error of the network, it allows it to move over the hill. This will allow further training to cause the gradient descent to move toward the global minimum. 
 
A benefit of stochastic gradient descent is that it requires much less computation than true gradient descent (and is therefore faster to calculate), while still generally converging to a minimum (although not necessarily a global one).