What are Adaptive Subgradient Methods (AdaGrad)?
AdaGrad is a variation of stochastic gradient optimization algorithms that updates the learning rate for each parameter. Instead of a single universal learning rate, AdaGrad maintains a per-parameter learning rate, which considerably improves performance for problems with sparse gradients, such as natural language or computer vision problems.
How does AdaGrad Work?
For parameters associated with frequently occurring features, the updates are small, i.e. low learning rates. For parameters with infrequent features, the updates are large, representing high learning rates. So this technique has the advantage of speeding along learning for sparse datasets. In addition, Adagrad eliminates the need for the trainer to manually adjust the learning rate.
However, AdaGrad’s disadvantage is that the squared gradients in the denominator keep accumulating. Every additional term is positive, so the accumulated sum continues to grow during training. This shrinks the learning rate until it’s infinitesimally small, which eventually makes the algorithm unable to acquire additional knowledge.
Alternatives to AdaGrad
Several alternative algorithms have been created to solve these issues with AdaGrad. Some of the most common are:
- Adadelta
- Adam
- AdaMax
- AmsGrad
- RMSprop