Shampoo: Preconditioned Stochastic Tensor Optimization

by   Vineet Gupta, et al.

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.


page 1

page 2

page 3

page 4


A Unified Convergence Theorem for Stochastic Optimization Methods

In this work, we provide a fundamental unified convergence theorem used ...

NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent

Multi-way data analysis has become an essential tool for capturing under...

Lightweight Stochastic Optimization for Minimizing Finite Sums with Infinite Data

Variance reduction has been commonly used in stochastic optimization. It...

Tensor Normal Training for Deep Learning Models

Despite the predominant use of first-order methods for training deep lea...

SGD_Tucker: A Novel Stochastic Optimization Strategy for Parallel Sparse Tucker Decomposition

Sparse Tucker Decomposition (STD) algorithms learn a core tensor and a g...

Tensor Canonical Correlation Analysis

In many applications, such as classification of images or videos, it is ...