Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport

05/27/2022
by   Lingkai Kong, et al.
21

The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied, partly due to rich machine learning applications. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require commonly used projection or retraction, and thus having low computational costs when compared to existing algorithms. Its generalization to adaptive learning rates is also demonstrated. Pleasant performances are observed in various practical tasks. For instance, we discover that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer [Dosovitskiy et al. 2022] could remarkably improve its performance, when our optimizer is used, and it is better that each head is made orthogonal within itself but not necessarily to other heads. This optimizer also makes the useful notion of Projection Robust Wasserstein Distance [Paty Cuturi 2019][Lin et al. 2020] for high-dim. optimal transport even more effective.

READ FULL TEXT

page 6

page 8

page 10

page 11

page 12

page 14

page 19

page 23

research
05/29/2019

Entropic Regularisation of Robust Optimal Transport

Grogan et al [11,12] have recently proposed a solution to colour transfe...
research
06/08/2019

A gradual, semi-discrete approach to generative network training via explicit wasserstein minimization

This paper provides a simple procedure to fit generative networks to tar...
research
11/02/2022

A new method for determining Wasserstein 1 optimal transport maps from Kantorovich potentials, with deep learning applications

Wasserstein 1 optimal transport maps provide a natural correspondence be...
research
09/01/2021

Wasserstein GANs with Gradient Penalty Compute Congested Transport

Wasserstein GANs with Gradient Penalty (WGAN-GP) are an extremely popula...
research
12/19/2022

Fully Probabilistic Design for Optimal Transport

The goal of this paper is to introduce a new theoretical framework for O...
research
04/19/2023

Generative Modeling of Time-Dependent Densities via Optimal Transport and Projection Pursuit

Motivated by the computational difficulties incurred by popular deep lea...
research
07/24/2023

An Isometric Stochastic Optimizer

The Adam optimizer is the standard choice in deep learning applications....

Please sign up or login with your details

Forgot password? Click here to reset