Amortized Proximal Optimization

02/28/2022
by   Juhan Bae, et al.
10

We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low precision training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2022

A History of Meta-gradient: Gradient Methods for Meta-learning

The history of meta-learning methods based on gradient descent is review...
research
07/02/2020

On the Outsized Importance of Learning Rates in Local Update Methods

We study a family of algorithms, which we refer to as local update metho...
research
08/22/2020

Fast Proximal Gradient Methods for Nonsmooth Convex Optimization for Tomographic Image Reconstruction

The Fast Proximal Gradient Method (FPGM) and the Monotone FPGM (MFPGM) f...
research
02/01/2021

Meta-learning with negative learning rates

Deep learning models require a large amount of data to perform well. Whe...
research
11/30/2019

Learning Rate Dropout

The performance of a deep neural network is highly dependent on its trai...
research
02/22/2021

A Probabilistically Motivated Learning Rate Adaptation for Stochastic Optimization

Machine learning practitioners invest significant manual and computation...

Please sign up or login with your details

Forgot password? Click here to reset