Towards Efficient and Scalable Sharpness-Aware Minimization

03/05/2022
by   Yong Liu, et al.
5

Recently, Sharpness-Aware Minimization (SAM), which connects the geometry of the loss landscape and generalization, has demonstrated significant performance boosts on training large-scale models such as vision transformers. However, the update rule of SAM requires two sequential (non-parallelizable) gradient computations at each step, which can double the computational overhead. In this paper, we propose a novel algorithm LookSAM - that only periodically calculates the inner gradient ascent, to significantly reduce the additional training cost of SAM. The empirical results illustrate that LookSAM achieves similar accuracy gains to SAM while being tremendously faster - it enjoys comparable computational complexity with first-order optimizers such as SGD or Adam. To further evaluate the performance and scalability of LookSAM, we incorporate a layer-wise modification and perform experiments in the large-batch training scenario, which is more prone to converge to sharp local minima. We are the first to successfully scale up the batch size when training Vision Transformers (ViTs). With a 64k batch size, we are able to train ViTs from scratch in minutes while maintaining competitive performance.

READ FULL TEXT
research
06/01/2021

Concurrent Adversarial Learning for Large-Batch Training

Large-batch training has become a commonly used technique when training ...
research
05/27/2022

Sharpness-Aware Training for Free

Modern deep neural networks (DNNs) have achieved state-of-the-art perfor...
research
08/20/2023

Enhancing Transformers without Self-supervised Learning: A Loss Landscape Perspective in Sequential Recommendation

Transformer and its variants are a powerful class of architectures for s...
research
12/16/2020

Study on the Large Batch Size Training of Neural Networks Based on the Second Order Gradient

Large batch size training in deep neural networks (DNNs) possesses a wel...
research
10/23/2022

K-SAM: Sharpness-Aware Minimization at the Speed of SGD

Sharpness-Aware Minimization (SAM) has recently emerged as a robust tech...
research
10/16/2021

Sharpness-Aware Minimization Improves Language Model Generalization

The allure of superhuman-level capabilities has led to considerable inte...
research
06/01/2016

Efficiently Bounding Optimal Solutions after Small Data Modification in Large-Scale Empirical Risk Minimization

We study large-scale classification problems in changing environments wh...

Please sign up or login with your details

Forgot password? Click here to reset