No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

02/06/2022
by   Chen Liang, et al.
0

Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance.

READ FULL TEXT
research
05/23/2022

The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains

Recent model pruning methods have demonstrated the ability to remove red...
research
08/25/2022

Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima

Learning rate is one of the most important hyper-parameters that has a s...
research
09/20/2019

Learning an Adaptive Learning Rate Schedule

The learning rate is one of the most important hyper-parameters for mode...
research
04/06/2020

Applying Cyclical Learning Rate to Neural Machine Translation

In training deep learning networks, the optimizer and related learning r...
research
09/10/2022

Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

Recent trends towards training ever-larger language models have substant...
research
11/28/2022

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

Multi-task learning (MTL) models have demonstrated impressive results in...
research
10/24/2020

A one-dimensional morphoelastic model for burn injuries: sensitivity analysis and a feasibility study

We consider a one-dimensional morphoelastic model describing post-burn s...

Please sign up or login with your details

Forgot password? Click here to reset