Scheduled DropHead: A Regularization Method for Transformer Models

04/28/2020
by   Wangchunshu Zhou, et al.
0

In this paper, we introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism, which is a key component of transformer, a state-of-the-art model for various NLP tasks. In contrast to the conventional dropout mechanisms which randomly drop units or connections, the proposed DropHead is a structured dropout method. It drops entire attention-heads during training and It prevents the multi-head attention model from being dominated by a small portion of attention heads while also reduces the risk of overfitting the training data, thus making use of the multi-head attention mechanism more efficiently. Motivated by recent studies about the learning dynamic of the multi-head attention mechanism, we propose a specific dropout rate schedule to adaptively adjust the dropout rate of DropHead and achieve better regularization effect. Experimental results on both machine translation and text classification benchmark datasets demonstrate the effectiveness of the proposed approach.

READ FULL TEXT
research
09/09/2022

MaxMatch-Dropout: Subword Regularization for WordPiece

We present a subword regularization method for WordPiece, which uses a m...
research
08/04/2022

DropKey

In this paper, we focus on analyzing and improving the dropout technique...
research
08/03/2021

A Dynamic Head Importance Computation Mechanism for Neural Machine Translation

Multiple parallel attention mechanisms that use multiple attention heads...
research
04/11/2021

UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost

Transformer architecture achieves great success in abundant natural lang...
research
10/21/2020

TargetDrop: A Targeted Regularization Method for Convolutional Neural Networks

Dropout regularization has been widely used in deep learning but perform...
research
10/10/2019

Orthogonality Constrained Multi-Head Attention For Keyword Spotting

Multi-head attention mechanism is capable of learning various representa...
research
11/26/2019

Low Rank Factorization for Compact Multi-Head Self-Attention

Effective representation learning from text has been an active area of r...

Please sign up or login with your details

Forgot password? Click here to reset