Guiding Attention for Self-Supervised Learning with Transformers

10/06/2020
by   Ameet Deshpande, et al.
0

In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster convergence of models as well as better performance on downstream tasks compared to the baselines, achieving state of the art results in low-resource settings. Surprisingly, we also find that linguistic properties of attention heads are not necessarily correlated with language modeling performance.

READ FULL TEXT

page 1

page 5

page 13

11/29/2021

Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis

Vision Transformers (ViT)s have shown great performance in self-supervis...
12/22/2021

Fine-grained Multi-Modal Self-Supervised Learning

Multi-Modal Self-Supervised Learning from videos has been shown to impro...
08/03/2020

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

Music annotation has always been one of the critical topics in the field...
01/10/2020

Pruning Convolutional Neural Networks with Self-Supervision

Convolutional neural networks trained without supervision come close to ...
06/09/2022

Spatial Entropy Regularization for Vision Transformers

Recent work has shown that the attention maps of Vision Transformers (VT...
06/05/2020

Understanding Self-Attention of Self-Supervised Audio Transformers

Self-supervised Audio Transformers (SAT) enable great success in many do...
06/08/2022

CASS: Cross Architectural Self-Supervision for Medical Image Analysis

Recent advances in Deep Learning and Computer Vision have alleviated man...