Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

by   Hongyu Gong, et al.

Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of +2.0 BLEU over 13 language directions in multilingual setting and +2.0 BLEU over 3 domains in multi-domain setting.


page 10

page 11


Adaptive Sparse Transformer for Multilingual Translation

Multilingual machine translation has attracted much attention recently d...

Cross-Modal Transfer Learning for Multilingual Speech-to-Text Translation

We propose an effective approach to utilize pretrained speech and text m...

Multi^2OIE: Multilingual Open Information Extraction based on Multi-Head Attention with BERT

In this paper, we propose Multi^2OIE, which performs open information ex...

FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

In this paper, we describe our end-to-end multilingual speech translatio...

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models

In multilingual neural machine translation, it has been shown that shari...

Adaptive multilingual speech recognition with pretrained models

Multilingual speech recognition with supervised learning has achieved gr...