A Mixture of h-1 Heads is Better than h Heads

05/13/2020
by   Hao Peng, et al.
0

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead "reallocate" them – the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2017

Weighted Transformer Network for Machine Translation

State-of-the-art results on neural machine translation often use attenti...
research
10/11/2022

Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture-of-Experts (MoE) networks have been proposed as an efficient way...
research
09/11/2021

Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy

Simultaneous machine translation (SiMT) generates translation before rea...
research
07/30/2018

Doubly Attentive Transformer Machine Translation

In this paper a doubly attentive transformer machine translation model (...
research
02/21/2018

Globally Consistent Algorithms for Mixture of Experts

Mixture-of-Experts (MoE) is a widely popular neural network architecture...
research
05/03/2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demon...
research
02/06/2018

Granger-causal Attentive Mixtures of Experts

Several methods have recently been proposed to detect salient input feat...

Please sign up or login with your details

Forgot password? Click here to reset