Understanding Multi-Head Attention in Abstractive Summarization

11/10/2019
by   Joris Baan, et al.
0

Attention mechanisms in deep learning architectures have often been used as a means of transparency and, as such, to shed light on the inner workings of the architectures. Recently, there has been a growing interest in whether or not this assumption is correct. In this paper we investigate the interpretability of multi-head attention in abstractive summarization, a sequence-to-sequence task for which attention does not have an intuitive alignment role, such as in machine translation. We first introduce three metrics to gain insight in the focus of attention heads and observe that these heads specialize towards relative positions, specific part-of-speech tags, and named entities. However, we also find that ablating and pruning these heads does not lead to a significant drop in performance, indicating redundancy. By replacing the softmax activation functions with sparsemax activation functions, we find that attention heads behave seemingly more transparent: we can ablate fewer heads and heads score higher on our interpretability metrics. However, if we apply pruning to the sparsemax model we find that we can prune even more heads, raising the question whether enforced sparsity actually improves transparency. Finally, we find that relative positions heads seem integral to summarization performance and persistently remain after pruning.

READ FULL TEXT

page 5

page 7

page 8

research
07/01/2019

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

Learning algorithms become more powerful, often at the cost of increased...
research
10/20/2018

Abstractive Summarization Using Attentive Neural Techniques

In a world of proliferating data, the ability to rapidly summarize text ...
research
04/14/2021

Sparse Attention with Linear Units

Recently, it has been argued that encoder-decoder models can be made mor...
research
08/10/2021

Differentiable Subset Pruning of Transformer Heads

Multi-head attention, a collection of several attention mechanisms that ...
research
05/23/2019

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Multi-head self-attention is a key component of the Transformer, a state...
research
05/22/2017

A Regularized Framework for Sparse and Structured Neural Attention

Modern neural networks are often augmented with an attention mechanism, ...
research
10/24/2018

Multi-Head Attention with Disagreement Regularization

Multi-head attention is appealing for the ability to jointly attend to i...

Please sign up or login with your details

Forgot password? Click here to reset