Log In Sign Up

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

by   William Held, et al.

Multilingual transformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades, a phenomenon termed interference. Often attributed to limited model capacity, interference is commonly addressed by adding additional parameters despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific parameters. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction, seeing gains as large as 24.7%. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization.


page 2

page 5


Inducing Language-Agnostic Multilingual Representations

Multilingual representations have the potential to make cross-lingual sy...

Adaptive Sparse Transformer for Multilingual Translation

Multilingual machine translation has attracted much attention recently d...

On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment

Modern multilingual models are trained on concatenated text from multipl...

Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

For many (minority) languages, the resources needed to train large model...

Causes and Cures for Interference in Multilingual Translation

Multilingual machine translation models can benefit from synergy between...

Scaling End-to-End Models for Large-Scale Multilingual ASR

Building ASR models across many language families is a challenging multi...

Laughing Heads: Can Transformers Detect What Makes a Sentence Funny?

The automatic detection of humor poses a grand challenge for natural lan...