DeepAI AI Chat
Log In Sign Up

Pruning a BERT-based Question Answering Model

by   J. S. McCarley, et al.

We investigate compressing a BERT-based question answering system by pruning parameters from the underlying BERT model. We start from models trained for SQuAD 2.0 and introduce gates that allow selected parts of transformers to be individually eliminated. Specifically, we investigate (1) reducing the number of attention heads in each transformer, (2) reducing the intermediate width of the feed-forward sublayer of each transformer, and (3) reducing the embedding dimension. We compare several approaches for determining the values of these gates. We find that a combination of pruning attention heads and the feed-forward layer almost doubles the decoding speed, with only a 1.5 f-point loss in accuracy.


page 1

page 2

page 3

page 4


Pretrained Transformers for Simple Question Answering over Knowledge Graphs

Answering simple questions over knowledge graphs is a well-studied probl...

Of Non-Linearity and Commutativity in BERT

In this work we provide new insights into the transformer architecture, ...

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

How much information do NLP tasks really need from a transformer's atten...

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Traditional (unstructured) pruning methods for a Transformer model focus...

How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking

Attribution methods assess the contribution of inputs (e.g., words) to t...

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

The strong performance of vision transformers on image classification an...

Exploring BERT Parameter Efficiency on the Stanford Question Answering Dataset v2.0

In this paper we explore the parameter efficiency of BERT arXiv:1810.048...