Pruning a BERT-based Question Answering Model
We investigate compressing a BERT-based question answering system by pruning parameters from the underlying BERT model. We start from models trained for SQuAD 2.0 and introduce gates that allow selected parts of transformers to be individually eliminated. Specifically, we investigate (1) reducing the number of attention heads in each transformer, (2) reducing the intermediate width of the feed-forward sublayer of each transformer, and (3) reducing the embedding dimension. We compare several approaches for determining the values of these gates. We find that a combination of pruning attention heads and the feed-forward layer almost doubles the decoding speed, with only a 1.5 f-point loss in accuracy.
READ FULL TEXT