Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

05/04/2020
by   Christopher Brix, et al.
0

Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stabilized lottery ticket hypothesis states that networks can be pruned after none or few training iterations, using a mask computed based on the unpruned converged model. On the transformer architecture and the WMT 2014 English-to-German and English-to-French tasks, we show that stabilized lottery ticket pruning performs similar to magnitude pruning for sparsity levels of up to 85 propose a new combination of pruning techniques that outperforms all other techniques for even higher levels of sparsity. Furthermore, we confirm that the parameter's initial sign and not its specific value is the primary factor for successful training, and show that magnitude pruning cannot be used to find winning lottery tickets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2018

The Lottery Ticket Hypothesis: Training Pruned Neural Networks

Recent work on neural network pruning indicates that, at training time, ...
research
06/22/2020

Rapid Structural Pruning of Neural Networks with Set-based Task-Adaptive Meta-Pruning

As deep neural networks are growing in size and being increasingly deplo...
research
10/15/2020

A Deeper Look at the Layerwise Sparsity of Magnitude-based Pruning

Recent discoveries on neural network pruning reveal that, with a careful...
research
04/18/2021

Lottery Jackpots Exist in Pre-trained Models

Network pruning is an effective approach to reduce network complexity wi...
research
04/17/2021

Visual Transformer Pruning

Visual transformer has achieved competitive performance on a variety of ...
research
09/06/2022

What to Prune and What Not to Prune at Initialization

Post-training dropout based approaches achieve high sparsity and are wel...
research
07/06/2020

Bespoke vs. Prêt-à-Porter Lottery Tickets: Exploiting Mask Similarity for Trainable Sub-Network Finding

The observation of sparse trainable sub-networks within over-parametrize...

Please sign up or login with your details

Forgot password? Click here to reset