SpikeBERT: A Language Spikformer Trained with Two-Stage Knowledge Distillation from BERT

08/29/2023
by   Changze Lv, et al.
0

Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way. However, the network architectures of existing SNNs for language tasks are too simplistic, and deep architectures have not been fully explored, resulting in a significant performance gap compared to mainstream transformer-based networks such as BERT. To this end, we improve a recently-proposed spiking transformer (i.e., Spikformer) to make it possible to process language tasks and propose a two-stage knowledge distillation method for training it, which combines pre-training by distilling knowledge from BERT with a large collection of unlabelled texts and fine-tuning with task-specific instances via knowledge distillation again from the BERT fine-tuned on the same training examples. Through extensive experimentation, we show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese with much less energy consumption.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Fine-tuning pre-trained language models like BERT has become an effectiv...
research
05/01/2020

Distilling Spikes: Knowledge Distillation in Spiking Neural Networks

Spiking Neural Networks (SNN) are energy-efficient computing architectur...
research
06/14/2021

Energy-efficient Knowledge Distillation for Spiking Neural Networks

Spiking neural networks (SNNs) have been gaining interest as energy-effi...
research
07/31/2020

Improving NER's Performance with Massive financial corpus

Training large deep neural networks needs massive high quality annotatio...
research
01/03/2022

Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

We perform knowledge distillation (KD) benchmark from task-specific BERT...
research
02/07/2020

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

In this paper, we propose a novel model compression approach to effectiv...
research
10/06/2020

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

The latency of neural ranking models at query time is largely dependent ...

Please sign up or login with your details

Forgot password? Click here to reset