Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

10/14/2022
by   Manuele Rusci, et al.
0

This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually-managed memory transfers of model parameters. To ensure minimal accuracy degradation with respect to the full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit while the bit precision of remaining layers is kept to FP16. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters. Thanks to the proposed approaches, we speed-up the computation by up to 4x with respect to the lossless FP16 baselines. Differently from a uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the power cost of the external memory by fitting the large models on the limited on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by reducing the supply voltage from 0.8V to 0.65V while still matching the real-time constraints. Our design results 10x more energy efficient than state-of-the-art SE solutions deployed on single-core MCUs that make use of smaller models and quantization-aware training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/20/2020

HAWQV3: Dyadic Neural Network Quantization

Quantization is one of the key techniques used to make Neural Networks (...
research
11/01/2019

Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters

Effective employment of deep neural networks (DNNs) in mobile devices an...
research
02/01/2018

Alternating Multi-bit Quantization for Recurrent Neural Networks

Recurrent neural networks have achieved excellent performance in many ap...
research
07/15/2016

On the efficient representation and execution of deep acoustic models

In this paper we present a simple and computationally efficient quantiza...
research
11/29/2021

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

State-of-the-art language models (LMs) represented by long-short term me...
research
01/20/2022

Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

While neural networks have advanced the frontiers in many machine learni...
research
05/20/2020

TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids

Modern speech enhancement algorithms achieve remarkable noise suppressio...

Please sign up or login with your details

Forgot password? Click here to reset