Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

07/13/2022
by   Lu Zeng, et al.
0

We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2023

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

A lot of recent progress has been made in ultra low-bit quantization, pr...
research
09/30/2022

Convolutional Neural Networks Quantization with Attention

It has been proven that, compared to using 32-bit floating-point numbers...
research
06/30/2022

Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme...
research
08/05/2019

GDRQ: Group-based Distribution Reshaping for Quantization

Low-bit quantization is challenging to maintain high performance with li...
research
04/24/2020

Quantization of Deep Neural Networks for Accumulator-constrained Processors

We introduce an Artificial Neural Network (ANN) quantization methodology...
research
11/13/2022

FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

Although prior art has demonstrated negligible accuracy drop in sub-byte...
research
09/04/2023

Memory Efficient Optimizers with 4-bit States

Optimizer states are a major source of memory consumption for training n...

Please sign up or login with your details

Forgot password? Click here to reset