Optimizing Speech Recognition For The Edge

09/26/2019 ∙ by Yuan Shangguan, et al. ∙ Google 0

While most deployed speech recognition systems today still run on servers, we are in the midst of a transition towards deployments on edge devices. This leap to the edge is powered by the progression from traditional speech recognition pipelines to end-to-end (E2E) neural architectures, and the parallel development of more efficient neural network topologies and optimization techniques. Thus, we are now able to create highly accurate speech recognizers that are both small and fast enough to execute on typical mobile devices. In this paper, we begin with a baseline RNN-Transducer architecture comprised of Long Short-Term Memory (LSTM) layers. We then experiment with a variety of more computationally efficient layer types, as well as apply optimization techniques like neural connection pruning and parameter quantization to construct a small, high quality, on-device speech recognizer that is an order of magnitude smaller than the baseline system without any optimizations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Whether for image processing Gokhale et al. (2014) or for speech applications Chen (2014), neural networks have been finding their way onto edge devices for the better part of a decade now. It stands to reason then that the search for ways to make these networks smaller and faster has become increasingly urgent. The three predominant ways of doing so are through quantization Alvarez et al. (2016); Jacob et al. (2017), sparsity LeCun et al. (1990), and architecture variation Greff et al. (2016) or some combination thereof Han et al. (2016). This work explores all three to accomplish the goal of creating an all-neural speech recognizer that runs on-device in real time.

We begin by examining neural network pruning. Early approaches, including Optimal Brain Damage LeCun et al. (1990) and Optimal Brain Surgery Hassibi et al. (1994), describe rather complex methods of determining the precise connections that should be kept and cut from a dense network, which is already trained to reasonable accuracy. More recently, simpler approaches have prevailed, which rely on just the magnitude of the weights to make the same decision Yu et al. (2012); Han et al. (2015); Zhu and Gupta (2018); Frankle and Carbin (2019).

Others have explicitly explored pruning a network at initialization. Researchers show that a pruned network can be trained to similar accuracy as the original dense network with winning ticket initialization Frankle and Carbin (2019); Frankle et al. (2019). Liu et al. demonstrate that random initialization usually suffices for training a structured pruned networks Liu et al. (2019). Lee et al. propose a method to get the pruned network structure at initialization prior to training Lee et al. (2019). All these methods pose the possibility of obtaining an optimal pruned network without the necessity of pre-training a dense network.

In this paper, we apply the automated gradual pruning algorithm to obtain pruned speech recognition models with fewer parameters and minimal accuracy loss Zhu and Gupta (2018).

After pruning, we examine alternative recurrent neural network (RNN) layer architectures. The baseline is a fairly standard long short-term memory (LSTM) architecture that consists of the basic topology first proposed by Hochreiter and Schmidhuber

Hochreiter and Schmidhuber (1997), further enhanced by a forget gate Gers et al. (2000) to allow resetting the cell states at the beginning of sub-sequences. One last commonly accepted addition was in the form of peephole connections from its internal cells to the gates in the same cell to learn precise timing of the outputs Gers et al. (2003).

The two primary alternatives we explore in this work are the Simple Recurrent Unit (SRU) Lei et al. (2018) and the Coupled Input-Forget Gate (CIFG) Greff et al. (2016). Both are significantly less complex both conceptually and computationally than the LSTM.

Finally we examine quantization. In the literature, there exists methodologies that convert the weights to all manor of low-bit representations Mellempudi et al. (2017); Zhou et al. (2017); Courbariaux et al. (2016). In this paper, however, we focus our experimentation to 8-bit integerization of weights, and an efficient mix of either 8 and 16-bit integer computations in the case of the fully integer quantized approach, or 8-bit integer and 32 bit float computation in the so called hybrid approach proposed by Alvarez et al. (2016).

The remainder of this paper is organized as follows: section 2 describes the core architecture of the end-to-end speech recognizer that we study in subsequent sections. Sections 3, 4, and 5 cover pruning, architecture variants and quantization respectively. Section 6 delves into detailed experiments that apply the aforementioned approaches to the neural networks that comprise the speech recognizer. Finally, section 7 wraps up with interpretations we derived from our results.

2 RNN Transducer

The speech recognition architecture at the core of our experiments is the RNN Transducer (RNN-T) Graves (2012); Graves et al. (2013), depicted in Figure 1.

Figure 1: A schematic representation of CTC and RNNT, from Narayanan et al. (2019).

There are three components of the RNN-T. First, the encoder, which takes as input acoustic frames. This network plays the analogous role of a traditional acoustic model. Second, the prediction network, which functions as a kind of language model. Lastly, we have the joint

network that combines the outputs of the previous two and leads finally to the softmax output. The internal layers can be specified in any number of ways, some of which will be more fully described in later sections. The weights of these three components of the model effectively parameterize the conditional probability distribution that plays a central role in computing the loss function used during training.

To further characterize the distribution, however, we must first introduce some notation. A sequence of input acoustic frames are denoted as , where are

-dimensional vectors, log-mel filterbank energies in this work (

) and denotes the number of frames in . A ground-truth label sequence of length is specified as , where and where corresponds to context-independent (CI) phonemes, graphemes or, as in this work, word-pieces.

We can now establish a conditional distribution by first augmenting with an additional blank symbol, , and defining:


where are alignment sequences with blanks and labels such that removing the blanks in yields . We also use and to represent the number of acoustic frames and non-blank symbols in the partial alignment .

Two crucial independence assumptions have been made here. First, cannot depend on future acoustic frames. Second, the probability of observing the th label, , in an alignment, , is conditioned only on the history of non-blank labels emitted thus far. These assumptions enable both efficient inference and, through the use of the forward-backward algorithm, tractable computation of the gradients of the loss function. Incidentally, they also allow us to build streaming systems that do not need to wait for the entire utterance to begin processing.

Finally, given a dataset

, we can estimate the expected loss as:


3 Pruning

It is often repeated in the literature that neural networks are over-parameterized, making them computationally expensive and giving them a larger than necessary memory footprint. This motivates network pruning LeCun et al. (1990); Han et al. (2015), where sparsity is introduced to reduce model size significantly while maintaining the quality of the original.

In this work, we adopt the pruning method proposed by Zhu & Gupta, in which the sparsity of weight matrices increases from an initial value to a final value over a span of steps, starting at training step with pruning frequency  Zhu and Gupta (2018):


Unlike from most iterative pruning methods in the literature which update the mask incrementally, a pruned weight element can be recovered at a later training stage Zhu and Gupta (2018). It is achieved by retaining the values for the pruned weights instead of setting them to zero even though they do not contribute to forward propagation and are not get back propagated. Thus, when the mask is updated later, a pruned weight can be recovered if its retained value is bigger than some un-pruned weights. In our experiments, we allow the mask to be updated even after sparsity reaches the final value so that weights pruned early due to bad initialization can be recovered.

In order to utilize the sparse structure of pruned model to speed up computation, we prune the model to be of block sparse structure.

3.1 Block Compressed Sparse Row

To reduce memory usage and compute more efficiently with the pruned model of block sparse structure, we use a variation of Block Compressed Sparse Row (BCSR) Saad (2003) as the storage format.

Given a matrix and block size and suppose the matrix can be evenly partitioned into equal-sized blocks, then the standard BCSR format of consists of three arrays:

  1. A real array containing non-zero blocks in row-major order.

  2. An integer array containing the block column indices of non-zero blocks.

  3. A second integer array containing the pointers to the beginning of each block row in the real array.

In our implementation, we combine the two integer arrays into one so-called ledger array which contains the number of non-zero blocks of each block row followed by block column indices of non-zero blocks.

4 Efficient RNN Variants

The long short-term memory (LSTM) recurrent neural network cell topology was designed to accurately model dependencies between temporally distant events. It has been successfully employed in several state of the art systems solving sequential tasks. For example, LSTM cells have been used as building blocks to construct many speech recognition architectures. They are the basis for acoustic modeling, a core component of a speech recognizer Sak et al. (2014); McGraw et al. (2016). They have also been developed to build language modeling Jozefowicz et al. (2016), and more recently, as in the case of this paper, to create end-to-end audio-to-transcription models that perform most if not all of the job of the recognizer Chan et al. (2016); Jaitly et al. (2016); Chiu et al. (2018); He et al. (2019).

In the RNN-T models of He et al. used as the baseline system described in Section 2, the streaming speech recognition model He et al. (2019) is composed out of LSTM cells as described in Section 4.1.

4.1 Baseline LSTM architecture

Our baseline LSTM cell is built from the original topology by Hochreiter and Schmidhuber Hochreiter and Schmidhuber (1997), further enhanced by a forget gate Gers et al. (2000) to allow resetting the cell states at the beginning of sub-sequences. We do not include peephole connections from its internal cells to the gates in the same cell (conceived by Gers et al. (2003) to learn precise timing of the outputs), but do use more a recent modification to the LSTM topology proposed by Sak et al. (2014), in which a recurrent projection is incorporated to the outputs of the cell to reduce the output sizes and thus the number of recurrent connections within the cell, as well as those to the next layer(s).

Thus, the equations for the baseline LSTM are:


where is the input at time , the and terms denote weight matrices (e.g. is the matrix of weights applied to the input vectors to calculate input gates, is the recurrent weights matrix applied to outputs of the previous time step to calculate the input gate), the

terms denote bias vectors (

is the input gate bias vector),

is the logistic sigmoid function, and

, , , and are respectively the LSTM-block input, input gate, forget gate, output gate and cell activation vectors, all of which are of the same size as the cell output activation vector , is the element-wise product of the vectors, and

is the cell input and cell output activation function, generally


In the LSTM cells, we add layer normalization to the input block and gates (, , and ). Layer normalization helps to stabilize the hidden layer dynamics and it speeds up model convergence Ba et al. (2016). Take reset gate as an example below: for hidden node of the reset gate at time , , we have,


Where is a trainable gain parameter that has the same dimension as , and is the number of hidden units in the layer.

During training, the weight matrices and weight matrices are usually transposed and grouped into one matrix


for efficient computation.

We experiment with variants of the LSTM cells aimed at reducing the number parameters and computation without a significant impact in the final accuracy of the system. Two of such variants are the coupled input-forget gate (CIFG) LSTM 4.2 and the simple recurrent unit (SRU) 4.3.

4.2 Cifg-Lstm

The CIFG-LSTM variant is an LSTM modification explored by Greff at al. Greff et al. (2016)

and similar to that proposed in the gated recurrent unit (GRU) 

Cho et al. (2014). CIFG-LSTM simplifies the LSTM cell topology by having a single set of weights (i.e. a single gate) controlling both the amount of information added to and removed from the cell state (i.e. in Equation 7). This is implemented by setting the input and forget gates to be the complement of one another, replacing Equation 4 with Equation 16.


This can be interpreted as if the cell is learning information complementary to the information it forgets (and vise versa), at any given time step.

Greff et al. show that simplifying LSTM cells by coupling the forget gate with input gate, or by removing the peephole connections, does not lead to significant decrease in LSTM performance on the TIMIT dataset Greff et al. (2016). Experiments by van der Westhuizen et al. also discover that the forget gate is one of the most important gates in the LSTM cells, and that the forget-gate-only version of the LSTM cells beats the performance of standard LSTM cells on the MNIST and pMNIST datasets van der Westhuizen and Lasenby (2018). Similar findings are empirically validated in the work by Jozefowicz et al. where the forget is shown to be the most important gates. When forget gate biases are properly initialized (i.e. to 1), it contributed to significant qualitative performance of the LSTM across sequential tasks Jozefowicz et al. (2015).

Greff et al., however, did not explored in depth the benefits of such LSTM configuration. From the efficiency point of view, keeping the number of cells constant, a CIFG reduces the number of parameters and computations by 25% with respect to the baseline LSTM. Moreover, since it does not change the fundamental formulation of an LSTM, it still allows leveraging the projection proposed by Sak et al. (2014), reducing the parameter count further. We compared the performance of CIFG-LSTM versus the baseline LSTM, as well as explored the impact of combining it with other optimizations such as pruning and quantization.

4.3 Simple Recurrent Unit (SRU)

The SRU cell topology emerges in recent works as an effective tool for speech recognition and natural language processing tasks. Lei et al. show that SRU cells are not only highly parallelizable in model inferencing but also adequately expressive in capturing the recurrent statistical patterns in the input data. They construct neural networks with SRU cells and reproduce state-of-the-art results for text classification, question answering, machine translation, and character level language modeling tasks, at 5-9x speed-ups over similar networks constructed with cuDNN-optimized LSTM cells

Lei et al. (2018).

Park et al. build an acoustic model by combining one layer of SRU cells with one layer of depth-wise 1D Convolutional Neural Network. They add a RNN Language Model on top of the acoustic model, constructing a smaller than 15MB automatic speech recognition system. This system runs in real time on mobile devices and produces similar WER for WSJ dataset as a 100MB size state of the art model does 

Park et al. (2018). We implemented a version of SRU as shown below.



, the logistic regression activation function, applies point-wise onto its input, and

is the cell output activation function, generally defined as . We also added an output projection, described in Equation 23, to the SRU cells. This output projection is similar to the output projection of LSTM cells Sak et al. (2014) in Section 4.1. It too does not contain biases and simply applies a weight matrix that shrinks the SRU layers by reducing cell output dimensions.

Inspired by the work of layer normalization on RNN cells Ba et al. (2016), we also added layer normalization into SRU cells. For each layer, we normalized the forget gate , reset gate , and the cell value .

To explore the effectiveness of SRU cells, we looked at the results of using SRU cells in the encoders, decoders, or both in the streaming RNN-T model.

5 Quantization

We usually train and represent our models using floating-point arithmetic and parameters (i.e. 32-bit floating point type). Then as we directly deploy the floating-point model for inference, we expect the same precision and quality as we have observed during model training. However there is significant advantages in representing the model in lower integer precision. It reduces memory consumption, both on disk and on RAM, and speeds up the model’s execution to meet the on-device real-time requirements.

Quantization is the process of transforming a computation graph such that it can (entirely or partially) be represented and executed in a lower precision by discretizing the original graph’s values to a more restricted representation and rewriting the computation to take place using these values. Thus, the resulting quantized graph is an approximation of the original.

In this work, when we refer to quantization

it means an affine transformation that linearly maps values from a higher to a lower precision. We typically express it as a transformation involving a scale and an offset (or zero point) on 1 or more dimensions of a tensor.

We explored two quantization schemes (i.e. ways to convert, represent, and execute the model), for which TensorFlow provides tooling and execution capabilities

TensorFlow model optimization (2019); TensorFlow Lite (2019). Hybrid quantization, described in 5.1, makes use of integer and floating point computation to execute an inference, whereas integer quantization, described in 5.2, performs inferences entirely using integer operations. Both schemes make use of post-training quantization, meaning the model’s computation graph is quantized after it has been fully trained, and without any quantization-aware training. In other words, we do not perform quantization emulation during the forward passes, which could have adapted the model to the noise introduced by quantization and further reduce possible losses in model accuracies Alvarez et al. (2016); Jacob et al. (2017). Both schemes also transform the weights from 32-bit floating-point precision into 8-bit integers. They mainly differ in how the computation takes place.

5.1 Hybrid Quantization

Hybrid quantization, as introduced in McGraw et al. (2016); Alvarez et al. (2016), transforms all weights from their original 32-bit representation to 8-bit integers, but performs on-the-fly quantization of dynamic values (e.g. activations) as well as some of the computation like non-linearities in floating point. One of the advantages of hybrid quantization is that it can be performed entirely as a single pass transformation over the graph, without requiring collecting statistics on the dynamic ranges of the tensors, and thus no external sample data is needed. Moreover, there is an advantage in accurately quantizing the dynamic tensors with the true range of values.

As explained in prior work, we symmetrically quantize the values by defining the quantized tensor, , to be the product of the original vector, , and a quantization scale, , where  He et al. (2019).

Figure 2: Computation : weights are already quantized, and inputs are quantized by on-the-fly before performing multiplication ; the product is then dequantized to apply the biases and the activation function .

The approach we follow during model inference is depicted in Figure 2. Internally, layers operate on 8-bit integers for the matrix multiplications (typically the most computationally intensive operations), and their product is dequantized to floating point to go through activation functions.

5.2 Integer Quantization

Integer quantization is designed such that the graph performs only integer operations. Thus, it requires, among other things, that the dynamic tensors be quantized with a scale pre-computed from statistics as opposed to the hybrid quantization’s use of true range of values. However, integer only quantization has a number of important advantages over hybrid:

  • Widespread availability. Integer operations are common across hardware, including the latest generation of specialized chips.

  • Efficiency. Having all operations as integer means faster execution, and less power consumption. Furthermore, the use of pre-computed scales means there is no overhead re-computing scales with every inference, nor quantizing and dequantizing tensors on-the-fly as with the hybrid approach.

The statistics for all dynamic tensors are computed by running inferences on a floating point version of the model, which logs the dynamic ranges of each tensor. The dynamic ranges are then used to compute the needed scales and zero point by simply taking the absolute range across all logs from the same tensor.

Empirically, we have observed that despite the large variety of audio and semantic conditions in speech recognition, a fixed 100-utterances dataset is sufficient to produce fully quantized models with negligible accuracy loss.

What tensors scales are actually required depends on the needs of the quantized graph (e.g. where scaling an accumulator in a higher precision to a lower one). This is the main challenge of integer quantization: the design of the execution strategy for the operations in the graph. It largely involves optimizing for the choice of number of bits as well as scale. In the case of stateful operations like the LSTMs used in this paper, we commonly face the need of using larger bit representations.

The computation “recipe” for integer LSTMs used in this paper follows this principles:

  • Matrix related operations, such as matrix-matrix multiplicaton and matrix-vector multiplication, are all in 8-bit.

  • Vector related operations, such as element wise sigmoid, are a mixture of 8-bit and 16-bit.

Also, we would like to highlight that larger number of bits does not necessarily mean higher accuracy. Quantized computation accuracy is associated with scale. For a given scale, number of bits is related to saturation. In practice, these decisions are part of the TensorFlow tooling TensorFlow model optimization (2019) and runtime TensorFlow Lite (2019).

On the modeling side though there are ways to make a model more amenable to quantization. For example, layer normalization Ba et al. (2016) helps improve accuracy for integer only calculation. We theorize that this is due to layer normalization being robust against overall shift from gate matrix multiplication, which is the primary source of accuracy degration for quantization.

Using all the approches mentioned above, we are able to quantize the models in this paper with negligible accuracy loss. See section 6 for the details.

6 Experiments

6.1 Model Architecture Details

Our base model is an RNN-T with a target word-piece output size of 4096, similar to the model proposed by He et al. He et al. (2019). The RNN-T takes globally normalized log-Mel features of audio as input Narayanan et al. (2018)

. Our model pipeline first extracts 128-dims log-Mel features from audio data with a sliding window of 32ms width and 10ms stride. It then stacks every 4 consequtive frames to form a 512 dimensional input vector, before downsampling the input vectors at a frame rate of 30ms. The encoder network consists of 8 LSTM layers with 2048 hidden units followed by a projection layer with 640 units. A time reduction layer is inserted after the second layer of the encoder, further increasing the frame rate to 60ms. The prediction network consists of 2 layers of LSTM cells, each of which has 2048 hidden units, and a output projection layer of size 640. The joint network has 640 hidden units and a softmax layer with 4096 units. Our baseline RNN-T model has 122M parameters in total. In this work, we apply pruning, CIFG-LSTM, SRU and quantization to the baseline model and compare their performance.

All models are trained using Lingvo Shen et al. (2019) in TensorFlow Abadi et al. (2015) on Tensor Processing Unit slices with a global batch size of 4096 on a dataset with cross-domain utterances including VoiceSearch, YouTube and Telephony. Pruned models are trained based on TensorFlow model optimization (2019) pruning implementation. In training, the learning rate ramps up to 1e-3 linearly in the first 32k steps and decays to 1e-5 exponentially from 100k to 200k steps. We evaluate all models on 3 test datasets with utterances from same domain as used in training: the VoiceSearch, the YouTube and the Telephony dataset.

Dataset Utt Len Mean/s 50 90
percentile/s percentile/s
VoiceSearch 4.7 4.4 6.6
YouTube 987.5 836.4 1799.0
Telephony 4.7 3.7 8.2
Table 1: Test dataset utterance length distributions

Table 1 shows the distributions of utterance lengths in these three datasets. VoiceSearch and Telephony contain shorter utterances with a mean of 4.7 seconds per utterance. YouTube contains longer utterances, averaging 16.5 min per utterance.

6.2 Pruning

Sparsity #Params (millions) WER
% of baseline VoiceSearch YouTube Telephony
0% 122.1 (100%) 6.6 19.5 8.1
50% 69.7 (57%) 6.7 20.3 8.2
70% 48.7 (39.9%) 7.1 20.6 8.5
80% 38.2 (31.3%) 7.4 21.2 8.9
Table 2: Comparison of pruned models with different sparsities

We first train a base model as described in Section 6.1. Then we apply the pruning algorithm mentioned in Section 3 to the grouped weight matrix in each LSTM layer of the RNNT model with sparsity increasing to target sparsity from 0 to 100k steps polynomially as defined in Equation 3. In order to leverage learned sparse structure to speed up inference on modern CPUs, a block sparse structure is enforced in . Table 2 shows the Word Error Rates (WER) and number of params of base model and pruned models at different sparsity levels.

6.3 Comparing RNN Topologies

Table 3 and 4 show the comparison of models trained with the same model architecture (described in Section 6.1), but different RNN cell topologies. We also train pruned RNN-T models with 50% sparsities in each RNN layer.

In Table 3, we show that CIFG based RNN-T models are comparable to LSTM-based RNN-T models in its performance, regardless of whether sparsity is applied. We show the results of a dense LSTM model (labeled “Small”) with the number of hidden layer cells and projection layer cells at the those of the baseline model. This smaller model is 45% smaller than the original model but suffers a WER degradation of 18.2% on VoiceSearch – much worse than the sparse CIFG model, which is 55% smaller than the original model with only a 4.5% degradation on VoiceSearch.

Enc & Dec Cell Sparsity #Params (millions) WER
#layers % of baseline VoiceSearch YouTube Telephony
LSTM LSTMx8 - 122.1 (100%) 6.6 19.5 8.1
(Baseline) LSTMx2 -
LSTM LSTMx8: - 67.7 (55.4%) 7.8 20.3 8.6
(Baseline small) LSTMx2: -
CIFG CIFG-LSTMx8 - 95.8 (79%) 6.8 18.6 8.1
Sparse LSTM LSTMx8 50% 69.7 (57%) 6.7 20.3 8.2
LSTMx2 50%
Sparse CIFG CIFG-LSTMx8 50% 56.3 (46%) 6.9 21.0 8.1
Table 3: Comparison of LSTM based RNN-T model with CIFG-LSTM based RNN-T models.

Table 4 records the results of SRU cells in different parts of the end-to-end RNN-T model. We refer to the first two layers of the encoder collectively as , and subsequent layers . In between and is the time reduction layer (see Section 6.1). We showed that SRU layers could effectively substitute LSTM layers in the decoder, but did not perform comparable to LSTM layers in the encoders.

A LSTM cell contains roughly two times the number of parameters as a SRU cell. To keep the decoder parameter count unchanged, we stacked 4 layers of SRU cells as the decoder. Compared to LSTM based decoder, 4 layer SRU improved VoiceSearch, YouTube, and Telephony test sets by 1.5%, 6.7% and 4.9% respectively.

We also experimented with mixing SRU layers and LSTM layers in the encoder. Although the resultant models had better WER than the model using SRU layers alone, they are still not on par with LSTM based encoder. Last but not least, we found SRU layers needing more time to converge than LSTM layers. We lengthened the warm-up period, peak period and decay learning rate schedule by , and observed a 8.5% drop in the VoiceSearch test set WER of SRU-based RNN-T from 9.4% to 8.6%. Similar LSTM based cells could achieve that result with half the learning rate schedule.

Enc & Dec Cell #Params (millions) WER
#layers % of baseline VoiceSearch YouTube Telephony
SRU-dec LSTMx8 111.6 (91%) 6.7 18.5 8.1
SRU-dec deep LSTMx8 124.7 (102%) 6.5 18.2 7.7
SRU-enc0 SRUx2,LSTMx6 111.6 (91%) 7.0 20.6 8.4
SRU-enc1 LSTMx2,SRUx6 90.6 (74%) 7.2 19.0 8.5
SRU (long lr) SRUx8 69.6 (57%) 8.6 21.2 10.0
Table 4: Comparison of RNN-T models that contain SRU cells in the encoder0, encoder1, and/or decoder. The ‘long lr’ stands for double the learning rate schedule.

Table 5 shows our finalized RNN-T model with sparse CIFG (50% weight sparsity) layers in the encoders, and sparse SRU (30% weight sparsity) layers in the decoder. This model has 59% fewer parameters than the baseline LSTM-based model, but only degraded by 7.5% and 1.2% of WER on VoiceSearch and Telephony test sets. Its WER on YouTube has improved by 3.1%. This final RNN-T model is significantly better than the “Small” and dense LSTM model.

Enc & Dec Cell Sparsity #Params (millions) WER
#layers % of baseline VoiceSearch YouTube Telephony
LSTM LSTMx8: - 67.7 (55.4%) 7.8 20.3 8.6
(Baseline small) LSTMx2: -
CIFG-SRU CIFG-LSTMx8 - 89.6 (73%) 6.9 19.1 7.8
SRUx2 -
sparse-CIFG CIFG-LSTMx8 50% 50.6 (41%) 7.1 18.9 8.2
sparse-SRU SRUx2 30%
Table 5: Comparing smaller dense LSTM model with models trained with sparse CIFG-LSTM and SRU cells.

To summarize, although SRU layers required longer learning rate to converge, they were smaller and effective substitutes of LSTM layers in the RNN-T decoder. A combination of 50% sparse CIFG (encoder layers) and 30% sparse SRU (decoder layers) eliminated 59% of the parameters with respect to the baseline RNN-T model with a small loss of WER.

6.4 Quantized LSTM

#Params (millions) Quantization WER RT(0.9)
% of baseline VoiceSearch YouTube Telephony Pixel 3 small cores
LSTM 122.1 (100%) Float (baseline) 6.6 19.5 8.1 3.223
Hybrid 6.7 19.8 8.2 1.024
Integer 6.7 19.8 8.2 1.013
Sparse LSTM 69.7 (57%) Float 6.7 20.2 8.2 1.771
Hybrid 6.8 20.4 8.4 0.888
Integer 6.9 22.9 8.7 0.869
Sparse CIFG 56.3 (46%) Float 7.1 21.7 8.3 1.503
Hybrid 7.2 21.4 8.5 0.743
Integer 7.2 20.6 8.7 0.709
Table 6: Comparison of float, hybrid and fully quantized models.

The accuracy and CPU performane comparision between float, hybrid and fully quantized models is listed in Table 6.

Accuracy wise, our experimental results show that our proposed integer quantization has negligible accuracy loss on Voice Search. Even the result on longer utterances (in YouTube), which are typically more challenging for sparse models than shorter utterances, the accuracy is still comparable with that of float models.

We use Real Time (‘RT’) factor, which is the ratio between the wall time needed for completing the speech recognition and the length of the audio, to measure the end-to-end performance. means the system processes 10 seconds of audio in 9 seconds. We denote RT(0.9) as the RT factor at 90 percentile: 90% of the utterances have RT values smaller or equal to the value of RT(0.9). Using the TensorFlow Lite (2019) runtime in a typical mobile CPU (Pixel 3 small cores), we compare the RT(0.9) between the float, hybrid and integer models, and see the integer model achieve an RT factor that is  30% with respect to the float one. There is an on-going effor to optimize the integer model (such as better sparsity support) on CPU but (much) bigger performance improvements are expected for integer quantization on specialized neural network acceleration chips.

7 Conclusion

In this work we presented a comprehensive set of optimizations that span from more efficient neural network building blocks to the elimination, and reduction in precision, of neural network parameters and computations. Altogether they result in a high quality speech recognition system “Sparse CIFG” that is  9x smaller in size (from 488.4MB to 56.3MB), and reduces RT factor 4.5x (from 3.223 to 0.709) with respect to our full precision “LSTM Baseline”. More specifically, we validated that neural connection pruning, as proposed in our work, is a very useful tool to reduce potential overparameterization in the neural network at the cost of a relatively small accuracy loss and an increase in overall training time. Moreover, the use of a particular “block sparsity” configuration enables further execution speedups in CPUs widely used in mobile, desktop and server devices, without requiring specialized hardware support. We expect, however, that specialized hardware can speed things further. We also validated that other RNN variants result in competitive qualitative performance with respect to the widely accepted and used LSTM topology, while also reducing the number of parameters and potentially enabling other optimizations. In particular we believe that CIFG-LSTM are an underused and relatively simple optimization to take advantage of, which paired with a projection significantly reduce size and computation at a relatively small trade-off in accuracy. We have been successfully using CIFG-LSTMs across a variety of tasks with the same benefits validated in this paper. We also showed that neural network quantization is an extremely valuable technique to reduce the memory footprint of the model as well as speed up its inference in CPUs, while opening the door towards using more specialized neural network acceleration chips such as Tensor Processing Units (something we intend to take advantage of in future work). Finally, we verified that all these techniques are complimentary to each other, and whereas the accuracy losses of each technique do compound, they do not do it in a way that multiply each other with catastrophic results. On the contrary, our smallest model ”Sparse CIFG” achieves better accuracy, even quantized, than that of a small baseline model “LSTM (Baseline small)” evaluated in full precision.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) External Links: Link Cited by: §6.1.
  • R. Alvarez, R. Prabhavalkar, and A. Bakhtin (2016) On the efficient representation and execution of deep acoustic models. In Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), External Links: Link Cited by: §1, §1, §5.1, §5.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1, §4.3, §5.2.
  • W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §4.
  • L. H. Chen (2014) Speech recognition repair using contextual information. Google Patents. Note: US Patent 8,812,316 Cited by: §1.
  • C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. Cited by: §4.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Link, 1406.1078 Cited by: §4.2.
  • M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1.
  • J. Frankle and M. Carbin (2019) THE lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §1, §1.
  • J. Frankle, K. Dziugaite, D. M. Roy, and M. Carbin (2019) Stabilizing the lottery ticket hypothesis. External Links: Link Cited by: §1.
  • F. A. Gers, J. Schmidhuber, and F. Cummins (2000) Learning to forget: continual prediction with LSTM. Neural Computation 12 (10), pp. 2451–2471. Cited by: §1, §4.1.
  • F. A. Gers, N. N. Schraudolph, and J. Schmidhuber (2003) Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, pp. 115–143. External Links: ISSN 1532-4435, Document Cited by: §1, §4.1.
  • V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello (2014) A 240 g-ops/s mobile coprocessor for deep neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 682–687. Cited by: §1.
  • A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §2.
  • A. Graves (2012) Sequence transduction with recurrent neural networks. In CoRR, pp. vol. abs/1211.3711. Cited by: §2.
  • K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2016) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §1, §1, §4.2, §4.2.
  • S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, et al. (2016) DSD: dense-sparse-dense training for deep neural networks. arXiv preprint arXiv:1607.04381. Cited by: §1.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, Cited by: §1, §3.
  • B. Hassibi, D. G. Stork, and G. Wolff (1994) Optimal brain surgeon: extensions and performance comparisons. In Advances in neural information processing systems, pp. 263–270. Cited by: §1.
  • Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §4, §4, §5.1, §6.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Document Cited by: §1, §4.1.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko (2017) Quantization and training of neural networks for efficient integer-arithmetic-only inference. CoRR abs/1712.05877. External Links: Link, 1712.05877 Cited by: §1, §5.
  • N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio (2016) An online sequence-to-sequence model using partial conditioning. In Advances in Neural Information Processing Systems, pp. 5067–5075. Cited by: §4.
  • R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016) Exploring the limits of language modeling. External Links: Link Cited by: §4.
  • R. Jozefowicz, W. Zaremba, and I. Sutskever (2015) An empirical exploration of recurrent network architectures. Journal of Machine Learning Research. Cited by: §4.2.
  • Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1, §1, §3.
  • N. Lee, T. Ajanthan, and P. H. Torr (2019) SNIP: single-shot network pruning based on connection sensitivity. In ICLR, Cited by: §1.
  • T. Lei, Y. Zhang, S. I. Wang, H. Dai, and Y. Artzi (2018) Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4470–4481. Cited by: §1, §4.3.
  • Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In ICLR, Cited by: §1.
  • I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F. Beaufays, and C. Parada (2016) Personalized speech recognition on mobile devices. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §4, §5.1.
  • N. Mellempudi, A. Kundu, D. Das, D. Mudigere, and B. Kaul (2017)

    Mixed low-precision deep learning inference using dynamic fixed point

    arXiv preprint arXiv:1701.08978. Cited by: §1.
  • A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani (2018) Toward domain-invariant speech recognition via large scale training. CoRR abs/1808.05312. External Links: Link, 1808.05312 Cited by: §6.1.
  • A. Narayanan, R. Prabhavalkar, C. Chiu, D. Rybach, T. N. Sainath, and T. Strohman (2019) Recognizing long-form speech using streaming end-to-end models. 2019 IEEE Automatic Speech Recognition and Understanding (ASRU) (accepted). Cited by: Figure 1.
  • J. Park, Y. Boo, I. Choi, S. Shin, and W. Sung (2018) Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pp. 10620–10630. Cited by: §4.3.
  • Y. Saad (2003) Iterative methods for sparse linear systems. 2nd edition, SIAM. Cited by: §3.1.
  • H. Sak, A. Senior, and F. Beaufays (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, Cited by: §4.1, §4.2, §4.3, §4.
  • J. Shen, P. Nguyen, Y. Wu, Z. Chen, et al. (2019) Lingvo: a modular and scalable framework for sequence-to-sequence modeling. External Links: 1902.08295 Cited by: §6.1.
  • TensorFlow Lite (2019) External Links: Link Cited by: §5.2, §5, §6.4.
  • TensorFlow model optimization (2019) External Links: Link Cited by: §5.2, §5, §6.1.
  • J. van der Westhuizen and J. Lasenby (2018) The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849. Cited by: §4.2.
  • D. Yu, F. Seide, G. Li, and L. Deng (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In 2012 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 4409–4412. Cited by: §1.
  • A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: §1.
  • M. Zhu and S. Gupta (2018) To prune, or not to prune: exploring the efficacy of pruning for model compression. In ICLR Workshop, Cited by: §1, §1, §3, §3.