Dynamic Sparsity Neural Networks for Automatic Speech Recognition

05/16/2020 ∙ by Zhaofeng Wu, et al. ∙ Google University of Washington 0

In automatic speech recognition (ASR), model pruning is a widely adopted technique that reduces model size and latency to deploy neural network models on edge devices with resource constraints. However, in order to optimize for hardware with different resource specifications and for applications that have various latency requirements, models with varying sparsity levels usually need to be trained and deployed separately. In this paper, generalizing from slimmable neural networks, we present dynamic sparsity neural networks (DSNN) that, once trained, can instantly switch to execute at any given sparsity level at run-time. We show the efficacy of such models on ASR through comprehensive experiments and demonstrate that the performance of a dynamic sparsity model is on par with, and in some cases exceeds, the performance of individually trained single sparsity networks. A trained DSNN model can therefore greatly ease the training process and simplifies deployment in diverse scenarios with resource constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally, network pruning methods [1, 2] have been employed to obtain sparse neural network models to support edge devices with limited resources [3]

. However, today’s machine learning production models often target a variety of consumer hardware capabilities. The wide spectrum of mobile devices alone could have latency differences of multiple orders of magnitude for the same machine learning model 

[4]. The situation further complicates when systems such as home speakers and cars are taken into consideration. As an additional compounding factor, different software applications might have different latency requirements. For example, despite likely using the same architecture, the speech recognizer for video conference captioning requires higher synchronicity than one that aids online video subtitle generation.

Ideally, different-sized models with varying sparsity levels should be trained to target every single device type. However, this scheme is impractical given the myriad of existing devices. Alternatively, one could train a few sparse models only targeting typical hardware configurations. In addition to the maintenance overhead of an offline device sparsity table, this strategy will also necessarily under- or over-utilize resources on the heterogeneous long tail of devices. Additionally, even on a single device, resource availability usually changes dynamically as concurrent activities vary. Models with a static sparsity level will hence likely result in sub-optimal resource usage.

To support such diverse sets of scenarios, we propose dynamic sparsity neural networks (DSNN). After training, a single DSNN model is able to execute at any sparsity level at run-time with no or insignificant loss in accuracy compared to individually trained single sparsity networks. With such a model, we can dynamically adjust its sparsity according to the device capability and resource availability, thereby achieving an optimal accuracy-latency trade-off with minimal memory footprint.

DSNN was inspired by recent work [5, 6] which showed that even for untrained random networks, there exists sub-networks of arbitrary sparsity levels that achieve very high quality. Therefore, trained networks should also simultaneously contain powerful sub-networks at different sparsity levels.

Methodologically, DSNN builds upon the recent line of work on slimmable neural networks (SNN) [7, 8]

that were developed to tackle a similar issue on model deployment across heterogeneous devices. However, these models are only designed for convolutional neural networks, restricting their applicability to many domains and tasks. We demonstrate in Section 

5.2 that a naive generalization of SNN to the task of automatic speech recognition (ASR) shows poor performance.

DSNN, on the other hand, is a sparsity-based extension of SNN that is applicable to any weight-based neural network. In this paper, we choose to focus on the task of ASR due to an increasing demand for on-device ASR [9, 3]. We show that a single DSNN model can match, or sometimes exceed, the performance of individually trained single sparsity networks across a series of sparsity levels (Section 5.1).

DSNN models contribute to practical machine learning systems in two ways. First, the same DSNN model can be deployed to multiple hardware types with different resource and energy constraints. This greatly reduces both the training overhead and management complexity of deployment processes. Secondly, as DSNN models prove to exceed the performance of single sparsity models at high sparsity levels, its training scheme hence also constitutes an effective approach to improve sparse model performance.

2 Related Work

Over-parameterization is a commonly addressed issue of neural networks [10, 11]. To deal with this issue, model pruning methods have been developed to remove unimportant connections in weight matrices of neural network models. Optimal Brain Damage [12] and Optimal Brain Surgeon [13] first proposed pruning methods based on second-order derivatives, and [1] demonstrated a magnitude-based pruning approach which we base upon in our work. The resulting pruned models contain only sparse structures, allowing them to run efficiently at inference time while maintaining performance [12, 1]. Many studies have demonstrated the empirical strength of such sparse networks [1, 3] and examined their theoretical properties [14, 15, 5, 6].

Recently, [5] and [6] showed that untrained random networks contain sub-networks at arbitrary sparsity levels that perform well without training. The best of these sub-networks, usually at around 50% sparsity, can perform as well as the full (i.e. zero-sparsity) model in specific datasets. Our work also tries to find a single network containing multiple high-quality sub-networks, but we allow model training while requiring these sub-networks to match the quality of individually-trained single sparsity networks.

Dynamic neural networks are a family of models that optimize run-time accuracy and efficiency trade-off using dynamic inference graphs [16, 17, 7, 8]. These models often allow selective execution which is desirable when the target inference platforms vary in their constraints. To our knowledge, our proposed DSNN is the first of such models that achieves such optimization using sparse networks.

3 Dynamic Sparsity Neural Networks

In this section, we first provide a formulation of dynamic sparsity neural networks (DSNN) and justify it using previous studies. We then introduce the DSNN training algorithm. Finally we sketch several key distinctions with slimmable neural networks, our methodological precursor.

3.1 Model Formulation

DSNN aims to train a super-network such that, given an arbitrary sparsity level in a range , we can find a sub-network with only a subset of connections in without further fine-tuning or re-training. This sub-network should have the same or better quality than an individually trained single sparsity model obtained through traditional pruning algorithms at the same sparsity level . With such a super-network, we are able to dynamically switch to sub-networks with different sparsity levels during deployment, optimizing for hardware capacities and application latency constraints.

While it is non-trivial to theoretically prove the existence of such super-networks , empirical evidence does suggest they are likely to exist. Zhou et al. [5] and Ramanujan et al. [6] showed that an untrained random model can simultaneously contain sub-networks that perform well. At very specific sparsity levels, their models can perform as well as individually trained dense models. Slimmable neural networks [7, 8]

demonstrated that for ImageNet 

[18] classification, a trained convolutional super-network can have structured sub-networks with similar or better performance than individually trained networks with the same architecture. More recently, BigNAS [19]

trained a single set of shared weights on ImageNet which are used to obtain child models via a simple coarse-to-fine architecture selection heuristic. All these works hint at a possible super-network that encompass multiple high quality sub-networks. They hence encourage us to explore DSNN, a general sparsity-based super-network.

3.2 Approach

Define S_min, S_max, L, number of iterations T
iter_length = L + 2  # Including min & max sparsity
Initialize model N with pre-trained full network
for iteration in [0, ..., T]:
   for step in [0, ..., iter_length - 1]:
      for W in N:
         if step == 0:
            sparsity = S_min
         elif step == iter_length - 1:
            sparsity = S_max
         else:
            sparsity = random(S_min, S_max)
         W = sparsify(W, sparsity)
      forward(N)
      backward(N) accumulating gradients
   update weights of N using optimizer
Algorithm 1: Dynamic sparsity neural networks training algorithm.

With the likely existence of an , the question becomes how we can efficiently find it. In order for a single model to execute at arbitrary sparsity levels, we jointly train the same network with a variety of sparsity levels. Specifically, at each training step, we choose a target sparsity level at which to train the model.

We leverage “the sandwich rule” [8] which states that the quality of models of varying sizes is bounded by that of a largest and a smallest model. Formally, given two sparsity levels and , as long as the pruning function guarantees

(1)

where and are the sets of connections remaining after pruning111This is a weak condition and many pruning functions satisfy it, including magnitude-based pruning which is used in this work., then the residual error of a network with a smaller sparsity level should be no higher than one with a larger sparsity level. Then, in extension

(2)

This formulation allows us to focus on training two endpoint models with a minimum and maximum sparsity level, and . In addition, we also sample intermediate sparsity levels to allow better generalizability between the endpoints.

We take inspiration from regular sparse model training where it is common practice to pretrain the full model for some number of steps before pruning begins [20, 21]. This gives a high performance first-step model as the initialization for sparse models to prune from. For DSNN, because we choose (Section 4.3), the full model is already present during the training algorithm as the minimally sparse endpoint. Nevertheless, empirically we find the inclusion of the pretraining stage to still be crucial for the DSNN quality (Section 5.3).

We sketch the DSNN training procedure in Algorithm 1. For each iteration, we alternate among the minimum, intermediate, and maximum sparsity levels for weight masking and execute forward and backward propagation. However, this alternation during training could be a source of instability. To deal with this, instead of updating model parameters immediately after backward propagation in each training step, we accumulate the parameter gradients across training steps and only do one parameter update per iteration.

3.3 Comparison with Slimmable Neural Networks

Our model is similar to slimmable neural network (SNN) [7, 8]

, both allowing dynamic inference graphs, albeit with several key distinctions. First, SNN shrinks models by truncating convolutional channels while DSNN obtains smaller model variants using model pruning. This allows DSNN to be easily applied to more domains and tasks. The sparse structure also allows DSNN to preserve the high dimensionality of input and output spaces, although the mapping from input to output is low-dimensional. We may consider a simple generalization of SNN that prunes whole nodes in a network instead of convolutional channels. In contrast, DSNN uses an edge-pruning approach, the common practice for model pruning. This restricts SNN to always use fully connected sub-networks. On the other hand, the lack of a pre-defined network structure in DSNN allows greater modeling flexibility. Therefore, SNN is a special case of DSNN whose sparse patterns are skewed with all connections to the last channels masked as zeros. See Figure 

1 for an illustration.

Figure 1: Comparison of slimmable and dynamic sparsity neural networks. SNN, taking a node-pruning approach, is a special case of DSNN that prunes edges. DSNN allows more connection flexibility while SNN restricts sub-networks to always be fully connected.

4 Experimental Setup

In this section we describe our experimental settings.

4.1 Task and Dataset

While our approach is widely applicable to all weight-based neural networks, we choose to conduct experiments in automatic speech recognition (ASR) due to an increased interest in on-device ASR [9, 3]. We perform experiments on the LibriSpeech dataset [22]. It consists of read speech data of audio books. We merge its three training sets while maintaining the “clean” and “other” distinction corresponding to low versus high noise conditions. The final dataset contains 960.9 hours of training data, 5.4/5.3 hours of clean/other development data, and 5.4/5.1 hours of clean/other test data.

4.2 Model Architecture and Settings

We use an attentive encoder-decoder architecture as the base model [23]. The encoder consists of 2 layers of 3x3 convolutional neural networks, 3 layers of projection matrices with 2048 output units, and 4 layers of bidirectional LSTMs [24] with 2048 output units. The decoder consists of 2 layers of LSTMs with 1024 output units, 1 attention layer with 128 hidden units, and 1 fully connected layer. The model contains 184M parameters. The quality of our base model generally matches the similar architecture in [25] (Table 1).

Our models are implemented in TensorFlow 

[26]

and trained on 8x8 Tensor Processing Units (TPU) with a batch size of 2048. We use a constant learning rate of 1e-3 after warm-up with the Adam optimizer 

[27].

4.3 Model Pruning

We use magnitude based pruning which, given a target sparsity level and a weight matrix , zeros out the elements in with the smallest absolute value by applying a binary mask over . We employ block pruning [28] with block size to more efficiently leverage hardware resources. Instead of the smallest elements, we zero out the smallest blocks in . This procedure corresponds to the sparsify function in Algorithm 1. We only prune the recurrent connections which constitute 64.57% of model parameters.222When we indicate a sparsity level in this paper, we refer to the sparsity level of these recurrent weights only. We also allow the mask to update at each iteration which enables pruned weights to be recovered if at a later step its magnitude is greater than that of some other survived weights, similar to [29, 30].

For both baseline single sparsity models (Section 5.1) and DSNN, we first train a zero-sparsity network until convergence (around 200k steps). When training this network, we maintain exponential moving averages of all model parameters. When the pruning stage begins, we load all model variables from these averages. The baseline models use a constant pruning schedule that fixes the sparsity level. The DSNN training algorithm does not require a pruning schedule.

We train DSNN with a minimum sparsity level (i.e. full model) and a maximum . We sample sparsity levels in this range to train at during each iteration in addition to the two endpoint sparsity levels.

5 Results and Discussion

In this section we present our experimental results. We evaluate the models on the minimum and maximum sparsity levels, 0% and 90%, and two intermediate sparsity levels, 30% and 60%, that test the model generalizability between the endpoints.

Sparsity Model Dev Dev Test Test
Level Type clean other clean other
0% Zeyer et al. 3.54 11.52 3.82 12.76
0% Single 3.6 11.8 3.9 11.8
SNN 4.7 14.2 4.8 14.5
DSNN 3.7 12.3 4.0 12.4
30% Single 3.7 12.3 4.0 12.2
SNN 4.7 14.4 4.9 14.8
DSNN 3.7 12.3 4.0 12.3
60% Single 3.8 12.5 4.2 12.6
SNN 5.9 16.8 6.3 17.5
DSNN 3.7 12.3 4.0 12.3
90% Single 4.2 13.9 4.5 14.1
SNN 8.0 20.3 8.6 21.5
DSNN 3.9 13.2 4.2 13.2
Table 1: WER for single sparsity networks (Single), SNN, and DSNN. Lower is better. We also show the LAS performance in [25] (Zeyer et al.) as a baseline for comparison.

5.1 Comparison with Single Sparsity Networks

As argued in Section 3, the most important success criterion of DSNN is to match individually trained single sparsity networks in quality. If the DSNN model significantly degraded the model quality, it would not be useful especially in real-world scenarios when quality is prioritized. We, therefore, conducted baseline experiments with single sparsity networks and show the results in Table 1.

The dynamic sparsity model generally matches the quality of single sparsity networks. Additionally, the quality decline with increased sparsity is much slower in DSNN than single sparsity networks. The DSNN quality slightly trails behind single sparsity networks’ at 0% but is the best at 60% and 90% sparsity. We hypothesize the reason to be that the sparser networks in DSNN have thinner structures which on expectation are more frequently trained in each step. On the other hand, connections with smaller weights are trained more sporadically, receiving relatively less focus. Notably, a similar trend is observed for SNN in [7, 8] where also increases as fewer convolutional channels are used in inference, growing less negative or more positive.

In practice, even in highly quality-driven scenarios where the slight quality gap at denser models is unacceptable, one can still deploy DSNN at only high sparsity levels, complementing single sparsity networks that can be used for the denser models.

5.2 Comparison with Slimmable Neural Networks

Slimmable neural networks (SNN) [7, 8] only vary the number of channels in convolutional networks. For each target width, the last channels (ones with the largest indices) are removed. Despite not directly applicable to arbitrary weight matrices, we experiment with a simple generalization of SNN as follows.

Given a target sparsity level and a weight matrix , we apply a binary mask on . For each dimension , we select a threshold by

(3)

where rounds to the nearest integer. We then generate the mask by

(4)

Finally we prune by

(5)

where denotes element-wise multiplication.

Intuitively, we truncate the last rows in each dimension by an equal fraction trying to make the resulting matrix have a sparsity close to the target sparsity.333This procedure does not guarantee an exact final sparsity level because the pre-rounded is usually not an integer, but the larger the original matrix is, the closer it will be. After all, model pruning is usually only applied on large weight matrices. In our experiments, the differences between the target sparsity levels and the resulting sparsity levels are always within 0.01%. We therefore neglect this difference when comparing results. This is analogous to removing the last convolutional channels.

We compare SNN and DSNN quality in Table 1. We see that DSNN’s edge pruning approach significantly outperforms SNN’s node pruning approach.

5.3 Ablations

Sparsity Model Dev Dev Test Test
Level Setting clean other clean other
0% Baseline DSNN 4.2 12.9 4.4 13.1
  + pretraining 3.8 12.3 4.1 12.3
    + GA 3.7 12.3 4.0 12.4
30% Baseline DSNN 4.2 12.9 4.4 13.1
  + pretraining 3.8 12.3 4.1 12.3
    + GA 3.7 12.3 4.0 12.3
60% Baseline DSNN 4.2 12.9 4.4 13.2
  + pretraining 3.8 12.3 4.1 12.3
    + GA 3.7 12.3 4.0 12.3
90% Baseline DSNN 4.3 13.4 4.5 13.6
  + pretraining 4.0 13.4 4.3 13.3
    + GA 3.9 13.2 4.2 13.2
Table 2: WER for incrementally adding pretraining and gradient accumulation (GA) on top of a baseline DSNN. Lower is better.

We analyzed the effect of pretraining and gradient accumulation (GA) and show the ablation results in Table 2. Similar to previous model pruning work [20, 21], we find it important to pretrain the zero sparsity model for a certain number of steps before pruning begins. This pretraining stage uniformly improves the performance by up to 0.8% WER. This further confirms the importance of initialization for sparse model training: it helps even when the zero sparsity model is present during the DSNN training algorithm. The gradient accumulation technique also consistently improves the model performance across sparsity levels by stabilizing the training process.

6 Conclusion and Future Work

We presented a training scheme that allows one single trained model to optimally switch its sparsity level at inference time. Given that its performance is on par with individually trained single sparsity networks, such a model can simultaneously support a variety of devices with different hardware capabilities and applications with diverse latency requirements. As it outperforms single sparsity models at high sparsity levels, DSNN also serves as a way to improve model performance. Nevertheless, the DSNN model is still trailing behind the quality of single sparsity networks at low sparsity levels. We leave it to further work to cover this gap.

In this work, we only considered models with all parameters pruned by the same fraction. However, components of a machine learning model are sometimes not equally important and setting different sparsity levels for different weights may yield a higher quality model [31]. As each weight matrix is independently pruned in the DSNN training algorithm, DSNN is able to approximate the performance of individually trained networks with arbitrary sparsity configurations across weights. Combined with a greedy search algorithm, DSNN can be used to search for an optimal per-weight sparsity configuration, analogous to [32]. This can be an interesting future exploration.

References