Dynamic latency speech recognition with asynchronous revision

11/03/2020 ∙ by Mingkun Huang, et al. ∙ ByteDance Inc. 0

In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Specifically, we achieve dynamic latency with only one model by using arbitrary right context during inference. The model is composed of a stack of convolutional layers for audio encoding. In inference stage, the history states of encoder and decoder can be asynchronously revised to trade off between the latency and the accuracy of the model. To alleviate training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. This allows us to have dynamic latency speech recognition results with large improvements in accuracy. Experiments show that our dynamic latency model with asynchronous revision gives 8%-14% relative improvements over the streaming models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end (E2E) models [1, 2, 3, 4, 5, 6] gained large popularity for ASR over the last few years. These models, which combine the acoustic, pronunciation and language models into a single network, have shown competitive results compared to conventional ASR systems.

Recently, there has been considerable interest in training E2E models for streaming ASR [7, 8, 9, 10, 11]. It is shown that having access to future context to encode the current frame significantly improves speech recognition accuracy. Bidirectional encoders take advantage of future context, however only full input sequence provided can inference be performed. Therefore, streaming recognition can be achieved by specifying a limited right context, with a cost of accuracy degradation.

Typically, streaming and non-streaming ASR models are usually trained separately. Specially, in streaming condition, various models are trained to reach a suitable latency and accuracy trade-off. In this work, we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Moreover, in streaming mode, we achieve dynamic latency ASR with only one model. To be specific, the non-streaming model can be used with arbitrary right context during inference. In inference stage, chunk-based incremental decoding is applied on the non-streaming model, the history encoder and decoder states of which can be asynchronously revised to achieve dynamic latency ASR. When performing asynchronous revision during inference, the encoder network may get incomplete right context. To alleviate this training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. We show that the dynamic latency model gives 8%-14% relative improvements over streaming models with the same latency.

2 related work

There has been a growing interest in building streaming ASR systems based on RNN-T[1, 2]. Compared with attention-based encoder decoder (AED)[3] models, RNN-T models are naturally streamable and have shown great potentials for low-latency streaming ASR. In this work, we mainly focus on RNN-T based models.

To improve the latency of AED models, partial hypothesis selection is proposed in [12]. We both use chunk-based incremental decoding during inference, the main difference is that we achieve truly streaming ASR based on RNN-T, while they use global attention mechanism. Similar work is explored in simultaneous (speech) translation[13, 14]. Variable context training is used in Y-model [15] architecture. In inference time, the context length for the variable context layers can be changed. However, the latency of Y-model is predefined in training stage. At inference, limited latency configurations can be used. Universal ASR [16] is a unified framework to train a single E2E ASR model for both streaming and non-streaming speech recognition. Nevertheless, only one streaming mode is available.

Unlike these approaches, our work not only explores the unification of streaming and non-streaming ASR models, but also achieve dynamic latency ASR with only one model. In other words, with asynchronous revision decoding technique, the non-streaming model can be used with any right context during inference.

3 dynamic latency asr

Morden E2E ASR systems have an encoder-decoder structure [1, 3]

. The encoder network encode Mel-spectrogram feature into hidden representation containing semantic information. Then, the decoder network make predictions based on the current output of encoder and history prediction states. In this work, we focus on the most commonly used RNN-T model.

3.1 Frame Synchronous Decoding

Figure 1: Frame synchronous decoding of RNN-T model. Ci, Ei and Di mean ith input chunk, encoder state and decoder state respectively. Suppose the right context of the encoder network has two chunks, the decoder state is finalized iff complete right context is provided. The incremental dependency of the encoder and decoder states are omitted, along with the left context of the encoder network.

It is clear that the decoder in RNN-T is an auto-regressive model in both streaming and non-streaming ASR models. When performing streaming inference, complete input context for the encoder is necessary. Commonly, for incremental inference, we divide the input utterances into fixed-size chunks and decode every time a new chunk arrives. As shown in figure 1, when performing incremental decoding, the encoder state is conditioned on the current chunk and history states (memory). Meanwhile, the decoder state is based on the current encoder state and previous decoder state.

3.2 Asynchronous Revision Decoding

For non-streaming model, its hard to perform incremental decoding since the long right context leads to large latency. What if we limit the right context? Figure 2 depicts the limited context decoding process for non-streaming model. Suppose the right context of the encoder network has two chunks, figure 2 part a shows the initial states when decoding chunk C0. Since the right context of E0 is not complete, when chunk C1 comes, we need to revise the encoder state E0 along with the decoder state D0. As shown in figure 2 part b, if we only revise the last decoder state, then D0 is finalized. After chunk C2 comes, the encoder state E0 get full input context, then we revise E0 to make it finalized as shown in part c. The finalized states will not be changed when new chunks arrive as shown in part d. Temporary states will be revised in the next decoding step.

Since the revision steps of the encoder and decoder states may be different, we refer this technique as asynchronous revision decoding. For non-streaming model, more revisions means more available right context, then gives better results. However, more revision also leads to larger latency. Nevertheless, revision provides trade-off between latency and accuracy. In other words, the non-streaming model can be used with any right context during inference, which achieves arbitrary latency. Thus, the latency of a ASR model can be arbitrarily changed after training, which we call dynamic latency.

Figure 2: Asynchronous revision decoding of RNN-T model. The right context of the encoder network is limited for non-streaming models when performing incremental decoding. The last several encoder states are not finalized due to incomplete input context. Therefore, revision of the encoder and decoder states is applied for the last one and two steps respectively. At each decoding step, the temporary state will be replaced with the finalized state. The memory cache (forward connection) of the encoder and decoder states are omitted, along with the left context of the encoder network.

3.3 Segment Cropping Training

When training a non-streaming ASR model, the full utterance is provided. At each frame, the encoder has a complete context. When performing asynchronous revision during inference, the finalized encoder states may get incomplete right context. As shown in figure 2, suppose the right context covers five chunks, the finalized encoder state E0 only sees two chunks in the right since the encoder revision is two steps. Then, E0 provides partial memory to the next state E1 and so on. This leads to inference and training mismatch.

Figure 3: Partial memory problem and segment cropping training. The left context of the encoder states are omitted.

Prefix training is used to alleviate partial inputs in simultaneous (speech) translation[14, 13]. However, its not work for RNN-T models for its intrinsic monotonic alignment property. In our framework, revision is applied on each chunk during inference, if the right context of the finalized encoder state is not the same as training, mismatch happens. We call this incomplete right context state as partial memory. To alleviate partial memory problem in incremental decoding, we propose a training strategy, segment cropping, which randomly splits input sequence into several segments with forward connections. As shown in figure 3, every chunk has full right context in normal training. At inference, partial memory occurs in every chunk if no revision. By segment cropping, we can simulate partial memory in training.

4 experiments

4.1 Data

For our experiments we use 10K hours of internal Mandarin speech dataset. The test set we use consists of 21k utterances with duration less than 30 seconds long from various applications. The training and test sets are all anonymized and hand-transcribed. The input speech wave-forms are framed using a 25 msec window with 10 msec shift. We use 80 dimension filter bank features. We report the model performance using Character Error Rate (CER).

4.2 Model Structures and Hyperparameters

Without loss of generality, our encoder network is based on a temporal convolutional network, DFSMN [17], which can be extended to self-attentional network like conformer [6]

. And our decoder network is consists of two LSTM layers with 1024 units. At the front of the encoder network, we use two convolution subsampling layers, each with a stride 2. For non-streaming model, the encoder has 30L DFSMNs with 2048 hidden units, the input context for each convolution layer is

, which means 20 left frames and 20 right frames along with the current frame. For streaming models, we use three different context configurations: for 0.4 second (0.4s for short) latency model, for 1.2s latency model and for 2.4s latency model. The only difference for those models is the right context configuration. For simplicity, we use greedy search to decode our models.

4.3 Asynchronous Revision on Non-Streaming Model

We perform asynchronous revision decoding on the baseline non-streaming model (CER 9.45%) with various revisions. At inference, we divide the input utterances into fixed-size chunks with 40 frames and decode every time a new chunk arrives. Hence, the number of revisions means the number of history chunks to be revised.

Figure 4 shows the relationship between revisions and CER. With the same decoder revisions, more encoder revisions has little impact on model performance. As we can see, more decoder revisions brings better results, while the marginal improvements becomes smaller.

We have also done experiments with smaller encoder revision, the results are similar with figure 4. Besides, when there is no decoder revision (revise=0), the decoder network sees partial memory without right context, which leads to bad performance (CER ). We will not report the results with no revision in the following experiments.

Figure 4: Asynchronous revision on non-streaming model.

4.4 Asynchronous Revision on Streaming Models

The proposed framework can also works on streaming model. As shown in figure 5, we perform asynchronous revision on two different streaming models, along with the baseline non-streaming model. For simplicity, we use the same revisions for the encoder and decoder networks. It is evident that the number of revisions is positively related to performance whenever the model latency is. Meanwhile, large revisions gives little relative improvements since the right context provided by revision is close to training. Interestingly, when revisions is small (e.g. revise=1), lower latency model gives a better result. We believe that it is because partial memory has larger impact on higher latency model.

Figure 5: Asynchronous revision on streaming models.

4.5 Dynamic Latency Versus Streaming

Model Revisions Latency CER
Streaming - 0.4s 11.63
- 1.2s 10.63
- 2.4s 10.20
Dynamic Latency 1 0.4s 11.75
3 1.2s 10.28
6 2.4s 9.75
- 24s 9.45
1 0.4s 10.29
Dynamic Latency 3 1.2s 9.55
+3 Segments Cropping 6 2.4s 9.32
- 24s 9.26
1 0.4s 10.03
Dynamic Latency 3 1.2s 9.55
+5 Segments Cropping 6 2.4s 9.37
- 24s 9.29
Table 1: Dynamic latency model versus streaming models. ’-’ means full context decoding.

As discussed in section 3.2, the latency of a ASR model can be arbitrarily changed with asynchronous revision during inference. Without loss of generality, the decoding chunk size is 40 frames. Revise one chunk means 0.4 second latency, 3 chunks for 1.2 seconds latency and 6 chunks for 2.4 seconds latency. We denote the non-streaming model applied asynchronous revision during inference as dynamic latency model.

Table 1 presents the results of dynamic latency model versus streaming models. The streaming models we used are described in section 4.2. The dynamic latency model gives better or similar results compared with streaming models in the same latency configuration. To alleviate partial memory mismatch, we apply segment cropping training. We use two cropping strategies by randomly splitting input utterances into 3 or 5 segments. As shown in the last two blocks in table 1, both strategies give similar results. Compare with streaming models, the dynamic latency model with segment cropping training gives 13.76%, 10.16% and 8.14% relative improvements in 0.4s, 1.2s and 2.4s latency respectively. Furthermore, the performance of the dynamic latency model is extremely close (less than 1% rel.) to that of the non-streaming model in the case of 2.4 second latency.

5 conclusion

We propose an inference framework, asynchronous revision, to perform streaming and non-streaming speech recognition in one model. Moreover, in streaming mode, the latency of a non-streaming model can be arbitrary changed, which we call dynamic latency. Thus can be generally applied as an inference technique without requiring extra training support. We also propose segment cropping training to alleviate partial memory problem during inference. We show that the dynamic latency model gives 8%-14% relative improvements over the streaming models under the same latency. We also show that in the 2.4 seconds latency setting, the performance of asynchronous revision decoding is very close to full context decoding. A limitation of our framework is that more revisions means more computations. Our future work will focus on how to improve the computation efficiency of asynchronous revision.

References