Sequence-to-sequence (S2S) attention-based models [chorowski2015attention, chan2016listen] have become increasingly popular for end-to-end speech recognition. Several advances [chiu2018state, zeyer2018improved, weng2018improving, pham2019very] have been proposed to the architecture and the optimization of S2S models to achieve state-of-the-art recognition performance. In offline scenarios, i.e., batch processing of audio files, the S2S models in [park2019specaugment, nguyen2019improving] have already shown state-of-the-art performance on standard benchmarks. However, methods for employing S2S models in online speech recognition, i.e., run-on recognition with low latency, still needs to be researched, in order to obtain the desired accuracy and latency.
[raffel2017online, chiu2017monotonic] pointed out early that the shortcoming of an attention-based S2S model used in online condition lies in its attention mechanism, which must perform a pass over the entire input sequence for every element of the output sequence. They proposed a so-called monotonic attention mechanism which enforces a monotonic alignment between the input and output sequence. Later on, [fan2019online, miao2019online, tsunoo2019online] have addressed the latency issue of bidirectional encoders which is also an obstacle for online speech recognition. In these studies, unidirectional and chunk-based encoder architectures replace the fully-bidirectional approach to control the latency.
In this work, we analyze the alignment behavior of the attention function of a high-performance S2S model and propose an additional constraint loss to make it capable of streaming inference. By discussing the problems that occurred when adapting a S2S model to be used for a streaming recognizer, we additionally show that the standard beam-search has no guarantee for low-latency inference results, and needs to be modified for providing partial hypotheses. Besides, we argue that the common real-time factor is not a proper choice for measuring the user-perceived latency in an online and streaming setup, and propose a novel and suitable technique for latency measure.
In contrast to earlier research and literature, our experimental results prove that a bidirectional encoder can be combined with suitable inference methods to produce high accuracy and low latency speech recognition output. With a delay of 1.5 seconds in all output elements, our streaming recognizer can fully achieve the performance of an offline system of the same configuration. To the best of our knowledge for the first time, a S2S speech recognition model can be used in online conditions without scarifying accuracy.
2 Sequence-to-Sequence Model
In this work, we modify the LSTM-based sequence-to-sequence encoder-decoder model proposed in [nguyen2019improving]
to perform high-accuracy online streaming ASR with very low latency. Our model can be decomposed using a set of neural network functions as follows:
In principle, the functions are designed to map a sequence of acoustic vectors to a sequence of sub-words and can be grouped into two parts: encoder and decoder. In the encoder, acoustic vectors are down-sampled with two convolutional layers and then fed into several bidirectional LSTM layers to generate the encoder’s hidden states. In the decoder, two unidirectional LSTM layers are used to embed a sub-word unit into a latent representation . The soft-attention function proposed in [vaswani2017attention] is used to model the relationship between and , which results in a context vector . All the functions are jointly trained via the sequence cross-entropy loss by plugging a softmax distribution on top of and .
As shown in [nguyen2019improving], this S2S model can achieve highly-competetitive offline performance on the Switchboard speech recognition task. However, the model encounters latency issues when being used in online conditions since both, the attention function and the bidirectional encoder network, require the entire input sequence to achieve their optimal performance.
3 Streaming S2S ASR
In this section, we describe our modifications that enable the S2S model to perform online streaming speech recognition with low latency and without loss in performance. The modifications include an additional loss to control the uncertainty of the attention function and a search algorithms to infer high-accuracy partial hypothesis.
3.1 Discouraging Look-ahead Attention
The core of S2S models is the mechanism that autoregessively generates a context vector ctx for the prediction of the next token. For this model as described in Section 2, ctx is computed as a sum of all the encoder’s hidden states weighted by the attention scores which are calculated by the attention function. The attention scores calculated for a specific token typically reveals which of the positions of the encoder (or spectral frames) correspond to the token. So, the attention function can be considered as an alignment model. However, this unsupervised alignment does not pursue what traditional forced-alignments (or human alignments) do for the speech recognition task. As illustrated in Figure 1a, during the inference of an utterance, the attention scores produced for many tokens (e.g., #3, 8, 9, 16) are dominated by the start and end frames, which are not the proper alignments. In this case, the inference still produces the correct transcript, and so the attention function works as it is expected. The mismatch between the attention-based alignment and regular alignment reveals uncertainty that the attention function may have while being optimized with the sequence training likelihood. Although this uncertainty may not lead to inference errors, the attention function always employs all the encoder’s hidden states, which hinders the model from being used in streaming inference. It is preferable for streaming inference that for the prediction of a token , the attention function only considers past frames until a particular time (the endpoint) and disregards all future frames.
To build such a S2S model for streaming, we investigated the incorporation of an additional loss which discourages the attention function from using future frames during training. Specifically, given token which belongs to word in label sequence , we find a region in which is the end time of provided by a Viterbi alignment. The attention-based constraint loss is computed as the sum of all attention scores within the region for all in :
The tuneable parameter adjusts the influence of the constraint loss to the maximum likelihood loss of the label sequence during training. By minimizing the total of both losses, we expect that the attention function learns to produce close-to-zero scores for the constraint regions for all label tokens while still minimizing the main loss.
3.2 Inference for Partial Stable Hypothesis
Beam search is the most efficient approach for the inference of S2S models. Its idea is to maintain a search network in which network paths are extended with new nodes with the highest accumulated scores and to then prune the network only keeping a set of active paths (or hypotheses). Typically, the most probable hypothesis for an utteranceis found and guaranteed when the entire search space constructed from is supplied to the search. However, needing the complete acoustic signals of to its very end in order to output the inference result is not efficient for a streaming setup. A streaming recognizer must be able to produce partial output while processing partial input. In this section, we describe our search algorithm applied to the proposed S2S model to produce partial output while retaining high accuracy.
Assume that in a streaming setup, at time we use the proposed S2S model to perform inference for audio frames. Given a context sequence , the attention function is used to generate attention scores for the prediction of the next token. We find a time such that the sum of all attention scores from the covering window is equal to a constant . When , covers all dominant attention scores and the context vector generated from is almost the same as from . If is observed to be unchanged when keeps growing, then we consider as the endpoint of . During stream processing, we use a term to determine if endpoint finally gets fixed as .
We then incorporate the information of endpoints into the beam search to find a partial stable hypothesis. Assume that our beam search can always perform in real-time for audio frames to produce considered hypotheses. If all N hypotheses share the same prefix sequence and the endpoint of is determined, then we consider to be an immortal part that will not change anymore in the future. When more audio frames are available in the stream, C will be used as the prefix for all search hypotheses, and we repeat this step to find a longer stable hypothesis. Except the condition on endpoints, the idea of finding immortal prefix is similar to the partial trace-back [brown1982partial, selfridge2011stability] used in HMM-based speech recognizers.
In addition to the immortal prefix, we also investigated a more straightforward method in which we only consider best-ranked hypothesis and decide on a stable part based solely on the term . This approach is inspired from the incremental speech recognition proposed in [wachsmuth1998integration].
3.3 Bidirectional Encoder
To achieve high performance, bidirectional LSTMs have been the optimal choice for the encoder of LSTM-based S2S models. However, due to the backward LSTM, bidirectional LSTMs are not suited to provide partial and low-latency output as needed for streaming recognizers. The addition of acoustic input will affect all of the encoder’s hidden states, which then makes all partial inference results unstable. This effect leads to the fact that stable output can be confidently inferred only when the input is complete. Therefore, earlier works [raffel2017online, he2019streaming, narayanan2019recognizing] switched to unidirectional LSTM in their online models.
In this work, we try to utilize bidirectional LSTMs for high-performance speech recognition in a streaming scenario. In the first setting, we investigated the use of the S2S model with a fully bidirectional encoder. First, we train the S2S model with optimal settings found for an offline setup, and with the attention-based constraint loss proposed in Section 3.1. Then, during inference, we updated the encoder’s hidden states from all available acoustic input before performing the search approaches in Section 3.2 to find stable hypotheses. As will be shown later, the use of a bidirectional LSTM as this way is possible since the proposed inference methods rely on the determination of endpoints, and the update of encoder’s hidden states leads to stabilizing this determination.
In addition to fully bidirectional LSTMs, we also experimented with a chunk-based BLSTM approach. During training, we divide input sequences into many non-overlapping blocks of a fixed size of , and then use a BLSTM to compute each block sequentially. To benefit from long-context learning, we initialize the forward LSTM with its last hidden states after processing the previous chunk. The initialization of the backward LSTM can either be a constant or from the previous chunk. By doing that, the encoder’s hidden states can be computed incrementally and efficiently as for unidirectional LSTMs. This chunk-based approach is different from [audhkhasi2019forget] and the latency-controlled BLSTM [fan2018online, xue2017improving] that adopt constant initialization of both directions.
4.1 Experimental Setup
Our experiments were conducted on the Fisher+Switchboard corpus consisting of 2,000 hours of telephone conversation speech. The Hub5’00 evaluation data was used as the test set. All the experimental models use the same input features of 40 dimensional log-mel filterbanks to predict 4,000 BPE sub-word units generated with the SentencePiece toolkit from all the training transcripts. The models with bidirectional encoder employ six layers of 1024 units while it is 1536 for the unidirectional encoders. We used only 1-head for the attention function in all setups. All models were trained with a dropout of 0.3. We further used the combination of two data augmentation methods Dynamic Time Stretching and SpecAugment proposed in [nguyen2019improving] to reduce model overfitting. We use Adam [kingma2014adam]
with an adaptive learning rate schedule to perform 12,000 updates during training. The model parameters of the 5 best epochs according to the perplexity on the cross-validation set are averaged to produce the final model.
For beam search, we use neither length normalization nor a language model. With a beam size of 8, the experimental models typically achieve their optimal accuracy.
4.2 Latency Measure
Neither the commonly used real-time factor (RTF) nor commitment latency are sufficient to measure user-perceived latency for a streaming recognizer. For example, the transcript outputs for an 11-seconds sentence can appear 10 seconds later than a 1-second sentence, but the RTFs measured in two cases can be similar. In this study, we propose to use a different method for measuring streaming latency. Assume that a recognizer processes a sentence S of T seconds in streaming fashion and it outputs N token , ,.. at different timestamps , ,.. . And assume the inference time is always a small constant, then timestamp is just when the recognizer is confident of producing . The latency of recognizing with regard to the transcript , ,.. is calculated as the average of all token latencies normalized by the duration of : . With this measure, the latency of an offline system is always 1 – as the offline system is only confident for all transcripts until end-of-sentence. In the same way, we can simulate the latency of an instant recognizer by using a forced-alignment to find for .
5.1 Effect of the Constraint Loss
In this section, we evaluate the influence of the constraint loss proposed in Section 3.1 on the training of the S2S model. We started by using a high value for and exponentially decreased it to train several systems for comparison. As observed during training, the constraint loss gets small quickly to a stable value depended on . Joint training slows down the convergence of the main loss but does not have a significant impact on the final performance. As shown in Table 1, WERs are slightly worse with high and can be similar to the regular training when is small (e.g., 0.05). Different from that, the constraint loss may largely change the behavior of the attention function. For example, in Figure 1b, the attention function moves the high scores of the mismatched alignment to start frames, instead of start and end frames as in the regular training. We also found an extreme case when . The attention-based alignment does not correspond at all to the proper alignment as illustrated Figure 1c.
Using the model trained with , we follow the approach in Section 3.2 to extract the endpoints for all prefixes found during the inference of the evaluation set. We could verify that the extracted endpoints in all sentences match the expectation for streaming inference described in Section 3.1. So we keep this model for further experiments.
5.2 Latency on Various Conditions
Using the S2S model with a bidirectional encoder trained with the constraint loss scale , we performed several experiments with the inference approaches described in Section 3.2. In the experiments, the streaming scenario is simulated by repeatedly feeding an additional audio chunk of 250 ms to the experimental systems for incremental inferences. All the inferences were performed on a single Nvidia Titan RTX GPU, which produced an average RTF of 0.065 with a beam size of 8. The RTF result shows that real-time capacity is not a bottleneck problem in this setup. So we focus on the latency measure proposed in Section 4.2.
For baselines, we computed the offline WER performance with the beam sizes 8, 4, 2, and then used a force-alignment system to produce the ideal latency from the offline transcripts. The ideal latency is always 0.6. If we shift the time alignment of the transcripts with 250 ms (i.e., all the outputs have a delay of 250 ms), 500 ms, 1 second, and 1.5 seconds, then we obtained a latency of 0.71, 0.78, 0.86 and 0.91 respectively.
Table 2 presents the accuracy and the latency we achieved when using the immortal prefix and 1st-ranked prefix inference methods with several settings of . Overall, the two methods are consistent with the observations in the HMM-based systems [brown1982partial, selfridge2011stability, wachsmuth1998integration]. Using the immortal prefix condition, the final accuracy can be guaranteed as for the offline inference for large beam sizes, e.g., 8 and 4. For a smaller beam size, this condition is not strong enough to deal with unstable partial results – probably due to the changes of the encoder’s hidden states. In the 1st-ranked prefix approach, increasing allows for a flexible trade-off between the accuracy and the latency. The offline accuracy can also be achieved if a very large is applied. These results consolidate our findings in two aspects. First, the integration of is reliable and crucial for the streaming inferences to work efficiently. And second, the use of the bidirectional LSTM for the encoder is possible and results in high accuracy.
To achieve 8.9% WER (the offline accuracy), the system needs to delay outputs with an average duration of about 1.5 seconds. To obtain a lower latency of 1 second, the WER increases to 9.2%, e.g., by using the immortal prefix method with and . The combination of both methods is efficient if we want to reach a latency of 0.81, which is closer to the average delay of 0.5 seconds.
5.3 Performance of Different Encoders
The shortcoming of the bidirectional encoder lies on the re-computation of the entire encoder’s hidden states for every addition of input signal in the stream. In this section, we investigate two additional network architectures, unidirectional LSTM and chunk-based BLSTM described in Section 3.3, that improve the computational efficiency of the encoder. For chunk-based, we experimented with and , as the chunk sizes of 800 ms and 2 seconds. We constantly found that initializing the backward LSTM from the last hidden state of the previous chunk is better than a constant, so we only present the results of this approach. We evaluated two types of encoders in two categories: the best accuracy and the accuracy that the systems retain when maintaining an average delay of 1 second. To do so, we use the same immortal prefix inference and experiment with different settings of beam size and .
As shown in Table 3, there is a big gap between the best WER of the unidirectional and bidirectional encoders (12.6% vs. 8.9%). The chunk-based encoder closes the gap and moves closer to the performance of the bidirectional encoder when a large chunk size is used. As the encoder’s states are fixed early, the inferences are already stable when for all beam sizes. To achieve 1-second delay, all the approaches need to trade-off for an accuracy reduction of 5% relatively. In term of latency, the chunk-based approach with and and is the best setting in this setup.
6 Difference to Related work
[raffel2017online, chiu2017monotonic] have pointed out the issue of the soft-attention mechanism on acquiring the entire encoder’s hidden states and proposed a trainable monotonic attention function to train sequence-to-sequence models for online application. Given a prefix, the monotonic attention function allows finding an encoder position [raffel2017online] or the endpoint of a chunk [chiu2017monotonic]
used for prediction of the next token. In our study, we showed that endpoints can also be estimated precisely and efficiently via the regular soft-attention function by controlling its uncertainty. We further showed that there are more issues to be addressed for high-performance online speech recognition, such as finalizing partial results of the beam search and the use of a bidirectional encoder, and proposed effective methods for addressing theses issues.
In this paper we have proposed and evaluated several techniques for applying encoder-decoder with attention based models to run-on speech recognition that produces results with a low word-based latency. In order to overcome the general problem that for this type of model the complete audio data needs to be available at the start of processing we have introduced an additional loss to control the attention mechanism, introduced a modified beam-search algorithm that produces stable hypotheses with low latency, and introduced techniques for using BLSTM in the encoder without introducing high latency. Our results show that with these techniques it is possible to produce low latency online recognition results on the Switchboard+Fisher task without a significant decrease in performance.