A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition

05/20/2020 ∙ by Linhao Dong, et al. ∙ Ping An Bank 0

End-to-end models are gaining wider attention in the field of automatic speech recognition (ASR). One of their advantages is the simplicity of building that directly recognizes the speech frame sequence into the text label sequence by neural networks. According to the driving end in the recognition process, end-to-end ASR models could be categorized into two types: label-synchronous and frame-synchronous, each of which has unique model behaviour and characteristic. In this work, we make a detailed comparison on a representative label-synchronous model (transformer) and a soft frame-synchronous model (continuous integrate-and-fire (CIF) based model). The results on three public dataset and a large-scale dataset with 12000 hours of training data show that the two types of models have respective advantages that are consistent with their synchronous mode.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end models are profoundly affecting the development of automatic speech recognition (ASR), and have demonstrated their advantages on a wide range of ASR tasks. Since end-to-end ASR models directly transform the speech sequence (usually in the form of feature frames) into the text label sequence by neural networks, there must be one end (the text end or the speech end) to drive the recognition process of the entire model. According to the driving end, current end-to-end models could be categorized into two types: label-synchronous models and frame-synchronous models.

Label-synchronous models refer to the models that are driven by the text end: they process one label at each step and stop after the label of end of sentence is recognized. The representative models in them are originally proposed for neural machine translation (NMT)

[1, 2], and rely on the attention mechanism to extract relevant speech information from the encoded frames, such as the attention-based models [3, 4, 5, 6, 7, 8] and transformer [9, 10, 11]. Compared with frame-synchronous models, they have two main advantages: 1) since they usually attend to a large range of encoded frames, they could utilize the speech information more comprehensively; 2) since they are label-synchronous, they could jointly extract the relevant information for each label from multiple channels [12] or sources [13, 14]. Recently, since transformer has shown performance superiority in the comparison with the RNN attention-based model [10], we focus on the model of transformer in this work.

Figure 1: Schematic diagram of the label-synchronous end-to-end model and the soft frame-synchronous end-to-end model.

Frame-synchronous models refer to the models that are driven by the speech end: they process one frame at each step and stop after the last frame is processed. Compared with label-synchronous models, they have two main advantages: 1) since they process in the frame-by-frame manner, they could naturally support online speech recognition; 2) since they are aware of the frame position, they could provide time stamp for the recognition result. The representative models in them could be categorized into two types: 1) the models that use a hard [15] alignment, including the CTC [16] and the CTC-like [17, 18] models; 2) the models that use a soft [15] alignment, which first locates the acoustic boundary through the frame-by-frame detection, and then integrate the acoustic information from the located speech piece using the form of weighted-sum (e.g. [19, 20, 21]). Compared with the hard ones, the soft frame-synchronous models exclude the influence of the blank label, thus having less decoding computation and simpler search process. In this work, we focus on the soft frame-synchronous models and experiment on the continuous integrate-and-fire (CIF) based model [21], which is calculated in a concise manner and uses the same self-attention network (SAN) as transformer.

The concepts of label-synchronous and frame-synchronous first appear in [22], where the authors compare the attention-based model [4] with the CTC [16] and RNN-T [17] models in detail. Differing from it: 1) we think the concepts of label-synchronous and frame-synchronous should not only be used to describe the decoding manner [22], but could be used to describe end-to-end models with different model behaviour and characteristic. By distinguishing the two types of models, it not only benefits to clarify the advantages and disadvantages of them, but also could provide guidance for applying the suitable model in specific ASR scenarios; 2) we focus on the comparison of a label-synchronous model (transformer) with a soft frame-synchronous model (CIF-based model), rather than the hard frame-synchronous models in [22]. Specifically, we make a detailed comparison for the two models on multiple datasets, including three public sets and a large-scale data set with 12000 hours of training data. We find the two models show their respective advantages on accuracy, speed and generalization. The detailed contents will be introduced in the following sections.

2 Models

In this section, we describe the label-synchronous model (transformer) and the frame-synchronous model (continuous integrate-and-fire (CIF) based model) compared in this work. Both of them follow the encoder-decoder framework, where the encoder transforms the speech feature sequence into the encoded acoustic representations , the decoder receives the previous label and the relevant acoustic information to predict the current label . The encoder and the decoder are connected by the alignment mechanism, which determines the extraction manner of the relevant acoustic information used for decoding. Next, we will combine the diagram of the alignment mechanism (as shown in figure 2) to introduce the two models.

2.1 Transformer

The transformer model used in this work has the similar structure as [9, 10], but has two differences: 1) it abandons the sinusoidal positional encoding in [9, 10] and uses the proximity bias in the self-attention network (SAN) [23] to provide relative positional information (since this performs better and more stable in our experiments); 2) it uses the encoder structure in [23], which uses a convolutional front-end and a pyramid of SAN.

The running of transformer is driven by the text end, or the decoder end. At each decoder step , the previous label is input to a stack of identical decoder blocks, each of which is composed by inserting a multi-head encoder-decoder attention into the two sub-networks (multi-head self-attention and position-wise feed-forward network) of the SAN. Each of the multi-head encoder-decoder attention receives the encoded representation as the key and value, and receives the outputs of the multi-head self-attention (in the same decoder block) as the query, then they conduct the multi-head attention to extract the relevant acoustic information. As shown in figure 2 (a), transformer runs label by label, and stops after the label of end of sentence ( e ) is predicted. Assume the length of the predicted labels is , the length of is , the computational complexity of the alignment mechanism in transformer is .

2.2 CIF-based model

The CIF-based model used in this work follows the model structure in [21]. Its encoder first generates the encoded representation for each encoder step . Then, it predicts a weight for each by passing a window centered at (e.g. ) to a 1-dimensional convolutional layer and then a fully connected layer with one output unit and a sigmoid activation. Next, it passes the weight and to CIF, which forwardly accumulates the weight and integrates the acoustic representation (using the form of weighted sum) until the accumulated weight reaches a threshold (1.0), which means an acoustic boundary is located. At this point, CIF divides current weight into two part: the one is used to fulfill the integration of current label by building a complete distribution (whose sum of weights is 1.0) on relevant encoder steps, the other is used for the integration of next label. After that, it fires the integrated acoustic embedding

(as well as the context vector) to the decoder to predict the corresponding label

.

The running of CIF-based model is driven by the speech end, or the encoder end. At each encoder step , the representation and the weight of the encoded frame are input to the CIF module, which determines if there have integrated enough acoustic information to trigger the calculation of the decoder. As shown in figure 2 (b), the CIF-based model runs frame by frame, and stops after the last frame is processed. Assume the length of is , the computational complexity of the CIF alignment mechanism is .

2.3 Model details

The two models use the same encoder structure [21], which reduces the frame rate to 1/8. The CIF-based model uses the autoregressive decoder that is based on SAN. The decoder of both the two models caches the computed SAN states of all heads for efficient inference. In the training, both of the two models use multi-task learning: the CIF-based model is trained under the cross-entropy loss, the CTC loss on the encoder with coefficient and the quantity loss on the CIF with coefficient [21], and the transformer is trained under the cross-entropy loss and the CTC loss on the encoder [10]. In the inference, we first perform beam search to get the decoded hypothesis, then use the score of a SAN-based language model (LM) with coefficient to rescore the hypothesis predicted by the two models.

Figure 2: Schematic diagram of the alignment mechanism used in transformer and the CIF-based model when processing on an utterance with 5 frames and is labelled as ”CAT”. The shade of gray in each square represents the weight of each encoder step involved in the calculation of decoding labels. The arrows indicate the synchronous mode of the two models.

3 Experimental Setup

Model Librispeech AISHELL-2 HKUST CN12000h
test_clean test_other test_ios test_android test_mic test_set 1 test_set 2 test_set 3
Transformer 2.78 7.57 6.12 5.74 6.31 21.88 5.52 4.61 22.27
CIF-based model 2.86 8.08 6.09 5.68 6.20 22.80 5.54 5.01 23.01
Table 1:

Results of accuracy performance for the models compared in this work. The evaluation metric uses word error rate (WER) for Librispeech, and uses character error rate (CER) for the three Mandarin datasets, and keep the same for the later table

3 and 4.

We conduct the comparison on three public ASR datasets and a large Mandarin ASR dataset (CN12000h) that contains about 12000 hours of training data. The three public ASR datasets include the English read-speech dataset (Librispeech [24]) that contains 960 hours of training data, the Mandarin read-speech dataset (AISHELL-2 [25]) that contains 1000 hours of training data, and the Mandarin telephone ASR benchmark (HKUST [26]) that contains 168 hours of training data. Other details about the usage of three datasets are introduced in [21]. For the CN12000h dataset, the training set covers data from different application scenarios, so we use three representative test sets to evaluate the performance of the models and use one of them (test_set 2) as the development set.

For the three public ASR datasets, the generation of input features and output labels keeps the same as [21]. The frequency masking and time masking in SpecAugment [27] with , , , ,

are applied to the models on the three datasets. For the CN12000h dataset, the input features use 29 dimensional filterbanks extracted from 25 ms window and shifted every 10 ms, then extended with delta and delta-delta, and the global normalization. The output labels cover 4733 classes including 4701 Chinese characters, 26 uppercase English letters, 3 special marker (noise, etc), the blank, the label of end-of-sentence and the pad. Above setting keeps the same for the two models compared in this work.

For the CIF-based model, the encoder uses the 2-layer convolutional front-end and the pyramid of self-attention network (SAN) in [23], where the head number , the hidden size and the inner size in SAN are set to for all Mandarin datasets and set to for Librispeech, in the pyramid structure is set to 5 for all datasets, thus the number of SAN encoder layers is 15. The 1-dimensional convolutional layer that predicts weights uses convolutional filters, and the convolutional width is 3. Besides, layer normalization [28]

and a ReLU activation are applied after this convolution. The decoder uses the autoregressive decoder in

[21] and the number of SAN decoder layers is 2. The coefficient on the CTC loss is set to 0.5 for all Mandarin datasets and is set to 0.25 for Librispeech, on the quantity loss is set to 1.0.

For transformer, it uses the same , and as the SAN in the CIF-based model. The encoder uses the same structure as the CIF-based model except in the pyramid structure is set to 4 for all datasets, thus the number of SAN encoder layers is 12. The decoder uses 6-layer decoder blocks. The setting of the layer number in the encoder and the decoder refers to the best setting in [9, 10], and it also makes the two models have similar amount of parameters. The coefficient on the CTC loss keeps the same as the CIF-based model.

The language model (LM) uses SAN with , ,

for the three public datasets, and the number of SAN layers is set to 3, 6, 20, 6 for HKUST, AISHELL-2, Librispeech and CN12000h, respectively. Above models are implemented on TensorFlow

[29].

We use almost the same training and inference strategy for the two models. In the training, we batch the data with approximate frame length together and fill about 20000 frames for every batch of the three public dataset, and fill 37500 frames (the maximum for a p6000 gpu) for every batch in CN12000h. We use 4, 8, 8, 8 gpus to train the models on HKUST, AISHELL-2, Librispeech and CN12000h, respectively. We use the optimizer and the varied learning rate formula in [9], where the warmup step is set to 25000 for HKUST and is set to 36000 for other datasets, the global coefficient is set to 4.0. We only apply dropout to the SAN, whose attention dropout and residual dropout are all set to 0.2 except the models on CN12000h that is set to 0.1. We use the uniform label smoothing in [30] and set it to 0.2 for both of the ASR models and the language models. Parallel scheduled sampling (PSS) [31] with a constant sampling rate of 0.5 is applied to the models on the three Mandarin datasets. After training, we average the newest 10 checkpoints for inference. In the inference, we use beam search with size 10 for all models. For the CIF-based model, the coefficient for LM rescoring is set to 0.1, 0.2, 0.9, 0.2 for HKUST, AISHELL-2, Librispeech and CN12000h respectively. For transformer, the hyper-parameter is set to 0.6, 0.9, 2.0, 0.6 for HKUST, AISHELL-2, Librispeech and CN12000h, respectively.

4 Results

In this section, we present the results of the compared two models on the performance of accuracy, speed and generalization. The two models are evaluated on the offline mode, and use almost the same training strategy, similar model structure and similar amount of parameters to make the comparison as fair as possible, the details are introduced in section 3.

4.1 Comparison on accuracy performance

As shown in table 1, we find transformer performs better on 3/4 datasets (Librispeech, HKUST and CN12000h), while the CIF-based model only shows better performance on AISHELL-2, which is a Mandarin read speech dataset that has clear acoustic boundary between labels. To our best knowledge, the result of CIF-based model on AISHELL-2 and the result of transformer on HKUST are the best accuracy performance of respective dataset, thus the comparison is conducted on two very strong models that have different synchronous mode.

From the perspective of synchronous mode, the accuracy difference between the two models can be attributed to two aspects: 1) since transformer is a label-synchronous model and extracts acoustic information of each label from the global view, it could make comprehensive use of the encoded acoustic information and capture more acoustic details for the decoding. In contrast, the CIF-based model performs a frame-by-frame manner and doesn’t cache the processed frames like [32], thus it integrates acoustic information from a limited local view, which may affect its modeling expressiveness to some extent; 2) since the CIF-based model is a soft frame-synchronous model that needs to locate the acoustic boundary during processing, this makes it perform slightly inferior on the datasets with blurred acoustic boundary between labels. In contrast, transformer is less affected by the clearness of acoustic boundary. In addition to the above, the multi-layer encoder-decoder attention of transformer (from its decoder blocks) also benefits to its extraction of acoustic information and promotes better accuracy.

4.2 Comparison on speed performance

The accuracy advantage of transformer is partly due to the global encoder-decoder attention. However, attending to every encoder step is bound to bring a mass of unnecessary calculations on steps that are acoustically irrelevant to the decoding label, which may affect its calculation speed. Thus we also compare the speed performance, including the training speed and the real time factor (RTF) (= inference time / audio time).

Dataset Model
Training
speed
(step/sec)
Real Time
Factor
(RTF)
Librispeech Transformer 0.407 0.0815
CIF-based model 0.308 0.0120
AISHELL-2 Transformer 0.847 0.0209
CIF-based model 0.630 0.0048
HKUST Transformer 0.987 0.0226
CIF-based model 0.873 0.0049
Table 2: Results of speed performance for the models compared in this work. The following results are obtained on the Quadro P6000 gpu.

As shown in table 2, we find transformer has faster training speed than the CIF-based model. Since both of them use the SAN-based encoder and decoder, which are computed in a highly parallel manner, the gap on the training speed can be attributed to the frame-by-frame incremental calculation by the CIF. Even so, since the incremental calculation by the CIF is lightweight, the training efficiency of the CIF-based is still considerable and is about of transformer.

Besides, we find the CIF-based model obtains about of the RTF than transformer, which means it obtains about times inference speed than transformer. Since the two models are compared on the offline mode, both of them perform one-time encoder calculation and multi-time decoder calculations, the heavy decoder of transformer that takes cost to extract the acoustic information inevitably brings large amount of calculations. In contrast, the CIF-based model that takes cost to integrate the acoustic information has much lighter inference calculation, which is one of the advantages of this type of frame-synchronous models.

4.3 Comparison on generalization

In addition to the accuracy and speed, we wonder to know whether the synchronous mode affects the modeling generalization on some special cases.

Dataset Model
20s-
30s
30s-
40s
40s-
50s
50s-
60s
60s-
70s
Librispeech
(test-clean)
Transformer 3.64 4.12 3.68 6.40 11.51
CIF-based model 4.14 3.72 2.68 3.38 3.61
AISHELL-2
(test-ios)
Transformer 16.47 29.97 45.25 58.35 60.00
CIF-based model 8.02 6.57 7.99 9.73 9.13
Table 3: Results on the long utterances with multiple durations. The ‘s’ after the figures represents seconds.

We first compare the two models on the long utterances, which are generated by randomly concatenating multiple utterances of the same speaker in the test set, each duration covers at least 50 utterances. For fair comparison, we do not use the long utterance strategy for specific models in [33, 34].

As shown in table 3, we find the CIF-based model performs more stable on the long utterances, and its deletion and insertion errors are kept at a comparative level for all durations. In contrast, the deletion error of transformer increases rapidly as the length of utterance grows and becomes the main reason of its performance degradation on the long utterances. We also notice the performance gap between the two models on the same durations is different on the two datasets. We suspect it is because the averaged length of training utterances on AISHELL-2 (3.55 seconds) is shorter than Librispeech (12.28 seconds).

Then we compare the two models on the repeated utterances and the noisy utterances. The repeated utterances are generated by randomly selecting 1/10 utterances of each test set, then concatenating the same utterance repeatedly. The number of repetition is 1, 2, 3 and 4 (denoted as , , , in table 4). The noisy utterances are generated by mixing every utterance in the test set with one of the 115 noise types in [35]

, the signal to noise ratio(SNR) ranges from 0-20.

Dataset Model 1 2 3 4 Noisy
Librispeech
(test-clean)
Transformer 3.23 4.24 24.25 50.50 18.20
CIF-based model 3.34 3.48 3.58 3.69 18.15
AISHELL-2
(test-ios)
Transformer 5.73 7.12 52.7 152.92 18.95
CIF-based model 5.60 6.20 6.58 6.78 18.89
Table 4: Results on the repeated and noisy utterances.

As shown in table 4, we find transformer encounters large performance degradation as the number of repetition increases, especially on the , setup. Most of errors are insertion errors, which come from lots of cases where more number of repetitions is predicted, there are also some cases where less number of repetition is predicted, but not many. We suspect the degradation is due to the decoding confusion brought by the repeated similar decoder states, which makes transformer hard to predict the label of end of sentence at the correct step. In contrast, the CIF-based model performs stable on the repeated utterances and just encounters slightly performance degradation. Besides, as the repetition times become to , , , we find the inference time becomes to , , for the CIF-based model, and becomes to , , for transformer on Librispeech, which is basically consistent with the computational complexity for the two models. On the noisy utterances, the two models show similar performance.

The poor generalization of transformer on above cases can be explained from two aspects: 1) transformer uses its decoder state as the query of the encoded representations () to extract the relevant acoustic information, thus its recognition is affected by both of the decoder state and the , the changes of the two on special cases will bring great challenges to its recognition. In contrast, the integration of acoustic information by CIF-based model is just affected by the . 2) transformer mainly relies on the dependency between the decoder states to determine whether to stop, which is hard to achieve when it suffers from the decoding confusion on some cases. In contrast, the CIF-based model has a clear stop signal that is the last frame is processed. Based on above, we believe the type of frame-synchronous models may be more suitable for the ASR application scenarios that need cover wider range of speech.

5 Conclusions

In this work, we make a detailed comparison on a label-synchronous model (transformer) and a soft frame-synchronous model (CIF-based model). Through the experiments on multiple datasets, we find: 1) the label-synchronous model achieves slightly better accuracy on most of datasets, which can be attributed to its comprehensive usage of acoustic information and its insensitivity to the clearness of boundary; 2) the frame-synchronous model achieves 4.4-6.8 times faster inference speed, which is determined by its frame-by-frame calculation and linear computation complexity; 3) the frame-synchronous model also achieves better generalization on the special cases. Since there must be one driving end in the ASR models, we hope the comparison on the two types of models could benefit to the selection and design of end-to-end models in the practical ASR application.

References