CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition

05/27/2019 ∙ by Linhao Dong, et al. ∙ 0

Automatic speech recognition (ASR) system is undergoing an exciting pathway to be more simplified and practical with the spring up of various end-to-end models. However, the mainstream of them neglects the positioning of token boundaries from continuous speech, which is considered crucial in human language learning and instant speech recognition. In this work, we propose Continuous Integrate-and-Fire (CIF), a 'soft' and 'monotonic' acoustic-to-linguistic alignment mechanism that addresses the boundary positioning by simulating the integrate-and-fire neuron model using continuous functions under the encoder-decoder framework. As the connection between the encoder and decoder, the CIF forwardly integrates the information in the encoded acoustic representations to determine a boundary and instantly fires the integrated information to the decoder once a boundary is located. Multiple effective strategies are introduced to the CIF-based model to alleviate the problems brought by the inaccurate positioning. Besides, multi-task learning is performed during training and an external language model is incorporated during inference to further boost the model performance. Evaluated on multiple ASR datasets that cover different languages and speech types, the CIF-based model shows stable convergence and competitive performance. Especially, it achieves a word error rate (WER) of 3.70

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The prevailing end-to-end models are driving automatic speech recognition (ASR) system to become more simplified and practical. These models could directly transform the input speech features to the output text by a single neural network, which integrates the functions of the acoustic model, the acoustic-to-linguistic alignment mechanism and the language model into one. Two main branches are gradually formed with the evolvement of the end-to-end models. One of them uses the connectionist temporal classification (CTC)

[Graves et al., 2006] or its extensions (RNN-T [Graves, 2012], RNA [Sak et al., 2017]

) as the alignment mechanism, which generates a ‘hard’ and ‘monotonic’ alignment with each frame labeled (may by the blank label) and is optimized by maximizing the sum of the probabilities of generating the alignment that is able to map to the targets. The other branch is under the encoder-decoder framework and uses the attention based alignment mechanism

[Chan et al., 2016, Jaitly et al., 2016], which generates a ‘soft’ but ‘non-monotonic’ alignment by calculating a weight for each position in the globally [Chan et al., 2016] or locally [Jaitly et al., 2016] encoded representations, then extracting the sum of weighted representations for decoding.

Despite the great success, these mainstream end-to-end models neglect the positioning of token (word, word piece, etc.) boundaries from continuous speech, which, however, is considered to be a crucial step in the language learning of infants [Jusczyk, 1999]. The importance of locating boundaries is also reflected in the foreign language learning, where many learners may experience difficulty in hearing where one word ends and another begins, thereby leading to error prone recognition to speech content. Moreover, positioning to a token boundary can be regarded as a pre-step for instant speech recognizing, which is required by various online ASR scenarios. Based on these findings, we believe it is worthwhile to explore an alignment mechanism that addresses the positioning of token boundaries by forwardly integrating the acoustic information, and fires the integrated information for instant recognizing once a boundary is located.

Figure 1: Illustration of the two mainstream alignment mechanisms and our CIF alignment mechanism on an utterance of length 5 and labelled as "CAT". In (c), the CIF uses a ‘soft’ manner to extract acoustic information based on the weights (represented by the shade of gray of each square), similar to the attention in (b). Besides, it keeps the ‘monotonic’ manner by first integrating the acoustic information corresponding to the token ‘C’ until the boundary (represented by dashed line) is located and then starting the integration of the subsequent token, which differs from the CTC in (a) that implements ‘monotonic’ by the restricted transitions between the states (represented by circle).

There is a large similarities between the above alignment mechanism and the integrate-and-fire neuron model [Lapicque, 1907, Abbott, 1999], one of the most canonical models for analyzing the behaviour of neural system [Burkitt, 2006]

, which works by integrating the stimulations from the input signal during a period and firing an action potential (spike) when its membrane potential reaches a threshold value. However, the discontinuous spike information retard the penetration of the integrate-and-fire idea to the end-to-end models that are optimized with back-propagation. Here, we take a small step forward by simulating the process of the integrate-and-fire using the vector information processed on the continuous functions.

In this work, we propose Continuous Integrate-and-Fire (CIF), a novel ‘soft’ and ‘monotonic’ alignment mechanism to be utilized in the encoder-decoder framework. As the connection between the encoder and decoder, it first calculates a weight (which means the amount of acoustic information) for each incoming encoded acoustic representation. Then, it forwardly integrates the information in the acoustic representations until the accumulated weight reaches a threshold, that means a boundary is located. At this point, it divides the information in this boundary frame into two: one for filling the integration of the current token and the other for the subsequent integration, which mimics the processing of the integrate-and-fire neuron model when its membrane potential reaches a threshold at some point during a period of an encoded frame. After that, it fires the integrated acoustic information to the decoder to predict current token. Such process is illustrated in the Figure 1 (c) and loops over until the end of the recognition.

In the process of implementing the CIF-based model, the inaccurate positioning often occurs and brings difficulties to both of the training and inference. In the training, it may cause the unequal length between the predicted tokens and the targeted tokens, thus hindering the cross-entropy training. To solve that, we introduce a scaling strategy on the calculated weights to teacher-force the CIF to produce the same number of tokens as the target during training. We also present a loss function to supervise the quantity of the produced tokens to be closer to the quantity of target for better positioning. In the inference, the inaccurate positioning causes some useful but insufficient information being left at the tail, which leads to the appearance of incomplete words at the end of recognition result. To alleviate that, we present a rounding method to decide whether to make an additional firing based on the residual weights during inference, and introduce an extra token to the tail of the target sequence to mark the end of sentence and provide tolerance during training.

Evaluated on multiple ASR datasets covering different languages and speech types, the CIF-based model shows stable convergence and competitive performance. On the Librispeech dataset, the text is converted to the sequence of word pieces, although there are relatively blurred boundaries between the word pieces, the CIF-based model still achieves a word error rate (WER) of 3.70% on test-clean, which matches the result of most end-to-end models while keeping the potential of instant speech recognition and using lower encoded frame rate. On the Mandarin ASR datasets, the CIF-based model exhibit impressive performance due to the relatively clear boundaries between Mandarin character, specifically, it achieves a new state-of-the-art character error rate (CER) of 6.69% on test-ios of the read dataset AISHELL-2 and a competitive CER of 24.71% on the spontaneous dataset HKUST.

2 Related Work

The soft and monotonic acoustic-to-linguistic alignment mechanism may be a preferable choice for the end-to-end model in ASR. On the one hand, the soft characteristic enables the model to extract information from the relevant acoustic representations based on the calculated weights, thus utilizing the acoustic information more directly and comprehensively. On the other hand, the monotonic characteristic fits with the left-to-right nature of the ASR task, thus enabling the model to conduct instant speech recognition and run in lower computational complexity due to the alleviation of some calculations on the irrelevant positions.

Several prior works have studied the soft and monotonic alignment mechanism in end-to-end ASR models. [Hou et al., 2017, Tjandra et al., 2017]

assumes the alignment to be a forward-moving window that fits gaussian distribution, where the center and width of the gaussian window are predicted by its decoder state. Differing from them, the CIF neither introduces such assumption nor uses the state of the decoder, thus encouraging more pattern learning from the acoustic data without the assumption restriction. In addition, the CIF provides a concise calculation process by conducting the locating and integrating at the same time, rather than in

[Chiu and Raffel, 2017, Fan et al., 2018], which performs soft attention over small chunks of memory preceding where a hard monotonic attention mechanism decides to stop. Besides, the CIF-based model is trained from scratch and doesn’t need a trained CTC model to conduct pre-partition before decoding like [Moritz et al., 2019]. In [Li et al., 2019a], Li al. present the important Adaptive Computation Steps (ACS) algorithm to dynamically decide how many frames should be processed to predict a linguistic output. Their method is like locating ‘hard’ boundaries at the encoded frame level, which causes the insufficient usage of acoustic information in the boundary frame. In contrast, the CIF mimics the integrate-and-fire neuron model and believes locating to the boundary (firing) occurs at some point in the period of an encoded frame, thus locating ‘soft’ boundaries at a finer time granularity (inside the encoded frame) and integrating the acoustic information more sufficiently. Besides, the ACS didn’t present constructive solutions to the inaccurate computed frames, these problems are probably the main reason why their model has a huge performance gap from a DNN-HMM model. In comparison, our model introduces multiple strategies to alleviate the difficulties brought by the inaccurate positioning of the CIF, thus supporting effective training and showing competitive performance.

3 Model Architecture

Continuous Integrate-and-Fire (CIF) is a ‘soft’ and ‘monotonic’ alignment mechanism employed in the encoder-decoder architecture. It is suitable for many sequence transduction tasks with the left-to-right nature (ASR, scene text recognition, grapheme-to-phoneme, etc.). In this paper, we focus on the ASR task and illustrate the architecture of our CIF-based model in Figure 2.

As shown in the Figure 2, the encoder transforms the speech features to the high-level acoustic representations , where due to the temporal down-sampling. Then, the CIF part consumes in the left-to-right manner to produce the integrated acoustic representations , where could be regarded as the acoustic embedding of the token in the output sequence . When

is produced, the decoder takes it and maps it to the probability distribution over the token

. Three loss functions are placed on the encoder, the CIF part and the decoder respectively to offer sufficient supervision for the training. Besides, an external language model is incorporated to further improve the model performance. More details are described in the following sections.

Figure 2: The architecture of our CIF-based encoder-decoder model. Operations in the dashed rectangles are only applied in the training stage, and the switch (S) in the CIF part connects the right in the training stage and the left in the inference stage.

3.1 Encoder

The encoder uses a convolutional front-end and a pyramid structure composed of self-attention networks (SANs) that have shown the competitiveness in ASR [Salazar et al., 2019, Dong et al., 2019]. The convolutional front-end employs the powerful structure in [Dong et al., 2018]

that utilizes a 2-dimensional strided convolutional network to conduct temporal down-sampling by 2, and a multiplicative unit (MU)

[Kalchbrenner et al., 2016] to further capture acoustic details. Then the 2-dimensional outputs are flattened and projected to as the input of the pyramid structure composed of SANs. Two temporal pooling layers with width 2 are uniformly inserted between the stacked SANs to encourage effective encoding in each temporal resolution, and they further reduce the original temporal sampling rate to 1/8 and bring lighter learning and inference. After the modeling of the pyramid structure, the encoded acoustic representations are obtained.

3.2 Continuous Integrate-and-Fire

The Continuous Integrate-and-Fire (CIF) part produces the acoustic embeddings of the output sequence by integrating the information in step by step. Specifically, at step , it first calculates a weight for the incoming encoded representation , where the weight means the amount of acoustic information hidden in , and is calculated by first using a 1-dimensional convolutions to capture the local dependencies around then using a projection layer with sigmoid activation to extract the scalar between 0 and 1.

To determine whether a boundary is located at step k, the weight is added to the previous residual weights to get the current accumulated weights . If is less than the given threshold value , no boundary is located and is used as the residual weights for the next step, the current integrated state is updated as and is used as the residual state for the next step. If is greater than , it means a token boundary is located and is set to 1, the calculation of , and is as follows:

(1)
(2)
(3)

where is fired to the decoder as the integrated acoustic information corresponding to current token . Above calculations are looped over till to the end of the utterance, and make the CIF-based model perform under a linear-time complexity of .

In the training, the length of the produced may differ from the length of the targets , thus bringing difficulties for the cross-entropy training. To solve this problem, we introduce a scaling strategy to multiply the calculated weights by a scalar to generate the scaled weights whose sum is equal to , thus teacher-forcing the CIF to produce with length for more effective training.

In the inference, there leaves some weights that are not enough to trigger one firing but useful at the tail of utterance, which may cause the appearance of incomplete words at the end of predictions. To alleviate such tail problem, we present a rounding method to make an additional firing if the last residual weight is greater than 0.5 during inference. We also introduce a token <EOS> to the tail of the target sequence to mark the end of sentence and provide tolerance during training.

3.3 Decoder

The decoder also uses the SANs to capture the positional dependencies. Two versions of decoder are investigated in this work. Figure 2 shows our better performing version: the autoregressive (AR) decoder, which follows the decoder networks in [Dong et al., 2019] and models the probability distribution of as follows:

(4)

However, such autoregressive property leads to low parallelization and slow inference speed. To alleviate that, we introduce a non-autoregressive (NAR) decoder, which just inputs to the SANs to generate the probability distributions of independently in parallel as follows:

(5)

3.4 Loss functions

In addition to the cross entropy loss , we introduce two auxiliary loss functions to introduce more supervision for better training. Specifically, we place a CTC loss function on the encoder (similar to Kim et al. [2017]) to promote the left-to-right acoustic modelling. Besides, we introduce a loss function on the CIF part to supervise the boundary positioning and make the quantity of predicted tokens closer to the quantity in the target. We term it as quantity loss , which is defined as , where is the quantity of the targeted tokens. Thus our model is trained under the multi-task learning as follows:

(6)

where and are tunable hyper-parameters.

3.5 Incorporating with Language Model

To further boost the performance of our model, we incorporate an SAN-based language model (LM) by performing second-pass using means of log-linear interpolation in

[Chiu et al., 2018]. Given the hypotheses produced by beam search, we determine the final transcript as:

(7)

where is a hyper-parameter tuned on the development dataset.

4 Experiments

4.1 Experimental Setup

We evaluate our approach on three public ASR datasets including two read speech corpora (Librispeech [Panayotov et al., 2015] and AISHELL-2 [Du et al., 2018]) and a spontaneous speech corpus (HKUST [Liu et al., 2006]). On the Librispeech, we use all the available train data (960 hours) for training, put the two development subsets together for validation, and leave the two test subsets only for evaluation. Beside, we train our language model on the separately prepared language-model training data, which is available on 111http://www.openslr.org/11 together with the above speech data. On the Mandarin ASR dataset AISHELL-2, we use all the available train data (1000 hours) for training, put the three development subsets together for validation, and leave the three test subsets only for evaluation. The speech data of AISHELL-2 is now available through an application process on 222http://www.aishelltech.com/aishell_2. The HKUST corpus (LDC2005S15, LDC2005T32) consists of a training set and a development set, which adds up to about 178 hours of telephone conversation Mandarin speech. We extract about 5 hours from the original training set for tuning the hyper-parameters, use the left training data for training, and use the original development set only for evaluation. In addition, the training of language model on HKUST and AISHELL-2 only uses the text data from its respective training set.

We extract input features using the Kaldi [Povey et al., 2011] recipe, specifically, we extract the 40-dimensional mel-filterbanks from a 25ms sliding window with a 10ms shift, then extend with delta and delta-delta, the per-speaker normalization and the global normalization for all the three datasets. We also perform speed perturbation [Ko et al., 2015] method with fixed 10% to conduct data augmentation. As for the output token, we use the word piece for Librispeech and use the character for AISHELL-2 and HKUST. Specifically, we use the BPE 333https://github.com/rsennrich/subword-nmt [Sennrich et al., 2015]

toolkit generating 3722 word pieces from the training set of Librispeech by setting the number of merge operations to 7500. Plus the blank label , the end of sentence label and the pad label , the number of output tokens is 3725 for the Librispeech dataset. We collect the characters and special markers appeared in the dataset of AISHELL-2 and HKUST, respectively. After adding with the , and , we generate 5230 output tokens for AISHELL-2 and 3674 output tokens for HKUST .

We implement our model on the TensorFlow

[Girija, 2016] platform. The self-attention networks (SANs) in our model leverage the implementation in [Dong et al., 2019] and use , , for all datasets. In the encoder, the convolutional front-end performs the same configures in [Dong et al., 2019], and the in the pyramid structure is set to 5 for all datasets. For fair comparison with other results, our encoder calculates bi-directionally. The forward encoding is applicable by introducing reasonable masking in SANs, which is left as our future work. In the CIF part, the number of filters in the 1-dimensional convolutional layer is set to , the convolutional width is set to 3 for two Mandarin datasets and is set to 5 for Librispeech. Layer normalization [Ba et al., 2016]

and a ReLU activation is applied after the convolutions. The firing threshold

is set to 0.9. In the decoder, the number of layer is set to 2 for two Mandarin datasets and to 3 for Librispeech. The multi-task hyper-parameters is set to 0.5 for two Mandarin datasets and to 0.25 for Librispeech (make the CTC loss value be about 0.5-0.7 of the CE loss), and is set to 1.0 for all datasets. The language models (LM) in our experiments are also constructed using the SANs whose , and keep same as the encoder-decoder model. The number of SAN layers is set to 3, 6, 15 for HKUST, AISHELL-2 and Librispeech, respectively.

In the training, we batch the training data with approximate number of frames together and let each batch contain about 20000 frames. We use the optimizer and the varied learning rate formula in [Vaswani et al., 2017], where is set to 25000 for Librispeech and AISHELL-2 and is set to 16000 for HKUST, the global coefficient on varied learning rate is set to 4.0. We use two regularization including dropout and label smoothing. We only apply dropout to the self-attention networks (SANs) whose attention dropout and residual dropout is set to 0.1 for Librispeech and AISHELL-2 and is set to 0.2 for HKUST. We use the uniform label smoothing in [Chorowski and Jaitly, 2016] and set it to 0.2. In the training of the language model, the two dropout is set to 0.2 and the uniform label smoothing is set to 0.2 for all datasets. Scheduled Sampling [Bengio et al., 2015] with constant sampling probability of 0.5 is applied on two Mandarin datasets. After training, we average the newest 10 checkpoints for inference.

In the inference, we use beam search with beam size 10 for all datasets. The hyper-parameter

for incorporating language model is set to 0.2, 0.3, 0.9 for HKUST, AISHELL-2 and Librispeech, respectively. We evaluate the result using word error rate (WER) for Librispeech and character error rate (CER) for two Mandarin datasets. All the experiments are evaluated by running 3 times, and all the experimental results are presented in the form of mean and standard deviation as follows.

4.2 Experimental Results

4.2.1 Results on Read Speech

Since the characters of Mandarin are single syllable and have relatively clear boundary, we first evaluate our model on the AISHELL-2, which is known as the largest public Mandarin ASR dataset and is released recently. As shown in the Table 1, the CIF-based model performs competitive on all of the test sets and significantly improves the result achieved by the Chain model [Povey et al., 2016].

Model End-to-End test_android test_ios test_mic
Chain-TDNN [Povey et al., 2016] No 9.59 8.81 10.87
CIF-based model Yes 7.25 0.06 6.69 0.02 7.47 0.06
Table 1: Comparison with other published models on the AISHELL-2, CER (%)

We further evaluate our CIF-based model on the English ASR dataset: Librispeech. Since we use the word pieces as the output tokens, the acoustic boundary between adjacent output tokens may be blurred. Even so, our model still shows competitive results that match most of end-to-end models. Specifically, our model achieves a mean WER of 4.48% on test-clean and a mean WER of 12.62% on test-other, which are comparable to the result of 4.1% and 12.5% achieved by current state-of-the-art LAS model without the powerful data augmentation by SpecAugment [Park et al., 2019] (which is released recently and will be applied to further boost our model performance in future work). Besides, our model still keeps the potential of instant speech recognition and utilizes lower encoded frame rate (12.5 Hz), thus may be more practical in various ASR scenarios. Compared with other soft and monotonic alignments that instantly recognize speech, the CIF-based model shows significant performance advantages. Especially, it achieves a huge absolute WER improvements on the result of Adaptive Computation Steps [Li et al., 2019a] which is reproduced by utilizing the same model setting, thus further demonstrating the superiority of the CIF which locates and integrates at a finer time granularity.

text-clean text-other
Model Params w/o LM w/ LM w/o LM w/ LM
LAS + SpecAugment [Park et al., 2019] - 2.8 2.5 6.8 5.8
Jasper [Li et al., 2019b] 333 M 3.86 2.95 11.95 8.79
wav2letter++ [Vineel Pratap, 2018] - - 3.44 - 11.24
LAS + Deep bLSTM [Zeyer et al., 2018] 150 M 4.87 3.82 15.39 12.76
ASG + Gated ConvNet [Liptchinsky et al., 2017] 208 M 6.7 4.8 20.8 14.5
CTC + policy learning [Zhou et al., 2018a] 75 M - 5.42 - 14.70
CTC + i-SRU 1D-Conv [Park et al., 2018] 36 M - 5.73 - 15.96
‘Soft’ and ‘monotonic’:
 ACS [Li et al., 2019a] 67M 16.72 0.07 16.11 0.03 24.09 0.25 22.66 0.30
 Triggered Attention [Moritz et al., 2019] - 7.4 5.7 19.2 16.1
CIF-based model 67M 4.48 0.09 3.70 0.10 12.62 0.09 10.90 0.16
Table 2: Comparison with other end-to-end models on the Librispeech dataset, WER (%)

The achieved results on the two read datasets reflect that the CIF-based model has the potential to cover different languages with relatively clear or blurred token boundary.

4.2.2 Ablation Study

test-clean test-other
without scaling strategy 6.03 0.18 14.98 0.08
without quantity loss 8.84 0.76 15.49 0.44
without handling tail 6.04 0.02 14.11 0.07
without CTC loss 4.96 0.06 13.27 0.16
without autoregressive 9.27 0.18 21.56 0.15
Full Model 4.48 0.09 12.62 0.09
Table 3: Ablation study on the Librispeech dataset, WER (%)

In this section, we use ablation study to evaluate the importance of different methods in our CIF-based model. As shown in Table 3, all of the introduced methods have a positive impact on the modelling of CIF-based model. The most crucial one is the auto-regression in decoder, which explicitly captures the language dependency that is required in ASR. The quantity loss used to supervise the boundary positioning also greatly matters, since obvious performance degradation and instability are produced after ablating it. The scaling strategy and the methods of handling tail are proposed to alleviate the problem brought by the inaccurate positioning. In line with our expectations, they provide significant and stable improvements to the CIF-based model. With the joint action of these methods, the CIF-based model conducts better boundary positioning and shows significant improvements.

4.2.3 Results on Conversational Speech

Figure 3: Token boundary positioning by CIF on an English utterance in Librispeech test-clean where "_" represents the space. The boundary in the spectrogram is marked by two humans. The middle part shows the calculated weights for each encoded representations, and the upper part shows the accumulated weights at different steps. When reaches the threshold, a firing happens and a token boundary is located. We find the located boundaries are roughly accurate and the token with more stable and clear pronunciations are more prone to be located ahead of time by CIF.

We further evaluate our model on a telephone conversional speech dataset (HKUST). As shown in Table 4, our CIF-based model achieves a competitive CER on the spontaneous speech which is not as well structured acoustically and linguistically as the read speech and has less clear boundary. The achieved performance further demonstrates the generalization of our CIF alignment mechanism.

Model CER

Joint CTC-attention model / ESPNet

[Watanabe et al., 2018]
27.4
Extended-RNA [Dong et al., 2018] 26.8
Transformer [Zhou et al., 2018b] 26.6
Self-attention aligner [Dong et al., 2019] 24.1
CIF-based model 24.71 0.07
Table 4: Comparison with recent end-to-end models on the HKUST, CER (%)

5 Conclusion

In this work, we get inspirations from the integrate-and-fire neuron model and propose Continuous Integrate-and-Fire (CIF), a soft and monotonic alignment mechanism that supports instant speech recognition by forwardly integrating the acoustic information and firing the integrated information once a token boundary is located. Since it mimics the integrate-and-fire neuron model, it locates and integrates at a finer time granularity (inside the encoded frame, or from another perspective, at the continuous speech that is framed), thus enabling the model to sufficiently utilize the acoustic information and perform an effective and concise calculation process under a linear-time complexity.

In the future, we will further validate the performance of CIF-based model on larger-scale ASR dataset and other monotonic sequence transduction tasks. Besides, we will continue to find inspirations from biologically-inspired neuron models (e.g. integrate-and-fire neuron family) to further boost the practicality of CIF-based model. We also hope this work could send some useful ideas to the construction of biologically-plausible ASR system, and may put a foot in the research of such field.

References