Although autoregressive models have achieved great success in various NLP tasks and speech recognition[bahdanau2014neural, chorowski2015attention, chan2016listen, vaswani2017attention, kim2017joint, dong2018speech], the autoregressive characteristics result in a large latency during the inference process [lee2018deterministic]. Most of the attention-based sequence-to-sequence models generate the target sequence in an autoregressive fashion. These models predict the next token conditioned on the previously generated tokens and the source state sequence. By contrast, the non-autoregressive model gets rid of temporal dependency and able to perform parallel computing, greatly improving the speed of inference.
Non-autoregressive transformers (NAT) have achieved comparable results with autoregressive models in neural machine translation and speech recognition [lee2018deterministic, gu2017non, ma2019flowseq, wang2019non, chen2019non, libovicky2018end, moritz2019triggered]
. Different from the autoregressive sequence-to-sequence model, the NAT takes a fixed-length mask sequence as input to predict target sequence. The setting of this predefined length is very important. If the length is shorter than the actual length, this will cause many errors of deletion. On the contrary, a longer length will cause the model to generate duplicate tokens and consume additional calculations. To our best knowledge, there are three ways to estimate the length of the target sequence. Firstly, some works introduce a neural network module behind the encoder to predict the target length[lee2018deterministic, gu2017non, ma2019flowseq]. These method cannot guarantee the accuracy of the predicted lengths. During inference, it is necessary to sample different lengths to select the optimal sequence. Secondly, [wang2019non, chen2019non] set an empirical(or maximum) length based on the length of the source sequence. To guarantee the performance of the model, the length is often much longer than the actual length of the target sequence. It will result in extra calculation cost and affect the inference speed. Thirdly, [libovicky2018end]
utilizes the CTC loss function instead of the cross entropy to optimize the model, which makes the model generate tokens without calculating the length of the target sequence. However, the characteristics of CTC will cause the model to generate some duplicate tokens and a large number of blanks during inference, and it does not accelerate the inference speed.
For speech recognition, the number of valid characters or words contained in a piece of speech is affected by various factors such as the speaker’s speech rate, silence, and noise. It is unreasonable to set a fixed length only according to the duration of the audio. To estimate the length of the target sequence accurately and accelerate the inference speech, we propose a spike-triggered non-autoregressive transformer (ST-NAT) for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence. The CTC loss plays three important roles in our proposed model. Firstly, ST-NAT utilizes the CTC module to predict the length of target sequences. The CTC module can generate spike-like label posterior probabilities. The number of spikes accurately reflects the length of the target sequence[ma2019flowseq, moritz2019streaming]. During inference, the ST-NAT can count the number of spikes to avoid redundant calculations. Secondly, the ST-NAT adopts the encode states corresponding to the positions of spikes as the input of the decoder. We assume that the triggered encode state sequence can obtain more prior information than the mask sequence, which may able to improve the performance of the model. Thirdly, the ST-NAT adapts the CTC loss as an auxiliary loss to speed up training and convergence [kim2017joint]. Additionally, a non-autoregressive transformer cannot model the inter-dependencies between the outputs. Therefore, we improve the model performance by integrating the output probabilities predicted by the ST-NAT and a neural language model. All experiments are conducted on a public Chinese mandarin dataset AISHELL-1. The results show that the ST-NAT can predict the length of the target sequence accurately and achieve comparable performance with the most advanced end-to-end models. The probability of missing words or characters is less than 2%. What’s more, the model even achieves a real-time factor (RTF) of 0.0056, which exceeds all mainstream speech recognition models.
The remainder of this paper is organized as follows. Section 2 describes our proposed triggered non-autoregressive transformers. Section 3 presents our experimental setup and results. The conclusions and future work will be given in Section 4.
2 Spike-Triggered Non-Autoregressive Transformer
2.1 Model Architecture
The spike-trigger non-autoregressive transformer consists of an encoder, a decoder, and a CTC module, as depicted in Fig.1. Both encoder and decoder are composed of multi-head attention layers and feed-forward layers [vaswani2017attention], which is similar to the speech transformer [dong2018speech].
As shown in Fig.1, we put a 2D convolution front end at the bottom of the encoder to process the input speech feature sequences simply, including dimension transformation (from 40 to 320), time-axis down-sampling, and adding sine-cosine positional information.
Multi-head attention (MHA) layer allows the model to focus on the information from different positions. Each head is a complete self-attention component. , and represent queries, keys and values respectively. is the dimension of keys. , , and are projection parameter matrices.
Feed-forward network (FFN) contains two linear layers and a gated liner unit (GLU) [dauphin2017language]activation function [Tian2019, fan2019speaker].
where parameters , , and are learnable.
The sine and consine positional embedding proposed by [vaswani2017attention]
are applied for all the experiments in this paper. Besides, the model also apply residual connection and layer normalization.
The ST-NAT introduces a CTC module to predict the length of the target sequence and accelerate the convergence. The CTC module only consists of a linear project layer. Most non-autoregressive transformer model adopts a fixed-length sequence filled with ’’ as the input of the decoder. These sequences don’t contain any useful information. In fact, the CTC spike is usually located in the range of one specific word. Therefore, the ST-NAT utilizes the encoded states corresponding to the CTC spike as the input of the decoder. We assume that the triggered encode state sequence contains some prior information on the target words, which makes the decoding process more purposeful than guessing from the empty sequence.
It is very important to predict the length of the target sequence accurately. When the predicted length is shorter than the target length , there is no doubt that the generated sequence will miss many words or characters, which means that it causes many deletion errors. Instead, the predicted length is longer than the target length of , it will cost extra calculation and even generate many duplicate tokens. The ST-NAT can predict the length of the target sequence accurately, through counts the number of spikes produced by the CTC module. When the probability that the CTC module generates a non-blank token is greater than the trigger threshold , the corresponding trigger position is recorded. This process can be described as follows.
Where means the -th position of the encoder output states. is the blank probability predicted by the CTC module. Then the probability of non-blank can be expressed as . The ST-NAT also inserts an end-of-sentence token ’’ into the target sequence to guarantee the model still able to generate a correct sequence, when the predicted length is larger than the target length .
Furthermore, it has been widely proved that the CTC loss function [graves2006connectionist] is effective to assist the model to accelerate the training and convergence [kim2017joint]. It is difficult to train a non-autoregressive model from scratch. Therefore, we use the CTC loss as an auxiliary loss function to optimize the model.
where is the cross entropy loss [de2005tutorial] and is the CTC loss. is the weight of CTC in joint loss function. is the predicted target length. is the real target length. If is smaller than , the ST-NAT only utilizes the CTC loss optimize the encoder. Thanks to the CTC module, the ST-NAT can be trained from scratch and without any pre-training or other tricks.
During inference, we just select the token which has the highest probability at each position. Generating the token ’’ or the last word in the sequence means the end of the decoding process.
Non-autoregressive model cannot model the temporal dependencies between the output labels. This largely prevents the improvement of model performance. We also introduce a transformer-based language model into decoding process. Neural language model makes up the weakness of non-autoregressive model. The joint decoding process can be described as
where is the predict sequence. And is the probability of language model. is the weight of the language model probabilities.
3 Experiments and Results
In this work, all experiments are conducted on a public Mandarin speech corpus AISHELL-1111http://www.openslr.org/13/. The training set contains about 150 hours of speech (120,098 utterances) recorded by 340 speakers. The development set contains about 20 hours (14,326 utterances) recorded by 40 speakers. And about 10 hours (7,176 utterances / 36109 seconds) of speech is used as test set. The speakers of different sets are not overlapped.
3.2 Experiment Setup
For all experiments, we use 40-dimensional FBANK features computed on a 25ms window with a 10ms shift. We chose 4233 characters (including a padding symbol ’’ , an unknown symbol ’’ and an end-of-sentence symbol ’’) as model units.
Our proposed model and baseline models are built on OpenTransformer222https://github.com/ZhengkunTian/OpenTransformer
. The ST-NAT model consists of 6 encoder blocks and 6 decoder blocks. There are 4 heads in multi-head attention. The 2D convolution front end utilizes two-layer time-axis CNN with ReLU activation, stride size 2, channels 320, and kernel size 3. The output size of the multi-head attention and the feed-forward layers are 320. We adopt an Adam optimizer with warmup steps 12000 and the learning rate scheduler reported in[vaswani2017attention]
. After 80 epochs, we average the parameters saved in the last 20 epochs. We also use the time mask and frequency mask method proposed in[park2019specaugment] for the baseline transformer, SAN-CTC, and all non-autoregressive models. During inference, we use a beam search with a width of 5 for the baseline Transformer model, SAN-CTC model and the ST-NAT with language model.
We use the character error rate (CER) to evaluate the performance of different models. For evaluating the inference speed of different models, we decode utterances one by one to compute real-time factor (RTF) on the test set. The RTF is the time taken to decode one second of speech. All experiments are conducted on a GeForce GTX TITAN X 12G GPU.
3.3.1 Explore the effects of different weights and trigger thresholds.
We train the ST-NAT model with different CTC weights and trigger thresholds from scratch. As shown in Table.1, the trigger NAT model with CTC weight 0.7 and trigger threshold 0.3 can achieve a CER of 7.66% on test. At the same threshold, the trigger NAT with weight 0.6 can achieve the best performance on development set. The CTC weights and trigger thresholds affect the performance of the model in different aspects. The CTC weights are used to balance the performance of CTC trigger module and decoder. However, the trigger threshold are used to determine how many encoder states are triggered. Both weights and thresholds play important roles in the performance of the models.
3.3.2 Explore the effects of different trigger thresholds on the inference speed.
We evaluate our ST-NAT with different trigger thresholds on the inference speed. All the ST-NAT models are trained with a CTC weight of 0.6. It is obvious from the Table.2 that the larger the threshold, the faster the model decode an utterance. When the trigger threshold is 0.7, the model achieves an RTF of 0.0054. It also means the model only has a latency of nearly 20 milliseconds. However, a large threshold does not mean that the model can achieve the best performance. A large trigger threshold might cause the predicted length generated by the CTC trigger to be shorter than the target length, which in turn will hurt the performance of the model. Fortunately, different trigger thresholds have only little effect on the speed of inference, which can even be ignored.
3.3.3 Analysis on trigger mechanism.
We analyze the trigger non-autoregressive transformer from the following two perspectives.
On the one hand, we explore the relationship between the predicted length by the model and the target length, as show in Fig.2. The histogram record the difference between the target length and the predicted length. When the value is less than or equal to zero, it means that the predicted length is less than or equal to the target length. This will not cause irreversible effects. The decoder is still able to predict a token at the end of sentence. We can find that the vast majority of predicted length have no any errors. What’s more, the probability of missing words or characters is even less than 2%. Even for the most of weights (0.3, 0.5 and 0.7), the maximum predict error does not exceed 4. Therefore, we conclude that the CTC model can predict the length of the target sequence approximately accurately. However, if the value is larger than zero, the model will miss some words permanently. We can fix this problem by adding a padding bias to the predicted length.
On the other hand, Fig.3(a) shows the relationship between the trigger position and the word pronunciation boundary. There is no triggered spike in the range of silence. Within the scope of the last pronounced word, there are two triggered spikes. Because we also take an end-of-sentence token into consideration during training. It’s obvious that each spike is within the boundary of the word. Therefore, our assumption, that the triggered encode state sequence contains more prior information on the target sequence, is reasonable. It’s obvious from Fig.3(b) that the ST-NAT model can make the target sequence better aligned to the encoded states sequence. What’s more, the center of the alignment position almost coincides with the trigger position, which again verifies our assumption.
|TDNN-Chain (Kaldi) [povey2016purely]||-||7.45||-|
|SAN-CTC * [salazar2019self]||7.83||8.74||0.0168|
|NAT-MASKED * [chen2019non]||7.16||8.03||0.0058|
These models are re-implemented by ourselves according to the papers.
We supplement the RTF of our previous two models.
3.3.4 Compare with other models.
We also compare our proposed ST-NAT model with various main-stream models, e.g. traditional model, CTC-based model, transducer model, and attention-based sequence-to-sequence model. Under the same training condition and the same model parameters, we train a Speech-Transformer[dong2018speech], NAT-MASKED [chen2019non], and our proposed ST-NAT model, where the speech transformer applies a beam search with beam width 5 to decoding utterances.
From Table.3, we can find the ST-NAT models can achieve comparable performance with the advanced speech-transformer model [dong2018speech] and TDNN-Chain model [povey2016purely], which is better than LAS. From another perspective, the ST-NAT has the fastest inference speed among them, which is only about 1/10 of speech-transformer. The ST-NAT with a transformer language model can achieve the best CER of 7.02% on the test set and an RTF of 0.0292.
Compared with the streaming end-to-end model, e.g. SAN-CTC [salazar2019self], Sync-Transformer [tian2019synchronous], and SA-Transducer [Tian2019], the ST-NAT can not only achieve the best performance, but also the fastest inference speed. We suppose that the ST-NAT can decode an utterance with all context and without temporal dependencies.
By contrast, we also re-implement a NAT-MASKED model in a BERT-like way [chen2019non], which adopts a fixed-length (set as 60) mask sequence as the input. The NAT-MASKED has the same parameters as our ST-NAT except for the CTC module. We find the ST-NAT can achieve better performance. We guess that it is difficult for the model to learn to predict the target words(or characters) and the target length jointly. Both of them have a very close inference speed.
4 Conclusions and Future Works
To estimate the length of the target sequence accurately and accelerate the inference speech, we proposed a spike-triggered non-autoregressive transformer (ST-NAT) for end-to-end speech recognition, which introduce a CTC module to predict the target length and accelerate the convergence. The ST-NAT adopts the encode states corresponding to the positions of spikes as the input of the decoder. In the inference process, ST-NAT can count the number of spikes to avoid redundant calculations. We conduct all experiments on a public Chinese mandarin dataset AISEHLL-1. The results show that the CTC module can accurately predict the length of the target sequence. The ST-NAT model has achieved achieve comparable performance with the advanced speech transformer model. However, the ST-NAT has a real-time factor of 0.0056, which exceeds all mainstream models. What’s more, the ST-NAT with a language model can still have a very high inference speed. In the future, we will try to utilize the CTC module for joint decoding to improve the performance of the model during inference.