1 Introduction
Although autoregressive models have achieved great success in various NLP tasks and speech recognition
[bahdanau2014neural, chorowski2015attention, chan2016listen, vaswani2017attention, kim2017joint, dong2018speech], the autoregressive characteristics result in a large latency during the inference process [lee2018deterministic]. Most of the attentionbased sequencetosequence models generate the target sequence in an autoregressive fashion. These models predict the next token conditioned on the previously generated tokens and the source state sequence. By contrast, the nonautoregressive model gets rid of temporal dependency and able to perform parallel computing, greatly improving the speed of inference.Nonautoregressive transformers (NAT) have achieved comparable results with autoregressive models in neural machine translation and speech recognition [lee2018deterministic, gu2017non, ma2019flowseq, wang2019non, chen2019non, libovicky2018end, moritz2019triggered]
. Different from the autoregressive sequencetosequence model, the NAT takes a fixedlength mask sequence as input to predict target sequence. The setting of this predefined length is very important. If the length is shorter than the actual length, this will cause many errors of deletion. On the contrary, a longer length will cause the model to generate duplicate tokens and consume additional calculations. To our best knowledge, there are three ways to estimate the length of the target sequence. Firstly, some works introduce a neural network module behind the encoder to predict the target length
[lee2018deterministic, gu2017non, ma2019flowseq]. These method cannot guarantee the accuracy of the predicted lengths. During inference, it is necessary to sample different lengths to select the optimal sequence. Secondly, [wang2019non, chen2019non] set an empirical(or maximum) length based on the length of the source sequence. To guarantee the performance of the model, the length is often much longer than the actual length of the target sequence. It will result in extra calculation cost and affect the inference speed. Thirdly, [libovicky2018end]utilizes the CTC loss function instead of the cross entropy to optimize the model, which makes the model generate tokens without calculating the length of the target sequence. However, the characteristics of CTC will cause the model to generate some duplicate tokens and a large number of blanks during inference, and it does not accelerate the inference speed.
For speech recognition, the number of valid characters or words contained in a piece of speech is affected by various factors such as the speaker’s speech rate, silence, and noise. It is unreasonable to set a fixed length only according to the duration of the audio. To estimate the length of the target sequence accurately and accelerate the inference speech, we propose a spiketriggered nonautoregressive transformer (STNAT) for endtoend speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence. The CTC loss plays three important roles in our proposed model. Firstly, STNAT utilizes the CTC module to predict the length of target sequences. The CTC module can generate spikelike label posterior probabilities. The number of spikes accurately reflects the length of the target sequence
[ma2019flowseq, moritz2019streaming]. During inference, the STNAT can count the number of spikes to avoid redundant calculations. Secondly, the STNAT adopts the encode states corresponding to the positions of spikes as the input of the decoder. We assume that the triggered encode state sequence can obtain more prior information than the mask sequence, which may able to improve the performance of the model. Thirdly, the STNAT adapts the CTC loss as an auxiliary loss to speed up training and convergence [kim2017joint]. Additionally, a nonautoregressive transformer cannot model the interdependencies between the outputs. Therefore, we improve the model performance by integrating the output probabilities predicted by the STNAT and a neural language model. All experiments are conducted on a public Chinese mandarin dataset AISHELL1. The results show that the STNAT can predict the length of the target sequence accurately and achieve comparable performance with the most advanced endtoend models. The probability of missing words or characters is less than 2%. What’s more, the model even achieves a realtime factor (RTF) of 0.0056, which exceeds all mainstream speech recognition models.The remainder of this paper is organized as follows. Section 2 describes our proposed triggered nonautoregressive transformers. Section 3 presents our experimental setup and results. The conclusions and future work will be given in Section 4.
2 SpikeTriggered NonAutoregressive Transformer
2.1 Model Architecture
The spiketrigger nonautoregressive transformer consists of an encoder, a decoder, and a CTC module, as depicted in Fig.1. Both encoder and decoder are composed of multihead attention layers and feedforward layers [vaswani2017attention], which is similar to the speech transformer [dong2018speech].
As shown in Fig.1, we put a 2D convolution front end at the bottom of the encoder to process the input speech feature sequences simply, including dimension transformation (from 40 to 320), timeaxis downsampling, and adding sinecosine positional information.
Multihead attention (MHA) layer allows the model to focus on the information from different positions. Each head is a complete selfattention component. , and represent queries, keys and values respectively. is the dimension of keys. , , and are projection parameter matrices.
(1) 
(2) 
Feedforward network (FFN) contains two linear layers and a gated liner unit (GLU) [dauphin2017language]activation function [Tian2019, fan2019speaker].
(3) 
where parameters , , and are learnable.
The sine and consine positional embedding proposed by [vaswani2017attention]
are applied for all the experiments in this paper. Besides, the model also apply residual connection and layer normalization.
The STNAT introduces a CTC module to predict the length of the target sequence and accelerate the convergence. The CTC module only consists of a linear project layer. Most nonautoregressive transformer model adopts a fixedlength sequence filled with ’’ as the input of the decoder. These sequences don’t contain any useful information. In fact, the CTC spike is usually located in the range of one specific word. Therefore, the STNAT utilizes the encoded states corresponding to the CTC spike as the input of the decoder. We assume that the triggered encode state sequence contains some prior information on the target words, which makes the decoding process more purposeful than guessing from the empty sequence.
2.2 Training
It is very important to predict the length of the target sequence accurately. When the predicted length is shorter than the target length , there is no doubt that the generated sequence will miss many words or characters, which means that it causes many deletion errors. Instead, the predicted length is longer than the target length of , it will cost extra calculation and even generate many duplicate tokens. The STNAT can predict the length of the target sequence accurately, through counts the number of spikes produced by the CTC module. When the probability that the CTC module generates a nonblank token is greater than the trigger threshold , the corresponding trigger position is recorded. This process can be described as follows.
(4) 
Where means the th position of the encoder output states. is the blank probability predicted by the CTC module. Then the probability of nonblank can be expressed as . The STNAT also inserts an endofsentence token ’’ into the target sequence to guarantee the model still able to generate a correct sequence, when the predicted length is larger than the target length .
Furthermore, it has been widely proved that the CTC loss function [graves2006connectionist] is effective to assist the model to accelerate the training and convergence [kim2017joint]. It is difficult to train a nonautoregressive model from scratch. Therefore, we use the CTC loss as an auxiliary loss function to optimize the model.
(5) 
where is the cross entropy loss [de2005tutorial] and is the CTC loss. is the weight of CTC in joint loss function. is the predicted target length. is the real target length. If is smaller than , the STNAT only utilizes the CTC loss optimize the encoder. Thanks to the CTC module, the STNAT can be trained from scratch and without any pretraining or other tricks.
2.3 Inference
During inference, we just select the token which has the highest probability at each position. Generating the token ’’ or the last word in the sequence means the end of the decoding process.
Nonautoregressive model cannot model the temporal dependencies between the output labels. This largely prevents the improvement of model performance. We also introduce a transformerbased language model into decoding process. Neural language model makes up the weakness of nonautoregressive model. The joint decoding process can be described as
(6) 
where is the predict sequence. And is the probability of language model. is the weight of the language model probabilities.
3 Experiments and Results
3.1 Dataset
In this work, all experiments are conducted on a public Mandarin speech corpus AISHELL1^{1}^{1}1http://www.openslr.org/13/. The training set contains about 150 hours of speech (120,098 utterances) recorded by 340 speakers. The development set contains about 20 hours (14,326 utterances) recorded by 40 speakers. And about 10 hours (7,176 utterances / 36109 seconds) of speech is used as test set. The speakers of different sets are not overlapped.
3.2 Experiment Setup
For all experiments, we use 40dimensional FBANK features computed on a 25ms window with a 10ms shift. We chose 4233 characters (including a padding symbol ’
’ , an unknown symbol ’’ and an endofsentence symbol ’’) as model units.Our proposed model and baseline models are built on OpenTransformer^{2}^{2}2https://github.com/ZhengkunTian/OpenTransformer
. The STNAT model consists of 6 encoder blocks and 6 decoder blocks. There are 4 heads in multihead attention. The 2D convolution front end utilizes twolayer timeaxis CNN with ReLU activation, stride size 2, channels 320, and kernel size 3. The output size of the multihead attention and the feedforward layers are 320. We adopt an Adam optimizer with warmup steps 12000 and the learning rate scheduler reported in
[vaswani2017attention]. After 80 epochs, we average the parameters saved in the last 20 epochs. We also use the time mask and frequency mask method proposed in
[park2019specaugment] for the baseline transformer, SANCTC, and all nonautoregressive models. During inference, we use a beam search with a width of 5 for the baseline Transformer model, SANCTC model and the STNAT with language model.We use the character error rate (CER) to evaluate the performance of different models. For evaluating the inference speed of different models, we decode utterances one by one to compute realtime factor (RTF) on the test set. The RTF is the time taken to decode one second of speech. All experiments are conducted on a GeForce GTX TITAN X 12G GPU.
3.3 Results
3.3.1 Explore the effects of different weights and trigger thresholds.
We train the STNAT model with different CTC weights and trigger thresholds from scratch. As shown in Table.1, the trigger NAT model with CTC weight 0.7 and trigger threshold 0.3 can achieve a CER of 7.66% on test. At the same threshold, the trigger NAT with weight 0.6 can achieve the best performance on development set. The CTC weights and trigger thresholds affect the performance of the model in different aspects. The CTC weights are used to balance the performance of CTC trigger module and decoder. However, the trigger threshold are used to determine how many encoder states are triggered. Both weights and thresholds play important roles in the performance of the models.
CTC  Trigger Threshold  

Weight  0.1  0.3  0.5  0.7 
0.1  7.67/8.66  7.56/8.50  7.66/8/45  7.56/8.50 
0.3  7.37/8.14  7.25/8.12  7.26/8.19  7.21/8.01 
0.5  7.06/7.97  7.10/7.88  7.38/8.14  7.30/8.12 
0.6  7.06/7.88  6.88/7.67  7.05/7.77  7.01/7.70 
0.7  7.26/8.05  6.91/7.66  7.03/7.87  7.39/8.02 
Threshold  0.1  0.3  0.5  0.7 

Performance  7.88  7.67  7.77  7.70 
Seconds  212.04  202.59  200.62  198.44 
RTF  0.0059  0.0056  0.0055  0.0054 
3.3.2 Explore the effects of different trigger thresholds on the inference speed.
We evaluate our STNAT with different trigger thresholds on the inference speed. All the STNAT models are trained with a CTC weight of 0.6. It is obvious from the Table.2 that the larger the threshold, the faster the model decode an utterance. When the trigger threshold is 0.7, the model achieves an RTF of 0.0054. It also means the model only has a latency of nearly 20 milliseconds. However, a large threshold does not mean that the model can achieve the best performance. A large trigger threshold might cause the predicted length generated by the CTC trigger to be shorter than the target length, which in turn will hurt the performance of the model. Fortunately, different trigger thresholds have only little effect on the speed of inference, which can even be ignored.
3.3.3 Analysis on trigger mechanism.
We analyze the trigger nonautoregressive transformer from the following two perspectives.
On the one hand, we explore the relationship between the predicted length by the model and the target length, as show in Fig.2. The histogram record the difference between the target length and the predicted length. When the value is less than or equal to zero, it means that the predicted length is less than or equal to the target length. This will not cause irreversible effects. The decoder is still able to predict a token at the end of sentence. We can find that the vast majority of predicted length have no any errors. What’s more, the probability of missing words or characters is even less than 2%. Even for the most of weights (0.3, 0.5 and 0.7), the maximum predict error does not exceed 4. Therefore, we conclude that the CTC model can predict the length of the target sequence approximately accurately. However, if the value is larger than zero, the model will miss some words permanently. We can fix this problem by adding a padding bias to the predicted length.
On the other hand, Fig.3(a) shows the relationship between the trigger position and the word pronunciation boundary. There is no triggered spike in the range of silence. Within the scope of the last pronounced word, there are two triggered spikes. Because we also take an endofsentence token into consideration during training. It’s obvious that each spike is within the boundary of the word. Therefore, our assumption, that the triggered encode state sequence contains more prior information on the target sequence, is reasonable. It’s obvious from Fig.3(b) that the STNAT model can make the target sequence better aligned to the encoded states sequence. What’s more, the center of the alignment position almost coincides with the trigger position, which again verifies our assumption.
Model  DEV  TEST  RTF 

TDNNChain (Kaldi) [povey2016purely]    7.45   
LAS[8682490]    10.56   
SpeechTransformer *  6.57  7.37  0.0504 
SATransducer [Tian2019]  8.30  9.30  0.1536 
SANCTC * [salazar2019self]  7.83  8.74  0.0168 
SyncTransformer [tian2019synchronous]  7.91  8.91  0.1183 
NATMASKED * [chen2019non]  7.16  8.03  0.0058 
STNAT(ours)  6.88  7.67  0.0056 
STNAT+LM(ours)  6.39  7.02  0.0292 

These models are reimplemented by ourselves according to the papers.

We supplement the RTF of our previous two models.
3.3.4 Compare with other models.
We also compare our proposed STNAT model with various mainstream models, e.g. traditional model, CTCbased model, transducer model, and attentionbased sequencetosequence model. Under the same training condition and the same model parameters, we train a SpeechTransformer[dong2018speech], NATMASKED [chen2019non], and our proposed STNAT model, where the speech transformer applies a beam search with beam width 5 to decoding utterances.
From Table.3, we can find the STNAT models can achieve comparable performance with the advanced speechtransformer model [dong2018speech] and TDNNChain model [povey2016purely], which is better than LAS. From another perspective, the STNAT has the fastest inference speed among them, which is only about 1/10 of speechtransformer. The STNAT with a transformer language model can achieve the best CER of 7.02% on the test set and an RTF of 0.0292.
Compared with the streaming endtoend model, e.g. SANCTC [salazar2019self], SyncTransformer [tian2019synchronous], and SATransducer [Tian2019], the STNAT can not only achieve the best performance, but also the fastest inference speed. We suppose that the STNAT can decode an utterance with all context and without temporal dependencies.
By contrast, we also reimplement a NATMASKED model in a BERTlike way [chen2019non], which adopts a fixedlength (set as 60) mask sequence as the input. The NATMASKED has the same parameters as our STNAT except for the CTC module. We find the STNAT can achieve better performance. We guess that it is difficult for the model to learn to predict the target words(or characters) and the target length jointly. Both of them have a very close inference speed.
4 Conclusions and Future Works
To estimate the length of the target sequence accurately and accelerate the inference speech, we proposed a spiketriggered nonautoregressive transformer (STNAT) for endtoend speech recognition, which introduce a CTC module to predict the target length and accelerate the convergence. The STNAT adopts the encode states corresponding to the positions of spikes as the input of the decoder. In the inference process, STNAT can count the number of spikes to avoid redundant calculations. We conduct all experiments on a public Chinese mandarin dataset AISEHLL1. The results show that the CTC module can accurately predict the length of the target sequence. The STNAT model has achieved achieve comparable performance with the advanced speech transformer model. However, the STNAT has a realtime factor of 0.0056, which exceeds all mainstream models. What’s more, the STNAT with a language model can still have a very high inference speed. In the future, we will try to utilize the CTC module for joint decoding to improve the performance of the model during inference.
Comments
There are no comments yet.