A key challenge in implementing such method lies in the process of decoding. Since CTC-based models do not have built-in methods for decoding, the beam search strategy is often used to achieve satisfactory accuracy. This requires an external language model (LM) to provide prior information on which sequence of words are the most probable.
While both CTC and sequence-to-sequence models are able to process the whole sequence of speech data in a bidirectional manner, the traditional beam search method does not take advantage of the future information since the current prediction relies on the previous output. Although there have been attempts to take advantage of the whole sequence in the decoding process of CTC-based methods, its effectiveness has not been demonstrated in automatic speech recognition.
In this paper, we propose the bi-directional decoding method with noisy future context and training method robust to noisy context. As an ASR model, we employ the Connectionist Temporal Classification (CTC) model instead of autoregressive model because the latter is based on thelabel-synchronous  decoding, resulting in difficulty to know which frame of speech is being decoded at each time step. However, frame-synchronous [13, 17] decoding makes easier.
The majority of bi-directional LMs [24, 4, 14] are designed for N-best LM rescoring with architectural advancement from LSTM-RNN to neural random field and transformer encoder. Although N-best LM rescoring can replace/be combined with our decoding method, we insist that rescoring generally underperforms shallow fusion. The former only finds the best solution among first-pass search results, while the latter can use LM score at every decoding time step of the beam search.
 use the bidirectional shallow fusion decoding algorithm by using forward/backward uni-directional decoders on top of bi-directional encoder architecture. Similar to our work, they use the result of greedy decoding from the backward decoder as the future context. However, it is known that the greedy decoding of the autoregressive model suffers from exposure bias problem, which degrades the quality of generation and requires expensive supervision (e.g., scheduled sampling, REINFORCE) to fix it.
2 CTC greedy and prefix beam search
Assume that sequence of speech representation , corresponding sequence of labels , and its alignment is given. The alignment sequence () consists of a special CTC character called blank label as well as output labels in the set of label dictionary and is mapped to an unique result label sequence by removing all blanks and repeated labels.
For each time step , the greedy decoding of the CTC model choose the most probable CTC output label using the probability , namely
Greedy decoding ends with having actual label sequence by being mapped to , . Greedy decoding is not able to get the most probable output sequence since it does assume that the probability of the sequence is independent for every frame. However it can be done very simply and quickly because the only thing have to do is find the most probable output label for each time step .
Other than greedy decoding, we can efficiently calculate the probabilities of successive extensions of every each labelling prefix with prefix search decoding . However, in full prefix search the maximum, number of prefixes expands grows exponentially as input sequence length gets longer. For that reason, prefix search decoding should be done with beam search decoding, which limits the size of extensions for each search step.
The CTC model is based on conditional independence assumption, therefore decoding often require an external language model to utilize dependency between outputs during decoding. Assume that we have an external language model which models the probability . CTC beam search can be done for a certain beam width by finding most probable candidates in each time step . We should find the optimal CTC label sequence such that
where is CTC probability, is language model probability, is an output label of CTC label and
is hyperparameter controlling the effect of language model. For each time step, get most probable beams based on the probability (2) and proceed to next time step with beams.
3 CTC Bidirectional Beam Search
In this section, we propose bi-directional decoding method with noisy future context for CTC. In addition, training method making bi-directional language model robust to noisy future context is presented.
3.1 Bidirectional Language Model
Assume that bi-directional context including future information is given. Then we can predict the next output token with both forward language model from the past information and backward language model from the future information with the following:
where represents the forward language model and represents the backward language model respectively Fig. 2-(a).
Bidirectional language model can be realized by Transformer as well as RNN with considering as memory in Transformer. We have used both LSTM and Transformer-XL to train bidirectional language model. It needs two input sequences, forward sequence for the past text and backward sequence for the future text. Putting into two input sequences, we get two hidden states each, and from bidirectional language model.
can be obtained by adding these two hidden states and it will be used to calculate final probability vector.
3.2 CTC Decoding with Bidirectional Language Model
As described in Fig.1, while decoding through the time, greedy decoding result provides noisy future context for backward language model. In (1), for greedy decoding result , we may take as the future sequence for a certain time step . Therefore in order to decode CTC with bidirectional beam search, we find the optimal output sequence such that
where is the hyper-parameter controlling the effect of the language model to the final decoding and the second term of product can be evaluated as stated in .
With such bidirectional beam search, we may expect the decoding to get more valuable semantic information from the future sequence helpful to predict next label output. Beam search with unidirectional language model can be relatively inaccurate at the start of the sentence because there are not much context for language model at the front part. Bidirectional language model having context from both front and back would be more grounded on support for the backward context from the future information.
3.3 Future Exposure Bias
The well known exposure bias problem implies that the model is exposed only on ground-truth context, while not exposed to predicted context. Both forward and backward language model (FWLM, BWLM) affect from this exposure bias problem due to its autoregressive nature. Especially the BWLM employ noisy context given from greedy decoding, where noisy context cannot be trivially obtainable from text corpus. To simulate noisy context given from CTC greedy decoding, we propose two data augmentation methods.
3.3.1 Future Shift
For backward language model, exposure bias in the immediate future input can cause critical problem in predicting the next output. In order to mitigate this problem, we have to train the backward part in the bidirectional language model with a future shift so that there should be only effect of future sequence where on predicting . If , there is no shift in future and it means this backward may suffer from exposure bias problem. To achieve the expected result, backward language model should be trained and decoded with a future shift . Fig. 2-(b) shows how to make shift on future context.
3.3.2 Random Noise In Future
To further minimize the difference between training data and real data, we simulate noisy greedy decoding to be used in training backward language model. Greedy decoding could make three kinds of errors: insertion, deletion and substitution. Substitution error can be simulated by simply changing the input of a certain step, into some other random input . Simulating insertion and deletion is more complicated as both insertion and deletion break the alignment between the future sequence and input sequence. Fig. 2-(c),(d) shows how to align future sequence with random noises when the bidirectional LM is trained.
The CTC is trained over the LibriSpeech dataset  which contains 960 hours of speech data of reading books. The models are tested in over two different datasets; dev-clean and test-clean which have 5.4 hours respectively. We used a character-level dictionary consisting of 48 tokens including special tokens such as unknown, start of sentence, end of sentence and other special characters. Every model is trained on a single NVIDIA V100 GPU.
The external bidirectional language model is trained on every train corpus consisting of about 280,000 sentences. Sentences shorter than 10 letters and consisting of almost non-alphabet characters have been removed.
4.2 Acoustic Model
The 161-dimensional log spectrogram is used as the input of the CTC model. Two convolutional layers are used to compress the information of the log spectrogram on the temporal dimension. Four bidirectional LSTM layers with 2048 hidden nodes are used for acoustic information analysis. Then, one fully connected layer is used to compute the probability of each token from the output of the LSTM layers. The dimension of the final output vector of the CTC model is 49, which is one greater than the number of tokens because the CTC model should compute the probability of the blank label.
4.3 Language Model
We used both LSTM and Transformer-XL in order to train bidirectional language model described in Section 3.1. We used 6 LSTM layers with implementation details following AWD-LSTM . For Transformer-XL there are 6 Transformer decoder layers with memory-augmented hidden states for previous hidden states in auto-regressive decoding. Every hidden state has 1024-dimensional vector for each layer.
There are two training parameters for bidirectional language model: future shift and random noise proportion . We have experimented for . Another training parameter is controlling the random noise density in future sequence. We assume that the proportion of noise has to follow the actual error rate occurred in greedy decoding. Since the character error rate (CER) of CTC greedy decoding turned out around 5%, we tried to set to be 0.05; 5 percents out of all characters in a sequence is set to be noise. Among these noises, we set 45% to be insertion noise, and another 20% to be deletion noise and the rest of them to be substitution noises, following the statistics of the greedy result Table 1. We are going to compare between the one with these modeling structure and the other without it.
4.4 Beam Search
CTC prefix beam search finds the best beams for each time step with the score
where is the language model weight and is for the length reward for normalization with respect to the length due to the bias for shorter utterances.
For all experiments, we set , , since we are looking for the effect of bidirectional decoding, not finding the optimal parameters.
Detailed algorithm is described in Algorithm 1. First, we get the greedy decoding result . Using it, we are able to obtain hidden states of backward language model for each . These hidden states are cached and used for the bidirectional language model through time. For time , we should get in line 10, the earliest time step after time having non-blank greedy output. With we can determine which part we are going to use among the backward hidden states. Depending on the future shift parameter , would be used as a backward hidden state. With forward hidden state , language model score , -dimensional vector having probability for each output label, can be calculated. By regular CTC prefix search algorithm (CTC_PREFIX_SEARCH) , is obtained, where the second argument of CTC_PREFIX_SEARCH is to be added to the non-blank probability in this function. With pruning beams over , picking best beams based on the score for the next time step. The inference of backward language model is done just once with greedy decoding result, there is not much difference in amount of total calculation.
In this section, we show the improvement of perplexity of the bidirectional language model compared to the unidirectional language model. We trained both LSTM and Transformer-XL varying the hyper-parameter from 1 to 3.
As we can see in Table 2, compared to the result of unidirectional language model, bidirectional LM has improved the model performance significantly for both LSTM and Transformer-XL based on the perplexity measure. This improvement is quite obvious since predicting with future information is surely more helpful than without it as well as future information is guaranteed to be perfect. Similarly, closer the information is, the more helpful it is because closer future is more related. Therefore smaller makes the results more accurate. However, like we explained in 3.3, smaller value of also increases the risk of inaccurate result from incorrect future information.
Perplexity of Language Models
5.2 Decoding Results
Table 3 shows that for both datasets, bidirectional decoding is quite effective. However, we speculate that the direct future information gives side effects as we anticipated in advance, for observing the CER result of the case . For LSTM seems to be optimal whereas for Transformer-XL. Furthermore, the best performance is obtained both for LSTM and Transformer-XL with adding random noise to greedy result during training with ratio of . We note that the purpose of our experiment is to prove the effectiveness of the proposed bidirectional decoding. Therefore, the mentioned and value may not be optimal. More experiments should be conducted in future to find out the generally optimal hyper-parameters.
5.3 Relative Error Position
In order to further analyse the effectiveness of the proposed method, we visualized the statistics of the relative error position as a histogram in Fig. 3. As shown in the figure, the unidirectional decoding shows obviously poorer performance at the beginning of the sentence, which coincides with the analysis in 3.2
. On the contrary, bidirectional decoding shows relatively uniform distribution through all relative positions, implying that the proposed bidirectional decoding outperforms unidirectional one in overall error rates. nhancement gets the maximum at the front 10% of the sequence, implying that the bidirectional decoding is especially good at correcting errors located in the front part.
In Table 4, there are a few examples that seems to be corrected by future information properly. For instance, the second sentence the greedy decoding result is tusday august eighteenth and unidirectional decoding has no choice but to decode as it is. However for bidirectional decoding, it is able to see the future part august eighteenth and successfully found out the right answer “tuesday". Although the results are not flawless, but we can expect positive effect from the future like these examples.
|sweak squeak||squeak squeak|
|tusday august eighteenth||tuesday august eighteenth|
|i name nine others and said||i named nine others and said|
In this paper, we proposed new decoding method for CTC speech recognition model with language model trained bidrectionally. Bidirectional language model can be obtained by traditional unidirectional language model and backward language model for taking advantage of future information, for which greedy decoding result is used. With several experiments, we have demonstrated that the proposed method helps for decoding the front part of the speech compared to unidirectional decoding. Furthermore, we have alleviated future exposure bias problem by future shift and random noise. As a future work, we could apply bidirectional decoding to transducer-based model [21, 26, 15] or label-synchronous attention-based decoder [5, 12].
-  (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22 (10), pp. 1533–1545. Cited by: §1.
Deep speech 2: end-to-end speech recognition in english and mandarin.
International conference on machine learning, pp. 173–182. Cited by: §1.
-  (2016) End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4945–4949. Cited by: §1.
Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation. IEEE Spoken Language Technology. Cited by: §1.
-  (2015) Listen, attend and spell. CoRR abs/1508.01211. External Links: Cited by: §1, §6.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. Cited by: §1.
Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing 20 (1), pp. 30–42. Cited by: §1.
-  (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. Cited by: §1.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1, §2, §4.4.
-  (2014) Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pp. 1764–1772. Cited by: §1.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §6.
-  (2014) First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873. Cited by: §1.
-  (2019) Effective sentence scoring method using bert for speech recognition. ACML. Cited by: §1.
-  (2021) Improving rnn transducer based asr with auxiliary tasks. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 172–179. Cited by: §6.
-  (2017) Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182. Cited by: §4.3.
-  (2019) Streaming end-to-end speech recognition with joint ctc-attention based models. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Cited by: §1.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.1.
-  (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §1.
-  (2017) A comparison of sequence-to-sequence models for speech recognition.. In Interspeech, pp. 939–943. Cited by: §1.
-  (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. Cited by: §6.
-  (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128. Cited by: §1.
Asynchronous bidirectional decoding for neural machine translation. AAAI. Cited by: §1.
-  (2017) Future word contexts in neural network language models. IEEE Automatic Speech Recognition and Understanding. Cited by: §1.
-  (2018) Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864. Cited by: §1.
-  (2021) Librispeech transducer model with internal language model prior correction. arXiv preprint arXiv:2104.03006. Cited by: §6.