Back from the future: bidirectional CTC decoding using future information in speech recognition

10/07/2021
by   Namkyu Jung, et al.
0

In this paper, we propose a simple but effective method to decode the output of Connectionist Temporal Classifier (CTC) model using a bi-directional neural language model. The bidirectional language model uses the future as well as the past information in order to predict the next output in the sequence. The proposed method based on bi-directional beam search takes advantage of the CTC greedy decoding output to represent the noisy future information. Experiments on the Librispeechdataset demonstrate the superiority of our proposed method compared to baselines using unidirectional decoding. In particular, the boost inaccuracy is most apparent at the start of a sequence which is the most erroneous part for existing systems based on unidirectional decoding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/12/2014

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

We present a method to perform first-pass large vocabulary continuous sp...
08/18/2020

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

False triggers in voice assistants are unintended invocations of the ass...
10/23/2019

Efficient Dynamic WFST Decoding for Personalized Language Models

We propose a two-layer cache mechanism to speed up dynamic WFST decoding...
02/06/2020

Consistency of a Recurrent Language Model With Respect to Incomplete Decoding

Despite strong performance on a variety of tasks, neural sequence models...
05/24/2017

Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning

We develop the first approximate inference algorithm for 1-Best (and M-B...
07/16/2018

On the Information Theoretic Distance Measures and Bidirectional Helmholtz Machines

By establishing a connection between bi-directional Helmholtz machines a...
05/08/2019

A Hardware-Oriented and Memory-Efficient Method for CTC Decoding

The Connectionist Temporal Classification (CTC) has achieved great succe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the performance of automatic speech recognition (ASR) systems have seen dramatic improvements due to the application of deep learning and the use of large-scale datasets

[7, 8, 1, 22, 3, 2, 25, 6, 19]. In particular, there have been two strands of research pushing the state-of-the-art in this field, namely sequence-to-sequence models [5, 9] and Connectionist Temporal Classifier (CTC) based methods [10, 11]. This work focuses on the latter due to its simplicity and good performance.

A key challenge in implementing such method lies in the process of decoding. Since CTC-based models do not have built-in methods for decoding, the beam search strategy is often used to achieve satisfactory accuracy. This requires an external language model (LM) to provide prior information on which sequence of words are the most probable.

While both CTC and sequence-to-sequence models are able to process the whole sequence of speech data in a bidirectional manner, the traditional beam search method does not take advantage of the future information since the current prediction relies on the previous output. Although there have been attempts to take advantage of the whole sequence in the decoding process of CTC-based methods, its effectiveness has not been demonstrated in automatic speech recognition.

In this paper, we propose the bi-directional decoding method with noisy future context and training method robust to noisy context. As an ASR model, we employ the Connectionist Temporal Classification (CTC) model instead of autoregressive model because the latter is based on the

label-synchronous [20] decoding, resulting in difficulty to know which frame of speech is being decoded at each time step. However, frame-synchronous [13, 17] decoding makes easier.

The majority of bi-directional LMs [24, 4, 14] are designed for N-best LM rescoring with architectural advancement from LSTM-RNN to neural random field and transformer encoder. Although N-best LM rescoring can replace/be combined with our decoding method, we insist that rescoring generally underperforms shallow fusion. The former only finds the best solution among first-pass search results, while the latter can use LM score at every decoding time step of the beam search.

[23] use the bidirectional shallow fusion decoding algorithm by using forward/backward uni-directional decoders on top of bi-directional encoder architecture. Similar to our work, they use the result of greedy decoding from the backward decoder as the future context. However, it is known that the greedy decoding of the autoregressive model suffers from exposure bias problem, which degrades the quality of generation and requires expensive supervision (e.g., scheduled sampling, REINFORCE) to fix it.

2 CTC greedy and prefix beam search

Assume that sequence of speech representation , corresponding sequence of labels , and its alignment is given. The alignment sequence () consists of a special CTC character called blank label as well as output labels in the set of label dictionary and is mapped to an unique result label sequence by removing all blanks and repeated labels.

For each time step , the greedy decoding of the CTC model choose the most probable CTC output label using the probability , namely

(1)

Greedy decoding ends with having actual label sequence by being mapped to , . Greedy decoding is not able to get the most probable output sequence since it does assume that the probability of the sequence is independent for every frame. However it can be done very simply and quickly because the only thing have to do is find the most probable output label for each time step .

Other than greedy decoding, we can efficiently calculate the probabilities of successive extensions of every each labelling prefix with prefix search decoding [10]. However, in full prefix search the maximum, number of prefixes expands grows exponentially as input sequence length gets longer. For that reason, prefix search decoding should be done with beam search decoding, which limits the size of extensions for each search step.

The CTC model is based on conditional independence assumption, therefore decoding often require an external language model to utilize dependency between outputs during decoding. Assume that we have an external language model which models the probability . CTC beam search can be done for a certain beam width by finding most probable candidates in each time step . We should find the optimal CTC label sequence such that

(2)

where is CTC probability, is language model probability, is an output label of CTC label and

is hyperparameter controlling the effect of language model. For each time step

, get most probable beams based on the probability (2) and proceed to next time step with beams.

Figure 1: Bidirectional CTC Beam Search

3 CTC Bidirectional Beam Search

Figure 2: Bi-directional LM with future shift, random insertion/deletion noise

In this section, we propose bi-directional decoding method with noisy future context for CTC. In addition, training method making bi-directional language model robust to noisy future context is presented.

3.1 Bidirectional Language Model

Assume that bi-directional context including future information is given. Then we can predict the next output token with both forward language model from the past information and backward language model from the future information with the following:

(3)

where represents the forward language model and represents the backward language model respectively Fig. 2-(a).

Bidirectional language model can be realized by Transformer as well as RNN with considering as memory in Transformer. We have used both LSTM and Transformer-XL to train bidirectional language model. It needs two input sequences, forward sequence for the past text and backward sequence for the future text. Putting into two input sequences, we get two hidden states each, and from bidirectional language model.

can be obtained by adding these two hidden states and it will be used to calculate final probability vector.

3.2 CTC Decoding with Bidirectional Language Model

As described in Fig.1, while decoding through the time, greedy decoding result provides noisy future context for backward language model. In (1), for greedy decoding result , we may take as the future sequence for a certain time step . Therefore in order to decode CTC with bidirectional beam search, we find the optimal output sequence such that

(4)

where is the hyper-parameter controlling the effect of the language model to the final decoding and the second term of product can be evaluated as stated in .

With such bidirectional beam search, we may expect the decoding to get more valuable semantic information from the future sequence helpful to predict next label output. Beam search with unidirectional language model can be relatively inaccurate at the start of the sentence because there are not much context for language model at the front part. Bidirectional language model having context from both front and back would be more grounded on support for the backward context from the future information.

3.3 Future Exposure Bias

The well known exposure bias problem implies that the model is exposed only on ground-truth context, while not exposed to predicted context. Both forward and backward language model (FWLM, BWLM) affect from this exposure bias problem due to its autoregressive nature. Especially the BWLM employ noisy context given from greedy decoding, where noisy context cannot be trivially obtainable from text corpus. To simulate noisy context given from CTC greedy decoding, we propose two data augmentation methods.

3.3.1 Future Shift

For backward language model, exposure bias in the immediate future input can cause critical problem in predicting the next output. In order to mitigate this problem, we have to train the backward part in the bidirectional language model with a future shift so that there should be only effect of future sequence where on predicting . If , there is no shift in future and it means this backward may suffer from exposure bias problem. To achieve the expected result, backward language model should be trained and decoded with a future shift . Fig. 2-(b) shows how to make shift on future context.

3.3.2 Random Noise In Future

To further minimize the difference between training data and real data, we simulate noisy greedy decoding to be used in training backward language model. Greedy decoding could make three kinds of errors: insertion, deletion and substitution. Substitution error can be simulated by simply changing the input of a certain step, into some other random input . Simulating insertion and deletion is more complicated as both insertion and deletion break the alignment between the future sequence and input sequence. Fig. 2-(c),(d) shows how to align future sequence with random noises when the bidirectional LM is trained.

4 Experiments

4.1 Dataset

The CTC is trained over the LibriSpeech dataset [18] which contains 960 hours of speech data of reading books. The models are tested in over two different datasets; dev-clean and test-clean which have 5.4 hours respectively. We used a character-level dictionary consisting of 48 tokens including special tokens such as unknown, start of sentence, end of sentence and other special characters. Every model is trained on a single NVIDIA V100 GPU.

The external bidirectional language model is trained on every train corpus consisting of about 280,000 sentences. Sentences shorter than 10 letters and consisting of almost non-alphabet characters have been removed.

4.2 Acoustic Model

The 161-dimensional log spectrogram is used as the input of the CTC model. Two convolutional layers are used to compress the information of the log spectrogram on the temporal dimension. Four bidirectional LSTM layers with 2048 hidden nodes are used for acoustic information analysis. Then, one fully connected layer is used to compute the probability of each token from the output of the LSTM layers. The dimension of the final output vector of the CTC model is 49, which is one greater than the number of tokens because the CTC model should compute the probability of the blank label.

4.3 Language Model

We used both LSTM and Transformer-XL in order to train bidirectional language model described in Section 3.1. We used 6 LSTM layers with implementation details following AWD-LSTM [16]. For Transformer-XL there are 6 Transformer decoder layers with memory-augmented hidden states for previous hidden states in auto-regressive decoding. Every hidden state has 1024-dimensional vector for each layer.

There are two training parameters for bidirectional language model: future shift and random noise proportion . We have experimented for . Another training parameter is controlling the random noise density in future sequence. We assume that the proportion of noise has to follow the actual error rate occurred in greedy decoding. Since the character error rate (CER) of CTC greedy decoding turned out around 5%, we tried to set to be 0.05; 5 percents out of all characters in a sequence is set to be noise. Among these noises, we set 45% to be insertion noise, and another 20% to be deletion noise and the rest of them to be substitution noises, following the statistics of the greedy result Table 1. We are going to compare between the one with these modeling structure and the other without it.

type dev-clean test-clean
insertion 43.2% 45.6%
deletion 22.7% 21.9%
substitution 34.1% 32.5%
Table 1: Error types proportion in greedy decoding

4.4 Beam Search

CTC prefix beam search finds the best beams for each time step with the score

(5)

where is the language model weight and is for the length reward for normalization with respect to the length due to the bias for shorter utterances.

For all experiments, we set , , since we are looking for the effect of bidirectional decoding, not finding the optimal parameters.

Detailed algorithm is described in Algorithm 1. First, we get the greedy decoding result . Using it, we are able to obtain hidden states of backward language model for each . These hidden states are cached and used for the bidirectional language model through time. For time , we should get in line 10, the earliest time step after time having non-blank greedy output. With we can determine which part we are going to use among the backward hidden states. Depending on the future shift parameter , would be used as a backward hidden state. With forward hidden state , language model score , -dimensional vector having probability for each output label, can be calculated. By regular CTC prefix search algorithm (CTC_PREFIX_SEARCH) [10], is obtained, where the second argument of CTC_PREFIX_SEARCH is to be added to the non-blank probability in this function. With pruning beams over , picking best beams based on the score for the next time step. The inference of backward language model is done just once with greedy decoding result, there is not much difference in amount of total calculation.

0:  : Beam Width, : Vocab Size, : future shift, : Beam Threshold, : language model weight, : length reward
1:   (Greedy Decoding)
2:  
3:  for  todo
4:     
5:  end for
6:  
7:  
8:  for  to  do
9:     
10:     
11:     for  to  do
12:        
13:        
14:        
15:        
16:        for  to  do
17:           if  then
18:              add to
19:           end if
20:        end for
21:     end for
22:      top beams in
23:     Reorder according to
24:  end for
Algorithm 1 Bidirectional CTC beam search

5 Results

5.1 Perplexity

In this section, we show the improvement of perplexity of the bidirectional language model compared to the unidirectional language model. We trained both LSTM and Transformer-XL varying the hyper-parameter from 1 to 3.

As we can see in Table 2, compared to the result of unidirectional language model, bidirectional LM has improved the model performance significantly for both LSTM and Transformer-XL based on the perplexity measure. This improvement is quite obvious since predicting with future information is surely more helpful than without it as well as future information is guaranteed to be perfect. Similarly, closer the information is, the more helpful it is because closer future is more related. Therefore smaller makes the results more accurate. However, like we explained in 3.3, smaller value of also increases the risk of inaccurate result from incorrect future information.

model LSTM T-XL
dataset dev test dev test
uni 5.811 5.745 5.539 5.505
=1 2.711 2.676 2.451 2.245
=2 4.327 4.281 3.409 3.214
=3 5.351 5.171 4.564 4.452
Table 2:

Perplexity of Language Models

5.2 Decoding Results

Table 3 shows that for both datasets, bidirectional decoding is quite effective. However, we speculate that the direct future information gives side effects as we anticipated in advance, for observing the CER result of the case . For LSTM seems to be optimal whereas for Transformer-XL. Furthermore, the best performance is obtained both for LSTM and Transformer-XL with adding random noise to greedy result during training with ratio of . We note that the purpose of our experiment is to prove the effectiveness of the proposed bidirectional decoding. Therefore, the mentioned and value may not be optimal. More experiments should be conducted in future to find out the generally optimal hyper-parameters.

LM LSTM T-XL
dataset dev test dev test
greedy 5.30 5.22 5.31 5.22
uni 4.56 4.48 4.37 4.12
=1 5.15 5.02 4.90 4.77
=2 4.40 4.23 4.29 4.12
=3 4.51 4.31 4.22 4.05
=2, =0.05
4.38 4.21 4.24 4.07
=3, =0.05
4.46 4.28 4.17 4.02
Table 3: Character Error Rate(CER) on LibriSpeech dev-clean and test-clean dataset for greedy search decoding, unidirectional beam search, and bidirectional beam search with and .

5.3 Relative Error Position

In order to further analyse the effectiveness of the proposed method, we visualized the statistics of the relative error position as a histogram in Fig. 3. As shown in the figure, the unidirectional decoding shows obviously poorer performance at the beginning of the sentence, which coincides with the analysis in 3.2

. On the contrary, bidirectional decoding shows relatively uniform distribution through all relative positions, implying that the proposed bidirectional decoding outperforms unidirectional one in overall error rates. nhancement gets the maximum at the front 10% of the sequence, implying that the bidirectional decoding is especially good at correcting errors located in the front part.

Figure 3: Histogram of relative positions where each error has occurred in test-clean with Transformer-XL model and : -axis represents the relative position of every error part of the recognition results. 0 means the beginning of the speech and 1 means the end of the speech. -axis shows the total count of errors for each bin. Yellow line implies the improvement from unidirectional to bidirectional.

In Table 4, there are a few examples that seems to be corrected by future information properly. For instance, the second sentence the greedy decoding result is tusday august eighteenth and unidirectional decoding has no choice but to decode as it is. However for bidirectional decoding, it is able to see the future part august eighteenth and successfully found out the right answer “tuesday". Although the results are not flawless, but we can expect positive effect from the future like these examples.

unidirectional bidirectional
sweak squeak squeak squeak
tusday august eighteenth tuesday august eighteenth
i name nine others and said i named nine others and said
Table 4: These examples show that bidirectional decoding uses future information when necessary. In these cases, it is actually very hard to get the right text without seeing the future information as in unidirectional decoding.

6 Conclusions

In this paper, we proposed new decoding method for CTC speech recognition model with language model trained bidrectionally. Bidirectional language model can be obtained by traditional unidirectional language model and backward language model for taking advantage of future information, for which greedy decoding result is used. With several experiments, we have demonstrated that the proposed method helps for decoding the front part of the speech compared to unidirectional decoding. Furthermore, we have alleviated future exposure bias problem by future shift and random noise. As a future work, we could apply bidirectional decoding to transducer-based model [21, 26, 15] or label-synchronous attention-based decoder [5, 12].

References

  • [1] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22 (10), pp. 1533–1545. Cited by: §1.
  • [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In

    International conference on machine learning

    ,
    pp. 173–182. Cited by: §1.
  • [3] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio (2016) End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4945–4949. Cited by: §1.
  • [4] W. Bin and O. Zhijian (2018)

    Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation

    .
    IEEE Spoken Language Technology. Cited by: §1.
  • [5] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals (2015) Listen, attend and spell. CoRR abs/1508.01211. External Links: Link, 1508.01211 Cited by: §1, §6.
  • [6] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. Cited by: §1.
  • [7] G. E. Dahl, D. Yu, L. Deng, and A. Acero (2011)

    Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition

    .
    IEEE Transactions on audio, speech, and language processing 20 (1), pp. 30–42. Cited by: §1.
  • [8] L. Deng, G. Hinton, and B. Kingsbury (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. Cited by: §1.
  • [9] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1.
  • [10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    .
    In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1, §2, §4.4.
  • [11] A. Graves and N. Jaitly (2014) Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pp. 1764–1772. Cited by: §1.
  • [12] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §6.
  • [13] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng (2014) First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873. Cited by: §1.
  • [14] S. Joongbo, L. Yoonhyung, and J. Kyomin (2019) Effective sentence scoring method using bert for speech recognition. ACML. Cited by: §1.
  • [15] C. Liu, F. Zhang, D. Le, S. Kim, Y. Saraf, and G. Zweig (2021) Improving rnn transducer based asr with auxiliary tasks. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 172–179. Cited by: §6.
  • [16] S. Merity, N. S. Keskar, and R. Socher (2017) Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182. Cited by: §4.3.
  • [17] N. Moritz, T. Hori, and J. Le Roux (2019) Streaming end-to-end speech recognition with joint ctc-attention based models. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Cited by: §1.
  • [18] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.1.
  • [19] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §1.
  • [20] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly (2017) A comparison of sequence-to-sequence models for speech recognition.. In Interspeech, pp. 939–943. Cited by: §1.
  • [21] K. Rao, H. Sak, and R. Prabhavalkar (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. Cited by: §6.
  • [22] H. Sak, A. Senior, and F. Beaufays (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128. Cited by: §1.
  • [23] Z. Xiangwen, S. Jinsong, Q. Yue, L. Yang, J. Rongrong, and W. Hongji (2018)

    Asynchronous bidirectional decoding for neural machine translation

    .
    AAAI. Cited by: §1.
  • [24] C. Xie, L. Xunying, R. Anton, W. Yu, and G. Mark (2017) Future word contexts in neural network language models. IEEE Automatic Speech Recognition and Understanding. Cited by: §1.
  • [25] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and R. Collobert (2018) Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864. Cited by: §1.
  • [26] A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney (2021) Librispeech transducer model with internal language model prior correction. arXiv preprint arXiv:2104.03006. Cited by: §6.