Log In Sign Up

Blank Collapse: Compressing CTC emission for the faster decoding

by   Minkyu Jung, et al.

Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a very simple method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78 faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets. We prove this method is effective not only practically by experiments but also theoretically by mathematical reasoning. We also observe that this reduction is more obvious if the accuracy of the model is higher.


Korean Tokenization for Beam Search Rescoring in Speech Recognition

The performance of automatic speech recognition (ASR) models can be grea...

A Hardware-Oriented and Memory-Efficient Method for CTC Decoding

The Connectionist Temporal Classification (CTC) has achieved great succe...

Back from the future: bidirectional CTC decoding using future information in speech recognition

In this paper, we propose a simple but effective method to decode the ou...

Enhancing Speech Recognition Decoding via Layer Aggregation

Recently proposed speech recognition systems are designed to predict usi...

Accelerating NMT Batched Beam Decoding with LMBR Posteriors for Deployment

We describe a batched beam decoding algorithm for NMT with LMBR n-gram p...

Spell my name: keyword boosted speech recognition

Recognition of uncommon words such as names and technical terminology is...

Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition

Attention-based encoder decoder network uses a left-to-right beam search...

1 Introduction

Recently, there has been remarkable improvement in Automatic Speech Recognition (ASR) system thanks to the End-to-End (E2E) training approaches. Especially, most ASR models have one of three popular models as a base architecture, Connectionist Temporal Classification (CTC) based model [graves2006connectionist], transducer-based model [graves12, zhang2020transformer] and Attention-based Encoder Decoder (AED) model [chan2016listen, dong2018speech].

Unlike other models, CTC encodes the waveform with sequential models like Reccurent Neural Network (RNN)

[graves2014towards] or Transformer[baevski2020wav2vec, gulati2020conformer]

and outputs the probability or logit for each sample frame. Since it does not have any decoder module itself and assumes the conditional independence with respect to the time, it can make the recognition result directly from its output. Moreover, this model tries to learn the frame-wise phoneme representation, it originally lacks temporal information. Therefore CTC-based model is often used with beam search decoder and external Language Model (LM) like n-gram LM

[synnaeve2019end, baevski2020wav2vec]

or deep learning based LM 

[hannun2014first]. However, even though the beam search decoder shows the better result, it requires a longer decoding time. Both accuracy and decoding time are too important to give up so usually we have to take the saturation point between this trade-off.

Usually, reducing inference time is very important because it can lead to the reduction of the cost and the increase of amount of data one can handle at a single time, namely throughput. There have been some efforts to reduce the total recognition time in ASR. Especially there have been lots of works trying to improve the beam search decoding speed not just for the speed but also for the accuracy. In [freitag2017beam], they introduced beam threshold pruning which prunes some beam candidates with relatively low scores and showed significant improvements in beam search speed. [seki2019vectorized]

invented vectorized CTC which also shows big enhancements in CTC-Attention-based beam search decoding while

[jain2019rnn] proposed an improved RNN-T beam search achieving decoding time reduction. [drexler2019subword] introduced a new beam search decoding using subword optimization for the better accuracy. Besides, there are many other studies introducing the optimization of Weighted Finite State Transducers (WFST) decoder [tsukada2004efficient, miao2015eesen, mendis2016parallelizing, chen2018gpu, laptev22_interspeech].

Especially for CTC, there is a special label blank that has multiple functions in the decoding process as well as the calculation of CTC loss. However, there seems to have been insufficient studies for optimizing decoding process utilizing the blank token which plays a crucial role for CTC objective. In this paper, we research the role of blank label deeply in the CTC beam search decoding and find that there are redundant computations for the blank label. In conclusion, we propose blank collapse method for reducing the calculations in decoding which results in the improvement of decoding speed with negligible loss of accuracy. This can be done without any further training and is available on every CTC emission.

Figure 1:

Number of all consecutive frames for each size based on the frame type in LibriSpeech test-other. The blank type means the frame having the highest probability at the blank index while the non-blank type is for the other cases. Most of the non-blank frames last only one or two frames whereas blank frames seem to last longer.

2 Analysis on CTC blank

In this section, we study the characteristics of CTC blank. Most of the notations in this section follow those in [graves2006connectionist].

2.1 Blanks on greedy decoding

Before introducing the method, we have to define the blank frame. Let be the CTC emission probability of a certain waveform from CTC model where is the probability at a specified timestep and is a set of of labels, , represents the blank label. CTC probability normally can be obtained by applying softmax function to the logit from the output of LSTM or Transformer trained with CTC loss. Each can be a subword or character and we use a character dictionary.

If we want CTC greedy decoding (best path decoding), we calculate the CTC output sequence as for each timestep and map the sequence with the mapping in [graves2006connectionist]. maps a sequence of CTC outputs to a label sequence by removing all blanks and repeated labels from the sequence. Thus by the definition of , consecutive blanks play the same role with a single blank. (e.g. ). This property motivates us to think that consecutive blank outputs could be replaced by a single blank output. Also we note that we can’t ignore a series of blanks entirely because the existence of blank label plays an important role that used for representing a repetition of the non-blank label. That’s why we have to leave at least one blank label on behalf of the following blanks.

However, consecutive blanks at the beginning and the ending of a sequence can be omitted entirely since they affect nothing to the result of greedy decoding. In other words, in greedy decoding, we are allowed to collapse the consecutive blanks to a single blank and drop all blanks before the first non-blank output and after the last non-blank output before decoding. More formally, we get the following.

Definition 1.

is the CTC greedy decoding or best path decoding given by

for a CTC emission probability . is called a set of weak blank frames of defined as

is a function called the consecutive extension defined by

By definition it is straightforward that


for a CTC emission probability . This means we may drop frames included in for CTC greedy decoding, even though it is not useful in practice. (Greedy decoding is already fast enough.)

2.2 Blanks on CTC beam search

Equation 1 might not be valid for the CTC beam search decoding. Let be a set of beam candidates for timestep . During CTC beam search, we calculate and , blank probability and non-blank probability respectively for each candidate and each timestep . As proceeds, and have to be updated for the case of stay () or extend to another path () according to the following rules:


where and represents the last label of . We use instead of for non-blank probability because they can be added from another beam extension. CTC beam search with a language model (LM) sorts the candidates by their score defined by


Here, is a probability of LM and is the LM weight (other hyper-parameters like a length penalty are not considered for this time). As we see Equation 2 even though is the biggest probability, is applied to the non-blank probability which can be accumulated as goes. Thus collapsing weak blank frames is too risky to collapse carelessly.

Consequently, we consider a stronger condition for the blank frame which affects search little enough to collapse.

Definition 2.

is called a set of strong blank frames or just blank frames of for defined by

where is called blank threshold.

Compared to weak blank frames, on blank frames of , it is more confident for CTC model to predict that these frames are for the blank label. Also, we can control the confidence by for alleviating the difference from the result from original CTC emissions. We propose our method with this definition in the next chapter followed by the reasoning of it.

2.3 Comparison between blank and non-blank

Although other consecutive non-blank labels also might be collapsed into a single label in greedy decoding, we don’t cover the case of non-blank labels because non-blank labels (a) don’t occur often in a row and (b) are riskier than blank labels.

As we can see in Figure 1, non-blank usually occurs at most two frames consecutively whereas blank lasts longer once it occurs. Our ultimate purpose is to reduce the decoding time and collapsing consecutive non-blanks into a single non-blank will make little difference in decoding time.

3 Blank collapse method

Now we propose a new method called blank collapse which drops all collapsible frames before the CTC beam search in order to reduce the size of decoding frames. Additionally we study the theoretical reasoning behind this method, followed by its limitations and implementation.

3.1 Definition of the method

Definition 3 (blank collapse).

is called blank collapse method with defined by , for a CTC emission probability where . We call a set of collapsible frames of with . The method using as an index set is called weak blank collapse.

In other words, from the original CTC emissions, this method drops the blank frames if they occur in the front, the last, or following another blank frame. If we drop these collapsible frames, CTC emissions will be compressed on its length, resulting in the shorter beam search time.

This method certainly has to maintain the accuracy of beam search as much as possible and this is accomplished by the definition of blank frames. Since , it automatically implies , . This means that we can assure that the change of the non-blank probability has to be limited on blank frames while the blank probability almost remains as same before for sufficient large . We can almost surely ignore these frames for this reason.

3.2 Limitations of the method

Even though we can almost surely omit the consecutive blank frames, this method always as a nonzero opportunity that could harm the original result. No matter how is large, sometimes consecutive blank frames can change the score quite a lot and reverse the order. As Equation 2 shows, the non-blank accumulates the score not only from its beam but also from another beam as they merge their scores. Namely,

where is the last label of the beam and . In this equation, we can limit the scale of but we don’t know how be larger than the other beam scores. Again, every beam always has a change to add a relatively large score from another beam having a much higher score. LM score can cause deepen this kind of side effect. Additionally, this tells us why we collapse only blanks not non-blanks, mentioned in section 2.3 (b). Collapsing frames with high non-blank emission probability can’t limit which can cause serious distortion.

Nevertheless, this method is still effective making little difference from the original. This is because the non-blank probability still has an upper limit and proper beam threshold pruning [freitag2017beam] prevents the beams from overspreading. Trivially, the higher , the less distortion we could expect.

3.3 Implementation

To implement this method, we take advantage of a special utility function unique_consecutive from PyTorch[NEURIPS2019_9015]. First, we get a vector where each value represents whether the frame at timestep belongs to the blank frames or not. This function returns each unique value of a vector in order and how many times such value occurs consecutively. Using the returned value from this function with a vector , we can get representing how many blank/non-blank frames occur in a row. By these values, we can leave only non-collapsible frames, .

This method is also compatible with timestep alignment [10.1007/978-3-030-60276-5_27] since we can reorder the timestep alignment result with . The detail method is described in algorithm 1. Figure 2 shows the resulting image of collapsed CTC emission before and after the blank collapse. Yellow represents high log probability and the top row is for the blank label. As we can see, consecutive blank frames disappear in collapsed emission. The total length of the collapsed emission reduces from 169 to 102. The ground truth transcript is

I had that curiosity beside me at this moment


Figure 2: Log probability of original/collapsed CTC emission for a sample waveform in LibriSpeech.
Data: CTC probability , blank threshold
Result: collapsed probability , indices
where , ;
// number of first blanks ;
for  to length() do
       if  is True then
             if i = 1 then
                   if i = length() then
                         // append number of blanks
                   end if
             end if
             // append 1’s with number of non-blanks
       end if
end for
Algorithm 1 blank collapse

4 Experiments

4.1 ASR model and datasets

We use the wav2vec 2.0 Base / Large [baevski2020wav2vec] model pre-trained on the unlabeled audio data of LibriVox dataset [librivox] and fine-tuned on either 10 minutes, 100 hours, and 960 hours of transcribed LibriSpeech dataset[panayotov2015librispeech]. We do not fine-tune further and use a CTC beam search decoder and 4-gram word LM provided by torchaudio [yang2022torchaudio] which uses [kahn2022flashlight] for the decoder. Since this decoder uses the logit of each beam for sorting beam candidates we get the probability vector explicitly by the softmax function to apply our method. Also, we try the beam threshold for .

Beam search decoding uses 32 batch size, 1,500 beams, LM weight 1.57, and length penalty -0.64 for every experiment in order to simulate the experiments done by [baevski2020wav2vec]. For the analysis of their effectiveness, we use various beam threshold () and blank threshold (). Every experiment is done on Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz.

We evaluate our method on LibriSpeech dev and test sets. Firstly we check the proportion of collapsible frames to all frames from a CTC emission with , and weak collapse (Definition 2). As we can see at Table 1, about the half of CTC frames seem to be collapsible frames. Additionally, there are more collapsible frames with the lower

than the higher one, though the gap between them is not that big. With this observation, we can estimate that CTC model predicts a certain frame to a blank frame with very high confidence so that there are not many frames with ambiguous certainty.

0.999 0.99 0.9 weak
dev-clean 42.87 43.24 43.60 43.97
dev-other 42.98 43.77 44.51 45.20
test-clean 43.88 44.27 44.65 45.05
test-other 43.83 44.67 45.44 46.15
Table 1: The percentage of collapsible frames to all frames from CTC emissions with wav2vec 2.0 Large model fine-tuned on 960 hours.
dev-clean dev-other test-clean test-other
original WER 1.781 3.509 2.031 3.681
RTF 0.278 0.290 0.279 0.291
0.999 WER 1.781 3.511 2.031 3.681
RTF 0.160 (0.42) 0.167 (0.42) 0.157 (0.44) 0.165 (0.43)
0.99 WER 1.783 3.513 2.029 3.683
RTF 0.160 (0.42) 0.165 (0.43) 0.156 (0.44) 0.163 (0.44)
weak WER 1.866 3.749 2.109 3.834
RTF 0.158 (0.43) 0.161 (0.45) 0.154 (0.45) 0.159 (0.45)
Table 2: Word Error Rate (WER) (%) and Real Time Factor (RTF) with its reduction ratio compared to the original on LibriSpeech dev/test sets with . RTF includes the time spent executing blank collapse which is less than a second.

4.2 Experimental results

Table 2 shows the accuracy and decoding time of each experiment with . As we can see, blank collapse with sufficiently high shows the significant improvement on its decoding speed with little difference in accuracy. For test-clean, shows about 44% time reduction at most, which equals to about 78% speed enhancement without any accuracy loss. Weak blank collapse shows the best speed among all settings with not a small distortion on the accuracy. Thus we may know that sufficient is safe for this method.

Interestingly, the reduction ratio of the decoding time seems to be almost as much as the ratio of their frame sizes, 43.2% and 43.8% respectively for test-other. However, this is not always true. As we see in Figure 3, with , the reduction ratio of decoding time is very similar to that of the size of frames. On the contrary, shows a little bit less effect than . With this phenomenon, we can estimate that the ratio of the time consuming on blank frames to that on non-blank frames is relatively lower on the smaller . This is because the beam threshold pruning with small reduces the time more on the blank frames than the others. In other words, higher beam search takes longer time on blank frames than lower because it has more candidates which must be pruned with the lower .

Table 3 shows the correlation between the model type and the improvement of decoding time by blank collapse. The larger the size of the model and the larger the dataset has been fine-tuned with, the improvement of the speed gets more evident. Actually these two factors directly affect the accuracy of the model, which is the key factor deciding how many frames can be collapsed out. This is because the model with better accuracy tends to provide the CTC emission probability with higher confidence and it makes the blank probability high enough to be dropped. In other words, a good model can tell whether a certain frame can be collapsed out or not.

0.999 0.99 0.9 weak
Base / 10 min 23.56 28.22 31.54 35.81
Large / 10 min 35.95 39.05 40.34 40.59
Large / 100h 41.08 42.50 43.54 43.96
Large / 960h 43.30 43.99 45.36 45.70
Table 3: The percentage of decoding time improvement on various model type, depending on the size of model and the fine-tuning dataset.
Figure 3: Reduction ratio of the decoding time by blank collapse in LibriSpeech test-other according to for each , compared to the reduction ratio of the number of frames.

4.3 Results on other decoder settings

In this section, we discuss proposed methods in two different decoder settings on LibriSpeech test-other subset. Firstly, we decode both vanilla frames and collapsed ones with a CTC beam search decoder fused with a Transformer based LM. We use word-level 20 decoder layers Transformer for LM[synnaeve2019end, baevski2020wav2vec] and same decoder in section 4.1. However, unlike the n-gram fusion which shows significant improvement in inference speed with a negligible WER degradation, there is only gain in decoding time when is . This must be because most of the time is spent on huge LM inference when decoding with a neural LM.

Nextly, we experiment with a WFST-based beam search decoder. We compile the same 4-gram LM in section 4.1 for a language graph and a CTC graph using k2 framework [povey2011kaldi]. And we implement a WFST decoder using Kaldi [povey2021speech] We observe the improvement of decoding time by blank collapse with a WFST-based decoding. The reduction ratio of the decoding time is about 33% with a very small loss of accuracy. Since the hyper-parameters of the WFST decoder are different with the CTC beam search decoder, there could be a little difference of improvement between two decoders.

4.4 Analysis of side-effects

It seems a bit strange that the reduction ratio on is lower than at as shown in Figure 3. This occasionally happens on other datasets as well and we interpret that over-collapsing might drop some unintended candidates which could have been the best candidate leading faster decoding with appropriate pruning for the rest of time.

Additionally, there are some cases in which the accuracy turns out to be better when collapsed than original. For example, WER of test-clean with (2.029) is lower than that of original (2.031) though it is a small amount. We guess that blank collapse may drop some frames having harmful information potentially. However, such cases do not occur very often and the difference is usually small throughout our experiments.

5 Conclusions

In this paper, we analyze the characteristics of the blank label which is used as a special role in CTC model. We define the blank frame as a frame with a high blank probability and find out that usually CTC emission has a bunch of blank frames. It is also discussed that the blank frames can be omitted in CTC beam search decoding in almost every case.

By this, we finally propose a new method called blank collapse which intends to reduce the collapsible frames in order to improve the decoding speed with minimal loss of accuracy. It is shown by several experiments that this method actually can improve the decoding speed. We also find that we can collapse more frames when the model is well-trained on a dataset and with a higher beam threshold.

As future work, we expect that our method can be plugged into any E2E ASR models using CTC loss as a regularization.

6 Acknowledgements

We would like to thank Chan Kyu Lee and Icksang Han for their helpful advice.