Code-Switching (CS) is a phenomenon in which multiple languages are used in a single conversation. CS can appear on word, phrase, or sentence level and can alternate throughout the conversation. If the language changes between sentences, it is called inter-sentential CS. On the other hand, if CS happens within a single utterance or sentence, it is called intra-sentential CS, which is the focus of this work.
Connectionist Temporal Classification (CTC) loss [graves2006connectionist] has gained popularity in the community due to its computational efficiency and ability to optimize character-based end-to-end models. Character-based models, requiring no dictionary, lead to greater usability across different languages [miiller2018multilingual, tong2018cross, wang2019end, dalmia2018domain, ito2017end]. However, when dealing with a mixture of characters from different languages in utterances with CS, CTC can result in spelling inconsistency. A word can contain a mixture of characters from different languages because the model does not know the context when predicting the output. To avoid this problem, multilingual sentences are sometimes transliterated into the same language [emond2018transliteration] which can be unnatural.
The lack of context can be alleviated by postprocessing using a Language Model (LM) or providing context information from past states. However, LM rescoring increases the inference time. Long Short-Term Memory (LSTM) or previous hidden states can be used to provide context at the expense of inference speed. One can also add language information to the model which can be done jointly with the main ASR task in a multi-task manner. This language identification subtask has been explored either on the frame level[luo2018towards, li2019towards] or the subword level [Zeng2019, shan2019investigating]. However, adding language information requires alignments in order to train the language identification subtask.
For monolingual ASR, teaching the model to predict future context has led to improvements in the main prediction. In [Jaitly2014AutoregressivePO], the model was trained to predict multiple frames at once given a contextualized input. For inference, results were averaged for the final prediction. In [zhang2015speech, zhang2016prediction], the main prediction was based on both the current frame and prediction made on previous frames. These works made use of frame-level alignments to compute the Cross-Entropy (CE) loss. Providing the required contextual information to the model in this manner should also alleviate the issue with CS utterances without the need to sacrifice speed. However, predicting future context in the CTC framework is not straightforward, since there is no explicit frame-level alignment, and most outputs in CTC are blanks which provide little contextual information.
We propose Contextualized CTC (CCTC) loss, a modification to the original CTC-based modeling that allows for prediction of surrounding context. CCTC loss does not require frame-level labels from an external model. In contrast, CCTC loss allows the model to make a prediction of the current frame based on its best estimate of the context. Concretely, this is done by adding secondary prediction heads that predict the surrounding characters. The target prediction for the surrounding characters is acquired by using the prediction from the previous iteration. Using information from the previous iteration replaces the need for frame-based alignments from Hidden-Markov-Model-based models which might not exactly match the CTC-based alignments. Moreover, predicting the surrounding character should provide longer contextual information than predicting the surrounding frames, which have shown to be more effective in[zhang2015speech].
Experiments on a Thai-English CS corpus show that the CCTC loss can help mitigate intra-word language inconsistency. Applying the CCTC loss has the same effect as implicitly learning a low-order LM, yet gives complementary gains when combined with n-gram rescoring.
In this section, we give an overview of the CTC loss [graves2006connectionist]. CTC is an alignment-free objective function for sequence-to-sequence tasks such as ASR and handwriting recognition. Suppose we have an input sequence,
, CTC loss maximizes the probability of predicting the ground truth transcription,, where is a set containing alphabets in the languages which in this work are Thai and English alphabets. A blank token, , is also included in the alphabets set, , to handle noise, silence, and consecutive duplicate characters in transcriptions. The model outputs a path, , which has the same length as the input frames. Lastly, is mapped to an inferred transcription, , using a mapping function . By applying the function , adjacent duplicate alphabets are merged and additional blank tokens are removed.
The CTC loss, , is the negative log probability of all paths that can be mapped to the ground truth transcription. The CTC loss is calculated as follows:
3 Contextualized Connectionist Temporal Classification
The motivation for the CCTC loss is to indirectly introduce context conditioning to the ASR models focusing on speed, which usually are non-autoregressive and non-recurrent. As shown in Fig. 1
, additional prediction heads are added to the original CTC prediction head. These additional heads try to minimize the CE loss for predicting the next and previous characters according to the path predicted by a model trained by CTC. Since a non-autoregressive model is not aware of the surrounding predictions during inference, we introduce dependencies on predicted contexts,, to encourages the model to generate more consistent predictions. Therefore, the probability of the prediction at the index can now be rewritten as , where is the context size. We refer to the CCTC loss with the context size as a -order CCTC from now on.
Since a typical CTC path output usually contains a lot of blank tokens which is not informative, we opt to train the context heads with dense character supervision from the prediction after the mapping is applied instead. Concretely, the context and at a position are the kth-nearest characters to the left and right of that are not a blank token or consecutive duplicate. However, a naive search for every position is computationally expensive. Thus, we propose the following efficient and generalizable algorithm that could scale to any order of the CCTC loss on top of the existing CTC pipeline.
First, we decompose the mapping function into two steps, namely removing blank tokens and removing duplicates. We define as a mapping that merges all consecutive duplicates into one and as a mapping which removes blank tokens from the sequence. Thus, the path mapping can be written as . Given a merged path, , we create a list, , containing indices of . An index, , indicates that a letter, , is derived from a path token, , after applying the merging function . We demonstrate this process with an example in Fig. 2.
To calculate the -order context loss , we apply the CE criterion on the obtained the label as shown in (3).
where indicates the position of the first non-blank letter next to the letter , the subscript indicates the direction of the context loss. The variable is set to when and otherwise.
where the superscript indicates the order of the context loss.
Finally, the CCTC loss is the combination of the CTC loss and the context losses up to the order as shown in (6). The weights of the left and right contexts can be set differently with and .
For our experiments, we used a 200-hour Thai speech corpus, crawled from public YouTube podcast channels. The utterances were then manually transcribed. The recordings were preprocessed to 16kHz and 16-bit depth. CS with English was found in 4.4% of the training set, 4.3% of the development set, and 5.7% of the test set. More details are shown in Table 1. The YouTube channels in the test set are different from the training and development sets. Therefore, speakers in the test set are not in the training data.
For performance comparison and analysis, we separated the development set and test set into the monolingual part containing only Thai utterances and the CS part. We refer to these subsets as TH and TH-CS from now on. Note that training and hyperparameter tuning were still done on the entire data, making no such distinction.
|duration||150 Hr||24 Hr||26 Hr|
5 Experiments and Results
We conducted a series of experiments to measure the performance of models trained with our proposed CCTC loss. The experiments are designed to compare the CCTC loss with the standard CTC loss. The details of our implementation are provided in Sec. 5.1. We present the results on our Thai-English dataset in Sec. 5.2, and on LibriSpeech dataset in Sec. 5.3. We also show the effect of different CCTC loss weights in Sec. 5.4 and different LM n-gram orders in Sec. 5.5.
5.1 Experimentation details
We adopted a non-autoregressive and fully-convolutional model Wav2Letter+111https://github.com/NVIDIA/OpenSeq2Seq, a modified version of wav2Letter [collobert2016wav2letter, liptchinsky2017based], as our base model. It comprises of 17 1D convolutional layers and two fully connected layers at the end. We added context prediction heads and projection layers with Dropout [srivastava2014dropout] right after the last layer of the base model as shown in Fig. 1. For simplicity, we only considered -order CTCC. We also set and tuned them using the development set.
Since the labels for the context heads are derived from the predicted path of the middle head, it is important that the context losses are applied only when these predictions are reliable. Therefore, in all experiments, we started by training the models with only the CTC loss for 130 epochs. Afterwards, the context losses were included, and the training resumed for an additional of 170 epochs. This additional training was also performed on the CTC baseline models. For each mini-batch, the context labels were generated on-the-fly with the current model’s output path for efficiency.
The default settings of Wav2Letter+ were used with some exceptions. The Adam optimizer [DBLP:journals/corr/KingmaB14] with a starting learning rate of 1e-4 was used for the first 130 epochs and then decreased to 4e-5 for the rest. The Layer-wise Adaptive Rate Control [you2017large] and weight decay were not used as we found them to hurt performance. We also replaced the polynomial decay with an exponential decay with a rate of 0.98.
LM rescoring was also applied to investigate more realistic setups. We curated two corpora with 145M letters from Thai Wikipedia and 330M letters from Pantip (Thai Q&A forum). For each corpus, character-based n-gram models were trained using KenLM [Heafield-kenlm] with pruning methods from [Likhomanenko2019]
. The final LM is obtained by n-gram interpolation. A beam width of 32 was used for LM rescoring.
As Thai has no explicit nor agreed upon rules for word boundary and space usage, we discarded all spaces and opted for the Character Error Rate (CER) as the evaluation metric instead of the Word Error Rate (WER).
5.2 Effect of CCTC loss on EN-TH dataset
We investigated the performance of the model trained by CCTC loss compared to standard CTC loss on the Thai CS dataset. Table 2 summarizes the results of four setups: greedy decoding, beam search decoding without LM, with 3-gram LM, and with 20-gram LM. For detailed analysis, we split the testing utterances into two cases: Thai only utterances (TH) and CS utterances (TH-CS). The performance consistently improves when using the CTCC loss for all setups in CS utterances. For TH utterances, CTCC performs slightly better.
Further qualitative analyses have shown that CTCC mostly fixes the inconsistencies in the spelling. The top example of Fig. 3 illustrates a case where the CTC baseline produces an improbable character sequence, while CCTC produces a non-Thai word that has a similar pronunciation to the ground truth. The bottom example illustrates a CS utterance. The phrase “best seller” is spelt with a mixture of Thai and English alphabets in the CTC case, while CCTC outputs English alphabets consistently. Note that the phoneme sequence /st/ only appears in loanwords in Thai.
5.3 Effect of CCTC loss on Monolingual English
In this experiment, we investigated the performance of CCTC loss on a 100 hours subset of LibriSpeech [panayotov2015librispeech]. The goal of this experiment is to determine whether CTCC is beneficial to other languages. The character LM was trained on the whole LibriSpeech text corpus. Unlike the Thai experiments, we also reported the WER as per the standard evaluation for LibriSpeech. Table 3 summarizes the results on the test-clean utterances. We also provided results from the Wav2Letter++ [pratap2019wav2letter++] model taken from the Wav2Letter tutorial222https://github.com/facebookresearch/wav2letter/tree/master/tutorials/1-librispeech_clean which was trained on the same subset as a strong baseline. The CCTC loss improves the CER and WER over CTC loss by 3.5% relative when used with a greedy decoder. When LM rescoring is used, the improvement over CTC loss is only slightly better. This might suggest that the CCTC loss might be useful for monolingual data when LM rescoring is not an option.
|Wav2Letter++ [pratap2019wav2letter++]||beam w/ word 3-gr||8.72||18.97|
|CTC||beam w/ char 3-gr||8.20||21.76|
|CCTC||beam w/ char 3-gr||8.18||21.59|
|CTC||beam w/ char 10-gr||6.94||15.54|
|CCTC||beam w/ char 10-gr||6.94||15.52|
5.4 Effect of CCTC context loss weight
As the weights of the context losses, and , control the trade-off between the artificial context and the middle prediction during training, we studied how the choice of weights can affect the performance. The results are shown in Fig. 4. The optimal context loss weight is larger when a greedy decoder is used compared to a beam search decoder. This is due to the fact that when the external LM is not available, the model needs to rely more on the context heads to make consistent predictions. In general, we found any value between 0.05-0.075 would yield improvements over regular CTC on both LibriSpeech and our Thai corpus.
5.5 Effect of n-gram order
In order to understand what CCTC is learning, we varied the context size of the external LM and studied the effects. A Thai word is around 3-9 characters long, so a 10-gram character LM is comparable to a 2-or 3-gram word LM. As shown in Fig. 5, for monolingual TH, both CCTC and CTC models demonstrate almost identical results. On the other hand, for TH-CS, CCTC loss performs consistently better regardless of the n-gram order. Note how the gap is small near 5-gram and widens as the n-gram order increases. This suggests that CTCC is learning something similar to a small LM, since we are forcing the model to learn about the context around it. However, the gain from CTCC and LM rescoring seems to be complementary, since the LM can carry longer context information. An example is shown in Fig. 3, the top example shows how CTCC can help the model produce a more consistent character sequence which can then be corrected again by the LM.
We introduced CCTC loss for incorporating context information into a CTC-based non-autoregressive model. We showed that the CCTC loss improved results in utterances with CS by encouraging context consistency in the predicted path. We believe that the technique of adding additional context dependencies in the CCTC loss can be helpful for CS ASR regardless of the language. In the future, we plan to investigate the impact of different left and right context weights as it might be more natural for the model to depend more on the preceding characters. We also plan to measure the effect of incorporating larger context sizes into the CCTC loss.