Data Augmentation with Locally-time Reversed Speech for Automatic Speech Recognition

10/09/2021 ∙ by Si-Ioi Ng, et al. ∙ 0

Psychoacoustic studies have shown that locally-time reversed (LTR) speech, i.e., signal samples time-reversed within a short segment, can be accurately recognised by human listeners. This study addresses the question of how well a state-of-the-art automatic speech recognition (ASR) system would perform on LTR speech. The underlying objective is to explore the feasibility of deploying LTR speech in the training of end-to-end (E2E) ASR models, as an attempt to data augmentation for improving the recognition performance. The investigation starts with experiments to understand the effect of LTR speech on general-purpose ASR. LTR speech with reversed segment duration of 5 ms - 50 ms is rendered and evaluated. For ASR training data augmentation with LTR speech, training sets are created by combining natural speech with different partitions of LTR speech. The efficacy of data augmentation is confirmed by ASR results on speech corpora in various languages and speaking styles. ASR on LTR speech with reversed segment duration of 15 ms - 30 ms is found to have lower error rate than with other segment duration. Data augmentation with these LTR speech achieves satisfactory and consistent improvement on ASR performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Perception and recognition of speech requires tremendous amount of processing in human brain. Cross-disciplinary studies investigated the relation between speech sound and human behaviour. To understand how human listeners response to different speech stimuli, unintelligible speech is often used. Speech reversal refers to the operation of reversing a speech signal, which renders an audio signal that may contain unintelligible speech. In studies related to irrelevant speech effect (IFS), [4], time-reversed speech was found to affect short-term memory of human listeners. In neuro-physiological research [8], when subjects listened to time-reversed speech, observation on neural responses revealed that human brain discriminates speech stimuli based on acoustics instead of linguistic features. In relation to speech technology, time-reversed speech was applied to train speech enhancement model [3] to improve the quality of enhanced speech.

While time-reversed speech is generally unintelligible, speech intelligibility, referring to percentage of correctly recognised syllables or words, was found to withstand time reversal performed in local segments [21]. This is known as locally-time reversed (LTR) speech. Let a 5-second speech utterance be divided into 250 segments of 20 ms in length. If the samples in each segment are reversed in time order, listeners still consider the manipulated speech highly intelligible. Cognitive restoration of LTR speech was further investigated in a multilingual study [25]. The speech intelligibility degraded with the duration of time-reversed segment increasing across different languages. The intelligibility remains at above 90% as the reversed segment duration is shorter than 50 ms, and drops to below 25% as the duration is longer than 100 ms. This observation was in agreement with other monolingual studies on LTR speech [6, 9, 16, 12].

Being analogous to human speech recognition (HSR), ASR aims to transform human speech into text. It could be cast as a task of pattern recognition. Motivated by human perception of LTR speech, we are interested in asking the following questions: how would a general-purpose ASR system perform on LTR speech? Apparently, local time reversal alters the spectral characteristics of natural speech. The alteration increases the discrepancy between training data of natural speech and test data of LTR speech. If reversed segment duration is relatively short, e.g. below 50 ms, acoustical characteristics of phonemes, syllable structure, word order, are largely preserved in LTR speech. We expect the natural and the LTR speech are similar. The ASR performance on LTR speech would be close to that on natural speech.

State-of-the-art ASR systems are predominately based on the data-driven deep neural network (DNN) approach. Using data augmentation strategies to escalate the quantity and variability of training data is effective to improve the performance of DNN models. Data augmentation does not require additional acoustic data, and is fairly easy to be implemented. Commonly used methods include vocal-tract length perturbation

[10], speed perturbation [13], spectral masking, swapping and stretching [18, 22, 17]. Data augmentation with LTR speech has the potential to help improve ASR performance.

The present study starts with an investigation on the effect of LTR speech on ASR. The performance of state-of-the-art end-to-end (E2E) ASR systems on LTR speech is evaluated with speech corpora of different languages and speaking styles. The E2E design is chosen as it was shown to better match human performance than other predominant systems in psychometric experiments [27]. We find recognition on LTR speech with particularly short and particularly long reversed segment duration is prone to erroneous results. Based on the observation on ASR for LTR speech, we propose a speech data augmentation method, which combines LTR speech with natural speech in ASR training. Extensive ASR experiments are carried out to confirm the effectiveness of this strategy.

2 Method

2.1 Speech corpora

Corpus Type Language Hours
TIMIT Read English 5.4
WSJ Read English 81
Aishell Read Mandarin Chinese 178
Ted-lium-2 Spontaneous English 118
Table 1: Speech corpora.

ASR experiments are carried out on natural and LTR speech with four speech databases. They are TIMIT [5], WSJ [19], Aishell [2], Ted-lium2 [20]

. The databases are chosen based on the consideration of language diversity, speaking style and amount of speech data. TIMIT is a read-speech corpus of English. The corpus contains phonetically-balanced utterances. It was widely used to support research on acoustic-phonetics and continuous speech recognition. WSJ is a read-speech corpus of English news, covering a large vocabulary size. The corpus was extensively used for benchmark evaluation of ASR systems. Aishell is an open-sourced corpus of Mandarin read-speech. The content covers a wide range of topics. Ted-lium2 is a corpus of spontaneous English speech. The speech was extracted from recordings of public talks, and the content was not well organized. Details of the four databases are summarised as in Table

1.

2.2 Locally-time reversed speech

(a) Unaltered utterance
(b) Reversed segment duration: 5ms.
(c) Reversed segment duration: 50 ms.
Figure 1: Spectrogram of an utterance from WSJ.

The narrowing performance gap between ASR and HSR has motivated exploratory studies on the proximity of ASR and human auditory system [14, 24]. Despite that some of the latest ASR systems outperform human performance on specific tasks [28, 17], ASR performance is clearly inferior to human auditory system under challenging acoustic conditions, i.e. when input speech is clipped, spectrally modulated, band-pass filtered, or masked by noise [27, 23]. The present study is focused on the effect of LTR speech on E2E ASR. LTR speech is a type of temporally distorted speech. Local reversal with segment duration from 5 ms to 50 ms, in the step size of 5 ms, is applied to the speech data in TIMIT, WSJ, Aishell and Ted-lium2. In [1], the effect of LTR speech on ASR was investigated with the English corpus Ted-lium2 and the Japanese corpus CSJ [15]. LTR speech with segment duration of 25 ms, 50 ms, 70 ms and 100 ms were attempted. The ASR performance was found to degrade substantially when the reversed segment duration is longer than 50 ms.

Figure 1 illustrates the spectrograms of natural and LTR speech of the same utterance. It is noted that the temporal order of phonemes is mostly preserved in the LTR speech. In the case of 5 ms local reversal, the acoustic properties of speech, e.g. pitch and formant contours, are locally altered. Local reversal in short segments tends to cause more noise in speech. This is related to the discontinuities of speech samples at the segment boundaries. Reversal in relatively longer segments, e.g. 50 ms, does not cause much noise, but inclines to disrupt the smooth transition between speech sounds.

2.3 E2E ASR system

The ASR model is implemented with an encoder-decoder architecture. It is trained toward two objectives, namely the connectionist temporal classification (CTC) and the attention-based classification [11]. Tokens such as English/Chinese characters, phonemes and graphemes, are extracted from given speech transcriptions as the output labels (or symbols). CTC aims to map input acoustic features to output token sequence via the encoder. Assuming a ground-truth token sequence has distinct token labels, an intermediate token sequence of length is used to represent with repetitions of token labels and blank label, where . Given input features of

frames in length, the CTC loss function is defined as,

where is the prolonged version of with blank tokens inserted in-between ground-truth tokens, and at the beginning and the end of . denotes all possible sequences of .

The decoder predicts output token sequence in an auto-regressive manner with the attention mechanism. Let be the u-th ground-truth token, the decoder loss is expressed as,

The two loss functions are fused in a multi-task manner by,

where is the task weight. In the decoder, the language model (LM) can be incorporated by shallow fusion. The ASR system aims to determine the best token sequence as,

where denote the complete set of hypotheses. and are hyper-parameters that control the contributions of CTC, attention and language model (LM).

Corpus Model
Architecture
(Layer / Units)
CTC-
Attention
Weight
Epochs /
Patience
TIMIT
Bi-directional GRU
with projection
Encoder: 5 / 320
Decoder: 1 / 300
0.5 20 / 3
WSJ Conformer [7]
Encoder: 12 / 2048
Decoder: 6 / 2048
0.3 100 / 5
Aishell Transformer
Encoder: 12 / 2048
Decoder: 6 / 2048
0.3 100 / None
Ted-lium2 Conformer
Encoder: 17 / 1024
Decoder: 4 / 1024
0.3 50 / None
Table 2: System and training configurations of E2E ASR for the four tasks/databases.
Corpus Model
Architecture
(Layer / Units)
Epochs /
Patience
WSJ
Long-short term memory
(LSTM)
1 / 1000 20 / 3
Aishell 2 / 650 20 / 3
Ted-lium2 4 / 2048 2 / None
Table 3: RNNLM training setup.

Training of the E2E models is implemented with the ESPNET toolkit [26] with the Adadelta optimizer [29]

. The input acoustic features are 80-dimensional Mel-scale filter-bank (F-bank) coefficients with mean and variance normalised. The model architectures and training parameters are summarised as in Table

2

. Recurrent neural network LMs (RNNLMs) are trained on WSJ, Aishell, and Ted-lium2. All RNNLMs adopt the architecture of long-short term memory (LSTM). Details are provided as in Table

3. In addition, a 4-gram LM is trained on Aishell and jointly used with the RNNLM.

3 Results And Discussion

3.1 ASR performance on locally time-reversed speech

(a) TIMIT
(b) WSJ
(c) Aishell
(d) Ted-lium2
Figure 2: ASR performance on LTR speech in different languages

The ASR systems are first trained on natural speech and used to decode input utterances of LTR speech. The segment duration for local reversal varies from 5 ms to 50 ms. The recognition performance is measured in terms of phone error rate (PER) in TIMIT, character error rate (CER) in Aishell, and word error rate (WER) in WSJ and Ted-lium2. Except the TIMIT phone recognition task, shallow fusion with RNNLM is applied in the decoding. The results are shown as in Figure 2. It must be noted that many previous studies on LTR speech [25, 6, 9, 16, 1] assumed minimum segment duration of 20 ms. The effect of local time reversal in notably short segments, i.e. below 20 ms, has not been well understood.

Corpus Metric Corpus Training set
Original
(Baseline)
Original
+ 5ms + 10ms
(Set 1)
Original
+ 15ms + 20ms
(Set 2)
Original
+ 25ms + 30ms
(Set 3)
Original
+ 35ms + 40ms
(Set 4)
Original
+ 45ms +50ms
(Set 5)
Speed
Perterbation
(3-fold)
SpecAugment
Timit
(dev / test)
PER(%) No 16.1 / 18.1 16.1 / 18.3 15.8 / 17.6 16.1 / 17.9 16.5 / 18.3 17.4 / 19.1 14.9 / 17.3 17.8 / 18.8
WSJ
(dev93 / eval92)
WER(%) Yes 7.2 / 4.5 7.0 / 4.6 7.0 / 4.5 6.4 / 4.3 6.8 / 4.4 6.9 / 4.4 6.1 / 3.9 7.4 / 4.3
No 18.6 / 14.5 13.2 / 9.9 14.0 / 11.1 13.6 / 10.5 14.1 / 10.3 13.8 / 10.7 12.2 / 9.7 15.3 / 11.9
Aishell
(dev / test)
CER(%) Yes 6.4 / 7.4 5.7 / 6.3 5.7 / 6.5 5.8 / 6.5 5.8 / 6.7 5.9 / 6.7 5.4 / 5.9 6.5 / 7.0
No 6.5 / 8.0 5.8 / 6.6 6.5 / 6.8 5.9 / 6.9 6.0 / 7.1 6.1 / 7.1 5.7 / 6.7 5.1 / 5.8
Ted-lium2
(dev / test)
WER(%) Yes 10.4 / 8.4 10.0 / 8.3 10.2 / 8.3 10.2 / 8.1 10.2 / 8.1 10.1 / 8.2 9.2 / 7.7 9.8 / 7.5
No 10.8 / 9.5 10.2 / 9.1 10.3 / 9.0 11.0 / 9.0 10.9 / 9.0 10.4 / 9.0 9.8 / 8.3 10.6 / 8.0
Table 4: Recognition performance using LTR speech in ASR training.

The error rate increases first from no reversal to reversal with 5 ms segment duration. When the segment duration further increases to 10 ms and 20 ms, the error rate drops. When the segment duration increases from 20 to 50 ms, the error rate goes up steadily. These trends are consistent among different corpora regardless of the model architecture. The error rate peak at 5 ms could be caused by the additional noise and changes of acoustic properties in LTR speech. The dissimilarity between training and test data deteriorates the recognition performance.

Different types of ASR errors are analysed for LTR speech and natural speech. In the case of TIMIT, there are frequent errors concerning nasal substitution (/m/ → /n/), and approximant substitution (/w/ → /l/). These errors are not common in natural speech. Insertion and deletion of the article ‘a’ in WSJ, and deletion of ‘a’, ‘and’, ’the’ in Ted-lium2, are persistent ASR errors for natural speech. These errors are found more frequently for LTR speech, especially when the reversed segment duration is 5 ms or above 20 ms. For Aishell, substitution errors by homophones are common in the recognition of natural speech, i.e., two different Chinese characters share the same constituent phonemes and tone. For LTR speech, ASR is more prone to minimal-pair errors when the reversed segment duration is 10 ms or below. Two Chinese characters are said to be a minimal pair if they differ by a single phonological element. For instance, the two differ in tone, e.g. ‘shi4’ (city) and ‘shi2’ (time); or differ in one phoneme, e.g. Mandarin particles of ‘le’ and ‘de’. More deletion errors are found in ASR when the reversed segment duration is 15 ms or above.

The recognition performance and error analysis of LTR speech help understand the relation between LTR speech and general-purpose ASR systems. The results suggest that if appropriate reversed segment duration is chosen, the mismatch between natural and LTR speech can be alleviated.

3.2 Data augmentation with LTR speech for ASR training

Utterances of LTR speech are divided into 5 sets according to the reversed segment duration, as shown in Table 4

. Each set represents different levels of intelligibility. As an attempt to data augmentation, each set of LTR speech utterances are combined with natural speech in ASR training, creating a 3-fold training set. Data augmentation with LTR speech is compared with speed perturbation and SpecAugment, which are the most widely used strategies of data augmentation. With speed perturbation factor of 0.9, 1.0 and 1.1, 3-fold training set is created, and has the same quantity of training data as data augmentation with LTR speech. SpecAugment warps and masks the time and frequency domain of the input F-bank coefficients. It does not increase the size of original training data. The model training follows the configurations in Table

2. The systems trained on natural speech are regarded as the baseline systems. Recognition performance on natural speech are given as in Table 4.

Data augmentation with LTR speech improves the ASR performance over the baseline system on all the four corpora. This confirm the feasibility and efficacy of deploying LTR speech in ASR training. For each corpus, the best performance attained by data augmentation with LTR speech is comparable to data augmentation with speed perturbation. Data augmentation with LTR speech outperforms SpecAugment on TIMIT, WSJ and Aishell, but not on Ted-lium2.

When shallow fusion with LM is applied, using LTR speech with reversed segment duration shorter than 30 ms (Set 1-3) performs better than with longer segment duration (Set 4-5) on WSJ, Aishell and Ted-lium2. Without shallow fusion, more satisfactory performance is obtained via LTR speech with reversed segment duration of 5 ms and 10 ms (Set 1). For TIMIT, the best result is achieved by using LTR speech reversed segment duration of 15 and 20 ms (Set 2). Consistent improvement is observed in using LTR speech with reversed segment duration of 15 ms - 30 ms (Set 2-3). These results are in agreement with Figure 2, in which the error rates are comparatively low. LTR speech with this range of reversal contains less noise, and retains smoother transition of speech sounds than longer segment duration. Local reversal with 15 ms - 30 ms is recommended for efficacious data augmentation with LTR speech.

4 Conclusion

The effect of LTR speech on E2E ASR performance, and the feasibility of using LTR speech in ASR training are studied in this paper. Using general-purpose ASR systems trained on natural speech, recognition on LTR speech with relatively short and long reversed segment duration is prone to erroneous results. Data augmentation with LTR speech is confirmed to improve ASR performance on natural speech. Using LTR speech with reversed segment duration of 15 ms - 30 ms consistently improves ASR performance on corpora of different languages and speaking styles. In future work, the intelligibility of LTR speech with short reversed segment duration, e.g. below 20 ms, will be studied via subjective rating. The use of LTR speech will be generalized to other topics in speech technology, such as speaker verification, speech enhancement and self-supervised learning of speech representation.

References

  • [1] T. Ashihara, T. Moriya, and M. Kashino (2021) Investigating the impact of spectral and temporal degradation on end-to-end automatic speech recognition performance. In Proc. of Interspeech, pp. 1757–1761. Cited by: §2.2, §3.1.
  • [2] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng (2017) Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In Proc. of O-COCOSDA, pp. 1–5. Cited by: §2.1.
  • [3] F. Chao, S. F. Jiang, B. Yan, J. Hung, and B. Chen (2021) TENET: a time-reversal enhancement network for noise-robust asr. arXiv preprint arXiv:2107.01531. Cited by: §1.
  • [4] W. Ellermeier and K. Zimmer (2014) The psychoacoustics of the irrelevant sound effect. Acoustical Science and Technology 35 (1), pp. 10–16. Cited by: §1.
  • [5] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett (1993) DARPA timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93, pp. 27403. Cited by: §2.1.
  • [6] S. Greenberg and T. Arai (2001) The relation between speech intelligibility and the complex modulation spectrum. In Proc. of Eurospeech, Cited by: §1, §3.1.
  • [7] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang (2020) Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. of Interspeech, pp. 5036–5040. Cited by: Table 2.
  • [8] M. F. Howard and D. Poeppel (2010)

    Discrimination of speech stimuli based on neuronal response phase patterns depends on acoustics but not comprehension

    .
    Journal of neurophysiology 104 (5), pp. 2500–2511. Cited by: §1.
  • [9] M. Ishida, T. Arai, and M. Kashino (2018) Perceptual restoration of temporally distorted speech in l1 vs. l2: local time reversal and modulation filtering. Frontiers in psychology 9, pp. 1749. Cited by: §1, §3.1.
  • [10] N. Jaitly and G. E. Hinton (2013) Vocal tract length perturbation (vtlp) improves speech recognition. In Proc. of ICML, Vol. 117, pp. 21. Cited by: §1.
  • [11] S. Kim, T. Hori, and S. Watanabe (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Proc. of ICASSP, pp. 4835–4839. Cited by: §2.3.
  • [12] M. Kiss, T. Cristescu, M. Fink, and M. Wittmann (2008) Auditory language comprehension of temporally reversed speech signals in native and non-native speakers. Acta neurobiologiae experimentalis 68 (2), pp. 204. Cited by: §1.
  • [13] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015) Audio augmentation for speech recognition. In Proc. of Interspeech, pp. 3586–3589. Cited by: §1.
  • [14] R. P. Lippmann (1997) Speech recognition by machines and humans. Speech communication 22 (1), pp. 1–15. Cited by: §2.2.
  • [15] K. Maekawa, H. Koiso, S. Furui, and H. Isahara (2000) Spontaneous speech corpus of japanese.. In LREC, Vol. 6, pp. 1–5. Cited by: §2.2.
  • [16] I. Magrin-Chagnolleau, M. Barkat, and F. Meunier (2002) Intelligibility of reverse speech in french: a perceptual study. In Proc. of ICSLP, Cited by: §1, §3.1.
  • [17] T. Nguyen, S. Stueker, J. Niehues, and A. Waibel (2020) Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation. In Proc. of ICASSP, pp. 7689–7693. Cited by: §1, §2.2.
  • [18] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. of Interspeech, pp. 2613–2617. Cited by: §1.
  • [19] D. B. Paul and J. Baker (1992) The design for the wall street journal-based csr corpus. In Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. Cited by: §2.1.
  • [20] A. Rousseau, P. Deléglise, and Y. Esteve (2012) TED-lium: an automatic speech recognition dedicated corpus.. In LREC, pp. 125–129. Cited by: §2.1.
  • [21] K. Saberi and D. R. Perrott (1999) Cognitive restoration of reversed speech. Nature 398 (6730), pp. 760–760. Cited by: §1.
  • [22] X. Song, Z. Wu, Y. Huang, D. Su, and H. Meng (2020) SpecSwap: a simple data augmentation method for end-to-end speech recognition.. In Proc. of Interspeech, pp. 581–585. Cited by: §1.
  • [23] C. Spille, B. Kollmeier, and B. T. Meyer (2018) Comparing human and automatic speech recognition in simple and complex acoustic scenes. Computer Speech & Language 52, pp. 123–140. Cited by: §2.2.
  • [24] S. Thomas, M. Suzuki, Y. Huang, G. Kurata, Z. Tuske, G. Saon, B. Kingsbury, M. Picheny, T. Dibert, A. Kaiser-Schatzlein, et al. (2019) English broadcast news speech recognition by humans and machines. In Proc. of ICASSP, pp. 6455–6459. Cited by: §2.2.
  • [25] K. Ueda, Y. Nakajima, W. Ellermeier, and F. Kattner (2017) Intelligibility of locally time-reversed speech: a multilingual comparison. Scientific reports 7 (1), pp. 1–8. Cited by: §1, §3.1.
  • [26] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai (2018) ESPnet: end-to-end speech processing toolkit. In Proc. of Interspeech, pp. 2207–2211. Cited by: §2.3.
  • [27] L. Weerts, S. Rosen, C. Clopath, and D. F. Goodman (2021) The psychometrics of automatic speech recognition. bioRxiv. Cited by: §1, §2.2.
  • [28] W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke, D. Yu, and G. Zweig (2017) Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (12), pp. 2410–2423. Cited by: §2.2.
  • [29] M. D. Zeiler (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.3.