Speech-to-text translation (ST), which translates the source language speech to target language text, is useful in many scenarios such as international conferences, travels, foreign-language video subtitling, etc. Conventional cascaded approaches to ST Ney99; Matusov2005OnTI; Mathias2006; Berard2016 first transcribe the speech audio into source language text (ASR) and then perform text-to-text machine translation (MT), which inevitably suffers from error propagation in the pipeline. To alleviate this problem, recent efforts explore end-to-end approaches (E2E-ST) Weiss2017; Berard2018EndtoEndAS; Vila2018EndtoEndST; Gangi2019, which are computationally more efficient at inference time and mitigate the risk of error propagation from imperfect ASR.
To improve the translation accuracy of E2E-ST models, researchers either initialize the encoder of ST with a pre-trained ASR encoder Berard2018EndtoEndAS; bansal2019; WangWLY020 to get better a representation of the speech signal, or perform multi-task training (ASR+ST) to bring more training signals to the shared encoder anastasopoulos2016; anastasopoulos2018; Sperber2019; Liu2019 (see Fig. 1). These methods improve the translation quality by providing more training signals to the encoder to learn better self-attention.
However, both above solutions assume the existence of substantial speech transcriptions of the source language. Unfortunately, this assumption is problematic. On one hand, for certain low-resource languages, especially endangered ones bird2010; bird2014, the source speech transcriptions are expensive to collect. Moreover, the majority of the human languages has no written form or no standard orthography, making phonetic transcription impossible duong2016. On the other hand, the amount of speech audios with transcriptions is limited (as they are expensive to collect), and there exist far more audios without any annotations. It will be much more straightforward and cheaper to directly leverage these raw audios to train a robust encoder.
To relieve from the dependency on source language transcriptions, we present a very simple yet effective solution, Masked Acoustic Modeling (MAM), to utilize the speech data in a self-supervised fashion without using any source language transcription, unlike other speech pre-training models Chuang2019; WangWLY020ACL. Aside from the regular training of E2E-ST (without ASR as multi-tasking or pre-training), MAM masks certain portions of the speech input randomly and aims to recover the masked speech signals with their context on the encoder side. MAM is a general technique that also can be used as a pre-training module on arbitrary acoustic signals, e.g., multilingual speech, music, animal sounds or even noise. The contributions of our paper are as follows:
We demonstrate the importance of self-supervising module for E2E-ST. Unlike all previous attempts, which heavily depend on transcription, MAM improves the capacity of encoder by recovering masked speech signals merely based on their context. MAM also can be used together with ASR multi-tasking.
MAM also can be used as a pre-training module by itself. In the pre-training setting, MAM is capable to utilize arbitrary acoustic signal (e.g., music, animal sound, or noise) other than regular speech audio. Considering there are much more acoustic data than human speech, MAM has better potential to be used for pre-training. To the best of our knowledge, MAM is the first technique that is able to perform pre-training with any form of audio signal.
With the help of pre-training, MAM advances the E2E-ST performance by 1.36 in BLEU without using transcription. With the help of ASR multi-tasking, MAM further boosts the translation quality by 1.22 in BLEU.
We show that the success of MAM does not rely on intensive or expensive computation. All our experiments are done with 8 1080Ti GPUs within 2 days of training without any parameter and architecture search.
2 Preliminaries: ASR and ST
We first briefly review standard E2E-ST and E2E-ST with ASR multi-task training to set up the notations. Then we analyze the reason why ASR training is beneficial to ST, which motivates our own solution of MAM in the next section.
2.1 Vanilla E2E-ST Training with Seq2Seq
Regardless of particular design of Seq2Seq models for different tasks, the encoder always takes the source input sequence of elements where each is awhere . The encoding function can be implemented by a mixture between Convolution, RNN and Transformer. More specifically, can be the spectrogram or mel-spectrogram of the source speech audio in our case, and each represents the frame-level speech feature of speech signal within certain duration.
On the other hand, the decoder greedily predicts a new output word given both the source sequence and the prefix of decoded tokens, denoted . The decoder continues the generation until it emits <eos> and finishes the entire decoding process. Finally, we obtain the hypothesis with the model score which defined as following:
During the training time, the entire model aims to maximize the conditional probability of each ground-truth target sentencegiven input over the entire training corpus , or equivalently minimizing the following loss:
Compared with other tasks, e.g., MT or ASR, which also employ Seq2Seq framework for E2E training, E2E-ST is a more difficult and challenging task in many ways. Firstly, data modalities are different on source and target sides. For ST, the encoder deals with speech signals and tries to learn word presentations on decoder side while MT has text format on both sides. Secondly, due to the fact of high sampling rate of speech signals, speech inputs are generally multiple times longer than target sequence, which increase the difficulties of learning the correspondence between source and target. Thirdly, compared with the monotonicity natural of the alignment of ASR, ST usually needs to learn the global reordering between speech signal and translation, and this raises the challenge to another level.
2.2 Multi-task Learning with ASR
To address the previously issues, researchers proposed to either use pre-trained ASR encoder to initialize ST encoder, or to perform ASR Multi-task Learning (MTL) together with ST training. We only discuss the multi-task training since pre-training does not require significant change to Seq2Seq model.
During multi-task training, there are two decoders sharing one encoder. Besides the MT decoder, there is also another decoder for generating transcriptions. With the help of ASR training, the encoder is able to learn more accurate speech segmentations (similar to forced alignment) making the global reordering of those segments for MT relatively easier.
We defined the following training loss for ASR:
where represents the annotated, ground-truth transcription for speech audio . In our baseline setting, we also hybrid CTC/Attention framework Watanabe2017 on the encoder side. In the case of multi-task training with ASR for ST, the total loss is defined as
Fig. 2 tries to explain and analyze the difference between E2E-ST and E2E-ST with ASR multi-task training. We extract the most top layer from encoder for comparison. We notice that E2E-ST tends to get more meaningful self-attention with the training signal from ASR. With the help from ASR, the source input spectrogram is chunked into segments which contain phoneme-level information. With larger scales of segmented spectrograms, target side decoder only need to perform reordering on those segments instead of frames. Those benefits are introduced by the utilizing the annotated transcription.
However, in the following sections, we demonstrate that our proposed MAM is capable to get segmented, accurate self-attention without using transcriptions.
3 Masked Acoustic Modeling
As we discussed above, all the existing solutions to boost the current E2E-ST performance heavily depend on the availability of the transcription of the source language. Those solutions are not able to take the advantages of large amount of speeches without any annotations. They also become inapplicable when the source language is low-resource, or even does not have standard orthography system. Therefore, the ideal solution should not be constrained by source language transcription and still achieves similar translation quality. Thus, we introduce MAM in this section.
3.1 MAM as Part of Training Objective
As discussed in Sec. 2.2, both pre-training ASR and ASR multi-task training are beneficial for encoder self-attention. Based on this observation, we propose to perform self-supervised training on encoder side by reconstructing sabotaged speech signals from the input. Note that MAM is totally different from another self-supervised training Chuang2019; WangSemantic2019; WangWLY020ACL which rely on transcription to segment the speech audio with forced alignment toolsPovey11thekaldi; McAuliffe2017. We directly apply random masks with different widths over speech audio eliminating the dependency of transcription. Therefore, MAM can be easily applied to speech audio without transcription and even any non-human speech audio, e.g., any noise, music and animal sound.
Formally, we define a random replacement function over the original speech input
where Rand() replaces some certain vectors in with the same random initialized vector, , randomly with probability of . Note that we use the same vector to represent all the corrupted frames (see one example in Fig.4(b)). Then we obtain a corrupted input and its corresponding latent representation .
For MAM module, we have the following training objective to reconstruct the original speech signal with surrounding context information with self-supervised fashion:
where is a reconstruction function which tries to recover the original signal with . We use function to measure the difference between the corrupted and the original signal. For simplicity, we use regular 2D deconvolution as , and norm as for measuring the difference. Finally, we have the following total loss of our model
To further boost the performance of E2E-ST, we also have the option to use the help of ASR multi-task training when transcription of the speech is available. Then our total loss is
3.2 Different Masking Strategies
MAM aims at much harder tasks than pure textual pre-training models, e.g., BERT or ERINE, which only perform semantic learning over missing tokens. In our case, we but only try to recover semantic meaning, but also acoustic characteristic of given audio. MAM simultaneously predicts the missing words and generates spectrograms like speech synthesis task.
To ensure the masked segments contain different levels of granularity of speech semantic, we propose the following masking methods.
Single Frame Masking
Uniformly mask frames out of to construct . Note that we might have continuous frames that were masked.
Similar with SpanBERT joshi2020spanbert, we first sample a serial of span widths and then apply those spans randomly to different positions of the input signal. Note that we do not allow overlap in this case.
Segmentation with non-silence Segments Detection
Above methods do not guarantee that full word pronunciations will be masked; and thus leaking part of masked word information to the input making the recovering easier. To ensure performing word or phoneme level masking, we introduce to use silence detection algorithm to segment full words or phonemes.
Given a raw audio, we first process the audio with Gaussian low-pass filter savitzky64; William2007, and set a threshold upon a normalized, smoothed signal to locate the silence areas. Then we can easily locate and segment the non-silence areas. Fig. 4 shows one example of segmented audio file, and during training of MAM, we directly random select to mask from those non-silence pieces.
In this way, we achieve similar segmentation goal with forced alignment-based methods Chuang2019; WangWLY020ACL without the demand of transcription. More importantly, due to the flexibility of our random masking algorithms, MAM can be easily applied to all kinds of audios, e.g., music, animal sounds and any kind of noise even, during pre-training step.
3.3 Pre-training MAM
MAM is a powerful technique that is not only beneficial to the training procedure, but also can be used as a pre-training framework that do not need any annotation.
The bottle neck of current speech-related tasks, e.g., ASR, ST, is lacking of annotated training corpus. For some languages that do not even have standard orthography system, these annotation is even imposable to obtain.
Although current speech-related pre-training frameworks Chuang2019; WangWLY020ACL indeed relieve certain needs of large scale parallel training corpus for E2E-ST, all of these pre-training methods still need intense transcription annotation for the source speech.
During pre-training time, we only use the encoder part of MAM. Thanks to our flexible masking techniques, MAM is able to perform pre-training with any kind of audio signal. In this way, MAM can easily recover any forms of audio and creates more robust and accurate self-attention in practice when we use the system in noisy environment. With different heads of transformer, MAM is also capable of distinguish the background, non-human speech and generate more accurate translation and transcription.
In this section, we present MAM results in E2E-ST English-to-German translation on MUST-C datasetmustc. All raw audio files are processed by Kaldi Povey11thekaldi to extract 80-dimensional log-Mel filterbanks stacked with 3-dimensional pitch features using window size of 25 ms and step size of 10 ms. Our basic E2E-ST framework has similar settings with ESPnet-STinaguma-etal-2020-espnet
. We first downsample the speech input with 2 layers of 2D convolution of size 3 with stride size of 2. Then there is a standard 12-layers Transformer with 2048 hidden size to bridge the source and target side. We only use 4 attention heads on each side of transformer and each of them has the dimensionality of 256. For MAM module, we simply linearly project the outputs of Transformer encoder to another latent space, then upsample the latent representation with 2-layers deconvolution to match the size of original input signal. For the random masking ratio, we choose 15% across all the experiment including pre-training. During inference, we do not perform any masking over the speech input.
We conducted MAM pre-training experiment on two corpora, Librispeech (human speech in English) panayotov2015librispeech and FMA (music audios) defferrard2016fma, respectively. The statistical results of the dataset are shown in Table. 2
Our MAM is very easy to replicate as we do not perform any parameters and architecture search upon baseline system. Due to the simply, but effective design of MAM, MAM does not rely on intensive computation; thus it can easily converge within 2 days of training with 8 1080Ti GPUs. We showcase the comparison of parameters between different solutions to E2E-ST in Table. 1 This makes a big difference with current popular intensive computations frameworks such as BERTBERT (340M parameters) and GPT3brown2020language (175B parameters), making this technique is accessible to regular users.
4.1 Visualizing Reconstruction
|Cascade di2019adapting *||-||-||18.5|
|E2E-ST+ASR MTL *||8.71||16.98||20.65|
|MAM as auxilary task|
|MAM (span) + ASR MTL||9.1||17.79||21.87|
|ASR encoder pre-trained with English speech data|
|E2E-ST + ASR pre-trained encoder *||10.21||16.9||20.26|
|MAM pre-trained with English speech data|
|MAM (span) + ASR MTL||9.17||17.2||21.6|
|MAM pre-trained with any audio (FMA music data)|
To verify the pre-trained results of MAM, we demonstrate the reconstruction ability of MAM by visualizing the results in Fig. 5. We first showcase the original spectrogram of a given speech in Fig. 4(a). Then we corrupted the original spectrogram with replacing the selected mask frames with , which is a random initialized vector, to form (see Fig. 4(b)). In Fig. 4(c)
, we show that our proposed MAM is able to recover the missing framework by pre-training over Librispeech dataset. Since MAM does not need any transcription to perform pre-training, we also pre-train MAM with FMA corpusfma_challenge, which is a music dataset. Surprisingly, MAM performs very similar reconstruction ability compared with the one that are pre-trained with speech dataset considering the corrupted audio is only about speech.
The above analysis opens a completely new solution to speech-related task pre-training, suggesting that we should perform the pre-training over any sort of audios instead of solely relying on speech data. Considering the large-scale audio that is already available on Internet and relatively much smaller annotated speech audio, MAM has much greater potential to improve the performance of any speech-related task, e.g., ASR, ST, even speech synthesis.
4.2 Translation Accuracy Comparisons
We showcase the translation accuracy of our proposed MAM comparing against to several baselines in Table 3. All the results of baseline methods are indicated with * in the table. We have the cascade ST framework, which first transcribed the speech into transcription then past the results to following machines translation system. The cascaded system has similar performance with E2E-ST. MT system directly generate the target translation from groundtruth transcription instead of ASR generated results, which can be approximately considered as the upper bound accuracy of this speech translation corpus. To make compete comparison, we also include the performance of the multi-task training with E2E-ST and ASR, and as well as pre-trained ASR encoder.
For MAM’s performance, we first show three results which only use MAM as an extra training module without any pre-training (grouped as “MAM as auxilary task”). In the setting with span masking, MAM outperforms E2E-ST by 1.36 in BLEU and only has 0.94 gap with ST+ASR MTL. When we also enable ASR MTL for MAM, we further boost the translation performance to 21.87 which outperforms the above best performance by 1.22 in BLEU.
In latter part of Table 3, we demonstrate the effectiveness of MAM when we use it as a pre-training technique. Note that we do not use any transcription for MAM during pre-training. In the first setting, we utilize the speech part of external Librispeech corpus as a pre-training corpus. In the pre-training setting, MAM with span masking further improves the accuracy by 0.56 in BLEU with the help of pre-trained knowledge. Compared with the setting of ASR pre-trained encoder initialization (20.26), MAM which pre-trained with Librispeech achieves very similar performance (20.27) without using transcription, indicating MAM is capable to generate accurate translation when transcription of the source language is absent in some low resource settings.
The last but not the least, more interestingly, when we pre-train MAM with FMA dataset, which only contains music audio, MAM also demonstrates improvements especially in mid and high settings. Due to the vast non-speech audio dataset availability, MAM is able to utilize much larger pre-training corpus to further boost the performance. As a results of using non-speech pre-training corpus, we only need to have one pre-trained model that can be applied to any language fine-tuning.
We have presented a novel acoustic modeling framework MAM in this paper. MAM not only can be used as an extra component during training time, but also as a separate pre-training framework that can be applied to arbitrary acoustic signal. We demonstrate the effectiveness of MAM with multiple different experiment settings. Especially, we show that music data pre-training with MAM also boosts the performance of English-to-German speech translation.