MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

by   Junkun Chen, et al.
Oregon State University
Baidu, Inc.

End-to-end Speech-to-text Translation (E2E- ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique, termed Masked Acoustic Modeling (MAM), can also perform pre-training, for the first time, on any acoustic signals (including non-speech ones) without annotation. Compared with current state-of-the-art models on ST, our technique achieves +1.4 BLEU improvement without using transcriptions, and +1.2 BLEU using transcriptions. The pre-training of MAM with arbitrary acoustic signals also boosts the downstream speech-related tasks.


page 5

page 6

page 7


Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Encoder pre-training is promising in end-to-end Speech Translation (ST),...

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

We present a simple approach to improve direct speech-to-text translatio...

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

End-to-end speech translation, a hot topic in recent years, aims to tran...

Speech Pre-training with Acoustic Piece

Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre...

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Direct speech-to-speech translation (S2ST) models suffer from data scarc...

Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Acoustic-to-word (A2W) models that allow direct mapping from acoustic si...

Transformer based unsupervised pre-training for acoustic representation learning

Computational audio analysis has become a central issue in associated ar...

1 Introduction

Speech-to-text translation (ST), which translates the source language speech to target language text, is useful in many scenarios such as international conferences, travels, foreign-language video subtitling, etc. Conventional cascaded approaches to ST Ney99; Matusov2005OnTI; Mathias2006; Berard2016 first transcribe the speech audio into source language text (ASR) and then perform text-to-text machine translation (MT), which inevitably suffers from error propagation in the pipeline. To alleviate this problem, recent efforts explore end-to-end approaches (E2E-ST) Weiss2017; Berard2018EndtoEndAS; Vila2018EndtoEndST; Gangi2019, which are computationally more efficient at inference time and mitigate the risk of error propagation from imperfect ASR.

Figure 1: Comparisons with different existing solutions and our proposed Masked Acoustic Modeling (MAM).

To improve the translation accuracy of E2E-ST models, researchers either initialize the encoder of ST with a pre-trained ASR encoder Berard2018EndtoEndAS; bansal2019; WangWLY020 to get better a representation of the speech signal, or perform multi-task training (ASR+ST) to bring more training signals to the shared encoder anastasopoulos2016; anastasopoulos2018; Sperber2019; Liu2019 (see Fig. 1). These methods improve the translation quality by providing more training signals to the encoder to learn better self-attention.

However, both above solutions assume the existence of substantial speech transcriptions of the source language. Unfortunately, this assumption is problematic. On one hand, for certain low-resource languages, especially endangered ones bird2010; bird2014, the source speech transcriptions are expensive to collect. Moreover, the majority of the human languages has no written form or no standard orthography, making phonetic transcription impossible duong2016. On the other hand, the amount of speech audios with transcriptions is limited (as they are expensive to collect), and there exist far more audios without any annotations. It will be much more straightforward and cheaper to directly leverage these raw audios to train a robust encoder.

To relieve from the dependency on source language transcriptions, we present a very simple yet effective solution, Masked Acoustic Modeling (MAM), to utilize the speech data in a self-supervised fashion without using any source language transcription, unlike other speech pre-training models Chuang2019; WangWLY020ACL. Aside from the regular training of E2E-ST (without ASR as multi-tasking or pre-training), MAM masks certain portions of the speech input randomly and aims to recover the masked speech signals with their context on the encoder side. MAM is a general technique that also can be used as a pre-training module on arbitrary acoustic signals, e.g., multilingual speech, music, animal sounds or even noise. The contributions of our paper are as follows:

  • We demonstrate the importance of self-supervising module for E2E-ST. Unlike all previous attempts, which heavily depend on transcription, MAM improves the capacity of encoder by recovering masked speech signals merely based on their context. MAM also can be used together with ASR multi-tasking.

  • MAM also can be used as a pre-training module by itself. In the pre-training setting, MAM is capable to utilize arbitrary acoustic signal (e.g., music, animal sound, or noise) other than regular speech audio. Considering there are much more acoustic data than human speech, MAM has better potential to be used for pre-training. To the best of our knowledge, MAM is the first technique that is able to perform pre-training with any form of audio signal.

  • With the help of pre-training, MAM advances the E2E-ST performance by 1.36 in BLEU without using transcription. With the help of ASR multi-tasking, MAM further boosts the translation quality by 1.22 in BLEU.

  • We show that the success of MAM does not rely on intensive or expensive computation. All our experiments are done with 8 1080Ti GPUs within 2 days of training without any parameter and architecture search.

2 Preliminaries: ASR and ST

We first briefly review standard E2E-ST and E2E-ST with ASR multi-task training to set up the notations. Then we analyze the reason why ASR training is beneficial to ST, which motivates our own solution of MAM in the next section.

Figure 2: One head of the last self-attention layer comparison between transformer based E2E-ST and ST+ASR multi-task on the encoder side. ASR multi-tasking helps the encoder learns more meaningful self-attentions.

2.1 Vanilla E2E-ST Training with Seq2Seq

Regardless of particular design of Seq2Seq models for different tasks, the encoder always takes the source input sequence of elements where each is a

-dimension vector and produces a sequence of hidden representations

where . The encoding function can be implemented by a mixture between Convolution, RNN and Transformer. More specifically, can be the spectrogram or mel-spectrogram of the source speech audio in our case, and each represents the frame-level speech feature of speech signal within certain duration.

On the other hand, the decoder greedily predicts a new output word given both the source sequence and the prefix of decoded tokens, denoted . The decoder continues the generation until it emits <eos> and finishes the entire decoding process. Finally, we obtain the hypothesis with the model score which defined as following:


During the training time, the entire model aims to maximize the conditional probability of each ground-truth target sentence

given input over the entire training corpus , or equivalently minimizing the following loss:


Compared with other tasks, e.g., MT or ASR, which also employ Seq2Seq framework for E2E training, E2E-ST is a more difficult and challenging task in many ways. Firstly, data modalities are different on source and target sides. For ST, the encoder deals with speech signals and tries to learn word presentations on decoder side while MT has text format on both sides. Secondly, due to the fact of high sampling rate of speech signals, speech inputs are generally multiple times longer than target sequence, which increase the difficulties of learning the correspondence between source and target. Thirdly, compared with the monotonicity natural of the alignment of ASR, ST usually needs to learn the global reordering between speech signal and translation, and this raises the challenge to another level.

2.2 Multi-task Learning with ASR

To address the previously issues, researchers proposed to either use pre-trained ASR encoder to initialize ST encoder, or to perform ASR Multi-task Learning (MTL) together with ST training. We only discuss the multi-task training since pre-training does not require significant change to Seq2Seq model.

During multi-task training, there are two decoders sharing one encoder. Besides the MT decoder, there is also another decoder for generating transcriptions. With the help of ASR training, the encoder is able to learn more accurate speech segmentations (similar to forced alignment) making the global reordering of those segments for MT relatively easier.

We defined the following training loss for ASR:


where represents the annotated, ground-truth transcription for speech audio . In our baseline setting, we also hybrid CTC/Attention framework Watanabe2017 on the encoder side. In the case of multi-task training with ASR for ST, the total loss is defined as


Fig. 2 tries to explain and analyze the difference between E2E-ST and E2E-ST with ASR multi-task training. We extract the most top layer from encoder for comparison. We notice that E2E-ST tends to get more meaningful self-attention with the training signal from ASR. With the help from ASR, the source input spectrogram is chunked into segments which contain phoneme-level information. With larger scales of segmented spectrograms, target side decoder only need to perform reordering on those segments instead of frames. Those benefits are introduced by the utilizing the annotated transcription.

However, in the following sections, we demonstrate that our proposed MAM is capable to get segmented, accurate self-attention without using transcriptions.

3 Masked Acoustic Modeling

Figure 3: MAM (in blue box) can be treated as one extra module besides standard Transformer encoder-decoder and convolution layers for processing speech signals.

As we discussed above, all the existing solutions to boost the current E2E-ST performance heavily depend on the availability of the transcription of the source language. Those solutions are not able to take the advantages of large amount of speeches without any annotations. They also become inapplicable when the source language is low-resource, or even does not have standard orthography system. Therefore, the ideal solution should not be constrained by source language transcription and still achieves similar translation quality. Thus, we introduce MAM in this section.

3.1 MAM as Part of Training Objective

As discussed in Sec. 2.2, both pre-training ASR and ASR multi-task training are beneficial for encoder self-attention. Based on this observation, we propose to perform self-supervised training on encoder side by reconstructing sabotaged speech signals from the input. Note that MAM is totally different from another self-supervised training Chuang2019; WangSemantic2019; WangWLY020ACL which rely on transcription to segment the speech audio with forced alignment toolsPovey11thekaldi; McAuliffe2017. We directly apply random masks with different widths over speech audio eliminating the dependency of transcription. Therefore, MAM can be easily applied to speech audio without transcription and even any non-human speech audio, e.g., any noise, music and animal sound.

Formally, we define a random replacement function over the original speech input


where Rand() replaces some certain vectors in with the same random initialized vector, , randomly with probability of . Note that we use the same vector to represent all the corrupted frames (see one example in Fig.4(b)). Then we obtain a corrupted input and its corresponding latent representation .

For MAM module, we have the following training objective to reconstruct the original speech signal with surrounding context information with self-supervised fashion:


where is a reconstruction function which tries to recover the original signal with . We use function to measure the difference between the corrupted and the original signal. For simplicity, we use regular 2D deconvolution as , and norm as for measuring the difference. Finally, we have the following total loss of our model


To further boost the performance of E2E-ST, we also have the option to use the help of ASR multi-task training when transcription of the speech is available. Then our total loss is


3.2 Different Masking Strategies

Figure 4: One example of our segmented audio non-silence detection technique that described in Section. 3.2. Blue lines indicate the starting points of each non-silence segment while white lines represent the ends. Note some of the blue and white lines are very closed to each other and hard to recognize. We hide the transcription on purpose to show that we do not reply on textual information to segment.

MAM aims at much harder tasks than pure textual pre-training models, e.g., BERT or ERINE, which only perform semantic learning over missing tokens. In our case, we but only try to recover semantic meaning, but also acoustic characteristic of given audio. MAM simultaneously predicts the missing words and generates spectrograms like speech synthesis task.

To ensure the masked segments contain different levels of granularity of speech semantic, we propose the following masking methods.

Single Frame Masking

Uniformly mask frames out of to construct . Note that we might have continuous frames that were masked.

Span Masking

Similar with SpanBERT joshi2020spanbert, we first sample a serial of span widths and then apply those spans randomly to different positions of the input signal. Note that we do not allow overlap in this case.

Segmentation with non-silence Segments Detection

Above methods do not guarantee that full word pronunciations will be masked; and thus leaking part of masked word information to the input making the recovering easier. To ensure performing word or phoneme level masking, we introduce to use silence detection algorithm to segment full words or phonemes.

Given a raw audio, we first process the audio with Gaussian low-pass filter savitzky64; William2007, and set a threshold upon a normalized, smoothed signal to locate the silence areas. Then we can easily locate and segment the non-silence areas. Fig. 4 shows one example of segmented audio file, and during training of MAM, we directly random select to mask from those non-silence pieces.

In this way, we achieve similar segmentation goal with forced alignment-based methods Chuang2019; WangWLY020ACL without the demand of transcription. More importantly, due to the flexibility of our random masking algorithms, MAM can be easily applied to all kinds of audios, e.g., music, animal sounds and any kind of noise even, during pre-training step.

3.3 Pre-training MAM

MAM is a powerful technique that is not only beneficial to the training procedure, but also can be used as a pre-training framework that do not need any annotation.

The bottle neck of current speech-related tasks, e.g., ASR, ST, is lacking of annotated training corpus. For some languages that do not even have standard orthography system, these annotation is even imposable to obtain.

Although current speech-related pre-training frameworks Chuang2019; WangWLY020ACL indeed relieve certain needs of large scale parallel training corpus for E2E-ST, all of these pre-training methods still need intense transcription annotation for the source speech.

During pre-training time, we only use the encoder part of MAM. Thanks to our flexible masking techniques, MAM is able to perform pre-training with any kind of audio signal. In this way, MAM can easily recover any forms of audio and creates more robust and accurate self-attention in practice when we use the system in noisy environment. With different heads of transformer, MAM is also capable of distinguish the background, non-human speech and generate more accurate translation and transcription.

4 Experiments

In this section, we present MAM results in E2E-ST English-to-German translation on MUST-C datasetmustc. All raw audio files are processed by Kaldi Povey11thekaldi to extract 80-dimensional log-Mel filterbanks stacked with 3-dimensional pitch features using window size of 25 ms and step size of 10 ms. Our basic E2E-ST framework has similar settings with ESPnet-STinaguma-etal-2020-espnet

. We first downsample the speech input with 2 layers of 2D convolution of size 3 with stride size of 2. Then there is a standard 12-layers Transformer with 2048 hidden size to bridge the source and target side. We only use 4 attention heads on each side of transformer and each of them has the dimensionality of 256. For MAM module, we simply linearly project the outputs of Transformer encoder to another latent space, then upsample the latent representation with 2-layers deconvolution to match the size of original input signal. For the random masking ratio

, we choose 15% across all the experiment including pre-training. During inference, we do not perform any masking over the speech input.

number of
31M 47M 33M
Table 1: MAM only has 6.5% more parameters than baseline model while ASR multi-tasking needs to use 51.6% more parameters.

We conducted MAM pre-training experiment on two corpora, Librispeech (human speech in English) panayotov2015librispeech and FMA (music audios) defferrard2016fma, respectively. The statistical results of the dataset are shown in Table. 2

Must-C Librispeech FMA
Hours 408h 960h 208h
Table 2: The statistical results of corpora and the original task that audios corresponds to.

Our MAM is very easy to replicate as we do not perform any parameters and architecture search upon baseline system. Due to the simply, but effective design of MAM, MAM does not rely on intensive computation; thus it can easily converge within 2 days of training with 8 1080Ti GPUs. We showcase the comparison of parameters between different solutions to E2E-ST in Table. 1 This makes a big difference with current popular intensive computations frameworks such as BERTBERT (340M parameters) and GPT3brown2020language (175B parameters), making this technique is accessible to regular users.

4.1 Visualizing Reconstruction

(a) The original speech spectrogram. Note that though we annotate the transcription underneath, we do not use transcription information at all during pre-training.
(b) We mask the selected frames (underlined with blue lines) with the same random initialized vector.
(c) Recovered spectrogram with MAM, pre-trained with Librispeech corpus.
(d) MAM that pre-trains with FMA music corpus still have the ability to reconstruct corrupted speech signal.
Figure 5: One speech example to showcase the reconstruction ability of pre-trained MAM. We notice that MAM reconstructs the corrupted audio signal in both pre-training with ordinary speech and music dataset.
low mid high
Cascade di2019adapting * - - 18.5
MT di2019adapting* - - 25.3
E2E-ST * 0.98 11.2 18.35
E2E-ST+ASR MTL * 8.71 16.98 20.65
MAM as auxilary task
MAM (single) 1.67 11.87 19.35
MAM (span) 2.58 12.37 19.71
MAM (span) + ASR MTL 9.1 17.79 21.87
with transcription
ASR encoder pre-trained with English speech data
E2E-ST + ASR pre-trained encoder * 10.21 16.9 20.26
without transcription
MAM pre-trained with English speech data
MAM (span) 7.18 15.61 20.27
MAM (span) + ASR MTL 9.17 17.2 21.6
MAM pre-trained with any audio (FMA music data)
MAM (span) 2.73 13.55 19.82
Table 3: Experimental comparisons between MAM and several baselines (indicated with *). We use 20% (low), 50% (mid) and full corpus (high) of MUST-C dataset to show the translation accuracy with different training or finetuning resource. Note that the MT baseline is the upper bound of E2E-ST since it’s translated from gold source transcriptions. All of our reported results above are based on a single model without any model ensembling.

To verify the pre-trained results of MAM, we demonstrate the reconstruction ability of MAM by visualizing the results in Fig. 5. We first showcase the original spectrogram of a given speech in Fig. 4(a). Then we corrupted the original spectrogram with replacing the selected mask frames with , which is a random initialized vector, to form (see Fig. 4(b)). In Fig. 4(c)

, we show that our proposed MAM is able to recover the missing framework by pre-training over Librispeech dataset. Since MAM does not need any transcription to perform pre-training, we also pre-train MAM with FMA corpus

fma_challenge, which is a music dataset. Surprisingly, MAM performs very similar reconstruction ability compared with the one that are pre-trained with speech dataset considering the corrupted audio is only about speech.

The above analysis opens a completely new solution to speech-related task pre-training, suggesting that we should perform the pre-training over any sort of audios instead of solely relying on speech data. Considering the large-scale audio that is already available on Internet and relatively much smaller annotated speech audio, MAM has much greater potential to improve the performance of any speech-related task, e.g., ASR, ST, even speech synthesis.

(a) The original musical spectrogram that is mixed with different instruments’ sound.
(b) We mask the selected frames (underlined with blue lines) with the same random initialized vector.
(c) Recovered spectrogram with MAM, pre-trained with Librispeech corpus.
Figure 6: One speech example to showcase the reconstruction ability of pre-trained MAM. Pre-trained MAM with Librispeech corpus (only human speech data) can not reconstruct the original music spectrogram accurately since there are many different musical instruments’ sound that is unseen in speech data.

4.2 Translation Accuracy Comparisons

We showcase the translation accuracy of our proposed MAM comparing against to several baselines in Table 3. All the results of baseline methods are indicated with * in the table. We have the cascade ST framework, which first transcribed the speech into transcription then past the results to following machines translation system. The cascaded system has similar performance with E2E-ST. MT system directly generate the target translation from groundtruth transcription instead of ASR generated results, which can be approximately considered as the upper bound accuracy of this speech translation corpus. To make compete comparison, we also include the performance of the multi-task training with E2E-ST and ASR, and as well as pre-trained ASR encoder.

For MAM’s performance, we first show three results which only use MAM as an extra training module without any pre-training (grouped as “MAM as auxilary task”). In the setting with span masking, MAM outperforms E2E-ST by 1.36 in BLEU and only has 0.94 gap with ST+ASR MTL. When we also enable ASR MTL for MAM, we further boost the translation performance to 21.87 which outperforms the above best performance by 1.22 in BLEU.

In latter part of Table 3, we demonstrate the effectiveness of MAM when we use it as a pre-training technique. Note that we do not use any transcription for MAM during pre-training. In the first setting, we utilize the speech part of external Librispeech corpus as a pre-training corpus. In the pre-training setting, MAM with span masking further improves the accuracy by 0.56 in BLEU with the help of pre-trained knowledge. Compared with the setting of ASR pre-trained encoder initialization (20.26), MAM which pre-trained with Librispeech achieves very similar performance (20.27) without using transcription, indicating MAM is capable to generate accurate translation when transcription of the source language is absent in some low resource settings.

The last but not the least, more interestingly, when we pre-train MAM with FMA dataset, which only contains music audio, MAM also demonstrates improvements especially in mid and high settings. Due to the vast non-speech audio dataset availability, MAM is able to utilize much larger pre-training corpus to further boost the performance. As a results of using non-speech pre-training corpus, we only need to have one pre-trained model that can be applied to any language fine-tuning.

5 Conclusions

We have presented a novel acoustic modeling framework MAM in this paper. MAM not only can be used as an extra component during training time, but also as a separate pre-training framework that can be applied to arbitrary acoustic signal. We demonstrate the effectiveness of MAM with multiple different experiment settings. Especially, we show that music data pre-training with MAM also boosts the performance of English-to-German speech translation.