Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

by   Chengyi Wang, et al.
Nankai University

End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model. Conventional approaches employ multi-task learning and pre-training methods for this task, but they suffer from the huge gap between pre-training and fine-tuning. To address these issues, we propose a Tandem Connectionist Encoding Network (TCEN) which bridges the gap by reusing all subnets in fine-tuning, keeping the roles of subnets consistent, and pre-training the attention module. Furthermore, we propose two simple but effective methods to guarantee the speech encoder outputs and the MT encoder inputs are consistent in terms of semantic representation and sequence length. Experimental results show that our model outperforms baselines 2.2 BLEU on a large benchmark dataset.


page 1

page 2

page 3

page 4


M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

End-to-end speech-to-text translation models are often initialized with ...

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

End-to-end Speech-to-text Translation (E2E- ST), which directly translat...

Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis

Retrosynthesis is a problem to infer reactant compounds to synthesize a ...

Bridging the Modality Gap for Speech-to-Text Translation

End-to-end speech translation aims to translate speech in one language i...

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Encoder pre-training is promising in end-to-end Speech Translation (ST),...

Fine-Grained Visual Categorization using Meta-Learning Optimization with Sample Selection of Auxiliary Data

Fine-grained visual categorization (FGVC) is challenging due in part to ...

Voice-Face Homogeneity Tells Deepfake

Detecting forgery videos is highly desirable due to the abuse of deepfak...

1 Introduction

Speech-to-Text translation (ST) is essential for a wide range of scenarios: for example in emergency calls, where agents have to respond emergent requests in a foreign language [Munro2010]; or in online courses, where audiences and speakers use different languages [Jan et al.2018]. To tackle this problem, existing approaches can be categorized into cascaded method [Ney1999, Ma et al.2019], where a machine translation (MT) model translates outputs of an automatic speech recognition (ASR) system into target language, and end-to-end method [Duong et al.2016, Weiss et al.2017], where a single model learns acoustic frames to target word sequence mappings in one step towards the final objective of interest. Although the cascaded model remains the dominant approach due to its better performance, the end-to-end method becomes more and more popular because it has lower latency by avoiding inferences with two models and rectifies the error propagation in theory.

Since it is hard to obtain a large-scale ST dataset, multi-task learning [Weiss et al.2017, Bérard et al.2018] and pre-training techniques [Bansal et al.2019] have been applied to end-to-end ST model to leverage large-scale datasets of ASR and MT. A common practice is to pre-train two encoder-decoder models for ASR and MT respectively, and then initialize the ST model with the encoder of the ASR model and the decoder of the MT model. Subsequently, the ST model is optimized with the multi-task learning by weighing the losses of ASR, MT, and ST. This approach, however, causes a huge gap between pre-training and fine-tuning, which are summarized into three folds:

  • Subnet Waste: The ST system just reuses the ASR encoder and the MT decoder, while discards other pre-trained subnets, such as the MT encoder. Consequently, valuable semantic information captured by the MT encoder cannot be inherited by the final ST system.

  • Role Mismatch: The speech encoder plays different roles in pre-training and fine-tuning. The encoder is a pure acoustic model in pre-training, while it has to extract semantic and linguistic features additionally in fine-tuning, which significantly increases the learning difficulty.

  • Non-pre-trained Attention Module: Previous work [Bérard et al.2018] trains attention modules for ASR, MT and ST respectively, hence, the attention module of ST does not benefit from the pre-training.

Figure 1: An illustration of multi-task learning for speech translation. Networks inherited from pre-trained models are labeled by rectangles.

To address these issues, we propose a Tandem Connectionist Encoding Network (TCEN), which is able to reuse all subnets in pre-training, keep the roles of subnets consistent, and pre-train the attention module. Concretely, the TCEN consists of three components, a speech encoder, a text encoder, and a target text decoder. Different from the previous work that pre-trains an encoder-decoder based ASR model, we only pre-train an ASR encoder by optimizing the Connectionist Temporal Classification (CTC) [Graves et al.2006] objective function. In this way, the additional decoder of ASR is not required while keeping the ability to read acoustic features into the source language space by the speech encoder. Besides, the text encoder and decoder can be pre-trained on a large MT dataset. After that, we employ common used multi-task learning method to jointly learn ASR, MT and ST tasks.

Compared to prior works, the encoder of TCEN is a concatenation of an ASR encoder and an MT encoder and our model does not have an ASR decoder, so the subnet waste issue

is solved. Furthermore, the two encoders work at tandem, disentangling acoustic feature extraction and linguistic feature extraction, ensuring the

role consistency between pre-training and fine-tuning. Moreover, we reuse the pre-trained MT attention module in ST, so we can leverage the alignment information learned in pre-training.

Since the text encoder consumes word embeddings of plausible texts in MT task but uses speech encoder outputs in ST task, another question is how one guarantees the speech encoder outputs are consistent with the word embeddings. We further modify our model to achieve semantic consistency and length consistency. Specifically, (1) the projection matrix at the CTC classification layer for ASR is shared with the word embedding matrix, ensuring that they are mapped to the same latent space, and (2) the length of the speech encoder output is proportional to the length of the input frame, so it is much longer than a natural sentence. To bridge the length gap, source sentences in MT are lengthened by adding word repetitions and blank tokens to mimic the CTC output sequences.

We conduct comprehensive experiments on the IWSLT18 speech translation benchmark [Jan et al.2018], demonstrating the effectiveness of each component. Our model is significantly better than previous methods by 3.6 and 2.2 BLEU scores for the subword-level decoding and character-level decoding strategies, respectively.

Our contributions are three-folds: 1) we shed light on why previous ST models cannot sufficiently utilize the knowledge learned from the pre-training process; 2) we propose a new ST model, which alleviates shortcomings in existing methods; and 3) we empirically evaluate the proposed model on a large-scale public dataset.

Figure 2: The architecture of our model. The linear projection matrix in ASR is shared with the word embedding matrix in MT.

2 Background

2.1 Problem Formulation

End-to-end speech translation aims to translate a piece of audio into a target-language translation in one step. The raw speech signals are usually converted to sequences of acoustic features, e.g. Mel filterbank features. Here, we define the speech feature sequence as .The transcription and translation sequences are denoted as , and repectively. Each symbol in or is an integer index of the symbol in a vocabulary or respectively (e.g. ). In this work, we suppose that an ASR dataset, an MT dataset, and a ST dataset are available, denoted as , and respectively. Given a new piece of audio , our goal is to learn an end to end model to generate a translation sentence without generating an intermediate result .

2.2 Multi-Task Learning and Pre-training for ST

To leverage large scale ASR and MT data, multi-task learning and pre-training techniques are widely employed to improve the ST system. As shown in Figure 1, there are three popular multi-task strategies for ST, including 1) one-to-many setting, in which a speech encoder is shared between ASR and ST tasks; 2) many-to-one setting in which a decoder is shared between MT and ST tasks; and 3) many-to-many setting where both the encoder and decoder are shared.

A many-to-many multi-task model contains two encoders as well as two decoders. It can be jointly trained on ASR, MT, and ST tasks. As the attention module is task-specific, three attentions are defined.

Usually, the size of and is much larger than . Therefore, the common training practice is to pre-train the model on ASR and MT tasks and then fine-tune it with a multi-task learning manner. However, as aforementioned, this method suffers from subnet waste, role mismatch and non-pre-trained attention issues, which severely limits the end-to-end ST performance.

3 Our method

In this section, we first introduce the architecture of TCEN, which consists of two encoders connected in tandem, and one decoder with an attention module. Then we give the pre-training and fine-tuning strategy for TCEN. Finally, we propose our solutions for semantic and length inconsistency problems, which are caused by multi-task learning.

3.1 TCEN Architecture

Figure 2 sketches the overall architecture of TCEN, including a speech encoder , a text encoder and a decoder with an attention module . During training, the acts like an acoustic model which reads the input to word or subword representations , then

learns high-level linguistic knowledge into hidden representations

. Finally, the

defines a distribution probability over target words. The advantage of our architecture is that two encoders disentangle acoustic feature extraction and linguistic feature extraction, making sure that valuable knowledge learned from ASR and MT tasks can be effectively leveraged for ST training. Besides, every module in pre-training can be utilized in fine-tuning, alleviating the subnet waste problem.

Follow inaguma2018speech inaguma2018speech, we use CNN-BiLSTM architecture111Our method does not depend on a specific architecture, such as RNN and Transformer [Vaswani et al.2017], so it is easy to apply Transformer in our method. to build our model. Specifically, the input features

are organized as a sequence of feature vectors in length

. Then,

is passed into a stack of two convolutional layers followed by max-pooling:


where is feature maps in last layer and is the filter. The max-pooling layers downsample the sequence in length by a total factor of four. The down-sampled feature sequence is further fed into a stack of five bidirectional -dimensional LSTM layers:


where denotes the vector concatenation. The final output representation from the speech encoder is denoted as , where .

The text encoder consists of two bidirectional LSTM layers. In ST task, accepts speech encoder output as input. While in MT, consumes the word embedding representation derived from , where each element is computed by choosing the -th vector from the source embedding matrix . The goal of is to extract high-level linguistic features like syntactic features or semantic features from lower level subword representations or . Since and belong to different latent space and have different lengths, there remain semantic and length inconsistency problems. We will provide our solutions in Section 3.3. The output sequence of is denoted as .

The decoder is defined as two unidirectional LSTM layers with an additive attention . It predicts target sequence

by estimating conditional probability



Here, is the the hidden state of the deocder RNN at step and is a time-dependent context vector computed by the attention .

3.2 Training Procedure

Following previous work, we split the training procedure to pre-training and fine-tuning stages. In pre-training stage, the speech encoder is trained towards CTC objective using dataset , while the text encoder and the decoder are trained on MT dataset . In fine-tuning stage, we jointly train the model on ASR, MT, and ST tasks.

Transcript we were not v @en @ge @ful at all
CTC path -(11) we we -(3) were -(3) not -(4) v @en
@en @ge - @ful -(8) at at -(3) all -(10)
CTC path -(9) we -(3) were were -(4) not not -(3) v v @en
@en @en @ge - @ful -(7) at -(3) all all -(10)
Table 1: An example of the comparison between the golden transcript and the predicted CTC paths given the corresponding speech. ‘-’ denotes the blank token and the following number represents repeat times.


To sufficiently utilize the large dataset and , the model is pre-trained on CTC-based ASR task and MT task in the pre-training stage.

For ASR task, in order to get rid of the requirement for decoder and enable the to generate subword representation, we leverage connectionist temporal classification (CTC) [Graves et al.2006] loss to train the speech encoder.

Given an input , emits a sequence of hidden vectors , then a softmax classification layer predicts a CTC path , where {‘-’} is the observing label at particular RNN step , and ‘-’ is the blank token representing no observed labels:


where is the weight matrix in the classification layer and is the total length of encoder RNN.

A legal CTC path is a variation of the source transcription by allowing occurrences of blank tokens and repetitions, as shown in Table 1. For each transcription , there exist many legal CTC paths in length . The CTC objective trains the model to maximize the probability of observing the golden sequence , which is calculated by summing the probabilities of all possible legal paths:


where is the set of all legal CTC paths for sequence with length . The loss can be easily computed using forward-backward algorithm. More details about CTC are provided in supplementary material.

For MT task, we use the cross-entropy loss as the training objective. During training, is converted to embedding vectors through embedding layer , then consumes and pass the output to decoder. The objective function is defined as:



In fine-tune stage, we jointly update the model on ASR, MT, and ST tasks. The training for ASR and MT follows the same process as it was in pre-training stage.

For ST task, the reads the input and generates , then learns high-level linguistic knowledge into . Finally, the

predicts the target sentence. The ST loss function is defined as:


Following the update strategy proposed by luong2015multi luong2015multi, we allocate a different training ratio for each task. When switching between tasks, we select randomly a new task with probability .

3.3 Subnet-Consistency

Our model keeps role consistency between pre-training and fine-tuning by connecting two encoders for ST task. However, this leads to some new problems: 1) The text encoder consumes during MT training, while it accepts during ST training. However, and may not follow the same distribution, resulting in the semantic inconsistency. 2) Besides, the length of is not the same order of magnitude with the length of , resulting in the length inconsistency.

In response to the above two challenges, we propose two countermeasures: 1) We share weights between CTC classification layer and source-end word embedding layer during training of ASR and MT, encouraging and in the same space. 2)We feed the text encoder source sentences in the format of CTC path, which are generated from a seq2seq model, making it more robust toward long inputs.

Semantic Consistency

As shown in Figure 2, during multi-task training, two different hidden features will be fed into the text encoder : the embedding representation in MT task, and the output in ST task. Without any regularization, they may belong to different latent spaces. Due to the space gap, the has to compromise between two tasks, limiting its performance on individual tasks.

To bridge the space gap, our idea is to pull into the latent space where belong. Specifically, we share the weight in CTC classification layer with the source embedding weights , which means . In this way, when predicting the CTC path , the probability of observing the particular label {‘-’} at time step , , is computed by normalizing the product of hidden vector and the -th vector in :


The loss function closes the distance between and golden embedding vector, encouraging have the same distribution with .

CTC path -(11) we we -(3) were -(3) not -(4) v @en @en @ge - @ful -(8) at at -(3) all -(10)
- we - were - not - v @en @ge - @ful - at - all -
11 2 3 1 3 1 4 1 2 1 1 1 8 2 3 1 10
Table 2: The CTC path and corresponding unique tokens and repetition times , where .
Figure 3: The architecture of seq2seq model. It predicts the next token and its number of repetition at the same time.

Length Consistency

Another existing problem is length inconsistency. The length of the sequence is proportional to the length of the input frame , which is much longer than the length of . To solve this problem, we train an RNN-based seq2seq model to transform normal source sentences to noisy sentences in CTC path format, and replace standard MT with denoising MT for multi-tasking.

Specifically, we first train a CTC ASR model based on dataset , and generate a CTC-path for each audio by greedy decoding. Then we define an operation , which converts a CTC path to a sequence of the unique tokens and a sequence of repetition times for each token , denoted as . Notably, the operation is reversible, meaning that . We use the example in Table 1 and show the corresponding and in Table 2.

Then we build a dataset by decoding all the audio pieces in and transform the resulting path by the operation . After that, we train a seq2seq model, as shown in Figure 3, which takes as input and decodes as outputs. With the seq2seq model, a noisy MT dataset is obtained by converting every source sentence to , where . We did not use the standard seq2seq model which takes as input and generates directly, since there are too many blank tokens ‘-’ in and the model tends to generate a long sequence with only blank tokens. During MT training, we randomly sample text pairs from and according to a hyper-parameter . After tuning on the validation set, about pairs are sampled from . In this way, the is more robust toward the longer inputs given by the .

4 Experiments

We conduct experiments on the IWSLT18 speech translation task [Jan et al.2018]. Since IWSLT participators use different data pre-processing methods, we reproduce several competitive baselines based on the ESPnet222  [Watanabe et al.2018] for a fair comparison.

4.1 Dataset

Subword Level Decoder Char Level Decoder
tst2010 tst2013 tst2014 tst2015 Average tst2010 tst2013 tst2014 tst2015 Average
Vanilla 7.52 7.04 6.77 6.57 6.98 13.77 12.50 11.50 12.68 12.61
+enc pretrain 10.70 10.12 8.82 7.76 9.35 16.00 14.49 12.66 12.20 13.76
+dec pretrain 9.75 9.02 8.34 8.01 8.78 14.44 12.99 11.91 12.87 13.05
+enc dec pretrain 12.14 11.07 9.96 8.77 10.49 15.52 14.62 13.39 13.33 14.22
One-to-many 9.33 7.97 7.64 7.50 8.11 15.03 13.52 12.00 12.18 13.18
Many-to-one 8.65 7.28 6.89 7.12 7.51 14.59 13.21 11.86 12.30 12.99
Many-to-many 8.84 7.65 7.43 7.79 7.93 14.98 13.54 12.33 12.37 13.31
Many-to-many+pretrain 11.92 11.78 9.89 9.27 10.72 15.70 15.42 13.43 12.66 14.30
Triangle+pretrain 9.89 9.91 7.48 7.22 8.63 11.35 10.73 9.43 9.47 10.25
TCEN 15.49 15.50 13.21 13.02 14.31 17.61 17.67 15.73 14.94 16.49
Table 3: BLEU score results on IWSLT-18. “Average” denotes it averages the results of the tst2010, tst2013, tst2014 and tst2015. We copy the numbers of vanilla model from Since pre-training data is different, we run ESPnet code to obtain the numbers of pre-training and multi-task learning method, which are slightly higher than numbers in their report.
Speech translation data:

The organizer provides a speech translation corpus extracting from the TED talk (ST-TED), which consists of raw English wave files, English transcriptions, and aligned German translations. The corpus contains 272 hours of English speech with 171k segments. We split 2k segments from the corpus as dev set and tst2010, tst2013, tst2014, tst2015 are used as test sets.

Speech recognition data: Aside from ST-TED, TED-LIUM2 corpus [Rousseau, Deléglise, and Esteve2014] is provided as speech recognition data, which contains 207 hours of English speech and 93k transcript sentences.

Text translation data: We use transcription and translation pairs in the ST-TED corpus and WIT3333 as in-domain MT data, which contains 130k and 200k sentence pairs respectively. WMT2018 is used as out-of-domain training data which consists of 41M sentence pairs.

Data preprocessing: For speech data, the utterances are segmented into multiple frames with a 25 ms window size and a 10 ms step size. Then we extract 80-channel log-Mel filter bank and 3-dimensional pitch features using Kaldi [Povey et al.2011]

, resulting in 83-dimensional input features. We normalize them by the mean and the standard deviation on the whole training set. The utterances with more than 3000 frames are discarded. The transcripts in ST-TED are in true-case with punctuation while in TED-LIUM2, transcripts are in lower-case and unpunctuated. Thus, we lowercase all the sentences and remove the punctuation to keep consistent. To increase the amount of training data, we perform speed perturbation on the raw signals with speed factors 0.9 and 1.1.

For the text translation data, sentences longer than 80 words or shorter than 10 words are removed. Besides, we discard pairs whose length ratio between source and target sentence is smaller than 0.5 or larger than 2.0. Word tokenization is performed using the Moses scripts444 and both English and German words are in lower-case.

We use two different sets of vocabulary for our experiments. For the subword experiments, both English and German vocabularies are generated using sentencepiece555 [Kudo2018] with a fixed size of 5k tokens. inaguma2018speech inaguma2018speech show that increasing the vocabulary size is not helpful for ST task. For the character experiments, both English and German sentences are represented in the character level.

For evaluation, we segment each audio with the LIUM SpkDiarization tool [Meignier and Merlin2010] and then perform MWER segmentation with RWTH toolkit [Bender et al.2004]

. We use lowercase BLEU as evaluation metric.

4.2 Baseline Models and Implementation

We compare our method with following baselines.

Vanilla ST baseline: The vanilla ST [Inaguma et al.2018] has only a speech encoder and a decoder. It is trained from scratch on the ST-TED corpus.

Pre-training baselines: We conduct three pre-training baseline experiments: 1) encoder pre-training, in which the ST encoder is initialized from an ASR model; 2) decoder pre-training, in which the ST decoder is initialized from an MT model; and 3) encoder-decoder pre-training, where both the encoder and decoder are pre-trained. The ASR model has the same architecture with vanilla ST model, trained on the mixture of ST-TED and TED-LIUM2 corpus. The MT model has a text encoder and decoder with the same architecture of which in TCEN. It is first trained on WMT data (out-of-domain) and then fine-tuned on in-domain data.

Multi-task baselines: We also conduct three multi-task baseline experiments including one-to-many setting, many-to-one setting, and many-to-many setting. In the first two settings, we train the model with while or . For many-to-many setting, we use and .. For MT task, we use only in-domain data.

Many-to-many+pre-training: We train a many-to-many multi-task model where the encoders and decoders are derived from pre-trained ASR and MT models.

Triangle+pre-train: DBLP:conf/naacl/AnastasopoulosC18 DBLP:conf/naacl/AnastasopoulosC18 proposed a triangle multi-task strategy for speech translation. Their model solves the subnet waste issue by concatenating an ST decoder to an ASR encoder-decoder model. Notably, their ST decoder can consume representations from the speech encoder as well as the ASR decoder. For a fair comparison, the speech encoder and the ASR decoder are initialized from the pre-trained ASR model. The Triangle model is fine-tuned under their multi-task manner.

All our baselines as well as TCEN are implemented based on ESPnet [Watanabe et al.2018], the RNN size is set as

for all models. We use a dropout of 0.3 for embeddings and encoders, and train using Adadelta with initial learning rate of 1.0 for a maximum of 10 epochs.

For training of TCEN, we set and in the pre-training stage, since the MT dataset is much larger than ASR dataset. For fine-tune, we use and , same as the ‘many-to-many’ baseline.

For testing, we select the model with the best accuracy on speech translation task on dev set. At inference time, we use a beam size of 10, and the beam scores include length normalization with a weight of 0.2.

4.3 Experimental Results

Table 3 shows the results on four test sets as well as the average performance. Our method significantly outperforms the strong ‘many-to-many+pretrain’ baseline by 3.6 and 2.2 BLEU scores respectively, indicating the proposed method is very effective that substantially improves the translation quality. Besides, both pre-training and multi-task learning can improve translation quality, and the pre-training settings (2nd-4th rows) are more effective compared to multi-task settings (5th-8th rows). We observe a performance degradation in the ‘triangle+pretrain’ baseline. Compared to our method, where the decoder receives higher-level syntactic and semantic linguistic knowledge extracted from text encoder, their ASR decoder can only provide lower word-level linguistic information. Besides, since their model lacks text encoder and the architecture of ST decoder is different from MT decoder, their model cannot utilize the large-scale MT data in all the training stages.

Interestingly, we find that the char-level models outperform the subword-level models in all settings, especially in vanilla baseline. A similar phenomenon is observed by berard2018end berard2018end. A possible explanation is that learning the alignments between speech frames and subword units in another language is notoriously difficult. Our method can bring more gains in the subword setting since our model is good at learning the text-to-text alignment and the subword-level alignment is more helpful to the translation quality.

4.4 Discussion

System tst2010 tst2013 tst2014 tst2015
TCEN 15.49 15.50 13.21 13.02
-MT noise 15.01 14.95 13.34 12.80
-weight sharing 13.51 14.02 12.25 11.66
-pretrain 8.98 8.42 7.94 8.08
Table 4: Ablation study for subword-level experiments.

Ablation Study

To better understand the contribution of each component, we perform an ablation study on subword-level experiments. The results are shown in Table 4. In ‘-MT noise’ setting, we do not add noise to source sentences for MT. In ‘-weight sharing’ setting, we use different parameters in CTC classification layer and source embedding layer. These two experiments prove that both weight sharing and using noisy MT input benefit to the final translation quality. Performance degrades more in ‘-weight sharing’, indicating the semantic consistency contributes more to our model. In the ‘-pretrain’ experiment, we remove the pre-training stage and directly update the model on three tasks, leading to a dramatic decrease on BLEU score, indicating the pre-training is an indispensable step for end-to-end ST.

Figure 4: Model learning curves in fine-tuning.

Learning Curve

It is interesting to investigate why our method is superior to baselines. We find that TCEN achieves a higher final result owing to a better start-point in fine-tuning. Figure 4 provides learning curves of subword accuracy on validation set. The x-axis denotes the fine-tuning training steps. The vanilla model starts at a low accuracy, because its networks are not pre-trained on the ASR and MT data. The trends of our model and ‘many-to-many+pretrain’ are similar, but our model outperforms it about five points in the whole fine-tuning process. It indicates that the gain comes from bridging the gap between pre-training and fine-tuning rather than a better fine-tuning process.

tst2010 tst2013 tst2014 tst2015
cascaded 13.38 15.84 12.94 13.79
cascaded+re-seg 17.12 17.77 14.94 15.01
our model 17.61 17.67 15.73 14.94
Table 5: BLEU comparison of cascaded results and our best end-to-end results. re-seg denotes the ASR outputs are re-segmented before fed into the MT model.

Compared with a Cascaded System

Table 3 compares our model with end-to-end baselines. Here, we compare our model with cascaded systems. We build a cascaded system by combining the ASR model and MT model used in pre-training baseline. Word error rate (WER) of the ASR system and BLEU score of the MT system are reported in the supplementary material. In addition to a simple combination of the ASR and MT systems, we also re-segment the ASR outputs before feeding to the MT system, denoted as cascaded+re-seg. Specifically, we train a seq2seq model [Bahdanau, Cho, and Bengio2015] on the MT dataset, where the source side is a no punctuation sentence and the target side is a natural sentence. After that, we use the seq2seq model to add sentence boundaries and punctuation on ASR outputs. Experimental results are shown in Table 5. Our end-to-end model outperforms the simple cascaded model over 2 BLEU scores, and it achieves a comparable performance with the cascaded model combining with a sentence re-segment model.

5 Related Work

Early works conduct speech translation in a pipeline manner [Ney1999, Matusov, Kanthak, and Ney2005], where the ASR output lattices are fed into an MT system to generate target sentences. HMM [Juang and Rabiner1991], DenseNet [Huang et al.2017], TDNN [Peddinti, Povey, and Khudanpur2015] are commonly used ASR systems, while RNN with attention [Bahdanau, Cho, and Bengio2015] and Transformer [Vaswani et al.2017] are top choices for MT. To enhance the robustness of the NMT model towards ASR errors, DBLP:conf/eacl/TsvetkovMD14 DBLP:conf/eacl/TsvetkovMD14 and DBLP:conf/asru/ChenHHL17 DBLP:conf/asru/ChenHHL17 propose to simulate the noise in training and inference.

To avoid error propagation and high latency issues, recent works propose translating the acoustic speech into text in target language without yielding the source transcription [Duong et al.2016]. Since ST data is scarce, pre-training [Bansal et al.2019], multi-task learning [Duong et al.2016, Bérard et al.2018], curriculum learning [Kano, Sakti, and Nakamura2018], attention-passing [Sperber et al.2019], and knowledge distillation [Liu et al.2019, Jia et al.2019a] strategies have been explored to utilize ASR data and MT data. Specifically, DBLP:conf/interspeech/WeissCJWC17 DBLP:conf/interspeech/WeissCJWC17 show improvements of performance by training the ST model jointly with the ASR and the MT model. berard2018end berard2018end observe faster convergence and better results due to pre-training and multi-task learning on a larger dataset. DBLP:conf/naacl/BansalKLLG19 DBLP:conf/naacl/BansalKLLG19 show that pre-training a speech encoder on one language can improve ST quality on a different source language. All of them follow the traditional multi-task training strategies. DBLP:journals/corr/abs-1802-06003 DBLP:journals/corr/abs-1802-06003 propose to use curriculum learning to improve ST performance on syntactically distant language pairs. To effectively leverage transcriptions in ST data, DBLP:conf/naacl/AnastasopoulosC18 DBLP:conf/naacl/AnastasopoulosC18 augment the multi-task model where the target decoder receives information from the source decoder and they show improvements on low-resource speech translation. Their model just consumes ASR and ST data, in contrast, our work sufficiently utilizes the large-scale MT data to capture the rich semantic knowledge. DBLP:conf/icassp/JiaJMWCCALW19 DBLP:conf/icassp/JiaJMWCCALW19 use pre-trained MT and text-to-speech (TTS) synthesis models to convert weakly supervised data into ST pairs and demonstrate that an end-to-end MT model can be trained using only synthesised data.

6 Conclusion

This paper has investigated the end-to-end method for ST. It has discussed why there is a huge gap between pre-training and fine-tuning in previous methods. To alleviate these issues, we have proposed a method, which is capable of reusing every sub-net and keeping the role of sub-net consistent between pre-training and fine-tuning. Empirical studies have demonstrated that our model significantly outperforms baselines.


  • [Anastasopoulos and Chiang2018] Anastasopoulos, A., and Chiang, D. 2018. Tied multitask learning for neural speech translation. In NAACL-HLT 2018, 82–91.
  • [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR 2015,.
  • [Bansal et al.2019] Bansal, S.; Kamper, H.; Livescu, K.; Lopez, A.; and Goldwater, S. 2019. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In NAACL-HLT 2019, 58–68.
  • [Bender et al.2004] Bender, O.; Zens, R.; Matusov, E.; and Ney, H. 2004. Alignment templates: the RWTH SMT system. In 2004 International Workshop on Spoken Language Translation, IWSLT 2004, Keihanna Science City, Kyoto, Japan, September 30 - October 1, 2004, 79–84.
  • [Bérard et al.2018] Bérard, A.; Besacier, L.; Kocabiyikoglu, A. C.; and Pietquin, O. 2018. End-to-end automatic speech translation of audiobooks. In ICASSP, 2018, 6224–6228.
  • [Chen et al.2017] Chen, P.; Hsu, I.; Huang, Y. Y.; and Lee, H. 2017. Mitigating the impact of speech recognition errors on chatbot using sequence-to-sequence model. In ASRU 2017, 497–503.
  • [Duong et al.2016] Duong, L.; Anastasopoulos, A.; Chiang, D.; Bird, S.; and Cohn, T. 2016.

    An attentional model for speech translation without transcription.

    In NAACL 2016, 949–959.
  • [Graves et al.2006] Graves, A.; Fernández, S.; Gomez, F. J.; and Schmidhuber, J. 2006.

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.

    In ICML 2006, 369–376.
  • [Huang et al.2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In CVPR 2017, 4700–4708.
  • [Inaguma et al.2018] Inaguma, H.; Zhang, X.; Wang, Z.; Renduchintala, A.; Watanabe, S.; and Duh, K. 2018. The jhu/kyotou speech translation system for iwslt 2018. In IWSLT 2018.
  • [Jan et al.2018] Jan, N.; Cattoni, R.; Sebastian, S.; Cettolo, M.; Turchi, M.; and Federico, M. 2018. The iwslt 2018 evaluation campaign. In IWSLT, 2–6.
  • [Jia et al.2019a] Jia, Y.; Johnson, M.; Macherey, W.; Weiss, R. J.; Cao, Y.; Chiu, C.-C.; Ari, N.; Laurenzo, S.; and Wu, Y. 2019a. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019, 7180–7184. IEEE.
  • [Jia et al.2019b] Jia, Y.; Johnson, M.; Macherey, W.; Weiss, R. J.; Cao, Y.; Chiu, C.; Ari, N.; Laurenzo, S.; and Wu, Y. 2019b. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019, 7180–7184.
  • [Juang and Rabiner1991] Juang, B. H., and Rabiner, L. R. 1991. Hidden markov models for speech recognition. Technometrics 33(3):251–272.
  • [Kano, Sakti, and Nakamura2018] Kano, T.; Sakti, S.; and Nakamura, S. 2018. Structured-based curriculum learning for end-to-end english-japanese speech translation. Interspeech 2017 2630–2634.
  • [Kudo2018] Kudo, T. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL 2018, 66–75.
  • [Liu et al.2019] Liu, Y.; Xiong, H.; He, Z.; Zhang, J.; Wu, H.; Wang, H.; and Zong, C. 2019. End-to-end speech translation with knowledge distillation. CoRR abs/1904.08075.
  • [Luong et al.2016] Luong, M.-T.; Le, Q. V.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2016. Multi-task sequence to sequence learning. ICLR 2016.
  • [Ma et al.2019] Ma, M.; Huang, L.; Xiong, H.; Zheng, R.; Liu, K.; Zheng, B.; Zhang, C.; He, Z.; Liu, H.; Li, X.; Wu, H.; and Wang, H. 2019. STACL: simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In ACL 2019, 3025–3036.
  • [Matusov, Kanthak, and Ney2005] Matusov, E.; Kanthak, S.; and Ney, H. 2005. On the integration of speech recognition and statistical machine translation. In INTERSPEECH 2005, 3177–3180.
  • [Meignier and Merlin2010] Meignier, S., and Merlin, T. 2010. Lium spkdiarization: an open source toolkit for diarization. In CMU SPUD Workshot.
  • [Munro2010] Munro, R. 2010. Crowdsourced translation for emergency response in haiti: the global collaboration of local knowledge. In AMTA Workshop on Collaborative Crowdsourcing for Translation, 1–4.
  • [Ney1999] Ney, H. 1999. Speech translation: coupling of recognition and translation. In ICASSP 1999, 517–520.
  • [Peddinti, Povey, and Khudanpur2015] Peddinti, V.; Povey, D.; and Khudanpur, S. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.
  • [Povey et al.2011] Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. 2011. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
  • [Rousseau, Deléglise, and Esteve2014] Rousseau, A.; Deléglise, P.; and Esteve, Y. 2014. Enhancing the ted-lium corpus with selected data for language modeling and more ted talks. In LREC, 3935–3939.
  • [Sperber et al.2019] Sperber, M.; Neubig, G.; Niehues, J.; and Waibel, A. 2019. Attention-passing models for robust and data-efficient end-to-end speech translation. TACL 7:313–325.
  • [Tsvetkov, Metze, and Dyer2014] Tsvetkov, Y.; Metze, F.; and Dyer, C. 2014. Augmenting translation models with simulated acoustic confusions for improved spoken language translation. In EACL 2014, 616–625.
  • [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In NeurIPS, 4-9 December 2017, Long Beach, CA, USA, 5998–6008.
  • [Watanabe et al.2018] Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Soplin, N. E. Y.; Heymann, J.; Wiesner, M.; Chen, N.; et al. 2018. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
  • [Weiss et al.2017] Weiss, R. J.; Chorowski, J.; Jaitly, N.; Wu, Y.; and Chen, Z. 2017. Sequence-to-sequence models can directly translate foreign speech. In Interspeech 2017, 2625–2629.