Log In Sign Up

ESPnet-ST IWSLT 2021 Offline Speech Translation System

This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021.


page 1

page 2

page 3

page 4


The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

This paper describes the submission of our end-to-end YiTrans speech tra...

ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task

This paper describes the ON-TRAC Consortium translation systems develope...

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

The primary goal of this FBK's systems submission to the IWSLT 2022 offl...

End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020

This paper describes FBK's participation in the IWSLT 2020 offline speec...

UPC's Speech Translation System for IWSLT 2021

This paper describes the submission to the IWSLT 2021 offline speech tra...

RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

We compare the fast training and decoding speed of RETURNN of attention ...

CTC-based Compression for Direct Speech Translation

Previous studies demonstrated that a dynamic phone-informed compression ...

1 Introduction

This paper presents the ESPnet-ST group’s EnglishGerman speech translation (ST) system submitted to the IWSLT 2021 offline speech translation track. ESPnet (watanabe2018espnet)

has been widely used for many speech applications; automatic speech recognition (ASR), text-to-speech 

(hayashi2020espnet), speech translation (inaguma-etal-2020-espnet), machine translation (MT), and speech separation/enhancement (li2020espnet)

. The purpose of this submission is not only to show the recent progress on ST researches, but also to encourage future research by building strong systems along with the open-sourced project.

This year we focused on (1) sequence-level knowledge distillation (SeqKD) (kim-rush-2016-sequence), (2) Conformer encoder (gulati2020), (3) Multi-Decoder architecture (dalmia2021searchable)

, (4) model ensembling, and (5) better segmentation with a neural network-based voice activity (VAD) system 

(bredin2020pyannote) and a novel algorithm to merge multiple short segments for long context modeling. Our primary focus was E2E models, although we also compared them with cascade systems with our best effort. All experiments were conducted with the ESPnet-ST toolkit (inaguma-etal-2020-espnet), and the recipe is publicly available at

2 Data preparation

In this section, we describe data preparation for each task. The corpus statistics are listed in Table 1. We removed the off-limit talks following previous evaluation campaigns111 To fit the GPU memory, we excluded utterances having more than 3000 speech frames or more than 400 characters. All sentences were tokenized with the tokenizer.perl script in the Moses toolkit (koehn-etal-2007-moses).

#Hour #Sentence
 Must-C 408 3 00.68M
 Must-C v2 458 3 00.74M
 ST-TED (cleaned) 200 3 00.40M
 Librispeech 960 00.28M
 TEDLIUM2 210 3 00.27M
 Must-C 408 3 00.68M
 Must-C v2 458 3 00.74M
 ST-TED (cleaned) 200 3 00.40M
 Must-C - 00.68M
 Must-C v2 00.74M
 ST-TED (cleaned) 00.40M
 Europarl 01.82M
 Commoncrawl 02.39M
 Paracrawl 34.37M
 NewsCommentary 00.37M
 WikiTitles 01.38M
 RAPID 01.63M
 WikiMatrix 01.57M
Table 1: Corpus statistics

2.1 Asr

We used Must-C (di-gangi-etal-2019-must), Must-C v2222, ST-TED (jan2018iwslt), Librispeech (librispeech), and TEDLIUM2 (tedlium) corpora. We used the cleaned version of ST-TED following (inaguma19asru). The speech data was augmented by three-fold speed perturbation (speed_perturbation) with speed ratios of 0.9, 1.0, and 1.1 except for Librispeech. We removed case information and punctuation marks except for apostrophes from the transcripts. The 5k unit vocabulary was constructed based on the byte pair encoding (BPE) algorithm (sennrich-etal-2016-neural) with the sentencepiece toolkit333 using the English transcripts only.

2.2 E2e-St

We used Must-C, Must-C v2, and ST-TED only. The shared source and target vocabulary of BPE16k units was constructed using cased and punctuated transcripts and translations.

2.3 Mt

We used available bitext for WMT20444Europarl, Commoncrawl, Paracrawl, NewsCommentary, WikiTitles, RAPID and WikiMatrix

in addition to the in-domain TED data used for E2E-ST systems. We first performed perplexity-based filtering with an in-domain n-gram language model (LM) 

(moore-lewis-2010-intelligent). We controlled the WMT data size by thresholding and obtained three data pools: 5M, 10M, and 20M sentences. Next, we removed non-printing characters and performed language identification with the toolkit (lui-baldwin-2012-langid)555 and kept sentences whose language IDs were identified correctly on both English and German sides. We also removed sentences having more than 250 tokens in either language or a source-target length ratio of more than 1.5 with the clean-corpus-n.perl script in Moses. Finally, we removed sentences having CJK and other unrelated characters in either language with the built-in regex module in Python. The resulting data size is shown in Table 2. We found that our filtering strategy removed 22-37% of data. Note that the above filtering process was performed over the WMT data only. For each data size, the joint source and target vocabulary of BPE32k units was constructed using cased and punctuated sentences after the filtering. We did not use additional monolingual data.

Filtering method #Sentence
In-domain LM 5.00M 10.00M 20.00M
 + langid 3.42M 07.90M 15.33M
  + length/character 3.15M 07.77M 15.01M
Table 2: MT bitext filtering
Figure 1: Block diagram of Conformer architecture

3 System

3.1 Conformer encoder

Conformer encoder (gulati2020) is a stacked multi-block architecture and has shown consistent improvement over a wide range of E2E speech processing applications (guo2020recent). The architecture of each block is depicted in Figure 1. It includes a multi-head self-attention module, a convolution module, and a pair of position-wise feed-forward modules in the Macaron-Net style. While the self-attention module learns the long-range global context, the convolution module aims to model the local feature patterns synchronously. Recent studies have shown improvements by introducing Conformer in the E2E-ST task (guo2020recent; inaguma2021source), which motivated us to adopt this architecture as our system.

3.2 SeqKD

Sequence-level knowledge distillation (SeqKD) (kim-rush-2016-sequence) is an effective method to transfer knowledge in a teacher model to a student model via discrete symbols. Our recent studies (inaguma2020orthros; inaguma2021source) showed a large improvement in ST performance with this technique. Unlike the previous studies, however, we used more training data than bitext in ST training data to train teacher MT models. We translated source transcripts in the ST training data by the teacher MT models with a beam width of 5 and then replaced the original ground-truth translation with the generated translation. We used cased and punctuated transcripts as inputs to the MT teachers. We also combined both the original and pseudo translations as data augmentation (multi-referenced training(gordon2019explaining).

Figure 2: The Multi-Decoder (MD) architecture decomposes the overall ST task with ASR and MT sub-nets while maintaining E2E differentiability.

3.3 Multi-Decoder architecture

The Multi-Decoder is an E2E-ST model using Searchable Hidden Intermediates to decompose the overall ST task into ASR and MT sub-tasks (dalmia2021searchable). As shown in Figure 2

, the Multi-Decoder consists of two encoder-decoder models, an ASR sub-net and a subsequent MT sub-net, where the hidden representations of the ASR decoder are passed as inputs to the encoder of the MT sub-net. During inference, the best ASR decoder hidden representations are retrieved using beam search decoding at this intermediate stage.

Since this framework decomposes the overall ST task, it brings several advantages of cascaded approaches into the E2E setting. For instance, the Multi-Decoder allows for greater search capabilities and separation of speech and text encoding. However, one trade-off is a greater risk of error propagation from the ASR sub-net to the downstream MT sub-net. To alleviate this issue, we condition the decoder of the MT sub-net on the ASR encoder hidden representations in addition to the MT encoder hidden representations using multi-source cross-attention. This improved variant of the architecture is called the Multi-Decoder with Speech Attention.

3.4 Model ensembling

We use posterior probability combination to ensemble models trained with different data and architectures. During inference, we perform a posterior combination at each step of beam search decoding by first computing the softmax normalized posterior probabilities for each model in the ensemble and then taking the mean value. In this ensembling approach, a single unified beam search operates over the combined posteriors of the models to find the most likely decoded sequence.

3.5 Segmentation

How to segment audio during inference significantly impacts ST performances (Gaido2020; Pham2020; potapczyk-przybysz-2020-srpols; gaido2021beyond). This is because the ST systems are usually trained with utterances segmented based on punctuation marks (di-gangi-etal-2019-must) while the audio segmentation by voice activity detection (VAD) at test time does not access such meta information. Since VAD splits a long speech recording into chunks by silence regions, it would prevent models from extracting semantically coherent contextual information. Therefore, it is very important to seek a better segmentation strategy in order to minimize this gap in training and test conditions and evaluate models correctly. In fact, the last year’s winner obtained huge improvements by using their own segmentation strategy.

Motivated by this fact, we investigated two VAD systems apart from the provided segmentation. Specifically, we used WebRTC666 and (bredin2020pyannote)777

toolkits. For WebRTC, we set the frame duration, padding duration, and aggressive mode to 10ms, 150ms, and 3, respectively. For, we used a publicly available model pre-trained on the DIHARD corpus 


However, we observed that VAD systems are more likely to generate short segments because they do not take contextual information into account. Therefore, we propose a novel algorithm to merge multiple short segments into a single chunk to enable long context modeling by self-attention in both encoder and decoder modules. The proposed algorithm is shown in Algorithm 1. We first perform VAD and obtain multiple segments. Then, we check the segments in a greedy way from left to right and merge adjacent segments if (1) the total utterance duration is below a threshold [10ms] and (2) the time interval of the two segments is below a threshold [10ms]. This process continues until no segment is merged in an iteration. Although recent studies proposed similar methods (potapczyk-przybysz-2020-srpols; gaido2021beyond), our algorithm is a bottom-up approach while theirs are top-down.

1:function MergeSegment()
3:      while True do
5:             Queue
6:             Start/End time
7:            for  do
8:                 if  and  then
9:                        Merge segments
10:                 else
12:                        Reset
13:                 end if
15:            end for
17:            if  then
18:                 break
19:            end if
20:      end while
21:      return
22:end function
Algorithm 1 Merge short segments after VAD for long context modeling

4 Experimental setting

In this section, we describe the experimental setting for each task. The detailed configurations for each task are summarized in Table 3.

4.1 Feature extraction

We extracted 80-channel log-mel filterbank coefficients computed with 25-ms window size and shifted every 10-ms with 3-dimensional pitch features using the Kaldi toolkit (kaldi)

. The features were normalized by the mean and the standard deviation calculated on the entire training set. We applied SpecAugment 

(specaugment) with mask parameters and time-warping for both ASR and E2E-ST tasks.

Configuration ASR E2E-ST MT
non-MD MD
Warmup step 25k 25k 25k 8k
Learning rate factor 10.0 2.5 12.5 1.0
Batch size 200 utt 128 utt 120 utt 65k tok
Epoch 30 30 30 40
Validation metric Accuracy BLEU BLEU BLEU
Model average 5 5 5 5
Beam width 10 4 16, 10 4
Table 3: Summary of training configuration

4.2 Asr

We used both Transformer and Conformer architectures. The encoder had two CNN blocks followed by 12 Transformer/Conformer blocks following (karita2019comparative; guo2020recent)

. Each CNN block consisted of a channel size of 256 and a kernel size of 3 with a stride of 2

2, which resulted in time reduction by a factor of 4. Both architectures had six Transformer blocks in the decoder. In both encoder and decoder blocks, the dimensions of the self-attention layer and feed-forward network were set to 512 and 2048, respectively. The number of attention heads was set to 8. The kernel size of depthwise separable convolution in Conformer blocks was set to 31. We optimized the model with the joint CTC/attention objective (hybrid_ctc_attention) with a CTC weight of 0.3. We also used CTC scores during decoding but did not use any external LM for simplicity. We adopted the best model configuration from the Librispeech ASR recipe in ESPnet.

4.3 Mt

We used the Transformer-Base and -Big configurations in (vaswani2017attention).

4.4 E2e-St

We used the same Conformer architecture as ASR except for the vocabulary. We initialized the encoder parameters with those of the Conformer ASR. On the decoder side, we initialized parameters like BERT (devlin-etal-2019-bert), where weight parameters were sampled from , biases were set to zero, and layer normalization parameters were set to , . This technique led to better translation performance and faster convergence.

Model WER ()
Librispeech TEDLIUM2 Must-C
test-other test tst-COMMON
Transformer 9.4 6.4 7.0
Conformer 7.1 6.2 5.6
Table 4: Word error rate (WER) of ASR systems
tst2010 tst2015 tst2018 tst2019 Avg.
Provided 18.2 32.1 23.5 20.8 23.65
1500 200 14.4 29.3 18.4 15.5 19.40
2000 200 12.7 27.7 16.4 11.5 17.08
2500 200 14.5 29.9 15.1 12.2 17.93
WebRTC 35.3 35.1 44.0 22.7 34.28
1500 200 19.4 26.7 27.7 13.8 21.90
2000 200 19.8 27.7 27.1 11.9 21.63
2500 200 22.9 29.5 27.1 11.6 22.78
pyannote 09.5 24.0 15.5 07.3 14.08
1500 200 08.0 23.0 12.4 07.3 12.68
1500 100 07.5 22.2 12.4 06.5 12.15
2000 200 10.3 22.5 12.2 06.5 12.88
2000 150 09.6 21.8 12.3 06.1 12.45
2000 100 08.1 21.5 12.0 05.8 11.90
2000 050 07.3 21.9 12.4 05.9 11.88
Table 5: Impact of audio segmentation for ASR
Model BLEU ()
Must-C Must-C v2 tst2010 tst2015 tst2018 tst2019 Must-C
dev tst-COMMON tst-COMMON Train
Base (Must-C only) 30.02 29.86 27.28 24.92 21.13 20.37
Base (WMT5M) 31.31 34.13 33.85 31.61 32.44 28.30 28.28 45.68
 + Big 27.32 29.11 28.85 27.61 28.44 24.42 23.92
Base (WMT10M) 33.28 35.09 34.80 33.58 33.26 29.24 28.87 38.31
 + In-domain finetune 30.67 35.50 35.30 30.79 31.43 25.35 26.10
Base (WMT20M) 33.15 35.06 34.87 33.26 33.56 29.94 29.08 33.60
Table 6: BLEU scores of text-based MT systems

5 Results

5.1 Asr

5.1.1 Architecture

We compared Transformer and Conformer ASR architectures in Table 4. We observed that Conformer significantly outperformed Transformer. Therefore, we use the Conformer encoder in the following experiments.

5.1.2 Segmentation

Next, we investigated the VAD systems and the proposed segment merging algorithm for long context modeling in Table 5

. We used the same decoding hyperparameters tuned on Must-C. We firstly observed that merging short segments was very effective probably because it alleviated frame classification errors in the VAD systems. Among three audio segmentation methods, we confirmed that significantly reduced the WER while WebRTC had negative impacts compared to the provided segmentation. Specifically, we found that the

dihard option in worked very well while the rest options did not. The optimal maximum duration was around 2000 frames (i.e., 20 seconds). In the last experiments, we tuned the maximum interval among {50, 100, 150, 200} and found 50 and 100 (i.e., 0.5 and 1 second) was best on average. Compared to the provided segmentation, we obtained a 49.6% improvement on average.

ID Model BLEU ()
Must-C Must-C v2 tst2010 tst2015 tst2018 tst2019
dev tst-COMMON tst-HE tst-COMMON
- Bidir SeqKD (E2E) (inaguma2021source) 25.67 27.01 25.36
Multi-Decoder (E2E) (dalmia2021searchable) 26.4
RWTH (Cascade) (bahar2021tight) 26.50 26.80 28.4
KIT (E2E) (Pham2020) 30.60 24.27 21.82
KIT (Cascade) (Pham2020) 26.68 24.95
SRPOL (E2E) (potapczyk-przybysz-2020-srpols) 29.44 24.6 23.96
A1 Baseline (X) 25.14 35.63 22.63 36.07 21.40 18.18 16.69 17.39
A2  + SeqKD (Y) 26.31 29.29 26.33 29.50 23.34 21.24 21.09 22.25
A3  + 2ref SeqKD (X+Y) 26.50 30.59 26.21 30.92 23.00 22.18 20.38 21.59
A4  + 3ref SeqKD (X+Y+Z) 27.66 30.90 27.44 31.07 24.97 22.66 22.20 23.41
B1 MD + 2ref SeqKD 30.78 23.78
C1 Conformer ASR Base MT (WMT10M) 27.01 29.42 26.13 29.75 25.04 23.17 23.05 23.19
Table 7: BLEU scores of ST systems. X: original, Y: WMT5M, Z: WMT10M. For unsegmented test sets, we used with and .

5.2 Mt

In this section, we show the results of our MT systems used for cascade systems and pseudo labeling in SeqKD. We report case-sensitive detokenized BLEU scores (papineni-etal-2002-bleu) with the multi-bleu-detok.perl script in Moses. We carefully investigated the effective amount of WMT training data to improve the performance of the TED domain. The results are shown in Table 6. We confirmed that adding the WMT data improved the performance by more than 4 BLEU. Regarding the WMT data size, using up to 10M sentences was helpful, but 20M did not show clear improvements, probably because of the undersampling of the TED data. Oversampling as in multilingual NMT (arivazhagan2019massively) could alleviate this problem, but this is beyond our scope.

After training with a mix of the WMT and TED data, we also tried to finetune the model with the TED data only, but this did not lead to clear improvement, especially for the IWSLT test sets. Increasing the model capacity was not helpful, although the conclusion might change by adding more training data and evaluating the model in other domains such as news. Because our primary focus to use MT systems was pseudo labeling for SeqKD, we decided to use the Base configuration to speed up decoding.

Finally, we checked the BLEU scores on the Must-C training data used for SeqKD. We observed that adding more WMT data decreased the BLEU score, from which we can conclude that using more WMT data gradually changed the MT output from the TED style. Therefore, we decided to use the models trained on WMT5M and WMT10M as teachers for SeqKD.

5.3 Speech translation

5.3.1 E2e-St


The results are shown in Table 7. We first observed the baseline Conformer model (A1) achieved 35.63 BLEU on the Must-C tst-COMMON set, and it is the new state-of-the-art record to the best of our knowledge. Surprisingly, it even outperformed text-based MT systems in Table 6. On the other hand, unlike our observations in (inaguma2020orthros; inaguma2021source), SeqKD (A2-4) degraded the performance on the Must-C tst-COMMON set. However, the results on the Must-C dev and tst-HE sets showed completely different trends, where we observed better BLEU scores by SeqKD in proportion to the WMT data used for training the teachers. Therefore, after tuning audio segmentation, we also evaluated the models on the unsegmented IWSLT test sets. Here, we used the based segmentation with as described in §5.1.2. Then, we confirmed large improvements with SeqKD by 2-6 BLEU, and therefore we decided to determine the best model based on the IWSLT test sets. Multi-referenced training consistently improved the BLEU scores on the IWSLT sets. For example, A4 outperformed A1 by 6.02 BLEU on tst2019 although the tst2019 set was well-segmented (WER: 6.0%). Given these observations, we recommend evaluating ST models on multiple test sets for future research.

Multi-Decoder architecture

We combined the SeqKD and Multi-Decoder techniques in our B1 system. B1, which used a conformer ASR encoder and 2ref SeqKD, showed an improvement of 2.19 BLEU on tst2019 over A3, the encoder-decoder which also used 2ref SeqKD. B1 also achieved a slightly higher result on tst2019 compared to A4 which used 3ref SeqKD. These results suggest that the Multi-Decoder architecture is indeed compatible with SeqKD.

ID Ensembled Models tst2019
0- B1 21.06
E1 B1, A4 22.51
E2 B1, A4, A1 22.83
E3 B1, A4, A1, A3 23.36
E4 B1, A4, A1, A3, A2 23.61
Table 8: BLEU () scores of ensembled E2E-ST systems on tst2019, using the provided segmentation with and
tst2010 tst2015 tst2018 tst2019 Avg.
Provided 20.1
Provided (E2E) 21.99 19.94 19.29 19.70 20.23
1000 200 22.62 20.54 19.80 20.54 20.88
1500 200 23.00 21.66 20.14 21.50 21.58
2000 200 22.95 21.58 20.03 21.34 21.48
WebRTC (E2E) 13.13 12.97 11.07 13.32 12.62
1000 200 20.95 20.66 17.09 20.87 19.89
1500 200 21.00 20.99 17.67 21.05 20.18
2000 200 20.25 21.81 17.08 20.71 19.96
pyannote (E2E) 22.26 16.84 17.78 19.98 19.22
1500 200 25.00 22.22 21.97 22.67 22.97
1500 100 25.92 22.81 22.51 22.88 23.53
2000 200 24.10 21.98 21.00 22.71 22.45
2000 150 24.25 22.26 21.41 22.99 22.73
2000 100 24.97 22.66 22.20 23.41 23.31
2000 050 24.50 20.67 22.14 22.89 22.55
pyannote (Cascade) 1500 200 25.06 22.65 23.01 22.51 23.31
1500 100 25.56 22.85 23.03 22.82 23.57
2000 200 24.41 22.76 22.15 22.08 22.85
2000 150 24.50 23.03 23.12 23.11 23.44
2000 100 25.04 23.17 23.05 23.19 23.61
2000 050 24.33 20.79 23.12 23.11 22.84
Table 9: Impact of audio segmentation for ST. A4 was used for the E2E model.  (potapczyk-przybysz-2020-srpols)
Model ensemble

As shown in Table 8, ensembling our various ST systems using the posterior combination method described in §3.4 showed improvements over the best single model, B1. We found that an ensemble of all of our models, A1-4 and B1, achieved the best result of 23.61 BLEU on tst2019 and outperformed B1 by 2.55 BLEU. Although A1 as a single system performs worse on tst2019 than the other single systems as shown in Table 7, including it in an ensemble with the two best single systems, B1 and A4, still yielded a slight gain of 0.32 BLEU (E2). Therefore, we can conclude that weak models are still beneficial for ensembling.

System Segmentation Segment merging BLEU ()
tst2019 tst2020 tst2021
ref1 ref2 both
IWSLT’20 winner given 20.1 21.5
own 23.96 25.3
E4 (primary) pyannote 200 24.14 25.6 19.3 21.2 31.4
E4+* pyannote 200 24.41 25.5 19.7 20.6 30.8
E4+* pyannote 100 24.87 26.0 19.5 21.1 31.3
E4+* given 100 23.72 25.1 19.4 21.4 31.5
E4+* given 21.10 22.3 17.4 18.4 27.7
B1 pyannote 100 23.78 25.0 18.9 20.9 31.1
Table 10: BLEU scores of submitted systems on tst2020 and tst2021.  (potapczyk-przybysz-2020-srpols). was used for the segment merging algorithm. *Late submission (not official). E4+ denotes E4 trained for more steps.

5.3.2 Segmentation

Similar to §5.1.2, we also investigated the impact of audio segmentation for E2E-ST models. To this end, we used the A4 model. Note that we used the same decoding hyperparameters tuned on Must-C. The results are shown in Table 9. We confirmed a similar trend to ASR. Although showed the best performance on average, we decided to use for submission considering the best performance on the latest IWSLT test, tst2019.

5.3.3 Cascade system

We also evaluated the cascade system with the Conformer ASR and the Transformer-Base MT trained on the WMT10M data (C1). The MT model was trained by feeding source sentences without case information and punctuation marks. The results in Table 9 showed that the BLEU scores correlated to the WER in Table ,5 and the performance was comparable with that of A4. Although there is some room for improving the performance of the cascade system further by using in-domain English LM, it is difficult to conclude which modeling (cascade or E2E) is effective because the cascade system had more model parameters in the ASR decoder and MT encoder. This means that the E2E model could also be enhanced by using a similar amount of parameters.

5.3.4 Final system

Our final system was the best ensemble system E4, using the based segmentation with 888Because of time limitation, we submitted the systems before completing tuning segmentation hyperparameters.. This system, which was our primary submission, scored 24.14 BLEU on tst2019 as shown in Table 10. Compared to the result in Table 8, it improved by 0.53 BLEU thanks to better audio segmentation. It was also slightly higher than the IWSLT20 winner’s submission by SPROL (potapczyk-przybysz-2020-srpols).

We also present the results on tst2020 and tst2021 in Table 10. Our primary submission E4 outperformed the result of last year’s winner system on tst2020.

6 Conclusion

In this paper, we have presented the ESPnet-ST group’s offline systems on the IWSLT 2021 submission. We significantly improved the baseline Conformer performance with multi-referenced SeqKD, Multi-Decoder architecture, segment merging algorithm, and model ensembling. Our future work includes scaling training data and careful analysis of the performance gap in different test sets.

7 Acknowledgement

This work was partly supported by ASAPP and JHU HLTCOE. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (towns2014xsede), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system (nystrom2015bridges), which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).