Log In Sign Up

Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities – its learned representation improves the accuracy of cross-modal speech-text retrieval from 4 at


STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-tex...

SDST: Successive Decoding for Speech-to-text Translation

End-to-end speech-to-text translation (ST), which directly translates th...

End-to-end Speech Translation via Cross-modal Progressive Training

End-to-end speech translation models have become a new trend in the rese...

CyCLIP: Cyclic Contrastive Language-Image Pretraining

Recent advances in contrastive representation learning over paired image...

A Novel Approach for Pill-Prescription Matching with GNN Assistance and Contrastive Learning

Medication mistaking is one of the risks that can result in unpredictabl...

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

In this paper, we propose a novel end-to-end user-defined keyword spotti...

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) systems can infer t...

1 Introduction

End-to-end speech-to-text translation (E2E ST) becomes important in many internet products and real applications. An E2E ST system accepts audio signals as the input and generates the target translation using a single model. Compared with the conventional cascade ST models, E2E ST models have achieved almost comparable Bentivogli et al. (2021) or even superior Ansari et al. (2020); Potapczyk and Przybysz (2020); Xu et al. (2021) performance.

The performance of an E2E ST model is still restricted for languages with relatively small parallel data, compared to text machine translation (MT). Existing approaches for ST focus on using additional data from MT and automatic speech recognition (ASR). This can be realized through pre-training approaches 

Zheng et al. (2021); Dong et al. (2021b, a) or multi-task training frameworks Tang et al. (2021b); Ye et al. (2021); Han et al. (2021).

Figure 1: Illustration of representations for speech and transcript text (projected to 2D). (a) representations learned by existing models. Pairs of speech and text representations are distant. (b) an ideal representation that we expect, where different modalities with same meaning should stay close to each other.

Different from the data perspective, this paper investigates the bottleneck of E2E ST from the neural representation perspective. We believe that when the representation of audio input is similar to its corresponding textual representation, it is easier for information to transfer from MT to ST, thus improving speech translation performance.

We analyze Transformer models for speech translation and observe a noticeable modality gap between encoder representations of speech and text (Sec. 6 has more details) from existing ST models. An ideal representation should satisfy: if the content of the speech and transcription are similar, their encoded representations should likewise be close to each other. Nevertheless, how to learn unified and aligned speech-text representations?

Inspired by the recent progress of contrastive learning approaches in cross-lingual Lample and Conneau (2019); Pan et al. (2021) and cross-modal vision-and-language domains (Li et al., 2021; Zhou et al., 2020; Dong et al., 2019), we designed a simple contrastive learning method for ST (ConST) to learn the representations that meet the aforementioned conditions explicitly. On the one hand, our model inherits the advantages of the previous multi-task learning methods. On the other hand, it reduces the gap between the representations of speech and its corresponding transcription.

Our contributions are as follows.

  • We develop ConST for speech translation, a cross-modal contrastive learning method, on top of the multi-task training framework.

  • Our experiments on the MuST-C benchmark to show ConST achieves an average BLEU score of 29.4, outperforming the best previous baseline.

  • We show that ConST indeed learns similar representations for two modalities and better retrieves text with speech input.

Figure 2: Left: Model structure of ConST. The gray shaded modules are the optional data augmentation operations introduced in Section 3.3. Right: An illustration of cross-modal contrastive learning.

2 Related Work

End-to-end ST  To alleviate the error propagation in the cascaded ST systems and to make the deployment simpler, Bérard et al. (2016); Weiss et al. (2017) proposed to use an end-to-end architecture to directly translate speech into text in another language, without the intermediate transcription. Kano et al. (2017); Berard et al. (2018); Inaguma et al. (2020); Wang et al. (2020a); Zhao et al. (2021) implemented several off-the-shelf encoder-decoder E2E-ST models, such as BiLSTM Greff et al. (2016) and Speech-Transformer Dong et al. (2018). However, training an end-to-end speech translation model is difficult because we need to design a cross-modal cross-language model, meanwhile, the speech-transcription-translation supervised data for speech translation is significantly less than that of MT and ASR. Methods, like data augmentation Park et al. (2019); Pino et al. (2020); Chen et al. (2021), pre-training Weiss et al. (2017); Berard et al. (2018); Bansal et al. (2019); Wang et al. (2020b); Alinejad and Sarkar (2020); Dong et al. (2021a); Zheng et al. (2021), self-training Pino et al. (2020); Wang et al. (2021), utilizing self-supervised pre-trained audio representation Wu et al. (2020); Han et al. (2021); Ye et al. (2021); Wang et al. (2021), are proved to be effective. Meanwhile, some work has shown that the encoder-decoder model with a single encoder cannot encode speech information well. For example, Dong et al. (2021b) first proposed a second encoder to further extract semantic information of the speech sequence. Xu et al. (2021) proposed a stacked acoustic-and-textual encoder and introduced large-scale out-of-domain data. Also, multi-task frameworks  Le et al. (2020); Tang et al. (2021b); Ye et al. (2021) are widely applied to further enhance the robustness for ST. As a cross-modal task, some work has noted the problem of the modality gap. Han et al. (2021) designed a fix-size semantic memory module to bridge such a gap, from the neuroscience perspective. However, we find that this approach actually sacrifices the effect of MT. So in this paper, we propose a simple yet effective contrastive learning method to bridge the gap and to improve ST performance.

Cross-modal grounding learning  This paper attempts to address the problem in speech translation from the perspective of cross-speech-text representation learning. We are also inspired by cross-modal representation learning in the acoustic word embedding (AWE) Palaskar et al. (2019); Kamper et al. (2020); Hu et al. (2020) and the visual-language pre-training (VLP) Wu et al. (2019); Lu et al. (2019); Chen et al. (2020b); Li et al. (2021) tasks. These works usually focus on enhancing textual representations with acoustic or visual information, in other words, grounding learning. In this work, we consider the its dual form, i.e., grounding speech representations using text.

Contrastive learning

  Our method is motivated by the recent success in contrastive representation learning. The contrastive learning method was first proposed to learn representations from unlabeled datasets (hence the term, self-supervised learning) by telling which data points are similar or distinct, especially in the field of computer vision 

Chopra et al. (2005); Gutmann and Hyvärinen (2010); Schroff et al. (2015); Sohn (2016); Oord et al. (2018); Chen et al. (2020a); Grill et al. (2020). Khosla et al. (2020) extended the self-supervised batch contrastive approach to the fully-supervised setting and proposed a supervised contrastive learning method. In speech processing, representative methods focused on speaker identification Ravanelli and Bengio (2018), speech recognition Schneider et al. (2019), and audio representation learning van den Oord et al. (2018); Baevski et al. (2020). In the NLP area, the contrastive framework is used for sentence representation learning Fang et al. (2020); Shen et al. (2020); Gao et al. (2021); Wu et al. (2021); Yan et al. (2021), machine translation Pan et al. (2021), and summarization Wang et al. (2021); Cao and Wang (2021). Very recently, contrastive learning is also applied to learning a unified representation of image and text Dong et al. (2019); Zhou et al. (2020); Li et al. (2021). Motivated by the contrastive learning frameworks in cross-lingual and cross-modal topics, we introduce a similar idea in speech translation.

3 The ConST Approach

An end-to-end speech translation model directly translates audio sequence to the text in the target language. Speech translation corpus provides transcript in the source language, as well.

In this section, we present the overall speech translation model and cross-modal contrastive learning. We also provide several feasible strategies to construct more positive and negative pairs to enhance the contrastive learning.

3.1 Model Framework

Our model consists four sub-modules: a speech encoder, a word embedding layer, a Transformer Encoder and a Transformer decoder (Figure 2). It is designed to take either speech or a sentence as input, and to output either source transcript or target translation text. Such architecture enables a universal framework for multiple tasks, including ST, MT and ASR.

The speech encoder module (S-Enc) is designed to extract low-level features for speech signals. It contains Wav2vec2.0 Baevski et al. (2020)

and two additional convolutional layers. The input is raw waveform signal sampled at 16kHz. Each convolutional layer has a stride of 4 and

channels. In total, it shrinks the time dimension by a factor of . Denote as the audio representation of the speech, .

Parallel to the speech encoder is the word embeeding layer. It is the same as word embedding for text translation.

Both the speech encoder and word embedding layer are connect to Transformer encoder and then passed to the Transformer decoder. The Transformer encoder and decoder are using the same configuration as the original Vaswani et al. (2017). To explain, the Transformer encoder

further extracts the high-level semantic hidden representation of two modalities. The

Transformer decoder generates the word sequences (transcription and translation) for ST, MT and ASR tasks. Since our model has a complete Transformer encoder-decoder as a sub-module, this makes it possible to pre-train using large-scale extra MT parallel data.

Previous work has shown that multi-task learning on ST, MT and ASR improves translation performance Indurthi et al. (2020); Tang et al. (2021b); Ye et al. (2021). Our training loss consists of the following elements.



The first three elements are cross-entropy losses on <speech, target text>, <speech, source text> and <source text, target text> pairs. These pairs are built from the triplet ST data. We also introduce a cross-modal contrastive loss term (see Section 3.2 for details). It aims to bring the representation between the speech and textual transcription modalities closer (its effect will be analyzed in detail in Section 6). is a tuned hyper-parameter of the weighted contrastive loss term.

3.2 Cross-modal Contrastive Learning

As mentioned in the beginning, since we need to produce similar representations for the speech and transcript sharing the same semantic meanings, we propose cross-modal contrastive learning method to bring their representations closer together. The main idea of cross-modal contrastive learning is to introduce a loss that brings speech and its corresponding transcript (positive example) near together while pushing irrelevant ones (negative examples) far apart.

Given a positive example of such a speech-transcript pair , we randomly pick a set of transcripts from the same batch as negative examples. For speech and its transcript , we first average them in terms of the time dimension,


and apply the multi-class N-pair contrastive loss Sohn (2016):


where , is the temperature hyper-parameter, and

is the cosine similarity function

. In the implementation, negative examples are from the same training batch of data (Figure 2(b)).

3.3 Mining Hard Examples for Contrastive Learning

To further enhance the contrastive learning, we introduce three strategies to mine additional hard examples. These strategies are at input and representation (gray shaded modules in Figure 2(a)). Specific schematic illustrations of each operations are shown in Figure 3.

Span-Masked Augmentation  We mask consecutive segments of an original audio waveform sequence to obtain a new modified speech . We take

as an input to the model, and compute the contrastive loss on its original corresponding transcript. We randomly sample without replacement all time steps in the original waveform of the speech to be the starting indices with a probability

, and then we set the sub-sequence successive time steps to be blank. In the experiment, we tried multiple configurations, and found the best, resulting in a masked span of second. Since the masked speech fragment is very short, we consider the masked speech and the original transcript to be positive pairs, and the remaining transcripts in the same batch to be negative pairs.

Figure 3: Schematic illustration of the hard examples mining strategies. In the cut-off strategy, the gray shaded grid represents the zero-out element.

Word Repetition The word repetition strategy randomly replicates some words (or sub-words) in the original sentences, with two advantages for improving representation robustness. First, as the length of the sentence is shorter than that of its audio representation, randomly repeating the words in the sentence is a simple yet useful technique to increase the length. Second, repeating words does not change the semantics and is suitable as an extra positive example of the corresponding speech. Specifically, given sentence , each sub-word token can be duplicated more times, resulting in the duplicated sentence , where and . We regard as the additional positive example for the speech and the samples with the same operation in the same batch as the negative examples.

Cut-off strategy  Recent studies on natural language understanding and generation have proved cut-off strategy to be successful Shen et al. (2020); Yan et al. (2021). We analogize a similar idea to the cut-off approach for speech representation. We entirely erase a slice of the representation matrix along each dimension and set the erased terms to 0. Here, we present two variants: sequence cut-off, which erases some sequence dimension, and feature cut-off, which erases some feature dimension. Note that there is a difference between cut-off and dropout. Dropout randomly sets some elements to 0, while cut-off is a dimensional “block" dropout. Similarly, we treat the cut-off audio representation and the original transcribed sentence as positive pairs, and the rest sentences in the same batch as negative pairs.

4 Experiments

4.1 Experimental Setups

ST datasets  We conduct experiments on all the translation directions in MuST-C dataset 111We use v1.0. Di Gangi et al. (2019): English (En) to German (De), Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), Romanian (Ro) and Russian (Ru). As one of the largest ST benchmarks, MuST-C contains more than 385 hours of TED talks for each direction.

MT datasets  We also introduce external WMT datasets Bojar et al. (2016) for En-De/Es/Fr/Ro/Ru and OPUS100 datasets Zhang et al. (2020) for En-It/Nl/Pt directions, as the expanded setup.

Table 8 (in Appendix. A) lists the statistics of all the datasets included.

Models External Data BLEU
Speech Text ASR MT De Es Fr It Nl Pt Ro Ru Avg.
w/o external MT data
Fairseq ST (Wang et al., 2020a) - - - - 22.7 27.2 32.9 22.7 27.3 28.1 21.9 15.3 24.8
NeurST Zhao et al. (2021) - - - - 22.8 27.4 33.3 22.9 27.2 28.7 22.2 15.1 24.9
Espnet ST Inaguma et al. (2020) - - - - 22.9 28.0 32.8 23.8 27.4 28.0 21.9 15.6 25.1
Dual Decoder Le et al. (2020) - - - - 23.6 28.1 33.5 24.2 27.6 30.0 22.9 15.2 25.6
W-Transf. Ye et al. (2021) - - - 23.6 28.4 34.6 24.0 29.0 29.6 22.4 14.4 25.7
Speechformer Papi et al. (2021) - - - - 23.6 28.5 - - 27.7 - - - -
Self-training (Pino et al., 2020) - - 25.2 - 34.5 - - - - - -
SATE (Xu et al., 2021) - - - - 25.2 - - - - - - - -
BiKD (Inaguma et al., 2021) - - - - 25.3 - 35.3 - - - - - -
Mutual-learning Zhao et al. (2021) - - - - - 28.7 36.3 - - - - - -
XSTNet Ye et al. (2021) - - - 25.5 29.6 36.0 25.5 30.0 31.3 25.1 16.9 27.5
ConST - - - 25.7 30.4 36.8 26.3 30.6 32.0 24.8 17.3 28.0
w/ external MT data
MTL Tang et al. (2021b) - - - 23.9 28.6 33.1 - - - - - -
LightweightAdaptor Le et al. (2021) - 24.7 28.7 35.0 25.0 28.8 31.1 23.8 16.4 26.6
FAT-ST (Big) (Zheng et al., 2021) 25.5 30.8 - - 30.1 - - - -
JT-S-MT Tang et al. (2021a) - - - 26.8 31.0 37.4 - - - - - -
Chimera Han et al. (2021) - - 27.1 30.6 35.6 25.0 29.2 30.2 24.0 17.4 27.4
XSTNet Ye et al. (2021) - - 27.1 30.8 38.0 26.4 31.2 32.4 25.7 18.5 28.8
SATE Xu et al. (2021) - - 28.1 - - - - - - - -
STEMM Fang et al. (2022) - - 28.7 31.0 37.4 25.8 30.5 31.7 24.5 17.8 28.4
TaskAware Indurthi et al. (2021) - - 28.9 - - - - - - - -
STPT Tang et al. (2022) - 33.1 39.7 - - - - - -
ConST - - 28.3 32.0 38.3 27.2 31.7 33.1 25.6 18.9 29.4
Table 1: Case-sensitive detokenized BLEU scores on MuST-C tst-COMMON set. "Speech" denotes unlabeled speech data. "Text" means unlabeled text data, e.g. Europarl V7 Koehn and others (2005), CC25 Liu et al. (2020a). † use external 40M OpenSubtitles (Lison and Tiedemann, 2016) MT data. Other models only use WMT data.

Model Configurations  The Wav2vec2.0 in the S-Enc is only pre-trained on Librispeech Panayotov et al. (2015) speech without any downstream fine-tuning222 Two layers of CNNs after the Wav2vec2.0 are set to kernel size 5, stride size 2 and hidden size 512. The Transformer follows the base configuration, with 6 layers of encoder and decoder, hidden size , 8 attention heads, and 2048 FFN hidden states. We use pre-layer normalization for stable training. The model with the above configurations has a total of about 150M parameters.

Experiment Details  We evaluate case-sensitive detokenized BLEU using sacreBLEU333, BLEU Signature: nrefs:1 | bs:1000 | seed:12345 | case:mixed | eff:no | tok:13a | smooth:exp | version:2.0.0 Post (2018) on MuST-C tst-COMMON set. In the analysis, we also report the ChrF++ score 444ChrF2++ Signature: nrefs:1 | bs:1000 | seed:12345 | case:mixed | eff:yes | nc:6 | nw:2 | space:no | version:2.0.0 Popović (2017) and the learning-based BLEURT score 555 Sellam et al. (2020). As recommended, the checkpoint we use is BLEURT-20.. We use the raw 16-bit 16kHz mono-channel speech input. We jointly tokenize the bilingual text using SentencePiece Kudo and Richardson (2018), with a vocabulary size of 10k, which is the same as Ye et al. (2021)’s setup. For the training loss, we set contrastive temperature and weight of contrastive term for German and Dutch, and for the other languages.

Appendix B contains more detailed settings and explanations for the baseline models in Table 1. Appendix C shows the experiments on the choice of the hyper-parameters.

4.2 Main Results

Comparison with end-to-end ST models  Table 1 shows the main results. Since many existing works regard “leveraging external data” to be one of their model’s features, their strong performances are largely predicated on the utilization of auxiliary data, especially large-scale MT data. For a relatively fair comparison, we investigate two cases: (1) without external MT data and (2) with external MT data. Without the external MT data, our method already gains an average improvement of 0.5 BLEU over the previous best models. Also when speech data is introduced for pre-training, our method works better than others (Self-training, W-Transf. and XSTNet). When extra MT data are introduced, our method also outperforms SOTA by an average of 0.6 BLEU. Among the benchmark models, with the same goal of closing two modality gaps, Chimera Han et al. (2021) constructed an extra fixed-length shared semantic space. However, the shared memory with fixed size actually compromises the MT performance, while our contrastive learning approach is more straightforward and effective.

Comparison with cascaded ST systems  We compare our method with several cascade baselines, where Ye et al. (2021) and Xu et al. (2021) provided two strong cascade systems trained using MuST-C and external ASR and MT data (LibriSpeech, WMT, and Opensubtitles). From Table 2, we find that as an end-to-end model, ConST can outperform these strong cascade models. In Appendix 7, we provide a case study to show such improvement.

Models En-De En-Fr En-Ru
 EspnetInaguma et al. (2020) 23.6 33.8 16.4
Ye et al. (2021) 25.2 34.9 17.0
Xu et al. (2021) 28.1 - -
 ConST 28.3 38.3 18.9
Table 2: ConST versus the cascaded ST systems on MuST-C En-De/Fr/Ru test sets. Ye et al. (2021) and Xu et al. (2021) are two strong cascaded models.

5 Analysis

5.1 Is contrastive loss effective?

With the same model architecture and the same pre-training + fine-tuning procedure, the main difference between ConST and XSTNet Ye et al. (2021) is whether we use the contrastive loss term during the fine-tuning or not. Comparing the BLEU results of w/o and w/ external MT data situations in Table 1, we find that ConST further improves and BLEU scores in terms of eight translation directions on average, which proves the effectiveness of the cross-modal contrastive learning. By gradually removing each losses in Eq.( 1), Table 3 shows the improvements bringing by the multi-task learning and the contrastive learning. For En-De translation direction, contrastive learning can bring an average improvement of 0.9 BLEU over the baseline models by only optimizing (corresponding to the last row of the Table 3), and multi-task learning can lead to a further improvement of about 1.2 BLEU on top of that.

External MT
Config. without with
ConST 25.7 28.3
24.6 27.0
23.6 26.3
Table 3: BLEU scores on MuST-C En-De tst-COMMON set by removing individual losses. We test the results under settings with and without the introduction of external MT data.

5.2 Which layer to contrast on?

An intriguing question is which representations should be considered in the contrastive loss function. In the method part (Section 

3.2), we use averaged audio representation for speech (Eq.(2)) and averaged lexical embedding for the transcript (Eq.(3)), denoted as low-level repr.. Whereas inspired by a recent study in multilingual MT (Pan et al., 2021), we also provide an alternative contrastive loss as a comparison, whose speech and text features are average-pooled semantic representations derived from the Transformer encoder, denoted as high-level repr..

Table 4 shows that contrastive learning using the low-level representations (Line 1) is better than using the high-level ones (Line 2). On the other hand, although the performance of Line 2 is relatively inferior, it still outperforms the multi-task model without the contrastive loss (Line 3). The detailed analysis of possible explanations will be shown in Section 6.2.

Representations BLEU ChrF++ BLEURT
low-level repr. 28.3* 53.2* 64.5
high-level repr. 27.5 52.6 63.6
w/o contrative loss 27.1 52.1 62.4
Table 4: BLEU, ChrF++ and BLEURT (%) on En-De test set. Different representations are tested. *: ConST is significantly better than the other two baselines (). : the model is significantly better the baseline model without contrastive loss ().

5.3 Is contrastive loss better than other losses?

Our goal for introducing the contrastive loss term (denoted as CTR Loss) is to close the distance between speech and text representations. Whereas, there are other options to achieve this goal, such as L2 loss and CTC loss.

  • L2 Loss: Without introducing any negative samples, L2 loss directly reduces the Euclidean distance between the representations of two modalities by minimizing . L2 loss can be viewed as an implementation based on the idea of knowledge distillation Heo et al. (2019); Dong et al. (2021b).

  • CTC Loss: The connectionist temporal classification (CTC) loss Graves et al. (2006) is commonly used in speech-related tasks Xu et al. (2021); Dong et al. (2021b). Unlike contrastive loss that cares about the representation, CTC loss connects the two modalities by establishing speech-text alignment and maximizing , where is the set of all valid alignments.

Compared to the other two ways of bridging the modality gap, L2 and CTC loss, is the contrastive loss term better? The answer is yes according to the results in Table 5. Our explanation is that information on the negative samples benefits the contrastive loss, bringing the the distance between the speech and its corresponding transcription closer while pushing the distance to the irrelevant text farther.

Extra Loss BLEU ChrF++ BLEURT
CTR Loss 28.3* 53.2 64.5
CTC Loss 27.6 53.0 64.1
L2 Loss 27.3 52.4 63.0
- 27.1 52.1 62.4
Table 5: BLEU, ChrF++ and TER (%) on En-De test set under different loss terms other than the basic multi-task NLL loss. *: ConST is significantly () better than the other three alternatives. : the improvement from CTC loss over the baseline without extra loss is significant ().

5.4 Analysis on the hard example mining strategies

Figure 4: The heat map visualization of the BLEU scores on En-De test set, with 15 combinations of the original contrastive loss (Original) and hard examples mining methods – word repetition (Rep), span-masked augmentation (SMA), sequence cut-off (SCut) and feature cut-off (FCut). * and ** mean the improvements over the XSTNet baseline without contrastive loss are statistically significant (*:, **:).

In Section 3.3, we proposed four methods to mine the hard examples for contrastive learning, namely span-masked augmentation (SMA), word repetition (Rep), sequence cut-off (SCut), and feature cut-off (FCut). In this section, we study how effective these methods are, and to do so, we consider the BLEU performances of their 15 combinations (Figure 4). Note that “Original” means the original contrastive loss in Eq.(4) without any additional hard examples mining operation, and the diagonal in the heat map represents only one strategy used. For an easy and fair comparison, we set the weight of the contrastive term to 1.0 uniformly. We have the following observations.

All the hard examples mining methods are effective.  All the BLEU scores in Figure 4 exceed the strong multi-task model trained without contrastive learning (). Among all the strategies, the combination of the original and SCut reaches the best result (), and is better than the model without any expanded operations (). Generally, to find the best model, we suggest adopting multiple strategies and choosing the best checkpoint on the dev-set.

The combinations of the hard examples mining methods and the “original” have relatively better performances. We argue that we need the original positive and negative examples to give more accurate representations (without any dropout) for contrastive learning. On the contrary, without the help of “original” loss, the performance with both sequence cut-off and feature cut-off is the worst in Figure 4, probably because too much information is lost by superimposing the two.

Ref. src: Lights, sounds, solar panels, motors — everything should be accessible.
tgt: Lichter, Töne, Solarelemente, Motoren — alles sollte verfügbar sein.
Cascaded src: Lights sounds solar panels motors everything should be accessible.
tgt: Licht klingt Solarpaneele, Motoren; alles sollte zugänglich sein.
XSTNet tgt: Licht, Geräusche, Solarkollektoren, Motoren — alles sollte zugänglich sein.
ConST tgt: Licht, Geräusche, Solarpanele, Motoren, alles sollte zugänglich sein.
Ref. src: Eight years ago when I was at the Media Lab, I started exploring this idea of how to put the power of engineers in the hands of artists and designers.
tgt: Vor acht Jahren war ich am Media Lab und ich begann diese Idee zu erforschen, wie man die Macht der Ingenieure in die Hand von Künstlern und Designern legen könnte.
Cascaded src: Eight years ago when I was at the Media Lab, I started exploring this idea of how to put the power of engineers in the hands of artists and designers.
tgt: Vor 8 Jahren, als ich im Media Lab war, begann ich, diese Idee zu erforschen, wie man die Macht der Ingenieure in die Hände von Künstlern und Designern legte.
XSTNet tgt: Vor acht Jahren, als ich im Media Lab war, begann ich zu erforschen, wie man die Kraft der Ingenieure in die Hände von Künstlern und Designern legt.
ConST tgt: Vor acht Jahren, als ich im Media Lab war, begann ich, diese Idee zu erforschen, wie man die Macht von Ingenieuren in die Hände von Künstlern und Designern legt.
Table 6: En-De test cases that generated from the cascaded model, XSTNet (both provided by Ye et al. (2021)) and our ConST model. The red underlined text indicates grammatically incorrect or inaccurate translations.

6 Why does cross-modal contrastive learning work? — Analysis on the Modality Gap

As mentioned earlier, the existing multi-task training models cannot address the speech-text modality gap. Does ConST reduce the representation gap between speech and text?

6.1 Visualization of Representation

(a) w/o contrastive loss
(b) ConST
Figure 5: Bivariate KDE contour plot of the representation of speech and transcript in source language English. T-SNE is used to reduce into 2D. The blue lines are the audio representations and the red dashed lines stand for text. (a) for the vanilla multi-task framework without any extra supervision. (b) for our proposed ConST model. Sentences are from En-De test set.

Does the speech-text modality gap exist without explicitly bridging the two? Speech-text modality gap

means the discrepancy between the audio representations and transcription sentence embeddings. To visualize it, we plot the bivariate kernel density estimation 

Parzen (1962) (KDE) contour of their dim-reduced features, where T-SNE (Van der Maaten and Hinton, 2008) is used to reduce the dimension into two (Figure 5). Ideally, if the representations of speech and its corresponding transcript are similar, their KDEs will be similar, and thus the contour lines will overlap as much as possible. However, Figure 5(a) is the KDE contour of the multi-task framework without any explicit modeling to bring two modalities together Ye et al. (2021). It shows that the representations are so dissimilar that they are organically divided into two clusters, i.e. speech-text modality gap exists.

Does ConST reduce the modality gap? As shown in Figure 5(b), compared to the baseline model without contrastive learning, ConST with cross-modal contrastive learning is able to bring representations of different modalities much closer. This means that the audio representation contains more linguistic information similar to that of the textual transcription, which is more advantageous for the downstream ST generation through the shared Transformer encoder and Transformer decoder.

6.2 Cross-modal Retrieval

How good is the cross-modal representation space learned from ConST? To answer this question, we conduct a retrieval experiment, i.e. finding the nearest (smallest cosine similarity) transcript based on the speech representation. We compare ConST model with the baseline without cross-modal contrastive learning and report the top-1 retrieval accuracy using (1) the low-level representations and (2) the high-level semantic representations, in Table 7.

When retrieving the text using low-level representations, our method gains a substantial 79% increase over the baseline. In addition, we find that without explicit contrastive modeling, the baseline can achieve retrieval accuracy of more than according to the semantic representations outputted from the Transformer encoder. We believe that such high accuracy is automatically learned from the triple-supervised data itself under the multi-task learning framework. With such a degree of cross-modal alignment, if we construct the contrastive loss with semantic representations, its gain to the ST performance turns out to be limited, which exactly corroborates the findings in Section 5.2 – low-level representations are preferred in the cross-modal contrastive learning.

Representations CTR loss Acc.
low-level repr. 9.4
high-level repr. 94.7
Table 7: Cross-modal top-1 retrieval accuracy (%) on En-De test set. Two different representations are used, based on which, ConST achieves huge accuracy improvements.

7 Case Analysis

In this section, we use several cases that ConST generates. We compare our model with the cascaded model and the previous end-to-end model, XSTNet Ye et al. (2021).

For this first case, the cascaded system fails to give a right translation due to the mis-punctuation issue (klingt is a verb), while the end-to-end model, XSTNet and ConST translate correctly. For the second case, the previous end-to-end XSTNet model cannot accurately translate the phrase “started exploring this idea of”, which performs worse than the cascaded one. Whereas ConST successfully conveys the meaning of “this idea” , and translates more accurately than XSTNet. We believe this improvement comes from the cross-modal contrastive learning.

8 Conclusion

In this paper, we propose ConST, a simple yet effective contrastive learning framework bridging the speech-text representation gap and facilitating the ST with limited data. We also provide feasible hard example mining methods to learn robust representations. The results on the MuST-C ST dataset prove the effectiveness of the method.

9 Broader Impact

This work improves the performance of ST tasks on public datasets by learning speech representations that are more similar to text representations, but the model is far from being achieved for industrial-grade implementations. In real scenarios, for example, the original voice is noisier and the distribution of speech lengths is more complex than in the public dataset, which cannot be handled by an end-to-end model alone. The shortcoming of this model is that it still needs a certain amount of labeled data for training, especially <speech,transcription> to learn better speech representation, and for the more than languages and dialects in the world, most of them do not have corresponding translations or even transcriptions, our method does not work in untranscribed scenarios. In this paper, we focus on the improvement brought by the better speech representation on the ST task, and obtained good results with hundreds of hours of speech data. We hope that our work achieves better results using more data (e.g. raw speech, raw text, ASR, MT data) in the future.


  • A. Alinejad and A. Sarkar (2020) Effectively pretraining a speech translation decoder with machine translation data. In Proc. of EMNLP, pp. 8014–8020. Cited by: §2.
  • E. Ansari, A. Axelrod, N. Bach, O. Bojar, R. Cattoni, F. Dalvi, N. Durrani, M. Federico, C. Federmann, J. Gu, et al. (2020) Findings of the iwslt 2020 evaluation campaign. In Proc. of IWSLT, pp. 1–34. Cited by: §1.
  • A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §2, §3.1.
  • S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proc. of NAACL-HLT, pp. 58–68. Cited by: §2.
  • L. Bentivogli, M. Cettolo, M. Gaido, A. Karakanta, A. Martinelli, M. Negri, and M. Turchi (2021) Cascade versus direct speech translation: do the differences still make a difference?. In Proc. of ACL, Cited by: §1.
  • A. Berard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin (2018) End-to-end automatic speech translation of audiobooks. In Proc. of ICASSP, pp. 6224–6228. Cited by: §2.
  • A. Bérard, O. Pietquin, C. Servan, and L. Besacier (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. In NIPS workshop on End-to-end Learning for Speech and Audio Processing, Cited by: §2.
  • O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Névéol, M. Neves, M. Popel, M. Post, R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri (2016) Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 131–198. Cited by: §4.1.
  • S. Cao and L. Wang (2021) CLIFF: contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proc. of ACL, pp. 6633–6649. Cited by: §2.
  • J. Chen, M. Ma, R. Zheng, and L. Huang (2021) SpecRec: an alternative solution for improving end-to-end speech-to-text translation via spectrogram reconstruction. Proc. of InterSpeech, pp. 2232–2236. Cited by: §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020a) A simple framework for contrastive learning of visual representations. In Proc. of ICML,

    Proceedings of Machine Learning Research

    , Vol. 119, pp. 1597–1607.
    Cited by: §2.
  • Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020b) Uniter: universal image-text representation learning. In Proc. of ECCV, pp. 104–120. Cited by: §2.
  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In Proc. of CVPR, Vol. 1, pp. 539–546. Cited by: §2.
  • M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi (2019) MuST-C: a Multilingual Speech Translation Corpus. In Proc. of NAACL-HLT, pp. 2012–2017. Cited by: §4.1.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Proc. of NeurIPS, pp. 13063–13075. Cited by: §1, §2.
  • L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proc. of ICASSP, pp. 5884–5888. Cited by: §2.
  • Q. Dong, M. Wang, H. Zhou, S. Xu, B. Xu, and L. Li (2021a) Consecutive decoding for speech-to-text translation. In Proc. of AAAI, Cited by: §1, §2.
  • Q. Dong, R. Ye, M. Wang, H. Zhou, S. Xu, B. Xu, and L. Li (2021b) Listen, understand and translate: triple supervision decouples end-to-end speech-to-text translation. In Proc. of AAAI, Vol. 35, pp. 12749–12759. Cited by: §1, §2, 1st item, 2nd item.
  • H. Fang, S. Wang, M. Zhou, J. Ding, and P. Xie (2020) Cert: contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766. Cited by: §2.
  • Q. Fang, R. Ye, L. Li, Y. Feng, and M. Wang (2022) STEMM: self-learning with speech-text manifold mixup for speech translation. In Proc. of ACL, Cited by: 4th item, Table 1.
  • T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. In Proc. of EMNLP, Cited by: §2.
  • A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In Proc. of ICML, W. W. Cohen and A. W. Moore (Eds.), Vol. 148, pp. 369–376. Cited by: 2nd item.
  • K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2016) LSTM: a search space odyssey.

    IEEE transactions on neural networks and learning systems

    28 (10), pp. 2222–2232.
    Cited by: §2.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. (2020) Bootstrap your own latent-a new approach to self-supervised learning. Proc. of NeurIPS 33, pp. 21271–21284. Cited by: §2.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 297–304. Cited by: §2.
  • C. Han, M. Wang, H. Ji, and L. Li (2021) Learning shared semantic space for speech-to-text translation. In Proc. of ACL - Findings, Cited by: 2nd item, §1, §2, §4.2, Table 1.
  • B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi (2019) A comprehensive overhaul of feature distillation. In Proc. of the ICCV, pp. 1921–1930. Cited by: 1st item.
  • Y. Hu, S. Settle, and K. Livescu (2020) Multilingual jointly trained acoustic and written word embeddings. In Proc. of INTERSPEECH, Cited by: §2.
  • H. Inaguma, T. Kawahara, and S. Watanabe (2021) Source and target bidirectional knowledge distillation for end-to-end speech translation. In Proc. of NAACL, pp. 1872–1881. Cited by: Appendix B, Table 1.
  • H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. Yalta, T. Hayashi, and S. Watanabe (2020) ESPnet-ST: all-in-one speech translation toolkit. In Proc. of ACL, pp. 302–311. Cited by: Appendix B, §2, Table 1, Table 2.
  • S. Indurthi, H. Han, N. K. Lakumarapu, B. Lee, I. Chung, S. Kim, and C. Kim (2020) Data efficient direct speech-to-text translation with modality agnostic meta-learning. In Proc. of ICASSP, Cited by: §3.1.
  • S. Indurthi, M. A. Zaidi, N. K. Lakumarapu, B. Lee, H. Han, S. Ahn, S. Kim, C. Kim, and I. Hwang (2021) Task aware multi-task learning for speech to text tasks. In Proc. of ICASSP, pp. 7723–7727. Cited by: Appendix B, Table 1.
  • H. Kamper, Y. Matusevych, and S. Goldwater (2020) Multilingual acoustic word embedding models for processing zero-resource languages. In Proc. of ICASSP, pp. 6414–6418. Cited by: §2.
  • T. Kano, S. Sakti, and S. Nakamura (2017) Structured-based curriculum learning for end-to-end english-japanese speech translation. In Proc. of INTERSPEECH, pp. 2630–2634. Cited by: §2.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. Proc. of NeurIPS 33. Cited by: §2.
  • P. Koehn et al. (2005) Europarl: a parallel corpus for statistical machine translation. In MT summit, Vol. 5, pp. 79–86. Cited by: Table 1.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of EMNLP, pp. 66–71. Cited by: §4.1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. In Proc. of NeurIPS, Cited by: §1.
  • H. Le, J. Pino, C. Wang, J. Gu, D. Schwab, and L. Besacier (2020) Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation. In Proc. of COLING, pp. 3520–3533. Cited by: Appendix B, §2, Table 1.
  • H. Le, J. Pino, C. Wang, J. Gu, D. Schwab, and L. Besacier (2021) Lightweight adapter tuning for multilingual speech translation. In Proc. of ACL, Cited by: Appendix B, Table 1.
  • W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang (2021) UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In Proc. of ACL, Cited by: §1, §2, §2.
  • P. Lison and J. Tiedemann (2016) OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles. In Proc. of LREC, pp. 923–929. Cited by: Table 1.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020a)

    Multilingual denoising pre-training for neural machine translation

    TACL 8, pp. 726–742. Cited by: Table 1.
  • Y. Liu, J. Zhu, J. Zhang, and C. Zong (2020b) Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920. Cited by: Appendix B.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc. of NeurIPS 32. Cited by: §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proc. of NAACL - Demonstrations, pp. 48–53. Cited by: Appendix B.
  • S. Palaskar, V. Raunak, and F. Metze (2019) Learned in speech recognition: contextual acoustic word embeddings. In Proc. of ICASSP, pp. 6530–6534. Cited by: §2.
  • X. Pan, L. Wu, M. Wang, and L. Li (2021) Contrastive learning for many-to-many multilingual neural machine translation. In Proc. of ACL, Cited by: §1, §2, §5.2.
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In Proc. of ICASSP, pp. 5206–5210. Cited by: §4.1.
  • S. Papi, M. Gaido, M. Negri, and M. Turchi (2021) Speechformer: reducing information loss in direct speech translation. In Proc. of EMNLP, Cited by: Appendix B, Table 1.
  • D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple augmentation method for automatic speech recognition. In Proc. of INTERSPEECH, Cited by: §2.
  • E. Parzen (1962)

    On estimation of a probability density function and mode

    The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §6.1.
  • J. Pino, Q. Xu, X. Ma, M. J. Dousti, and Y. Tang (2020) Self-training for end-to-end speech translation. In Proc. of INTERSPEECH, pp. 1476–1480. Cited by: Appendix B, §2, Table 1.
  • M. Popović (2017)

    ChrF++: words helping character n-grams

    In Proceedings of the second conference on machine translation, pp. 612–618. Cited by: §4.1.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Cited by: §4.1.
  • T. Potapczyk and P. Przybysz (2020) SRPOL’s system for the iwslt 2020 end-to-end speech translation task. In Proc. of IWSLT, pp. 89–94. Cited by: §1.
  • M. Ravanelli and Y. Bengio (2018) Learning speaker representations with mutual information. arXiv preprint arXiv:1812.00271. Cited by: §2.
  • S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition.. In Proc. of INTERSPEECH, Cited by: §2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proc. of CVPR, pp. 815–823. Cited by: §2.
  • T. Sellam, D. Das, and A. P. Parikh (2020)

    BLEURT: learning robust metrics for text generation

    In Proc. of ACL, Cited by: footnote 5.
  • D. Shen, M. Zheng, Y. Shen, Y. Qu, and W. Chen (2020) A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818. Cited by: §2, §3.3.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Proc. of NeurIPS, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. 1857–1865. Cited by: §2, §3.2.
  • Y. Tang, H. Gong, N. Dong, C. Wang, W. Hsu, J. Gu, A. Baevski, X. Li, A. Mohamed, M. Auli, et al. (2022) Unified speech-text pre-training for speech translation and recognition. In Proc. of ACL, Cited by: Appendix B, Table 1.
  • Y. Tang, J. Pino, X. Li, C. Wang, and D. Genzel (2021a) Improving speech translation by understanding and learning from the auxiliary text translation task. In Proc. of ACL, Cited by: Appendix B, Table 1.
  • Y. Tang, J. Pino, C. Wang, X. Ma, and D. Genzel (2021b) A general multi-task learning framework to leverage text data for speech to text tasks. In Proc. of ICASSP, pp. 6209–6213. Cited by: Appendix B, §1, §2, §3.1, Table 1.
  • A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis (2018) Parallel wavenet: fast high-fidelity speech synthesis. In Proc. of ICML, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3915–3923. Cited by: §2.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §6.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. of NeurIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §3.1.
  • C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino (2020a) Fairseq s2t: fast speech-to-text modeling with fairseq. In Proc. of AACL, pp. 33–39. Cited by: Appendix B, §2, Table 1.
  • C. Wang, A. Wu, J. Pino, A. Baevski, M. Auli, and A. Conneau (2021)

    Large-scale self-and semi-supervised learning for speech translation

    In Proc. of INTERSPEECH, Cited by: §2.
  • C. Wang, Y. Wu, S. Liu, M. Zhou, and Z. Yang (2020b) Curriculum pre-training for end-to-end speech translation. In Proc. of ACL, pp. 3728–3738. Cited by: §2.
  • D. Wang, J. Chen, H. Zhou, X. Qiu, and L. Li (2021) Contrastive aligned joint learning for multilingual summarization. In Proc. of ACL - Findings, Cited by: §2.
  • R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen (2017) Sequence-to-sequence models can directly translate foreign speech. In Proc. of INTERSPEECH, pp. 2625–2629. Cited by: §2.
  • A. Wu, C. Wang, J. Pino, and J. Gu (2020) Self-supervised representations improve end-to-end speech translation. In Proc. of INTERSPEECH, Cited by: §2.
  • H. Wu, J. Mao, Y. Zhang, W. Sun, Y. Jiang, L. Li, and W. Ma (2019) Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In Proc. of CVPR, Cited by: §2.
  • X. Wu, C. Gao, L. Zang, J. Han, Z. Wang, and S. Hu (2021) ESimCSE: enhanced sample building method for contrastive learning of unsupervised sentence embedding. arXiv preprint arXiv:2109.04380. Cited by: §2.
  • C. Xu, B. Hu, Y. Li, Y. Zhang, Q. Ju, T. Xiao, J. Zhu, et al. (2021) Stacked acoustic-and-textual encoding: integrating the pre-trained models into speech translation encoders. In Proc. of ACL, Cited by: Appendix B, §1, §2, §4.2, Table 1, Table 2, 2nd item.
  • Y. Yan, R. Li, S. Wang, F. Zhang, W. Wu, and W. Xu (2021) ConSERT: a contrastive framework for self-supervised sentence representation transfer. In Proc. of ACL, Cited by: §2, §3.3.
  • R. Ye, M. Wang, and L. Li (2021) End-to-end speech translation via cross-modal progressive training. In Proc. of INTERSPEECH, Cited by: 1st item, 3rd item, §1, §2, §3.1, §4.1, §4.2, Table 1, Table 2, §5.1, Table 6, §6.1, §7.
  • B. Zhang, P. Williams, I. Titov, and R. Sennrich (2020) Improving massively multilingual neural machine translation and zero-shot translation. In Proc. of ACL, pp. 1628–1639. Cited by: §4.1.
  • C. Zhao, M. Wang, Q. Dong, R. Ye, and L. Li (2021) NeurST: neural speech translation toolkit. In Proc. of ACL - System Demonstrations, Cited by: Appendix B, §2, Table 1.
  • J. Zhao, W. Luo, B. Chen, and A. Gilman (2021) Mutual-learning improves end-to-end speech translation. In Proc. of the EMNLP, pp. 3989–3994. Cited by: Appendix B, Table 1.
  • R. Zheng, J. Chen, M. Ma, and L. Huang (2021) Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. In Proc. of ICML, External Links: 2102.05766 Cited by: Appendix B, §1, §2, Table 1.
  • L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao (2020)

    Unified vision-language pre-training for image captioning and vqa

    In Proc. of AAAI, Vol. 34, pp. 13041–13049. Cited by: §1, §2.

Appendix A Statistics of all datasets

En hours #sents name #sents
De 408 234K WMT16 4.6M
Es 504 270K WMT13 15.2M
Fr 492 292K WMT14 40.8M
It 465 258K OPUS100 1.0M
Nl 442 253K OPUS100 1.0M
Pt 385 211K OPUS100 1.0M
Ro 432 240K WMT16 0.6M
Ru 489 270K WMT16 2.5M
Table 8: Statistics of all datasets

Appendix B Experimental Details

Training and Implementation Details  We use Adam optimizer () with learning rate and warmup 25k steps during the ST training. We also implement the expanded setting with the introduction of external WMT to train the Transformer module. In the pre-training stage, we set the learning rate and warmup 4000 steps. For robust training, we set label smoothing to , and dropout rate to . The hyper-parameters for different data augmentation methods are as follows: for masked audio span strategy, we set masking probability and masking span length frames; for both sequence and feature cut-off, we set the cut-off dropout rate as . We save the checkpoint with the best BLEU on dev-set and average the last 10 checkpoints. For decoding, we use a beam size of 10 and length penalty for German, for French, and for Russian. We train the models in 8 Nvidia Tesla V100 GPUs for each experiment. We use Fairseq Ott et al. (2019) as the code-base for our implementation.

Baseline Models  In Table 1, we compared our method with end-to-end baseline models whose audio inputs are 80-channel log Mel-filter bank, including: FairseqST Wang et al. (2020a), NeurST Zhao et al. (2021), Espnet ST Inaguma et al. (2020), Dual-decoder Transformer Le et al. (2020), SATE Xu et al. (2021), Speechformer Papi et al. (2021), self training Pino et al. (2020) and mutual learning Zhao et al. (2021) method, STAST Liu et al. (2020b), bi-KD Inaguma et al. (2021), MLT method Tang et al. (2021b), Lightweight Adaptor Le et al. (2021), JT-S-MT Tang et al. (2021a), FAT-ST Zheng et al. (2021), TaskAware Indurthi et al. (2021), and STPT Tang et al. (2022). We also compare our method to baseline models that have pretrained Wav2vec2.0 as a module, including:

  • W-Transf. Ye et al. (2021): the model has the same structure as ours, but is only trained on <speech, translation> parallel data.

  • Chimera-ST Han et al. (2021): the model that builds a shared semantic memory for both audio and text modalities.

  • XSTNet Ye et al. (2021): the model has the same structure as ours, and adopted a multi-task fine-tuning strategy.

  • STEMM Fang et al. (2022): the model that bridges the modality representation gap by minimizing the Jensen–Shannon divergence between the original speech representation and the manifold mix-up representation.

Appendix C The Choice for Hyper-parameters

Influence of Temperature  In the contrastive loss, the temperature hyper-parameter is provided to control the smoothness of the distribution normalized by softmax operation. A high temperature helps to smooth the distribution, making it more difficult for the model to distinguish between positive and negative samples (corresponding to correct transcriptions and other transcriptions in this work), while the low temperature behaves just the opposite. We choose several temperature hyper-parameters ranging from to , and Figure 6 shows their BLEUs on the test and dev sets . We find that (1) the choice of the temperature does not drastically affect the final BLEU score, and (2) we recommend that the temperature be set between 0.02 and 0.05 to ensure a relatively good ST performance. In the experiment, we use .

Figure 6: En-De BLEU scores on tst-COMMON and Dev set. the x-axis is the choices of different temperature in Eq.(4) varying from to .

Influence of Contrastive Loss Weight  The total loss we optimize, Eq.(1), is a linear combination of the multi-task cross-entropy losses and the contrastive term . To investigate how much the contrastive terms affect BLEU, we fix its temperature , adjust the values of its loss weight from 0.1 to 2.0, performed three experiments for each value, and test the average BLEU on En-De tst-COMMON set. Figure 7 depicts the performances. First, all objective functions containing , even if their weights take different values, are apparently better than the baseline model with only . Then, the best BLEU score is achieved at loss weight , corresponding to the results in Table 1. And when analyzing the effect of data augmentation strategies (Section 5.4), since we need to consider the combination between them, which is more complicated. Therefore, we set the loss weight to uniformly for simplicity. In general, we recommend that the weight hyper-parameter takes a value between and .

Figure 7: En-De BLEU scores on tst-COMMON and Dev sets. The x-axis is the weight of the contrastive loss term in Eq.(1). Experiments are performed under the fix temperature hyper-parameter .

Appendix D Data Scale for Fine-tuning

The experiments in the main paper show that our model can perform well by introducing external MT data pre-training. Here, we simulate the scenario with plenty of MT and speech data and limited ST triple-labeled data, and does ConST have the ability of low-resource learning? In the experiment, we reduce the labeled ST data to 1, 10, and 100 hours, corresponding to sentence counts of about 500, 5k, and 50k sentences. For a fair comparison, we use the same MT pre-trained Transformer module as in the main paper. We find the contrastive loss particularly helpful when the amount of speech data is extremely small, like only 1 hour of speech. Second, the multi-task training strategy is also very effective in improving the robustness of the model performance. We also find that by using easily accessible MT and speech pre-training, our model could reach the previous baseline results without pre-training using only of the original data, i.e. hours of labeled ST data.

Figure 8: En-De BLEU scores on tst-COMMON sets. The horizontal axis is the amount of ST data (in hours of speech).