The development of benchmark datasets, such as MuST-C Di Gangi et al. (2019), Europarl-ST Iranzo-Sánchez et al. (2020) or CoVoST Wang et al. (2020), has greatly contributed to the increasing popularity of speech-to-text translation (ST) as a research topic. MuST-C provides TED talks translations from English into 8 European languages, with data amounts ranging from 385 hours to 504 hours, thereby encouraging research into end-to-end ST Berard et al. (2016) as well as one-to-many multilingual ST Di Gangi et al. (2019). Europarl-ST offers translations between 6 European languages, with a total of 30 translation directions, enabling research into many-to-many multilingual ST Inaguma et al. (2019). The two corpora described so far involve European languages that are in general high resource from the perspective of machine translation (MT) and speech. CoVoST is a multilingual and diversified ST corpus from 11 languages into English, based on the Common Voice project Ardila et al. (2020). Unlike previous corpora, it involves low resource languages such as Mongolian and it also enables many-to-one ST research. Nevertheless, for all corpora described so far, the number of languages involved is limited.
In this paper, we describe CoVoST 2, an extension of CoVoST (Wang et al., 2020) that provides translations from English (En) into 15 languages—Arabic (Ar), Catalan (Ca), Welsh (Cy), German (De), Estonian (Et), Persian (Fa), Indonesian (Id), Japanese (Ja), Latvian (Lv), Mongolian (Mn), Slovenian (Sl), Swedish (Sv), Tamil (Ta), Turkish (Tr), Chinese (Zh)—and from 21 languages into English, including the 15 target languages as well as Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), Russian (Ru). The overall speech duration is extended from 700 hours to 2880 hours. The total number of speakers is increased from 11K to 78K. We make data available at https://github.com/facebookresearch/covost under CC0 license.
|Hours (CoVoST ext.)||Speakers (CoVoST ext.)||Src./Tgt. Tokens|
|Fr||180 (264)||22 (23)||23 (24)||2K (2K)||2K (2K)||4K (4K)||2M/2M||145K/143K||143K/141K|
|De||119 (184)||21 (23)||22 (120)||1K (1K)||1K (1K)||4K (5K)||1M/1M||137K/152K||777K/844K|
|Es||97 (113)||22 (22)||23 (23)||1K (1K)||2K (2K)||4K (4K)||747K/751K||131K/134K||131K/132K|
|Ca||81 (136)||19 (21)||20 (25)||557 (557)||722 (722)||2K (2K)||939K/972K||142K/148K||160K/169K|
|It||28 (44)||14 (15)||15 (15)||236 (236)||640 (640)||2K (2K)||307K/329K||89K/95K||88K/93K|
|Ru||16 (18)||10 (15)||11 (14)||8 (8)||30 (30)||417 (417)||118K/144K||89K/109K||81K/100K|
|Zh||10 (10)||8 (8)||8 (8)||22 (22)||83 (83)||784 (784)||131K/85K||91K/60K||88K/57K|
|Pt||7 (10)||4 (5)||5 (6)||2 (2)||16 (16)||301 (301)||67K/68K||27K/28K||34K/34K|
|Fa||5 (49)||5 (11)||5 (40)||532 (545)||854 (908)||1K (1K)||307K/313K||67K/73K||244K/271K|
|Et||3 (3)||3 (3)||3 (3)||20 (20)||74 (74)||135 (135)||23K/32K||19K/27K||20K/27K|
|Mn||3 (3)||3 (3)||3 (3)||4 (4)||24 (24)||209 (209)||20K/23K||19K/22K||18K/20K|
|Nl||2 (7)||2 (3)||2 (3)||74 (74)||144 (144)||379 (383)||58K/59K||19K/19K||20K/20K|
|Tr||2 (4)||2 (2)||2 (2)||34 (34)||76 (76)||324 (324)||24K/33K||11K/16K||11K/15K|
|Ar||2 (2)||2 (2)||2 (2)||6 (6)||13 (13)||113 (113)||10K/13K||9K/11K||8K/10K|
|Sv||2 (2)||1 (1)||2 (2)||4 (4)||7 (7)||83 (83)||12K/12K||8K/9K||9K/10K|
|Lv||2 (2)||1 (1)||2 (2)||2 (2)||3 (3)||54 (54)||11K/14K||6K/7K||8K/10K|
|Sl||2 (2)||1 (1)||1 (1)||2 (2)||1 (1)||28 (28)||11K/13K||3K/4K||2K/2K|
|Ta||2 (2)||1 (1)||1 (1)||3 (3)||2 (2)||48 (48)||6K/10K||2K/3K||3K/5K|
|Ja||1 (1)||1 (1)||1 (1)||2 (2)||3 (3)||37 (37)||20K/9K||12K/5K||12K/6K|
|Id||1 (1)||1 (1)||1 (1)||2 (2)||5 (5)||44 (44)||7K/8K||5K/5K||5K/6K|
|Cy||1 (2)||1 (12)||1 (16)||135 (135)||234 (371)||275 (597)||11K/10K||79K/76K||110K/103K|
|Ca||364 (430)||26 (27)||25 (472)||10K (10K)||4K (4K)||9K (29K)||3M/3M||156K/171K||4M/4M|
2 Dataset Creation
2.1 Data Collection and Quality Control
Translations are collected from professional translators the same way as for CoVoST. We then conduct sanity checks based on language model perplexity, LASER Artetxe and Schwenk (2019)
scores and a length ratio heuristic in order to ensure the quality of the translations. Length ratio and LASER score checks are conducted as in the original version of CoVoST. For language model perplexity checks, 20M lines are sampled from the OSCAR corpusOrtiz Suárez et al. (2020) for each CoVoST 2 language, except for English, Russian for which pre-trained language models Ng et al. (2019) are utilized111https://github.com/pytorch/fairseq/tree/master/examples/language_model. 5K lines are reserved for validation and the rest for training. BPE vocabularies of size 20K are then built on the training data, with character coverage 0.9995 for Japanese and Chinese and 1.0 for other languages. A Transformer base model (Vaswani et al., 2017) is then trained for up to 800K updates. Professional translations are ranked by perplexity and the ones with the lowest perplexity are manually examined and sent for re-translation as appropriate. In the data release, we mark out the sentences that cannot be translated properly222They are mostly extracted from articles without context, which lack clarity for appropriate translations..
2.2 Dataset Splitting
Original Common Voice (CV) dataset splits utilize only one sample per sentence, while there are potentially multiple samples (speakers) available in the raw dataset. To allow higher data utilization and speaker diversity, we add part of the discarded samples back while keeping the speaker set disjoint and the same sentence assignment across different splits. We refer to this extension as CoVoST splits. As a result, data utilization is increased from 44.2% (1273 hours) to 78.8% (2270 hours). We by default use CoVoST train split for model training and CV dev (test) split for evaluation. The complementary CoVoST dev (test) split is useful in the multi-speaker evaluation (Wang et al., 2020)
to analyze model robustness, but large amount of repeated sentences (e.g. on English and German) may skew the overall BLEU (WER) scores.
Basic statistics of CoVoST 2 are listed in Table 1, including speech duration, speaker counts as well as token counts for both transcripts and translations. As we can see, CoVoST 2 is diversified with large sets of speakers even on some of the low-resource languages (e.g. Persian, Welsh and Dutch). Moreover, they are distributed widely across 66 accent groups, 8 age groups and 3 gender groups.
Our speech recognition (ASR) and ST models share the same BLSTM-based encoder-decoder architecture Bérard et al. (2018), which is similar to the Listen, Attend and Spell (LAS) architecture (Chan et al., 2016; Chiu et al., 2017; Park et al., 2019). Specifically, on the encoder side, audio features are first fed into a two-layer DNN with activations and hidden sizes and . Then two 2D convolutional layers with kernel size x
and stridex are applied to reduce the sequence length to . Both convolutional layers have 16 output channels and project the features to dimensions after flattening. Finally, the features are passed to a stack of bidirectional LSTM layers of hidden size to form encoder output states . For the decoder side, a stack of LSTM layers with hidden size and additive attention (Bahdanau et al., 2014) is applied, followed by a linear projection to size . In the multilingual setting (EnAll and AllAll), we follow Inaguma et al. (2019) to force decoding into a given language by using a target language ID as the first token.
For MT, we use a Transformer base architecture Vaswani et al. (2017) with encoder layers, decoder layers, 0.3 dropout, and shared embeddings for encoder/decoder inputs and decoder outputs. For multilingual models, encoders and decoders are shared as preliminary experimentation showed that this approach was competitive.
We provide MT, cascaded ST and end-to-end ST baselines under bilingual settings as well as multilingual settings: AllEn (A2E), EnAll (E2A) and AllAll (A2A). Similarly for ASR, we provide both monolingual and multilingual baselines.
4.1 Experimental Settings
For all texts, we normalize the punctuation and build vocabularies with SentencePiece Kudo and Richardson (2018) without pre-tokenization. For ASR and ST, character vocabularies with 100% coverage are used. For bilingual MT models, BPE Sennrich et al. (2016) vocabularies of size 5k are learned jointly on both transcripts and translations. For multilingual MT models, BPE vocabularies of size 40k are created jointly on all available source and target text. For MT and language pair -, we also contrast using only - training data and both - and - training data. The latter setting is referred to as +Rev subsequently.
We extract 80-channel log-mel filterbank features (windows with 25ms size and 10ms shift) using Kaldi Povey et al. (2011)
, with per-utterance cepstral mean and variance normalization applied. We remove training samples having more than 3,000 frames or more than 512 characters for GPU memory efficiency.
For ASR and ST, we set , , and . We use and for bilingual models and and for multilingual models. We adopt SpecAugment Park et al. (2019) (LB policy without time warping) to alleviate overfitting. To accelerate model training, we pre-train all non-English ASR and all ST models with English ASR model encoder. For MT, we set for bilingual models and for multilingual models. All models are implemented in Fairseq Ott et al. (2019).
We use a beam size of 5 for all models and length penalty 1. We use the best checkpoint by validation loss for MT, and average the last 5 checkpoints for ASR and ST. For MT and ST, we report case-sensitive detokenized BLEU Papineni et al. (2002) using sacreBLEU Post (2018) with default options, except for English-Chinese and English-Japanese where we report character-level BLEU. For ASR, we report character error rate (CER) on Japanese and Chinese (no word segmentation) and word error rate (WER) on the other languages using VizSeq (Wang et al., 2019). Before calculating WER (CER), sentences are tokenized by sacreBLEU tokenizers, lowercased and with punctuation removed (except for apostrophes and hyphens).
4.2 Monolingual and Bilingual Baselines
reports monolingual baselines for ASR and bilingual MT, cascaded ST (C-ST) and end-to-end ST baselines. As expected, the quality of transcriptions and translations is very dependent on the amount of training data per language pair. The poor results obtained on low resource pairs can be improved by leveraging training data from the opposite direction for MT and C-ST. These results serve as baseline for the research community to improve upon, including methods such as multilingual training, self-supervised pre-training and semi-supervised learning.
4.3 Multilingual Baselines
We introduced CoVoST 2, the largest speech-to-text translation corpus to date for language coverage and total volume, with 21 languages into English and English into 15 languages. We also provided extensive monolingual, bilingual and multilingual baselines for ASR, MT and ST. CoVoST 2 is free to use under CC0 license and enables the research community to develop methods including, but not limited to, massive multilingual modeling, ST modeling for low resource languages, self-supervision for multilingual ST, semi-supervised modeling for multilingual ST.
- Common voice: a massively-multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4218–4222. Cited by: §1.
- Margin-based parallel corpus mining with multilingual sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3197–3203. Cited by: §2.1.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.
- End-to-end automatic speech translation of audiobooks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6224–6228. Cited by: §3.
- Listen and translate: a proof of concept for end-to-end speech-to-text translation. In Proceedings of the 2016 NeurIPS Workshop on End-to-end Learning for Speech and Audio Processing, Cited by: §1.
Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §3.
- State-of-the-art speech recognition with sequence-to-sequence models. Cited by: §3.
One-to-many multilingual end-to-end speech translation.
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 585–592. Cited by: §1.
- MuST-C: a Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2012–2017. Cited by: §1.
- Multilingual end-to-end speech translation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 570–577. Cited by: §1, §3.
- Europarl-st: a multilingual corpus for speech translation of parliamentary debates. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 8229–8233. Cited by: §1.
SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. Cited by: §4.1.
- Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, pp. 314–319. Cited by: §2.1.
- A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1703–1714. Cited by: §2.1.
- Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.1.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.1.
- Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §3, §4.1.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. Cited by: §4.1.
- The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §4.1.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. Cited by: §4.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1, §3.
VizSeq: a visual analysis toolkit for text generation tasks. EMNLP-IJCNLP 2019, pp. 253. Cited by: §4.1.
- CoVoST: a diverse multilingual speech-to-text translation corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4197–4203 (English). External Links: Cited by: §1, §1, §2.2.