Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation

02/09/2018 ∙ by Ali Can Kocabiyikoglu, et al. ∙ Université Grenoble Alpes 0

Recent works in spoken language translation (SLT) have attempted to build end-to-end speech-to-text translation without using source language transcription during learning or decoding. However, while large quantities of parallel texts (such as Europarl, OpenSubtitles) are available for training machine translation systems, there are no large (100h) and open source parallel corpora that include speech in a source language aligned to text in a target language. This paper tries to fill this gap by augmenting an existing (monolingual) corpus: LibriSpeech. This corpus, used for automatic speech recognition, is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. After gathering French e-books corresponding to the English audio-books from LibriSpeech, we align speech segments at the sentence level with their respective translations and obtain 236h of usable parallel data. This paper presents the details of the processing as well as a manual evaluation conducted on a small subset of the corpus. This evaluation shows that the automatic alignments scores are reasonably correlated with the human judgments of the bilingual alignment quality. We believe that this corpus (which is made available online) is useful for replicable experiments in direct speech translation or more general spoken language translation experiments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attention-based encoder-decoder approaches have been very successful in Machine Translation [Bahdanau et al.2014], and have shown promising results in End-to-End Speech Translation [Bérard et al.2016, Weiss et al.2017] (translation from raw speech, without any intermediate transcription). End-to-End speech translation is also attractive for language documentation, which often uses corpora made of audio recordings aligned with their translation in another language (no transcript in the source language) [Blachon et al.2016, Adda et al.2016, Anastasopoulos and Chiang2017]. However, while large quantities of parallel texts (such as Europarl, OpenSubtitles) are available for training (text) machine translation systems, there are no large (>100h) and open source parallel corpora that include speech in a source language aligned to text in a target language. For End-to-End speech translation, only a few parallel corpora are publicly available. For example, Fisher and Callhome Spanish-English corpora provide 38 hours of speech transcriptions of telephonic conversations aligned with their translations [Post et al.2013]. However, these corpora are only medium size and contain low-bandwidth recordings. Microsoft Speech Language Translation (MSLT) corpus also provides speech aligned to translated text. Speech is recorded through for English, German and French [Federmann and Lewis2016]. But this corpus is again rather small (less than 8h per language).

Paper contributions. Our objective is to provide a large corpus for direct speech translation evaluation which is an order of magnitude bigger than existing corpora described in the introduction. For this, we propose to enrich an existing (monolingual) corpus based on read audiobooks called LibriSpeech. The approach is straightforward: we align e-books in a foreign language (French) with the English utterances of LibriSpeech. This results in 236h of English speech automatically aligned to French translations at the utterance level111Our dataset is available at


This paper is organized as following: after presenting our starting point (Librispeech) in section 2, we describe how we aligned foreign translations to the speech corpus in section 3. Section 4 describes our evaluation of a subset of the corpus (quality of the automatically obtained alignments). Finally, section 5 concludes this work and gives some perspectives.

2 Our Starting Point: Librispeech Corpus

Our starting point is LibriSpeech corpus used for Automatic Speech Recognition (ASR). It is a large scale corpus which contains approximatively 1000 hours of speech aligned with their transcriptions [Panayotov et al.2015]. The read audio book recordings derive from a project based on collaborative effort: LibriVox. The speech recordings are based on public domain books available on Gutenberg Project222 and are distributed with LibriSpeech as well as the original recordings.

We start from this corpus333Another dataset could have been used: TED Talks - see - but we considered it was be better to start with a read speech corpus for evaluating End-2-End speech translation. because it has been widely used in ASR and because we believe it is possible to find the text translations for a large subset of the read audiobooks.

subset hours
dev-clean 5.4 8 20 20 40
test-clean 5.4 8 20 20 40
dev-other 5.3 10 16 17 33
test-other 5.1 10 17 16 33
100.6 25 125 126 251
363.6 25 439 482 921
496.7 30 564 602 1166
Table 1: Details on LibriSpeech corpus

Table 1. gives details on Librispeech as well as data split. Recordings are segmented and put into different subsets of the corpus according to their quality (better quality speech segments are put in the clean part). Note that in order to obtain a balanced corpus with a large number of speakers, each speaker only read a small portion of a book (8-10 minutes for dev and test, 25-30 minutes for train). Moreover, in training data, speech segments are obtained by splitting long signals according to () silences in order to obtain segments that are maximum 35s long.

3 Aligning Foreign Translations to Librispeech

3.1 Overview

The main steps of our process are the following:

  • Collect e-books in foreign language corresponding to English books read in Librispeech (section 3.2),

  • Extract chapters from these foreign books, corresponding to read chapters in Librispeech (section 3.3),

  • Perform bilingual text alignement from comparable chapters (section 3.4),

  • Realign speech signal with text translations obtained (section 3.5).

These different steps are described in the next subsections.

3.2 Collecting Foreign Novels

LibriSpeech corpus is composed of 5831 chapters (from 1568 books) aligned with their transcriptions. We used the given metadata to search e-books in foreign language (French) corresponding to English books read in Librispeech. Firstly, we used DBPedia [Auer et al.2007] in order to (automatically) obtain title translations. Secondly, we used a public domain index of French e-books444 to find Web links matching titles we found. Then, we finished this process for the entire LibriSpeech corpus by manually searching for French novels in different public domain resources. Overall, we collected 1818 chapters (from 315 books) in French to be aligned with Librispeech. Some of the public domain resources that we used are: Gutenberg Project555, Wikisource666, Gallica777, Google Books888, BEQ999, UQAC101010

Audiobooks available in LibriSpeech are of different literary genres: most of them are novels, however there are also poems, fables, treaties, plays, religious texts, etc. Belonging to the public domain, most of the texts are old and not available publicly in foreign language. Therefore, the novels that were collected in foreign language are mostly novels from world’s classics. As few of them are ancient texts, some translations are in old French.

3.3 Chapters Extraction

LibriSpeech transcriptions are provided for each chapter. As the readers only read a short period of time111111One goal of Librispeech was to have as many speakers as possible, transcriptions may correspond to incomplete chapters. For the same reason, books are not read entirely. Therefore, in order to obtain an alignment at the sentence level, a first step was to decompose English and French language books into chapters. This step was achieved by a semi-automatic process. After converting books to text format (both English and French), regular expressions were used to identify chapter transitions. Then, each French chapter was extracted and aligned to its counterpart in English. After manual verification of all chapters, we obtained 1423 usable chapters (from 247 books).

3.4 Bilingual Text Alignement

The 1423 parallel chapters establish the comparable corpus from which we extracted bilingual sentences. This was done using an off-the-shelf bilingual sentence aligner called hunAlign [Varga et al.2007]. HunAlign takes as input a comparable (not sentence-aligned) corpus and outputs a sequence of bilingual sentence pairs. It combines (Gale-Church) sentence-length information as well as dictionary-based alignment methods.

Initial dictionary available for alignment was the default French-English (40k entries) lexicon created for

121212 (wrapper for hunAlign created by Andras Farkas). We enriched this dictionary by adding entries from other open source bilingual dictionaries. Different dictionaries (woaifayu, apertium, freedict, quick) from a language learning resource were gathered in various formats and adapted to hunAlign dictionary format131313 We finally obtained and used a dictionary of 128,000 unique entries.

In order to improve the quality of sentence level alignments, data had to be pre-processed. For English and French, our extracted chapters were cleaned with regular expressions. Then, we used Python NLTK [Bird2006] sentence split to detect sentence boundaries in the corpora. Furthermore, the bitexts were stemmed (removing suffixes to reduce data sparsity). Finally, parallel sentences found were brought back to their initial form with reverse stemming. This last step was done using Google’s library [Fraser2012].

English Sentence French Sentence
Oh, I beg your pardon!
«Oh! je vous demande
bien pardon!
A lane was forthwith
opened through the
crowd of spectators.
Un chemin fut alors
ouvert parmi la foule
des spectateurs.
No, ”said Catherine,”
he is not here;
I cannot see him anywhere.
- Non, dit Catherine,
il n’est pas ici.
Jamais je ne parviens
à le rencontrer.
Table 2: Examples of parallel sentences obtained from comparable corpora made up of aligned book chapters

Table 2. shows examples of 3 bilingual sentences obtained from 3 different chapters.

3.5 Realigning Speech Signal with Text Translations

In order to associate parallel sentences to speech signal transcriptions, realignment of speech segments of LibriSpeech was necessary. This realignment is a two step process: first, we forced aligned Librispeech English transcripts to match English sentences obtained in the previous stage ; secondly, we resegmented the speech signal according to new sentence splits.
For the first step, we used , a tool for realigning texts in a same language but with a different sentence tokenization [Matusov et al.2005]. We applied to realign our speech transcriptions in English to the English sentences of our bilingual corpus obtained in section 3.4. The outcome of this first step is a new sentence segmentation for our English transcriptions that are now correctly aligned to our French translations.
The second step was to resegment the speech signals to match them to the new sentence segmentation. We did that by:

  • creating a big file by concatenating speech segments for each chapter,

  • re-aligning the large speech signal to the transcripts using 141414 toolkit, an off-the-shelf English forced-aligner based on ASR toolkit [Povey et al.2011],

  • re-segmenting speech according to the desired sentence split.

Table 3. presents and overview of final data (speech with aligned translations) obtained after this final step. For each sentence pair, we also added En-Fr machine translation output of our English transcripts (Google Translate). So we have 2 French translations in the end (a correct one from automatic alignement ; a noisy one from MT).

Chapters Books Duration (h) Total Segments
1408 247 ~236h 131395
Table 3: Statistics of the final multimodal and bilingual corpus obtained (English speech aligned to French text)

4 Human Evaluation of a Corpus Subset

4.1 Protocol

Now that we have obtained a multimodal alignment between (English) speech signals and (French) translations, we want to evaluate its quality. At this point, the only score available is the confidence score given by indicating confidence for aligned sentences. One goal of this human evaluation, that can only be made on a corpus subset, is to see if score has a good correlation with human judgements.

50 sentences from 4 different chapters have been chosen for evaluation. These chapters were chosen according to their average alignment scores (from ). We chose two chapters that were near the mean of overall alignment scores (hypothesized medium quality alignments), one chapter which was above the mean score (hypothesized good quality alignment) and a final chapter below mean score (hypothesized bad quality alignment). These sentences were evaluated by three annotators. We established a scale from 1 to 3 to judge matching quality between English speech and English transcriptions. This 3-step scale is precise enough because few errors were found in speech alignments. We established a scale from 1 to 5 to judge quality between bilingual text alignments. Overall, 200 sentences were evaluated (on both scales) by 3 annotators.

Average confidence
score ()
Average speech alignment
score (max 3)
Average textual alignment
score (max 5)
Chapter XXIII
1.34 2.82 4.64
Alice’s Adventures in Wonderland
Chapter V
1.14 2.98 4.28
A Tale of Two Cities
Book III, Chapter III
0.96 2.86 3.86
Adventures of Huckleberry Finn
Chapter VIII
0.66 2.9 2.58
Average 1.02 2.89 3.84
Table 4: Results of human evaluation by 3 annotators.

Kappa’s Cohen (weighted) for inter annotator agreement for textual alignment is 0.76

We give, as example below, sentences for each mark (1-5) for human evaluation of bilingual alignments. Two different dimensions are evaluated at the same time: the accuracy of alignment (an alignment can be wrong, partial or correct) and the fact that translational equivalence is compositional and may be isolated from the current context.

  • 1. Wrong alignment


    • French: Je sais, par exemple, que maintenant il souffre de la faim dans un vaste désert, où l’on ne saurait trouver de nourriture.

  • 2. Partial alignment with slightly compositional translational equivalence


    • French: Mais il paraît que tu préfères être courtisée avec l’arc et la hache, plutôt qu’avec des phrases polies et avec la langue de la courtoisie.

  • 3. Partial alignment with compositional translation and additional or missing information


    • French: C’est ainsi qu’enfin débuta le journal du soir à la Force, le jour où la pauvre Lucie avait vu danser la carmagnole.

  • 4. Correct alignment with compositional translation and few additional or missing information


    • French: La nuit était sombre; le vent âpre et froid chassait devant lui avec rage les nuages rapides.

  • 5. Correct alignment and fully compositional translation

    • English: WHAT IS A CAUCUS RACE

    • French: Qu’est-ce qu’une course cocasse?

4.2 Results

Table 4. reports our human evaluations for the 4 chapters.

The first thing that we can notice is that the alignment quality is higher for chapters with higher confidence scores. The first evaluation (speech alignement ; scale 1-3) shows an average score of 2.89/3 which confirms that our re-segmentation of speech signals worked correctly. The second evaluation (bilingual alignment ; scale 1-5) shows an average score of 3.84/5. Some sentences were found un-correctly aligned but overall, the alignment quality can be considered as correct. The main reason why the average alignment score varies between chapters is reflected by the translations compositionnality. Also, the dictionary that we used for bilingual alignments is inadequate for old texts and results in lower overall confidence scores.

We also computed automatic correspondence scores obtained with a cross-language textual similarity detection between transcriptions and their translations [Ferrero et al.2016]. Our idea was to add another automatic score in addition to score. We computed the correlation between human evaluation scores and scores and obtained a correlation of 0.41. The same correlation was obtained between human evaluation scores and those obtained automatically with method of [Ferrero et al.2016]. This shows that automatic alignment scores are reasonably correlated with human judgments and could be used to extract a subset of the best alignments by ranking them according to score for instance.

5 Conclusion

We have presented a large corpus (236h) which is an augmentation of Librispeech in order to provide a bilingual speech-text corpus for direct (end-2-end) speech translation experiments. The methodology described here could be used in order to add other languages than French (German, Spanish, etc.) to our augmented Librispeech. The current corpus contains several ancient texts, so it would also be interesting to extend it to other kinds of corpora: different speaking styles (not only read speech), more contemporary texts, etc.

For direct speech translation experiments, preliminary experiments have been done recently and will be presented at next ICASSP 2018 conference [Bérard et al.2018]. Our online repository151515see provides a data split for speech translation experiments and results show that it is possible to train compact and efficient end-to-end speech translation models in this setup, but the dataset is challenging (BLEU score around 15 for direct speech translation task - more details in [Bérard et al.2018]).

6 Bibliographical References


  • [Adda et al.2016] Adda, G., Stücker, S., Adda-Decker, M., Ambouroue, O., Besacier, L., Blachon, D., Bonneau-Maynard, H., Godard, P., Hamlaoui, F., Idiatov, D., Kouarata, G.-N., Lamel, L., Makasso, E.-M., Rialland, A., Van de Velde, M., Yvon, F., and Zerbian, S. (2016). Breaking the unwritten language barrier: The Bulb project. In Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages), Yogyakarta, Indonesia.
  • [Anastasopoulos and Chiang2017] Anastasopoulos, A. and Chiang, D. (2017). A case study on using speech-to-translation alignments for language documentation. arXiv preprint arXiv:1702.04372.
  • [Auer et al.2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. The semantic web, pages 722–735.
  • [Bahdanau et al.2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • [Bérard et al.2016] Bérard, A., Pietquin, O., Servan, C., and Besacier, L. (2016). Listen and translate: A proof of concept for end-to-end speech-to-text translation. In NIPS workshop on End-to-end Learning for Speech and Audio Processing.
  • [Bérard et al.2018] Bérard, A., Besacier, L., Kocabiyikoglu, A. C., and Pietquin, O. (2018). End-to-end automatic speech translation of audiobooks. In Accepted to Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE.
  • [Bird2006] Bird, S. (2006). Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pages 69–72. Association for Computational Linguistics.
  • [Blachon et al.2016] Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.-N., Adda-Decker, M., and Rialland, A. (2016). Parallel speech collection for under-resourced language studies using the LIG-Aikuma mobile device app. In Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages), Yogyakarta, Indonesia, May.
  • [Federmann and Lewis2016] Federmann, C. and Lewis, W. D. (2016). Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german.
  • [Ferrero et al.2016] Ferrero, J., Agnes, F., Besacier, L., and Schwab, D. (2016). A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection.
  • [Fraser2012] Fraser, N. (2012). google-diff-match-patch-diff, match and patch libraries for plain text.
  • [Matusov et al.2005] Matusov, E., Leusch, G., Bender, O., and Ney, H. (2005). Evaluating machine translation output with automatic sentence segmentation. In International Workshop on Spoken Language Translation (IWSLT) 2005.
  • [Panayotov et al.2015] Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE.
  • [Post et al.2013] Post, M., Kumar, G., Lopez, A., Karakos, D., Callison-Burch, C., and Khudanpur, S. (2013). Improved speech-to-text translation with the fisher and callhome spanish-english speech translation corpus.
  • [Povey et al.2011] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, number EPFL-CONF-192584. IEEE Signal Processing Society.
  • [Varga et al.2007] Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., and Trón, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 292:247.
  • [Weiss et al.2017] Weiss, R. J., Chorowski, J., Jaitly, N., Wu, Y., and Chen, Z. (2017). Sequence-to-sequence models can directly transcribe foreign speech. arXiv preprint arXiv:1703.08581.