BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

by   Josh Meyer, et al.
Universität Saarland

BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa. The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development of high-quality text-to-speech models. The ten languages represented are: Akuapem Twi, Asante Twi, Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba. This corpus is a derivative work of Bible recordings made and released by the Open.Bible project from Biblica. We have aligned, cleaned, and filtered the original recordings, and additionally hand-checked a subset of the alignments for each language. We present results for text-to-speech models with Coqui TTS. The data is released under a commercial-friendly CC-BY-SA license.


page 1

page 2

page 3

page 4

1 Introduction

The majority of the world’s approximately 7,000 languages [ethnologue] do not have open speech datasets, and even fewer have high-quality data with aligned text and speech, which can be used for training text-to-speech (TTS) models. The creation of benchmark datasets such as Librispeech [panayotov2015librispeech], LibriTTS [zen2019libritts], and LJSpeech [ljspeech17] enabled significant advances through community development on common resources, but these resources cover few languages, and most TTS systems evaluate on English only.

Speech synthesis systems have received significant attention in recent years due to the advances provided by deep learning. These advances enable TTS models to achieve improved naturalness with respect to human speech 

[tacotron2, valle2020flowtron, kim2021conditional], and improved synthesized speech as driven adoption of virtual assistants [purington2017alexa, dempsey2017teardown]. However, neural models often require a non-trivial amount of data for training. This necessity leaves many language communities under-served in the development of speech technologies [casanova2022tts], and it further results in researchers not evaluating models on diverse linguistic phenomena.

In this work, we present the BibleTTS corpus, a high-quality aligned speech corpus for ten African languages. This data enables further research and resource creation for these languages and will allow researchers to create meaningful benchmarks against non-English languages.

Creating high-quality aligned datasets typically requires tools not available for most languages, hindering the creation of datasets for lower-resourced languages. Specifically, forced alignment of speech and text typically requires pre-trained acoustic models and grapheme-to-phoneme (G2P) models. This process can be challenging and error-prone without high-quality resources. We demonstrate that it is possible to force-align data without access to any pre-trained models (acoustic or G2P), and still produce quality output.

Additionally, recent corpora that significantly expand linguistic coverage for TTS datasets are often not freely available [black2019cmu, babel2011iarpa], contain less single-speaker data [salesky2021mtedx], and/or have lower-quality recordings. BibleTTS stands out in this regard as it is a large, high-fidelity corpus made of single-speaker recordings. The corpus is released under an open CC-BY-SA license. Corpus links and samples created with our TTS models can be accessed from the project website111

2 Related Work

We focus on related work for African languages in the following section. Existing publicly available datasets are typically small. For Yorùbá these include a 2.75 hour corpus [van2015lagos, Niekerk2012ToneRI] and a 4 hour multi-speaker dataset [gutkin2020developing]. TWB Gamayun kits [gamayun] include a 6-hour single speaker high-quality Swahili speech corpus optimized for TTS training. Earlier Yorùbá TTS efforts typically used bespoke private data [Dagba2016DesignOA, Afolabi2014DevelopmentOT, Akinwonmi2013APT, ajadi2007quantitative, Odejobi2004ACM]. For isiXhosa, Sesotho, Setswana and Afrikaans, multi-speaker corpora of approximately 2 hours each have been developed for TTS [van2017rapid, barnard2014nchlt]. The CMU Wilderness dataset [black2019cmu] includes up to 20 hours of high-quality, single-speaker data for several African languages, but it is not publicly available and the alignments can contain noise. TTS systems research for African languages has comprised development efforts in frameworks like Festival [Black1998FestivalSS] or MaryTTS [Schrder2011OpenSV] for Yorùbá [iyanda2017development, aoga2016integration], Ibibio [Ekpenyong2008TowardsAU], Amharic [Mariam2004UnitSV], Fon [Dagba2014ATT], isiZulu [Louw2005AGI], KiSwahili [Gakuru2005DevelopmentOA]. While many of these systems used concatenative synthesis, in large part because the available corpora were small, there have also been investigations into statistical parametric speech synthesis for Ibibio [Ekpenyong2014StatisticalPS]. Finally, there have been efforts in related tasks, such as grapheme-to-phoneme research for Yorùbá [ynd2014DevelopmentOG], intonation modeling [ajadi2007quantitative], and numeral preprocessing [akinade2014computational].

3 Languages represented

Table 1 shows the languages in the BibleTTS corpus, with their language families, the number of speakers[ethnologue] and the regions in Africa where they are spoken. The corpus consists of ten languages from the three largest language families in Africa (Niger-Congo, Afro-Asiatic and Nilo-Saharan) and four regions of Africa. All of these languages are tonal and are spoken primarily in sub-Saharan Africa.

3.1 Language Characteristics222

African No. of
Language Classification Region speakers
Éwé [ewe] Niger-Congo / Kwa West 5.5M
Hausa [hau] Afro-Asiatic / Chadic West 77M
Kikuyu [kik] Niger-Congo / Bantu East 8.2M
Lingala [lin] Niger-Congo / Bantu Central 40M
Luganda [lug] Niger-Congo / Bantu East 11M
Luo [luo] Nilo-Saharan / Luo–Acholi East 5.3M
Chichewa [nya] Niger-Congo / Bantu South-East 14M
Akuapem Twi [aka] Niger-Congo / Akan West 626k
Asante Twi [aka] Niger-Congo / Akan West 3.8M
Yorùbá [yor] Niger-Congo / Volta-Niger West 46M
Table 1: Language, classification and statistics. All language classifications and numbers of speakers are from Ethnologue.

Éwé [ewe] uses 35 Latin letters excluding (c, j, q), with 12 additional letters (, dz, , f, gb, , kp, ny, ŋ, , ts, ). Ewe has three tones, and they are marked in text.

Hausa [hau] uses two different writing scripts: Ajami and Boko. The Boko script is the most widely used and is based on the Latin alphabet with 44 letters. The alphabet excludes letters (p, q, v and x) and uses 12 additional letters: , , , y, kw, w, gw, ky, y, gy, sh, ts. Hausa is tonal, but tones are not represented in text.

Kikuyu [kik] uses Latin script with 27 letters excluding (f, l, p, s, v, x, y, z), and including additional nine letters (ĩ, ũ, mb, nd, nj, ng, ng‘,ny, th). Kikuyu uses two tones (high and low) but they are not marked in text.

Lingala [lin] uses the Latin script with 40 letters excluding (j, q, x) and including an additional 17 letters (, gb, kp, mb, mf, mp, mv, nd, ng, ngb, nk, ns, nt, ny, nz, , ts). Lingala uses two tones (high and low), but they are not marked in text.

Luganda [lug] uses 24 Latin letters excluding (h, q, x), and including additional two letters (ŋ, ny). Luganda uses three tones, but they are not marked in text.

Luo [luo] or Dholuo uses Latin script with 31 letters excluding the letters (c, q, x, v, z), and additional letters (ch, dh, mb, nd, ng’, ng, ny, nj, th, sh). Luo has four tones, but they are not marked in text.

Chichewa [nya] uses the Latin script with 31 letters excluding (q, x, y), and including additional eight letters (ch, kh, ng, ŋ, ph, tch, th, ŵ). Chichewa uses two tones (high and low) but they are not marked in text.

Akan [aka] is a language with multiple dialects (including Fante, Bono, Asante, and Akuapem), and they are collectively known as Twi. In this study, we focus on Asante and Akuapem which are mutually intelligible and share the same alphabets (referred to herein as aka-Asante and aka-Akuapem). Twi uses 22 Latin letters excluding (c,j,q,v,x,z), and including two additional letters (, ).

Yorùbá [yor] uses 25 Latin letters without the letters (c, q, v, x and z) and with additional letters (ẹ, gb, ṣ, ọ). Yorùbá is a tonal language with three tones: low, middle, and high. These tones are represented by the grave (e.g. “è ”), optional macron (e.g. “ē”) and acute (e.g. “é”) accents respectively but the mid tone is usually ignored in writing.

4 Corpus creation

The BibleTTS corpus consists of high-quality audio released as 48kHz, 24-bit, mono-channel FLAC files. Recordings for each language are under professional quality, close-microphone conditions (i.e., without background noise or echo). BibleTTS is unique among open speech corpora for the volume of data per speaker and suitability for TTS. The corpus consists of ten languages which are under-represented in today’s voice technology landscape, both in academia and in industry. We release train/dev/test splits for each language, where dev is the Book of Ezra, test is Colossians, and train is all other books.

Figure 1:

Distribution of the sample length per language. Samples longer than 30s and with fewer than 10 characters were removed, and outlier segments were detected and discarded as described in

Section 4.2. Lingala is a slight outlier with the majority of segments between 10 and 20 seconds, while the other five languages have segments centered at 5-10s each.

4.1 Alignment

The BibleTTS corpus contains audio recordings and text transcripts (i.e. “Open Contemporary Bible” translations) which were released by Biblica via the Open.Bible project.444 The original audio recordings were 48kHz, mono-channel WAV, typically one recording per chapter of the Bible. Each chapter was up to 30 minutes long, which is too long for most modeling tasks. Verses are a natural alternative, as the text already contains verse boundaries. Aligning at the verse level creates more manageable recording lengths of up to 30 seconds (see Figure 1) which are more likely to be consistent across languages than segmentation on voice activity detection or other alternatives.

Potential challenges in alignment include additional content in either the speech or text, such as spoken titles and headings or text annotations, and the availability of pre-trained acoustic models and grapheme-to-phoneme mappings. We have employed various alignment techniques depending on the availability of verse timestamps and resources in each language, and evaluated a subset of the alignments with native speakers.

4.1.1 Verse timestamps

Three languages (aka-Akuapem, aka-Asante, and lin) were straightforwardly segmented using verse-level timestamps released by the Open.Bible project. The timestamps show the start time of every verse, as well as when the book and chapter titles were spoken. With these timestamps, verses were isolated and saved as individual audio files using sox. These alignment scripts can be found on Github at coqui-ai/open-bible-scripts. 555

4.1.2 Forced alignment using pre-trained acoustic models

Forced alignment is the process of extracting timestamps given an audio and a transcript pair, and requires either a pre-trained acoustic model or training one from scratch. For Hausa (hau), we opted to use the Montreal Forced Aligner (MFA) [mfa] for which there is a pre-trained Hausa model [globalphone]

. The code is open-sourced on Github

666 The process is as follows:

  1. [leftmargin=*]

  2. Audio of each chapter of each book is downloaded together with their script in the form of an XML file,

  3. XML script is parsed and converted into a plain normalized text file. Normalization entails: (a) adding the chapter title at the beginning of the script as ”Sura ¡chapter-no¿”, (b) converting numbers into written form using a dictionary prepared with Hausa linguists, and (c) adding a new line after every sentence ending punctuation mark (e.g. .?!").

  4. A grapheme-to-phoneme (G2P) dictionary is created from the word list extracted from the transcripts using the Hausa G2P model.

  5. Alignment is performed for each chapter using the audio and normalized script with a beam length of 1000.

  6. The time-aligned TextGrid file is processed in parallel with the sentence-segmented transcript to partition the chapter audio into sentence-level audio chunks with their transcriptions.

4.1.3 Forced alignment from scratch

Two languages (ewe and yor) were aligned via forced alignment from scratch. Using only the found audio and transcripts (i.e., without a pre-trained acoustic model), an acoustic model was trained and the data aligned with the Montreal Forced Aligner. Graphemes were used as a proxy for phonemes in place of G2P data. The code used to generate alignments can be found in the coqui-ai/open-bible-scripts repository.777

After forced alignment, we used regular expressions to pull out whole verses which were aligned such that silence occurred both at the beginning and the end of a verse. Segmenting out audio at the verse-level instead of splitting on silence may allow downstream TTS models to capture higher-level prosody.

4.2 Outlier detection

Following the alignment stage, we detected and removed outliers using the data-checker toolkit together with human judgments. The relevant code is open-sourced on Github at coqui-ai/data-checker.888 First, all segments longer than 30 seconds, or less than 10 characters in the aligned transcript, were removed. Then, the removal of outliers was performed and fine-tuned for each language independently until the major offending samples999”Major offending samples” was not explicitly defined, but refers to samples labeled by a non-native speaker of these languages as containing obvious mismatches between transcripts and speech. were no longer encountered, as described below.

Every pair of <audio,transcript>

was assigned an ”outlier score”, and the most extreme outliers were removed. First, the ratio of transcript length (characters) to audio length (seconds) was calculated for each sample. Then a Gaussian distribution was estimated for all samples in a given language. Lastly, the number of standard deviations from the mean was calculated for each sample. Outliers were excluded if they existed more than

N standard deviations away from the mean, where N was fine-tuned per language with an iterative human-in-the-loop approach, until minimal offending samples were encountered. For most languages, it was sufficient to exclude samples more than 3 standard deviations from the mean (or .2% of the data). However, yor notably required more outliers removed to attain a quality dataset. The resulting distribution of segment lengths per language is shown in Figure 1.

4.3 Human evaluation of alignment quality

We facilitated human evaluation of both the alignment and the output of the TTS models. In total, we collected labels from 15 annotators (three per language) for ewe, hau, lin, aka-Asante, and aka-Akuapem and an additional five annotators for yor. To judge the quality of <audio,transcript> pairs from our alignments, we randomly sampled 50 example pairings of aligned transcripts and the corresponding audio clips across the train, dev, and test sets. Annotators selected the one option that best described the quality of the alignment:

  1. [leftmargin=*]

  2. Audio contains EXTRA words not in the transcript

  3. Audio is MISSING words that are in the transcript

  4. Audio is MISSING words AND includes EXTRA words

  5. No missing or extra words

In cases where the labels corresponding to various annotators disagreed, we took the majority vote label. In cases where the number of labels was spread evenly among different choices, we noted these as ”conflicting.” Results of human evaluation are shown in Table 3.

As discussed in Section 4.1, some languages (aka-Asante, aka-Akuapem, lin) were segmented using existing verse-level timestamp files. Interestingly, annotators labeled these languages as having a high percentage of samples where the audio contains additional words not present in the aligned text. Aligning from scratch (ewe, yor) produced a greater proportion of segments with exact matches between speech and text than using forced alignment with a pre-trained acoustic model (hau). However, it should be noted that significantly more data was removed due to outliers for yor, and less data overall was aligned with ewe, yor (see the statistics for unaligned vs. aligned hours in Table 2).

Unaligned Unaligned Aligned Aligned
Language Hours Samples Hours Samples
Éwé 100.1 1,167 86.8 24,957
Hausa 103.2 1,189 86.6 40,603
Kikuyu 90.6 1,189
Lingala 151.7 1,189 71.6 15,117
Luganda 110.4 1,189
Luo 80.4 1,189
Chichewa 115.9 1,162
Akuapem Twi 75.7 1,189 67.1 28,238
Asante Twi 82.6 1,189 74.9 29,021
Yorùbá 93.6 1,189 33.3 10,228
Table 2: Corpus statistics. The corpus consists of data for ten languages, of which six have been aligned and formatted for immediate use to train TTS models.

5 TTS Models

To experimentally validate the quality of our dataset, we train the VITS end-to-end speech synthesis model [kim2021conditional] with the sampling rate 22050 Hz in the six aligned languages. The chosen languages are Akuapem Twi, Asante Twi, Éwé, Hausa, Lingala, and Yorùbá. We chose VITS for its state-of-the-art naturalness and also for its robust alignment mechanism [kim2020glow]. The model takes characters as input and does not require a phonemizer.

To accelerate training we use transfer learning. We start from a model pre-trained on LJSpeech

[ito2017lj] for 1M steps, which is available via the Coqui TTS repository101010 We continue training for approximately 110K steps for each one of the languages. We use the AdamW optimizer [loshchilov2017decoupled] with betas 0.8 and 0.99, weight decay 0.01, and an initial learning rate of 0.0002 decaying exponentially by a gamma of 0.999875  [paszke2017automatic]. The models were trained using an NVIDIA A100 SXM4 80GB with a batch size of 100. All models are released in the Coqui TTS toolkit111111

5.1 Results and Discussion

We evaluated the synthesized speech using subjective judgments, averaged across multiple speakers. We randomly sampled 50 segments from the in-domain test set, as well as out-of-domain corpora to test the models’ ability to generalize to non-Bible contexts. The out-of-domain sentences were obtained from the NEWS corpus121212 (except for Akuapem Twi). Annotators rated the quality of synthesized speech in terms of naturalness of voice and appropriateness of pronunciation for the particular language or dialect. Annotators selected from a 5-point Likert rating for each sample: 1 (bad), 2 (poor), 3 (fair), 4 (good), and 5 (excellent). The mean opinion scores (MOS) are shown in Table 4. We additionally use mel cepstral distortion (MCD) [kominek2008mcd], an automatic edit distance metric, to assess quality for the in-domain segments where we have reference speech, with dynamic time warping (DTW) to align the segments. MCD largely follows MOS: languages with better human judgments (higher MOS) typically have better (lower) MCD scores, though MCD can be misleading, as in Lingala.

The MOS judgments seem related to the goodness of alignment evaluations. That is, the language with the best alignments (ewe) was also rated the best MOS for speech synthesized from the resulting model. Similarly, the language with the worst alignments (hau) resulted in the TTS model with the lowest out-of-domain MOS scores. To improve MOS, it may be necessary to either improve the alignments or apply more stringent outlier exclusion criteria, which should be possible, as the training data size remains significantly larger than many available TTS corpora for these languages or others.

Language EM Add. Miss. Both Labels
Éwé 92% 2% 4% 0% 2%
Hausa 32% 68% 0% 0% 0%
Lingala 74% 12% 0% 0% 14%
Akuapem Twi 88% 0% 8% 2% 2%
Asante Twi 78% 2% 6% 6% 8%
Yorùbá 76% 24% 0% 0% 0%
Table 3: Human evaluation of alignment. Shown are percentages of ¡audio,transcript¿ samples with an exact match (EM), added words, missing words, or both.
Language In-Domain Out-of-domain
Éwé 4.34 3.87 5.8
Hausa 3.42 2.34 7.6
Lingala 3.31 2.40 5.6
Akuapem Twi 2.79 7.5
Asante Twi 3.07 2.44 6.8
Yorùbá 4.06 2.93 5.8
Table 4: Evaluation of TTS model outputs using both human judgments (MOS) and an automatic metric (MCD). In-Domain texts are Bible verses, and Out-of-Domain is news.

6 Conclusions

The BibleTTS corpus is the first of its kind in many respects. The quality and volume of the data is extremely rare in open speech corpora – these are professional, studio quality, 48kHz recordings, with up to 86 hours of verse-aligned data per language, for 10 languages spoken in sub-Saharan Africa. The BibleTTS license is research and commercial friendly: CC-BY-SA. We hope that this corpus will enable advances in speech technology for African languages and also will unlock new techniques in TTS, which require more, higher-quality data.

We described our approach to verse and sentence-level alignment of the original found data with a variety of different resources. We used human evaluation to assess the quality of the resulting alignments, and validate the resulting data by training high-quality speech synthesis models with Coqui TTS.

There are two clear and immediate avenues for future work: (1) verse-level alignment of the remaining four languages (kik, lug, luo, and nya), and (2) improvement of the quality of existing alignments. Given the volume of data per language, it may well be the case that we can be more conservative with outlier removal, keeping only 20 or 30 hours of the best data, and obtain even better resulting TTS models. Nevertheless, we have shown that the data can already be used to produce high-quality TTS models (as with Ewe), on both in and out of domain text. We plan to update BibleTTS such that we have high-quality verse-level alignments for all ten languages.

7 Acknowledgements

We are very grateful to the volunteers, Richard J. Bonnie, Komlanvi D. Akoly, Komlanvi M. Klove, Ibrahim Haruna, Oluwabusayo O. Awoyomi, Emmanuel Anebi, Christian Kilapi, Pacifick Taba, who helped with human evaluation and the Masakhane community.