Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.READ FULL TEXT VIEW PDF
This paper presents an extension to a very low-resource parallel corpus
Unsupervised spoken term discovery (UTD) aims at finding recurring segme...
For language documentation initiatives, transcription is an expensive
The primary obstacle to developing technologies for low-resource languag...
As language and speech technologies become more advanced, the lack of
We present a database of parallel recordings of speech and singing, coll...
This paper explores the difficulties of annotating transcribed spoken
Many languages will face extinction in the coming decades. Half of the 7,000 languages spoken worldwide are expected to disappear by the end of this century [Austin and Sallabank2011], and there are too few field linguists to document all of these endangered languages. Innovative speech data collection methodologies [Bird et al.2014, Blachon et al.2016] as well as computational assistance [Adda et al.2016, Stüker et al.2016] were recently proposed to help them in their documentation and description work.
As more and more researches are related to computational language documentation [Duong et al.2016, Franke et al.2016a, Godard et al.2016, Anastasopoulos and Chiang2017], there is a need of realistic corpora to fuel reproducible and replicable language studies at the phonetic, lexical and syntactic levels. To our knowledge, very few corpora are available for computational analysis of endangered languages.111We are only aware of a Griko-Italian corpus [Lekakou et al.2013], and of a Basaa-French corpus [Hamlaoui et al.2018].
Our work follows this objective and presents a speech dataset collected following a real language documentation scenario. It is multilingual (Mboshi speech aligned to French text) and contains linguists’ transcriptions in Mboshi (in the form of a non-standard graphemic form close to the language phonology). The corpus is also enriched with automatic forced-alignment between speech and transcriptions. The dataset is made available to the research community222It will be made available for free from ELRA, but its current version is online on: https://github.com/besacier/mboshi-french-parallel-corpus. This dataset is part of a larger data collection conducted on Mboshi language and presented in a companion paper [Rialland et al.2018].
Expected impact of this work is the evaluation of efficient and reproducible computational language documentation approaches in order to face fast and inflexible extinction of languages.
This paper is organized as follows: after presenting the language of interest (Mboshi) in section 2, we describe the data collection and processing in sections 3 and 4 respectively. Section 5 illustrates its first use for an unsupervised word discovery task. Our spoken term detection pipeline is also presented and evaluated in this section. Finally, section 6 concludes this work and gives some perspectives
Mboshi (Bantu C25) is a typical Bantu language spoken in Congo-Brazzaville. It is one of the languages documented by the BULB (Breaking the Unwritten Language Barrier) project [Adda et al.2016, Stüker et al.2016].
Mboshi has a seven vowel system (i, e, ɛ, a, ɔ, o, u) with an opposition between long and short vowels. Its consonantal system includes the following phonemes: p, t, k, b, d, β, l, r, m, n, ɲ, mb, nd, ndz, ng, mbv, f, s, ɣ , pf, bv, ts, dz, w, j. It has a set of prenasalized consonants (mb, nd, ndz, ng, mbv) which are common in Bantu languages [Embanga Aborobongui2013, Kouarata2014].
While the language can be considered as rarely written, linguists have nonetheless defined a non-standard graphemic form for it, considered to be close to the language phonology. Affricates and prenasalized plosives were coded using multiple symbols (e.g. two symbols for dz, three for mbv). Long and short vowels were coded respectively as V and as VV.
Mboshi displays a complex set of phonological rules. The deletion of a vowel before another vowel in particular, common in Bantu languages, occurs at 40% of word junctions [Rialland et al.2015]. This tends to obscure word segmentation and introduces an additional challenge for automatic processing.
Mboshi words are typically composed of roots and affixes, and almost always include at least one prefix, while the presence of several prefixes and one suffix is also very common. The suffix structure tends to consist of a single vowel V (e.g. -a or -i) whereas the prefix structure may be both CV and V. Most common syllable structures are V and CV, although CCV may arise due to the presence of affricates and prenasalized plosives mentioned above.
The noun class prefix system, another typical feature of Bantu languages, has an unusual rule of deletion targeting the consonant of prefixes333A prefix consonant drops if the root begins with a consonant [Rialland et al.2015].. The structure of the verbs, also characteristic of Bantu languages, follows: Subject Marker — Tense/Mood Marker — Root-derivative Extensions — Final Vowel. A verb can be very short or quite long, depending of the markers involved.
Mboshi prosodic system involves two tones and an intonational organization without downdrift [Rialland and Aborobongui2016]. The high tone is coded using an acute accent on the vowel while low tone vowel has no special marker. Word root, prefix and suffix all bear specific tones which tend to be realized as such in their surface forms.444The distinction between high and low tones is phonological (see [Rialland and Aborobongui2016]). Tonal modifications may also arise from vowel deletion at word boundaries.
A productive combination of tonal contours in words can also take place due to the preceding and appended affixes. These tone combinations play an important grammatical role particularly in the differentiation of tenses. However, in Mboshi, the tones of the roots are not modified due to conjugations, unlike in many other Bantu languages.
We have recently introduced Lig_Aikuma555http://lig-aikuma.imag.fr, a mobile app specifically dedicated to fieldwork language documentation, which works both on android powered smartphones and tablets [Blachon et al.2016]. It relies on an initial smartphone application developed by [Bird et al.2014] for the purpose of language documentation with an aim of long-term interpretability. We extended the initial app with the concept of retranslation (to produce oral translations of the initial recorded material) and speech elicitation from text or images (to collect speech utterances aligned to text or images). In that way, human annotation labels can be replaced by weaker signals in the form of parallel multimodal information (images or text in another language). Lig_Aikuma also implements the concept of respeaking initially introduced in [Woodbury2003]. It involves listening to an original recording and repeating what was heard carefully and slowly. This results in a secondary recording that is much easier to analyze later on (analysis by a linguist or by a machine). So far, Lig_Aikuma was used to collect data in three unwritten African Bantu languages in close collaboration with three major European language documentation groups (LPP, LLACAN in France; ZAS in Germany).
Focusing on Mboshi data, our corpus was built both from translated reference sentences for oral language documentation [Bouquiaux and Thomas1976] and from a Mboshi dictionary [Beapami et al.2000]. Speech elicitation from text was performed by three speakers in Congo-Brazzaville and led to more than 5k spoken utterances. The corpus is split in two parts (train and dev) for which we give basic statistics in Table 1. We shuffled the corpus prior to splitting in order to have comparable distributions in terms of speakers and origin.666Either [Bouquiaux and Thomas1976] or [Beapami et al.2000]. There is no text overlap for Mboshi transcriptions between the two parts.
All the characters of the Mboshi transcription have been checked, in order to avoid multiple encodings of the same character. Some characters have also been transcoded so that a character with a diacritic effectively corresponds to the UTF-8 composition of the individual character with the diacritic. Incorrect sequences of tones (for instance tone on a consonant) have been corrected. It was also decided to remove the elision symbol in Mboshi.
On the French side, the translations were double-checked semi-automatically (using linux command followed by a manual process – 3.3% of initial sentences were corrected this way). The French translations were finally tokenized (using from the Moses toolkit777http://www.statmt.org/moses/) and lowercased. We propose an example of a sentence pair from our corpus in Figure 1.
|Mboshi||wáá ngá iwé léekundá ngá sá oyoá lendúma saa m ótéma|
|French||si je meurs enterrez-moi dans la forêt oyoa avec une guitare sur la poitrine|
As the linguists’ transcriptions did not contain any timing information, the creation of timed alignments was necessary. The word and phoneme level alignments were produced with what Cooper-Leavitt et al. refer to as ‘A light-weight ASR tool’ [Cooper-Leavitt et al.2017a]. The alignment tool is an ASR system that is used in a forced-alignment mode. That is, it is used to associate words in the provided orthographic level transcription with the corresponding audio segments making use of a pronunciation lexicon which represents each word with one or more pronunciations (phonemic forms). The word-position-independent GMM-HMM monophone models are trained using the STK tools at LIMSI [Lamel and Gauvain2015]. A set of 68 phonemes are used to represent the pronunciation dictionary, with multiple symbols for each vowel representing different tones [Cooper-Leavitt et al.2017b, Bedrosian1996]
and a symbol for silence. The acoustic model is estimated iteratively, with 5 rounds of segmentation and parameter estimation, and the model resulting from the last iteration was used to resegment the data. Since vowel elision and morpheme deletion are known to be characteristic of the Mboshi language, variants explicitly allowing their presence or absence are included in the pronunciation lexicon. Details of how this was implemented can be found in[Cooper-Leavitt et al.2017a].
In this section, we illustrate how our corpus can be used to evaluate unsupervised discovery of word units from raw speech. This task corresponds to Track 2 in the Zero Resource Speech Challenge 2017.888http://zerospeech.com
We present below the evaluation metrics used as well as a monolingual baseline system which does not take advantage yet of the French translations aligned to the speech utterances (bilingual approaches may be also evaluated with this dataset but we leave that for future work).
. They measure the quality of a word segmentation and the discovered boundaries with respect to the gold corpus. For these metrics, the precision, recall and F-score are computed; theToken
recall is defined as the probability for a gold word token to belong to a cluster of discovered tokens (averaging over all the gold tokens), while theToken
precision represents the probability that a discovered token will match a gold token. The F-score is the harmonic mean between these two scores. Similar definitions are applied to theType and Boundary metrics.
Our baseline for word discovery from speech involves the cascading of the following two modules:
unsupervised phone discovery (UPD) from speech: find pseudo-phone units from the spoken input,
unsupervised word discovery (UWD): find lexical units from the sequence of pseudo-phone units found in the previous step.
In order to discover a set of phone-like units, we use the KIT system which is a three step process. First, phoneme boundaries are found using BLSTM as described in [Franke et al.2016b]. For each speech segment, articulatory features are extracted (see [Müller et al.2017a] for more details). Finally, segments are clustered based on these articulatory features [Müller et al.2017b]. So, the number of clusters (pseudo phones) can be controlled during this process.
To perform unsupervised word discovery, we use [Goldwater et al.2006, Goldwater et al.2009].999http://homepages.inf.ed.ac.uk/sgwater/resources.html It implements a Bayesian non-parametric approach, where (pseudo)-morphs are generated by a bigram model over a non-finite inventory, through the use of a Dirichlet-Process. Estimation is performed through Gibbs sampling.
godard2016preliminary compare this method to more complex models on a smaller Mboshi corpus, and demonstrate that it produces stable and competitive results, although it tends to oversegment the input, achieving very high recall and a lower precision.
We compare our results to a pure speech based baseline which does pair-matching using locally sensitive hashing applied to PLP features and then groups pairs using graph clustering [Jansen and Van Durme2011]. The parameters of this baseline spoken term discovery system are the same101010Notably the DTW threshold – which controls the number of spoken clusters found – is set to 0.90 in our experiment as the ones used for the Zero Resource Challenge 2017 [Dunbar et al.2017].
A topline where is applied to the gold forced alignments (phone boundaries are considered to be perfect) is also evaluated.
For the pipeline, we experience with different granularities of the UPD system (5, 30 and 60 units obtained after the clustering step). For each granularity, we perform 3 runs and report the averaged results.
We note that the baseline provided by the system of [Jansen and Van Durme2011] has a low coverage (few matches). Given that our proposed pipeline provides an exhaustive parse of the speech signals, it is guaranteed to have full coverage and, thus, shows better performance according to the Boundary metric. The quality of segmentation, in terms of tokens and types is, however, still low for all systems. Increasing the number of pseudo phone units improves Boundary recall but reduces precision. For Token and Type metrics, a coarser granularity provides slightly better results.
|gold FA phones + dpseg||53.8||83.5||65.4|
|[Jansen and Van Durme2011]||31.9||13.8||19.3|
|UPD+dpseg pipeline (5 units)||27.4||46.5||34.5|
|UPD+dpseg pipeline (30 units)||24.6||58.9||34.7|
|UPD+dpseg pipeline (60 units)||24.4||60.2||34.8|
|gold FA phones + dpseg||20.8||35.2||26.2|
|[Jansen and Van Durme2011]||4.5||1.6||2.4|
|UPD+dpseg pipeline (5 units)||2.1||4.4||2.9|
|UPD+dpseg pipeline (30 units)||1.4||4.6||2.2|
|UPD+dpseg pipeline (60 units)||1.4||4.7||2.1|
|gold FA phones + dpseg||7.5||13.9||9.7|
|[Jansen and Van Durme2011]||4.6||4.8||4.7|
|UPD+dpseg pipeline (5 units)||2.5||6.7||3.6|
|UPD+dpseg pipeline (30 units)||2.0||4.5||2.8|
|UPD+dpseg pipeline (60 units)||2.0||4.4||2.7|
We have presented a speech corpus in Mboshi made available to the research community for reproducible computational language documentation experiments. As an illustration, we presented the first unsupervised word discovery (UWD) experiments applied to a truly unwritten language (Mboshi).
The results obtained with our pipeline on pseudo-phones are still far behind those obtained with gold labels, which indicates that there is room for improvement for the UPD module of our pipeline. The UWD module should, in future work, make use of the bilingual data available to improve the quality of the segmentation.
Future work also includes enriching our dataset with some alignments at the word level, in order to evaluate a bilingual lexicon discovery task. This is possible with encoder-decoder approaches, as shown in [Zanon Boito et al.2017].
As we distribute this corpus, we hope that this will help the community to strengthen its effort to improve the technologies currently available to support linguists in documenting endangered languages.
This work was partly funded by the French ANR and the German DFG under grant ANR-14-CE35-0002.
Some experiments reported of this paper were performed during the Jelinek Summer Workshop on Speech and Language Technology (JSALT) 2017 in CMU, Pittsburgh, which was supported by JHU and CMU (via grants from Google, Microsoft, Amazon, Facebook, and Apple).
An attentional model for speech translation without transcription.In Proceedings of NAACL-HLT, pages 949–959.
Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems.In Proceedings of LREC.