The objective of language-universal speech recognition is to indiscriminately process utterances from anywhere in the world and produce intelligible transcriptions of what was said [kohler2001multilingual, schultz2001language]. In order to be truly universal, recognition systems need to encompass not only speech from many languages, but also intra-sentential code-switched speech [bullock2009cambridge, li2019codeswitch], speech with accents or otherwise non-standard pronunciations [coupland2007style, sun2018domain], and speech from languages without known written forms [himmelmann2006language, hillis2019unsupervised].
Language-universal speech recognition requires phonological units that are agnostic to any particular language such as articulatory features [stuker2003integrating, livescu2016articulatory, li2020towards] or global phones [schultz2002globalphone, li2020universal], which can be annotated through examination of audio data. While recent advancements in the related field of multilingual speech recognition have significantly improved the language coverage of a single system [adams2019massively, pratap2020massively], these works differ in that they operate on language-specific levels of surface vocabulary units [li2019bytes] or phonemic units that are defined with reference to the unique phonological rules of each language [dalmia2018sequence]. Prior works have avoided universal phone level annotation by implicitly incorporating this knowledge in shared latent representations that map to language-specific phonemes with neural nets [stolcke2006cross, vesely2012language, dalmia2018sequence].
Another approach is to learn explicit universal phone representations by relating language-specific units to their universal phonetic distinctions. Instead of relying on phone annotations, these prior works approximate universal phonological units through statistical acoustic-phonetic methods [kohler2001multilingual] or phone-to-phoneme realization rules [mortensen-etal-2020-allovera, li2020universal]. Unlike the implicit latent approach, this method allows for language-universal prediction. However, performance is dependent on the clarity of phone-phoneme dynamics in the selected training languages [kohler1996multi, li2020universal].
We are interested in systems that can incorporate the strengths of both the implicit and explicit approaches to representing universal phones. In particular, we are interested in language-universal automatic speech recognition (ASR) systems that can 1) explicitly represent universal phones and language-specific phonemes, 2) be built using only automatically generated grapheme-to-phoneme annotations and phone-to-phoneme rules, 3) resolve naturally ambiguous phone-to-phoneme mappings using information from other languages, and 4) learn interpretable probabilistic weights of each mapping.
In this work, we seek to incorporate these desiderata in a phone-based speech recognition system. We first propose a general framework to represent phone-to-phoneme rules as differentiable allophone graphs using weighted finite-state transducers [mohri2002weighted, moritz2020semi, doetsch2017returnn, shao2020pychain, hannun2020differentiable, k2] to probabilistically map phone realizations to their underlying language-specific phonemes (§3.1). We then incorporate these differentiable allophone graphs in a multilingual model with a universal phone recognizing layer trained in an end-to-end manner, which we call the AlloGraph model (§3.2). We show the efficacy of the AlloGraph model in predicting phonemes for 7 seen languages and predicting phones for 2 unseen languages with comparison to prior works (§5). More importantly, we show that our model resolves the ambiguity of manifold phone-to-phoneme mappings with an analysis of substitution errors and an examination of the interpretable allophone graph weights (§5.2). Finally we demonstrate our phone-based approach in two linguistic applications: pronunciation variation and allophone discovery (§LABEL:sec:applications).
2 Background and Motivation
In this section, we first introduce phone-to-phoneme mappings for manufacturing phone supervision from phoneme annotations (§2.1). Then we discuss short-comings of a baseline method representing mappings as a pass-through matrix (§2.2) to motivate our graph-based framework in the subsequent section (§3).
2.1 Phonological Units
2.1.1 Language-Specific Phonemes vs. Universal Phones
A phone is a unit of spoken sound within a universal set which is invariant across all languages, where consists of total phones [schultz2002globalphone]. In contrast, a phoneme is a unit of linguistically contrastive sound for a given language within a language specific set, where consists of total phonemes [mortensen2018epitran]. Phonemes defined for different languages describe different underlying sounds. Multilingual systems that conflate phonemes across languages have been shown to perform worse than those that treat phonemes as language-specific [kohler1996multi, li2020universal].
2.1.2 Phone-to-Phoneme Mappings
For each language, the phone-to-phoneme mappings are defined as a series of tuples, , where and for some subset of phones that occur as realizations in the language. Each phoneme has one or more phone realization and not all universal phones are necessarily mapped to a phoneme grounding in a particular language. Note that mappings may be imperfect in our resources [mortensen-etal-2020-allovera].
Phone-to-phonemes can be one-to-one mappings, but often the relationships are manifold. As shown in Figure 1, many-to-one mappings are found in scenarios where multiple phones are allophones, or different realizations, of the same phoneme. This is the prototypical mapping type. One-to-many mappings also occur for duplicitous phones that are mapped to multiple phonemes.111These occur in resources like [mortensen-etal-2020-allovera] when the source conflates allophonic and morphophonemic alternations, in instances of archiphonemic underspecification and neutralization (e.g. treating Japanese [m] as a realization of both /m/ and /N/ or English  as a realization of both /t/ and /d/ as in writer [aj] and rider [a:j]), or—spuriously—when the grapheme-phoneme mapping is complex. Furthermore, many-to-one and one-to-many mappings can occur together in various many-to-many forms.
2.1.3 Manufacturing Phone-Level Supervision
Since phones are fine-grained distinctions of spoken sounds in the universal space, phonemes are only fuzzy approximations. Multilingual sharing between diverse languages is required to properly learn phonetic distinctions. Consider the following:
One-to-One: If a phone is mapped one-to-one with a phoneme, then the learned phone representation will directly correspond to one supervising phoneme. In the multilingual setting, these direct mappings help other languages disambiguate this phone.
One-to-Many: If a phone is mapped to many phonemes, then each phoneme provides supervision in proportion to their prior distributions. If the learned phonemes representations are mapped from the learned phone, phoneme confusions occur if the one-to-many mappings are not disambiguated. This ambiguity persists despite information sharing from other languages.
Many-to-One: If many phones are mapped to a phoneme, each phone receives the same supervision. A second language with complementary mappings is required to learn distinct phones.
Many-to-Many: When one-to-many and many-to-one mappings occur together, they can take various forms. Generally, the many-to-one portions can be resolved through multilingual sharing but the one-to-many portions would still be problematic.
2.2 Encoding Phone-to-Phoneme as Pass-through Matrix
Prior works have shown that phone-to-phoneme mappings can be encoded as pass-through layers that convert a phone distribution into a phoneme distribution [li2020universal]. This phone-to-phoneme encoding, which we call AlloMatrix, is a sparse matrix where each tuple in the mappings desribed in §2.1.2 is represented by, to a logit vector of phonemes, by the dot product of the th column of with each phone logit :
In the many-to-one approach, this amounts to summing the phone contributions which is in accordance with our desired mapping of allophones in §2.1.2
. However, in one-to-many mappings a phone logit broadcast equally to each of the phonemes. This disagrees with the definition of phone realization. Rather we state that a realized phone in an utterance is grounded to each of the mapped phonemes with probability.
3 Proposed Framework
3.1 Encoding Phone-to-Phoneme as WFST
We define the allophone graph for language , denoted by , to be a single state weighted finite-state transducer (WFST) with a transition function giving each phone-to-phoneme mapping and a corresponding weight function giving the likelihood that is the phonetic realization of for each transition. The allophone graph accepts phone emission probabilities and transduces them into phonemes through WFST composition [mohri2002weighted], which is denoted as .
This WFST is an analogous data structure to the aforementioned matrix in §2.2, but this graphical representation of phone-to-phoneme mappings as arcs in a probabilistic transduction allows us to make two key intuitive determinations. First, many-to-one mappings are transductions of several phones into the same phoneme and therefore the phoneme posterior is given by summing over the input phone posteriors, as is also done in §2.2. Second, one-to-many mappings are transductions splitting the posterior of a single phone to several phoneme posteriors, depending on how likely those phonemes are to be groundings of the phone. In §2.2, the broadcasting method fails to do this probabilistic splitting in one-to-many scenarios, creating ambiguity.
3.2 Phone Recognition with Allophone Graphs
In this section, we apply the allophone graphs as differentiable WFST [mohri2002weighted, moritz2020semi, doetsch2017returnn, shao2020pychain, hannun2020differentiable, k2] layers in phone-based ASR systems optimized with only multilingual phoneme supervision.
|Uses||Seen (Phoneme Error Rate %)||Unseen (Phone Error Rate %)|
|Model Type||Model Name||Phones||Eng||Tur||Tgl||Vie||Kaz||Amh||Jav||Total||Tusom||Inuktitut||Total|
|Phoneme-Only||Multilingual-CTC [dalmia2018sequence]||✗||25.3||27.7||28.5||31.9||31.5||28.6||35.2||29.8||No Phone Predictions|
|AlloGraph||Our Proposed Model||✓||26.0||28.6||28.2||31.9||32.5||29.1||36.2||30.5||81.2||85.8||84.1|
|AlloGraph||+ Universal Constraint (UC)||✓||27.3||28.7||29.9||32.5||35.1||30.9||36.6||31.6||80.5||79.9||80.2|
In this work, we use the connectionist temporal classification network (CTC) [graves2006connectionist, miao2015eesen] where a language-universal Encoder maps input sequence
to a sequence of hidden representations, where . The phone emission probabilities are given by the affine projection of followed by the softmax function, denoted as SoftmaxOut.222In training, logits corresponding to unmapped phones in a particular language are masked prior to being softmax normalized similar to [dalmia2019plm]. To handle the blank token used in CTC to represent the null emission [graves2006connectionist], we add the transition as an additional arc in the language-specific allophone graphs . Phone and phoneme emissions are thus given by:
Equation 5 shows the CTC specific form of the general phone-to-phoneme emission transduction shown in Equation 2. During training, we maximize the likelihood of the ground-truth phonemes , where and is the length of the ground-truth which is at most the length of the input , by marginalizing over all possible CTC alignments using the forward-backward computation [graves2006connectionist, miao2015eesen].
We refer to this multilingual CTC architecture with allophone graphs as our proposed AlloGraph model. In the vanilla AlloGraph, we allow the weights of to freely take on any values. This is a loose-coupling of phone and phoneme emissions where each may amplify or reduce the phone posteriors; for instance, this allows to learn cases where a phone is universally rare but is a prominent realization in language .
While loose-coupling of phone and phoneme emissions is beneficial to language-specific phoneme recognition, it dilutes supervision to the universal phone layer. We address this by enforcing a tight-coupling of phone and phoneme emissions such that the phone posterior is only isometrically transformed: , where is the subset of phonemes that is mapped to in language . Now, Equation (5) exactly sums phone posteriors for many-to-one and splits phone posteriors for one-to-many in the manner that we desire, as stated in §3.1. We call this tightly-coupled variant the AlloGraph + Universal Constraint (UC) model.
4 Data and Experimental Setup
We use the English LDC Switchboard Dataset[godfrey1993switchboard, ldceval2kctm, ldceval2kspeech] and 6 languages from the IARPA BABEL Program: Turkish, Tagalog, Vietnamese, Kazakh, Amharic and Javanese [babel]. These datasets contain 8kHz recordings of conversational speech each containing around 50 to 80 hours of training data, with an exception of around 300 hours for English. We also consider two indigenous languages with phone level annotations, Tusom [mortensen2021tusom2021] and Inukitut, during evaluation only. We obtain phonemic annotations using Epitran for auto grapheme-to-phoneme [mortensen2018epitran] and phone-to-phoneme rules from Allovera [mortensen-etal-2020-allovera].
Experimental Setup: All our models were trained using the ESPnet toolkit [watanabe2018espnet] with differentiable WFSTs implemented using the GTN toolkit [hannun2020differentiable]. To prepare our speech input features we first upsample the audio to 16kHz, augment it by applying a speed perturbation of and
, and then extract global mean-variance normalized 83 log-mel filterbank and pitch features. Input frames are processed by an audio encoder with convolutional blocks to subsample by[watanabe2018espnet] before feeding to transformer-encoder blocks with a feed-forward dim of , attention dim of , and attention heads. We augment our data with the Switchboard Strong (SS) augmentation policy of SpecAugment [specaug] and apply a dropout of
for the entire network. We use the Adam optimizer to train 100 epochs with an inverse square root decay schedule, a transformer-lr scale[watanabe2018espnet] of , k warmup steps, and an effective batchsize of .
In Table 1, we show the results of our AlloGraph and AlloGraph + UC models. As mentioned in §4, we use Tusom and Inuktitut as two unseen languages with phone level annotations to evaluate our language-universal predictions; since these languages are unseen our model does not know their phoneme sets or which phones appear as realizations, allowing us to assess how universal our phone-based predictions are. On these two unseen languages our AlloGraph model outperforms our AlloMatrix baseline based on [li2020universal] by an average of 9.9 phone error-rate (%). When using the Universal Constraint described in §3.2, our approach gains an additional 3.9 phone error-rate improvement. The AlloGraph models make fewer substitution errors than the AlloMatrix baseline, and the substitutions are also less severe; we examine these improvements further in §5.1.
Table 1 also shows the language-specific phoneme level performance of the AlloGraph model on 7 seen languages. Note that these languages are annotated with phonemes as described in §4 but not with phones. Here our AlloGraph model slightly outperforms the AlloMatrix baseline, but both show degradation compared to our Phoneme-Only333Phoneme-Only [dalmia2018sequence] directly maps the shared Encoder hidden states to language-specific phoneme level SoftmaxOut, replacing the shared phone level in Equation (4). Thus there are no phone predictions. baseline based on [dalmia2018sequence]. We observe that models placing emphasis on learning universal phones do so with some cost to the language-specific level.
The AlloGraph is advantageous in jointly modeling phones and phonemes compared to the AlloMatrix baseline due to learned disambiguations of phone-to-phoneme mappings; we examine this benefit further in §5.2.
5.1 Universal Phone Recognition for Unseen Languages
| ||15||[a] ||13|
|AlloMatrix|| ||13||[i] ||13|
| [s’]||17||[u] [s’]||23|
|[i] [i:]||2||[a] [*̄A]||3|
|AlloGraph||[k] [kp]||4||[u] [o]||4|
|[a] [a:]||2||[a] [a:]||2|
|[a] ||4||[q] [k]||2|
|AlloGraph + UC|| ||2||[a] ||4|
|[a] [A]||2||[i] [I]||2|
As shown in Table 2, the improvements of the AlloGraph models over the AlloMatrix baseline come from reduced phone substitution errors. In addition to making fewer substitution errors, the AlloGraph models also make less severe substitutions than the AlloMatrix baseline. We quantify this severity by computing the averaged distance between articulatory feature vectors [mortensen-etal-2016-panphon] between the ground truth and incorrectly predicted phones for all substitution errors. Compared to the AlloMatrix, the substitutions made by the AlloGraph and AlloGraph + UC models are 31% and 37% closer in articulatory feature distance (AFD).
The high AFD of the AlloMatrix baseline results from degenerate behavior in which vowels are frequently confused for plosives, as shown by the top confusion pairs in Table 3. On the other, the top confusion pairs of the AlloGraph models are between related vowels which are proximate in the articulatory feature space. Thus the AlloGraph models produce intelligible phone transcriptions, while the AlloMatrix model fails. For qualitative examples of phone recognition, please see §LABEL:sec:qualitative.
5.2 Probabilistic Phone-to-Phoneme Disambiguation
An added benefit of our model is the ability to interpret the weights of learned AlloGraphs, which show disambiguations of ambiguous phone-to-phoneme mappings. As shown in Figure 2, our AlloGraph + UC model distributes phone emissions to multiple phonemes in the one-to-many and many-to-many scenarios. These probabilities can be interpreted as prior distributions of each mapping captured by the allophone graph and can be used to determine the relative dominance of each arc in manifold mappings that can be otherwise difficult to explain.