In recent years there has been considerable research devoted to reducing the amount of human effort required to build an automatic speech recognition (ASR) system for a new language. Conventional ASR training requires large quantities of manually-transcribed training data, as well as a hand-crafted pronunciation dictionary. Recent grapheme-based hybrid-HMM approaches  have shown success at removing the need for explicit pronunciation knowledge, whilst more recent end-to-end systems 
have removed the need for a lexicon entirely by modelling output tokens at the character or word-piece level. However, transcribed training data is typically still required, with end-to-end systems being particularly data hungry.
The process of manual transcription can be extremely time-consuming and expensive. Consequently a body of research has focused on reducing the need for such data, for example through the use of approximately-matching “in-the-wild” speech and text data, known as lightly-supervised training , and through the use of unlabelled data transcribed with a seed model, known as semi-supervised training . However, in both cases, manual expertise is required to train the initial model.
In his 2012 position paper, Glass  described the road towards unsupervised speech processing through a set of scenarios that, he noted, “might seem increasingly outlandish and impractical”. He suggested a move from “expert-based” systems, with a dictionary and phoneme set provided, through “data-based” systems with parallel speech and text data, to what he called “decipher-based” systems, through which ASR training could be achieved using entirely untranscribed speech, together with unpaired text data. This scenario has the significant advantage that for any languages with a significant web presence at least, both resources are likely to be relatively abundant without any human effort.
Since Glass’s paper, significant effort has been devoted to this so-called “zero-resource” scenario. Approaches to this problem tend to fall into two categories: those attempting to learn phoneme- or word-like patterns from speech in a bottom up manner, often motivated by child speech learning [6, 7]; and those using cross-lingual information to inform the target model. The latter category extends a long strand of research into cross-lingual ASR methods – which seek to improve supervised training on a target language through the use of out-of-language language data – to the case where no transcribed data exists for the target language. There have been a variety of recent approaches to this problem, all of which in one way or another address the problem of matching the modelling units of the out-of-language model to meaningful units in the target language. The earliest approaches used IPA-based phone mapping schemes  whilst more recent related methods have used automatic multilingual pronunciation mining from the web  or cross-lingual transfer from languages with similar orthographies .  uses knowledge of compositional phonetics in the target language to remove the need for resources in the target language, building on earlier supervised approaches such as , whilst  used a semi-supervised approach, extending an initial lexicon using unpaired phonemic transcripts and text data.
Separately, there has been significant work in towards building language-universal systems, generally with shared phonetic knowledge . These approaches can be problematic due to differing phonotactics between languages , though language-specific embeddings may be used . Again, these methods require knowledge of pronunciations in a target language in order to produce word output. Purely graphemic multilingual systems have been developed  but require the target language to be in the training set; supervised transliterations approaches have been used in the context of end-to-end systems .
Pure bottom-up approaches to zero-resource ASR, whilst interesting, have not generally yielded state-of-the-art ASR performance, when compared to cross-lingual methods. An exception is , which uses Facebook’s wav2vec2.0 architecture  to produce phone-like sequences in an entirely bottom-up manner, which are then mapped to phonemized text sequences using an adversarial objective . However, importantly, the authors of this work relied on manually-obtained phone units and a system trained on a large hand-crafted pronunciation dictionary to achieve their results, noting that it is easier to learn a mapping between the speech audio and phone units. Further, the wav2vec2.0 architecture is exceptionally computationally intensive and data hungry, making the method unavailable to those not wishing to use Facebook’s pre-trained models.
We believe that no prior research has yet achieved Glass’s vision of removing both the need for transcribed audio data and human phonetic knowledge of the target language, thus building a system with no expert input. In this paper, we propose to return to his original term of “decipher-based” systems. Inspired by this, and by similar work for unsupervised transliteration and machine translation [21, 22], we here present a method for deciphering speech data using only mismatched text data from the language of interest. Our method starts with a cross-lingual approach, taking a universal phone recogniser trained on a variety of source languages. However, absolutely no phonetic knowledge of the target language is used, making this method applicable in principle to almost any language. Furthermore, we fine the technique to be extremely data efficient, requiring just 20 minutes of speech data to achieve a successful decipherment on read speech data. We follow this with rounds of specially-designed semi-supervised training (SST) using 25 hours of untranscribed speech data to achieve results that can be competitive with a conventional supervised approach.
2 Zero-Resource Cross-Lingual Transfer
Our method for zero-resource cross-lingual transfer uses a three-stage approach. First a universal phone recogniser transcribes audio into phones. Then, we decipher this phone sequence into graphemes from the target language – for this, only language models trained on target-language text data are required. Finally, a flat-start semi-supervised training procedure is used to train a new acoustic model using the deciphered pseudo-labels. The complete pipeline is illustrated in Figure1. We describe the three steps in detail below.
2.1 Universal Phone Recognition
The aim of a universal phone recogniser is to phonetically transcribe speech from any language. To achieve good generalisation to unseen languages, it is necessary to train the model on a diverse set of languages, in order to cover as wide a set of phones as possible. One way to train such a system is to simply pool data and phonemic lexicons and train a multilingual model with a shared phoneme set. In this work we simply use a shared phoneme set to train a conventional hybrid HMM-DNN system on six well-resourced languages. We are aware that  notes that pooling the phoneme sets is sub-optimal as phonemes might have different surface forms in different languages, and that the use of linguistically-derived allophone mappings  might be beneficial.
The task of decipherment is to convert a cipher into plain natural language, a classic example being deciphering a letter substitution cipher. The use of this technique to decipher the output of a multilingual phone recogniser is the most significant contribution of this paper. We start with the work of Knight 
, who showed that a noisy-channel framework can be used for decipherment. In this framework the probability of deciphering a cipher X into an English111In this section we follow the literature in taking English as the target language; of course, in reality we decipher other languages. text Y is modelled as
where we call the Lexical model and the language model. The lexical model produces the probability that an English letter corresponds to a cipher letter and the Language model assigns probabilities to sequences of English letters. The Language model can trained on any text corpora and the Lexical model can be trained in an unsupervised fashion with the Baum Welch algorithm . Once the lexical model is trained, the most probable English text corresponding to the cipher can be deciphered with the Viterbi algorithm to obtain:
In the past, decipherment was used in various NLP applications such as unsupervised machine transliteration , unsupervised machine translation [21, 26] or unsupervised Chinese pronunciation learning . However, phoneme-to-grapheme (P2G) conversion is much more difficult than solving a deterministic substitution cipher for two reasons: first, a grapheme can be mapped to many phonemes, for example English grapheme “a” can be pronounced as AH, AA, AE or EH. Second, one grapheme can correspond to multiple sequential phonemes, for example “x” is pronounced as “K S”; similarly, one phoneme can align with multiple sequential graphemes, for example “th” is often pronounced as “DH”. Finally, when applying phoneme-to-grapheme conversion at the utterance level, our model needs to be able to perform word segmentation. These challenges are further multiplied when we deal with noisy inputs from the universal phone recogniser.
To be able to deal with insertion and deletions inherent to the P2G, we use the following parameterisation proposed by Nuhn :
In this parameterisaton the decipherment model consists of three components: Lexical model , Alignment model , and Language model
. The random variablerepresents a sequence of substitution, insertion and deletion operations. We can also express decipherment using WFST notation as:
where is an input phone acceptor, is the Lexicon model transducer, is the Alignment model transducer and is the Language model acceptor.
The role of the Lexical model is to model the probabilities of mapping phones into graphemes, and also models the probability of phone deletions . The lexical model is implemented as a simple one state flower transducer. It is initialised to allow all possible substitutions but during training the unseen substitutions are pruned from the model, which results in faster training and inference due to a smaller composition. However, since we train on small amounts of data in an unsupervised fashion it is possible that some important arcs are pruned from the model, which can be detrimental. Therefore, following  we smooth the Lexical model at various stages of training with the following equation:
where is a smoothing parameter (we use 0.9) and denotes the size of the input phone-set. Finally, the lexical model always maps silence phones to silence or a word boundary in the output. This results in faster training/inference – because silence prunes the space of possible word segmentation – and more accurate decipherment.
The Language model
is the component most important to decipherment, providing information to the training process. The language model predicts the probability of a sequence of graphemes in the target language. Therefore, it is possible to use character n-gram models. It is important to keep in mind that using n-gram models with large contexts results in a big composition when composed with an unpruned lexical model; therefore it is not feasible to use them from the beginning. Hence we start with a simple bigram model and move to using up to 5-gram grapheme models as training progresses. Subsequently, we use a word trigram language model together with a grapheme lexicon for the final round of training. Since the composition of the lexical model, alignment model and the word language model
is slow, and is required after every training epoch, we reimplemented the standard compositionwith a three-way composition  . We also use pruning to speed up training and inference with the word language models. Our decipherment pipeline is implemented in OpenFST  and its design is heavily inspired by the BaumWelch library from OpenGrm 
. The code will be open-sourced.
2.3 Semi-Supervised Training
In conventional Semi-Supervised Training (SST) we use a seed-model to create “pseudo-labels” for untranscribed speech data [4, 31]. In our previous work , we showed that SST can be successfully even with seed models with WER over 80% if lattices are used to model uncertainty in the hypotheses. In the previous section we described how decipherment can be used to convert the output of a universal phone recogniser into a sequence of target-language graphemes, to be used as pseudo-labels for untranscribed data. The pseudo-labels can either be one-best transcripts or decipherment lattices. Unlike conventional SST, we have no seed model for the target language, since there is no equivalence between the outputs of the phone recogniser and the target language graphemes. We choose to train a model with flat-start LF-MMI , initialising the lower layers of the model with the universal phone recogniser, but using a randomly-initialised output layer.
To demonstrate that decipherment can be used for cross-lingual transfer, we train a multilingual acoustic model on English (ENG), French, German, Spanish, Russian and Polish and we perform decipherment experiments on Czech (CES), Portuguese (POR) and Swedish (SWE).
Our experiments were performend using the GlobalPhone corpus . This corpus comes with data from various languages and contains lexicons, which can be mapped to X-SAMPA to be able to pool phonemes across languages, and pretrained language models trained on crawled web data. All these properties make GlobalPhone an ideal test-bed for evaluation of zero-resource cross-lingual transfer with decipherment.
To train the Universal Phone Recogniser as a multilingual model with a shared phone-set we pooled 20 hours of English LibriSpeech  with the training data from GlobalPhone German, French, Spanish, Russian and Polish 
. The multilingual model was a small time-delayed neural network with 18 hidden layers each having 798 hidden layer size and 90 bottleneck size. In total the model had 7.2M parameters, used 40 dimensional normalised MFCC features as inputs, and was trained with lattice-free MMI (LF-MMI) 
. We used a phone-bigram estimated on the training data for phone decoding in Table2. The topline mono-lingual models in Table 3 were trained with the same procedure using only the monolingual training data. We used the provided GlobalPhone n-gram language models for decoding in Table 3.
The Decipherment Model was trained on the 200 shortest uterances from the development set of each language. We increase the power of the character-level language model over successive epochs from a bigram up to a 5-gram; we perform 10 iterations of full-batch training with each model. These grapheme language models were trained on the GlobalPhone training data. To speed up training with the grapheme models, we pruned the lexical model to retain probabilities only for top 10 phones for each grapheme after training with the bigram grapheme language model. After this stage we smoothed the lexical model and continued training with the provided GlobalPhone word language model. Finally, we smoothed the lexical model again and deciphered the GlobalPhone training data. During the course of our experiments we found that the transcripts of the GlobalPhone training data are contained in the provided language models, which would lead to biased results of semi-supervised training. To prevent this issue we always used a language model trained on CommonCrawl data when deciphering and later re-decoding the training data. The CommonCrawl language models were trained on approximately 100M tokens and pruned to only contain 100k most frequent words.
After obtaining pseudo labels with decipherment we performed Semi-Supervised Training in two iterations. In the first iteration we used the pseudo-labels for flat-start LF-MMI training . Instead of training the model from scratch, we replaced the output layer of the multilingual model with a new layer producing pseudo-likelihoods for mono-graphemes. In the second iteration we used the mono-grapheme model to re-decode the training data. To prevent overfitting to the training data we did not continue training the mono-grapheme model, but again replaced the output layer of the pretrained multilingual model and used that model for training. This time we used bi-grapheme targets, because of now having more reliable pseudo-labels with which to estimate the state clustering tree. Since the decipherment tends to produce a lot of deletion errors we used a deletion penalty during decoding to allow the model to learn to fix them .
|reference w. silence||21.9||11.7||18.1||10.5|
|reference w/o silence||61.8||13.2||23.5||14.4|
|+ first iteration of SST||21.1||33.1||59.3|
|+ second iteration of SST||16.7||25.3||42.5|
In the first set of experiments, summarised in Table 1, we studied how decipherment works when deciphering a ground truth phoneme sequence with and without silence; when deciphering phones decoded with a mono-lingual model for the target language; and when deciphering a cross-lingual phone output from the multilingual model. From the results it can be seen that the presence of silence in the phone sequence leads to better results, and also that as amount of noise due to phone decoding errors increases, as shown in Table 2, the decipherment performance degrades. The English results in Table 1 are notably the weakest, especially on the inputs without silence. We attribute this to the fact that English has a more difficult orthography and the model might not be able to learn it from only 200 utterances. Therefore, we subsequently trained the English decipherment model on 400 utterances, which reduced the WER from 21.9, 61.8 and 44.0 to 20.9, 35.2 and 40.7, respectively. This shows that a better decipherment model can be learned with more data.
In the second set of experiments we used the decipherment model trained on the cross-lingual decode to produce pseudo-labels for GlobalPhone training data for use in SST. Table 4 contains WER and CER statistics for these pseudo-labels. In Table 3, we show the results following SST, which we compare with a “topline” performance obtained from conventional supervised training on the same data. The results demonstrate that it is possible to successfully train an acoustic model using the deciphered pseudo-labels, with the relative difference between the topline models and the SST models being 38% for Czech, 58% for Portuguese and 134% for Swedish, the absolute differences being 4.6%, 9.3% and 24.4% respectively. We believe that the differences could be further reduced by using a better universal phone recogniser , a better pre-trained model for initialisation  and by leveraging more crawled data for SST [38, 32]. Our results are in-line with our previous work on conventional low-resource SST where we showed that it is possible to perform SST even with a bad seed acoustic model provided that we have a good language model .
4 Conclusions and Future Work
We presented a method for zero-resource cross-lingual transfer of ASR models based on decipherment that allows training of ASR models using only untranscribed speech, text corpora and a universal phone recogniser. Across the three test languages our method was able to produce a working acoustic model, which could be further improved by using more untranscribed data for SST. We do concede that the languages we investigated are geographically close; in future we plan to apply decipherment to more challenging languages, but we believe that for this it may be necessary to train a more robust universal phone recogniser that works reliably across a wider range of various languages. We also intend to improve our decipherment algorithm to enable selection of utterances that can be reliably deciphered. In future, we hope to replace the universal phone recogniser with automatic unit discovery to create a pronunciation lexicon-free alternative for unsupervised speech recognition .
-  C. Liu, Q. Zhang, X. Zhang, K. Singh, Y. Saraf, and G. Zweig, “Multilingual graphemic hybrid asr with massive data augmentation,” in SLTU-CCURL, 2020.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
-  L. Lamel, J.-L. Gauvain, and G. Adda, “Lightly supervised and unsupervised acoustic model training,” Computer Speech and Language, vol. 16, 2002.
-  L. Lamel, J.-L. Gauvain, and G. Adda, “Unsupervised acoustic model training,” in ICASSP, 2002.
-  J. Glass, “Towards unsupervised speech processing,” in ISSPA, 2012.
-  H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework for fully-unsupervised large-vocabulary speech recognition,” Computer Speech and Language, vol. 46, 2017.
-  E. Hermann, H. Kamper, and S. Goldwater, “Multilingual and unsupervised subword modeling for zero-resource languages,” Computer Speech and Language, vol. 65, 2021.
-  N. T. Vu, F. Kraus, and T. Schultz, “Cross-language bootstrapping based on completely unsupervised training using multilingual a-stabil,” in ICASSP, 2011.
-  J. L. Lee, L. F. Ashby, M. E. Garza, Y. Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, “Massively multilingual pronunciation modeling with WikiPron,” in LREC, 2020.
-  M. Wiesner, O. Adams, D. Yarowsky, J. Trmal, and S. Khudanpur, “Zero-shot pronunciation lexicons for cross-language acoustic model transfer,” in ASRU, 2019.
-  X. Li, J. Li, F. Metze, and A. W. Black, “Hierarchical phone recognition with compositional phonetics,” Interspeech, 2021.
-  V. Hai, X. Xiao, E. S. Chng, and H. Li, “Cross-lingual phone mapping for large vocabulary speech recognition of under-resourced languages,” IEICE TRANSACTIONS on Information and Systems, vol. 97, no. 2, pp. 285–295, 2014.
T. Shinozaki, S. Watanabe, D. Mochihashi, and G. Neubig,
“Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text,”in Interspeech, 2017.
-  X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopoulos, D. R. Mortensen, G. Neubig, A. W. Black, et al., “Universal phone recognition with a multilingual allophone system,” in ICASSP, 2020.
-  S. Feng, P. Żelasko, L. Moro-Velázquez, A. Abavisani, M. Hasegawa-Johnson, O. Scharenborg, and N. Dehak, “How phonotactics affect multilingual and zero-shot asr performance,” in ICASSP, 2021.
-  H. Gao, J. Ni, Y. Zhang, K. Qian, S. Chang, and M. Hasegawa-Johnson, “Zero-shot cross-lingual phonetic recognition with external language embedding,” Interspeech, 2021.
-  S. Khare, A. Mittal, A. Diwan, S. Sarawagi, P. Jyothi, and B. Samarth, “Low resource ASR: The surprising effectiveness of high resource transliteration,” in Interspeech, 2021.
-  A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsupervised speech recognition,” arXiv preprint arXiv:2105.11084, 2021.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli,
“wav2vec 2.0: A framework for self-supervised learning of speech representations,”NeurIPS, 2020.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, 2014.
-  S. Ravi and K. Knight, “Deciphering foreign language,” in ACL-HLT, 2011.
Unsupervised training with applications in natural language processing, Ph.D. thesis, RWTH Aachen University, Germany, 2019.
-  K. Knight, “Decoding complexity in word-replacement translation models,” Computational linguistics, vol. 25, no. 4, pp. 607–615, 1999.
L. E. Baum, T. Petrie, G. Soules, and N. Weiss,
“A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,”The annals of mathematical statistics, vol. 41, no. 1, pp. 164–171, 1970.
-  S. Ravi and K. Knight, “Learning phoneme mappings for transliteration without parallel data,” in NAACL-HLT, 2009.
-  M. Nuhn and H. Ney, “EM decipherment for large vocabularies,” in ACL, 2014.
-  C. Chu, S. Fang, and K. Knight, “Learning to pronounce chinese without a pronunciation dictionary,” in EMNLP, 2020.
-  C. Allauzen and M. Mohri, “3-way composition of weighted finite-state transducers,” in CIAA, 2008.
-  C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in CIAA, 2007.
-  B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai, “The OpenGrm open-source finite-state grammar software libraries,” in ACL, 2012.
-  V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervised training of acoustic models using lattice-free MMI,” in ICASSP, 2018.
-  E. Wallington, B. Kershenbaum, O. Klejch, and P. Bell, “On the learning dynamics of semi-supervised training for ASR,” Interspeech, 2021.
-  H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE/ACM TASLP, vol. 26, no. 11, pp. 1949–1961, 2018.
-  T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in ICASSP, 2013.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
-  D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Interspeech, 2018.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Interspeech, 2016.
-  A. Carmantini, P. Bell, and S. Renals, “Untranscribed web audio for low resource speech recognition.,” in Interspeech, 2019.