The communication methods between pilots and Air-Traffic Controllers (ATCos) have remained almost unchanged for many decades, where the ATCo’s main task is to transfer spoken guidance to pilots during all flight phases (i.e. approach, landing or taxi) and at the same time providing safety, reliability and efficiency. This task has shown to be extremely stressful and highly voice demanding because of the impact a small mistake can make. Several attempts towards increasing the confidence and reducing the workload of pilot-controller communication have been pursued in the past, including experiments with Automatic Speech Recognition (ASR). Initially, due to budget and scarcity of computing power, previous work targeted isolated word recognition, or ’voice activity detection’ but currently most of the works performs ASR on whole utterances. Military applications were one of the first attempts involving engines for command-related ASR; in fact, Beek et al. [beek1977] contrast the main ASR techniques with its relevance to military applications like speaker verification, recognition of spoken codes, system control of aircraft and so on. They remarked that pilot-ATCo communications have a very limited word set -vocabulary-, speaker-dependent issues and environmental noises that need to be addressed to produce a sufficiently-reliable system. Initially, the integration of ASR technologies in ATCo started in the late 80s’ with Hamel et al. report [hamel1989]; but lately, ASR technologies has been successfully deployed on ATC training simulators. For example, Matrouf et al. [matrouf1990adapting] proposed a user-friendly and robust system to train ATCos based on hierarchical frames and history of dialogues -context-dependent-. Similarly, DLR [matrouf1990adapting], MITRE [tarakan2008automated] and more recently UPM-AENA [ferreiros2012speech] under the INVOCA project proposed akin training systems.
One of the current limitations in developing highly-accurate ASR engines for ATCo communications is the lack of available databases; likewise, generate the transcriptions of such data is extremely costly. As a matter of fact, typically a raw ATCo-pilot voice communication recording of one hour -including silences- requires between eight to ten man-hours of transcription effort [cordero2012automated] (mainly as it requires highly trained participants, often active, or retired ATCos). Afterwards, usually only 10 to 15 minutes speech segments of ATCo is obtained from 1h recording (after removing silence segments). Hence, it would take approximately one man-week work to get an hour of ATCos without silences [ferreiros2012speech, cordero2012automated].
Currently, several researchers [holone2015possibilities] and the International Civil Aviation Organization (ICAO) determined that the air-traffic is expected to grow about 3 to 6 percent yearly at least until 2025. Consequently, it has been seen a huge investment of the European Union (EU) to address the ATCos workload and development of ASR engines for field pilot-ATCos communication and not only for training purposes. Two recent projects financed by the EU on the scope of ASR for ATCo communications are MALORCA111MAchine Learning Of speech Recognition models for Controller Assistance, http://www.malorca-project.de/wp/ and ATCO2222AuTomatic COllection and processing of voice data from Air-Traffic COmmunications, https://www.atco2.org/. MALORCA project (together with AcListant333Active Listening Assistant, www.AcListant.de) demonstrated that ASR tools can reduce ATCos workload [helmke2016reducing] and increase the efficiency [helmke2017increasing]. MALORCA also addressed the lack of transcribed air traffic speech data using semi-supervised training to decrease Word Error Rates (WER) and command error rates [kleinert2018semi, srinivasamurthy2017semi]. We set as baseline word error rates the results from [kleinert2018semi, srinivasamurthy2017semi] for two proposed train/test sets. ATCO2 ongoing project aims at developing a unique platform to collect, organize and pre-process air-traffic speech data from air space. ATCO2 considers the real-time pilot-ATCos voice communication available either directly through publicly accessible radio frequency channels (such as LiveATC [liveatc2020]), or indirectly from air-navigation service providers. One of the current challenges of ASR engines for ATCo communications is the changing ATCos accent and vocabularies across different airports; hence, ATCO2 will develop a robust methodology capable of minimize their impact on the system. In this work, we present the first results -or a benchmark- based on six ATC in-domain databases which, to the authors’ knowledge, is the first time that such quantity of command-related databases (spanning more than 170 hours) have been used during the training phase. Firstly, we explore transfer learning from a Deep Neural Network (DNN) system trained on an Out-Of-Domain (OOD) corpus, then we contrast the results with the state-of-the-art ASR chain recipes (from Kaldi’s toolkit [povey2011kaldi]
) such as TDNNF and CNN+TDNNF. Also, we concluded that there is a huge opportunity for byte-pair encoding (BPE) algorithms (used as a new representation in lexicon instead of word-based units) due to the ATCo speech-data structure i.e. the ATCo communications follows a simple vocabulary where the most spoken words are numbers. The BPE algorithms do not restrict the ’units’ (in LMs, words) length and those units are not attached only to one word.
Even though obtaining a full ATCo-pilot communication system goes far beyond of only ASR tasks, we plan to convey in the following sections a benchmark of experiments going from transfer learning (from an OOD corpus) and adaptation with partial or complete in-domain command-related databases to BPE algorithms and end-to-end TDNNF models. Section 2 defines the corpus and data preparation used for our benchmark experiments. Section 3 reviews the lexicon and language modelling. The acoustic modelling and experimental setup is presented in Section 4. Then, Section 5 reviews and discusses the main obtained results. Finally, Section 6 concludes the paper and proposes the roadmap that ASR systems for ATCo communications should be heading.
2 Data Preparation
Diverse studies conclude that almost 80% of all pilot radio messages contain at least one error and 30% of the incidents are accounted by miss-communications (and up to 50% in the terminal manoeuvring area) [geacuar2010reducing]; therefore, ASR systems stand as a viable solution. Kleinert et al. [kleinert2018semi]
mention that a new technology for Air-Traffic Management (ATM) such as ASR on pilot-ATCos communication, needs to be user-friendly, comfortable and reliable enough while keeping an affordable initial cost. Accordingly, ASR systems cannot afford to be trained and tested ’on-the-fly’ in real operational environment, but we are required to build the best possible system before its deployment. With this intention, we use the state-of-the-art ASR engines that are based on DNN like Time-Delay Neural Networks (TDNN) and Convolutional Neural Network (CNN). These models are known as ’data-hungry’ algorithms, because state-of-the-art ASR systems need to be trained on large amount of data to achieve and acceptable operational performance. Sadly, it can be concluded that currently in the ATM world there is a lack of such databases. One of our main contribution is to solve this problem employing partly-in-domain or ’command-related’ databases, retaining similar phraseology and structure but with different speakers accents; thus, helping the algorithms to achieve lower WERs.
2.1 Command-related databases
One concern that has delayed the development of a unified ASR framework for ATM globally -or at least at country level- is the vast accent’s variability between ATCos from non-English speaking countries. Often, ATCos working in the same country but at different airports may have different accents (e.g. Switzerland). There is also a large variability in dictionary used across airports, as different call-signs, commands, or parameters (e.g. waypoints) can be used. Therefore, an unadapted ASR system will provide significantly worse performance due to unseen accents, Out-Of-Vocabulary (OOV) words, different recording procedures, paramenters, etc.
In order to address this issue, Table 1 presents six databases that have -or at least posses- close similarities to ATCo’s speech data, accounting to nearly 180 hours (train and test sets). In fact, the phraseology and vocabularies are shared across the databases but the speakers’ accent is domain-dependent. As part of our ASR benchmark for ATC, we also measured the impact of transfer learning of DNN models trained on out-of-domain databases (i.e. Librispeech and Commonvoice presented in Table 1).
|MALORCA||13||German and Czech||[kleinert2018semi, srinivasamurthy2017semi]|
|LDC ATCC||72.5||American English||[LDC_ATCC]|
|HIWIRE||28.3||French, Greek, Italian and Spanish||[HIWIRE]|
|ATCOSIM||10.67||German, Swiss German & French||[ATCOSIM]|
Another pilot-ATCos communication concern are the errors due to OOV words and phonetic di-similarities (e.g. ”hold in position” and ”holding position”, or, ”climb to two thousand” and ”climb two two thousand”). Hence, the ICAO has created a standard phraseology to reduce these errors during the communications. Similarly, Helmke et al. [helmke2018ontology] propose a new ontology to transcribe these ATCo-pilots communications, which will harmonize the integration into the ASR systems independently from the country of origin.
2.2 Out-of-domain databases
As part of the proposed benchmark, we measured the impact of transfer learning to address the lack of in-domain databases. The idea is to pre-train models with well-known out-of-domain databases such as Librispeech [panayotov2015librispeech] (960 hours) and Commonvoice [ardila2019common] (500 hours English subset) and then adapt the pre-trained models using in-domain data. The final out-of-domain train set contains nearly 1500 hours of speech data (see Table 1).
|Train1||38.7||Atcosim (train) + Malorca (Vienna+Prague) + UWB ATCC|
|Train2||137.7||Airbus + ATCC USA + Hiwire|
|Tr1+Tr2||176.4||Train1 + Train2|
|OOD set||1500||Out-of-domain set: Librispeech + Commonvoice|
|Atcosim||2.5||20% of Atcosim train set|
|Prague||2.2||From Malorca set|
|Vienna||1.9||From Malorca set|
|Airbus||1||From Airbus set|
2.3 Databases split
In order to measure whether the amount of data and various English accents (including variety of non-English words) of the databases influence the training process, we merged six command-related databases in three training sets as shown in Table 2. In case of ATCOSIM, we split the database (by speakers) in a 80/20 ratio (i.e. we used 80% of data as train/validation and the remaining 20% as test set). In case of MALORCA database, it comprises two ATC approaches (collected from two ANSPs), Vienna and Prague; the initial datasets (Table 1) were already split following Table 2. As reviewed in Section 5 the performance of our methodology and developed acoustic models is evaluated on four different test sets, where features such as ATCo accent, spoken commands, airport origin and quantity of training data are varied.
3 Lexicon and Language Modelling
The word-list for lexicon was assembled from the transcripts of all the ATCo audio databases (i.e. Tr1+Tr2, see Table 2) and from some other publicly available resources (i.e. lists with names of airlines, airports, ICAO alphabet, etc.). The pronunciations were synthesized with Phonetisaurus [phonetisaurus]. The G2P (grapheme-to-phoneme) model was trained on Librispeech lexicon, and we inherited its set of phonemes. Likewise, the ’spelled’ acronyms were auto-detected, and we create their pronunciations separately.
3.2 Language Modelling
We train N-gram language models using SRI-LM[srilm] on the transcripts of the training set Tr1+Tr2 (see Table 2). We use a tri-gram for the initial decoding and a four-gram model for re-scoring. In our results (Table 3) ’LM-3’ stands for the tri-gram and ’LM-4’ for the four-gram model. For the BPE model we additionally trained a six-gram, identified as ’LM-6’.
4 Acoustic Modelling and Experimental Setup
All experiments are conducted using the Kaldi speech recognition toolkit [povey2011kaldi]. We performed training on two frequently used DNN-based acoustic models. On the one hand, we train Factorized TDNN or TDNNF [povey2018semi] with 1500 hours of OOD speech (see Table 1) and then we adapt the resulting model with three ATC command-related data-sets (see Subsection 4.1). On the other hand, we perform flat-start CNN+TDNNF training without any kind of transfer learning or adaptation; the idea behind this is to measure quantitatively whether the amount/accent of training data helps to reduce WERs. We use the standard chain LF-MMI based Kaldi’s recipe for both architectures, which includes 3-fold speed perturbation and one third frame sub-sampling.
4.1 Conventional LF-MMI Training
Conventional LF-MMI training of TDNNF models still relies on a HMM-GMM model to build both the alignments and lattices needed during training. The HMM-GMM models are trained with only the out-of-domain databases i.e. Librispeech + Commonvoice. We prepare 100-dimensional i-vector features, 3-fold speed perturbation, and lattices for LF-MMI training supervision. The TDNNF system trained on the out-of-domain training set (1500 hours) is tagged as ’TDNNF-B’. To measure the impact of the amount of training data on performance in the target domain, we train once with and once without transfer learning on the three different ATC train sets presented in Table 2. Models trained with transfer learning have ’TF’ in the name (e.g. TDNN-TF-B). The systems without transfer learning simply are denoted according to their architectures (e.g. TDNNF, CNN+TDNNF or TDNNF-BPE).
4.2 Byte-Pair Encoding
As part of the benchmark experiments, we use Byte-Pair Encoding (BPE) [SennrichBPE] on the training transcripts to create a (subword) vocabulary to use for language modeling. BPE is a compression algorithm which transforms whole words into ’units’ of sub-strings, allowing the representation of an open vocabulary where new words can be easily introduced in the lexicons and LMs. There have been several studies using BPE for ASR systems [BPEDrexler2019, BPEZeyer2018, BPEZeng2019], we believe there is an especially strong case for it for ATC communications, as it relies mostly on simple commands and call-signs (our ATC vocabulary is smaller than 10k), but at the same time contains a relatively high amount of foreign proper nouns, which could be missing in a word-based model. For BPE training we limited the number of merges to 2000 (resulting in 2000 sub-words), we used the original implementation from [SennrichBPE]. We use a character-based sub-word lexicon which means to get a pronunciation for a word we simply split the word up into its characters, and then use these characters instead of phones. As mentioned previously, the LM is a six-gram language model. After decoding, the words that end with the separator symbol are joined with the next one, so that we end up with words as the final output and on which we can calculate the WER (comparable with word-based models).
5 Results and Discussion
The results (seen in Table 3) are split into four blocks. First, a system (TDNNF-B) is trained on an OOD set consisting of 1500 hours. This is our base model to perform transfer learning. Second, we use the TDNNF-B model to adapt to the different ATC datasets (by training on them and using TDNNF-B as initialization) i.e. Train1, Train2 and Tr1+Tr2. Third, we compare WERs for TDNNFs without transfer learning trained on each of the three proposed training sets. Finally, we present results on a CNN+TDNNF chain model and a TDNNF model trained with BPE units (see BPE section for details on the setup). We kept the same hyper-parameters across all the experiments in order to make fair comparisons between models.
|System||Train Set||Params||Word Error Rates (WER) % - (test sets)|
The base model performs poorly on the ATC data. This is not surprising as Librispeech and Commonvoice are both read speech with mostly clear audio. The ATC data is more noisy, the speakers talk much quicker, and the accents are stronger. Despite the significant difference in domains, the pretraining still helps when the target dataset is not too large, as can be seen when comparing the first two rows of Table 3 (trained on Train1, Train2) of the TDNNF-TF-B and the TDNNF models. However, once the target domain dataset becomes large enough, we do not see the benefit of pretraining (see the last row of the TDNNF-TF-B and the TDNNF models).
The main purpose of the last block of experiments is to provide a broader cover of different DNN architectures and techniques on our proposed ASR benchmark for air traffic communications. There is no clear winner. The CNN+TDNNF system yielded a new baseline of 5% WER for Atcosim, showing a relative improvement on WERs of 16.7% and 3.9% when compared to TDNNF-TF-B and TDNNF. For the Vienna approach, our best model was TDNNF trained on Tr1+Tr2 and scored with a 4-gram LM; for Prague approach the best performing model was TDNNF with 6-gram and lexicon based on BPE. Compared to previous experiments on MALORCA [kleinert2018semi, srinivasamurthy2017semi] our approach yields 29.8% and 37.9% relative WER improvement for Vienna and Prague.
We further investigated why the BPE model does significantly better on the Prague test set, and found that the difference in performance is entirely explained by reduced deletions (five times more deletions of TDNNF and CNN+TDNNF than TDNNF-BPE system). The word-based model is obviously not able to recognize out-of-vocabulary words, which is the primary reason for the deletion errors. We checked OOV rates, and found that on the Prague, Vienna, Airbus and Atcosim test sets they are 3.3%, 1.1%, 0.0% and 0.1%. This shows that the BPE system is capable of recognizing OOVs and thereby improving performance, although it does come at a cost (since the BPE models also perform significantly worse on some test sets). Further investigation is required to understand the differences in performance between word and sub-word (BPE) based systems. For instance, we noticed that the BPE model does better on foreign words (even when the word-based model includes these words in its lexicon), which we attribute to the character-based lexicon generalizing better to foreign languages which are not closely related to English.
The Atcosim baseline WER is presented in [holone2016n]. They achieved 8.5% absolute WER when performing n-best list re-ranking using syntactic knowledge. In our case, we obtain first 63.4% WER with TDNNF-B and an improvement to 8.1% absolute WER when training only on Train1 set. An additional 10% relative WER improvement can be obtained if employing transfer learning (i.e. TDNNF-TF-B + Train1), reaching 7.3% absolute WER. As on previous test sets, an increasing amount of training data helped the models to generalize better; consequently, we achieved an additional 28% relative WER improvement when training TDNNF on Tr1+Tr2. Finally, with the intention to explore different DNN architectures we were able to further reduce the absolute WER to 5.0% when using a CNN+TDNNF system trained on Tr1+Tr2, accounting to 3.8% relative improvement from TDNNF.
The main intention of this work is to introduce state-of-the-art DNN architectures to the area of ASR for air-traffic communications. We performed a benchmark with different DNN architectures, amount of training data and transfer learning across the presented experiments in order to reasonably compare their performance. To the author’s knowledge, this is the first paper employing six air-traffic command-related databases spanning more than 176 hours of speech data that are strongly related in both, phraseology and structure to ATCos-pilots communications, therefore dealing with the burden of lack of databases that many previous studies have quoted. Specifically, we have shown that using in-domain ATC databases, even if not from the same country/airport, the system is capable to yield a 29.8% and 37.9% relative WER improvement for Vienna and Prague approaches. Also, we reported new baselines for Vienna, Prague and Atcosim test sets. Finally, one of the main outcomes of this research was the results on byte-pair encoding with Prague approach, reaching 5.0% WER. We advise that future research should be focused in this way of AM, LM and lexicon modeling.
The work was supported by by European Union’s Horizon 2020 project No. 864702 - ATCO2, which is a part of Clean Sky Joint Undertaking. Karel Vesely was also supported by Czech National Science Foundation (GACR) project ”NEUREM3” No. 19-26934X.