The recent advances in Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) technologies have opened the way to potential applications in the field of Air Traffic Control (ATC).
On the controllers’ side, it is expected that these technologies will provide an alternative modality for controllers. As a matter of fact, controllers have to keep track of all the clearances they emit, this is nowadays made either by mouse input or by hand – which generates a high workload for controllers. The MALORCA111http://www.malorca-project.de/ ongoing research project, for instance, aims at improving ASR models for providing assistance at different controller working positions.
On the pilots’ side, ASR of ATC messages could also help decreasing pilots’ cognitive workload. Indeed, pilots have to perform several cognitive tasks to handle spoken communications with the air traffic controllers:
constantly listening to the VHF (Very High Frequency) radio in case their call sign (i.e. their aircraft’s identifier) is called;
understanding the controller message, even if pronounced with non-native accent and/or in noisy conditions;
remembering complex and lengthy messages.
|SWITCHBOARD speech||ATC speech|
|intelligibility||good (phone quality)||bad (VHF quality + noise)|
|accents||US English||diverse & non-native|
|lexicon & syntax||oral syntax, everyday topics||limited to ICAO phraseology and related|
|other||-||code switching, possible Lombard effect|
In short, industrial stakeholders consider today that ASR and NLU technologies could help decrease operators’ workload, both on pilots and on controllers’ sides. A first step towards cognitive assistance in ATC-related tasks could be a system able to (1) provide a reliable transcription of an ATC message; and (2) identify automatically the call sign of the recipient aircraft.
Although significant progress has been made recently in the field of ASR — see, for example, the work of  and  who have both claimed to have reached human parity in the switchboard corpus  — ATC communications still offer challenges to the ASR community; in particular because it combines several issues in speech recognition: accented speech, code-switching, bad audio quality, noisy environment, high speech rate and domain-specific language associated with a lack of voluminous datasets . The Airbus Air Traffic Control Speech Recognition 2018 challenge was intended to provide the research community with an opportunity to address the specific issues of ATC speech recognition.
describes the tasks, dataset and evaluation metrics used in the challenge; section4 briefly describes the best performing systems and analyses the results of the challenge. Perspectives are discussed in section 5.
2 Specificity of ATC speech and existing ATC speech corpora
ATC communications being very specific, voluminous generic datasets like the SWITCHBOARD corpus  cannot be used to build an ATC speech recognition system. Table 1 provides a comparison of ATC speech vs. SWITCHBOARD speech. ATC speech provides many challenges to automatic speech recognition: audio quality is bad (VHF), the language is English but pronounced by non-native speakers, speech rate is higher than in CTS  and there is also a lot of code switching. The only advantage of ATC compared to CTS is that the vocabulary is limited to the International Civil Aviation Organisation (ICAO) phraseology .
Several ATC datasets have been collected in the past. Unfortunately most of them are either unavailable, lack challenging features of ATC or lack proper annotation. On top of this, it was required that at least a small portion of the dataset had never been disclosed so that it could be used for evaluation.
The HIWIRE database  contains military ATC-related voice commands uttered by non-native speakers and recorded in artificial conditions. The nnMTAC corpus  contains 24h of real-life, non-native military ATC messages. Unfortunately, it is not available outside of NATO222North Atlantic Treaty and Organizations groups and affiliates. Similarly, the VOCALISE dataset  and the corpus of  (respectively 150h and 22h of real-life French-accented civil ATC communications) are not publicly available. ATCOSIM  is a freely available resource composed of realistic simulated ATC communications. Its limitations are its size (11h) and the fact that it lacks real-life features. The NIST Air Traffic Control Corpus  is composed of 70h of real-life ATC from 3 different US airports and it is commercially available through the Linguistic Data Consortium (LDC). Unfortunately, it is mainly composed of native English and the call signs have not been annotated. The corpus collected by  is freely available and contains real-life non-native ATC speech. It is though quite small (20h) and does not contain call sign annotations.
|Team||WER (%)||ins (%)||del (%)||sub (%)||F1 (%)||p (%)||r (%)|
3 Challenge description
3.1 Two tasks: ASR and call sign detection (CSD)
The Airbus ATC challenge consisted in tackling two tasks: 1) automatic speech-to-text transcription from authentic recordings in accented English, 2) call sign detection (CSD).
Aviation call signs (CS) are communication call signs assigned as unique identifiers to aircraft. They are expected to adhere to the following pre-defined format: an airline code followed by three to five numbers and zero to two letters. For instance, ”ENAC School six seven november” is a call sign in which ENAC school is a company name followed by two numbers (six and seven) and ”november” stands for the ’n’ character in the aviation alphabet. One difficulty lies in the use of shortened spoken CS when there is no ambiguity.
3.2 Speech material
The dataset used for running the challenge is a subset of the transcribed ATC speech corpus collected by Airbus 
. This corpus contains all the specific features of ATC mentioned above: non-native speech, bad audio quality, code-switching, high speech rate, etc. On top of this, call signs contained in the audio have been tagged, which allowed the challenge organizers to propose a ”call sign detection” task. Although the corpus is not publicly available, a subset of it was made available to the challengers, for challenge use only. Half of the whole corpus, totalling 50 hours of manually transcribed speech, was used. Utterances were isolated, randomly selected and shuffled. All the meta-information (speaker accent, role, timestamps, category of control) were removed. The corpus was then split into three different subsets: 40h of speech together with transcriptions and call sign tags for training, 5h of speech recordings for development (leaderboard) and 5h for final evaluation, were provided to the participants at different moments during the challenge. The participants did not have access to the ground-truth of the development and eval subsets. They could make submissions to a leaderboard to get their scores on the dev subset. Several criteria were considered to split the data into subsets that share similar characteristics (percentages given in speech duration): 1) speaker sex (female: 25%, male: 75%), 2) speaker job — ATIS (Airline Travel Information System, mostly weather forecasts, 3%), pilots (54%) and controllers (43%) —, the ”program” — ATIS (3%), approach (72%), tower (25%).
3.3 Evaluation metrics
Evaluation was performed on both the ASR and CSD tasks. ASR was evaluated with Word Error Rate (WER). Before comparison, hypothesis and reference texts were set to lower case. For CSD, F-measure ( or F1-score) was used. A score of a submission
was defined to combine WER and F1 as the harmonic mean of the normalized pseudo-accuracy () and the normalized score ():
Submissions were sorted by decreasing score values to get the final participant ranking.
4 Result analysis and system overview
In this section, we report detailed results for the two tasks ASR and CSD. We also give a bird’s eye view on the approaches of the best ranked predictions on the Eval subset.
Figures 1a and 1b show the Word Error Rates (WER) for the ASR task and the F1-scores for CSD, obtained by the 22 teams ordered by their final ranking. Are displayed the names of entities that gave a disclosure agreement.
VOCAPIA-LIMSI achieved the best results in both tasks with a 7.62% WER and a 82.41% CSD F1-score. Globally speaking, the best teams obtained impressive results with WERs below 10% and below 8% for the winner. Table 2 gives more details to analyze these results. One can see that almost all the ASR systems produced twice as many deletions and substitutions (around 3%) than insertions (around 1.5%).
Regarding CSD, the best systems yielded F1-score above 80%. Except for the two best systems with similar precision and recall values (respectively 81.99% and 82.82% for VOCAPIA-LIMSI), precision was larger than recall by a significant margin. This means that the systems miss call signs more often than they correctly detect them. This lack of robustness may be explained by the variability with which call signs are employed: sometimes in their full form, sometimes in partial forms. Three teams including Queensland Speech Lab and U. Sheffield did not submit CS predictions resulting in a zero score in CSD (no visible bar in fig.1b), and a final ranking that does not reflect their good performance in ASR.
|Acoustic frontend||Acoustic Modeling||Language Modeling|
|Team||Features||Data augmentation||Modeling||Context||Complexity||Lex. size||LM||Decoding||Ensemble|
|UWr-ToopLoox||Mel F-BANK||freq. shifting, noise||CTC Conv-BiLSTM||diphones||50M||2.2k||4-gram||Lattice||Yes|
|UWB-JHU||MFCC, ivectors||volume, speed||TDNN-F||triphones||20M||2.2k||3-gram||Lattice||No|
|Team5||MFCC, ivectors||reverb, speed, volume||TDNN||triphones||6M||2.7k||4-gram||Lattice||No|
To get more insights in these results, Table 3
shows the highest ranked team WER and CSD F1-score according to the program, speaker job, and speaker sex. As expected, ATIS speech (mostly weather forecasts with limited vocabulary) is easier to transcribe than Approach (AP) and Tower (TO), for which similar WERs were obtained: 8.1% and 7.8%, respectively. An interesting finding is that pilots’ speech (P) was much more difficult to transcribe than controllers’ speech (C), with almost a factor two in WER, and 8% absolute difference in CSD F1-score. This can be explained by the greater diversity of accents among pilots compared to controllers, most of whom are French. French-accented English being the most represented accent in the corpus, C is better recognized than P. Better performance was obtained for female speakers compared to male speakers probably because 78% of the female utterances are controller utterances.
4.2 ASR system characteristics
Table 4 gives an overview of the ASR modules used by the five best ranked teams. Regarding acoustic front-end, Vocapia-LIMSI used Perceptual Linear Predictive (PLP) features with RASTA-filtering [14, 15]
. Except UWr-ToopLoox that used Mel F-BANK coefficients, all the other participants used high-resolution MFCC (40 to 80 coefficients) and 100-d i-vectors. According to their findings, i-vectors bring very small gains. For acoustic modeling, Vocapia-LIMSI used a hybrid HMM-MLP model (Hidden Markov Models - Multi-Layer Perceptron). UWr-ToopLoox used an ensemble of six large models (50M parameters each), each comprised of two convolution layers, five bidirectional Long Short-Term Memory layers (Bi-LSTM) trained with the CTC (Connectionist Temporal Classification) objective function. CRIM also combined six different models, three Bi-LSTM and three Time-Delay Neural Networks (TDNN). UWB-JHU used factorized TDNNs (TDNN-F, 
), which are TDNNs whose layers are compressed via Singular Value Decomposition.
Finally, almost all the teams used the 2.2k word-type vocabulary extracted from the challenge corpus. The participants reported no to small gains when using neural language models rather than n-gram models.
4.3 Call Sign Detection system characteristics
For CSD, two main approaches were implemented: on the one hand grammar-based and regular expression (RE) methods, i.e.
knowledge-based methods, on the other hand machine learning models. The first type of models requires adaptation to capture production variants that do not strictly respect CS rules (pilots and controllers often shorten CS for example). The second one, namely neural networks, Consensus Network Search (CNS), n-grams, perform better in this evaluation but are not able to detect unseen CS. Vocapia-LIMSI combined both approaches (RE allowing full and partial CSD together with CNS) and achieved the highest scores.
Some participants attempted to use external ATC speech data for semi-supervised acoustic model training, and it revealed unsuccessful. This technique usually brings performance gains, such as in . This may be due to the fact that the eval subset is very close to the trained one so that adding external data just adds noise. This outcome reveals a robustness issue that needs to be addressed. A large-scale speech data collection is very much needed to solve ATC ASR. Several criteria should be considered for this data collection: diversity in the airports where speech is collected, diversity in foreign accents, acoustic devices used for ATC, among others.
Regarding organizing a future challenge, using speech from different airports for training and testing purposes should be considered. This also would require systems with more generalization capabilities for the CSD task since most of the call signs would be unseen during training.
Furthermore, to be successful, the major players in the field should join forces for data collection but also to share the large costs needed to manually transcribe the recordings. Finally, much attention should be paid to legal aspects on data protection and privacy (in Europe, the recent General Data Protection Regulation).
The organizing team would like to thank all the participants. This work was partially funded by AIRBUS333https://www.airbus.com/, École Nationale de l’Aviation Civile (ENAC)444http://www.enac.fr/en, Institut de Recherche en Informatique de Toulouse555https://www.irit.fr/ and SAFETY DATA-CFH666http://www.safety-data.com/en/.
-  W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Achieving human parity in conversational speech recognition,” CoRR, 2016.
-  G. Saonn, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L. Lim, B. Roomi, and P. Hall, “English conversational telephone speech recognition by humans and machines,” CoRR, 2017.
-  J. Godfrey, E. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” 1992, vol. 1, pp. 517–520.
-  E. Delpech, M. Laignelet, C. Pimm, C. Raynal, M. Trzos, A. Arnold, and D. Pronto, “A real-life, french-accented corpus of Air Traffic Control communications,” in Proc. LREC, Miyazaki, 2018, pp. 2866–2870.
-  R. Cauldwell, “Speech in action: Fluency for Air Traffic Control,” PTLC, August 2007.
-  ICAO, Manual of Radiotelephony, 4th edition, Doc 9432-AN/925.
-  J. C. Segura, T. Ehrette, A. Potamianos, D. Fohr, I. Illina, P.-A. Breton, V. Clot, R. Gemello, Matassoni M., and P. Maragos, “The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication,” Web Download, 2007, http://catalog.elra.info/en-us/repository/browse/ELRA-S0293/.
-  S. Pigeon, W. Shen, and D. van Leeuwen, “Design and characterization of the non-native military air traffic communications database (nnMATC),” in Proc. Interspeech, Antwerp, 2007, pp. 2417–2420.
-  L. Graglia, B. Favennec, and A. Arnoux, “Vocalise: Assessing the impact of data link technology on the r/t channel,” in The 24th Digital Avionics Systems Conference, 2005, vol. 1.
-  S. Lopez, A. Condamines, A. Josselin-Leray, M. O’Donoghue, and R. Salmon, “Linguistic analysis of english phraseology and plain language in air-ground communication,” Journal of Air Transport Studies, vol. 4, no. 1, pp. 44–60, 2013.
-  K. Hofbauer, S. Petrik, and H. Hering, “The ATCOSIM corpus of non-prompted clean Air Traffic Control speech,” 2008.
-  J. Godfrey, “Air traffic control complete LDC94S14A,” Web Download, 1994, https://catalog.ldc.upenn.edu/LDC94S14A.
-  L. Šmídl and P. Ircing, “Air Traffic Control communications (ATCC) speech corpus,” Web Download, 2014, https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-CCA1-0.
-  H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
-  H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE transactions on speech and audio processing, vol. 2, no. 4, pp. 578–589, 1994.
-  V. Gupta and G. Boulianne, “CRIM’s system for the MGB-3 English multi-genre broadcast media transcription,” in Proc. Interspeech, Hyderabad, 2018, pp. 2653–2657.
-  D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Proc. Interspeech, Hyderabad, 2018, pp. 3743–3747.
-  L. Šmídl, J. Švec, A. Pražák, and J. Trmal, “Semi-supervised training of DNN-based acoustic model for ATC speech recognition,” in Proc. SPECOM, Leipzig, 2018, Springer, pp. 646–655.