Previous research [srinivasamurthy2017semi, kleinert2018semi] as part of the MALORCA111MAchine Learning Of speech Recognition models for Controller Assistance: http://www.malorca-project.de/wp/ project focused on i) improving ABSR accuracy for ATCOs, ii) reducing workload for ATCOs [helmke2016reducing], and iii) increasing efficiency [helmke2017increasing] of ATCOs.
As part of an ongoing HAAWAII222 Highly Advanced Air Traffic Controller Workstation with Artificial Intelligence Integration:
Highly Advanced Air Traffic Controller Workstation with Artificial Intelligence Integration:https://www.haawaii.de project, we aim to research and develop a reliable and adaptable solution to automatically transcribe voice commands issued by both ATCOs and pilots.
An error resilient and accurate ASR system is critical in the ATC domain. Current state-of-the-art technologies require large amounts of data to train ASR systems. Goal of another ongoing project called ATCO 333Automatic collection and processing of voice data from air-traffic communications https://www.atco2.org/ is to collect large set of voice recordings of ATCOs and pilots (with a minimum effort) for the aforementioned purpose. In order to train ASR for this task, ATCO and pilot speech recordings are usually pooled together [zuluaga2020asratc, zuluaga2020callsign, srinivasamurthy2017semi] despite having a significant variability in the data distribution (acoustic and grammatical conditions) and the number of speakers in the data. As a result of the variability in the data distribution, ASR performance is significantly different if applied on ATCO or pilot speech (i.e. ATCO’s speech is easier to recognize). Our baseline system trained by pooling all data reveals that the absolute difference in WER for ATCO and pilot is 9.7% (ATCO WER: 36.1%, Pilot WER: 45.8%).
In general terms, this research is centered on integrating contextual knowledge during ASR decoding. Nevertheless, previous research has shown that this technique could be also integrated during semi-supervised learning i. e., contextual semi-supervised learning[zuluaga2021contextual].
In this paper, we hypothesize that instead of developing the ASR as a single task, ATCO and pilot ASR can be considered as two separate tasks [ruder2017overview]. Specifically, this paper investigates a multitask approach to train AMs to be integrated in ASR for ATCO and pilot. An obvious first step is to automatically split the ATC speech communications into two tasks (i.e. obtaining these speaker labels manually on a large dataset would be expensive and time consuming). A common approach is to use speaker diarization to classify the speakers in the audio [anguera2012speaker, park2021review]. Although the ATCO speech is often cleaner than the pilot (as the former communicates from a controlled acoustic environment), the speech recordings collected in ATCO project using Very High Frequency (VHF) receivers are noisy for both ATCO and pilot channels. In such a case, the speaker diarization system may fail to assign speaker labels (ATCO or pilot) accurately. Thus, a speaker diarization system cannot be easily deployed to obtain accurate speaker labels.
The vital aspect in the air traffic management (ATM) environment is the communication between a controller and pilot. For the smooth travel of the aircraft this communication is well defined with a standard phraseology by ICAO [allclear]. Another approach to obtain the speaker class is through leveraging the ‘ICAO’ grammar to classify an utterance as one of the classes on the text level. Once the speaker labels (ATCO and pilot) are available for the large data, AMs can be trained for both controllers and pilots through different approaches. In this study, we show that due to the poor acoustic conditions training a single AM by pooling all data does not provide the best performance for pilots even if the speech is constrained by grammar. To obtain better performance accuracy, AM should be trained separately for ATCO and pilot data or considered as different tasks by using a multitask approach.
Section 2 provides a brief overview of the work related to multitask automatic speech recognition. The datasets used are described in Section 3 followed by Section 4 that describes speaker role classification with text. Section 5 explains the experimental setup and the results obtained which are followed by the conclusion in Section 6.
2 Related work
Previous research [madikeri2020lattice, burget2010multilingual, imseng2014using, vu2014multilingual, karafiat2016multilingual] has shown that to compensate for limited data available in low-resourced languages, multilingual systems are an effective way to train ASR systems. In such a system, the output layer could be a separate layer for each language, or a single layer shared between all languages [karafiat2016multilingual]. The Kaldi [povey2011kaldi] toolkit provides state-of-the-art techniques to train AMs, specifically Lattice-Free Maximum Mutual Information (LF-MMI) [povey2016purely]. Recently, [madikeri2020lattice] showed that multilingual AM can be trained with LF-MMI [povey2016purely]. In MMI training, the cost function is given as:
where is an input sequence for an utterance , is a set of all utterances in the training data, corresponds to a numerator graph specific to a word sequence in transcription, is a denominator graph modelling all possible sequences which is usually a phone LM, is a model parameter and
is a language model probability for an utterance.
However, in multitask training with separate output layers, the cost function from Equation 1 is computed for each task depending on the number of tasks. For tasks, the output cost function for each task depends only on the utterances of that task:
where is the number of utterances in a minibatch for a task , contains the shared and task-dependent parameters, and are task-specific numerator and denominator graphs, respectively. For a task , a denominator graph is built using the task-specific phone. For each minibatch, the gradient of each task output layer is computed and updated.
The overall cost-function is then given as a weighted sum of all task-dependent cost-functions defined in Equation 3.
where is a task-dependent weight.
Although language and phone sets are the same for ATCO and pilots, due to the variation in the acoustic conditions, we consider them as different tasks and propose to use a multitask approach to train AMs. We hypothesize that using a multitask approach can lead to better ASR performance for both ATCOs and pilots compared to a single AM trained by combining all data.
The following subsections provide an overview of the data used in this paper.
3.1 Collection and pre-processing of VHF data
3.1.1 Data collection
To obtain ATC voice communications the following two sources are considered: (i) open-source speech like LiveATC444LiveATC.net is a streaming audio network consisting of local receivers tuned to aircraft communications: https://www.liveatc.net/, and ii) speech collected with our own setup of VHF receivers. In addition to speech data, the time-aligned metadata available is used to obtain the contextual information (e.g. callsign list for each utterance) from the OpenSky Network555OpenSky Network: provides open access of real-world air traffic control data to the public (OSN). This process yielded 377 hours of speech data from Prague (LKPR) and Brno (LKTB) airports from August 2020 until January 2021 for ATCO project.
3.1.2 Data pre-processing
Figure 1 shows the pipeline used for preparing the VHF database. First, a seed ASR system is used to produce the transcripts for the 377 hours of collected data. The seed model is a ‘hybrid’ speech-to-text recognizer based on Kaldi [povey2011kaldi] trained with LF-MMI [povey2016purely]
. The neural network has six convolutional layers followed by nine Factorized Time-Delay Neural Network (TDNN-F)[povey2018semi].
A list of callsigns retrieved from OSN is in ICAO format. The ICAO format is composed of three characters airline code (e.g. TVS) followed by the callsign number which consists of digits and an additional character combination, e.g. TVS84J. In order to use this prior knowledge, this format is transformed into its “expanded version”. Several variants exist for a given callsign. As illustrated in Figure 1, the callsign TVS84J can be pronounced as "skytravel eight four juliett" or instead each letter can be spelt out "tango victor sierra eight four juliett".
Then, an ensemble of callsigns with its variants are created. Finally, string matching of this expanded callsign list is applied to the automatic transcripts. The utterances in which one of the callsigns is found are stored. This pre-processing reduced the data from 377 hours to 66 hours.
3.2 Related ATC datasets available for training
In addition to the above data collection, ATCO has brought together several air-traffic command-related databases [srinivasamurthy2017semi, N4NATO, HIWIRE, ATCOSIM, AIRBUS, LDC_ATCC] from different publicly available open data sources. The full set of databases span approximately 140 hours of speech data that are strongly related in both phraseology and structure seen in ATCO-pilot communications [zuluaga2020asratc, zuluaga2020callsign, zuluaga2021contextual]. These databases were additionally augmented by adding noises that match LiveATC audio channels, doubling the size of training data. Since each of the seven databases had different annotation ontologies (annotation procedure, rules, and symbols), the transcripts had to be standardized and normalized [ATCOSIM, helmke2018ontology].
4 Speaker role classification with text
As described in Section 1, to develop a reliable and better performing ASR for both controllers and pilots, respective labelled speech data are required. However, in most cases, e.g. such as in ATCO project, although large amounts of data are collected, they do not contain speaker labels. The first task is therefore to split the speech recordings into two classes: ATCO and pilot. To accomplish this, we extract the information based on the ICAO grammar to identify the speaker’s role.
ICAO defines a separate grammar for ATCOs and pilots to enable clear communication. For instance, there are certain phrases/commands that an ATCO should use in a specific order. This knowledge is used to extract/identify potential words/commands that indicate a specific role of speaker. For example, the words such as "identified", "approved", "wind" would most probably only be spoken by an ATCO and the words "wilco", "maintaining", "we", "our" would probably be spoken only by a pilot. Currently we have made a list of 25 words for ATCO and 9 words for pilot that indicate each role. This list was generated by manual curation and expert feedback. A list of callsigns666https://en.wikipedia.org/wiki/List_of_airline_codes is also prepared from available airline codes.
Since this method operates at word level, manual (if available) or automatically generated transcripts are required for the corresponding speech recordings. In order to identify if an utterance is spoken by an ATCO or a pilot, we check the corresponding transcript for the conditions below: if the callsign appears at the beginning of an utterance, this utterance is classified as ATCO, else it is classified as a pilot. As there is greeting at the beginning quite often, we check if the callsign appears within the first four words. If one of the words in the utterance is in the list of ATCO words or in the list of pilot words, then the respective role is assigned.
Once each utterance in the training data is classified as ATCO or pilot, we propose to train two versions of ASR. In the first system there are two acoustic models: one for ATCO and one for pilot. In the second system we train a multitask network with one task as ATCO ASR and other as pilot ASR. The procedure is illustrated in Figure 2.
4.1 Speaker Role Classification Performance
This method has been tested on manually speaker segmented and transcribed data from two different Air Navigation Service Providers (ANSPs) as a part of the HAAWAII project: i) NATS for London Approach and ii) ISAVIA for Icelandic en-route. In the first set, there are 1060 ATCO utterances and 1280 pilot utterances. From the confusion matrix shown in Figure 3, we can observe that this method provides a true positive rate (TPR) of (correctly classified ATCO) and true negative rate (TNR) of (correctly classified pilot). The second set used consists of 775 ATCO utterances and 887 pilot utterances. From the confusion matrix shown in Figure 4, we see that this method provides a TPR of and TNR of .
4.2 Error Analysis
As there exists many variants for any given callsign, checking only for the airline code (e.g. lufthansa) is a major factor contributing to the misclassification of ATCO as pilot. A reason for the misclassification of pilot as ATCO is the occurrence of callsigns at the beginning of the utterance. Analysis of misclassification errors show that the accuracy can be improved by i) matching the callsign spoken with its allowed variants (e.g. LUF189AF lufthansa one eight nine alfa foxtrot, one eight nine alfa foxtrot, etc) and ii) using the context prior to the callsigns (e.g. the pilot may mention the place of the control they want to communicate followed by the callsign). We will consider applying the aforementioned improvements as a part of our future work.
For all our experiments, conventional biphone Convolutional Neural Network (CNN)[lecun1995convolutional] + TDNN-F [povey2018semi] based acoustic models trained with Kaldi [povey2011kaldi] toolkit (i.e. nnet3 model architecture) is used. AMs are trained with the LF-MMI [povey2016purely] training framework considered to produce state-of-the-art performance for hybrid ASR systems. In all the experiments, 3-fold speed perturbation [ko2015audio]
and i-vectors are used. The multi-task training script used can be found in Kaldi[povey2011kaldi]777egs/babel multilang/s5d/local/chain2/run_tdnn.sh. The value of the task dependent weight used in our experiments is . Language model (LM) is trained with all the manual transcripts available from datasets described in Section 3.2 and used for all the experiments.
The performance of different models is evaluated on LiveATC test set with the Word Error Rate (WER) metric which is based on the Levenshtein distance at the word level. The total duration of the test set is 1h 50 mins. The set is split into two subsets: ATCO set (52 mins) and Pilot set (58 mins).
In each group of experiments, results are given for i) AM trained for each task separately, ii) AM trained by combining all data and iii) AM trained with multitask learning.
|ATCO test||Pilot test|
|ATCO test||Pilot test|
5.1 Experiments on ATC databases
In this setup, we use data from the ATC databases mentioned in Section 3.2 as Clean data and its noise augmented part as Noise data. As shown in Table 1, both ATCO and pilot test sets provide better performance when the model is trained with Noise data compared to the model trained with only Clean data. This shows that the noise augmented version of the clean data matches with the test sets much better than the clean version. Moreover, the Combined system performs significantly worse than the Noise system. This shows that using the Clean dataset in fact hurts ASR performance. This is one of the reasons why the multitask system performs only on par with the Noise system. Therefore only the noise augmented data is used for training in the next experiments.
5.2 Experiments on VHF data
Results in Table 2 are presented for AMs trained with only the VHF data. Applying speaker role identification for the pre-processed data (66 h) yields 43 h for ATCO and 23 h for Pilot. Similar to Table 1, the results in Table 2 show that using multitask learning instead of training AM by combining all the data provides better ASR performance. Furthermore, the results reveal that due the low amount of data, multitask learning outperforms its single task counterparts.
|ATCO test||Pilot test|
5.3 Experiments on VHF+other ATC datasets
In this subsection we report results with models trained from both VHF and ATC datasets used in the previous two experiments. By investigating the ATC databases used in Section 5.1, we discovered that some of the datasets also contain pilot speech. Since no speaker role labels are available for these sets, we applied the proposed method to split the noise augmented speech as ATCO or pilot and combined them with their respective classes of the VHF data. This provided 123h of data for ATCO and 80h for pilot. The results in Table 3 show that training AMs for each task separately performs relatively better by 2.9% for ATCO and 2.4% for pilot than using the Combined system. This suggests that when more data is available, using our grammar-based approach to obtain speaker role information to train separate ATCO and pilot ASR is better than the Combined approach. The Multitask system does not perform better than the Combined; suggesting a negative transfer when considering ATCO and pilot tasks. This is expected as the ATC data dominates in size during training.
In this work, we compared different types of training AMs with state-of-the-art LF-MMI framework for ATCO and pilot speech recordings. The developed ASR systems were evaluated separately on ATCO and pilot test sets built from LiveATC. Due to the noisy nature of both ATCO and pilot test sets, AM trained with only noise augmented speech data boosts the ASR performance. We proposed a simple grammar based approach to identify speaker roles automatically and train acoustic models either by speaker role or in a multitask fashion. The results show that multitask training approach outperforms other training methods when limited training data is available. When sufficient data is available, we show that training AMs separately provides better ASR performance for both ATCO and pilot compared to the model trained by combining all data. Relative improvements of 3.2% for the ATCO set and 1.9% for the pilot set were obtained.
The work was supported by the European Union’s Horizon 2020 project No. 864702 - ATCO (Automatic collection and processing of voice data from air-traffic communications), which is a part of Clean Sky Joint Undertaking. The work was also partially supported by SESAR EC project No. 884287 - HAAWAII (Highly automated air-traffic controller workstations with artificial intelligence integration).