Oral cancer is a disease which impacts approximately 529,500 people worldwide every year . Apart from improving survival rates (mortality), research attention has shifted to improving the quality of life after surgery . Oral cancer survivors can suffer from several problems affecting their quality of life: difficulty swallowing [36, 21], decreased tongue mobility  and impaired speech intelligibility . The latter is the focus of our paper.
Speech impairment occurs due to the oral cancer treatment in which parts of the tongue or the entire tongue is removed (partial/total glossectomy). This partial or full removal causes an inability to reach articulatory targets. Oral cancer speech is consequently primarily impaired at the articulatory level, while only patients who have also undergone radiation therapy also have problems with phonation .
Different studies show different characteristics of oral cancer speech impairment. Stop consonants (mainly /k/, /g/, /b/, /p/, /t/, /d/) [2, 3] and alveolar sibilants (i.e., /s/, /z/)  seem to be primarily affected. In certain cases, patients are able to learn articulatory compensation techniques to adjust for the lost tongue tissue. For example, /t/ and /d/ can be produced by an altered bilabial seal . The effect of glossectomy on vowels and diphthongs is less clear [17, 6].
Analysis of oral cancer speech has so far focused on read speech. In the studies above, participants were asked to read a text passage. However, it has been shown that such structured tasks can fail to identify some characteristics of speech .
So far, no research has been carried out investigating whether oral cancer speech can be reliably differentiated from non-oral cancer speech automatically. The aim of the paper is three-fold: 1) we investigate whether spontaneous oral cancer speech can be differentiated from healthy speech, focusing on spontaneous speech for the first time as far as we know, and as such present the first baselines for oral cancer speech detection. 2) In order to do so, we collected a large dataset, which allows us to use machine learning techniques. Creating a large dataset of pathological speech is time-consuming due to slow patient recruitment. We, therefore, created a database of “found” oral cancer speech, which is freely available to the community.111http://doi.org/10.5281/zenodo.3732322 3) We provide a preliminary analysis of the differences in spontaneous oral cancer speech and healthy speech.
Pathological speech detection is a broad field. There are two main approaches employed in the field. The first one is to develop a new acoustic feature using some knowledge about the pathological speech and use that in a simple classification model. Effectively, this is solving the problem in a divide-and-conquer approach: detecting known characteristics of a pathology and then feeding it into a classifier. A typical example of this is looking at unsuccessful phone realisations with an automatic speech recogniser (ASR)[37, 22]. The second approach is to generate some acoustic features from the audio using standard feature extractors (frontends) like openSMILE , Kaldi  or librosa 
. This is a good approach if we are unsure what features would be the most appropriate. These features are then used to train a few chosen machine learning models (backends) such as artificial neural networks7]1] and boosted regression trees . These techniques rely on the models’ learning capabilities to find any difference in the feature distributions of healthy and pathological speakers. We follow the second approach here, by using the Kaldi feature extractor along with ASR features.
In order to analyse the differences between oral cancer speech and healthy speech we only use backends which have some degree of explainability. Moreover, in addition to Kaldi features, we use phonetic posteriorgrams (PPG) as ASR-based features, which are easier to interpret than MFCC or PLP.
We manually collected audio data containing English, spontaneous oral cancer speech from 3 male speakers and 8 female speakers from YouTube. The presence of oral cancer speech in the audio was determined by the content of the video and the authors’ experience with such speakers. The audio was manually cut to exclude music, healthy speakers and artefacts, leaving only the oral cancer speech. The total duration of the oral cancer dataset is 2h and 59mins.
As our spontaneous healthy speech, we chose a subset of native American English speakers from the VoxCeleb dataset . This dataset was chosen because it was also originally collected from YouTube. This allows exclusion of YouTube characteristics as a confounding factor in the detection task. The gender and number of speakers, as well as the amount of speech material for each speaker, was matched with that of the speech of the 11 oral cancer speakers to ensure that the ratio of the recordings is similar in the two datasets. There is no overlap in speakers between the training and test sets. In total, there are 10 speakers (8 female, 2 male) in the training set, and there are 12 (8 female, 4 male) speakers in the test set. The total duration of the training set is 4 h and 36 mins, for the test set 1 h 28 mins. The average duration per speaker is 27.6 min in the training set and 7.3 min in the test set.
The recordings in the oral cancer dataset were automatically cut into 5 s chunks to match the average duration of the utterances in the VoxCeleb dataset using ffmpeg. The audio was downsampled to 16 kHz and converted from stereo to mono. Loudness was normalised to 0.1 dB using the sox tool.
We compared several frontend and backend combinations to find the best oral cancer vs. healthy speech detection system. Sections 3.1 and 3.2 describe the different frontends and backends, respectively, and the rationale why we chose them. The code of the analysis is also available online222https://github.com/karkirowle/oral_cancer_analysis.
3.1 The preprocessing frontends
The following features have been extracted from the audio (abbreviations in bold). All features were calculated using the Kaldi frontend , unless mentioned otherwise. Silences were cut using Kaldi’s voice activity detection (VAD) algorithm.
MFCC - Mel-frequency cepstral coefficients are used as the baseline feature.
LTAS - Long term average spectrum is used as a voice quality measurement in the early detection of pathological speech [31, 23] and to evaluate the effect of speech therapy or surgery on voice quality 
. The LTAS features are extracted by calculating the mean and standard deviation of the frequency bins of Kaldi spectrograms and stacking them together.
PLP - Perceptual linear predictive coefficients are known to be related to the geometry of the vocal tract based on the principles of source-filter theory . During oral cancer the geometry of the vocal tract changes, so we expect that PLPs have useful information for detection. The PLPs are calculated based on .
Pitch - To investigate whether there are also prosodic and phonation impairments in oral cancer speech, a combination of pitch and voicing likelihood feature is used .
. PPGs are probability distributions over a set of phones, i.e., what is the probability that this phone is spoken at this frame of the utterance. The implementation that we used included 40 phones, including the phone for silence. However, silence phones were excluded in our approach.
3.2 The backends
Two different backends were used: a Gaussian Mixture Model (GMM) and a linear regression method (LASSO)
. Linear regression is generally the easiest to interpret, however when the dimensionality of the features are high, a feature selection step is usually recommended, that is why we used LASSO. The GMM is used widely in pathological speech detection[35, 7]. The GMM and LASSO models were implemented using the sklearn  library.
In addition to these two traditional models, we also trained a Dilated Residual Network (ResNet) . Similar architectures have been successful in detecting spoofed speech [12, 20]. We expect ResNet’s ability to recognise unnatural speech to be useful for detection of pathological speech.
3.2.1 Gaussian mixture model
We trained separate GMMs for oral cancer speech and healthy speech. The number of mixture components for each GMM was chosen so that it maximises performance on the test set from the list of . This could result in overfitting to the test set, however, in practice we found that the test set performance is relatively insensitive to the choice of the mixture parameter. This is further discussed in Section 5. We report the number of mixture components used with the results in Table 1. At test time, we presented the healthy and the oral cancer speech utterances to both models. To determine whether the input speech frame contained healthy or oral cancer speech, we calculated the likelihoods for each speech frame and averaged over all frames to compute a single likelihood for the entire stretch of speech. The average likelihoods for both models were subsequently compared.
LASSO is a variant of linear regression, which performs feature selection and regression simultaneously. It might be the case that for a given linear regression task, some features do not contain any relevant information to make predictions. In LASSO, coefficients of regression are encouraged to be close to zero if they do not provide useful information. Zeroing (pruning) some features means that the model requires only a subset of all predictors, making it parsimonious and easier to interpret. Pruning of the features is facilitated by setting the hyperparameter: the larger this parameter is, the closer the coefficients are to zero. This hyperparameter is taken from the list and tuned on the test set (see Section 5). The hyperparameters are reported with the results in Table 1.
3.2.3 Neural network classifier (Spectrogram-ResNet)
The ResNet architecture consists of four Dilated ResNet blocks. Each block has a different kernel size (width height) and number of filters: and ; and ; and ; and
. This is followed by a fully connected layer with 100 hidden nodes and finally a softmax layer. The architecture is described in detail in in.
The input of the ResNet consisted of spectrograms. Spectrograms are highly informative, high dimensional features, which capture most properties of the raw speech signals. They are widely used with neural network backends [29, 20]5], selecting the model with the best validation loss after training. We used the Adam optimiser with a learning rate of . To avoid overfitting on the test material, no hyperparameter optimisation was performed (see Section 5).
4 Results and analysis
4.1 Results on training and test set
The detection accuracies for the training and test sets are measured using accuracy and equal error rate (EER), and can be seen in Table 1. Chance level accuracy for the test set is 57.82%. refers to the number of Gaussian mixture components used during GMM training. refers to the sparsity inducing hyperparameter of LASSO, a larger indicates a sparser model.
The Spectrogram-ResNet-based detector achieved the best classification performance both in terms of accuracy and EER. This is closely followed by the LTAS-LASSO model. The superiority of the ResNet over the other methods is likely due to the ResNet seeing the whole utterance at once unlike the other methods. LASSO backends always outperformed the GMM-based backends on the test set, which is especially striking on the LTAS-based features. One possible explanation for the performance difference might be the collinearity of certain features as LASSO is known to handle collinearity better. We can see that for non-collinear features like MFCC or PLP the performance difference between LASSO and the GMMs is marginal. The worst performance was achieved by the Pitch features, which is close to chance level for both backends. This indicates that Pitch features are not appropriate for oral cancer speech detection, suggesting that oral cancer speech indeed is impaired primarily on the articulation level .
4.2 Analysis of the differences between oral cancer speech and healthy speech
To investigate the differences between oral cancer speech and healthy speech the two best performing architectures (Spectrogram-ResNET, LTAS-LASSO) and PPG-GMM were analysed through the information in the speech signal that was used by these models to distinguish oral cancer speech from healthy speech. While the PPG-GMM does not stand out in terms of accuracy, it lends itself to easy interpretation of the acoustic information used for the task.
4.2.1 Analysis of the Spectrogram-ResNet detector
To investigate what information the ResNet classifier uses to distinguish between oral cancer speech and healthy speech, we look at what parts of the spectrogram change the classification results the most.
To find these salient parts of the spectrogram, we calculate mean class activation maps, which indicate what frequencies are the most important for detection of both classes, for each sample in our test set as follows: Given a spectrogram image and a class label (oral cancer speech/healthy speech) as input, we pass the image through the ResNet to obtain the raw class scores before softmax. The gradients are set to zero for all classes except the desired class (i.e., oral cancer speech), which is set to 1. This signal is then backpropagated to the rectified convolutional feature map of interest. We used the implementation from the keras-vis library.
Figure 1 shows the mean class activation maps for healthy speech and oral cancer speech. For healthy speech (top panel), the neural network spreads its focus (indicated by the coloured bands where red means higher intensity) among all the frequency bands. For oral cancer speech (bottom panel), the majority of the acoustic energy lies above the 4 kHz band. This indicates that sibilant frequencies, which are above 4 kHz , might be important for distinguishing between oral cancer and healthy speech. This is in agreement with previous studies who name sibilants as impaired sounds .
4.2.2 Analysis of the PPG-GMM detector
The trained GMM models can be viewed as models of a global oral cancer speaker and a global healthy speaker. The mean parameters of the GMMs can inform us what features are more typical for oral cancer speakers and which for healthy speakers by constructing a difference model. First, we calculate the difference of the mixture components in the two GMMs, obtaining a difference vector for each phone. Taking the mean of , we are able to obtain a signed scalar for each phone class. If is positive it means that there is a higher likelihood of occurrence of that phone in oral cancer speech compared to healthy speech. If is negative it means that the likelihood of that phone is lower in oral cancer speakers – meaning that they have trouble pronouncing that phone.
Figure 2 shows the results for the phones with absolute mean differences larger than 0.005. The blue bars indicate phones that are more often present in healthy speech and the red bars indicate phones that are more typical of cancer speakers. We can see that /t/, /w/, /iy/, /k/ and /d/ have lower likelihoods, indicating that some stop consonants are challenging for oral cancer speakers. This is in agreement with [2, 3].
4.2.3 Analysis of the LTAS-LASSO based detector
LASSO-based models can be analysed using the coefficients of regression. A positive coefficient indicates a feature contributing to the cancer class and vice versa. Figure 3 shows the learned coefficients. The blue line shows the mean coefficients, the orange line shows the standard deviation coefficients of the LTAS. Knowing that neighbouring frequencies are discouraged (because they provide similar (collinear) information), it is still surprising that some frequency bands have several adjacent positive/negative coefficients (clusters, shown as adjacent spikes). These clusters indicate that a greater level of frequency resolution is needed for that particular frequency band. We can see that for oral cancer speech this is the 3-4 kHz, indicating that sibilant frequencies need greater frequency resolution.
5 General discussion
The paper presents the first baseline models for the task of healthy vs. oral cancer speech classification. The Spectrogram-ResNet classifier achieved a high classification accuracy and outperformed all other models. A preliminary analysis of the three models indicated that sibilant frequencies, and stop consonant phones are the most decisive in the classification.
The current study used healthy speech from the VoxCeleb dataset, which only contains recordings from celebrities. Although potentially the detectors could use other features than those related to the acoustic characteristics of the speech for classification, this is not likely: inspection of the recognition results of the individual speakers in both datasets showed that in both datasets some speakers are well classified whereas others are not (range oral cancer speech: 49.7% – 100.0%; range healthy speech: 34.8% – 94.4% on the test set), although on average the oral cancer speakers were better classified than the healthy speakers: 89.6% vs. 66.2%.
Because of our relatively small-sized datasets, we kept hyperparameter tuning to a minimum to avoid overfitting on the test set. The more traditional methods (GMM and LASSO) only have a single hyperparameter, so chances of overfitting to the test set are small. Neural networks, on the other hand, usually have a myriad of hyperparameters, which makes overfitting more likely. To avoid this, we refrained from using any tuning mechanisms at all with the neural networks.
We presented a brand new dataset for analysis of spontaneous oral cancer speech, and showed that a detector based on ResNet taking spectrograms as input had a high performance in distinguishing between oral cancer speech and healthy speech, and generalised well to unseen data. Analysis of the speech signals through the classifiers shows that sibilants and stop consonants are important for oral cancer speech detection, while no evidence has been found on the importance of vowels and diphthongs.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 766287. The Department of Head and Neck Oncology and surgery of the Netherlands Cancer Institute receives a research grant from Atos Medical (Hörby, Sweden), which contributes to the existing infrastructure for quality of life research.
-  (2008) Age and gender recognition for telephone applications based on gmm supervectors and support vector machines. In ICASSP, pp. 1605–1608. Cited by: §1.
-  (2009) Speech outcomes for partial glossectomy surgery: measures of speech articulation and listener perception indicateurs de la parole pour une glossectomie partielle: mesures de l’articulation de la parole et de la perception des auditeurs. Head and Neck Cancer 33 (4), pp. 204. Cited by: §1, §4.2.2.
-  (2004) Consonant intelligibility and tongue motility in patients with partial glossectomy. Journal of Oral and Maxillofacial Surgery 62 (3), pp. 298–303. Cited by: §1, §4.2.2.
-  (2013) API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122. Cited by: §3.2.
Keras: The Python Deep Learning library. Keras.Io. Cited by: §3.2.3.
-  (2009) Objective acoustic-phonetic speech analysis in patients treated for oral or oropharyngeal cancer. Folia Phoniatrica et Logopaedica 61 (3), pp. 180–187. Cited by: §1.
-  (2002) Feature analysis for automatic detection of pathological speech. In Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society][Engineering in Medicine and Biology, Vol. 1, pp. 182–183. Cited by: §1, §3.2.
-  (1999) Quality of life and oral function following radiotherapy for head and neck cancer.. Head Neck. External Links: Cited by: §1.
OpenSMILE:) the munich open-source large-scale multimedia feature extractor. 6 (4), pp. 4–13. Cited by: §1.
-  (1981) The source filter concept in voice production. 1 (1981), pp. 21–37. Cited by: 3rd item.
A pitch extraction algorithm tuned for automatic speech recognition. In ICASSP, pp. 2494–2498. Cited by: 4th item.
-  (2020) Residual networks for resisting noise: analysis of an embeddings-based spoofing countermeasure. In Odyssey The Speaker and Language Recognition Workshop, Cited by: §3.2.3, §3.2.
-  (1990) Perceptual linear predictive (plp) analysis of speech. 87 (4), pp. 1738–1752. Cited by: 3rd item.
-  (2019) Quantification of tongue mobility impairment using optical tracking in patients after receiving primary surgery or chemoradiation. PloS one 14 (8). Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.2.3.
-  (2018) ANN-based alzheimer’s disease classification from bag of words. In Speech Communication; 13th ITG-Symposium, pp. 1–4. Cited by: §1.
-  (2010) Speech after radial forearm free flap reconstruction of the tongue: a longitudinal acoustic study of vowel and diphthong sounds. Clinical linguistics & phonetics 24 (1), pp. 41–54. Cited by: §1.
-  (2011) A longitudinal acoustic study of the effects of the radial forearm free flap reconstruction on sibilants produced by tongue cancer patients. Clinical linguistics & phonetics 25 (4), pp. 253–264. Cited by: §1, §4.2.1.
-  (2012) Vowels and consonants. Wiley & Blackwell. Cited by: §4.2.1.
-  (2019) ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks. In Interspeech, pp. 1013–1017. Cited by: §3.2.3, §3.2.
-  (1997) Speech and swallowing rehabilitation for head and neck cancer patients. Oncology 11 (5). Cited by: §1.
-  (2006) Fully automatic assessment of speech of children with cleft lip and palate. 30 (4). Cited by: §1.
-  (2006) The long-term average spectrum in research and in the clinical practice of speech therapists. Pro-fono : revista de atualizacao cientifica. External Links: Cited by: 2nd item.
-  (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Cited by: §1.
-  (2017) VoxCeleb: a large-scale speaker identification dataset. In Interspeech, pp. 2616–2620. Cited by: §2.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: 5th item.
-  (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §1, §3.1.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.
Proceedings of the IEEE International Conference on Computer Vision, External Links: Cited by: §4.2.1.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP, pp. 4779–4783. Cited by: §3.2.3.
-  (2017) The global incidence of lip, oral cavity, and pharyngeal cancers by subsite in 2012. 67 (1), pp. 51–64. Cited by: §1.
-  (2014) Long-time average spectrum in individuals with Parkinson disease. NeuroRehabilitation. External Links: Cited by: 2nd item.
Spectral moments of the long-term average spectrum: Sensitive indices of voice change after therapy?. Journal of Voice. External Links: Cited by: 2nd item.
-  (2005) Automatic detection and rating of dementia of alzheimer type through lexical analysis of spontaneous speech. In IEEE International Conference Mechatronics and Automation, 2005, Vol. 3, pp. 1569–1574. Cited by: §1.
-  (1996) Regression shrinkage and selection via the lasso. 58 (1), pp. 267–288. Cited by: §3.2.
-  (2012) Automatic detection of sigmatism in children. In Third Workshop on Child, Computer and Interaction, Cited by: §1, §3.2.
-  (2014) Head and neck cancer: treatment, rehabilitation, and outcomes. chapter 5: speech and swallowing following oral, oropharyngeal, and nasopharyngeal cancer. Plural Publishing. Cited by: §1, §1, §1, §4.1.
-  (2009) Automatic quantification of speech intelligibility of adults with oral squamous cell carcinoma. Folia Phoniatrica et Logopaedicathe Journal of the Acoustical Society of AmericaACM SIGMultimedia RecordsSTL-QPSRCoRRJournal of the Royal Statistical Society: Series B (Methodological)InformaticaCA: a cancer journal for clinicians 60 (3), pp. 151–6. Cited by: §1.
-  (2017) Dilated residual networks. abs/1705.09914. External Links: Cited by: §3.2.
-  (2019) Foreign accent conversion by synthesizing speech from phonetic posteriorgrams. In Interspeech, pp. 2843–2847. Cited by: 5th item.