Alzheimer’s disease (AD) is a chronic neurodegenerative disease that affects memory, language, cognitive skills, and the ability to perform simple everyday tasks.
Throughout the course of AD, patients have been observed suffering a loss of lexical-semantic skills, including suffering anomia, reduced word comprehension, object naming problems, semantic paraphasia, and a reduction in vocabulary and verbal fluency [forbes2005detecting, bayles1982potential]. Speech in patients with AD is mostly characterised by a low speech rate and frequent hesitations at the phonetic and phonological level; however, the syntactic ability is better preserved than lexical-semantic ability in AD patients at the early stages of the disease[kave2003morphology].
The presence of cognitive dysfunction must be confirmed by neuropsychological tests such as the mini-mental state assessment (MMSE) performed in medical clinics before an AD diagnosis can be made. The existence of typical neurological and neuropsychological characteristics and a clinical examination of the patient’s history are used to make a diagnosis.
Detecting early diagnostic biomarkers that are non-invasive and cost-effective is of great value for clinical assessments. Several previous studies have investigated AD diagnosis via acoustic, lexical, syntactic, and semantic aspects of speech and language. More interactional aspects of language, like disfluencies, and purely non-verbal features, such as intra- and inter-speaker silence, can be key features of AD conversations. If useful for diagnosis, these features can have many advantages: they are easy to extract and are relatively language, subject, and task agnostic.
In terms of speech features, the number of pauses, pause proportion, phonation time, phonation–to–time ratio, speech rate, articulation rate, and noise–to–harmonic ratio were all found to be related to the severity of Alzheimer’s disease [szatloczki2015speaking]. Weiner et al. [weiner2016speech] used a Linear Discriminant Analysis (LDA) classifier with a set of acoustic features including the mean of silent segments, silence durations, and silence-to-speech ratio to differentiate subjects with AD from the control group, achieving an 85.7% AD binary classification. Ambrosini et al. [ambrosini2019automatic] used selected acoustic features (pitch, voice breaks, shimmer, speech rate, syllable duration) to detect mild cognitive impairment from a spontaneous speech task.
Lexical features from spontaneous speech have been shown to be informative in terms of features that assist AD detection. For example, Jarrold et al. [jarrold2014aided] merged acoustic features with the frequency occurrence of 14 distinct parts of speech features. Abel et al. [abel2009connectionist] modeled patient speech errors (naming and repetition disorders) to aid AD diagnosis.
Modeling multimodal input for AD detection has also been studied. Gosztolya et al. [gosztolya2019identifying] looked at how two SVM models with different sets of acoustic and linguistic features could be combined. Their research demonstrated how audio and lexical features could provide additional knowledge about an individual with AD.
Among other similar tasks within cognitive state prediction like depression, research has been done on integrating temporal information from two or more modalities using multimodal fusion [rohanian2019detecting]. The different predictive capacities of each modality and their different levels of noise are a major challenge for these models. A gating mechanism is effective in controlling the level of contribution of each modality to the final prediction in a variety of multimodal tasks, including in AD classification and regression [Rohanian2020].
This paper constitutes an entry into the Alzheimer’s Dementia Recognition through Spontaneous Speech (ADReSSo) challenge 2021 [bib:LuzEtAl21ADReSSo], which involves an AD classification and MMSE score regression tasks, in addition to a cognitive decline (disease progression) inference task using only the audio data from formal diagnosis interviews with patients as input. In the first two tasks, participants are required to rate the severity of Alzheimer’s disease in various subjects, with the target severity determined by their MMSE scores. In the third task, participants should identify those patients who exhibit cognitive decline within two years.
In this paper, we were particularly interested in the benefit of fusing ASR results (rather than transcripts) with acoustic data and whether self-repair disfluencies and unfilled pauses in individuals’ speech and language model probabilities (a measure of lexical predictability) from automatic speech recognition (ASR) results would help predict the severity of the patient’s cognitive impairment.
Inspired by [Rohanian2020]
, to detect AD, we used audio and text features to model the sessions in a Bidirectional Long-Short Term Memory (BiLSTM) neural network. We used the Bidirectional Encoder Representations from Transformers (BERT) model to classify AD from speech recognition results in a separate experiment. Our findings suggest that AD can be identified using pure sequential modelling of the speech recognition results from the interview sessions with limited details of the structure of the description tasks. Disfluency markers, unfilled pauses, and language model probabilities were also found to have predictive power for detecting Alzheimer’s disease.
2 Data and features
Two distinct datasets were used for the ADReSSo Challenge:
a set of speech recordings of picture descriptions produced by both patients with an AD diagnosis and subjects without AD (controls), who were asked to describe the Boston Diagnostic Aphasia Exam’s Cookie Theft picture [bib:LuzEtAl21ADReSSo].
a set of speech recordings of Alzheimer’s patients performing a category (semantic) fluency task [benton1968differential] at their baseline visit for prediction of cognitive decline over two years.
Dataset 1 for AD classification and severity detection includes 237 audio recordings, and the state of the subjects is assessed based on the MMSE score. MMSE is a commonly used cognitive function test for older people. It involves orientation, memory, language, and visual-spatial skills tests. Scores of 25-30 out of 30 are considered as normal, 21-24 as mild, 10-20 as moderate, and 10 as a severe impairment.
Dataset 2 for the disease prognostics task (prediction of cognitive decline) was created from a longitudinal cohort study involving AD patients. The period for assessing disease progression spanned the baseline and the year-2 data collection visits of the patients to the clinic. The task involves classifying patients into ‘decline’ or ’no-decline’ categories, given speech collected at baseline as part of a verbal fluency test.
Various features were extracted automatically from both datasets for the 3 ADReSSo tasks as described below.
2.1 Acoustic features
A set of 79 audio features were extracted using the COVAREP acoustic analysis framework software, a package used for automatic extraction of features from speech [degottex2014covarep]prosodic features (fundamental frequency and voicing), voice quality features (normalized amplitude quotient, quasi-open quotient, the difference in amplitude of the first two harmonics of the differentiated glottal source spectrum, maxima dispersion quotient, parabolic spectral parameter, spectral tilt/slope of wavelet responses, and shape parameter of the Liljencrants-Fant model of the glottal pulse dynamics) and spectral features
(Mel cepstral coefficients 0-24, Harmonic Model and Phase Distortion mean 0-24 and deviations 0-12). Segments without audio data were set to zero. A standard zero-mean and variance normalization was applied to features. We omitted all features with no statistically significant univariate correlation with the results of the training set.
2.2 Linguistic Features
For automatically transcribing the audio files, we used the free trial version of IBM’s Watson Speech-To-Text service.111https://www.ibm.com/uk-en/cloud/watson-speech-to-text The service offers ASR on the audio data which has considerable noise and may be affected by non-standard North American dialect of the patients - the average Word Error Rate (WER) on 10 transcripts we randomly selected from the training data is 32.8%. The Watson service, crucially for our task, does not filter out hesitation markers or disfluencies [baumann2017recognising]. It also outputs word timings that we use as features in our system.
For our models which did not use BERT, a pre-trained GloVe model [pennington2014glove]
was used to extract the lexical feature representations from the picture description transcript and convert the utterance sequences into word vectors. We selected the hyperparameter values, which optimised the output of the model on the training set. The optimal dimension of the embedding was found to be 100.
Disfluencies are usually seen as indicative of communication problems caused by production or self-monitoring issues [levelt1983monitoring]. Individuals with AD are likely to deal with troubles in language and cognitive skills. Patients with AD speak more slowly and with longer breaks and invest extra time seeking the right word, which in effect contributes to disfluency [lopez2013selection, nasreen2021alzheimer].
We automatically annotate self-repairs and edit terms using [rohanian-hough-2020-framing]’s multi-task learning model in a left-to-right, word-by-word manner to predict disfluency tags. Here each word is either tagged as one of repair onset, edit term, fluent word by the disfluency detector- we concatenate the disfluency tags with the word vectors to create the input for the text-based LSTM classifier described below.
2.4 Unfilled Pauses
Durations of pauses were calculated from the word timings provided by the ASR hypotheses, using the latency between the end of the previous word to the beginning of the patient’s current word as the pause length, with the value for the first word being 0. We further categorized pauses into either short pause (SP) and long pause (LP). An SP is a silence that occurs inside a single speaker turn, which in the range seconds; an LP is a longer pause within a single speaker turn defined as a speech pause of 1.5 seconds or greater. Pauses in the interviewer’s speech were excluded.
2.5 Language Model Probabilities
People with speech disorders or cognitive impairment express themselves in different ways when compared to control groups [gabani2009corpus]
. Language model probabilities, which can be interpreted to estimate the predictability of a sequence of words, can be used to assess a participant’s language structure, including vocabulary and syntactic constructions. The present work uses a Multi-task Learning (MTL) LSTM language model[rohanian-hough-2020-framing]
based on the Switchboard corpus[godfrey1992switchboard], a sizable multispeaker corpus of conversational speech and text. The language model uses standard Switchboard training data for disfluency detection (all conversation numbers starting sw2*,sw3 * in the Penn Treebank III release: 100k utterances, 650k words) and is trained in combination with other tasks, including disfluency detection as described in [rohanian-hough-2020-framing]. This corpus can be viewed as an approximation of control, non-AD disorder spoken dialogue. The model is then tested on the ASR transcript of each session, and the probability of each word is calculated. Finally, we concatenate the probability of the current word given the history with the word vectors to create the lexical input for our model.
3 Proposed Approach
We experiment with different deep-learning architectures for predicting AD in both classification and regression and for cognitive decline prediction:
unimodal LSTM models utilising using either acoustic or lexical features.
multimodal LSTM model using lexical and acoustic information, including disfluency and pause tagging.
unimodal BERT based classifier using lexical features.
multimodal BERT model with gating using lexical and acoustic information.
3.1 Sequence modeling
Our approach is to model the speech of individuals as a sequence to predict whether they have AD or not, and if so, to what degree, using either LSTMs or BERT models.
LSTM The potential of neural networks lies in the power to derive representations of features by non-linear input data transformations, providing greater capacity than traditional models. As we were interested in modelling the temporal nature of speech recordings and transcripts, we used a bi-directional LSTM. For each of the audio and text modalities, we trained a separate unimodal LSTM model, using different sets of features, then used late fusion to combine their probabilities.
BERT Pre-trained BERT models are fine-turned for the AD classification task. Each of the training instances is considered a data point. The input to the model consists of a sequence of words from the transcript for every speaker. Following [yuan2020disfluencies] we used Bert-for-Sequence-Classification2 for fine-tuning. The standard default tokenizer was used, and two special tokens, [CLS] and [SEP], were added to the beginning and the end of each input. Specifically for regression, the last layer is the shape (hidden size, 1), and we use MSE loss instead of cross-entropy.
3.2 Multimodal Model with Gating
Since learned representation for the text can be undermined by corresponding audio representation and ASR results can be unreliable, we need to minimise the effects of noise and overlaps during multimodal fusion. For audio and textual input for the BiLSTM models, we use two branches of the LSTM, one for each of the modalities, with their outputs combined into final feed-forward highway layers [srivastava2015highway], with gating units that learn by weighing text and audio inputs at each time step to regulate information flow through the network.
The concatenated output is passed through N highway layers (where the best value N
was determined from optimizing on held-out data). We pad the size of the training examples in the text set (which was the smaller set) to meet the audio set by mapping together instances that occurred in the same session, as the audio and text inputs for each branch of the LSTM had different timesteps and strides.
For the BERT-based multimodal models with gating, the output from the BERT-based textual classifier is combined with the acoustic data into the final feed-forward highway layers.
4.1 Implementation and Metrics
We set up our model to learn the most helpful information from modalities for predicting AD. All experiments are carried out without being conditioned on the identity of the speaker.
For the LSTM models, the sizes of layers and the learning rates are calculated by grid search on validation test. For the input data, we explored different timesteps and strides. After exploring different hyper-parameters, the model using audio data has a timestep of 20 and stride 1 with four bi-directional LSTM layers with 256 hidden nodes. The model using text input has an input with a timestep of 10 and stride of 2 and has 2 LSTM layers with 16 hidden nodes. We use a block of 3 stacked highway layers. The LSTM models were trained using ADAM [kingma2014adam]
with a learning rate of 0.0001. We used Binary Cross-Entropy to model binary outcomes for the loss function and Mean Square Error (MSE) to model regression outcomes.
For the BERT models, following [yuan2020disfluencies]
we use the “bert-large-uncased” model, with the hyperparameters: learning rate = 2e-5, batch size = 4, epochs = 8, max input length of 256.
For binary classification of AD and non-AD, we report binary accuracy scores. For the MMSE prediction task, we report the Root Mean Square Error (RMSE) for the prediction error score. For the cognitive decline task, we report the mean of F1 classification scores.
The code used in the experiments is publicly available in an online repository.222https://github.com/mortezaro/ad-recognition-from-speech
4.2 Baseline Models
We compare the performance of our models to the ADReSSo Challenge baselines [bib:LuzEtAl21ADReSSo]
with an ensemble of audio and linguistic features provided with the dataset. The best baselines we include here include decision trees (DT), linear discriminant analysis (LDA), support vector machines (SVM), support vector regression (SVR), and Gaussian process regression (GP).
|BERT w/ Gating||Words+Acoustic||-||4.38|
|LSTM w/ Gating||Words+Acoustic+Disf+Pse+WP||0.84||4.26|
AD classification and regression tasks In Table 1, we present our proposed models’ performance against that of the baselines models on AD classification and regression tasks on the provided test set and in Table 2 in a cross-validation setting. For AD detection, our proposed LSTM model with gating and additional features (disfluency, unfilled pause, and language model probabilities) achieves an accuracy of 0.84 and RMSE of 4.26, outperforming all the baselines. Overall, the results support our hypothesis that a model with a gating structure can more effectively reduce individual modalities’ errors and noise, including that from errorful ASR results. Furthermore, our proposed LSTM model with gating and additional features (disfluency, unfilled pauses, and language model probabilities) outperforms the BERT fine-tuned models in unimodal and multimodal situations (ACC 0.84 vs. 0.80; RMSE 4.26 vs. 4.49 and 4.38). It should also be noted that the BERT model is very large in comparison to the LSTM models. BERT has approximately 21 times the number of parameters as our second largest model (105 million vs. 4.9 million). Therefore, compared to the BERT model, our LSTM models need fewer resources for development.
Effect of disfluency and unfilled pause features We found that disfluencies and unfilled pauses help as features in AD detection. Adding disfluency and pause features to the lexical features lead to improvement on the test set (ACC 0.81 vs. 0.76) and in CV (ACC 0.78 vs. 0.74; RMSE 5.02 vs. 5.31). Our LSTM model with disfluencies and unfilled pauses outperforms the BERT model in both class-action and regression tasks on the test set (ACC 0.81 vs. 0.80; RMSE 4.43 vs. 4.49).
Effect of language model probabilities Language model probabilities (as an indicator of grammatical integrity) are useful as features in the diagnosis of AD. Adding language model probabilities to the lexical features improves the test set (ACC 0.77 vs. 0.76) and in CV (ACC 0.78 vs. 0.74; RMSE 4.78 vs. 5.31).
|LSTM w/ Gating||Words+Acoustic||0.79||4.88|
|LSTM w/ Gating||Words+Acoustic+Disf+Pse+WP||0.81||4.75|
Effect of multimodality On both the test set and in CV, the multimodal LSTM with gating model outperforms the single modality AD detection models in classification and regression tasks. In CV, integrating textual and audio modalities with gating improves performance over single modality models (ACC 0.79 vs. 0.74; RMSE 4.88 vs. 5.31). Even though each LSTM branch has different steps and timestep inputs in multimodal models, adding audio features improves performance. The multimodal model with BERT outperforms the single modality BERT in the regression task on both the test set and in CV (RMSE 4.72 and 4.38 vs. 4.94 and 4.49). However, integrating BERT and audio model with gating decreases performance over BERT for classification in CV (ACC 0.78 vs. 0.80). Text features are more informative than audio features as using text modality only predicts AD better than using unimodal audio modality sequentially in CV (ACC 0.74 vs. 0.68; RMSE 5.31 vs. 6.03).
|LSTM w/ Gating||Words+Acoustic+Disf+Pse+WP||0.66||0.62|
Cognitive decline (disease progression) inference task In Table 3, we present our results for disease progression task. As can be seen, our models do not reach the best baseline of the Decision-Tree based classifier. However, as with AD classification, the multimodal LSTM with Gating model outperforms all other competitors and is close to the DT classifier in performance on the test data (ACC 0.62 vs. 0.67). Overall, this task seems to have a considerably greater variation in performance across baseline classifiers and feature sets than the other two tasks. The lower performance of the LSTM model using words with disfluency and pause information model compared to using words alone (ACC 0.55 vs. 0.59) suggests these extra features are not as useful compared to the lexical information alone. This suggests the ASR quality is more critical, and the comparison of the IBM Watson system used here against the results obtained by the Google Cloud-based Speech Recogniser used by [bib:LuzEtAl21ADReSSo] would be a future step to take.
We have presented two multimodal fusion-based deep learning models which consume ASR transcribed speech and acoustic data simultaneously to classify whether a speaker in a structured diagnostic task has Alzheimer’s Disease and to what degree. Our best model, a BiLSTM with highway layers using words, word probabilities, disfluency features, pause information, and a variety of acoustic features, achieves an accuracy of 84%. While predicting cognitive decline is more challenging, our models show improvements using the multimodal approach and word probabilities, disfluency, and pause information over word-only models. In addition, we show there are considerable gains for AD classification using multimodal fusion and gating, which can effectively deal with noisy inputs from acoustic features and ASR hypotheses.
Purver is partially supported by the EPSRC under grant EP/S033564/1, and by the European Union’s Horizon 2020 programme under grant agreements 769661 (SAAM, Supporting Active Ageing through Multimodal coaching) and 825153 (EMBEDDIA, Cross-Lingual Embeddings for Less-Represented Languages in European News Media). The results of this publication reflect only the authors’ views and the Commission is not responsible for any use that may be made of the information it contains.