Alzheimer's Dementia Recognition through Spontaneous Speech: The ADReSS Challenge

by   Saturnino Luz, et al.
Carnegie Mellon University

The ADReSS Challenge at INTERSPEECH 2020 defines a shared task through which different approaches to the automated recognition of Alzheimer's dementia based on spontaneous speech can be compared. ADReSS provides researchers with a benchmark speech dataset which has been acoustically pre-processed and balanced in terms of age and gender, defining two cognitive assessment tasks, namely: the Alzheimer's speech classification task and the neuropsychological score regression task. In the Alzheimer's speech classification task, ADReSS challenge participants create models for classifying speech as dementia or healthy control speech. In the the neuropsychological score regression task, participants create models to predict mini-mental state examination scores. This paper describes the ADReSS Challenge in detail and presents a baseline for both tasks, including a feature extraction procedure and results for a classification and a regression model. ADReSS aims to provide the speech and language Alzheimer's research community with a platform for comprehensive methodological comparisons. This will contribute to addressing the lack of standardisation that currently affects the field and shed light on avenues for future research and clinical applicability.



page 1

page 2

page 3

page 4


HSD Shared Task in VLSP Campaign 2019:Hate Speech Detection for Social Good

The paper describes the organisation of the "HateSpeech Detection" (HSD)...

Combining Prosodic, Voice Quality and Lexical Features to Automatically Detect Alzheimer's Disease

Alzheimer's Disease (AD) is nowadays the most common form of dementia, a...

Comparing Natural Language Processing Techniques for Alzheimer's Dementia Prediction in Spontaneous Speech

Alzheimer's Dementia (AD) is an incurable, debilitating, and progressive...

AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification

The AutoSpeech challenge calls for automated machine learning (AutoML) s...

SEMOUR: A Scripted Emotional Speech Repository for Urdu

Designing reliable Speech Emotion Recognition systems is a complex task ...

ICASSP 2022 Deep Noise Suppression Challenge

The Deep Noise Suppression (DNS) challenge is designed to foster innovat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Alzheimer’s Disease (AD) is a neurodegenerative disease that entails a long-term and usually gradual decrease of cognitive functioning [1]. It is also the most common underlying cause for dementia. The main risk factor for AD is age, and therefore its greatest incidence is amongst the elderly. Given the current demographics in the Western world, where the population aged 65 years or more has been predicted to triple between years 2000 and 2050 [2], institutions are investing considerably on dementia prevention, early detection and disease management. There is a need for cost-effective and scalable methods that are able to identify the most subtle forms of AD, from the preclinical stage of Subjective Cognitive Decline (SCI), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer’s Dementia (AD) itself.

Whilst memory is often considered the main symptom of AD, language is also deemed as a valuable source of clinical information. Furthermore, the ubiquity of speech has led to a number of studies investigating speech and language features for the detection of AD, such as [3, 4, 5, 6]

to cite some examples. Although these studies propose various signal processing and machine learning methods for this task, the field still lacks balanced and standardised datasets on which these different approaches could be systematically compared.

Consequently, the main objective of the ADReSS Challenge of INTERSPEECH 2020 is to define a shared task through which different approaches to AD detection, based on spontaneous speech, could be compared. This aims to address one of the main problems of this active research field, the lack of standardisation, which hinders its translation into clinical practice. The ADReSS Challenges therefore aims:

  1. to target a difficult automatic prediction problem of societal and medical relevance, namely, the detection of cognitive impairment and Alzheimer’s Dementia (AD),

  2. to provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a new shared standardized dataset;

  3. to mitigate common biases often overlooked in evaluations of AD detection methods, including repeated occurrences of speech from the same participant (common in longitudinal datasets), variations in audio quality, and imbalances of gender and age distribution, and

  4. to focus on AD recognition using spontaneous speech, rather than speech samples are collected under laboratory conditions.

To the best of our knowledge, this will be the first such shared-task focused on AD. Unlike some tests performed in clinical settings, where short speech samples are collected under controlled conditions, this task focuses AD recognition using spontaneous speech. While a number of researchers have proposed speech processing and natural language processing approaches to AD recognition through speech, their studies have used different, often unbalanced and acoustically varied datasets, consequently hindering reproducibility, replicability, and comparability of approaches. The ADReSS Challenge will provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a shared dataset which consists of a statistically balanced, acoustically enhanced set of recordings of spontaneous speech sessions along with segmentation and detailed timestamped transcriptions. The use of spontaneous speech also sets the ADReSS Challenge apart from tests performed in clinical settings where short speech samples are collected under controlled conditions which are arguably less suitable for the development of large-scale longitudinal monitoring technology than spontaneous speech


As data scarcity and heterogeneity have hindered research into the relationship between speech and AD, the ADReSS Challenge provides researchers with the very first available benchmark, acoustically pre-processed and balanced in terms of age and gender. ADReSS defines two different prediction tasks:

  • the AD recognition task, which requires researchers to model participants’ speech data to perform a binary classification of speech samples into AD and non-AD classes, and

  • the MMSE prediction task, which requires researchers to create regression models of the participants’ speech in order to predict their scores in the Mini-Mental State Examination (MMSE).

This paper presents a baseline for both tasks, including feature extraction procedures (openSMILE features and MRCG features) and initial results for a classification and a regression model.

2 ADReSS Challenge Dataset

A dataset has been created for this challenge which is matched for age and gender, as shown in Table 1 and Table 2, so as to minimise risk of bias in the prediction tasks. The data consists of speech recordings and transcripts of spoken picture descriptions elicited from participants through the Cookie Theft picture from the Boston Diagnostic Aphasia Exam [8, 9]. Transcripts were annotated using the CHAT coding system [10]

. The recorded speech has been segmented for voice activity using a simple voice activity detection algorithm based on signal energy thresholding. We set the log energy threshold parameter to 65dB with a maximum duration of 10 seconds per speech segment. The segmented dataset contains 2,033 speech segments from 82 non-AD subjects and 2,043 speech segments from 82 AD subjects. The average number of speech segments produced by each participant was 24.86 (standard deviation

). Audio volume was normalised across all speech segments to control for variation caused by recording conditions such as microphone placement.

AD non-AD
Age Interval Male Female Male Female
1 0 1 0
5 4 5 4
3 6 3 6
6 10 6 10
6 8 6 8
3 2 3 2
Total 24 30 24 30
Table 1: ADReSS Training Set: Basic characteristics of the patients in each group.
AD non-AD
Age Interval Male Female Male Female
1 0 1 0
2 2 2 2
1 3 1 3
3 4 3 4
3 3 3 3
1 1 1 1
Total 11 13 11 13
Table 2: ADReSS Test Set: Basic characteristics of the patients in each group.

3 Feature Extraction

The baseline results reported below make no use of transcribed language data included in the datasets. Acoustic feature extraction was performed on the speech segments using the openSMILE v2.1 toolkit which is an open-source software suite for automatic extraction of features from speech, widely used for emotion and affect recognition in speech

[11]. The following is a brief description of each of the feature sets constructed in this way:

emobase: This acoustic feature set contains the mel-frequency cepstral coefficients (MFCC) voice quality, fundamental frequency (F0), F0 envelope, LSP and intensity features with their first and second order derivatives. Several statistical functions are applied to these features, resulting in a total of 988 features for every speech segment.

ComParE: The ComParE 2013 [12] feature set includes energy, spectral, MFCC, and voicing related low-level descriptors (LLDs). LLDs include logarithmic harmonic-to-noise ratio, voice quality features, Viterbi smoothing for F0, spectral harmonicity and psychoacoustic spectral sharpness. Statistical functionals are also computed, bringing the total to 6,373 features.

eGeMAPS: The eGeMAPS [13] feature set resulted from an attempt to reduce the somewhat unwieldy feature sets above to a basic set of acoustic features based on their potential to detect physiological changes in voice production, as well as theoretical significance and proven usefulness in previous studies [14]. It contains the F0 semitone, loudness, spectral flux, MFCC, jitter, shimmer, F1, F2, F3, alpha ratio, Hammarberg index and slope V0 features, as well as their most common statistical functionals, for a total of 88 features per speech segment.

MRCG functionals: MRCG features were proposed by Chen et al. [15] and have since been used in speech related applications such as voice activity detection [16] speech separation [15], and more recently for attitude recognition [17]. MRCG features are based on cochleagrams [18]

. A cochleagram is generated by applying the gammatone filter to the audio signal, decomposing it in the frequency domain so as to mimic the human auditory filters. MRCG uses the time-frequency representation to encode the multi-resolution power distribution of the audio signal. Four cochleagram features were generated at different levels of resolution. The high resolution level encodes local information while the remaining three lower resolution levels capture spectrotemporal information. A total of 768 features were extracted from each frame: 256 MRCG features (frame length of 20 ms and frame shift of 10 ms), along with 256

MRCG and 256 MRCG features. These features are meant to capture temporal dynamics of the signal [15]

. The statistical functionals (mean, standard deviation, minimum, maximum, range, mode, median, skewness and kurtosis) were applied on the 768 MRCG features for a total of 6,912 features.

In sum, we extracted 88 eGeMAPS, 988 emobase, 6,373 ComParE and 6,912 MRCG features from 4,077 speech segments. Pearson’s correlation test was performed on the whole dataset to remove acoustic features that were significantly correlated with duration (when ). Hence, 72 eGeMAPS, 599 emobase, 3,056 ComParE and 3,253 MRCG features were not correlated with the duration of the speech chunks, and were therefore selected for the machine learning experiments. Examples of features from the ComParE feature set by the above described procedure include L1-norms of segment length functionals smoothed by a moving average filter (including their means, maxima and standard deviations), and the relative spectral transform applied to auditory spectrum (RASTA) functionals (including the percentage of time the signal is above 25%, 50% and 75% of range plus minimum).

4 AD classification task

The AD classification task consists of creating a binary classification models to distinguish between AD and non-AD patient speech. These models may use speech data, transcribed speech, or both. Any methodological approach may be taken, but participants will work with the same dataset. The evaluation metric for this task are

, precision , recall , and , where N is the number of patients, TP, FP and FN are the number of true positives, false positives and false negatives, respectively.

4.1 Baseline classification

We performed our baseline classification experiments using five different methods, namely linear discriminant analysis (LDA), decision trees (DT, with leaf size of 20), nearest neighbour (1NN, for KNN with K=1), random forests (RF, with 50 trees and a leaf size of 20) and support vector machines (SVM, with a linear kernel with box constraint of 0.1, and sequential minimal optimisation solver). The classification methods are implemented in MATLAB

[19] using the statistics and machine learning toolbox. A leave-one-subject-out (LOSO) cross-validation setting was adopted, where the training data do not contain any information from validation subjects.

Figure 1: System Architecture: , the audio recording of i subjects is segmented using voice activity detection (VAD) into segments . Feature extraction (FE) is performed at segment level. The output of classification or regression for the segment of the audio recording is denoted . MV outputs the majority voting for classification, and Average the mean regression score.

We conducted a two-step classification experiment to detect cognitive impairment due to AD (as shown in Figure 1). This consisted of segment-level (SL) classification, where a classifier was trained and tested with acoustic features, age and gender to predict whether a speech segment was uttered by a non-AD or AD patient, and majority vote (MV) classification, which assigned each subject an AD or non-AD label based on the majority labels of SL classification.

4.2 Results

MV classification accuracy is shown in Tables 3 and 4 for LOSO and test settings respectively. These results show that the 1NN (0.574) provides the best accuracy using ComParE features for AD detection, with accuracy above the chance level of 0.50. For further insight, the confusion matrices of the top three LOSOCV results are also shown in Figure 2.

From the results shown in Table 3, we note that even though 1NN provides the best result (0.574), DT and LDA also exhibit promising performance, being in fact more stable across all feature sets than the other classifiers (the best average accuracy of 0.516 for LDA and 0.512 for DT). We also note that ComParE and MRCG also exhibit promising performance, being in fact more stable across all classifiers than the other features (the best average accuracy of 0.541 for Compare and 0.507 for MRCG). Based on theses finding we have selected LDA model which is trained using ComParE feature set as our baseline model.

Table 4 shows that 1NN provides less accurate results on the test set than in LOSO cross validation. However, the results of LDA (0.625) and DT (0.625) improve on the test data. For further insight, the confusion matrices for LDA, DT and 1NN are also shown in Figure 3. Hence the challenge baseline accuracy for the classification task is 0.625. The precision, recall and F1 Score is reported in Table 5.

Features LDA DT 1NN SVM RF avg.
emobase 0.500 0.519 0.398 0.491 0.472 0.476
ComParE 0.565 0.528 0.574 0.528 0.509 0.541
eGeMAPS 0.482 0.500 0.380 0.333 0.482 0.435
MRCG 0.519 0.500 0.482 0.528 0.509 0.507
avg. 0.516 0.512 0.458 0.470 0.493
Table 3: AD classification task LOSO cross validation.

Figure 2: Top 3 classification results in LOSO cross validation.
Features LDA DT 1NN SVM RF avg.
emobase 0.542 0.688 0.604 0.500 0.729 0.613
ComParE 0.625 0.625 0.458 0.500 0.542 0.550
eGeMAPS 0.583 0.542 0.688 0.563 0.604 0.596
MRCG 0.542 0.563 0.417 0.521 0.542 0.517
avg. 0.573 0.605 0.542 0.521 0.604
Table 4: AD Recognition sub-challenge: Test Results

Figure 3: Confusion matrix classification on test data.
class Precision Recall F1 Score Accuracy
LOSO non-AD 0.56 0.61 0.58 0.567
AD 0.57 0.52 0.54
TEST non-AD 0.67 0.50 0.57 0.625
AD 0.60 0.75 0.67
Table 5: The baseline results of AD classification task using LDA classifier with ComParE features.

5 MMSE prediction task

The MMSE prediction task consists of generating a regression model for prediction of MMSE scores of individual participants from the AD and non-AD groups. Unlike classification, MMSE prediction is relatively uncommon in the literature, despite MMSE scores often being available. While models may use speech (acoustic) or language data, or both, the baseline described here uses only acoustic data.

5.1 Baseline regression

We performed our baseline regression experiments using five different methods, namely decision trees (DT, with leaf size of 20), linear regression (LR), gaussian process regression (GPR, with a squared exponential kernel), least-squares boosting (LSBoost, which contains the results of boosting 100 regression trees) and support vector machines (SVM, with a radial basis function kernel with box constraint of 0.1, and sequential minimal optimisation solver). The regression methods are implemented in MATLAB

[19] using the statistics and machine learning toolbox. A LOSO cross-validation setting was adopted, where the training data do not contain any information of validation subjects.

As with classification, the regression experiments to predict MMSE score were conducted in two steps (Figure 1), that is, SL regression was performed as a first step and the predicted MMSE scores were averaged.

5.2 Results

The regression results are reported as root mean squared error (RMSE) scores in Tables 6 and 7 for LOSO and test data respectively. These results show that the DT (7.28) provides the best RMSE using MRCG feature for MMSE prediction and also exhibit promising performance, being in fact more stable across all feature sets than the other classifiers (the best average RMSE of 7.29 for DT). We also note that eGeMaPs also exhibits promising performance, with average RMSE of 8.02 across models. Based on this, the DT model trained using the MRCG feature was chosen as the baseline model for the regression task.

Table 7 shows that SVM provides more accurate average result (6.17) on the test data than other regression methods. The average results of all regression methods improve on test data. The baseline model (DT with MRCG features) provides an RMSE of 6.14 in the test setting. Hence the challenge baseline accuracy for this task is 6.14.

Features Linear DT GP SVM LSBoost avg.
emobase 20.12 7.29 7.71 7.71 8.33 10.23
ComParE 1.41e+04 7.29 7.67 7.63 7.84 2.83e+03
eGeMAPS 7.86 7.31 7.72 8.55 8.68 8.02
MRCG 8.08 7.28 7.65 7.50 8.02 7.71
avg. 3.54e+03 7.293 7.688 7.848 8.218
Table 6: MMSE prediction LOSO cross Validation results.
Features Linear DT GP SVM LSBoost avg.
emobase 14.18 6.78 6.36 6.18 6.73 8.05
ComParE 2.10e+04 6.52 6.33 6.19 6.72 4.21e+03
eGeMAPS 6.32 5.99 6.28 6.12 6.41 6.22
MRCG 6.93 6.14 6.33 6.20 6.31 6.38
avg. 5.26e+03 6.36 6.33 6.17 6.54
Table 7: MMSE prediction test results.

6 Discussion

These results of the classification baseline are comparable to those attained by models based on speech recordings available from spontaneous speech samples in DementiaBank’s Pitt corpus [8], which is widely used. Accuracy scores of 81.92%, 80% and 79% and 64% have been reported in the literature [3, 20, 21, 7]. Although these studies report higher accuracy than ours, all of those (except [7]

) include information from the manual transcripts, and were conducted on an unbalanced dataset (in terms of age, gender and number of subjects in the AD and non-AD classes). It is also worth noting that accuracy for the best performing of these models drops to 58.5% when feature selection is not performed on their original set of 370 linguistic and acoustic features

[3]. The performance of a model without the information from transcripts, that is, relying only on acoustic features as we do, is only reported in [7] (64%) and [21], where its SVM model drops to an average accuracy of 62%. It is also noted that previous studies do not evaluate their methods in a complete subject-independent setting (i.e. they consider multiple sessions for a subject and classify a session instead of a subject). This could lead to overfitting, as the model might learn speaker dependent features from a session and then, based on those features, classify the next session of the same speaker. One strength of our method is its speaker independent nature. Ambrosini et al. reported an accuracy of 80% while using acoustic (pitch, unvoiced duration, shimmer, pause duration, speech rate), age and educational level features for cognitive decline detection using an Italian dataset of an episodic story telling setting [22]. However, this dataset is less easily comparable to ours, as it is elicited differently, and is not age and gender balanced.

Yancheva et al. predicted MMSE scores with speech-related features [23] using the full DementiaBank Pitt dataset, which is not balanced and includes longitudinal observations. Their model yielded a mean absolute error (MAE) of 3.83 in predicting MMSE. However, they employed lexicosyntactic and semantic features derived from manual transcription, rather than automatically extracted acoustic features as we used in our analysis. In [23], those linguistic features were the main features selected from a group of 477, with acoustic features typically not being among the most relevant. Therefore no quantitative results were reported for acoustic features.

7 Conclusions

This paper demonstrates the relevance of acoustic features of spontaneous speech for cognitive impairment detection in the context of Alzheimer’s Disease diagnosis and MMSE prediction. Machine learning methods operating on automatically extracted voice features provide a baseline accuracy of up to 62.5%, well above the chance level of 50% and a baseline RMSE of 6.14 on test data for the ADReSS Challenge. By bringing the research community together in order to work on a shared task on the same dataset, ADReSS intends to make comprehensive methodological comparisons. This will hopefully highlight research caveats and shed light on avenues for clinical applicability and future research directions.

8 Acknowledgements

This research is funded by the European Union’s Horizon 2020 research programme, under grant agreement No 769661, towards the SAAM project. SdlFG is supported by the Medical Research Council (MRC).


  • [1] American Psychiatric Association, “Delirium, dementia, and amnestic and other cognitive disorders,” in Diagnostic and Statistical Manual of Mental Disorders, Text Revision (DSM-IV-TR), 2000, ch. 2.
  • [2] World Health Organization, “Mental health action plan 2013-2020,” WHO Library Cataloguing-in-Publication DataLibrary Cataloguing-in-Publication Data, pp. 1–44, 2013.
  • [3] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic features identify Alzheimer’s disease in narrative speech,” Journal of Alzheimer’s Disease, vol. 49, no. 2, pp. 407–422, 2016.
  • [4] S. Luz, S. D. la Fuente, and P. Albert, “A method for analysis of patient speech in dialogue for dementia detection,” in Procs. of LREC’18, D. Kokkinakis, Ed.   Paris, France: ELRA, may 2018.
  • [5] B. Mirheidari, D. Blackburn, T. Walker, A. Venneri, M. Reuber, and H. Christensen, “Detecting signs of dementia using word vector representations.” in INSTERSPEECH, 2018, pp. 1893–1897.
  • [6] F. Haider, S. De La Fuente Garcia, and S. Luz, “An assessment of paralinguistic acoustic features for detection of alzheimer’s dementia in spontaneous speech,” IEEE Journal of Selected Topics in Signal Processing, 2019.
  • [7] S. Luz, “Longitudinal monitoring and detection of Alzheimer’s type dementia from spontaneous speech data,” in Procs. of the Intl. Symp on Comp. Based Medical Systems (CBMS).   IEEE, 2017, pp. 45–46.
  • [8] J. T. Becker, F. Boller, O. L. Lopez, J. Saxton, and K. L. McGonigle, “The Natural History of Alzheimer’s Disease,” Archives of Neurology, vol. 51, no. 6, p. 585, 1994.
  • [9] H. Goodglass, E. Kaplan, and B. Barresi, BDAE-3: Boston Diagnostic Aphasia Examination – Third Edition.   Lippincott Williams & Wilkins Philadelphia, PA, 2001.
  • [10] B. MacWhinney, The CHILDES project: Tools for analyzing talk, Volume II: The database.   Psychology Press, 2014.
  • [11] F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE: the Munich versatile and fast open-source audio feature extractor,” in Procs. of ACM-MM.   ACM, 2010, pp. 1459–1462.
  • [12] F. Eyben, F. Weninger, F. Groß, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in ACM-MM.   ACM, 2013, pp. 835–838.
  • [13] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set GeMAPS for voice research and affective computing,” vol. 7, no. 2, pp. 190–202, 2016.
  • [14] ——, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” vol. 7, no. 2, pp. 190–202, 2016.
  • [15]

    J. Chen, Y. Wang, and D. Wang, “A feature study for classification-based speech separation at low signal-to-noise ratios,” vol. 22, no. 12, pp. 1993–2002, 2014.

  • [16]

    J. Kim and M. Hahn, “Voice activity detection using an adaptive context attention model,”

    IEEE Signal Processing Letters, 2018.
  • [17] F. Haider and S. Luz, “Attitude recognition using multi-resolution cochleagram features,” in Procs. of ICASSP, 2019, pp. 3737–3741.
  • [18] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines.   Springer, 2005, pp. 181–197.
  • [19] MATLAB, version 9.6 (R2019a).   Natick, Massachusetts: The MathWorks Inc., 2019.
  • [20] M. Yancheva and F. Rudzicz, “Vector-space topic models for detecting Alzheimer’s disease,” in Procs. of ACL, 2016, pp. 2337–2346.
  • [21] L. Hernández-Domínguez, S. Ratté, G. Sierra-Martínez, and A. Roche-Bergua, “Computer-based evaluation of Alzheimer’s disease and mild cognitive impairment patients during a picture description task,” Alzheimer’s & Dementia: Diagn., Asses. & Dis. Mon., vol. 10, pp. 260–268, 2018.
  • [22] E. Ambrosini, M. Caielli, M. Milis, C. Loizou, D. Azzolino, S. Damanti, L. Bertagnoli, M. Cesari, S. Moccia, M. Cid et al., “Automatic speech analysis to early detect functional cognitive decline in elderly population,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).   IEEE, 2019, pp. 212–216.
  • [23] M. Yancheva, K. Fraser, and F. Rudzicz, “Using linguistic features longitudinally to predict clinical scores for Alzheimer’s disease and related dementias,” in Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, 2015, pp. 134–139.