Alzheimer’s Disease (AD) is a neurodegenerative disease that entails a long-term and usually gradual decrease of cognitive functioning . It is also the most common underlying cause for dementia. The main risk factor for AD is age, and therefore its greatest incidence is amongst the elderly. Given the current demographics in the Western world, where the population aged 65 years or more has been predicted to triple between years 2000 and 2050 , institutions are investing considerably on dementia prevention, early detection and disease management. There is a need for cost-effective and scalable methods that are able to identify the most subtle forms of AD, from the preclinical stage of Subjective Cognitive Decline (SCI), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer’s Dementia (AD) itself.
Whilst memory is often considered the main symptom of AD, language is also deemed as a valuable source of clinical information. Furthermore, the ubiquity of speech has led to a number of studies investigating speech and language features for the detection of AD, such as [3, 4, 5, 6]
to cite some examples. Although these studies propose various signal processing and machine learning methods for this task, the field still lacks balanced and standardised datasets on which these different approaches could be systematically compared.
Consequently, the main objective of the ADReSS Challenge of INTERSPEECH 2020 is to define a shared task through which different approaches to AD detection, based on spontaneous speech, could be compared. This aims to address one of the main problems of this active research field, the lack of standardisation, which hinders its translation into clinical practice. The ADReSS Challenges therefore aims:
to target a difficult automatic prediction problem of societal and medical relevance, namely, the detection of cognitive impairment and Alzheimer’s Dementia (AD),
to provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a new shared standardized dataset;
to mitigate common biases often overlooked in evaluations of AD detection methods, including repeated occurrences of speech from the same participant (common in longitudinal datasets), variations in audio quality, and imbalances of gender and age distribution, and
to focus on AD recognition using spontaneous speech, rather than speech samples are collected under laboratory conditions.
To the best of our knowledge, this will be the first such shared-task focused on AD. Unlike some tests performed in clinical settings, where short speech samples are collected under controlled conditions, this task focuses AD recognition using spontaneous speech. While a number of researchers have proposed speech processing and natural language processing approaches to AD recognition through speech, their studies have used different, often unbalanced and acoustically varied datasets, consequently hindering reproducibility, replicability, and comparability of approaches. The ADReSS Challenge will provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a shared dataset which consists of a statistically balanced, acoustically enhanced set of recordings of spontaneous speech sessions along with segmentation and detailed timestamped transcriptions. The use of spontaneous speech also sets the ADReSS Challenge apart from tests performed in clinical settings where short speech samples are collected under controlled conditions which are arguably less suitable for the development of large-scale longitudinal monitoring technology than spontaneous speech.
As data scarcity and heterogeneity have hindered research into the relationship between speech and AD, the ADReSS Challenge provides researchers with the very first available benchmark, acoustically pre-processed and balanced in terms of age and gender. ADReSS defines two different prediction tasks:
the AD recognition task, which requires researchers to model participants’ speech data to perform a binary classification of speech samples into AD and non-AD classes, and
the MMSE prediction task, which requires researchers to create regression models of the participants’ speech in order to predict their scores in the Mini-Mental State Examination (MMSE).
This paper presents a baseline for both tasks, including feature extraction procedures (openSMILE features and MRCG features) and initial results for a classification and a regression model.
2 ADReSS Challenge Dataset
A dataset has been created for this challenge which is matched for age and gender, as shown in Table 1 and Table 2, so as to minimise risk of bias in the prediction tasks. The data consists of speech recordings and transcripts of spoken picture descriptions elicited from participants through the Cookie Theft picture from the Boston Diagnostic Aphasia Exam [8, 9]. Transcripts were annotated using the CHAT coding system 
. The recorded speech has been segmented for voice activity using a simple voice activity detection algorithm based on signal energy thresholding. We set the log energy threshold parameter to 65dB with a maximum duration of 10 seconds per speech segment. The segmented dataset contains 2,033 speech segments from 82 non-AD subjects and 2,043 speech segments from 82 AD subjects. The average number of speech segments produced by each participant was 24.86 (standard deviation). Audio volume was normalised across all speech segments to control for variation caused by recording conditions such as microphone placement.
3 Feature Extraction
The baseline results reported below make no use of transcribed language data included in the datasets. Acoustic feature extraction was performed on the speech segments using the openSMILE v2.1 toolkit which is an open-source software suite for automatic extraction of features from speech, widely used for emotion and affect recognition in speech. The following is a brief description of each of the feature sets constructed in this way:
emobase: This acoustic feature set contains the mel-frequency cepstral coefficients (MFCC) voice quality, fundamental frequency (F0), F0 envelope, LSP and intensity features with their first and second order derivatives. Several statistical functions are applied to these features, resulting in a total of 988 features for every speech segment.
ComParE: The ComParE 2013  feature set includes energy, spectral, MFCC, and voicing related low-level descriptors (LLDs). LLDs include logarithmic harmonic-to-noise ratio, voice quality features, Viterbi smoothing for F0, spectral harmonicity and psychoacoustic spectral sharpness. Statistical functionals are also computed, bringing the total to 6,373 features.
eGeMAPS: The eGeMAPS  feature set resulted from an attempt to reduce the somewhat unwieldy feature sets above to a basic set of acoustic features based on their potential to detect physiological changes in voice production, as well as theoretical significance and proven usefulness in previous studies . It contains the F0 semitone, loudness, spectral flux, MFCC, jitter, shimmer, F1, F2, F3, alpha ratio, Hammarberg index and slope V0 features, as well as their most common statistical functionals, for a total of 88 features per speech segment.
MRCG functionals: MRCG features were proposed by Chen et al.  and have since been used in speech related applications such as voice activity detection  speech separation , and more recently for attitude recognition . MRCG features are based on cochleagrams 
. A cochleagram is generated by applying the gammatone filter to the audio signal, decomposing it in the frequency domain so as to mimic the human auditory filters. MRCG uses the time-frequency representation to encode the multi-resolution power distribution of the audio signal. Four cochleagram features were generated at different levels of resolution. The high resolution level encodes local information while the remaining three lower resolution levels capture spectrotemporal information. A total of 768 features were extracted from each frame: 256 MRCG features (frame length of 20 ms and frame shift of 10 ms), along with 256MRCG and 256 MRCG features. These features are meant to capture temporal dynamics of the signal 
In sum, we extracted 88 eGeMAPS, 988 emobase, 6,373 ComParE and 6,912 MRCG features from 4,077 speech segments. Pearson’s correlation test was performed on the whole dataset to remove acoustic features that were significantly correlated with duration (when ). Hence, 72 eGeMAPS, 599 emobase, 3,056 ComParE and 3,253 MRCG features were not correlated with the duration of the speech chunks, and were therefore selected for the machine learning experiments. Examples of features from the ComParE feature set by the above described procedure include L1-norms of segment length functionals smoothed by a moving average filter (including their means, maxima and standard deviations), and the relative spectral transform applied to auditory spectrum (RASTA) functionals (including the percentage of time the signal is above 25%, 50% and 75% of range plus minimum).
4 AD classification task
The AD classification task consists of creating a binary classification models to distinguish between AD and non-AD patient speech. These models may use speech data, transcribed speech, or both. Any methodological approach may be taken, but participants will work with the same dataset. The evaluation metric for this task are, precision , recall , and , where N is the number of patients, TP, FP and FN are the number of true positives, false positives and false negatives, respectively.
4.1 Baseline classification
We performed our baseline classification experiments using five different methods, namely linear discriminant analysis (LDA), decision trees (DT, with leaf size of 20), nearest neighbour (1NN, for KNN with K=1), random forests (RF, with 50 trees and a leaf size of 20) and support vector machines (SVM, with a linear kernel with box constraint of 0.1, and sequential minimal optimisation solver). The classification methods are implemented in MATLAB using the statistics and machine learning toolbox. A leave-one-subject-out (LOSO) cross-validation setting was adopted, where the training data do not contain any information from validation subjects.
We conducted a two-step classification experiment to detect cognitive impairment due to AD (as shown in Figure 1). This consisted of segment-level (SL) classification, where a classifier was trained and tested with acoustic features, age and gender to predict whether a speech segment was uttered by a non-AD or AD patient, and majority vote (MV) classification, which assigned each subject an AD or non-AD label based on the majority labels of SL classification.
MV classification accuracy is shown in Tables 3 and 4 for LOSO and test settings respectively. These results show that the 1NN (0.574) provides the best accuracy using ComParE features for AD detection, with accuracy above the chance level of 0.50. For further insight, the confusion matrices of the top three LOSOCV results are also shown in Figure 2.
From the results shown in Table 3, we note that even though 1NN provides the best result (0.574), DT and LDA also exhibit promising performance, being in fact more stable across all feature sets than the other classifiers (the best average accuracy of 0.516 for LDA and 0.512 for DT). We also note that ComParE and MRCG also exhibit promising performance, being in fact more stable across all classifiers than the other features (the best average accuracy of 0.541 for Compare and 0.507 for MRCG). Based on theses finding we have selected LDA model which is trained using ComParE feature set as our baseline model.
Table 4 shows that 1NN provides less accurate results on the test set than in LOSO cross validation. However, the results of LDA (0.625) and DT (0.625) improve on the test data. For further insight, the confusion matrices for LDA, DT and 1NN are also shown in Figure 3. Hence the challenge baseline accuracy for the classification task is 0.625. The precision, recall and F1 Score is reported in Table 5.
5 MMSE prediction task
The MMSE prediction task consists of generating a regression model for prediction of MMSE scores of individual participants from the AD and non-AD groups. Unlike classification, MMSE prediction is relatively uncommon in the literature, despite MMSE scores often being available. While models may use speech (acoustic) or language data, or both, the baseline described here uses only acoustic data.
5.1 Baseline regression
We performed our baseline regression experiments using five different methods, namely decision trees (DT, with leaf size of 20), linear regression (LR), gaussian process regression (GPR, with a squared exponential kernel), least-squares boosting (LSBoost, which contains the results of boosting 100 regression trees) and support vector machines (SVM, with a radial basis function kernel with box constraint of 0.1, and sequential minimal optimisation solver). The regression methods are implemented in MATLAB using the statistics and machine learning toolbox. A LOSO cross-validation setting was adopted, where the training data do not contain any information of validation subjects.
As with classification, the regression experiments to predict MMSE score were conducted in two steps (Figure 1), that is, SL regression was performed as a first step and the predicted MMSE scores were averaged.
The regression results are reported as root mean squared error (RMSE) scores in Tables 6 and 7 for LOSO and test data respectively. These results show that the DT (7.28) provides the best RMSE using MRCG feature for MMSE prediction and also exhibit promising performance, being in fact more stable across all feature sets than the other classifiers (the best average RMSE of 7.29 for DT). We also note that eGeMaPs also exhibits promising performance, with average RMSE of 8.02 across models. Based on this, the DT model trained using the MRCG feature was chosen as the baseline model for the regression task.
Table 7 shows that SVM provides more accurate average result (6.17) on the test data than other regression methods. The average results of all regression methods improve on test data. The baseline model (DT with MRCG features) provides an RMSE of 6.14 in the test setting. Hence the challenge baseline accuracy for this task is 6.14.
These results of the classification baseline are comparable to those attained by models based on speech recordings available from spontaneous speech samples in DementiaBank’s Pitt corpus , which is widely used. Accuracy scores of 81.92%, 80% and 79% and 64% have been reported in the literature [3, 20, 21, 7]. Although these studies report higher accuracy than ours, all of those (except 
) include information from the manual transcripts, and were conducted on an unbalanced dataset (in terms of age, gender and number of subjects in the AD and non-AD classes). It is also worth noting that accuracy for the best performing of these models drops to 58.5% when feature selection is not performed on their original set of 370 linguistic and acoustic features. The performance of a model without the information from transcripts, that is, relying only on acoustic features as we do, is only reported in  (64%) and , where its SVM model drops to an average accuracy of 62%. It is also noted that previous studies do not evaluate their methods in a complete subject-independent setting (i.e. they consider multiple sessions for a subject and classify a session instead of a subject). This could lead to overfitting, as the model might learn speaker dependent features from a session and then, based on those features, classify the next session of the same speaker. One strength of our method is its speaker independent nature. Ambrosini et al. reported an accuracy of 80% while using acoustic (pitch, unvoiced duration, shimmer, pause duration, speech rate), age and educational level features for cognitive decline detection using an Italian dataset of an episodic story telling setting . However, this dataset is less easily comparable to ours, as it is elicited differently, and is not age and gender balanced.
Yancheva et al. predicted MMSE scores with speech-related features  using the full DementiaBank Pitt dataset, which is not balanced and includes longitudinal observations. Their model yielded a mean absolute error (MAE) of 3.83 in predicting MMSE. However, they employed lexicosyntactic and semantic features derived from manual transcription, rather than automatically extracted acoustic features as we used in our analysis. In , those linguistic features were the main features selected from a group of 477, with acoustic features typically not being among the most relevant. Therefore no quantitative results were reported for acoustic features.
This paper demonstrates the relevance of acoustic features of spontaneous speech for cognitive impairment detection in the context of Alzheimer’s Disease diagnosis and MMSE prediction. Machine learning methods operating on automatically extracted voice features provide a baseline accuracy of up to 62.5%, well above the chance level of 50% and a baseline RMSE of 6.14 on test data for the ADReSS Challenge. By bringing the research community together in order to work on a shared task on the same dataset, ADReSS intends to make comprehensive methodological comparisons. This will hopefully highlight research caveats and shed light on avenues for clinical applicability and future research directions.
This research is funded by the European Union’s Horizon 2020 research programme, under grant agreement No 769661, towards the SAAM project. SdlFG is supported by the Medical Research Council (MRC).
-  American Psychiatric Association, “Delirium, dementia, and amnestic and other cognitive disorders,” in Diagnostic and Statistical Manual of Mental Disorders, Text Revision (DSM-IV-TR), 2000, ch. 2.
-  World Health Organization, “Mental health action plan 2013-2020,” WHO Library Cataloguing-in-Publication DataLibrary Cataloguing-in-Publication Data, pp. 1–44, 2013.
-  K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic features identify Alzheimer’s disease in narrative speech,” Journal of Alzheimer’s Disease, vol. 49, no. 2, pp. 407–422, 2016.
-  S. Luz, S. D. la Fuente, and P. Albert, “A method for analysis of patient speech in dialogue for dementia detection,” in Procs. of LREC’18, D. Kokkinakis, Ed. Paris, France: ELRA, may 2018.
-  B. Mirheidari, D. Blackburn, T. Walker, A. Venneri, M. Reuber, and H. Christensen, “Detecting signs of dementia using word vector representations.” in INSTERSPEECH, 2018, pp. 1893–1897.
-  F. Haider, S. De La Fuente Garcia, and S. Luz, “An assessment of paralinguistic acoustic features for detection of alzheimer’s dementia in spontaneous speech,” IEEE Journal of Selected Topics in Signal Processing, 2019.
-  S. Luz, “Longitudinal monitoring and detection of Alzheimer’s type dementia from spontaneous speech data,” in Procs. of the Intl. Symp on Comp. Based Medical Systems (CBMS). IEEE, 2017, pp. 45–46.
-  J. T. Becker, F. Boller, O. L. Lopez, J. Saxton, and K. L. McGonigle, “The Natural History of Alzheimer’s Disease,” Archives of Neurology, vol. 51, no. 6, p. 585, 1994.
-  H. Goodglass, E. Kaplan, and B. Barresi, BDAE-3: Boston Diagnostic Aphasia Examination – Third Edition. Lippincott Williams & Wilkins Philadelphia, PA, 2001.
-  B. MacWhinney, The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press, 2014.
-  F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE: the Munich versatile and fast open-source audio feature extractor,” in Procs. of ACM-MM. ACM, 2010, pp. 1459–1462.
-  F. Eyben, F. Weninger, F. Groß, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in ACM-MM. ACM, 2013, pp. 835–838.
-  F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set GeMAPS for voice research and affective computing,” vol. 7, no. 2, pp. 190–202, 2016.
-  ——, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” vol. 7, no. 2, pp. 190–202, 2016.
J. Chen, Y. Wang, and D. Wang, “A feature study for classification-based speech separation at low signal-to-noise ratios,” vol. 22, no. 12, pp. 1993–2002, 2014.
J. Kim and M. Hahn, “Voice activity detection using an adaptive context attention model,”IEEE Signal Processing Letters, 2018.
-  F. Haider and S. Luz, “Attitude recognition using multi-resolution cochleagram features,” in Procs. of ICASSP, 2019, pp. 3737–3741.
-  D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines. Springer, 2005, pp. 181–197.
-  MATLAB, version 9.6 (R2019a). Natick, Massachusetts: The MathWorks Inc., 2019.
-  M. Yancheva and F. Rudzicz, “Vector-space topic models for detecting Alzheimer’s disease,” in Procs. of ACL, 2016, pp. 2337–2346.
-  L. Hernández-Domínguez, S. Ratté, G. Sierra-Martínez, and A. Roche-Bergua, “Computer-based evaluation of Alzheimer’s disease and mild cognitive impairment patients during a picture description task,” Alzheimer’s & Dementia: Diagn., Asses. & Dis. Mon., vol. 10, pp. 260–268, 2018.
-  E. Ambrosini, M. Caielli, M. Milis, C. Loizou, D. Azzolino, S. Damanti, L. Bertagnoli, M. Cesari, S. Moccia, M. Cid et al., “Automatic speech analysis to early detect functional cognitive decline in elderly population,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2019, pp. 212–216.
-  M. Yancheva, K. Fraser, and F. Rudzicz, “Using linguistic features longitudinally to predict clinical scores for Alzheimer’s disease and related dementias,” in Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, 2015, pp. 134–139.