Speech Paralinguistic Approach for Detecting Dementia Using Gated Convolutional Neural Network

04/16/2020 ∙ by Tifani Warnita, et al. ∙ Keio University 0

We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data without any linguistic features. We extract paralinguistic features for a short speech utterance segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method by using the Pitt Corpus and our own dataset, the PROMPT Database. Our method yields the accuracy of 73.1 using an average of 114 seconds of speech data. In the PROMPT Database, our method yields the accuracy of 74.7 improves to 79.0 evaluate our method on a three-class classification problem in which we included the Mild Cognitive Impairment (MCI) class and achieved the accuracy of 60.6



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dementia is an umbrella term for a group of medical signs and symptoms associated with the cognitive-related deficiency due to damage in neurons

(alzheimer20182018). Types of dementia include Alzheimer’s disease (AD), vascular dementia, dementia with lewy body (DLB) and frontotemporal lobar degeneration (FTLD). Dementia have various characteristics representing cognitive dysfunction such as poor narrative memory when recalling experiences (prud2011extraction) as well as difficulties in making plans, solving problems, and completing daily tasks (alzheimer20182018).

The increasing number of people living with dementia has gained a lot of attention. AD, which takes the biggest proportion of dementia, has become the 6th leading cause of death in the United States of America (alz2017). Moreover, according to the World Health Organization (dementia2017who)

, in 2015, dementia affected 47 million people worldwide and it is estimated that, by 2050, this number will be nearly triplicated.

Unfortunately, there is no clear protocol on how to detect dementia in an accurate and effective manner (alz2017). The most common approach is to perform various clinical assessments of the patients such as examining their medical history, conducting cognitive tests (e.g., memory tasks, executive function tasks, picture description tasks, naming tasks), assessing their mood and mental status, as well as performing brain imaging; i.e., computerized tomography (CT), magnetic resonance imaging (MRI), single-photon emission computed tomography (SPECT), positron emission tomography (PET), and blood/cerebrospinal fluid testing. The careful diagnosing process can be invasive, time-consuming and costly. Thus, faster and more cost-effective dementia detection approaches have been strongly demanded.

Another challenge related to the dementia detection is the identification of people with Mild Cognitive Impairment (MCI), which is the stage of cognitive impairment between the expected cognitive decline of normal ageing and early dementia (petersen2004mild). MCI is characterized by a cognitive decline that is greater than the age-related expectation, but that cannot be defined as dementia yet (petersen1999mild). The detection of MCI has been extensively studied (tuokko2020mild) and it is considered extremely important since it is estimated that and annual average of 10% to 15% of people with MCI might progress to dementia (arevalo2015mini). Early cognitive deficiency treatments can help patients preserve their cognitive functions (martono2000buku) and some causes of dementia are treatable in early stages (tripathi2009reversible).

Most approaches for the automatic dementia detection used linguistic information (zimmerer2016formulaic; fraser2016linguistic; orimaye2017predicting; wankerl2017n; mirheidari2018detecting) since the cognitive dysfunctions in the patients typically appear as linguistic impairments. While these methods are effective, their major drawback is the requirement of transcriptions.

In order to address these issues, we propose a language-independent method that can detect dementia based solely on acoustic information from speech audio data. Moreover, this work focuses on using less patient speech data as possible to make the diagnosis physically less demanding for the patients. We formulate the problem as a paralinguistic task in which we define paralinguistic-related cues such as pitch, voice energy and jitter as features. We employ Gated Convolutional Neural Networks (GCNN) in order to capture the temporal pattern in the extracted features. As an extension of our previous work (Warnita2018), we evaluate our method on two datasets, DementiaBank Pitt Corpus (English) and PROMPT Database (Japanese) collected by our own. We further explore the detection of Mild Cognitive Impairment (MCI) patients. The experimental results demonstrate the effectiveness of our method in terms of accuracy, cost, and time required for each prediction.

2 Related Works

2.1 Evaluation Measure of Dementia and MCI

Various evaluation methods have been defined to perceive dementia in people. In the medical field, commonly used approaches are the Clinical Dementia Rating (CDR) (morris1993clinical), the Clock Drawing Test (CDT) (sunderland1989clock), the Neuropsychiatric Inventory (NPI) (cummings1994neuropsychiatric) and the Mini-Mental State Examination (MMSE) (folstein1975mini). CDR uses an interview protocol to assess the dementia severity as mild or severe while, in CDT, the patients are asked to draw a clock with a specific time to get a score, which can assist the evaluation of whether a patient suffers from a neurological problem. The NPI test is used to assess the disruptions of several behavioural functioning.

MMSE is an extensively-used screening test that quantifies patients’ cognitive function as the total score of a series of questions and problems (pangman2000examination). The test itself is designed in order to aid the screening/diagnosing of dementia. The MMSE is used in this work and, based on saxton2009computer; kaufer2008cognitive; tombaugh1992mini; arevalo2015mini, we define the score ranges of 0–23, 24–26 and 27–30 to represent dementia, MCI and healthy people respectively. While most of the current medical research works present similar cut-off points to represent these classes, the definition of these score ranges has not been standardized (onwuekwe2012assessment; o2008detecting; kochhann2010mini). In addition, it is difficult to diagnose MCI, since its symptoms are subtle. (arevalo2015mini).

Several automatic dementia detection studies adopt the accuracy score as their performance measure (fraser2016linguistic; wankerl2017n; khodabakhsh2015evaluation; sadeghian2017speech) and some additionally report the Receiver Operating Characteristic (ROC) and its area under the curve (AUC) (fritsch2019automatic).

In some medical research, Cohen’s Kappa score is used to evaluate a classifier’s performance (pezzotti2008accuracy; wang2003comparison). The Kappa score measures the observed accuracy from the classifier’s outcome, normalized by the expected accuracy from random chance.

2.2 Features

Several types of features can be used to identify dementia. In this section, features based on the patient’s brain images and on linguistic characteristics of their speech will be presented as well as acoustic features extracted from their speech, which are the features used in this work.

2.2.1 Image

Structural brain images from Magnetic Resonance Imaging (MRI) can be used to identify AD patients (farhan2014ensemble). Brain imaging plays an important role in neurodegenerative disorder detection because it provides useful information regarding the anatomical change in the brain of patients (narayanan2016can). The combination of MRI and fluorodeoxyglucose-positron emission tomography (FDG-PET) images was used to identify MCI patients who will further progress to dementia (lu2018multimodal). Despite their effectiveness, medical image acquisition is costly and it requires doctors with a high level of expertise.

2.2.2 Linguistic and ASR-based

Language deficiency becomes a prominent and perceivable symptom of dementia patients. Several syntactic, lexical and n-gram features were used for detecting AD on the DementiaBank Pitt Corpus (zimmerer2016formulaic; orimaye2017predicting). wankerl2017n used n-gram and the MMSE score correlation analysis. More recently, chen2019attention defined a hybrid RNN-CNN architecture with an attention mechanism to detect AD from the Pitt Corpus’ transcriptions’ textual embeddings. fritsch2019automatic proposed a LSTM-based neural network language model whose prediction is calculated from their model’s perplexity.

Several other works have studied the combination of linguistic and acoustic features. mirheidari2019dementia

combined features inspired in the conversation analysis of clinical interviews, lexical information extracted with an ASR and acoustic features. They further input these features to a support vector machine (SVM) to classify the patients of their own dataset into dementia or functional memory disorder.

gosztolya2016detecting and toth2015automatic used ASR to extract phonetic-based features in order to detect MCI patients from their speech. fraser2016linguistic fused transcription-based linguistic features with acoustic features such as Mel-frequency cepstral coefficients (MFCC). sadeghian2017speech utilized the combination of speech duration, pause-related features, pitch-related features and other prosodic features, as well as linguistic features from a customized ASR adapted for dementia patients.

Even though those approaches have shown good results, most of them are limited by the availability of transcription data and/or ASR which usually has poor performance due to the degraded speech intelligibility of the patients.

2.2.3 Speech

Several works proposed the dementia identification by using ASR-independent speech features. Features such as silence ratio were found to be more meaningful than other linguistic features when applied to a SVM classifier (khodabakhsh2015evaluation). Moreover, the usage of acoustic and context-free linguistic features to classify patients showed promising results on the Carolina Conversations Collections dataset (luz2018method).

Besides having problems with language deficits, people with AD, specially in the early stages of the disease, might become apathetic and have a tendency to get depressed (alzheimer20182018). People with AD usually suffer from prosodic impairment due to which they will find difficulties in expressing their emotions (tosto2011prosodic). Those signs suggest the presence of paralinguistic cues to detect cognitive dysfunction.

OpenSMILE (eyben2013recent) is a commonly used tool for feature extraction in paralinguistic tasks (schuller2009interspeech; schuller2010interspeech). It describes a series of default feature sets, such as the INTERSPEECH 2010 Paralinguistic Challenge Feature Set (IS10) (schuller2010interspeech)

, which is used in this work. The IS10 defines 76 low-level descriptors (LLD) features for each time frame. In this work, we define a time frame as 25ms, sampled at a rate of 10ms. Those LLD are based on several speech descriptors that were used in previous works, such as pitch, voicing probability

(khodabakhsh2015evaluation) and MFCC (fraser2016linguistic). Additionally, the IS10 yielded the best result in the AD detection task compared to other feature sets defined in OpenSMILE (Warnita2018).

2.3 Classifiers

Support vector machine (SVM) classifiers (boser1992training) were widely used as the baseline method of several paralinguistic tasks, such as emotion recognition (schuller2009interspeech) and age-gender classification (schuller2010interspeech). While training the SVM, the Sequential Minimal Optimization (SMO) (platt1998sequential) algorithm is used. The SMO solves the quadratic programming (QP) optimization in the SVM by dividing the QP into the smallest possible QP sub-problems, which allows the SMO to handle a large amount of data.

Recently, the focus has changed to deep learning-based approaches.

huang2014speech applied convolutional neural networks (CNN) (lecun1989backpropagation)

to speech emotion recognition. For the same task, a recurrent neural network (RNN)

(hopfield1982neural) was added on top of CNN layers (keren2016convolutional) to capture the speech’s dynamic features.

The RNN sequential model can accommodate the temporal pattern change, but it requires a long training time (peddinti2015time) and more training data. On the other hand, CNNs need a relatively small amount of training data compared to the other existing networks (bengio2009learning) due to their reduced number of connection weights. Moreover, even without any explicit sequential mechanism, CNNs are still able to model the temporal context in the data by means of their convolution operations (peddinti2015time).

There have been various studies that incorporated gating mechanisms into convolution layers achieving state-of-the-art performance on tasks such as conditional image modelling (oord2016conditional), language modelling (dauphin2016language), speech synthesis (oord2016wavenet)

and generative image inpainting


. When applied to RNNs, such as Long Short-Term Memories (LSTM)


, gating mechanisms were shown to be effective in handling the long-term dependencies problem. Moreover, gating mechanisms in CNNs can be used to manage the information flow as well as to mitigate the vanishing gradient problem

(dauphin2016language). Therefore, inspired by these advantages and the superior performance of the combination of CNNs and gating mechanisms applied to different tasks, we hypothesize that the automatic dementia detection can also benefit from this architecture.

3 Methods

3.1 Gated Convolutional Neural Network

Figure 1: A convolution layer over LLD features extracted with the openSMILE toolkit.
Figure 2: A GCNN structure with one gated block. A deeper network can be made by stacking gated blocks.

A Gated Convolutional Neural Network (GCNN) consists of convolution layers and gating mechanisms. In our case, each convolution layer is expected to extract the salient information from the combined LLD features for every short period of time. Thus, the temporal pattern change will be encapsulated within the combination of several extracted patches of features.

The convolution operation “slides” a kernel over the input features in order to extract their prominent cues. In our study, since we want to model the correlation between all the LLD features, captured at each time frame, we use the one-dimensional (1D) CNN, hence each kernel slides only in the time axis, as represented in Figure 1. This network is also referred to as Time-Delay Neural Network (TDNN) (waibel1990phoneme).

The gating mechanism applies this convolution operation to the input in two different paths, as shown in Figure 2

. The gate in this network works as a controller that manages the information flow between succeeding layers. Thus, it can prevent the vanishing gradient problem during backpropagation.

Following the gated linear unit (GLU) architecture proposed by dauphin2016language, we feed the utterance feature matrix into our network, in which and are the dimension of the LLD features and the number of time frames, respectively. We further convolve the input with a kernel of dimension in which

is the length of the kernel in the time axis. At each convolution operation between this kernel and the input, a scalar output is produced. Combining the output of the convolution layer with linear activation to the convolution layer with the sigmoid function

as the activation function results in the scalar output

of kernel at position of the output matrix


in which is the element of at position and the element of at position . and are the linear gate kernel weight matrix and bias respectively (i.e., they represent the convolution operation in the right stream of the gated block in Figure 2) and and are the respective weight matrix and the bias of the convolutional operation in the left stream of Figure 2.

In Equation (1), both summations enclosed by parenthesis represent a convolution operation that results in one scalar. In addition, the term to which the sigmoid function is applied is the gate operation which controls the linear gate output.

The resulting matrix from the convolution operation will have the dimensions , in which is the number of kernels and is the output segment length. After the convolution operation, the matrix

has its length halved in the time-axis by the max-pooling layer

(zhou1988computation) to get its most significant information and reduce its dimensionality.

Figure 2 shows our GCNN with a single gated block (i.e., the components enclosed by the pink rectangle in Figure 2), which consists of two convolution layers and one max-pooling layer. A deeper GCNN would consist of multiple gated blocks.

The output of the network’s last gated block, , with as the number of kernels of the last gated block and as the final output segment length, is then flattened into one feature vector , in which

. This vector is input to a fully-connected (dense) layer with the ReLU activation function. We also apply batch normalization

(ioffe2015batch) at the end of each dense and convolution layer.

4 Experimental Settings

4.1 Datasets

In this experiment, we use two datasets containing the speech of people with and without dementia: the Dementia Pitt Corpus (becker1994natural), in which the subjects speak English and the PROMPT Database, in which the subjects speak Japanese.

4.1.1 Pitt Corpus

The Pitt Corpus, a part of the DementiaBank, contains speech data of healthy people (Control group) and of people with Alzheimer’s disease (AD group) speaking in English. We apply four constraints prior to using the data.

First, we select the data only from the picture description task. In this task, the subjects are asked to describe the Cookie Theft Picture of the Boston Diagnostic Aphasia Examination (kaplan1983boston). The task is considered an approximation of real-life spontaneous conversations (giles1996performance).

Second, from the AD group’s subjects, we select the sessions that correspond to patients with a diagnosis of AD or probable AD. There are no specific restrictions to select sessions from the control group. It should be noted that, even though we select sessions using the same method as in fraser2016linguistic, wankerl2017n, and chen2019attention, the number of sessions is slightly different from those works since the amount of data has increased over time.

Third, we select only sessions that include both the audio file and the transcription information available in the dataset to further compare with the existing linguistic approaches.

Fourth, we exclude audio data that contains overlapping sounds from other interview sessions. As a result, the data we use comprises 488 sessions (255 dementia, 233 healthy), which have an average duration of seconds, from 267 participants (169 dementia, 98 healthy).

We use the speech turns information available in the dataset in order to extract interviewee utterances. We perform three preprocessing stages on this dataset. First, we normalize each audio signal using the average value of decibels relative to full scale (dBFS) in the data. Then, we segment the audio data of each participant into several segments according to the turns information, thus obtaining a total of 6,267 utterance segments (3,276 dementia, 2,991 healthy). Finally, we extend the duration of the utterances by 10ms at the beginning and 10ms at the end of each utterance segment.

The audio files in the Pitt Corpus are single channelled (mono), sampled at a frequency of 44.1kHz and stored as PCM encoded wave files.

4.1.2 PROMPT Database

The PROMPT database is a part of a larger project of Keio University School of Medicine: the Project for Objective Measures Using Computational Psychiatry (PROMPT) 333On the 9th of March 2016, the PROMPT project and its medical data collection were approved by the ethics committee and the Institutional Review Board of Keio University School of Medicine and by all of the other participating facilities. PROMPT protocols have been registered with the University Hospital Medical Information Network (UMIN) (UMIN ID: UMIN000023764) (kishimoto19013011).

All the patients have given their written consent before participating in the study and, in cases in which patients were judged to be decisionally impaired, the patients’ guardians provided consent. Participants were able to leave the study at any time.

In this work, we use the PROMPT Database collected from May 2, 2016 to March 31, 2019 at seven hospitals and three outpatient clinics in five different Japanese prefectures. Speech data were recorded when the participant had free-discussion and performed several clinical tasks with trained research psychiatrists and/or psychologists. We define the cognitive impairment categorization based on the MMSE score.

The PROMPT data used in this work comprises 496 session recordings (153 dementia, 111 MCI and 232 healthy) with an average duration of 1,487 seconds, from 163 participants (49 dementia, 42 MCI and 72 healthy). In this work, dementia, MCI and healthy categories are defined as a MMSE score in the respective ranges of 0-23, 24-26 and 27-30. In this dataset, the intraclass correlation coefficient among the raters, conducting the evaluation from the video recordings, is ( CI=0.990-0.999, )

Since the PROMPT Database collects the audio recordings from synchronized microphones, one close to the patient (i.e., interviewee) and another close to the doctor (i.e., interviewer), we adopted the Cross-Channel Spectral Subtraction method (nasu2011cross) to extract the patient-only speech segments. Thus we reduce the impact of noisy information from the interviewer’s speech. Then, we apply the three preprocessing stages described in Section 4.1.1. Following our previous work (Warnita2018), we segment the utterances as having a fixed length of 4 seconds, thus resulting in 184,337 utterance segments (39,593 dementia, 27,234 MCI and 117,510 healthy).

The audio files are recorded under various unconstrained acoustic conditions (i.e., with different microphones and in rooms with different reverberation characteristics), but all the audio files are single channelled (mono), sampled at the frequency of 16kHz stored as PCM encoded wave files.

4.2 Evaluation Measures

We use classification accuracy as the main evaluation measure for this experiment. This reflects the previous studies that also used the Pitt Corpus dataset (fraser2016linguistic; wankerl2017n) as well as the studies which used other datasets (khodabakhsh2015evaluation; sadeghian2017speech). We calculate the accuracy averaged over from the 10-fold cross-validation results. At each fold, we partition the dataset in 10 subsets, from which we select one for testing and the remaining for training. We design the ten subsets so that no subject data appears in both training and testing. For the PROMPT Database, we additionally evaluate our best model using 4 seconds of audio data by analysing the Receiver Operating Characteristic (ROC) and the Detection Error Tradeoff (DET) curves. We also report the Kappa score of the conducted experiments.

4.3 Implementation Details

We split the data of each interview session into several utterance segments of predefined length. Then, we classify each segment using our Gated Convolutional Neural Network architecture. Finally, after aggregating the scores from multiple utterance segments, we conduct a majority voting to determine the session-level dementia classification.

In our binary GCNN, we consistently use the window length for

kernels in every convolution operation in each gated block and we apply zero padding to the input. We have tested our model with 6, 8 and 10 stacked gated blocks. The dense layer after the last gated block consists of 256 hidden neurons. We apply dropout with the value of 0.5 before the output layer for regularization. The output layer consists of one neuron with a sigmoid function. We trained the network using LLD features from IS10 of each utterance segment and its corresponding binary label using the 10-fold cross-validation scheme.

We used binary cross-entropy as the loss function and the Adam

(kingma2014adam) optimizer with learning rate equal to and exponential decay rate respectively defined as and

for the first and second moment estimates. A batch size equal to

was consistently used over all the experiments and the input has a dimension of , in which , the number of LLD features per time frame, is defined as and is fixed as .

5 Experimental Results

5.1 Pitt Corpus Experiments

Method Accuracy (%)
Speech (SMO Baseline) 67.5
Speech + Linguistic (fraser2016linguistic) 81.9
Linguistic (wankerl2017n) 77.1
Linguistic (fritsch2019automatic) 85.6
Linguistic (chen2019attention) 97.4
Ours 73.1
Table 1: Comparison of dementia detection methods on the sessions of the Pitt Corpus selected according to the method described in Section 4.1.1.

We present the average accuracy result over the 488 selected sessions of the Pitt Corpus in Table 1. We employ SMO on the IS10 segment-level features for comparison. Our method yields the accuracy of 73.1% which outperforms the SMO result of 67.5%. Other methods, that consider linguistic features and the combination of speech and linguistic features respectively yield the accuracy of 97.4% and 81.9%. Although our method has a worse performance than most methods presented in Table 1, it does not require any transcription information, thus it is more cost-effective and it can be applied to fast diagnosis.

Dementia Healthy
Actual Dementia 2,340 936
Healthy 1,213 1,778
(a) Segment-Level Classification
Dementia Healthy
Actual Dementia 189 66
Healthy 65 168
(b) Session-Level Classification
Table 2:

The confusion matrix depicting the classification results from all folds using the IS10 feature set as input and a GCNN with eight gated blocks on the Pitt Corpus.

Table 2 shows the confusion matrix of our best model on the Pitt Corpus for the aggregated values from the ten folds both from the utterance-level and the session-level predictions. The model is composed of eight gated blocks and it uses the actual utterance lengths obtained from the turns information in the Pitt Corpus. The confusion matrices show that predicting the individual utterance in a session is a difficult task since the amount of information in one utterance might be too limited. Thus, combining several utterances for one session improves the session-level prediction result.

Figure 3: Accuracy of GCNNs with different segment length on the Pitt Corpus

We examine the importance of the utterance length in the classification performance on the Pitt Corpus in Figure 3. We use a set of different segment duration , chosen as 0.5 s, 1 s, 2 s and 4 s. In this case, we separate each participant data into segments with a predetermined fixed duration without taking into consideration the actual utterance length (i.e., obtained from the processing of the speaker’s turn information as described in Section 4.1.1). Each utterance segment is input to the model and we apply majority voting over the utterances predictions in order to obtain the session’s classification. The experiment is carried out using GCNNs with a number of gated blocks equal to 6, 8 and 10.

Figure 3 shows that using only 4 seconds speech utterance yields good results. In addition, segmenting the subject’s voice in the middle of their speech does not significantly degrade the performance. That being said, there exist discriminative dementia cues in a short duration of speech data.

5.2 PROMPT Database Experiments

Figure 4: Accuracy of GCNN with different duration of data used for each session on PROMPT Database. The horizontal axis is presented in logarithmic scale.
Category Condition Accuracy (%) Kappa Score Duration (s)
2-classes D vs M + H 78.6 0.56 40
D + M vs H 75.9 0.53 300
D vs H 80.8 0.62 1000
3-classes D vs M vs H 60.6 0.51 40
Table 3: 2-classes and 3-classes Session classification experiment conditions and results for Dementia (D), MCI (M) and Healthy (H) participants of the PROMPT Database. Both Accuracy and Kappa score reported in this table are the best acquired for each condition among different time durations.

We further evaluate our approach on the PROMPT Database obtaining the average accuracy of 80.8%. It should be noted that the PROMPT Database has a longer duration of data compared to the Pitt Corpus hence it is easier for the network to learn how to distinguish dementia patients from healthy participants.

Furthermore, we investigate whether we can still detect dementia with only a short duration of speech data for each session. Based on the results of the experiment conducted with the Pitt Corpus shown in Section 5.1, we have observed that the model with the best accuracy is the GCNN with 8 gated blocks, thus we have chosen it for the experiments on the PROMPT Database. Moreover, the experiments on the Pitt Corpus confirmed the minimum duration of each utterance as being 4 seconds. From this utterance length, we experiment with different duration configuration for each session data, which are 4 seconds, 8 seconds, 20 seconds, 40 seconds, 1 minute, 5 minutes and all of the session speech data. Our model performs the classification over each 4 seconds speech utterance and, for longer duration configurations, we apply majority voting to determine each speech interval prediction. In order to define the session-level prediction, we compute the majority voting result over all utterances.

We report our results in accuracy as it can be seen in Figure 4. The figure shows that performance degrades if we apply shorter speech durations. However, we obtain the average accuracy of 77.1% by using only 20 seconds of data for each session and the average accuracy of 74.7%, with a false positive rate (FPR) of 23.2% and a false negative rate (FNR) of 24.0%, when we use only 4 seconds of data. This result represents an important step towards the application of automatic dementia detection tools in the real world, in which the detection speed plays a fundamental role. We have additionally reported the segment-level classification confusion matrix for the 4 second utterance segmentation of one of our folds in Table 4.

Dementia Healthy
Actual Dementia 3,882 380
Healthy 3,227 7,418
Table 4: Segment-level confusion matrix of one of the folds on the PROMPT database for the classification based on 4 seconds long utterance segments.

We further do the three-class classification to distinguish between sessions with dementia, MCI and healthy subjects on the PROMPT Database. We also perform a more comprehensive experiment on the binary classification with various configurations. Apart from distinguishing dementia and healthy sessions, we perform the classification on the dementia versus non-dementia case and the non-healthy versus healthy case by respectively adding the participants with a MMSE score in the MCI range to the healthy and to the dementia groups. Table 3 depicts these configurations in the column “Condition” and it reports the best result in terms of accuracy and Kappa score for each of these conditions as a function of speech duration.

In terms of session classification accuracy, adding the MCI class yields worse performance on both Condition D vs M + H (78.6%) and Condition D + M vs H (75.9%) compared to the Condition D vs H (80.8%) as it can be seen in Table 3. This result suggests that we should not combine the MCI participants either with healthy or dementia participants. This might be explained from the fact that MCI participants cannot be considered healthy due to their cognitive ability decline, but MCI cannot be framed as dementia either, since this decline is less severe compared to dementia. It is also interesting to see that combining MCI with dementia patients yields worse performance than combining MCI with healthy subjects. Further investigation on the closer relation between MCI patients and healthy subjects might be needed based on this result.

For the three-class classification, we obtain the average accuracy of 60.6% using 40 seconds of session data. While MCI patients might present subtle different visible characteristic from healthy or dementia patients, they are very different in actual. Detecting MCI patients is important for the early prediction of dementia but it is difficult due to the nearly ambiguous nature of the data.

Conditions Accuracy (%)
4 sec 8 sec 20 sec 40 sec 1 min 5 min all data
D vs M + H 73.8 74.3 77.9 78.6 78.3 77.0 77.6
D + M vs H 71.3 70.9 73.6 74.0 74.1 75.9 74.3
D vs H 74.7 74.0 77.1 78.4 78.1 79.0 80.8
Conditions Kappa Score
4 sec 8 sec 20 sec 40 sec 1 min 5 min all data
D vs M + H 0.47 0.47 0.56 0.56 0.56 0.54 0.55
D + M vs H 0.43 0.43 0.48 0.49 0.49 0.53 0.50
D vs H 0.50 0.49 0.55 0.57 0.57 0.58 0.62
Table 5: Experimental results on the PROMPT Database for different experiment conditions for the binary classification. The detailed information about each condition can be found in Table 3. The results are informed in accuracy and Kappa score.
Metrics 4 sec 8 sec 20 sec 40 sec 1 min 5 min all data
Accuracy (%) 65.0 61.1 58.4 60.6 59.2 57.7 58.3
Kappa Score 0.45 0.46 0.47 0.51 0.49 0.49 0.51
Table 6: Experimental results on PROMPT Database for the three-class classification of D vs M vs H. Results are informed in accuracy and Kappa score

Further details on the experiments with the four different conditions as well as their Kappa scores are reported in Table 5 and Table 6. From Table 5 and Table 6, it is possible to observe that, for the experiments that include the MCI class data, the accuracy does not increase monotonously with the amount of speech data. This can be explained from the challenge of classifying MCI patients. In fact, as indicated in Section 2.1, there is not a medical standard to classify MCI patients as they have characteristics of both healthy and dementia patients, making their classification nearly ambiguous.

We finally plot the ROC and the DET curves for the dementia versus healthy (D vs H) classification using 4 seconds of speech data as shown in Figures 5 and 6. These curves were obtained as a combination of the 10-fold cross-validation conducted in our experiments and they represent the trade-off involved with our model. The area under the curve (AUC) of the ROC was found to be , which shows that our model has a high capability of classifying between healthy and dementia categories, but it also shows that there is still room for improvement.

Figure 5: ROC curve of the D vs H 10-fold classification for the PROMPT Database using 4 seconds of speech data. The reported accuracy of refers to a TPR equal to and a FPR equivalent to
Figure 6: The DET curve of the D vs H 10-fold classification for the PROMPT Database using 4 seconds of speech data. The reported accuracy of refers to a FNR equal to and a FPR equivalent to

6 Conclusion

In this paper, we present a language-independent method for dementia detection using speech data. Using a GCNN architecture on top of the IS10 paralinguistic feature set yields the best accuracy of 73.1% in an English dataset, the Pitt Corpus, and 80.8% in a Japanese dataset, the PROMPT Database. We achieve the accuracy of 77.1% by using only 20 seconds of data on the PROMPT Database and 74.7% when we consider only 4 seconds of data. We further perform the three-class classification of dementia, MCI and healthy subjects on the PROMPT Database, which yields the accuracy of 60.6%.

Even though our results on the Pitt Corpus are worse when compared to the current linguistic approaches, our method is cost-effective since it does not require any transcription data and it allows the detection result to be obtained faster, which is particular promising to the dementia and MCI early diagnosis. Moreover, our method may be applicable to resource-deficient language speakers more easily than methods using linguistic information. This is because it is difficult to build a language and a high-accuracy ASR for those languages.

Nevertheless, there are still remaining improvements to enable our model to perform diagnosis in real case scenarios. With that said, in the near future, we intend to analyse the temporal pattern of dementia patients and to incorporate more modalities (e.g., facial features, body motion). Moreover, we would like to analyse the similarities and the differences of the MCI patients’ data to the other classes in order to improve the detection of MCI patients and to perform a cross-dataset experiment in order to evaluate if our model can be applied across languages.


This work was supported by JST CREST [grant numbers JPMJCR1687, JPMJCR19F5]; JSPS KAKEN [grant number 16H02845], and the Japan Agency for Medical Research and Development (AMED) [grant number JP18he1102004].