## I Introduction

Machine learning techniques have been largely used in medical signal and image analysis for prediction of neurodegenerative disorders, such as Alzheimer’s and Parkinson’s, which significantly affect elderly people, especially in developed countries [d1], [d2], [d3].

In the last few years, the development of deep learning technologies has boosted the investigation of using deep neural networks for early prediction of the above-mentioned neurodegenerative disorders. In

[d4], stacked auto-encoders have been used for diagnosis of Alzheimer’s disease.3-D Convolutional Neural Networks (CNNs) have been used in

[d5] to analyze imaging data for Alzheimer’s diagnosis. Both methods were based on the Alzheimer’s disease neuroimaging initiative dataset, including medical images and assessments of several hundred subjects. Recently, CNNs and convolutional-recurrent neural network (CNN-RNN) architectures have been developed for prediction of Parkinson’s disease [new2], based on a new database including Magnetic Resonance Imaging (MRI) data and Dopamine Transporters (DaT) Scans from patients with Parkinson’s and non patients [new3].In this paper we focus on the early prediction of Parkinson’s. It is the above two types of medical image data, i.e. MRI and DaT Scans that we explore for predicting an asymptomatic (healthy) status, or the stage of Parkinson’s at which a subject appears to be. In particular, MRI data show the internal structure of the brain, using magnetic fields and radio waves. An atrophy of the Lentiform and Caudate Nucleus can be detected in MRI data of patients with Parkinson’s. DaT Scans are a specific form of single-photon emission computed tomography, using Ioflupane Iodide-123 to detect lack of dopamine in patients’ brain.

In the paper we base our developments on the deep neural network (DNN) structures (CNNs, CNN-RNNs) developed in [new2] for predicting Parkinson’s using MRI, or DaT Scan, or MRI & DaT Scan data from the recently developed Parkinson’s database [new3]. We extend these developments by extracting latent variable information from the DNNs trained with MRI & DaT Scan data and generate clusters of this information; these are evaluated by medical experts with reference to the corresponding status/stage of Parkinson’s. The generated and medically annotated cluster centroids are used next in three different scenarios of major medical significance:

1) Transparently predicting a new subject’s status/stage of Parkinson’s; this is performed using nearest neighbor classification of new subjects’ MRI and DaT scan data with reference to the cluster centroids and the respective medical annotations.

2) Retraining the DNNs with the new subjects’ data, without forgetting the current medical cluster annotations; this is performed by considering the retraining as a constrained optimization problem and using a gradient projection training algorithm instead of the usual gradient descent method.

3) Transferring the learning achieved by DNNs fed with MRI & DaT scan data, to medical centers that only possess MRI information about subjects, thus improving their prediction capabilities; this is performed through a domain adaptation methodology, in which a new error criterion is introduced that includes the above-derived cluster centroids as desired outputs during training.

Section II describes related work where machine learning techniques have been applied to MRI and DaT Scan data for detecting Parkinson’s. The new Parkinson’s database we are using in this paper is also described in this section. Section III first describes the extraction of latent variable information from trained deep neural networks and then presents the proposed approach in the framework of the three considered testing, transfer learning and domain adaptation scenarios. Section IV provides the experimental evaluation which illustrates the performance of the proposed approach using an augmented version of the Parkinson’s database, which we also make publicly available. Conclusions and future work are presented in Section V.

## Ii Related Work

Medical image data constitute a rich source of information regarding cell degeneration in the human nervous system of Parkinson’s patients. MRI and DaT Scan data have been the focus of related research; in [d10]

, principal component analysis and support vector machines were applied to MRI data, while the same techniques and empirical mode decomposition were applied to DaT Scans in

[d11].A Parkinson’s database comprising MRI and DaT Scan data from 78 subjects, 55 patients with Parkinson’s and 23 non patients, has been recently released [new3]; it includes, in total 41528 MRI data (31147 from patients and 10381 from non patients) and 925 DaT scans (595 and 330 respectively). Our developments next are based on this database.

CNN architectures [ylc86], [c10] include convolutional, pooling and fully connected layers, in which convolutional kernel and fully connected layer weights are usually learned through gradient descent, while pooling layers reduce the input sizes through averaging operations. CNN-RNN architectures [c10], [d12] are capable of effectively analyzing temporal variations of the inputs, by permitting intra layer connections and using appropriate gating operations.

Recent advances in deep neural networks [c10], [new], [c9], [b5] have been explored in [new2]

, where convolutional (CNN) and convolutional-recurrent (CNN-RNN) neural networks were developed and trained to classify the information in the above Parkinson’s database in two categories, i.e., patients and non patients, based on either MRI inputs, or DaT Scan inputs, or together MRI and DaT Scan inputs.

DaT Scans, which are a specific examination for Parkinson’s, generally convey more information than MRI; however, using both inputs can provide better prediction performances.

The developed networks included: transfer learning of the ResNet-50 network [c25]

as far as the convolutional part of the networks was concerned, with retraining of the fully connected network layers; adding on top of this and training a recurrent network using Gated Recurrent Units (GRU)

[c6] in an end-to-end manner.In this paper we focus first on the analysis of the combined MRI and DaT Scan dataset. It should be mentioned that the target in Parkinson’s disease detection through MRI data is on estimation of the volume of the lentiform and of the capita of the caudate nucleus. To deal with volume estimation, we analyse MRIs in triplets of consecutive frames. Thus, an MRI triplet of (gray-scale) images and a DaT Scan (colour) image constitute the input to the CNN and/or CNN-RNN architectures that we use in our developments. Fig.

1 shows such a triplet of consecutive frames from an MRI sequence and a corresponding DaT Scan image.Section III.A presents the methodology used to extract latent variables from the trained DNNs and to achieve diagnosis of Parkinson’s. Section III.B describes the approach for retraining the DNNs with new information, while preserving the already extracted information. In Section III.C we examine DNN-based analysis of only MRI input triplets and show how this analysis can be improved by adaptation of the latent variable information extracted from the DNNs trained with both MRI and DaT Scan data.

## Iii The Proposed Approach

### Iii-a Extracting Latent Variables from Trained Deep Neural Networks

The proposed approach begins with training a CNN, or a CNN-RNN architecture, on the (train) dataset of MRI and DaT Scan data. The CNN networks include a convolutional part and one or more Fully Connected (FC) layers, using neurons with a ReLU activation function. In the CNN-RNN case, these are followed by a recurrent part, including one ore more hidden layers, composed of GRU neurons.

We then focus on the neuron outputs in the last FC layer (CNN case), or in the last RNN hidden layer (CNN-RNN case). These latent variables, extracted from the trained DNNs, represent the higher level information through which the networks produce their predictions, i.e., whether the input information indicates that the subject is patient, or not.

In particular, let us consider the following dataset for training the DNN to predict Parkinson’s:

(1) |

and the corresponding test dataset:

(2) |

where: and represent the network training inputs (each of which consists of an MRI triplet and a DaT Scan) and respective desired outputs (with a binary value 0/1, where 0 represents a non patient and 1 represents a patient case); and similarly represent the network test inputs and respective desired outputs.

After training the Deep Neural Network using dataset , its neurons’ outputs in the final FC, or hidden layer, and , both , are extracted as latent variables, obtained through forward-propagation of each image, in train set and test set respectively:

(3) |

and

(4) |

The following clustering procedure is then implemented on the in :

We generate a set of clusters by minimizing the within-cluster norms of the function

(5) |

where is the mean value of the data in cluster .

This is done using the k-means++ [d50] algorithm, with the first cluster centroid being selected at random from . The class label of a given cluster is simply the mode class of the data points within it.

As a consequence, we generate a set of cluster centroids, representing the different types of input data included in our train set :

(6) |

Through medical evaluation of the MRI and DaT Scan images corresponding to the cluster centroids, we can annotate each cluster according to the stage of Parkinson’s that its centroid represents.

By computing the euclidean distances between the test data in and the cluster centroids in and by then using the nearest neighbor criterion, we can assign each one of test data to a specific cluster and evaluate the obtained classification - disease prediction - performance. This is an alternative way to the prediction accomplished when the trained DNN is applied to the test data.

This alternative prediction is, however, of great significance: in the case of non-annotated new subject’s data, selecting the nearest cluster centroid in can be a transparent way for diagnosis of the subject’s Parkinson’s stage; the available MRI and DaT Scan data and related medical annotations of the cluster centroids being compared to the new subject’s data.

### Iii-B Retraining of Deep Neural Networks with Annotated Latent Variables

Whenever new data, either from patients, or from non patients, are collected, they should be used to extend the knowledge already acquired by the DNN, by adapting its weights to the new data. In such a case, let us assume that a new train dataset, say , usually of small size, say , is generated and an updated DNN should be created based on this dataset as well.

There are different methods developed in the framework of transfer learning [d20], for training a new DNN on using the structure and weights of the above-described DNN. However, a major problem is that of catastrophic forgetting, i.e., the fact that the DNN forgets some formerly learned information when fine-tuning to the new data. This can lead to loss of annotations related to the latent variables extracted from the formerly trained DNN. To avoid this, we propose the following DNN adaptation method, which preserves annotated latent variables.

For simplicity of presentation, let us consider a CNN architecture, in which we keep the convolutional and pooling layers fixed and retrain the FC and output layers. Let W be a vector including the weights of the FC and output network layers of the original network, before retraining, and denote the new (updated) weight vector, obtained through retraining. Let us also denote by, w and , respectively, the weights connecting the outputs of the last FC, defined as r in Eq. (3), to the network outputs, .

During retraining, the new network weights, , are computed by minimizing the following error criterion:

(7) |

where represents the misclassifications done in , which includes the new data and represents the misclassifications in , which includes the old information. is used to differentiate the focus between the new and old data. In the following we make the hypothesis that a small change of the weights W is enough to achieve good classification performance in the current conditions. Consequently, we get:

(8) |

and in the output layer case:

(9) |

in which and denote small weight increments. Under this formulation, we can apply a first-order Taylor series expansion to make neurons’ activation linear.

Let us now give more attention to the new data in . We can do this, by expressing in Eq. (7) in terms of the following constraint:

(10) |

which requests that the new network outputs and the desired outputs are identical.

Moreover, to preserve the formerly extracted latent variables, we move the input data corresponding to the annotated cluster centroids in from dataset to . Consequently, Eq. (10) includes these inputs as well; the size of becomes:

(11) |

where is the number of clusters in .

Let the difference of the retrained network output from the original one be:

(12) |

Expressing the output as a weighted average of the last FC layer outputs with the weights, we get [new2]

(13) |

where denotes the derivative of the former DNN output layer’s neurons’ activation function. Inserting Eq. (10) into Eq. (13) results in:

(14) |

All terms in Eq. (14) are known, except of the differences in weights and last FC neuron outputs . As a consequence, Eq. (14) can be used to compute the new DNN weights of the output layer in terms of the neuron outputs of the last FC layer.

If there are more than one FC layers, we apply the same procedure, i.e., linearize the difference of the , iteratively through the previous FC layers and express the in terms of the weight differences in these layers. When reaching the convolutional/pooling layers, where no retraining is to be performed, the procedure ends, since the respective is zero. It can be shown, similarly to [new2] that the weight updates are finally estimated through the solution of a set of linear equations defined on :

(15) |

where matrix includes weights of the original DNN and vector v is defined as follows:

(16) |

with denoting the output of the original DNN applied to the data in .

Similarly to [new2], the size of v is lower than the size of ; many methods exist, therefore, for solving Eq. (16). Following the assumption made in the beginning of this section, we choose the solution that provides minimal modification of the original DNN weight. This is the one that provides the minimum change in the value of in Eq. (7).

Summarizing, the targeted adaptation can be solved as a nonlinear constrained optimization problem, minimizing Eq. (7), subject to Eq. (10) and the selection of minimal weight increments. In our implementation, we use the gradient projection method [c40] for computing the network weight updates and consequently the adapted DNN architecture.

### Iii-C Domain Adaptation of Deep Neural Networks through Annotated Latent Variables

In the two previous subsections we have focused on generation, based on extraction of latent variables from a trained DNN, and use of cluster centroids for prediction and adaptation of a Parkinson’s diagnosis system. To do this, we have considered all available imaging information, consisting of MRI and DaT Scan data.

However, in many cases, especially in general purpose medical centers, DaT Scan equipment may not be available, whilst having access to MRI technology. In the following we present a domain adaptation methodology, using the annotated latent variables extracted from the originally trained DNN, to improve prediction of Parkinson’s achieved when using only MRI input data. A new DNN training loss function is used to achieve this target.

Let us consider the following train and test datasets, similar to and in Eq. (1) and Eq. (2) respectively, in which the input consists only of triplets of MRI data:

(17) |

and

(18) |

where: and represent the network training inputs (each of which consists of only an MRI triplet) and respective desired outputs (with a binary value , where 0 represents a non patient and 1 represents a patient case); and similarly represent the network test inputs and respective desired outputs.

Using , we train a similar DNN structure - as in the full MRI and DaT Scan case -, producing the following vector of neuron outputs in its last FC or hidden layer:

(19) |

with the dimension of each vector being , as in the original DNN last FC, or hidden, layer.

A far as the outputs are concerned, it would be desirable that these latent variables being closer, e.g., according to the mean squared error criterion, to one of the cluster centroids in Eq. (6) that belongs to the same category(patient/non patient) with them.

In this way, training of the DNN with only MRI inputs, would also bring its output closer to the one generated by the original DNN; this would potentially improve the network’s performance, towards the much better one produced by the original DNN (trained with both MRI and DaT Scan data).

Let us compute the euclidean distances between the latent variables in and the cluster centroids in as defined in Eq. (6). Using the nearest neighbor criterion we can define a set of desired vector values for the latent variables, with respect to the cluster centroids, as follows:

(20) |

where is equal to, either 1 in the case of the cluster centroid that was selected, as closest to during the above-described procedure, or equal to 0 in the case of the rest cluster centroids.

In the following, we introduce the values in a modified Error Criterion to be used in DNN learning to correctly classify the MRI inputs.

Normally, the DNN (CNN, or CNN-RNN) training is performed through minimization of the error criterion in Eq. (21) in terms of the DNN weights:

(21) |

where and denote the actual and desired network outputs and is equal to the number of all MRI input triplets.

We propose a modified Error Criterion, introducing an additional term, using the following definitions:

(22) |

and

(23) |

with T indicating the transpose operator.

It is desirable that the term - with a respective value of equal to one - is minimized, whilst the values - corresponding to the rest of the values, which are equal to zero - are maximized. Similarly to [c8], we pass through a softmax f function and subtract its output from 1, so as to obtain the above-described respective minimum and maximum values.

The generated Loss Function is expressed in terms of the differences of the transformed values from the corresponding desired responses , as follows:

(24) |

calculated on the data and the cluster centroids.

In general, our target is to minimize together Eq. (21) and Eq. (24). We can achieve this using the following Loss Function:

(25) |

where is chosen in the interval [, ].

Using a value of towards zero provides more importance to the introduced centroids of the clusters of the latent variables extracted from the best performing DNN, trained with both MRI and DaT Scan data. On the contrary, using a value towards one leads to normal error criterion minimization.

## Iv Experimental Evaluation

In this section we present a variety of experiments for evaluating the proposed approach. The implementation of all algorithms described in the former Section has been performed in Python using the Tensorflow library.

### Iv-a The Parkinson’s Dataset

The data that are used in our experiments come from the Parkinson’s database described in Section II. For training the CNN and CNN-RNN networks, we performed an augmentation procedure in the train dataset, as follows. After forming all triplets of consecutive MRI frames, we generated combinations of these image triplets with each one of the DaT Scans in each category (patients, non patients).

Consequently, we created a dataset of 66,176 training inputs, each of them consisting of 3 MRI and 1 DaT Scan images. In the test dataset, which referred to different subjects than the train dataset, we made this combination per subject; this created 1130 test inputs.

For possible reproduction of our experiments, both the training and test datasets, each being split in two folders - patients and non patients - are available upon request from the mlearn.lincoln.ac.uk web site.

### Iv-B Testing the proposed Approach for Parkinson’s Prediction

We used the DNN structures described in [new2], including both CNN and CNN-RNN architectures to perform Parkinson’s diagnosis, using the train and test data of the above-described database. The convolutional and pooling part of the architectures was based on the ResNet-50 structure; GRU units were used in the RNN part of the CNN-RNN architecture.

The best performing CNN and CNN-RNN structures, when trained with both MRI and DaT Scan data, are presented in Table I.

Structure | No FC layers | No Hidden Layers | No Units in FC Layer(s) | No Units in Hidden Layers | Accuracy () |

CNN | 2 | - | 2622-1500 | - | 94% |

CNN-RNN | 1 | 2 | 1500 | 128-128 | 98% |

It is evident that the CNN-RNN architecture was able to provide excellent prediction results on the database test set. We, therefore, focus on this architecture for extracting latent variables. For comparison purposes, it can be mentioned that the performance of a similar CNN-RNN architecture when trained only with MRI inputs was about 70%.

It can be seen, from Table I, that the number of neurons in the last FC layer of the CNN-RNN architecture was 128. This is, therefore, the dimension of the vectors r extracted as in Eq. (3) and used in the cluster generation procedure of Eq. (5).

We then implemented this cluster generation procedure, as described in the former Section. The k-means algorithm provided five clusters of the data in the 128-dimensional space. Fig. 2 depicts a 3-D visualization of the five cluster centroids; stars in blue color denote the two centroids corresponding to non patient data, while squares in red color represent the three cluster centroids corresponding to patient data.

With the aid of medical experts, we generated annotations of the images (3 MRI and 1 DaT Scan) corresponding to the 5 cluster centroids. It was very interesting to discover that these centroids represent different levels of Parkinson’s evolution. Since the DaT Scans conveyed the major part of this discrimination, we show in Fig.3 the DaT Scans corresponding to each one of the cluster centroids.

According to the provided medical annotation, the 1st centroid () corresponds to a typical non patient case. The 2nd centroid () represents a non patient case as well, but with some findings that seem to be pathological. Moving to the patient cases, the 3rd centroid () shows an early step of Parkinson’s - in stage 1 to stage 2, while the 4th centroid () denotes a typical Parkinson’s case - in stage 2. Finally, the 5th centroid () represents an advanced step of Parkinson’s - in stage 3.

It is interesting to note here that, although the DNN was trained to classify input data in two categories - patients and non patients -, by extracting and clustering the latent variables, we were able to generate a richer representation of the diagnosis problem in five categories. It should be mentioned that the purity of each generated cluster was almost perfect.

Cluster | No of Data (%) |
---|---|

4,3 | |

38,4 | |

27,6 | |

2,3 | |

27,4 |

Table II shows the percentages of training data included in each one of the five generated clusters. It should be mentioned that almost two thirds of the data belong in clusters 2 and 3, i.e., in the categories which are close to the borderline between patients and non patients. These cases require major attention by the medical experts and the proposed procedure can be very helpful for diagnosis of such subjects’ cases.

We tested this procedure on the Parkinson’s test dataset, by computing the euclidean distances of the corresponding extracted latent variables from the 5 cluster centroids and by classifying them to the closest centroid.

Table III shows the number of test data referring to six different subjects that were classified to each cluster. All non patient cases were correctly classified. In the patient cases, the great majority of the data of each patient were correctly classified to one of the respective centroids. In the small number of misclassifications, the disease symptoms were not so evident. However, based on the large majority of correct classifications, the subject would certainly attract the necessary interest from the medical expert.

Test case | |||||
---|---|---|---|---|---|

Non Patient 1 | 44 | 398 | 0 | 0 | 0 |

Non Patient 2 | 10 | 90 | 0 | 0 | 0 |

Patient 1 | 3 | 7 | 94 | 8 | 8 |

Patient 2 | 1 | 7 | 139 | 17 | 20 |

Patient 3 | 3 | 0 | 145 | 18 | 38 |

Patient 4 | 0 | 0 | 0 | 8 | 72 |

We next examined the ability of the above-described DNN to be retrained using the procedure described in Subsection III.B.

In the developed scenario, we split the above test data in two parts: we included 3 of them (Non Patient 2, Patient 2 and Patient 3) in the retraining dataset and let the other 3 subjects in the new test dataset. The size of was equal to 493 inputs, including the five inputs corresponding to cluster centroids in ; the size of the new test set was equal to 642 inputs.

We applied the proposed procedure to minimize the error over all train data in and , focusing more on the latter, as described by Eq. (10).

The network managed to learn and correctly classify all 493 inputs, including the inputs corresponding to the cluster centroids, with a minimal degradation of its performance over input data. We then applied the trained network to the test dataset consisting of three subjects. In this case, there was also a slight improvement, since the performance was raised to 98,91%, compared to the corresponding performance on the same three subjects’ data, shown in Table III, which was 98,44%.

Table IV shows the clusters to which the new extracted latent variables were classified. A comparison with the corresponding results in Table III shows the differences produced through retraining.

Test case | |||||
---|---|---|---|---|---|

Non Patient 1 | 41 | 401 | 0 | 0 | 0 |

Patient 1 | 2 | 5 | 99 | 7 | 7 |

Patient 4 | 0 | 0 | 0 | 7 | 73 |

We finally examined the performance of the domain adaptation approach that was presented in Subsection III.C.

We started by training the CNN-RNN network with only the MRI triplets in as inputs. The obtained performance when the trained network was applied to the test set was only 70,6%. For illustration of the proposed developments we extracted the latent variables from this trained network and classified them to a set of respectively extracted cluster centroids. Table V presents the results of this classification task, which is consistent with the acquired DNN performance. It can be seen that the MRI information leads DNN prediction towards the patient class, which indeed contained more samples in the train dataset. Most errors were made in the non patient class (subjects 1 and 2).

Test case | |||||
---|---|---|---|---|---|

Non Patient 1 | 181 | 74 | 179 | 8 | 0 |

Non Patient 2 | 14 | 4 | 44 | 33 | 5 |

Patient 1 | 16 | 0 | 53 | 49 | 2 |

Patient 2 | 6 | 0 | 83 | 80 | 15 |

Patient 3 | 26 | 3 | 130 | 35 | 10 |

Patient 4 | 12 | 0 | 51 | 11 | 6 |

We then examined the ability of the proposed approach, to train the CNN-RNN network using the modified Loss Function, using various values of ; here we present the case when using a value equal to 0.5.

The obtained performance when the trained network was applied to the test set was raised to 81,1%. For illustrating this improvement we also extracted the latent variables from this trained network and classified them to one of the five annotated original cluster centroids .

Table VI presents the results of this classification task. It is evident that minimization of the modified Loss Function managed to force the extracted latent variables get closer to cluster centroids which belonged in the correct class for Parkinson’s diagnosis.

Test case | |||||
---|---|---|---|---|---|

Non Patient 1 | 176 | 147 | 114 | 5 | 0 |

Non Patient 2 | 13 | 41 | 25 | 18 | 3 |

Patient 1 | 13 | 0 | 70 | 35 | 2 |

Patient 2 | 5 | 0 | 116 | 54 | 9 |

Patient 3 | 20 | 2 | 140 | 34 | 8 |

Patient 4 | 9 | 0 | 31 | 5 | 35 |

## V Conclusions and Future Work

The paper proposed a new approach for extracting latent variables from trained DNNs, in particular CNN and CNN-RNN architectures, and using them in a clustering and nearest neighbor classification method for achieving high performance and transparency in Parkinson’s diagnosis. We have used augmentation of the MRI and DaT Scan data in a recent Parkinson’s database and provide the resulting datasets upon request from mlearn.lincoln.ac.uk.

A DNN retraining procedure was presented, which is able to preserve the knowledge provided by annotated formerly extracted clustered latent variables. Moreover, a domain adaptation approach has been developed, which is able to use the extracted clustered latent variable information for improving the performance of the DNN architecture when presented with less input (only MRI) data.

An experimental study has been developed, using the above datasets, which illustrates the ability of the proposed approach to achieve high perfomance.

Future work will be based on a close collaboration of National Technical University of Athens and University of Lincoln with IBM, particularly relating the presented research to the IBM Watson Health initiative. The target will be generation of novel performance-aware and transparent systems for better diagnosis of neurodegenerative diseases like Parkinson’s, based on a combination of MRI and other images, epidemiological data, historical data of treatments and clinical data.

## Acknowledgment

The authors wish to thank the Department of Neurology of the Georgios Gennimatas General Hospital in Athens, Greece, and particularly Dr Georgios Tagaris, for the creation and provision of the main Parkinson’s dataset and for his collaboration in the evaluation of the results of the performed analysis.

Comments

There are no comments yet.