Ischemic stroke is a leading cause of acquired disability, dementia and mortality worldwide [gorelick2019global]. Several machine learning-based studies have been devoted to the prediction of the final stroke lesion [winzeck2018isles, nielsen2018prediction, yu2020use, Debs21], but still very few to the question of clinical outcome [van2018predicting, ramos2020predicting]. The clinical outcome is usually measured by the modified Rankin scale (mRS) which grades the degree of disability in daily activities; it has become the most widely used clinical outcome measure for stroke clinical trials [mRS1, mRS2]
. In the state of the art, this problem of classification of patients according to the mRS was treated by classical ML approaches (Random Forest) by taking clinical variables[van2018predicting] and radiological parameters (presence of leukoaraiosis, old infarctions, hyperdense vessel sign, and hemorrhagic transformation) [ramos2020predicting].
In this paper, we propose an approach which, in addition to the clinical metadata, takes the MRI images as input. This is achieved by proposing a spatio-temporal encoding which has already proved its efficiency in previous works [giacalone2017, giacalone2018local]
. However, unlike these previous works, this encoding is here specifically designed for deep learning architectures and is for the best of our knowledge presented for the first time in the context of images for stroke. This spatio-temporal encoding is based on a convolutional neuronal network - long short-term memory (CNN-LSTM) architecture that involves using CNNs layers for spatial feature extraction on input MR images combined with LSTMs to support temporal sequence prediction. CNN-LSTM architecture was originally developed for image or video description[venugopalan2016improving] and has very recently be proposed for problems in medical imaging for the classification of cancer types [jang2018prediction, marentakis2021lung].
In the context of our study, the CNN-LSTM architecture is applied independently to each MRI modality available at admission which are diffusion and perfusion MRI, more specifically we encode 5 entries which are raw diffusion MRI, the apparent diffusion coefficient (ADC) and the parametric maps associated to perfusion MRI that are time to maximum (Tmax), cerebral blood flow (CBF) and cerebral blood volume (CBV). In order to take advantages of ensemble learning (bias and variance reduction)[hatami2012]
, we then propose to fuse together the different classifiers constructed for each MR input by an ensemble voting approach. This fusion approach has the originality of being weighted using the clinical variables (not used as input in the CNN-LSTM architecture) to improve the classifiers’ performance. This is done by giving higher weights (reward) to the modules that are in accordance with the clinical meta-data, and lower weights (punish) to those which are in disagreement. We exhaustively evaluated our framework based on a cohort of 119 patients with large intracranial artery occlusion treated by thrombectomy. Results show an accurate mRS prediction of 74% for accuracy and 0.77 for AUC. We also compared our solution against different baselines. Our solution improves performance and stability by proposing an encoding adapted to the spatio-temporal nature of the image data and fusing together image and clinical data with an ensemble approach at the very end of the machine learning pipeline.
Ii-a Data source and preprocessing
Patients were included from the HIBISCUS-STROKE cohort [Debs21], which is an ongoing monocentric observational cohort enrolling patients with an ischemic stroke due to a proximal intracranial artery occlusion treated by thrombectomy. In total 119 patients, both male and female of age (meanstd), were analyzed. Inclusion criteria were: (1) patients with an anterior circulation stroke related to a proximal intracranial occlusion; (2) diffusion and perfusion MRI as baseline imaging; (3) patients treated by thrombectomy with or without intravenous thrombolysis. All patients gave their informed consent and the imaging protocol was approved by the regional ethics committee. All patients underwent the following protocol (IRB number:00009118) at admission: diffusion-weighted-imaging (DWI), dynamic susceptibility-contrast perfusion imaging (DSC-PWI) and a clinical evaluation including age and the National Institutes of Health Stroke Scale (NIHSS), which ranges from 0 to 42 (increasing scores indicate more severe neurological deficits) [NIHSS14]. Final clinical outcome was assessed at 3-month during a face-to-face follow-up visit using the mRS. The distribution of the final mRS scores and its associated binarization into poor and good outcomes are shown in Figure 1. In this paper, we used the binarized mRS for classifying patients’ outcome.
Parametric maps were extracted from the DSC-PWI by circular singular value decomposition of the tissue concentration curves (Olea Sphere, Olea Medical, La Ciotat, France): cerebral blood flow (CBF), cerebral blood volume (CBV) and time to maximum (Tmax). DSC-PWI parametric maps were coregistered within subjects to DWI using linear registration with Ants (Avants et al., 2011) and all MRI slices were of size. The skull from all patients was removed using FSL (Smith et al., 2001). Finally, images were normalized between 0 and 1 to ensure inter-patient standardization.
Ii-B Proposed model and baselines
) for ADC, CBF, CBV, DWI and Tmax. Each module receives sequences of images representing the entire parenchyma moving up the vascular tree from its lower part to its upper part. In order to extract the deep features and since the input for the pre-trained CNNs are RGB images, we used 3 consecutive slices. The resulting feature vectors were then used to train a LSTM[LSTM97] for obtaining preliminary mRS predictions ().
The final mRS prediction () is assessed using a weighted average of the preliminary probabilities, where the weights are calculated according to the Algorithm 1. The idea behind weighting with clinical data (i.e. age and/or NIHSS) is to reward the preliminary probabilities that are in agreement with the clinical data and penalize those in dissent. In other words, it uses the extra information provided by the clinical data to increase the confidence in the preliminary probabilities, and therefore in the final mRS prediction. The binary patient outcome is then predicted by setting a threshold on the final mRS prediction.
Algorithm 1 includes a number of steps that we list hereafter. First, an optimal threshold is calculated for the image modules so as to maximize the AUCs (step 1). Then, the preliminary binary labels of the modules are calculated using the thresholds (step 2). Depending on the output of the modules (labels), a weight is assigned to them according to the clinical variable considered (step 3). Please note that the clinical variable we used here, i.e. age and NIHSS are positively correlated to the final output. In subjects with higher age or higher NIHSS score, the weights () reward the modules with ”poor” outputs, and penalize the modules with ”good” outputs. This way, the final mRS prediction is pushed towards higher mRS values. These weights are then normalized to ensure that their sum is 1 on the scale of the 5 modules (step 4) and integrated with the preliminary probabilities in order to predict the final mRS (step 5). This prediction is then thresholded to produce the prediction of clinical outcome (step 6).
|Algorithm 1: Patient outcome prediction: fusion of the clinical meta-data and preliminary probabilities obtained from the image modules.|
|INPUT: normalized clinical meta-data , preliminary probabilities|
|OUTPUT: final mRS probability|
|1. Find the global threshold for that maximize|
|2. Obtain preliminary binary labels|
|3. If ==poor:|
|4. , where|
|6. If :|
|Final Patient Outcome ”good”|
|Final Patient Outcome ”poor”|
Table I reports two sub-tables which illustrate, for five patients each, the preliminary probabilities , the weights calculated by Algorithm 1 and their impact on the final prediction . We also give to represent the decision without these weights w/o and the gold standard GS. This table demonstrates how the final predicted outcome is corrected (shown in bold) by the proposed image-clinical data fusion algorithm, compared to the standard ensemble (fusion of image modules by averaging). For example (Table I, bottom panel), patient 023 has an NIHSS score of 8 and ”good” output (gold standard mRS2) given by the clinicians. The preliminary probabilities of the five CNN-LSTM modules are [0.68, 0.75, 0.74, 0.07, 0.24] which result in the average ensemble probability of 0.49. Obtaining the best threshold () of 0.40 using the cross-validation, the final patient output for the ensemble is ”poor”. Given a low NIHSS for this patient, the proposed algorithm assigns the in a way that the final ensemble label is corrected to ”good”. On contrary, patient 046 has an NIHSS of 23 and ”poor” output (gold standard mRS3). The high value of NIHSS help the algorithm to push the final score towards ”poor”, while the standard ensemble votes for ”good” output. Please note that the proposed fusion algorithm uses only one clinical variable in an ensemble model, i.e. age or NIHSS. Age has a similar impact on the model’s final output, as increasing age is associated with increasing likelihood for poor outcome (Table I, upper panel).
There are three main hyperparameters to be tuned in the proposed model. First, the type of CNN chosen in the image encoding part: six popular CNN models VGG16, VGG19, Xception, ResNet50, MobileNet and DenseNet with pre-trained ImageNet weights were considered[VGG16, ResNet]. Second, it concerns the hyperparameters related to the LSTM. Both normal and bidirectional LSTM is investigated, which we saw to significant difference among them. And finally, the threshold on the final mRS probability in order to obtain the binary labels. In order to determine the optimal hyperparameters, 5-fold cross-validation with patient-level separation is applied for maximizing AUC measure.
To compare our results, and because there is no published research on our dataset, we proposed three baseline models. First baseline is a Random Forest (RF) classifier inspired by [ramos2020predicting]. The input for the RF are the following clinical data: NIHSS baseline, age, door-to-puncture-time and Fazekas scale. Second baseline is inspired by [Debs21] where the input MR images are identical to our proposed model. It is an early fusion 3D-CNNs model (CNN), where the 5 MR images are concatenated and used as an unique input. In order to represent the architecture, we use and where is a 3D convolution (is fully-connected. Therefore, the CNN architecture is ------.
Six different measures are used to evaluate the performance of the models: classification accuracy (recognition rate), F1 score, sensitivity, specificity, Mean-Absolute-Error (MAE) and the Area Under the Curve (AUC). 10 independent runs with random seeds are performed, and means and standard deviations of the measures are reported. For our specific binary problem with imbalanced classes (refer to the Figure1), some of the measures give more insight than the others. In mRS prediction, we believe AUC gives more precise evaluation of the overall binary models; and F1 score, sensitivity, and specificity also provide useful information as the false positive and false negative errors are critical for the considered medical application. MAE is generally used for evaluation of regression tasks and previously used to evaluate mRS prediction [maier2017isles, winzeck2018isles]. In order to compare our model to the baselines we have also measured and reported the p-values of two-sided Wilcoxon signed rank tests. With p-value
we can reject the null hypothesis, and therefore the results predicted by two different models are significantly different with 95% confidence.
Iii Experimental Results and Discussions
We carried out experiments in PyTorch. For each module, a LSTM regressor is trained to estimate the preliminary mRS scores from the feature sequences. The LSTM architecture consists of a sequence of input layers, two LSTM layers, a fully-connected and a sigmoid layer with the half-mean-squared-error loss. In order to obtain the best results, Adam optimizer with learning rate of
were tried with maximum number of epochs ofwith early stopping and batch size of . In our experiments, we explored different hidden nodes for LSTM and the best using ResNet18 features was selected based on 5-fold cross-validation performance.
Table II reports the performance (meanstd over 10 runs) of the proposed ensemble compared to its individual modules. As shown, DWI is the best performing module (also performing well comparing to the ensemble) with AUC=0.71, MAE=0.38, F1 score=0.67, sensitivity=0.71, specificity=0.67, and accuracy=72%. Regarding the interest of weighting vote with clinical data, both age and NIHSS have improved the ensemble performance, which proves the importance of integrating both imaging and clinical meta-data, and their complementary values to the prediction. It is also important to note that adding the NIHSS boosts the performance of our model more than with age.
Table III and IV report the performance (meanstd over 10 runs) of the proposed ensemble (both with and without clinical meta-data) compared to the different baseline models. As shown, AUC, MAE, F1-score, sensitivity, specificity, accuracy of the proposed model is significantly superior than the baselines. Another advantage is the training time, which is considerably reduced in the case of our model compared to the 3D-CNN model. Regarding the RF which is a shallow model, the saving in learning time and in the number of parameters comes at the cost of a significant drop in all the precision measurements. The low accuracy of the RF can have two explanations: 1) the RF inputs are hand-crafted features and cannot accept MRI images as a spatio-temporal data and 2) because of the limited number of parameters and RF depths, it tends to overfit and demonstrate a poor generalization ability (compared to the proposed model). It is also important to point out that the proposed ensemble with and without using clinical data outperform the baselines (see Table III, therefore the LSTM-CNN architecture is better suited for the MRI image coding than the RF and 3D-CNN.
From a statistical point-of-view, as shown in Table IV, the results of the proposed models are statistically different from the baselines. It is also interesting to notice that the results from the baselines are similar, although they are different categories of algorithms and their input data is also different.
Comparing our results with the state-of-the-art, there are some important points to comment. First, our results offer a balanced prediction, meaning the prediction of good outcome is as good as the prediction of the poor outcome. While methods such as [ramos2020predicting] focused only on the poor outcome and suffer a high false positive; in contrast, our model offers both high sensitivity and specificity. Another interesting point about our results is the importance of age versus NIHSS in the outcome prediction. Our finding is in line with another independent research concluding that the NIHSS is likely to have a greater impact than age when it comes to mRS prediction [amitrano2016]. Lastly, as also shown in [ramos2020predicting]
, both age and NIHSS are important features for mRS prediction, and boost the prediction accuracy when combined to other predictor variables.
What is our explanation on NIHSS performing better than age in the model? Although both increasing age and higher NIHSS scores are associated with worse clinical outcome [andersen2011], elderly patients can benefit from therapy (reperfusion), especially patients with milder baseline neurological severity [Drouard2019]. In our dataset, the NIHSS score appears to add more prognostic information than age. It is likely related to the selection criteria of our patients (i.e. patients deemed eligible for thrombectomy), in whom the individual clinical severity has more prognostic importance than age. Elderly patients with significant pre-stroke comorbidities and disability are less likely to be treated by thrombectomy and thus included in the present study.
Iv Conclusions and Future Work
A CNN LSTM-based multimodal MR image fusion for predicting the final mRS is proposed. The proposed model offers the following advantages: (1) efficient encoding that fits the spatio-temporal nature of MRI data, (2) original fusion of MR images and clinical meta-data in a unified framework, (3) since the image deep features are extracted from the off-the-shell
CNNs (previously trained on the ImageNet[ImageNet15]), the training part is only LSTMs which reduces the computational cost, while offering accuracy boost by comparison with the state-of-the art.
One of the main limitations of the proposed fusion model is that it can only use one clinical variable in an ensemble model, e.g. either age or NIHSS. One future research is to adapt/extend the proposed weighting algorithm in order to combine multiple clinical variable with multiple image modules.
This work was supported by the RHU MARVELOUS (ANR-16-RHUS-0009) of Universite Claude Bernard Lyon-1 (UCBL) and by the RHU BOOSTER (ANR-18-RHUS-0001), within the program ”Investissements d’Avenir“ operated by the French National Research Agency (ANR).