Alzheimer’s disease (AD) is an irreversible neurodegenerative disorder that progressively destroys memory and other cognitive domains of the brain. While effective therapeutic management of AD is still in development, it seems reasonable to expect their prospective outcomes to depend on the severity of baseline pathology. For this reason, substantial research efforts have been invested in the development of effective means of non-invasive diagnosis of AD at its earliest possible stages. In pursuit of the same objective, the present paper addresses the problem of the quantitative diagnosis of AD by means of Diffusion Magnetic Resonance Imaging (dMRI). In particular, the paper introduces the notion of a pathology specific imaging contrast (PSIC), which, in addition to supplying a valuable diagnostic score, can serve as a means of visual representation of the spatial extent of neurodegeneration. The values of PSIC are computed by a dedicated deep neural network (DNN), which has been specially adapted to the processing of dMRI signals. Once available, such values can be used for several important purposes, including stratification of study subjects. In particular, experiments confirm the DNN-based classification can outperform a wide range of alternative approaches in application to the basic problem of stratification of cognitively normal (CN) and AD subjects. Notwithstanding its preliminary nature, this result suggests a strong rationale for further extension and improvement of the explorative methodology described in this paper.
The world population is steadily ageing, and with advanced age comes a higher risk of dementia. At the present time, the dementia of Alzheimer’s type, or Alzheimer’s disease (AD), accounts for almost two-thirds of all prevalent cases of dementia in the elderly. AD is an irreversible, progressive disease that slowly destroys memory and other cognitive domains, eventually leaving the patient bedridden. The course of AD pathology is likely to span around two to three decades [Kumar:2015aa]. Unfortunately, by the time when the first symptoms emerge, it is usually too late to save the brain. For this reason, over the last two decades, considerable efforts have been directed towards finding effective means of the earliest possible diagnosis of AD [AD2018].
The current arsenal of methods for quantitative diagnosis of AD is impressively broad, ranging from advanced proteomics to state-of-the-art neuroimaging. In the latter case, particularly promising results have been demonstrated by both nuclear and Magnetic Resonance Imaging (MRI) [Frisoni:2010aa, Weiner:2012aa, AD2018].
Among various methods of MRI, diffusion MRI (dMRI) is exceptional for its unique ability to generate imaging contrast based on the microscopic (rather than macroscopic) properties of neurological tissue, which makes it singularly fit for the task of detecting the earliest signs of neurodegeneration [Basser:1994aa, Bihan:1986aa]. This ability of dMRI has been investigated in a number of studies [Acosta:2012aa, Amlien:2014aa, Schouten:2017aa, Zhang:2014aa, Mayo:2018aa], which predominantly focused on the problem of classification (aka stratification) of three groups of subjects, viz. cognitively normal (CN) subjects, AD subjects, and the subjects diagnosed with mild cognitive impairment (MCI). Note that the latter is broadly recognized as a prodromal condition that frequently heralds the onset of “full-blown” AD [Petersen:2016aa, Edmonds:2019aa].
In virtually all earlier studies on dMRI-based stratification of AD, MCI, and CN subjects, the protocol of choice has been Diffusion Tensor Imaging
Diffusion Tensor Imaging(DTI) [Bihan:1986aa, Basser:1994aa, Mori:2014aa]. The latter is known to provide an adequate characterization of diffusion dynamics in the white matter associated with non-crossing bundles of neural fibre tracts. Unfortunately, its dependence on Gaussian modelling curbs the ability of DTI to delineate more complex diffusion processes, e.g., within crossing fibres [Assaf:2008aa, Jones:2010aa]. It is thus no wonder that, even though many DTI metrics have demonstrated considerable sensitivity across multiple brain regions, the most consistent findings have been confined to the corpus callosum [Bozzali:2002aa, Acosta:2012aa, Lee:2013aa, Amlien:2014aa, Teipel:2014aa, Hawkins:2015aa, Mayo:2017aa, Schouten:2017aa, Mayo:2018aa]. At the same time, few DTI studies have been able to stratify CN, AD, and MCI subjects based on DTI analysis of the medial-temporal white matter, which is known to be abundant in both crossing and “kissing” fibre tracts. The problem here has obviously been in the intrinsic modelling limitations of DTI, which is rather discouraging out-turn in view of the known involvement of the above region in the early stages of neuropathological AD [Englund:1988aa].
The limitations of DTI have prompted the development of more advanced methods of dMRI, among which Neurite Orientation Dispersion and Density Imaging (NODDI) is considered to be one of the most comprehensive approaches to quantitative characterization of cerebral diffusion [Zhang:2012aa]. Naturally, several studies have investigated the applicability of NODDI to early diagnosis of AD. However, when trying to correlate the spatial distribution of NODDI metrics with histopathological evidence of AD, it was observed in [Colgan:2016aa] that NODDI offered somewhat marginal advantages over DTI. A similar conclusion was reached in [Slattery:2015aa], in which NODDI was used in application to diagnosis of young-onset AD.
Needless to say, DTI and NODDI are only two specific examples among a wide range of methods available under the umbrella of dMRI. However, regardless of their specific modelling assumptions, all these methods share a tendency to produce more accurate results at the expense of higher complexity of parametrization. At the same time, the use of parametric spaces of progressively higher dimensionality requires a proportional increase in the number of data points, which might not always be possible due to practical constraints. Thus, given the melange of available protocols and models, the question of which of the existing dMRI methods is “the best” for early diagnosis of AD appears to be rather non-trivial.
Before going any further, it is important to note that not all methods of dMRI are equally feasible from the viewpoint of clinical implementation. In particular, for practical reasons, the typical duration of a clinical dMRI examination rarely exceeds 15-20 mins. This constraint puts a strict upper bound on the amount of acquirable data, and, consequently, on the maximal order of numerically stable parametrization. For this reason, most of the studies on the dMRI-based diagnosis of AD have predominantly relied on DTI. Despite its numerous limitations, DTI remains “the method of choice” in many ongoing studies thanks to the minimality of its technical requirements and its time efficiency.
It is also worthwhile noting that, in quotidian exchanges, the term “DTI” is usually used in two different connections. In particular, it could refer to the Gaussian (i.e., 2-order tensor) diffusion model which lies in the foundation of DTI analysis. This model is described by a total of seven parameters111The DTI model is parametrized by a symmetric diffusion tensor having six independent entries. An additional parameter is required to account for the effect of normalization.
, which could be estimated based on a minimum of seven independent measurements. Although the practical number of diffusion measurements normally exceeds this low bound, their acquisition is still rare to require more than 15 mins of scanning time, which makes such imaging protocolsclinically feasible
. It is probably due to the association between the low dimensionality of DTI modelling and its dependence on a relatively small number of measurements that the term “DTI” has also been used to refer to diffusion data comprised of a comparatively small number of diffusion encodings. Hence, to avoid misconception, the terms DTIdata and DTI modelling need to be distinguished.
For the reasons explained above, virtually all clinical dMRI data could be characterized as “DTI”. From the practical point of view, therefore, it seems reasonable to restrict the scope of available dMRI models to only those which could be reliably fitted based on clinical DTI data. Unfortunately, this would have mainly left us with low-parametric models of a descriptive power similar to that of DTI. This out-turn reveals a critical methodological predicament, where the use of more advanced models is fraught with estimation artefacts, whereas suppressing such artefacts by including additional measurements is precluded by practical constraints. In such conditions, the use of low-parametric dMRI models becomes the only option, which, unfortunately, comes at the cost of reduced accuracy.
A particularly promising way to improve over the performance of low-parametric dMRI modelling is offered by data-driven inference and, in particular, by its recent realization in the theory of Deep Learning (DL) [LeCun:2015aa, Goodfellow:2016aa]. The modern methods of DL make it possible to discern subtle and complex dependencies in experimental data, which would have been impossible to describe in mechanistic terms. In this case, the actual (unknown) model is replaced by its phenomenological representation in terms of a Deep Neural Network (DNN) that is capable of “learning” to predict future outcomes based on past observations.
Although the idea of using DL in the imaging-based diagnosis of AD is not original, much of the work along this direction has mainly focused on structural MRI. The latter has proven instrumental in assessing the cerebral atrophy due to AD in both cortical and subcortical regions of the brain [Bhagwat:2019aa, Jain:2019aa, Wang:2019bb, Choi:2018aa, Lian:2018aa]
. In longitudinal studies, the substantial potential for prediction of ensuing cognitive decline in MCI and AD subjects has been demonstrated through the use of recurrent DNNs[Ghazi:2019aa, Lee:2019aa, Wang:2019aa, Aghili:2018aa, Lin:2018aa, Kawahara:2017aa], both with and without augmenting the structural MRI data with other sources of diagnostic information [Lee:2019aa, Zhou:2019aa, Forouzannezhad:2018aa, Lu:2018aa]. However, the application of DL to a dMRI-based diagnosis of AD remains a barely tapped in area of research, notwithstanding the abundant evidence of its successful use in other applications [Golkov:2016aa, Koppers:2016aa]. Accordingly, the primary goal of this work has been to leverage the combined power of dMRI and DL towards the early diagnosis of AD.
The key idea of the proposed methodology is built around the notion of pathology specific imaging contrast (PSIC). Similarly to other imaging-based markers, PSIC is a scalar score that indicates the presence of a suspected pathology (such as, e.g., AD). However, instead of characterising an entire dataset, the values of PSIC are computed at each spatial coordinate within a specified region-of-interest (ROI). In this way, PSIC can serve as a local indicator of the degree to which AD pathology affects various anatomical sites.
Once available, the values of PSIC can be converted into regional statistics, which can, in turn, be used for the purpose of subject classification. In the case of multiple ROIs, the performance of such classification could be further improved through exploring the interdependencies between different regional statistics, which are known to undergo sizeable changes in AD [Maryam:2017aa]. Furthermore, the local PSIC values can be displayed in superposition with anatomical images in the form of a contrast. In this way, PSIC also offers a means for visual analysis of the extent and severity of suspected pathology.
In this work, the values of PSIC have been generated by a dedicated DNN. Although the overall architecture of the DNN is based on a standard feed-forward configuration, its principal operations (such as, e.g., convolution, pooling, etc.) have been properly adjusted to the physical and analytical properties of diffusion signals. The adjustment made it possible to minimize the number of network parameters and, consequently, to reduce the amount of training data substantially. In particular, the results of this paper have been obtained based on only 40 dMRI datasets.
It is important to emphasize that the results reported here should be regarded as neither exhaustive nor final, but rather describing a novel concept which admits many possible extensions and improvements. For this reason, no attempts have been made to “push the limits” of the proposed method by testing its performance in deliberately difficult scenarios. Instead, the present experimental study has been deliberately limited to the basic task of stratification of CN and AD subjects, while focusing instead on a comparative performance between the proposed method and a range of existing alternatives.
The remainder of this paper is organized as follows. The principal idea of the proposed method is described in Section 2, while Section 3 introduces some necessary technical preliminaries, followed by a description of the proposed network design in Section 4. Subsequently, Section 5 provides details on the experimental setup, study data, and performance metrics used in this work, while experimental results are summarized in Section 6. Finally, Section 7 concludes the paper with a discussion of its main findings along with an outline of possible directions for future research.
2. Principal idea and data structure
The experimental study of this work has been based on the dMRI data available through the continuing efforts of the Alzheimer’s Disease Neuroimaging Initiative (ADNI)222For more details on various ADNI programs, visit adni.loni.usc.edu.. Over the last few years, the ADNI database has been extended to include dMRI data acquired by means of relatively advanced protocols. For the purposes of this paper, however, we used the dMRI data collected during an earlier phase of the ADNI study – known as ADNI-II. At that time, the data acquisition relied on more standard protocols which are more common in present-day clinical DTI. Thus, working with the earlier data should provide a more objective demonstration of the practical value of the proposed methodology.
In what follows, we consider a typical setting in which a DTI dataset consists of diffusion-encoded MRI volumes which encode the values of apparent diffusivity along different spatial orientations. Such data are usually acquired at a fixed level of diffusion sensitization controlled by the -value [Mori:2014aa]333For relatively large , this acquisition scheme is also known as High Angular Resolution Diffusion Imaging (HARDI) [Tuch:2004].. In particular, in the case of ADNI-II data, the number of diffusion encodings was set to , with s/mm.
Formally, DTI signals can be considered to be functions of both spatial and spherical coordinates. In practice, the spatial coordinate is sampled over a regular Cartesian grid which represents the (anatomical) image domain. On the other hand, the process of diffusion encoding restricts the signal values to points over the unit sphere in the diffusion -space. Thus, from a practical point of view, a DTI dataset can be viewed as a 4-D numerical array of size , with three “anatomical” and one “diffusion” dimension.
For some , let be a symmetric neighbourhood of consisting of all such that for some (small) radius 444For example, with , represents a standard 27-connected neighbourhood of .. The term “diffusion cube (DC) at ” will be used below to refer to a segment of DTI data spatially restricted to . Formally, the DC at is defined as
where denotes the signal value at position and diffusion-encoding orientation . Thus, similarly to the entire dataset, can be viewed as a 4-D array of size , with .
As the next step, we refer to Fig. 1 that illustrates various stages of computing the PSIC values in application to the stratification of CN and AD subjects. Suppose, for a given study subject (Subplot A); we are interested in making an inference based on DTI data confined to some prescribed ROI (Subplot B). Then, for each , such that (Subplot C), its associated DC (Subplot D) is passed to a dedicated DNN that yields a positive PSIC score
. For the sake of argument, at the moment, the network parametersare assumed to be set to their optimal values . In this case, by virtue of network design and training, is set up to scale proportionally to the likelihood of the neural tissue at to be affected by AD. Thus, in particular, the CN cases would be associated with relatively low values of PSIC (e.g., ), while, in the case of AD, the values would be in close vicinity of 1 (e.g., ).
Finally, the values of computed at all can also be used as an imaging contrast, which can be superimposed over a structural display of underlying anatomy. Such contrast is expected to be comparatively weak and scarce in CN (Subfigure F) while being intense and densely concentrated in AD (Subfigure G).
In practice, the optimal network parameters are estimated from training data. To this end, we hypothesize that, for a proper choice of , the effects of neurodegeneration are manifested in some hidden diffusion characteristics which persevere throughout the entire ROI. Moreover, such hidden characteristics are likely to be shared between different subjects within the same diagnostic group.
Under the above hypothesis, the training data can be formed by stockpiling the DCs from all voxels within a specified ROI across all subjects within the same diagnostic group. Specifically, in application to the binary problem of stratification of CN and AD subjects, such training data would consist of a set of DCs obtained from all available CN subjects, on the one hand, and a similar set of DCs coming from all available AD subjects, on the other hand.
More formally, let and be the number of subjects in the CN and AD groups, respectively. Also, let (resp., ) denote the diffusion datasets collected from the CN (resp., AD) subjects. In this case, the training data would consist of two subsets of samples defined as
along with their corresponding (target) labels and , respectively. A conceptual illustration of the process of formation of the training data is shown in Fig. 2.
In general, the above procedure can be applied to different ROIs , in which case one would have different training sets, and . Each such set could then be used to estimate the optimal parameters for its related independently.
3. Analytical tools and principal operations
3.1. Composite Convolution
At a conceptual level, the DNN proposed in this paper is based on a standard feed-forward configuration, consisting of a typical succession of convolutional operations interleaved with nonlinearities and resizing. However, when defining such operations, it would be amiss to disregard the physical properties of diffusion signals which offer a number of important advantages.
As discussed in Section 2, at its input, the network receives a single DC, which can be viewed as a 4-D numeric array of size . Alternatively, the DC could be considered to be an array of the discrete values of a diffusion signal measured over and spherical orientations . Formally, can be defined over the combined domain , where stands for the convex hull of . Thus, in order to incorporate convolution into the network design, one needs a proper definition of such operation for the signals defined over .
Generally speaking, the convolution of -domain signals is defined over the special Euclidean group , working with which might be excessively complicated from a computational point of view. In what follows, we introduce a particular definition of this operation, which offers a number of important practical advantages. In this way, the present results expand on the simpler case of spherical signals over , which have been successfully addressed in a number of recent studies [Poulenard:2019aa, Mukhaimar:2020aa].
The proposed method relies on a simplified interpretation of convolution, which is derived under the assumptions of separability and zonality. In particular, the former requires the convolution kernel to be a separable product of a spatially-dependent and a spherically-dependent component. The assumption of zonality, on the other hand, requires the spherical component to be a zonal function, which implies its invariance to azimuthal rotations. Such functions admit representation in terms of Legendre polynomials as given by
with known as the Legendre coefficient of degree .
The use of zonal kernels considerably simplifies the definition of spherical convolution. To see that, we first assume that, at each , can be closely approximated by its truncated Spherical Harmonic (SH) expansion of the form
with and being the -th order SH of degree and its corresponding expansion coefficient, respectively. Note that, due to the spherical symmetry of DTI signals, the summation in (2) is restricted to the even values of , in which case the total number of expansion coefficients is equal to , for a predefined maximal degree .
At any , the availability of the SH coefficients and the knowledge of the Legendre coefficients of the zonal kernel can be used to define the spherical convolution of and according to
for all and .
Next, we note that at each ,
is just a real vector of length. When computed over the entire ; all such vectors can be conveniently assembled into a 4-D array of size . Alternatively, can be viewed as a collection of volumetric “coefficient images” , where each comprises the spatially-dependent values of the SH coefficient of degree and order .
It is also convenient to split into subgroups of 3-D arrays according to the value of their degree , i.e., , with . In this case, the spherical convolution in (3) amounts to scaling each “coefficient image” in by the same scalar . While computationally efficient, however, this operation has the disadvantage of ignoring the spatial behaviour of input signals.
The coordinate-wise spherical convolution in (3) can be generalized into a composite spatial-spherical convolution as follows. For , let be a set of spatial-domain filters of size . Then, based on the assumption of separability, the operation of -band spatio-spherical filtering can be defined as
where stands for discrete convolution in the spatial domain. The definition in (4) suggests that, for each spherical order , the output is computed as a linear convolutional combination of the “input images” in with filters , …, , … , . In this way, the action of (4) extends across the entire , as opposed to the case of (3).
Extending the -band spatio-spherical filtering in (4) to all gives rise to a convolution-type operator that takes effect across both of the spherical and spatial domains. In what follows, this operation will be referred below to as composite convolution555For each , the computation of can be identified with the action of multi-channel convolution that is a standard computational routine included in many existing DL frameworks, such as, e.g., TensorFlow® (which has been used in this study).
can be identified with the action of multi-channel convolution that is a standard computational routine included in many existing DL frameworks, such as, e.g., TensorFlow® (which has been used in this study).. Formally, for a 4-D array of SH coefficients and a bank of filters , their composite convolution
is defined according to (4) for each and .
3.2. SH coefficients
The composite convolutional in (5) implies the availability of SH coefficients , which can be estimated directly from DTI data, as described, e.g., in [Descoteaux:2006]. The estimation is carried out independently on each of the signals comprising a given DC, resulting in an array of associated SH coefficients . The proposed DNN has been designed to work directly with such , which are assumed to be precomputed prior to network training. For the sake of notational simplicity, such input arrays will also be referred bellow to as “DCs”.
Finally, the question of setting should not be overlooked. Usually, is defined in accordance with a required -value. In particular, for s/mm (as used in ADNI-II), setting seems to be a conventional choice [Rohde:2004, Jones:2004aa, Alexander:2007]. Note that, in this case, the total number of SHs is equal to .
4. Proposed network architecture
4.1. Convolutional layers
The proposed DNN consists of several convolutional layers, each of which is parameterized by (convolutional) weights and a vector of (scalar) biases , with and . Given an input array , each such layer computes its output according to
where the plus sign is assumed to broadcast the values of so that every “coefficient image” in is summed with a different . Note that the affine operation in (6) is not exclusive to volumetric data, since it can be reduced to its 2-D and 1-D versions by merely replacing all in by their 2-D and 1-D counterparts, respectively.
The scope of activation functions currently used in DL is broad. In this work, all activation functions have had the form of a basicrectified linear unit
(ReLU)[LeCun:2015aa], with its input-output relation defined as given by
where the maximization is carried out independently for all values in the input array. Note that, over the last years, (7) has been modified in several ways (resulting in, e.g., leaky ReLu, noisy ReLu, and exponential ReLU). In our experiments, however, the basic definition in (7) was observed to work more than adequately.
Pooling is a standard method of data aggregation which can be used to suppress redundancies in input data, reduce the number of network parameters, and to minimize the risk of overfitting [Goodfellow:2016aa]. The special structure of DC samples , however, requires a proper adaptation of this operation. Specifically, in this work, the operation of pooling consisted of max pooling applied along a singular spatial dimension. Specifically, depending on the direction of maximization, the pooling can be defined in three possible ways as given by
for each and . Above, denotes the pooling operator, with its sub- and superscripts indicating the spatial dimensionality of and the direction of maximization, respectively.
It is important to emphasize, while in (8) depends on three spatial variables, the number of spatial dimensions of is reduced to two, and, consequently, each of the “coefficient images” composing is now an array of size . The resulting outputs could be subjected to 2-D pooling defined according to
for each and , with . Note that, in both variants above, the spatial dimension of has been reduced to one (i.e., each is now an array of size ).
Proceeding analogously, one can finally define the operation of 1-D pooling as
for each and . This type of pooling collapses the spatial dimension of , resulting in a length- vector of modified SH coefficients.
4.4. Network Architecture
Apart from using the purpose-built operations of convolution and pooling, the proposed DNN relies on a standard feed-forward architecture, which is depicted in Fig. 3 for the case of and (i.e., and ). The network consists of six neural layers which, in the figure, are shown separated by vertical dotted lines. The dimensions specified at the top of these lines indicate the size of the inputs received by each layer. Specifically, the input layer of the network receives a DC array of size , which is subjected to three different transformations of type (6), followed by ReLU activation and pooling along the three spatial coordinately. Thus, the second layer of the network receives a total of three arrays of size , each of which is processed in an analogous manner, using (9) and (6) (with a 2-D version of (5)).
Similar computations are applied to all of the six inputs of the third layer, which are subjected to transformation (6) (with a 1-D version of (5)), ReLU activation, and 1-D pooling according to (10). Subsequently, each of the resulting pairs of -length vectors is fused into a single -length vector by means of a fully connected layer (FCL). An additional FCL is used to transform its inputs into a single vector of length .
The final layer of the DNN consists of a linear transformation into the space of diagnostic outcomes, followed by thesoftmax operator. In the case of binary stratification (i.e., CN vs AD), the latter can be defined as [Goodfellow:2016aa]
with and being the two outputs of the linear transformation. In the context of this paper, the score , thus computed, is referred to as PSIC.
4.5. Network Optimization
The proposed DNN is parameterized by the values of convolution kernels, biases and weight matrices used across various components of the network. Altogether, these parameters can be gathered into a single vector , in which case, for a given , the DNN acts as a forward mapping , associating each DC with its PSIC score. In this case, it is a standard practice to estimate the optimal value of via solving a cross-entropy minimization problem of the form
where expectation is computed over the empirical distribution of .
Finally, let be a set of relevant ROIs. Also, for each , let denote the optimal values of its associated DNN parameters. Then, given an unlabelled set of DTI measurements, can be used to compute the PSIC values across the entire . Subsequently, the resulting scores can be summarized into regional statistics for the purpose of subject stratification, as described next.
5. Experimental study design
5.1. Study data
As stated earlier, the experimental study of this work has been focused on the problem of stratification of CN and AD subjects. To this end, 20 CN and 20 AD age-matched subjects (mean age 72.6 7.6 years) were selected from the DTI database of ADNI-II666For more details on the inclusion/exclusion criteria and other design parameters used by the ADNI-II study, please refer to adni.loni.usc.edu.. The dataset of each subject consisted of diffusion-encoded volumetric scans, acquired at s/mm, as well as five -volumes (i.e., scans acquired in the absence of diffusion sensitization). In addition, each dataset was supplemented with its associated - and -weighted scans required for structural image alignment and segmentation.
5.2. Data preprocessing
For each dataset, its -volumes were first co-registered and then merged into a single (average) -volume, which was subsequently used for normalization of the diffusion-encoded volumes. After that, the normalized data were subjected to preprocessing by means of a custom pipeline which had been designed based on the recommendations of [Glasser:2013aa]. The technical implementation of the pipeline relied on the NiPype framework of NiPy (nipype.readthedocs.io), which allowed convenient integration of many well-established tools of computational imaging, including FSL (https://fsl.fmrib.ox.ac.uk/), ANTs (picsl. upenn.edu/software/ants/), and SPM (www.fil.ion.ucl.ac.uk/spm/). As per usual, the principal purpose of the preprocessing pipeline has been to compensate for various imaging artefacts caused by the effects of subject motion, variable magnetic susceptibility, eddy currents, etc.
Additionally, FreeSurfer [FS:2012aa] (surfer.nmr.mgh.harvard.edu/) was used for the purpose of segmentation of grey and white matter, with their subsequent parcellation into smaller anatomical regions. Subsequently, the anatomic labels provided by FreeSurfer were used to partition the image domain of the DTI volumes into a set of predefined ROIs , as detailed below.
5.3. Definition of ROIs
The two left columns of Table 1 summarize the names of the anatomical regions computed by FreeSurfer, along with their acronyms. In addition, for each ROI, the leftmost columns of the table indicate the total number of DC samples available in its respective CN () and AD () subsets.
The ROIs have been chosen to correspond to different parts of white matter anatomy, which are known to be implicated in the pathogenesis of AD. It should be noted, however, that FreeSurfer lacks means of direct delineation of white matter. Instead, various elements of the latter are labelled based on their proximity to the nearby anatomical structures of grey matter. Thus, for instance, all voxels designated as white matter and located, e.g., within a 5 mm ribbon around the left superior frontal gyrus would be labelled as lh-SFW. This approach is obviously not without limitations, the rectification of which has been in the focus of ongoing research [Huo:2018aa]. Nevertheless, considering the comparative nature of the results reported here, the choice of a specific method of whole-brain segmentation does not seem to be particularly critical.
|1||Superior frontal wm (lh)||lh-SFW||17251||15935||33186|
|2||Superior frontal wm (rh)||rh-SFW||16768||16739||33507|
|3||Cerebellum wm (lh)||lh-CBW||16010||13686||29696|
|4||Cerebellum wm (rh)||rh-CBW||16510||14397||30907|
|5||Precentral wm (lh)||lh-PreCW||10028||8453||18481|
|6||Precentral wm (rh)||rh-PreCW||9539||8147||17686|
|7||Inferior parietal wm (lh)||lh-IPW||4442||3841||8283|
|8||Inferior parietal wm (rh)||rh-IPW||5452||4678||10130|
|9||Middle frontal wm, rostral (lh)||lh-MFWros||4397||4593||8990|
|10||Middle frontal wm, rostral (rh)||rh-MFWros||3809||4027||7836|
|11||Precuneus wm (lh)||lh-PCUNW||3046||2970||6016|
|12||Precuneus wm (rh)||rh-PCUNW||3540||3416||6956|
|13||Superior parietal wm (lh)||lh-SPW||3184||3010||6194|
|14||Superior parietal wm (rh)||rh-SPW||2989||3139||6128|
Using the methodology of Section 2, each of the regions in Table 1 was used to assemble its associated input samples and , corresponding to the target labels and , respectively. Subsequently, the process of network training was carried out for each independently, yielding a vector of optimal network parameters for each of the chosen ROIs.
5.4. Network training
Prior to training, for each , the samples in and were randomized and split into a training and a validation dataset (in proportion 4:1), which were subsequently used for the purpose of estimation of and final performance evaluation, respectively. In all cases, the optimization was performed by means of the adaptive moment estimation (Adam) algorithm [ADAM], with a fixed learning rate of
, batch size of 256 samples, and 200 epochs. The network training procedure was augmented with dropout regularization, with the value of keep probability set to 0.7 to alleviate the effects of overfitting.
The convergence of optimization has been monitored in terms of empirical prediction accuracy (PA). Given a set of labelled DC samples , with , the latter can be defined as
where , if and , otherwise.
In the context of network training, the PA criterion can be a useful indication of the convergence of stochastic gradient. On the other hand, when used on validation data, PA provides a more objective measure of both optimality and generalizability in view of its virtual independence of the effects of overfitting. Thus, the higher values of validation PA are typically indicative of more accurate performance, in general [Goodfellow:2016aa].
5.5. Binary classification
Once the training is complete, and the values of are available for each ROI , the optimal forward mappings can be used for stratification of unclassified subjects. In particular, an unlabelled dataset of DTI measurements can be converted into subsets of DC samples in accordance with the selected ROIs. Subsequently, for each , its respective DC samples can be converted into a set of PSIC scores , with . In this case, the subject could be stratified based on
with the decision made independently for each . In such case, the median PSIC score on the left side of (14) can be viewed as a cumulative regional marker of AD.
Furthermore, for the same subject, the PSIC values in can serve in the role of imaging contrast. Particularly, these values can be superimposed on structural MRI scans, thus offering the possibility of visual exploration of the spatial variability of PSIC values.
5.6. Reference methods
The performance of the proposed classifier has been compared against region-based classification based on multiple diffusion metrics and their combination. The selected metrics included four standard DTI measures,viz.: mean diffusivity (MD), fractional anisotropy (FA), as well as diffusion linearity (CL) and planarity [Mori:2014aa]. The list of metrics also included diffusion volume (DV), average sample diffusion (ASD), diffusion energy (DE), and the coefficient of variation of diffusion (CVD), which can provide more general characterization of diffusion dynamics, independent of DTI modelling [Aja-Fernandez:2018aa]. It goes without saying, the above list is by no means exclusive, and other useful characteristics of cerebral diffusion could have been included as well [Novikov:2018aa]. This being the case, however, besides covering both basic and advanced options, the selected metrics have had an important advantage of estimability based on relatively small datasets, as it is the case in the present study.
In this paper, the region-based classification was based on the likelihood ratio test [Rigollet:2011aa]. Specifically, for each metric , with , its values inside
were assumed to be independent realizations of a random variable, with its probability densities in the CN and AD subgroups given byand , respectively. Then, given an unlabelled dataset, the observed values of were used to stratify the subject according to
for some decision variable . For each and , had been set to its optimal value through maximizing the area under the receiver operating characteristic (ROC) curve of its corresponding classifier. In each case, the probability densities in (15
) were assumed to be Gaussian, with their means and variances estimated from the available data, following a standard leave-one-out cross-validation procedure[Konishi:1992aa].
An addition reference method was based on concurrent use of multiple diffusion metrics within the framework of logistic regression (LR). Similarly to the proposed DNN-based classifier, LR was used to map the input values of into a scalar classification score [Agresti:2002aa].
6. Experimental results
The main experimental results of this study are summarized in Tables 2 and 3, where the former shows the PA scores produced by different classifiers for different ROIs (with the proposed method of classification denoted by “DNN”). One can see that the DNN-based classifier considerably outperforms the reference approaches across all ROIs. There is, however, a slight decrease in DNN’s performance for with , which is likely due to the reduction in the total number of training samples available for these ROIs (as indicated in Table 1).
As evidenced by Table 2, among the reference methods, the LR classifier showed superior performance in less than half of the cases. In other cases, the accuracy of classification was observed to depend on a particular metric/ROI combination. Interestingly enough, in about half of such cases, the most basic DTI metrics (such as MD and FA) demonstrated better performance in comparison to more advanced options.
In addition, leave-one-out cross-validation [Konishi:1992aa] was used to compare the proposed and reference classifiers in terms of their respective ROC curves. Specifically, the optimality of classification was assessed in terms of the area-under-curve (AUC) criterion, whose values are shown in Table 3. Here, the higher values of AUC indicate a higher accuracy of classification, with the upper bound of 1 corresponding to the case of perfect classification. Thus, as demonstrated by the table, the proposed classifier offers a notably better solution to the problem of CN/AD stratification in comparison with the reference ones.
Finally, Fig. 4 and Fig. 5 exemplify the use of PSIC in its capacity as imaging contrast. It should be emphasized that, as opposed to diffusion metrics, the values of PSIC have no physiological interpretation. Instead, PSIC could be viewed as a pathology-specific risk indicator, whose higher values reflect a higher probability of the brain to be affected by the disease. This property of PSIC is evident in Subplots A of Fig. 4 and Fig. 5 which show the PSIC-enhanced structural scans of two AD subjects. In this case, the spatial distribution of PSIC values appears to be both intense and spatially pervasive. On the other hand, Subplots B of the same figures show results for two CN subjects. One can see that, in this case, the magnitude and spatial spread of PSIC appear to be much more “diluted”.
Due to the preliminary nature of the present paper, an in-detail exploration of the spatial characteristics of PSIC as well as its correlation with underlying brain anatomy and its possible etiological explanations are left beyond the scope of this report. However, in view of the empirical evidence provided by Fig. 4 and Fig. 5, it is reasonable to expect the proposed contrast mechanism to be “worth a thousand words”, both as an adjunct to establishing a confident diagnosis and as a means to facilitate post hoc discoveries.
7. Discussion and Conclusions
The main objective of this work has been to explore the potential of dMRI in application to early diagnosis of AD. As opposed to alternative means of medical imaging, the physics of dMRI happens to be uniquely suited for the detection and assessment of microscopic damage to brain tissue, which is known to precede the ensuing morphological changes in cortical grey matter due to AD [Braak:1991aa, Kumar:2015aa]. This is what endows dMRI with the unique ability to detect the presence of neurodegeneration at its earliest pathological stages.
The proposed method has been derived based on the concept of data-driven inference, which allows overcoming some critical limitations of model-based analysis of diffusion signals, especially in situations with relatively small DTI datasets (i.e., when ). The DNN-based classifier designed this way has been rendered independent of any mechanistic assumptions (and, thus, of their limitations). Instead, it has been optimized based on known diagnostic outcomes, resulting in the phenomenological mechanism that establishes a direct correspondence between DTI data and a quantitative measure of health risks due to AD.
The proposed DNN has been designed to process spatially localized segments of DTI data, i.e., 4-D “diffusion cubes” corresponding to different spatial coordinates. The local definition of input DC samples has served two important purposes. First, it gave the means to use the output scores as a spatially dependent “risk indicator”, which has been referred to as PSIC (in view of its purposive specificity to suspected pathology). Second, the same locality made it possible to collect tens of thousands of training samples from as few as only 40 DTI datasets. More importantly, the training data thus obtained have been sufficient for reliable optimization of the network parameters. Needless to say, such a result would have been impossible to attain, had, in accordance with established practice, each of the datasets been dealt with as a single observation. Thus, the proposed methodology can be particularly advantageous in situations when larger sets of training data are not available.
It goes without saying; the present work has barely scratched the surface of the possibilities offered through the combination of dMRI measurements and DL, with many important questions left yet to be addressed. In particular, although sufficient as a “proof-of-concept” in comparative studies, the problem of DTI-based stratification of CN and AD subjects is clearly of limited practical importance. Thus, to further attest to its viability, the proposed method needs to be extended to the classification of multiple diagnostic groups. Along this direction of research, a particularly enticing question would be to find correspondence between PSIC and different pathological stages of AD [Braak:1991aa]. In a similar vein, it would also be interesting to apply the proposed solution to the problem of classification of various subtypes of MCI [Petersen:2016aa].
Addressing more complex classification problems is likely to require a proportional increase in the complexity of the DNN. Thus, in the case of simple CN/AD stratification, it was neither necessary to involve more complex data nor to extend the network architecture beyond a basic feed-forward configuration. Moreover, the minimality of the employed DNN architecture (with the total number of trainable parameters equal to 50,376) has been instrumental in preventing overfitting during its training on the small set of 40 subjects. In more complicated clinical scenarios, however, the architecture could be extended through, e.g., processing the spatial neighbourhoods of different sizes, with a corresponding increase in the number of hidden layers. Another way to enhance the predictive power of the DNN would be to take advantage of dMRI data acquired at multiple -values (as has been done in the ADNI-III study). Needless to add, all such extensions would come with an increase in the number of network parameters, which should be accompanied by a pro-rata increase in the size of training data.
Finally, although working with SH coefficients is not the only way to “planarize” the operation of spherical convolution, it offers an important advantage in the context of between-site variability of classification scores. Specifically, due to discrepancies in the design and settings of MRI scanners, similar dMRI signals acquired at different sites are not uncommon to have notably different spectral characteristics (which is often the reason behind conflicting reports in clinical applications of dMRI). This problem has been addressed by a range of approaches, among which a particularly effective way to counteract the effects of between-site variability is lent by means of spectral data harmonization of the SH coefficients of DTI data, as detailed in [Mirzaalian:2018aa]. This approach suggests a straightforward way of combining the proposed method with the compatible means of normalization, which should render its performance consistent across different clinical sites. This expectation, however, still needs to be validated via proper experimental studies, which constitutes another objective of our future research.