1 Introduction
Longitudinal image data provides us with a wealth of information to study aging processes, brain development and disease progression. Such studies, for example ADNI jack2015magnetic and the Rotterdam study ikram2015rotterdam , involve analyzing thousands of images. In fact, even larger studies will be available in the near future. For example, the UK Biobank biobankWebsite targets on the order of 100,000 images once completed. With the number of images increasing, largescale image analysis typically resorts to using compute clusters for parallel processing. While this is, in principle, a viable solution, increasingly larger compute clusters will become necessary for such studies. Alternatively, more efficient algorithms can reduce computational requirements, which then facilitates computations on individual computers or much smaller compute clusters, interactive (e.g., clinical) applications, efficient algorithm development, and use of these efficient algorithms as components in more sophisticated analysis approaches (which may use them as part of iterative processes).
Image registration is a key task in medical image analysis to study deformations between images. Building on image registration approaches, image regression models ref:georegression ; hong2012simple ; hong2012metamorphic ; ref:nikhil ; fletcher2013geodesic ; hong2014time ; singh2014splines ; hong2014geodesic ; singh2015splines ; hong2016parametric have been developed to analyze deformation trends in longitudinal imaging studies. One such approach is geodesic regression (GR) ref:georegression ; ref:nikhil ; fletcher2013geodesic which (for images) build on the large displacement diffeomorphic metric mapping model (LDDMM) ref:lddmm
. In general, GR generalizes linear regression to Riemannian manifolds. When applied to longitudinal image data, it can compactly express spatial image transformations over time. However, the solution to the underlying optimization problem is computationally expensive. Hence, a simplified, approximate, GR approach has been proposed
ref:hong (SGR) to decouple the computation of the regression geodesic into pairwise image registrations. However, even such a simplified GR approach would require months of computation time on a single graphics processing unit (GPU) to process thousands of 3D image registrations for largescale imaging studies such as ADNI jack2015magnetic . The primary reason computational bottleneck for SGR are the optimization required to compute pairwise registrations.Recently, efficient approaches have been proposed for deformable image registration dinggang ; miao2016real ; nonrigid ; ref:yang2016 ; quicksilver ; zhang2017frequency . In particular, for LDDMM, which is the basis of GR approaches for images, registrations can be dramatically sped up, by either working with finitedimensional Lie algebras zhang2015finite and frequency diffeomorphisms zhang2017frequency , or by fast predictive image registration (FPIR) ref:yang2016 ; quicksilver . FPIR predicts the initial conditions (specifically, the initial momentum) of LDDMM, which fully characterize the geodesic and the spatial transformation using a learned a patchbased deep regression model. Because numerical optimization of standard LDDMM registration is replaced by a single prediction step, followed by optional correction steps quicksilver , FPIR is dramatically faster than optimizationbased LDDMM without compromising registration accuracy, as measured on several registration benchmarks klein2009 .
Besides FPIR, other predictive image registration approaches have been proposed. Dosovitskiy et al. flownet2015
use a convolutional neural network (CNN) to directly predict optical flow. Liu et al.
ref:liu use an encoderdecoder network to synthesize video frames. Schuster et al. schuster2016optical investigate strategies to improve optical flow prediction via a CNN. Cao et al. dinggang use a sampling strategy and CNN regression to directly learn the mapping from moving and target image pairs to the final deformation field. Miao et al. miao2016real use CNN regression for 2D/3D rigid registration. Sokooti et al.nonrigiduse CNNs to directly predict a 3D displacement vector field from input image pairs. An unsupervised approach for image registration was proposed by de Vos et al.
dlmia2017_unsupervised; here, the loss function is the image similarity measure between images themselves and a deformation is parameterized via a spatial transformer (which essentially amounts to a parameterized model of deformation in image registration) which generates the soughtfor displacement vector field. In
hong2017fast , Hong et al. employ a lowdimensional bandlimited representation of velocity fields in Fourier space zhang2015finite to speed up SGR ref:hong for populationbased image analysis.In this work, we will build on FPIR, as it is a desirable approach for brain image registration for the following reasons: First, FPIR predicts the initial momentum of LDDMM and therefore inherits the theoretical properties of LDDMM. Consequently, FPIR results in diffeomorphic transformations, even though predictions are computed in a patchbypatch manner; this can not be guaranteed by most other prediction methods. Second, patchwise prediction allows for training of the prediction models based on a very small number of images, containing a large number of patches. Third, by using a patchwise approach, even highresolution image volumes can be processed without running into memory issues on a GPU. Fourth, none of the existing predictive methods address longitudinal data. However, as both FPIR and SGR are based on LDDMM, they naturally integrate and hence result in our proposed fast predictive simple geodesic regression (FPSGR) approach.
Our contributions can be summarized as follows:
 Predictive geodesic regression

We use a fast predictive registration approach for image geodesic regression. Different to quicksilver , we specifically validate that our approach can indeed capture the frequently subtle deformation trends of longitudinal image data.
 Largescale dataset capability

Our predictive regression approach facilitates largescale image regression within a short amount of time on a single GPU, instead of requiring months of computation time for standard optimizationbased methods on a single computer, or on a compute cluster.
 Accuracy

We assess the accuracy of FPSGR by (1) studying linear models of atrophy scores (which are derived from the nonlinear SGR model) over time, as well as (2) correlations between atrophy scores and various diagnostic groups.
 Validation

We demonstrate the performance of FPSGR by analyzing images of the ADNI1 / ADNI2 datasets. For comparison, we also perform SGR using numerical optimization for the registrations, again on the complete ADNI1 / ADNI2 datasets.
This work is an extension of a recent conference paper ding2017 . In particular, all our experiments are now in 3D. We also added significantly more results to further explore the behavior of FPSGR in comparison to optimizationbased SGR. In particular, we added (a) an example to visualize the performance of regression models and associated quantitative comparisons (Sec. 4.1); (b) an analysis of local atrophy score correlated with clinical variables (Sec. 4.3); (c) correlations within diagnostic groups (Sec. 4.3); (d) a comparison with pairwise registration (Sec. 4.4); (e) and experiments on extrapolation on unseen data (Sec. 4.5, Sec. 4.6).
Organization. The remainder of this article is organized as follows: Sec. 2 describes FPSGR, Sec. 3 discusses the experimental setup and the training of the prediction models. In Sec. 4, we present experimental results for 3D MR brain images. The paper concludes with a summary and an outlook on future work.
2 Fast predictive simple geodesic regression
Our fast predictive simple geodesic regression approach is a combination of two methods: First, fast predictive image registration (FPIR) and, second, integration of FPIR with simple geodesic regression (SGR). Both FPIR and SGR are based on the shooting formulation of LDDMM ref:nikhil ; Fig. 1 illustrates our overall approach. The individual components are described in the following.
2.1 Lddmm
Shootingbased LDDMM and geodesic regression minimize
(1)  
where is the initial image (known for imagetoimage registration and to be determined for geodesic regression), is the initial momentum, is a smoothing operator that connects velocity and momentum as and with , is a weight, is the measured image at time (there will be only one such image for imagetoimage registration at ), and denotes the image similarity measure between and (for example or geodesic distance); is the dual of the negative JacobiLie bracket of vector fields: and denotes the Jacobian. The deformation of the source image can be computed by solving , where id denotes the identity map.
2.2 Fpir
Fast predictive image registration ref:yang2016 ; quicksilver aims at predicting the initial momentum, , between a source and a target image patchbypatch. Specifically, we use a deep encoderdecoder network to predict the patchwise momentum. As shown in Fig. 1, in 3D the inputs are two layers of image patches ( in 2D), where the two layers are from the source and target images respectively. Two patches are taken at the same position by two parallel encoders, which learn features independently. The output is the predicted initial momentum in the , and directions (obtained by numerical optimization on the training samples). Basically, the network is split into an encoder and a decoder part. An encoder consists of 2 blocks of three 3 3 3 convolutional layers with PReLU activations, followed by another 2 2
2 convolution+PReLU with a stride of two, serving as a “pooling” operation. The number of features in the first convolutional layer is 64 and increases to 128 in the second. In the
decoder, three parallel decoders share the same input generated from the encoder. Each decoder is the inverse of the encoder except for using 3D transposed convolution layers with a stride of two to perform “unpooling”, and no nonlinearity at the end. To speed up computations, we use patch pruning (i.e., for brain imaging, e.g., patches outside the brain are not predicted as the momentum is expected to be zero there) and a large pixel stride (e.g., 14 for patches) for the sliding window of the predicted patches.2.3 Correction network
We follow quicksilver and use a twostep approach to improve overall prediction accuracy. An additional correction step, i.e., a correction network, corrects the prediction of the initial prediction network. Fig. 2 illustrates this twostep approach graphically. The correction network has the same structure as the prediction network. Only the inputs and outputs differ. For the prediction network, the inputs are the original moving image and the original target image; output is the predicted initial momentum. For the correction network, the inputs are the original moving image and the warped target image; the output is the momentum difference.
3D Longitudinal Test Case Deformation Error [pixel]  
Data Percentile  0.3%  5%  25%  50%  75%  95%  99.7% 
Longitudinal Training  0.0156  0.0407  0.0761  0.1098  0.1559  0.2681  0.3238 
Crosssectional Training  0.0544  0.1424  0.2641  0.3723  0.5067  0.7502  0.8425 
3D Crosssectional Test Case Deformation Error [pixel]  
Data Percentile  0.3%  5%  25%  50%  75%  95%  99.7% 
Longitudinal Training  0.1694  0.4802  1.0765  1.7649  2.7630  4.8060  5.6826 
Crosssectional Training  0.1123  0.3024  0.5863  0.8737  1.2743  2.2659  2.7836 
2.4 Sgr
Determining the initial image, , and the initial momentum, , of Eq. (1) is computationally costly. However, in simple geodesic regression, the initial image is fixed to the first image of a subject’s longitudinal image set (leftmost part of Fig. 1). Furthermore, the similarity measure is chosen as the geodesic distance between images and approximated so that the geodesic regression problem can be solved by computing pairwise image registrations with respect to the first image. Specifically, we define the quadratic distance between two images and as
(2)  
Assume we have an image at time as well as two images and . Further, assume that the spatial transformation maps to and maps to . Then and . Furthermore, assume that maps to , i.e., . Then . Assuming that the geodesic between and is parameterized by the initial velocity and between and by the initial velocity and that we travel between and in time (and similarly for ) we can rewrite the map between and based on the exponential map as
(3) 
which can be approximated to first order as
(4) 
Hence, the squared geodesic distance between the two images can be approximated as
(5) 
where and . Hence, Eq. (1) becomes
(6) 
where is the soughtfor initial momentum of the regression geodesic and are the initial momenta corresponding to the geodesic connecting (the starting image of the geodesic) and the measurements in time . Differentiating Eq. (6) w.r.t. results in
(7) 
Thus,
(8) 
In practice, is very small and can thus be omitted. Furthermore, is obtained by either registering to in unit time or, as in our FPSGR approach, by predicting the momenta via FPIR, denoted as . As Equation 8 was derived assuming that images are transformed into each other in time instead of unit time, the obtained unittime predicted momenta correspond in fact to the approximation . Finally, we obtain the approximated optimal of the energy functional in Eq. (1), for a fixed as
(9) 
3 Setup / Training
All our experiments use 3D images from the ADNI dataset^{1}^{1}1Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a publicprivate partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). which consists of 6471 3D MR brain images of size voxels. In particular, ADNI1 contains 3479 images from 833 subjects and ADNI2 contains 2992 images from 823 subjects. Images belong to various types of diagnostic categories which we will discuss later.
We perform the following two types of studies:
 Registration

We assess our hypothesis that training FPIR on longitudinal data for longitudinal registrations is preferred over training using crosssectional data. Vice versa, training FPIR on crosssectional data for crosssectional registrations is preferred over training using longitudinal data. Comparisons are with respect to registration results obtained by numerical optimization (i.e., LDDMM).
 Regression

As for regression, we compare linear models fitted to atrophy scores over time, where scores are either obtained from FPSGR or optimizationbased SGR. Additionally, we study correlations between atrophy scores and diagnostic groups. Our hypothesis is that FPSGR is accurate enough to achieve comparable performance to optimizationbased SGR, at much lower computational cost, in both situations.
3.1 Training of the prediction models
We use a randomly selected set of 120 patients’ MRI images from ADNI for training the prediction models and to test the performance of FPIR. We use all of the ADNI data for our regression experiments.
Training for registration. We randomly selected 120 subjects from ADNI1 and registered their baseline images to their 24 month followup images. We used the first 100 subjects for training and the remaining 20 subjects for testing. For longitudinal training, we registered the baseline image of a subject to the subject’s 24month image. For crosssectional training, we registered a subject’s baseline image to another subject’s 24month image. To assess the performance of prediction models trained on these two types of paired data, we (1) perform the same type of registrations on the heldout 20 subjects and (2) compare the 2norm of the deformation error computed from the output of the prediction models with respect to the result obtained by numerical optimization of LDDMM^{2}^{2}2LDDMM results are generated using a vector momentum formulation: https://bitbucket.org/scicompanat/vectormomentum (which serves as the “groundtruth”). Table 1 shows the results which confirm our hypothesis that training the prediction model with longitudinal registration cases is preferred for longitudinal registration over training with crosssectional data. The deformation error is very small for longitudinal training / testing which provides strong evidence that the predictive method exhibits performance comparable to the (costly) optimizationbased LDDMM. Another interpretation of these results is, that it is beneficial to train a prediction model with deformations that are to be expected, i.e., relatively small deformations for longitudinal registrations and larger deformations for crosssectional registrations. As we are interested in longitudinal registrations for the ADNI data, we only train our 3D models using longitudinal registrations in the following.
Training for regression. The ADNI1 dataset contains 228 normal controls, 257 subjects with mild cognitive impairment (MCI), 149 with late mild cognitive impairment (LMCI), as well as 199 subjects suffering from Alzheimer’s disease (AD). We randomly picked roughly 1/6 of patients from each diagnostic category to form a set of 139 subjects for training in ADNI1, i.e., 38 normal controls, 43 MCI, 25 LMCI, as well as 33 AD subjects; this results in 139 subjects overall. The baseline images of each subject were registered to all the later timepoints within the same subject. To maintain the diagnostic ratio, we picked (out of all registrations) 45 registrations from the normal group, 50 registrations from the MCI group, 30 registrations from the LMCI group, and 40 registrations from the AD group, resulting in 165 longitudinal registration cases for training. The same strategy was applied to ADNI2. In detail, ADNI2 contains 200 normal controls, 111 subjects with significant memory complaint (SMC), 182 subjects with early mild cognitive impairment (EMCI), 175 with late mild cognitive impairment (LMCI), and 155 subjects with Alzheimer’s disease (AD). We picked 150 subjects and 140 longitudinal registrations, consisting of 35 registrations from the control group, 20 registrations from the SMC group, 30 registrations from the EMCI group, 30 registrations from the LMCI group, and 25 registrations from the AD group. Note that there are fewer registrations than subjects (140 vs. 150) in this setup, as our priority is to maintain the overall diagnostic ratio.
For both, ADNI1 and ADNI2, the remaining 5/6 of the data is used for testing. We trained four prediction models and their four corresponding correction models, leading to eight prediction models in total, listed in Table 2. We also note that the training sets within ADNI1 and ADNI2, resp., were not overlapping.
ADNI1 Pred1  Model v1 (no corr.) 

ADNI1 Pred+Corr1  Model v1 +1x corr. step 
ADNI1 Pred2  Model v2 (no corr.) 
ADNI1 Pred+Corr2  Model v2 +1x corr. step 
ADNI2 Pred1  Model v1 (no corr.) 
ADNI2 Pred+Corr1  Model v1 +1x corr. step 
ADNI2 Pred2  Model v2 (no corr.) 
ADNI2 Pred+Corr2  Model v2 +1x corr. step 
3.2 Parameter selection
3.3 Efficiency
Once trained, the prediction models allow fast computations of registrations. We use a TITAN X (Pascal) GPU and PyTorch^{3}^{3}3http://pytorch.org for our implementation of FPIR. For the 3D ADNI1 dataset ( MR images), FPSGR took about one day to predict 2646 pairwise registrations (i.e., 25 [s]/prediction) and to compute the regression result. Optimizationbased LDDMM^{4}^{4}4Here, we used 300 fixed iterations for each registration. 300 iterations can guarantee almost all the results converge. Note that the optimizationbased LDDMM also uses a GPU implementation. would require 40 days of runtime. Runtime for FPIR on ADNI2 is identical to ADNI1 as the images have the same spatial dimension.
Compared to thestateofart fast geodesic regression model hong2017fast , FPSGR is also at least twice as fast. The model in hong2017fast achieves times speedup compared with SGR ref:hong for the same setting (parallel computing with the same number of cores). In our case, we achieve a more than 40 times speedup compared with SGR for the same setting (a single GPU).
4 Experimental results for 3D ADNI data
For our experiments, we created 10 different (dataset, registration approach) combinations, each combination specifically designed to assess certain properties of our proposed strategy. These combinations are described next.

All subjects from the ADNI1 dataset in combination with optimizationbased LDDMM.

Two subgroups of ADNI1 (i.e., different training data portions) in combination with FPSGR without a correction network.

The same two subgroups as in 2), but in combination with FPSGR with a correction network.

The same five groups of 13, but for ADNI2.
Our general hypothesis is that the prediction models (for ADNI1/2) show similar performance to optimizationbased LDDMM and that using the correction network for the predictions improves results. To assess differences, we compare differences in deformations. Specifically, for every deformation produced by the different approaches, we compute its Jacobian determinant (JD). The JDs are then warped to a common coordinate system for the entire ADNI dataset using existing nonlinear deformations from GMF_ISBI_matching ; GMF_ISBI_optimization . Each such spatially normalized JD is then averaged within a region where the rate of atrophy is significantly associated with Alzheimer’s disease (AD), i.e., within a statistical region of interest (statROI) (see Fig. 3). Specifically, we quantify atrophy as
(10) 
where denotes the determinant and the cardinality/size of a set; is an area in the temporal lobes which was determined in prior studies GMF_ISBI_matching ; GMF_ISBI_optimization
to be significantly associated with accelerated atrophy in Alzheimer’s disease. The resulting scalar value is an estimate of the relative volume change experienced by that region between the baseline and a followup image. Hence, its sign is positive when the region has lost volume over time and is negative if the region has gained volume over time.
We limited our experiments to the applications in Xue_ADNI1 ; Xue_ADNI2 , wherein nonlinear registration/regression is used to quantify atrophy within regions known to be associated to varying degrees with AD (), mild cognitive impairment (MCI) () (including LMCI^{5}^{5}5We combine MCI and LMCI mainly because (a) the diagnostic changes available on the IDA website (https://ida.loni.usc.edu/login.jsp) only provide these three diagnostic groups; (b) to be consistent with the experiments conducted by Hua et al. Xue_ADNI1
, where only Normal, MCI and AD were used as labels to classify
ADNI1. Hereafter, in all discussions of ADNI1, MCI is a combination of MCI and LMCI of ADNI1), and normal ageing (NC: normal control) () in an elderly population. These are the diagnostic groups for ADNI1. For ADNI2, there are also 3 diagnostic categories^{6}^{6}6Similar to ADNI1, a detailed diagnosis for ADNI2 is only available for the baseline images; MR images at later time points are only labeled as NC, MCI, and AD. Thus, we combine SMC and NC, as well as EMCI and LMCI to be consistent with the diagnostic changes in the ADNI Diagnosis Summary available on the IDA website. Hereafter, in all discussions of ADNI2, NC includes NC and SMC and MCI includes EMCI and LMCI.: normal ageing () (including SMC), mild cognitive impairment (including EMCI and LMCI) (), and AD ().Specifically, we investigate the following six questions:

Can the prediction models for regression qualitatively capture similar trends to the regression model obtained by numerical optimization? (Sec. 4.1)

Are atrophy measurements derived from FPSGR biased to overestimate or underestimate volume changes? (Sec. 4.2)

Are FPSGR atrophy measurements consistent with those derived from deformations via numerical optimization (LDDMM) which produced the training dataset? (Sec. 4.3)

Are regression results more stable and hence capture trends better than pairwise registrations? (Sec. 4.4)

Is the predictive power of the regression models strong enough to forecast deformations for unseen future timepoints (Sec. 4.5)

Do the prediction results capture expected trends in deformation? (Sec. 4.6)
If these experiments resolve favorably, then the substantially improved computational efficiency of FPSGR justifies its use for largescale imaging studies. Tables 3 and 4 show the distributions of the prediction cases per timepoint and the diagnostic groups in ADNI1 and ADNI2, respectively.
4.1 Regression results
Table 1 indicates that FPIR can predict deformation fields similar to the ones obtained using optimizationbased LDDMM, even for the subtle changes seen in longitudinal imaging data. However, it remains to be seen how a predictive model performs for image regression. Fig. 4 shows an exemplary regression result. In this specific case, large changes can be observed around the ventricles. To illustrate differences between the methods, Fig. 4 shows regression results based on optimizationbased LDDMM, for FPSGR without a correction network, and for FPSGR with a correction network. All three methods successfully capture the expanding ventricles and generally capture the image changes. Both FPSGR methods show results that are highly similar to SGR using optimizationbased LDDMM. Hence, FPSGR is useful for longitudinal image regression. To further quantify the regression accuracy, we compute the overlay error between measured images and the images on the geodesic as
(11) 
where is the brain area, is the regressed image at time and is the measured image at time . Table 5 shows the overlay error for the population of 100 subjects which includes all diagnostic groups in ADNI1. Both FPSGR methods obtain results comparable with optimizationbased LDDMM. This justifies the use of the proposed methods. The correction network generally increases the prediction accuracy over using the prediction network only.
4.2 Bias
Estimates of atrophy are susceptible to bias Yushkevich_Bias . To quantitatively assess this potential bias, we separately considered different diagnostic groups. Specifically, we considered six diagnostic change groups in our experiments: (1) NC for all time points (NCNC), (2) starting with NC and changing to MCI or AD at a later time point (NCMCI), (3) MCI for all time points (MCIMCI), (4) starting with MCI and reversing to NC at later time points (MCINC), (5) starting with MCI and changing to AD at later time points (MCIAD), and (6) AD for all the time points (ADAD)^{7}^{7}7In ADNI1/ADNI2, there are two patients who show a reversion from AD to MCI. We omitted these cases in our experiment because the number of such cases is too small.. In particular, we follow Xue_ADNI1
and fit a straight line (i.e., linear regression) through all atrophy measurements over time, conditioned on each diagnostic change category. The intercept term is an estimate of the atrophy one would measure when registering two scans acquired on the same day; hence it should be near zero and its 95% confidence interval should contain zero. Quantitatively, Table
6 lists the slopes, intercepts, and 95% confidence intervals for all ten groups of ADNI1 and ADNI2, respectively. LDDMM1 and LDDMM2 denote the optimizationbased results split into the same testing groups used for Pred1 and Pred2 to allow for a direct comparison. All of the results show intercepts that are near zero relative to the range of changes observed and all prediction intercept confidence intervals contain zero. For all diagnostic change groups the prediction and prediction+correction models exhibit more stable results than the optimizationbased LDDMM method as indicated by the tighter confidence intervals. Furthermore, all slopes are positive, indicating average volume loss over time. This is consistent with expectations for an aging and neurodegenerative population. The slopes capture increasing atrophy with disease severity. In ADNI1/ADNI2, we expect and all six experimental groups (i.e. LDDMM1, Pred1, Pred+Corr1, LDDMM2, Pred2, and Pred+Corr2) are generally consistent with this expectation. Exceptions happen in ADNI2 for the NCMCI and MCINC cases. As the number of subjects involved is relatively small, i.e., fewer than 20, compared with the other cases (roughly 100), one may speculate that this observation is caused by the limited number of data points for NCMCI and MCINC as shown in the #data column of Table 6. However, the behavior within each starting diagnostic category, is consistent, i.e., for NC and for MCI . Hence, all six groups’ slope results in ADNI1/ADNI2 are generally consistent with our expectation (and also consistent with results in Xue_ADNI1 ). The slope estimated from the prediction+correction results is larger than the slope estimated from the prediction model results and closer to the slope obtained from the optimizationbased LDDMM results. This indicates that the correction network can improve prediction accuracy. Fig. 5 shows linear regression results for the estimated atrophy scores in ADNI1/2for the Pred+Corr1 model. Both the data points themselves (i.e., the atrophy scores), as well as kernel density estimates for the linear trends for each subject are shown. These results are consistent with the results of Table
6 discussed above. We conclude that (1) neither LDDMM optimization nor FPSGR produced deformations with significant bias to overestimate or underestimate volume change; (2) a linear model of atrophy scores generated by FPSGR can capture intrinsic volume change (i.e., slope) among different diagnostic change groups. Note that our LDDMM optimization results and the prediction results show the same trends. Further, they are directly comparable as the results are based on the same test images (also for the atrophy measurements).4.3 Atrophy
Atrophy estimates have also been shown to correlate with clinical variables GMF_ISBI_matching . To quantify this effect, we computed the Spearman rankorder correlation^{8}^{8}8We used Spearman rankorder correlation instead of Pearson correlation, because the diagnostic groups imply an ordering only. between our atrophy estimates and the diagnostic groups (NC = 0, MCI = 1, AD = 2), and also between our atrophy estimates and the scores of the minimental state exam (MMSE). We applied the BenjaminiHochberg procedure FDR for all the correlation results in this paper to reduce the false discovery rate for multiple comparisons. The overall false discovery rate was set to be 0.01, which resulted in an effective significance level of 0.0093. Detailed results can be found in Table 7 and Fig. 6, respectively. In detail, for ADNI1/2, we randomly selected 200^{9}^{9}9In ADNI1 48 month, the number was 60 because there was not enough data; ADNI2 36 month was omitted due to lack of data. cases from each diagnostic category at each month and calculated the Spearman rankorder correlation. Fig. 6 shows the results for 50 repetitions. We observe median correlations for all four prediction models in the range of to for MMSE and to for diagnostic category. The correlations for all four prediction+correction models were in the range of to for MMSE and to for diagnostic category. Previous studies reported Pearson correlations between comparable atrophy estimates and clinical variables as high as for MMSE and for diagnostic category for 100 subjectsGMF_ISBI_matching ; GMF_ISBI_optimization .Our two optimizationbased LDDMM results achieve median correlations ranging from to for MMSE and to for diagnostic category, which is very similar to the predction+correlation models. In general, the correction+prediction FPSGR models outperform the models using only the prediction network. Further, using the correction network, FPSGR achieved comparable and sometimes even slightly better performance compared to the optimizationbased LDDMM SGR method, see Table 7
for additional quantitative results. Specifically, FPSGR using the prediction+correction network performs best in 8 out of 18 comparisons for MMSE and in 12 out of 20 comparisons for diagnostic group. In the cases where FPSGR with prediction+correction network did not perform best its difference to the best method was generally very small. In general FPSGR using the correction network performs better than FPSGR without the correction network. To check for statistical differences in the performance of FPSGR, we use a paired ttest. Table
8 shows the resulting pvalues for the three methods: optimizationbased SGR (i.e., LDDMM), FPSGR without correction network (i.e., Pred) and FPSGR with correction network (i.e., Pred+Corr). In both correlation with MMSE and DX, FPSGR with correction network shows significantly better performance than LDDMM and FPSGR without correction network, which justifies the use of the FPSGR method. In summary, FPSGR captures correlations between atrophy and clinical measures well.To further explore the correlations of atrophy with MMSE scores, we visualize them separated by diagnostic group where diagnosis did not change (i.e., NCNC, MCIMCI, ADAD) in Fig. 7. For the ADNI1 dataset, we observe (as expected) very low correlations for the normal diagnostic group (with no clear trend), and much stronger correlations for the MCI and AD groups. MCI and AD also exhibit increasingly stronger correlations with time. In case of ADNI2, the MCI group shows modest correlations, which remain consistent across time. Correlations are relatively low for the normal groups. The AD groups show increasingly strong correlations over time. In contrast to ADNI1, ADNI2 focuses mainly on earlier stages of the diagnostic groups Xue_ADNI2 . Hence, the deformations in ADNI2 are generally smaller than in ADNI1. This may explain why the NC and MCI diagnostic groups show consistent correlation values over time (instead of stronger correlations as for AD in ADNI2 or the MCI and AD groups in ADNI1).
To address the question how statROI specific measures behave over time, we explore how atrophy locally (i.e., voxelbyvoxel) correlates with MMSE. The local atrophy is defined as
I.e., each voxel in a statROI has an associated atrophy score. Fig. 8 shows kernel density estimates of the highest 10% local correlations in a violin plot. For the ADNI1 MCI and AD groups, a clear shift toward stronger correlations can be observed over time, consistent with the boxplots of Fig. 7. This indicates the progression of the disease. Correlations for the normal groups in ADNI 1/2 are mostly centered around a modest correlation (as expected). In ADNI2, only the AD diagnostic group shows a shift towards stronger correlations over time. All the other diagnostic groups show a relatively consistent distribution over time. This is also consistent with Fig. 7.
4.4 Justification of SGR
For simple geodesic regression to be a useful model it should outperform pairwise image registration. The main conceptual difference is that the regression model will recover an average trend
based on multiple image timepoints, i.e., the resulting regression geodesic will be a compromise between all the measurements. In contrast, for pairwise image registration (which can be seen as a trivial case of geodesic regression with two images only) the deformation will in general be able to match the target image well. However, just as in linear regression, this may accentuate the effects of noise. In both setups, images can be interpolated or extrapolated based on the estimated geodesic.
Tables 9 and 10 justify the use of SGR. Specifically, Table 9 shows linear regression results of atrophy measures over time as obtained via SGR (i.e., using an SGR fit over all timepoints followed by atrophy computations based on the deformations of the regression geodesic) compared with atrophy measures obtained by pairwise registration. For both the ADNI1 and the ADNI2 datasets, SGR outperforms the pairwise registration approach in two aspects: (1) the estimated intercept of SGR is generally closer to zero than for the pairwise method and the intercept 95% confidence interval is narrower; (2) 11 out of 24 of the 95% confidence intervals of the pairwise methods show bias to either overestimate or underestimate volume change, while none of the SGR results show such significant bias. Table 10 compares the correlations between atrophy and clinical measures (MMSE and diagnostic category) of SGR and the pairwise approach. SGR performs better than the pairwise approach in 13 out of 18 cases for MMSE and in 15 out of 20 cases for the diagnostic category. Furthermore, when the pairwise method is better than SGR, the difference is much smaller compared to the differences observed for the cases where SGR is better than the pairwise method. Also note that the pairwise method shows better performance in later months compared to earlier months. This could, for example, be because the deformations are larger for later timepoints and hence the registration result becomes more stable, or because SGR is also heavily influenced by the last timepoint. To address the above observation, we used a ShapiroWilk normality test and a Wilcoxon signedrank test. From Table 11 we see that we can reject the nullhypothesis of normality and hence, a paired test is not appropriate. As an alternative, we conducted a Wilcoxon signedrank test to compare the SGR prediction model and the pairwise prediction model. Table 11 shows that the SGR prediction model is statistically significantly better than the pairwise prediction model. Based on the above points, we conclude that SGR is more stable over time than the pairwise method and in general also results in stronger correlations.
4.5 Forecasting
Another interesting question for SGR and geodesic regression in general is the suitability of the model for the data. To address this question, we evaluate if SGR can forecast unseen future timepoints. Specifically we consider this question in two different scenarios:

Extrapolateclinical: Can we extrapolate the SGR results into the future (to timepoints that do not exist in the ADNI image dataset, but for the clinical data) while still obtaining strong correlations.

Extrapolateimage: How well can correlations between atrophy and clinical measures be predicted for timepoints when we do or do not use image data at that very timepoint. We artificially leave out image measurements so that we can compare prediction results to results when we have the image measurement.
For both scenarios we use two different forecasting approaches. In the first approach (Forecast) we simply compute SGR results with the available image timepoints and then extrapolate using the resulting regression geodesic to the desired timepoint in the future. In the second approach (Replace
), we artificially impute the missing image timepoints by simply replacing them by the image at the closest measured timepoint. For example, if we have images at 6, 12, and 18 month, but we want to forecast at 24 month, we use the 18 month image as the imputed 24 month image and then perform SGR on the 6, 12, 18, and the imputed 24 month images. We then obtain the deformation at 24 months from the SGR result.
ad Q1. Table 12 shows correlations between atrophy and the clinical measures for the Forecast results for 60 month, 72 month and 84 month.The resulting correlations of atrophy with diagnostic category are all above 0.3 (or below 0.3). Furthermore, the Forecast correlations show a downward trend with respect to time, which means that the prediction of “faraway“ points is not as accurate as for the “near” future. On the other hand, SGR using the 6 month to 48 month time points results in correlation around 0.5 for MMSE and 0.5 for DX on average. Hence the correlation with the diagnostic category is consistent for that of 60 months. In other words, using 6 month to 48 month data, our prediction model can predict accurately up to 60 month. Our prediction+correction network performs as well as and even slightly better than SGR using optimizationbased LDDMM. Fig. 10 shows that these forecasting results capture the trends of the changes in the temporal lobes near the hippocampus and changes in the ventricles.
ad Q2. Table 13 and Fig. 9 show Forecast and Replace results for correlations between atrophy and clinical measures in comparison to using all images. Specifically, for the Forecast and Replace results we did not use the available images at 36 and 48 month so we could compare against the results obtained when using these images. If FPSGR is a good model, it should results in correlation results as close to the correlation results using all images as possible. The Forecast correlations are only slightly weaker (0.02 to 0.05 lower) than the original correlations using all images illustrating that FPSGR can approximately forecast future changes.
The overall correlations in Table 13 show that the Replace group performs better than the Forecast group. In particular, we are also interested in the prediction of MCI converters, namely, MCI to NC, MCI to MCI, and MCI to AD. The boxplots in Fig. 9 show the correlations for such predictions. The Replace group in Fig. 9 show relatively worse correlation performance than the Forecast group in ADNI1 Pred1 and consistent performance in ADNI1 Pred2. Hence SGR on a longitudinal image data can achieves good forecasting result for MCI converters.
Thus, both Extrapolateclinical and Extrapolateimage experiments justify the use of FPSGR in predicting near future longitudinal trends especially for MCI converters.
4.6 Jacobian Determinant (JD)
The average JD images qualitatively agree with prior results Xue_ADNI1 ; Xue_ADNI2 : severity of volume change increases with severity of diagnosis and time. Change is most substantial in the temporal lobes near the hippocampus (see Fig. 10). In Fig. 10, 6 month to 48 month are existing data points; 60 month to 84 month are forecast results. Blue indicates volume loss. Red indicates expansion. Results are consistent with expectations: volume loss increases with time and severity of diagnosis in temporal lobes; volume expansion increases with respect to time and severity of diagnosis around the ventricles / cerebrospinal fluid. The forecast results capture visually sensible volume loss or expansion over time, qualitatively illustrating the performance of our method.
5 Conclusion and future work
In this work, we proposed a fast approach for geodesic regression (FPSGR) to study longitudinal image data. FPSGR incorporates the recently proposed FPIR ref:yang2016 ; quicksilver into the SGR ref:hong
framework, thus leading to a computationally efficient solution to geodesic regression. Since FPSGR replaces the computationally intensive intermediate step of computing pairwise initial momenta via a deeplearning prediction method, it is orders of magnitude faster than existing approaches
ref:hong ; hong2017fast , without compromising accuracy. Consequently, FPSGR facilitates the analysis of largescale imaging studies. Experiments on the ADNI1/ADNI2 datasets demonstrate that FPSGR captures expected atrophy trends of normal aging, MCI and AD. It further (1) exhibits negligible bias towards volume changes within statROIs, (2) shows high correlations with clinical variables (MMSE and diagnosis) and (3) produces consistent forecasting results on unseen data.In future work, it will be be interesting to explore FPSGR for the task of classifying stable Mild Cognitive Impairment (sMCI) and progressive Mild Cognitive Impairment (pMCI). Currently, FPSGR only shows modest accuracy for distinguishing these types within MCI. Extending our approach to timewarped geodesic regression models hong2014time might improve the accuracy in this context. Furthermore, endtoend prediction of averaged initial momenta would be an interesting future direction, as this would allow learning representations that characterize the geodesic path among multiple timeseries images, not only based on averages of momenta for two images as in FPIR ref:yang2016 ; quicksilver .
Acknowledgements
Research reported in this publication was supported by the National Institutes of Health (NIH) and the National Science Foundation (NSF) under award numbers NIH R01AR072013, NSF ECCS1148870, and EECS1711776. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF. We also thank Nvidia for the donation of a TitanX GPU.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH1220012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; BristolMyers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
References
References
 (1) C. R. Jack, J. Barnes, M. A. Bernstein, B. J. Borowski, J. Brewer, S. Clegg, A. M. Dale, O. Carmichael, C. Ching, C. DeCarli, et al., Magnetic resonance imaging in ADNI 2, Alzheimer’s & Dementia 11 (7) (2015) 740–756.
 (2) M. A. Ikram, A. van der Lugt, W. J. Niessen, P. J. Koudstaal, G. P. Krestin, A. Hofman, D. Bos, M. W. Vernooij, The Rotterdam scan study: design update 2016 and main findings, European journal of epidemiology 30 (12) (2015) 1299–1315.
 (3) Biobank website: www.ukbiobank.ac.uk.
 (4) M. Niethammer, Y. Huang, F.X. Vialard, Geodesic regression for image timeseries, in: MICCAI, 2011, pp. 655–662.
 (5) Y. Hong, Y. Shi, M. Styner, M. Sanchez, M. Niethammer, Simple geodesic regression for image timeseries., in: WBIR, Vol. 12, Springer, 2012, pp. 11–20.
 (6) Y. Hong, S. Joshi, M. Sanchez, M. Styner, M. Niethammer, Metamorphic geodesic regression, Medical Image Computing and ComputerAssisted Intervention–MICCAI (2012) 197–205.
 (7) N. Singh, J. Hinkle, S. Joshi, P. T. Fletcher, A vector momenta formulation of diffeomorphisms for improved geodesic regression and atlas construction, in: ISBI, 2013, pp. 1219–1222.
 (8) P. T. Fletcher, Geodesic regression and the theory of least squares on Riemannian manifolds, IJCV 105 (2) (2013) 171–185.
 (9) Y. Hong, N. Singh, R. Kwitt, M. Niethammer, Timewarped geodesic regression, in: Medical image computing and computerassisted intervention–MICCAI, Vol. 17, 2014, p. 105.
 (10) N. Singh, M. Niethammer, Splines for diffeomorphic image regression, in: Medical image computing and computerassisted intervention – MICCAI, Vol. 17, 2014, p. 121.

(11)
Y. Hong, R. Kwitt, N. Singh, B. Davis, N. Vasconcelos, M. Niethammer, Geodesic regression on the Grassmannian, in: European Conference on Computer Vision, Springer, 2014, pp. 632–646.
 (12) N. Singh, F.X. Vialard, M. Niethammer, Splines for diffeomorphisms, Medical image analysis 25 (1) (2015) 56–71.
 (13) Y. Hong, R. Kwitt, N. Singh, N. Vasconcelos, M. Niethammer, Parametric regression on the Grassmannian, IEEE transactions on pattern analysis and machine intelligence 38 (11) (2016) 2284–2297.
 (14) M. F. Beg, M. I. Miller, A. Trouvé, L. Younes, Computing large deformation metric mappings via geodesic flows of diffeomorphisms, International journal of computer vision 61 (2) (2005) 139–157.
 (15) Y. Hong, Y. Shi, M. Styner, M. Sanchez, M. Niethammer, Simple geodesic regression for image timeseries, in: WBIR, 2012, pp. 11–20.
 (16) X. Cao, J. Yang, J. Zhang, D. Nie, M. Kim, Q. Wang, D. Shen, Deformable image registration based on similaritysteered CNN regression, in: MICCAI, Springer, 2017.
 (17) S. Miao, Z. J. Wang, Y. Zheng, R. Liao, Realtime 2D/3D registration via CNN regression, in: Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on, IEEE, 2016, pp. 1430–1434.
 (18) H. Sokooti, B. d. Vos, F. Berendsen, B. P. Lelieveldt, I. Isgum, M. Staring, Nonrigid image registration using multiscale 3D convolutional neural networks, in: MICCAI, Springer, 2017.
 (19) X. Yang, R. Kwitt, M. Niethammer, Fast predictive image registration, in: International Workshop on LargeScale Annotation of Biomedical Data and Expert Label Synthesis, Springer, 2016, pp. 48–57.
 (20) X. Yang, R. Kwitt, M. Styner, M. Niethammer, Quicksilver: Fast predictive image registration–a deep learning approach, NeuroImage 158 (2017) 378–396.
 (21) M. Zhang, R. Liao, A. V. Dalca, E. A. Turk, J. Luo, P. E. Grant, P. Golland, Frequency diffeomorphisms for efficient image registration, in: International Conference on Information Processing in Medical Imaging, Springer, 2017, pp. 559–570.
 (22) M. Zhang, P. T. Fletcher, Finitedimensional Lie algebras for fast diffeomorphic image registration, in: IPMI, 2015, pp. 249–260.
 (23) A. Klein, J. Andersson, B. A. Ardekani, J. Ashburner, B. Avants, M.C. Chiang, G. E. Christensen, D. L. Collins, J. Gee, P. Hellier, et al., Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration, Neuroimage 46 (3) (2009) 786–802.
 (24) A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional networks, in: ICCV, 2015, pp. 2758–2766.
 (25) Z. Liu, R. Yeh, X. Tang, Y. Liu, A. Agarwala, Video frame synthesis using deep voxel flow, arXiv preprint arXiv:1702.02463.
 (26) T. Schuster, L. Wolf, D. Gadot, Optical flow requires multiple strategies (but only one network), arXiv preprint arXiv:1611.05607.
 (27) B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, I. Išgum, Endtoend unsupervised deformable image registration with a convolutional neural network, arXiv preprint arXiv:1704.06065.
 (28) Y. Hong, P. Golland, M. Zhang, Fast geodesic regression for populationbased image analysis, in: International Conference on Medical Image Computing and ComputerAssisted Intervention, Springer, 2017, pp. 317–325.
 (29) Z. Ding, G. Fleishman, X. Yang, P. Thompson, R. Kwitt, M. Niethammer, A. D. N. Initiative, et al., Fast predictive simple geodesic regression, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, 2017, pp. 267–275.
 (30) D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
 (31) G. Fleishman, P. M. Thompson, The impact of matching functional on atrophy measurement from geodesic shooting in diffeomorphisms, in: ISBI, 2017.
 (32) G. Fleishman, P. M. Thompson, Adaptive gradient descent optimization of initial momenta for geodesic shooting in diffeomorphisms, in: ISBI, 2017.

(33)
X. Hua, D. P. Hibar, C. R. K. Ching, C. P. Boyle, P. Rajagopalan, B. Gutman, A. D. Leow, A. W. Toga, C. R. J. Jr., D. J. Harvey, M. W. Weiner, P. M. Thompson, Unbiased tensorbased morphometry:improved robustness & sample size estimates for Alzheimer’s disease clinical trials, NeuroImage 66 (2013) 648–661.
 (34) X. Hua, C. R. K. Ching, A. Mezher, B. Gutman, D. P. Hibar, P. Bhatt, A. D. Leow, C. R. J. Jr., M. Bernstein, M. W. Weiner, P. M. Thompson, MRIbased brain atrophy rates in ADNI phase 2: acceleration and enrichment considerations for clinical trials, Neurobiology of Aging 37 (2016) 26–37.
 (35) P. A. Yushkevich, B. B. Avants, S. R. Das, J. Pluta, M. Altinay, C. Craige, Bias in estimation of hippocampal atrophy using deformationbased morphometry arises from asymmetric global normalization: An illustration in ADNI 3T MRI data, NeuroImage 50 (2) (2010) 434 – 445.
 (36) Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the royal statistical society. Series B (Methodological) (1995) 289–300.