1 Introduction
Deformable registration computes a dense correspondence between two images, and is fundamental to many medical image analysis tasks. Classical registration techniques have been rigorously developed and studied, but require computationally intensive optimization for each image pair, often requiring tens of minutes to hours of compute time on a CPU. Recent, learningbased registration methods achieve fast runtimes by building on machine learning developments, but largely omit rigorous theoretical treatment of deformations and topologypreserving guarantees. In this work, we present an approach that builds on the strengths of both paradigms, and overcomes these shortcomings. We provide a rigorous connection between probabilistic generative models for deformations and learning algorithms based on convolutional neural networks (CNNs). We also demonstrate that the learning can be done endtoend in an unsupervised fashion for this model. The resulting framework provides registration for a new image pair in under a second on a GPU, while maintaining guarantees developed for classical methods.
Our formulation casts registration as variational inference on a probabilistic generative model. This framework naturally results in an algorithm that leverages a collection of images to learn a global convolutional neural network with an intuitive cost function. Importantly, we introduce diffeomorphic integration layers combined with a spatial transform layer to enable unsupervised endtoend learning for diffeomorphic registration. We demonstrate that our algorithm achieves stateoftheart registration accuracy while providing diffeomorphic deformations and fast runtime, and can estimate of registration uncertainty. In our experiments we focus on the example of registering 3D MR brain scans, using a multistudy dataset of over 3,500 scans. However, the method is broadly applicable to many registration tasks.
This paper extends a preliminary version of this work presented at the Medical Image Computing and Computer Assisted Intervention (MICCAI) 2018 conference [16]. We build on that work by providing theoretical extensions, new results, analysis, and discussion. We first expand the model, including a natural extension to anatomical surfaces. In our experiments, we add baselines, new experiments on registration of both images and surfaces, and provide an analysis of the effect of our diffeomorphic implementation on field regularity and runtime. We implement our method as part of the registration framework called VoxelMorph, which is available at http://voxelmorph.csail.mit.edu.
1.1 Related Works
1.1.1 Classical Registration Methods
Classical methods solve an optimization over the space of deformations [3, 5, 7, 10, 17, 25, 55, 58, 59]
. Common representations are displacement vector fields, including elastictype models
[7, 19, 52], freeform deformations with bsplines [51], statistical parametric mapping [4], Demons [48, 55], and more recently discrete methods [17, 27, 25].Constraining the allowable transformations to diffeomorphisms ensures certain desirable properties, such as preservation of topology. Diffeomorphic transforms have seen extensive methodological development, yielding stateoftheart tools, such as Large Diffeomorphic Distance Metric Mapping (LDDMM) [10, 12, 13, 28, 33, 42, 47, 59], DARTEL [3], diffeomorphic Demons [56], and symmetric normalization (SyN) [5]. In general, these tools demand substantial time and computational resources for a given image pair.
Some recent GPUbased iterative algorithms use these frameworks to develop faster algorithms by requiring a GPU to be available for each registration [44, 43]. Recent learningbased registration methods have demonstrated that they can provide good initializations to iterative GPU methods [9] to further improve runtime.
1.1.2 Learningbased Registration
Recent methods have proposed to train neural networks that map a pair of input images to an output deformation. Most earlier approaches demonstrated the feasibility of deep learning based registration, and required ground truth registration fields
[11, 36, 49, 53, 57]. Such ground truth deformations are often derived via more conventional registration tools or simulations, sometimes limiting their applicability.Building on the successful demonstration of these methods, several recent papers [8, 9, 21, 20, 38] explore unsupervised, or endtoend, strategies. These methods employ a neural network that computes spatial transformation [32] to warp one image to another, enabling endtoend training. These approaches use machine learning techniques to achieve efficient training and fast runtimes, but build on classical registration development, such as probabilistic models and diffeomorphic theory. In our work, we bridge these two paradigms to offer classical guarantees within a machine learning approach. We note the contemporaneous development of a method that uses a conditional VAE to learn diffeomorphic representations, whose focus is on learning anatomical embeddings [37].
Recent methods proposed using segmentationbased cost functions, such as Dice [23], to replace the imagebased similarity term when segmentations are available during training for multimodal registration, such as T2w MRI and 3D ultrasound, within the same subject [31, 30]. We extend this line of work by showing that our generative probabilistic model naturally describes the deformation of surfaces, therefore enabling the use of segmentations during training within a single cohesive framework. The extended model results in a combination of segmentation (surface) and imagebased training losses.
1.2 Background: Diffeomorphic Registration
Although the method presented in this paper applies to a multitude of deformable representations, we choose to work with diffeomorphisms, and in particular with a stationary velocity field representation [3]. Diffeomorphic deformations are differentiable and invertible, and thus preserve topology. Let
represent the deformation that maps the coordinates from one image to coordinates in another image. In our implementation, the deformation field is defined through the following ordinary differential equation (ODE):
(1) 
where is the identity transformation and is time. We integrate the stationary velocity field over to obtain the final registration field [46].
While we implement and evaluate several numerical integration techniques, we find scaling and squaring to be most efficient, and we briefly review the technique here [2]. The integration of a stationary ODE represents a oneparameter subgroup of diffeomorphisms. In group theory, is a member of the Lie algebra and is exponentiated to produce , which is a member of the Lie group: . From the properties of oneparameter subgroups, for any scalars and , , where is a composition map associated with the Lie group. Starting from where is a map of spatial locations, we use the recurrence to obtain . is chosen so that is very small.
2 Methods
We let and be 3D images, such as MRI volumes, and let be a latent variable that parametrizes a transformation function . We propose a generative model that describes the formation of by warping via . We propose a variational inference approach that leverages a convolutional neural network with diffeomorphic integration and spatial transform layers. We learn network parameters in an unsupervised fashion, without access to ground truth registrations. We describe how the network yields fast diffeomorphic registration of a new image pair , in a probabilistic framework. We expand this treatment by including anatomical surface alignment, which enables training the network given (optional) anatomical segmentations.
2.1 Generative Model
We model the prior probability of the parametrization
as:(2) 
where is the multivariate normal distribution with mean and covariance . Our work applies to a wide range of representations . For example, could be a dense displacement field, or a lowdimensional embedding of the displacement field. In this paper, we let be a stationary velocity field that specifies a diffeomorphism through the ODE (1). We let be the Laplacian of a neighborhood graph defined on the voxel grid, where is the graph degree matrix, and is a voxel neighbourhood adjacency matrix. We encourage spatial smoothness of the velocity field by setting , where is a precision matrix and denotes a parameter controlling the scale of the velocity field .
We let be a noisy observation of warped image :
(3) 
where captures the variance of additive image noise.
We aim to estimate the posterior registration probability . Using this, we can obtain the most likely registration field for a new image pair via MAP estimation, along with an estimate of velocity field variance at each voxel. Figure 1 provides a graphical representation of our model.
2.2 Learning
Given our assumptions, computing the posterior probability is intractable. We use a variational approach, and introduce an approximate posterior probability parametrized by . We minimize the KL divergence
(4) 
which yields the negative of the variational lower bound of the model evidence [34]. We model the approximate posterior as a multivariate normal:
(5) 
where we let be diagonal. To understand the effects of this assumption, we explore a nondiagonal covariance in a later section. The statistics and the diagonal of can be interpreted as the voxelwise mean and variance, respectively.
We estimate , and using a convolutional neural network parameterized by , as described in the next section. We learn parameters by optimizing the variational lower bound (4) using stochastic gradient methods. Specifically, for each image pair and sample , we compute , with the resulting loss (detailed derivation in supplementary material):
(6) 
where is the number of samples used to approximate the expectation. The first term encourages image to be similar to the warped image . The second term encourages the posterior to be close to the prior . Although the variational covariance is diagonal, the last term spatially smoothes the mean, which can be seen by expanding , where are the neighbors of voxel . We treat and as fixed hyperparameters that we investigate in our experiments, and use .
2.3 Neural Network Framework
We design the network that takes as input and and outputs and , based on a 3D UNetstyle architecture [50]
. The network includes a convolutional layer with 32 filters, four downsampling layers with 64 convolutional filters and a stride of two, and three upsampling convolutional layers with 64 filters. All convolutional layers use 3x3 kernels and LeakyReLu activations. We emphasize that the exact architecture is not a focus of this paper, and many architectural choices are appropriate.
To enable unsupervised learning of parameters using (6), we must form and compute the data term. We first implement a layer that samples a new using the “reparameterization trick" [34]: , where is a sample from the standard normal: .
Given , we need to compute as described in the introduction. We propose vector integration layers using scaling and squaring operations. Specifically, scaling and squaring operations involve compositions within the neural network architecture using a differentiable spatial transformation operation. Given two 3D vector fields and , for each voxel this operation computes , a noninteger voxel location in
, using linear interpolation. Starting with
, we compute recursively using these operations T times, leading to . In our experiments, we extensively analyze the effect of the step size on the runtime of the network, the accuracy of the registration, and the regularity of the deformation. We also implement vector integration layers using quadrature and ODE solvers, and in the experiments show that these are significantly slower and can require significant memory.Finally, we warp volume according to the computed diffeomorphic field using a spatial transform layer.
In summary, the network takes as input images and , computes statistics and , samples a new velocity field , computes a diffeomorphic and warps
. Since all the steps are designed to be differentiable, we learn the network parameters using stochastic gradient descentbased methods. This network results in three outputs,
and , which are used in the model loss (6). The framework is summarized in Figure 2.2.4 Registration
Given learned parameters, we approximate registration of a new scan pair using . We first obtain the most likely velocity field using
(7) 
by evaluating the neural network . We then compute using the scaling and squaring based integration, altogether requiring less than a second on a GPU. We highlight that at test time, the diagonal covariance is not used, however it enables an estimation of the deformation uncertainty. Analysis of uncertainty is beyond the scope of this paper, and is an interesting avenue for future study.
Using a stationary velocity field representation, computing the inverse deformation field can be achieved by integrating the negative of the velocity field: , since [3, 45]. This enables the computation of both fields inside one efficient network when desired.
We implement our method as part of the VoxelMorph package [8], available online at http://voxelmorph.csail.mit.edu
, using neuron
[18]and Keras
[14]with a Tensorflow
[1] backend.2.5 Surfacebased Registration
In various instances, anatomical segmentation maps for specific structures of interest may also be available with some of the training images. Recent papers have demonstrated that the use of segmentations can help in registration [9, 30]. Here, we show that our proposed model naturally extends to handle surfaces, enabling the use of segmentations during training within the same principled framework.
We focus on the case where one anatomical structure is segmented in the image. Given a segmentation map where each voxel is assigned the desired anatomical label or background, we extract the anatomical surface and let represent the surface coordinates of the anatomical structure for image , which can be stored as an matrix. Given the diffeomorphism in the previous section, we model each surface location , as formed by displacing a matching surface location according to , and adding (spatial) displacement noise:
(8) 
where the composition warps surface coordinates.
Given both images and segmentation maps during training, we extract surfaces of the desired structure and aim to estimate the posterior probability . As before, we use a variational approximation. Since segmentation maps, and hence surfaces, are usually derived from images, we assume that images are sufficient to approximate the posterior: . As before, we minimize the KL divergence between the true and approximate posterior (derived in supplementary material):
(9) 
and arrive at the loss function:
(10) 
Compared to the original model loss (6), the additional third term encourages the deformation to warp the moving surface close to the fixed surface . As described in the generative model (8), this requires corresponding surface points in and . However, these correspondences are not available in practice, as segmentations are provided independently for each image. Therefore, the third term cannot be computed directly.
We propose an approximation of the surface term using surface distance transforms. Let be a surface distance function, which for location returns the Euclidean distance to the closest surface point in (Figure 3Left).^{1}^{1}1Function is a generating function for a distance transform image for the surface , by evaluating it at every grid point Noting that for two surfaces and , due to potential asymmetries in the surfaces (see Figure 3Right), we approximate the distance by computing in both directions:
(11) 
We implement this function efficiently using distance transforms and points sampled along each structure, and take advantage of our diffeomorphic representation that enables computing the inverse efficiently within the network.
Method  Avg. Dice  GPU sec  CPU sec  

Affine only  0.584 (0.157)  0  0  0 
ANTs (SyN)  0.749 (0.136)    9059 (2023)  9662 (6258) 
NiftyReg (CC)  0.755 (0.143)    2347 (202)  41251 (14336) 
VoxelMorph (CC)  0.753 (0.145)  0.45 (0.01)  57 (1.0)  19077 (5928) 
VoxelMorphdiff 
0.754 (0.139)  0.47 (0.01)  84.2 (0.1)  0.1 (1.27) 
In summary, since to compute the posterior approximation the neural network takes as input only the images and , images alone are required at test time. Given a diffeomorphism , at training time the network uses both a warped image and a warped surface to evaluate the quality of the registration.
This model can also be used to register two surfaces when the images themselves are not available. The only modelling change required is removing the image likelihood terms and using the variational approximation , which uses the segmentation maps and as input. Surfaceonly registration is beyond the scope of this paper, and we leave it for future work.
The complete neural network framework, including the surface loss, is illustrated in supplemental Figure 11.
2.6 Nondiagonal Covariance
Approximating the velocity field covariance using a diagonal matrix is a strong assumption that ignores spatial smoothness. As seen in (6), the spatiallysmooth prior encourages a smooth mean velocity field , but samples might still be noisy.
To evaluate the effects of the diagonal covariance, we explore a second approximation where is a diagonal matrix returned by the neural network and is a fixed smoothing convolution matrix. Specifically, for each row of we create a flattened Gaussian smoothing kernel centered at a particular voxel, such that is equivalent to 3D convolution of image by a gaussian filter with variance . We choose such that the smoothing operation matches the scale of the prior determined by : .
During training, sampling from the posterior is achieved using the reparametrization trick: , where is a sample from the standard normal. Intuitively, compared to the diagonal approximation, this sampling procedure smoothes the term before adding the mean .
In our experiments, we show that this approximation yields smoother velocity fields during training, and the effect diminishes with higher values. However, the resulting deformation fields are diffeomorphic and accurate for both approximations, demonstrating that the diagonal covariance approximation is sufficient when working with diffeomorphisms.
3 Experiments
We perform a series of experiments demonstrating that the proposed probabilistic image registration framework achieves accuracy and runtime comparable to stateoftheart methods while enabling diffeomorphic deformations. We also show the improvements enabled by the extended surface model, and analyze the effect of the various integration layers during test time.
We focus on atlasbased registration, a common task in population analysis. Specifically, we register each scan to an atlas computed using external data [24, 54]. Because we implement our algorithm as part of the VoxelMorph framework, we will refer to it as VoxelMorphdiff
.
3.1 Experiment setup
3.1.1 Data and Preprocessing
We use a largescale, multisite dataset of 3731 T1weighted brain MRI scans from eight publicly available datasets: OASIS [39], ABIDE [22], ADHD200 [41], MCIC [26], PPMI [40], HABS [15], and Harvard GSP [29]. Acquisition details, subject age ranges and health conditions are different for each dataset. We performed standard preprocessing steps on all scans, including resampling to mm isotropic voxels, affine spatial normalization and brain extraction for each scan using FreeSurfer [24]. We crop the final images to . Segmentation maps including 29 anatomical structures, obtained using FreeSurfer for each scan, are used in evaluating registration results. Each image contains roughly million brain voxels. We split the dataset into 3231, 250, and 250 volumes for train, validation, and test sets respectively, although we underscore that the training is unsupervised.
3.1.2 Evaluation Metrics
To evaluate a registration algorithm, we register each subject to an atlas, propagate the segmentation map using the resulting warp, and measure volume overlap using the Dice metric. For the surface experiments, we also employ the Euclidean surface distance, computed using the strategy described in (11).
We also evaluate the diffeomorphic property, a focus of our work. Specifically, the Jacobian matrix captures the local properties of around voxel . The local deformation is diffeomorphic, both invertible and orientationpreserving, only at locations for which , where is the determinant operator [3]. We count all other (folding) voxels, where .
3.1.3 Baseline Methods
We compare our approach with the popular ANTs software package using Symmetric Normalization (SyN) [6], a topperforming algorithm [35]. We found that the default ANTs settings are suboptimal for our task, and performed a wide parameter and similarity metric search across several datasets. We identified and use top performing parameter values for the Dice metric using: the crosscorrelation (CC) loss function, SyN step size of 0.25, Gaussian smoothing of (9, 0.2) and three scales of 201 iterations. We also test the NiftyReg package, for which we use a multithreaded CPU implementation as a GPU implementation is not currently available.^{2}^{2}2We compiled the latest source code from March, 2018 (tree [4e4525]). We experimented with different parameter settings for improved behavior, and used the following setting: CC cost function, grid spacing of 5, and 500 iterations.
To compare with recent learningbased registration approaches, we also test our recent CNNbased method, VoxelMorph, which produces stateoftheart fast and accurate registration, but does not yield diffeomorphic results [8, 9]. We sweep the regularization parameter using our validation set, and use the optimal regularization parameter of 1 in our results.
3.2 Image Registration
Table 1 provides a summary of the results on the heldout test set. Figure 5 and supplementary material Figure 12 show representative results. Figure 4 illustrates Dice results on several anatomical structures. For better visualization, we combine the same structures from the two hemispheres, such as the left and right hippocampus. Our algorithm, VoxelMorphdiff
, achieve stateoftheart Dice results and runtimes, but produces diffeomorphic registration fields (nearly no folding voxels per scan) in a probabilistic framework.
All methods achieve comparable Dice results on each structure and overall. Learningbased methods require a fraction of the baseline runtimes to register two images: less than a second on a GPU, and less than a minute and a half on a CPU. Runtimes were computed for an NVIDIA TitanX GPU and a Intel Xeon (E52680) CPU, and exclude preprocessing common to all methods.
Our method outputs positive Jacobians at nearly all voxels, which we analyze in more detail in a later section. We find that for most scans, the deformation fields result in zero folding voxels, while very few volumes have a handful of grouped folding voxels, averaging out to less than a voxel per scan in our test set. In contrast, the deformation fields resulting from the baseline methods contain a few thousand locations of nonpositive Jacobians for each scan (Table 1), usually grouped in clusters. This may be alleviated with increased spatial regularization or more optimization iterations, but this in turn leads to a drop in performance on the Dice metric or even longer runtimes.
Method  

ANTs SyN (CC)  9060 (4445)  0.545 (0.267) 
NiftyReg (CC)  40425 (9901)  2.431 (0.595) 
VoxelMorph (CC)  19077 (5928)  1.147 (0.360) 
VoxelMorphdiff  0.1 (1.2)  6.1e6 (7.6e5) 
VoxelMorphsurf (W.M.)  3.0 (6.2)  1.8e4 (3.8e4) 
VoxelMorphsurf (cortex)  3.4 (6.4)  2.0e4 (3.9e4) 
VoxelMorphsurf (lat. ven.)  4.0 (8.0)  2.3e4 (4.8e4) 
VoxelMorphsurf (thalamus)  4.3 (8.2)  2.6e4 (4.9e4) 
VoxelMorphsurf (hipp.)  2.7 (5.8)  1.6e4 (3.5e4) 
3.3 Image and Surface Registration
In this section, we evaluate the generative surface model. We demonstrate the use of anatomical segmentation alongside images during training, and refer to this model as VoxelMorphSurf
. We focus on the setting where one structure of interest is available during training, and learn separate networks for the left white matter, gray matter, ventricle, thalamus and hippocampus. Our goal is to analyze how the additional surface model terms affect the accuracy and regularity of resulting deformations.
Figure 7 illustrates the behaviour of the model with respect to the hyperparameter on the validation set. For a range of values, we find a significant improvement in terms of Dice for the desired structure. For very small values of , the training becomes unstable leading to poor generalization. A very large value leads to the model ignoring the surface term. Since the Dice scores are comparable in the range , for the rest of this section we use , which exhibits slightly fewer folding voxels ( compared to for ).
Figure 6 demonstrates the improvement on the test set in terms of Euclidean surface distance and Dice, compared to the imageonly registration model VoxelMorphdiff
. VoxelMorphsurf
improves significantly in all measures for most desired structures. Additionally, Table 2 illustrates that with increased accuracy in both metrics, the number of folding voxels in the entire volume increases only very slightly (to an average 3.5 voxels per volume), which remains orders of magnitude fewer than the baseline methods (Table 1). Figure 13 in the supplementary material illustrates example results.
In summary, the principled joint diffeomorphic model enables the use of surfaces during training which dramatically improves registration near a given structure while preserving desired deformation properties. For example, given hippocampus surfaces at training, registration using VoxelMorphSurf
improves Dice by points over VoxelMorphdiff
, improves maximum surface distance by more than three voxels, and preserving diffeomorphisms (less than three folding voxels per scan).
3.4 Analysis
3.4.1 Parameter Analysis
The two main hyperparameters, smoothing precision and image noise
, have physical meaning in our generative model. However, they share a single degree of freedom in the loss function. We set
, and vary the precision scale between 0.5 and 100. Figure 8 shows average Dice scores for 50 validation set scans for different parameter values, showing that the results vary smoothly over a large range, with reasonable behavior even near . We use in our experiments above.3.4.2 Integration Steps
During training, we hold the number of scaling and squaring steps fixed. However, this number can be varied at test time, affecting aspects of the resulting deformation field. In this section, we analyze the effects of the number of steps on accuracy, runtime, field regularity and invertability. We perform this experiment using 50 validation subjects and the image registration network VoxelMorphdiff trained with integration steps and regularization parameter . The velocity field is computed every two voxels, but all of the conclusions in this section are likely to apply to many reasonable field spacings.
Figure 9 summarizes the analysis results. The runtime increases modestly with the number of steps, and is overall significantly smaller than the cost of the rest of the network (i.e. the deformation network computation of the velocity field, and the spatial transform of the full moving image). After four scaling and squaring steps, the method achieved maximum Dice score. We observe a steep decline in the number of folding voxels (note the logscale vertical axis), reaching less than five voxels after five scaling and squaring steps, compared to classical methods which can include thousands of such voxels (Table 1). Finally, we measure the average displacement error after inverting the deformation fields: . We find that after five scaling and squaring steps, even the worst error is under a half voxel, indicating that five steps are sufficient to ensure invertible deformations.
In addition, we implemented the integration of the velocity field using Tensorflow ODE solvers and using standard quadrature, but found that these required significantly more runtime compared to the scaling and squaring strategy, consistent with literature findings. Specifically, while five scaling and squaring operations required seconds, equivalent quadrature integration required operations (occupying prohibitive amounts of memory) and seconds, and ODEsolver based integration with default parameter required a single layer and seconds. At comparable integration settings such as these, all three methods achieve similar Dice scores of . While these alternative methods require significant resources, all three implementations are available in our source code for experimentation.
This analysis indicates that the proposed scaling and squaring network integration layer is efficient and accurate. Increasing the number of scaling and squaring layers incurs a negligible runtime cost while improving deformation field properties. We use squaring steps in the test experiments above.
3.4.3 Velocity Sampling and Uncertainty
We also evaluate the modeling assumptions of the variational covariance . Figure 10 illustrates example samples of the velocity field and voxelwise empirical variance for the two approximations: diagonal covariance and the extended approximation in Section 2.6 that smooths samples . For underregularized networks (very low values for hyperparameter ), the latter approximation yields smoother velocity fields. However, given a higher hyperparameter value, such as the one used in our experiments, the network learns smaller values for the diagonal approximation, and yields smooth samples with either method. Futhermore, despite the difference in smoothness of the velocity field samples , the integration operation leads to equally regular and accurate deformations for a given (Table 3).
Therefore, although the diagonal covariance has the potential to add noise to velocity field samples, the loss function coupled with the integration operation lead to smooth and accurate deformation fields at reasonable values. Diagonal covariances would likely have negative effects in a different deformation model, for example if would be the displacement field itself.
Dice  

diagonal  extension  diagonal  extension  
0.74 (0.01)  0.74 (0.01)  2934 (2007)  2720 (1593)  
0.75 (0.01)  0.75 (0.01)  0.32 (0.96)  0.16 (0.57) 
4 Discussion and Conclusion
In this work, we build a principled connection between classical registration methods and recent learningbased approaches. We propose a probabilistic model for diffeomorphic image registration and derive a learning algorithm that leverages a convolutional neural network and unsupervised, endtoend learning for fast runtime. To achieve diffeomorphic transforms, we integrate stationary velocity fields through novel scaling and squaring differentiable network operations, and provide implementation and analysis for other integration layers.
Although the simplifying diagonal approximation to the velocity covariance adds voxelindependent noise to every velocity field sample , the resulting deformation fields are well behaved because of our smoothing prior and diffeomorphic representation.
We also provide an anatomical surface deformation model. If image segmentations are available for a particular anatomical structure, the generative model incorporates them naturally in the same joint framework during training, while not requiring the surfaces at test time.
Our algorithm can infer the registration of new image pairs in under a second. Compared to traditional methods, our approach is significantly faster, and compared to recent learning based methods, our method offers diffeomorphic guarantees. We demonstrate that the surface extension to our model can help improve registration while preserving properties such as low runtime and diffeomorphisms.
Our focus in this framework has been to present the technical connection between classical and learning paradigms, and show that diffeomorphisms are attainable in a very low runtime. Immediate extensions can enable other models and applications. For example, our derivation is generalizable to other formulations: can be a low dimensional embedding representation of a deformation field, or the displacement field itself. Similarly, the variational covariance enables an estimation of the uncertainty of the deformation field at each voxel, which can be informative in downstream tasks such as biomedical segmentation or population analysis. The model is also widely applicable to other applications, such as subjecttosubject registration, segmentationonly registration, or using multiple surfaces to improve imagebased registration.
5 Acknowledgments
This research was funded by NIH grants R01LM012719, R01AG053949, and 1R21AG050122, and NSF NeuroNex Grant grant 1707312.
References
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [2] Vincent Arsigny, Olivier Commowick, Xavier Pennec, and Nicholas Ayache. A logeuclidean framework for statistics on diffeomorphisms. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 924–931. Springer, 2006.
 [3] J. Ashburner. A fast diffeomorphic image registration algorithm. Neuroimage, 38(1):95–113, 2007.
 [4] J. Ashburner and K. Friston. Voxelbased morphometrythe methods. Neuroimage, 11:805–821, 2000.
 [5] Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with crosscorrelation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis, 12(1):26–41, 2008.
 [6] Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with crosscorrelation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis, 12(1):26–41, 2008.
 [7] R. Bajcsy and S. Kovacic. Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing, 46:1–21, 1989.

[8]
Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca.
An unsupervised learning model for deformable medical image
registration.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 9252–9260, 2018.  [9] Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Voxelmorph: A learning framework for deformable medical image registration. arXiv preprint arXiv:1809.05231, 2019.
 [10] M Faisal Beg, Michael I Miller, Alain Trouvé, and Laurent Younes. Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int. J. Comput. Vision, 61:139–157, 2005.
 [11] Xiaohuan Cao, Jianhua Yang, Jun Zhang, Dong Nie, Minjeong Kim, Qian Wang, and Dinggang Shen. Deformable image registration based on similaritysteered cnn regression. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 300–308. Springer, 2017.
 [12] Yan Cao, Michael I Miller, Raimond L Winslow, and Laurent Younes. Large deformation diffeomorphic metric mapping of vector fields. IEEE transactions on medical imaging, 24(9):1216–1230, 2005.

[13]
Can Ceritoglu, Kenichi Oishi, Xin Li, MingChung Chou, Laurent Younes, Marilyn
Albert, Constantine Lyketsos, Peter CM van Zijl, Michael I Miller, and Susumu
Mori.
Multicontrast large deformation diffeomorphic metric mapping for diffusion tensor imaging.
Neuroimage, 47(2):618–627, 2009.  [14] François Chollet. Keras. https://github.com/fchollet/keras, 2015.
 [15] Alexander Dagley, Molly LaPoint, Willem Huijbers, Trey Hedden, Donald G McLaren, Jasmeer P Chatwal, Kathryn V Papp, Rebecca E Amariglio, Deborah Blacker, Dorene M Rentz, et al. Harvard aging brain study: dataset and accessibility. NeuroImage, 2015.
 [16] Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised learning for fast probabilistic diffeomorphic registration. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 729–738, Cham, 2018. Springer.
 [17] Adrian V Dalca, Andreea Bobu, Natalia S Rost, and Polina Golland. Patchbased discrete registration of clinical brain images. In MICCAIPATCHMI Patchbased Techniques in Medical Imaging. Springer, 2016.
 [18] Adrian V Dalca, John Guttag, and Mert R Sabuncu. Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9290–9299, 2018.
 [19] Christos Davatzikos. Spatial transformation and registration of brain images using elastically deformable models. Computer Vision and Image Understanding, 66(2):207–222, 1997.
 [20] Bob D de Vos, Floris F Berendsen, Max A Viergever, Hessam Sokooti, Marius Staring, and Ivana Išgum. A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis, 52:128–143, 2019.
 [21] Bob D de Vos, Floris F Berendsen, Max A Viergever, Marius Staring, and Ivana Išgum. Endtoend unsupervised deformable image registration with a convolutional neural network. In DLMIA, pages 204–212. Springer, 2017.
 [22] A. Di Martino et al. The autism brain imaging data exchange: towards a largescale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry, 19(6):659–667, 2014.
 [23] Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
 [24] B. Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.

[25]
B. Glocker, N. Komodakis, G. Tziritas, N. Navab, and N. Paragios.
Dense image registration through mrfs and efficient linear programming.
Medical image analysis, 12(6):731–741, 2008.  [26] Randy L Gollub, Jody M Shoemaker, Margaret D King, Tonya White, Stefan Ehrlich, Scott R Sponheim, Vincent P Clark, Jessica A Turner, Bryon A Mueller, Vince Magnotta, et al. The MCIC collection: a shared repository of multimodal, multisite brain image data from a clinical investigation of schizophrenia. Neuroinformatics, 11(3):367–388, 2013.
 [27] Mattias P Heinrich, Mark Jenkinson, Michael Brady, and Julia A Schnabel. Mrfbased deformable registration and ventilation estimation of lung ct. IEEE transactions on medical imaging, 32(7):1239–1248, 2013.
 [28] Monica Hernandez, Matias N Bossa, and Salvador Olmos. Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows. International Journal of Computer Vision, 85(3):291–306, 2009.
 [29] Avram J Holmes, Marisa O Hollinshead, Timothy M O’Keefe, Victor I Petrov, Gabriele R Fariello, Lawrence L Wald, Bruce Fischl, Bruce R Rosen, Ross W Mair, Joshua L Roffman, et al. Brain genomics superstruct project initial data release with structural, functional, and behavioral measures. Scientific data, 2, 2015.

[30]
Yipeng Hu, Marc Modat, Eli Gibson, Nooshin Ghavami, Ester Bonmati, Caroline M
Moore, Mark Emberton, J Alison Noble, Dean C Barratt, and Tom Vercauteren.
Labeldriven weaklysupervised learning for multimodal deformable image registration.
In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1070–1074. IEEE, 2018.  [31] Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Caroline M Moore, Mark Emberton, et al. Weaklysupervised convolutional neural networks for multimodal image registration. Medical image analysis, 49:1–13, 2018.
 [32] Max Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [33] Sarang C Joshi and Michael I Miller. Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing, 9(8):1357–1370, 2000.
 [34] D.P. Kingma and M. Welling. Autoencoding variational bayes. arXiv:1312.6114, 2013.
 [35] Arno Klein, Jesper Andersson, Babak A Ardekani, John Ashburner, Brian Avants, MingChang Chiang, Gary E Christensen, D Louis Collins, James Gee, Pierre Hellier, et al. Evaluation of 14 nonlinear deformation algorithms applied to human brain mri registration. Neuroimage, 46(3):786–802, 2009.
 [36] Julian Krebs et al. Robust nonrigid registration through agentbased action learning. In Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 344–352, Cham, 2017. Springer International Publishing.
 [37] Julian Krebs, Tommaso Mansi, Boris Mailhé, Nicholas Ayache, and Hervé Delingette. Unsupervised probabilistic deformation modeling for robust diffeomorphic registration. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 101–109. Springer, 2018.
 [38] H. Li and Y. Fan. Nonrigid image registration using fully convolutional networks with deep selfsupervision. arXiv preprint arXiv:1709.00799, 2017.
 [39] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John C Morris, and Randy L Buckner. Open access series of imaging studies (oasis): crosssectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience, 19(9):1498–1507, 2007.
 [40] Kenneth Marek, Danna Jennings, Shirley Lasch, Andrew Siderowf, Caroline Tanner, Tanya Simuni, Chris Coffey, Karl Kieburtz, Emily Flagg, Sohini Chowdhury, et al. The parkinson progression marker initiative (ppmi). Progress in neurobiology, 95(4):629–635, 2011.
 [41] Michael P Milham, Damien Fair, Maarten Mennes, Stewart HMD Mostofsky, et al. The ADHD200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience, 6:62, 2012.
 [42] Michael I Miller, M Faisal Beg, Can Ceritoglu, and Craig Stark. Increasing the power of functional maps of the medial temporal lobe by using large deformation diffeomorphic metric mapping. Proceedings of the National Academy of Sciences, 102(27):9685–9690, 2005.
 [43] Marc Modat, David M Cash, Pankaj Daga, Gavin P Winston, John S Duncan, and Sébastien Ourselin. Global image registration using a symmetric blockmatching approach. Journal of Medical Imaging, 1(2):024003, 2014.
 [44] Marc Modat, Gerard R Ridgway, Zeike A Taylor, Manja Lehmann, Josephine Barnes, David J Hawkes, Nick C Fox, and Sébastien Ourselin. Fast freeform deformation using graphics processing units. Computer methods and programs in biomedicine, 98(3):278–284, 2010.
 [45] Marc Modat, Ivor JA Simpson, Manual Jorge Cardoso, David M Cash, Nicolas Toussaint, Nick C Fox, and Sébastien Ourselin. Simulating neurodegeneration through longitudinal population analysis of structural and diffusion weighted mri data. Medical Image Computing and ComputerAssisted Intervention, LNCS 8675:57–64, 2014.
 [46] Cleve Moler and Charles Van Loan. Nineteen dubious ways to compute the exponential of a matrix, twentyfive years later. SIAM review, 45(1):3–49, 2003.
 [47] Kenichi Oishi, Andreia Faria, Hangyi Jiang, Xin Li, Kazi Akhter, Jiangyang Zhang, John T Hsu, Michael I Miller, Peter CM van Zijl, Marilyn Albert, et al. Atlasbased whole brain white matter analysis using large deformation diffeomorphic metric mapping: application to normal elderly and alzheimer’s disease participants. Neuroimage, 46(2):486–499, 2009.
 [48] Xavier Pennec, Pascal Cachier, and Nicholas Ayache. Understanding the "demon’s algorithm": 3d nonrigid registration by gradient descent. International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 597–605, 1999.
 [49] MarcMichel Rohé, Manasi Datar, Tobias Heimann, Maxime Sermesant, and Xavier Pennec. SVFNet: Learning deformable image registration using shape matching. In MICCAI, pages 266–274. Springer, 2017.
 [50] O. Ronneberger et al. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
 [51] Daniel Rueckert, Luke I Sonoda, Carmel Hayes, Derek LG Hill, Martin O Leach, and David J Hawkes. Nonrigid registration using freeform deformation: Application to breast mr images. IEEE Transactions on Medical Imaging, 18(8):712–721, 1999.
 [52] D. Shen and C. Davatzikos. Hammer: Hierarchical attribute matching mechanism for elastic registration. IEEE Trans. Med. Imag., 21(11):1421–1439, 2002.
 [53] Hessam Sokooti, Bob de Vos, Floris Berendsen, Boudewijn PF Lelieveldt, Ivana Išgum, and Marius Staring. Nonrigid image registration using multiscale 3d convolutional neural networks. In MICCAI, pages 232–239, Cham, 2017. Springer.
 [54] R. Sridharan, A.V. Dalca, K.M. Fitzpatrick, L. Cloonan, A. Kanakis, O. Wu, K.L. Furie, J. Rosand, N.S. Rost, and P. Golland. Quantification and analysis of large multimodal clinical image studies: Application to stroke. In International Workshop on Multimodal Brain Image Analysis, pages 18–30. Springer International Publishing, 2013.
 [55] J.P. Thirion. Image matching as a diffusion process: an analogy with maxwell’s demons. Medical Image Analysis, 2(3):243–260, 1998.
 [56] Tom Vercauteren et al. Diffeomorphic demons: Efficient nonparametric image registration. NeuroImage, 45(1):S61–S72, 2009.
 [57] Xiao Yang, Roland Kwitt, Martin Styner, and Marc Niethammer. Quicksilver: Fast predictive image registration–a deep learning approach. NeuroImage, 158:378–396, 2017.
 [58] BT Thomas Yeo, Mert R Sabuncu, Tom Vercauteren, Daphne J Holt, Katrin Amunts, Karl Zilles, Polina Golland, and Bruce Fischl. Learning taskoptimal registration cost functions for localizing cytoarchitecture and function in the cerebral cortex. IEEE transactions on medical imaging, 29(7):1424–1441, 2010.
 [59] Miaomiao. Zhang, Ruizhi. Liao, Adrian V Dalca, Esra A Turk, Jie Luo, P Ellen Grant, and Polina Golland. Frequency diffeomorphisms for efficient image registration. In International Conference on Information Processing in Medical Imaging, pages 559–570. Springer, 2017.
Supplementary Material
Derivation of Main Loss
Using the facts that is constant, , and , and approximating the expectation with samples , we obtain
(12) 
Derivation of Surface VLB and Loss
We derive the variational lower bound and loss for the generative surface model. We start by minimizing the KL divergence between the true and approximate posteriors:
(13) 
where in we used the assumptions that the fixed image is independent of anatomical surfaces given the moving image and the deformation, and the fixed surface is independent of either image given the moving surface and the deformation. Following this variational lower bound, the loss follows the previous section closely (see (13)), with the additional term:
(14) 
Combining this term with (12), and approximating expectations with samples leads to the final loss (10):
(15) 