Unsupervised Learning of Probabilistic Diffeomorphic Registration for Images and Surfaces

03/08/2019 ∙ by Adrian V. Dalca, et al. ∙ MIT cornell university 13

Classical deformable registration techniques achieve impressive results and offer a rigorous theoretical treatment, but are computationally intensive since they solve an optimization problem for each image pair. Recently, learning-based methods have facilitated fast registration by learning spatial deformation functions. However, these approaches use restricted deformation models, require supervised labels, or do not guarantee a diffeomorphic (topology-preserving) registration. Furthermore, learning-based registration tools have not been derived from a probabilistic framework that can offer uncertainty estimates. In this paper, we build a connection between classical and learning-based methods. We present a probabilistic generative model and derive an unsupervised learning-based inference algorithm that uses insights from classical registration methods and makes use of recent developments in convolutional neural networks (CNNs). We demonstrate our method on a 3D brain registration task for both images and anatomical surfaces, and provide extensive empirical analyses of the algorithm. Our principled approach results in state of the art accuracy and very fast runtimes, while providing diffeomorphic guarantees. Our implementation is available online at http://voxelmorph.csail.mit.edu.



There are no comments yet.


page 4

page 7

page 10

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deformable registration computes a dense correspondence between two images, and is fundamental to many medical image analysis tasks. Classical registration techniques have been rigorously developed and studied, but require computationally intensive optimization for each image pair, often requiring tens of minutes to hours of compute time on a CPU. Recent, learning-based registration methods achieve fast runtimes by building on machine learning developments, but largely omit rigorous theoretical treatment of deformations and topology-preserving guarantees. In this work, we present an approach that builds on the strengths of both paradigms, and overcomes these shortcomings. We provide a rigorous connection between probabilistic generative models for deformations and learning algorithms based on convolutional neural networks (CNNs). We also demonstrate that the learning can be done end-to-end in an unsupervised fashion for this model. The resulting framework provides registration for a new image pair in under a second on a GPU, while maintaining guarantees developed for classical methods.

Our formulation casts registration as variational inference on a probabilistic generative model. This framework naturally results in an algorithm that leverages a collection of images to learn a global convolutional neural network with an intuitive cost function. Importantly, we introduce diffeomorphic integration layers combined with a spatial transform layer to enable unsupervised end-to-end learning for diffeomorphic registration. We demonstrate that our algorithm achieves state-of-the-art registration accuracy while providing diffeomorphic deformations and fast runtime, and can estimate of registration uncertainty. In our experiments we focus on the example of registering 3D MR brain scans, using a multi-study dataset of over 3,500 scans. However, the method is broadly applicable to many registration tasks.

This paper extends a preliminary version of this work presented at the Medical Image Computing and Computer Assisted Intervention (MICCAI) 2018 conference [16]. We build on that work by providing theoretical extensions, new results, analysis, and discussion. We first expand the model, including a natural extension to anatomical surfaces. In our experiments, we add baselines, new experiments on registration of both images and surfaces, and provide an analysis of the effect of our diffeomorphic implementation on field regularity and runtime. We implement our method as part of the registration framework called VoxelMorph, which is available at http://voxelmorph.csail.mit.edu.

1.1 Related Works

1.1.1 Classical Registration Methods

Classical methods solve an optimization over the space of deformations [3, 5, 7, 10, 17, 25, 55, 58, 59]

. Common representations are displacement vector fields, including elastic-type models 

[7, 19, 52], free-form deformations with b-splines [51], statistical parametric mapping [4], Demons [48, 55], and more recently discrete methods [17, 27, 25].

Constraining the allowable transformations to diffeomorphisms ensures certain desirable properties, such as preservation of topology. Diffeomorphic transforms have seen extensive methodological development, yielding state-of-the-art tools, such as Large Diffeomorphic Distance Metric Mapping (LDDMM) [10, 12, 13, 28, 33, 42, 47, 59], DARTEL [3], diffeomorphic Demons [56], and symmetric normalization (SyN) [5]. In general, these tools demand substantial time and computational resources for a given image pair.

Some recent GPU-based iterative algorithms use these frameworks to develop faster algorithms by requiring a GPU to be available for each registration [44, 43]. Recent learning-based registration methods have demonstrated that they can provide good initializations to iterative GPU methods [9] to further improve runtime.

1.1.2 Learning-based Registration

Recent methods have proposed to train neural networks that map a pair of input images to an output deformation. Most earlier approaches demonstrated the feasibility of deep learning based registration, and required ground truth registration fields 

[11, 36, 49, 53, 57]. Such ground truth deformations are often derived via more conventional registration tools or simulations, sometimes limiting their applicability.

Building on the successful demonstration of these methods, several recent papers [8, 9, 21, 20, 38] explore unsupervised, or end-to-end, strategies. These methods employ a neural network that computes spatial transformation [32] to warp one image to another, enabling end-to-end training. These approaches use machine learning techniques to achieve efficient training and fast runtimes, but build on classical registration development, such as probabilistic models and diffeomorphic theory. In our work, we bridge these two paradigms to offer classical guarantees within a machine learning approach. We note the contemporaneous development of a method that uses a conditional VAE to learn diffeomorphic representations, whose focus is on learning anatomical embeddings [37].

Recent methods proposed using segmentation-based cost functions, such as Dice [23], to replace the image-based similarity term when segmentations are available during training for multi-modal registration, such as T2w MRI and 3D ultrasound, within the same subject [31, 30]. We extend this line of work by showing that our generative probabilistic model naturally describes the deformation of surfaces, therefore enabling the use of segmentations during training within a single cohesive framework. The extended model results in a combination of segmentation (surface) and image-based training losses.

1.2 Background: Diffeomorphic Registration

Although the method presented in this paper applies to a multitude of deformable representations, we choose to work with diffeomorphisms, and in particular with a stationary velocity field representation [3]. Diffeomorphic deformations are differentiable and invertible, and thus preserve topology. Let 

represent the deformation that maps the coordinates from one image to coordinates in another image. In our implementation, the deformation field is defined through the following ordinary differential equation (ODE):


where is the identity transformation and  is time. We integrate the stationary velocity field over to obtain the final registration field  [46].

While we implement and evaluate several numerical integration techniques, we find scaling and squaring to be most efficient, and we briefly review the technique here [2]. The integration of a stationary ODE represents a one-parameter subgroup of diffeomorphisms. In group theory, is a member of the Lie algebra and is exponentiated to produce , which is a member of the Lie group: . From the properties of one-parameter subgroups, for any scalars and , , where is a composition map associated with the Lie group. Starting from  where  is a map of spatial locations, we use the recurrence  to obtain . is chosen so that is very small.

Figure 1:

A graphical representation of our generative model. Circles indicate random variables and rounded squares represent parameters. Shaded quantities are observed

at test time, and the plates indicate replication.  and  are the input images. The image intensities 

are generated from a normal distribution centered at 

. The registration prior is defined by normal parameters , and . In blue, the optional similar model structure is included for an anatomical surface, used purely for learning an improved posterior of registration.
Figure 2: Overview of end-to-end unsupervised architecture. The first part of the network, 

takes the input images and outputs the approximate posterior probability parameters representing the velocity field mean, 

, and variance

. A velocity field  is sampled and transformed to a diffeomorphic deformation field  using novel differentiable squaring and scaling integration layers. Finally, a spatial transform warps  to obtain . Figure 11 expands on this overview by including the optional surface-based loss.

2 Methods

We let  and  be 3D images, such as MRI volumes, and let  be a latent variable that parametrizes a transformation function . We propose a generative model that describes the formation of  by warping  via . We propose a variational inference approach that leverages a convolutional neural network with diffeomorphic integration and spatial transform layers. We learn network parameters in an unsupervised fashion, without access to ground truth registrations. We describe how the network yields fast diffeomorphic registration of a new image pair , in a probabilistic framework. We expand this treatment by including anatomical surface alignment, which enables training the network given (optional) anatomical segmentations.

2.1 Generative Model

We model the prior probability of the parametrization 



where  is the multivariate normal distribution with mean  and covariance . Our work applies to a wide range of representations . For example,  could be a dense displacement field, or a low-dimensional embedding of the displacement field. In this paper, we let  be a stationary velocity field that specifies a diffeomorphism through the ODE (1). We let  be the Laplacian of a neighborhood graph defined on the voxel grid, where  is the graph degree matrix, and  is a voxel neighbourhood adjacency matrix. We encourage spatial smoothness of the velocity field  by setting , where  is a precision matrix and  denotes a parameter controlling the scale of the velocity field .

We let  be a noisy observation of warped image :


where  captures the variance of additive image noise.

We aim to estimate the posterior registration probability . Using this, we can obtain the most likely registration field  for a new image pair  via MAP estimation, along with an estimate of velocity field variance at each voxel. Figure 1 provides a graphical representation of our model.

2.2 Learning

Given our assumptions, computing the posterior probability  is intractable. We use a variational approach, and introduce an approximate posterior probability  parametrized by . We minimize the KL divergence


which yields the negative of the variational lower bound of the model evidence [34]. We model the approximate posterior  as a multivariate normal:


where we let  be diagonal. To understand the effects of this assumption, we explore a non-diagonal covariance in a later section. The statistics  and the diagonal of can be interpreted as the voxel-wise mean and variance, respectively.

We estimate , and  using a convolutional neural network  parameterized by , as described in the next section. We learn parameters  by optimizing the variational lower bound (4) using stochastic gradient methods. Specifically, for each image pair  and sample , we compute , with the resulting loss (detailed derivation in supplementary material):


where  is the number of samples used to approximate the expectation. The first term encourages image  to be similar to the warped image . The second term encourages the posterior to be close to the prior . Although the variational covariance  is diagonal, the last term spatially smoothes the mean, which can be seen by expanding  , where are the neighbors of voxel . We treat  and  as fixed hyper-parameters that we investigate in our experiments, and use .

2.3 Neural Network Framework

We design the network   that takes as input  and  and outputs  and , based on a 3D UNet-style architecture [50]

. The network includes a convolutional layer with 32 filters, four downsampling layers with 64 convolutional filters and a stride of two, and three upsampling convolutional layers with 64 filters. All convolutional layers use 3x3 kernels and LeakyReLu activations. We emphasize that the exact architecture is not a focus of this paper, and many architectural choices are appropriate.

To enable unsupervised learning of parameters  using (6), we must form  and compute the data term. We first implement a layer that samples a new using the “re-parameterization trick" [34]: , where  is a sample from the standard normal: .

Given , we need to compute  as described in the introduction. We propose vector integration layers using scaling and squaring operations. Specifically, scaling and squaring operations involve compositions within the neural network architecture using a differentiable spatial transformation operation. Given two 3D vector fields and , for each voxel this operation computes , a non-integer voxel location in

, using linear interpolation. Starting with 

, we compute  recursively using these operations T times, leading to . In our experiments, we extensively analyze the effect of the step size  on the runtime of the network, the accuracy of the registration, and the regularity of the deformation. We also implement vector integration layers using quadrature and ODE solvers, and in the experiments show that these are significantly slower and can require significant memory.

Finally, we warp volume  according to the computed diffeomorphic field  using a spatial transform layer.

In summary, the network takes as input images  and , computes statistics  and , samples a new velocity field , computes a diffeomorphic  and warps 

. Since all the steps are designed to be differentiable, we learn the network parameters using stochastic gradient descent-based methods. This network results in three outputs, 

and , which are used in the model loss (6). The framework is summarized in Figure 2.

2.4 Registration

Given learned parameters, we approximate registration of a new scan pair  using . We first obtain the most likely velocity field  using


by evaluating the neural network . We then compute  using the scaling and squaring based integration, altogether requiring less than a second on a GPU. We highlight that at test time, the diagonal covariance  is not used, however it enables an estimation of the deformation uncertainty. Analysis of uncertainty is beyond the scope of this paper, and is an interesting avenue for future study.

Using a stationary velocity field representation, computing the inverse deformation field  can be achieved by integrating the negative of the velocity field: , since  [3, 45]. This enables the computation of both fields inside one efficient network when desired.

We implement our method as part of the VoxelMorph package [8], available online at http://voxelmorph.csail.mit.edu

, using neuron 


and Keras 


with a Tensorflow 

[1] backend.

2.5 Surface-based Registration

In various instances, anatomical segmentation maps for specific structures of interest may also be available with some of the training images. Recent papers have demonstrated that the use of segmentations can help in registration [9, 30]. Here, we show that our proposed model naturally extends to handle surfaces, enabling the use of segmentations during training within the same principled framework.

We focus on the case where one anatomical structure is segmented in the image. Given a segmentation map where each voxel is assigned the desired anatomical label or background, we extract the anatomical surface and let represent the  surface coordinates of the anatomical structure for image , which can be stored as an  matrix. Given the diffeomorphism  in the previous section, we model each surface location , as formed by displacing a matching surface location  according to , and adding (spatial) displacement noise:


where the composition  warps surface coordinates.

Figure 3: Left: an illustration of the surface distance function . Right: asymmetric surface behavior requires that we compute the surface distance in both directions. For example, computing  will be considerably smaller than  (recall that surface points are not directly corresponding.)

Given both images and segmentation maps during training, we extract surfaces of the desired structure and aim to estimate the posterior probability . As before, we use a variational approximation. Since segmentation maps, and hence surfaces, are usually derived from images, we assume that images are sufficient to approximate the posterior: . As before, we minimize the KL divergence between the true and approximate posterior (derived in supplementary material):


and arrive at the loss function:


Compared to the original model loss (6), the additional third term encourages the deformation  to warp the moving surface close to the fixed surface . As described in the generative model (8), this requires corresponding surface points in  and . However, these correspondences are not available in practice, as segmentations are provided independently for each image. Therefore, the third term cannot be computed directly.

We propose an approximation of the surface term using surface distance transforms. Let  be a surface distance function, which for location  returns the Euclidean distance to the closest surface point in  (Figure 3Left).111Function  is a generating function for a distance transform image for the surface , by evaluating it at every grid point  Noting that for two surfaces  and , due to potential asymmetries in the surfaces (see Figure 3Right), we approximate the distance  by computing  in both directions:


We implement this function efficiently using distance transforms and  points sampled along each structure, and take advantage of our diffeomorphic representation that enables computing the inverse  efficiently within the network.

Method Avg. Dice GPU sec CPU sec
Affine only 0.584 (0.157) 0 0 0
ANTs (SyN) 0.749 (0.136) - 9059 (2023) 9662 (6258)
NiftyReg (CC) 0.755 (0.143) - 2347 (202) 41251 (14336)
VoxelMorph (CC) 0.753 (0.145) 0.45 (0.01) 57 (1.0) 19077 (5928)
VoxelMorph-diff 0.754 (0.139) 0.47 (0.01) 84.2 (0.1) 0.1 (1.27)
Table 1: Summary of results: mean Dice scores over all anatomical structures and subjects (higher is better), mean runtime; and mean number of locations with a non-positive Jacobian of each registration field (lower is better). All methods have comparable Dice scores, while our method and the original VoxelMorph are orders of magnitude faster than ANTs or NiftiReg. Only our method, VoxelMorph-diff, achieves both high accuracy and fast runtime while also having nearly zero non-negative Jacobian locations. Each aspect of these results is studied in more details in the rest of the experiments and figures.

In summary, since to compute the posterior approximation  the neural network takes as input only the images  and , images alone are required at test time. Given a diffeomorphism , at training time the network uses both a warped image and a warped surface to evaluate the quality of the registration.

This model can also be used to register two surfaces when the images themselves are not available. The only modelling change required is removing the image likelihood terms and using the variational approximation , which uses the segmentation maps  and  as input. Surface-only registration is beyond the scope of this paper, and we leave it for future work.

The complete neural network framework, including the surface loss, is illustrated in supplemental Figure 11.

2.6 Non-diagonal Covariance

Approximating the velocity field covariance  using a diagonal matrix is a strong assumption that ignores spatial smoothness. As seen in (6), the spatially-smooth prior  encourages a smooth mean velocity field , but samples  might still be noisy.

To evaluate the effects of the diagonal covariance, we explore a second approximation  where  is a diagonal matrix returned by the neural network and  is a fixed smoothing convolution matrix. Specifically, for each row of  we create a flattened Gaussian smoothing kernel centered at a particular voxel, such that  is equivalent to 3D convolution of image  by a gaussian filter with variance . We choose  such that the smoothing operation matches the scale of the prior  determined by .

During training, sampling from the posterior is achieved using the reparametrization trick, where  is a sample from the standard normal. Intuitively, compared to the diagonal  approximation, this sampling procedure smoothes the term  before adding the mean .

In our experiments, we show that this approximation yields smoother velocity fields during training, and the effect diminishes with higher  values. However, the resulting deformation fields are diffeomorphic and accurate for both approximations, demonstrating that the diagonal covariance approximation is sufficient when working with diffeomorphisms.

Figure 4: Boxplots indicating Dice scores for anatomical structures for baselines ANTs, NiftiReg, VoxelMorph (CC), and finally our algorithm VoxelMorph-diff. Left and right hemisphere structures are merged for visualization, and ordered by average ANTs Dice score. In general, all four algorithms demonstrate comparable results, each performing slightly better in some structures and slightly worse in others.

3 Experiments

We perform a series of experiments demonstrating that the proposed probabilistic image registration framework achieves accuracy and runtime comparable to state-of-the-art methods while enabling diffeomorphic deformations. We also show the improvements enabled by the extended surface model, and analyze the effect of the various integration layers during test time.

We focus on atlas-based registration, a common task in population analysis. Specifically, we register each scan to an atlas computed using external data [24, 54]. Because we implement our algorithm as part of the VoxelMorph framework, we will refer to it as VoxelMorph-diff.

Figure 5: Example MR slices of input moving image, atlas, and resulting warped image for our method and ANTs, with overlaid boundaries of ventricles, thalami and hippocampi. Our resulting registration field is shown as a warped grid and RGB image, with each channel representing dimension. We omit VoxelMorph (CC) and NiftyReg examples, which are visually similar to our results and ANTs. More examples are provided in the supplementary material Figure 12.
Figure 6: Surface results for the proposed VoxelMorph models. Left: maximum Euclidean surface distance (lower is better). Middle: median Euclidean surface distance (lower is better). Right: mean Dice (higher is better). VoxelMorph-surf trained with surfaces of the desired structures achieves significantly smaller surface distances and larger Dice scores on each structure. We use left hemisphere white matter (WM), gray matter (GM), lateral ventricle (LV), Thalamus (T), and hippocampus (H).

3.1 Experiment setup

3.1.1 Data and Preprocessing

We use a large-scale, multi-site dataset of 3731 T1-weighted brain MRI scans from eight publicly available datasets: OASIS [39], ABIDE [22], ADHD200 [41], MCIC [26], PPMI [40], HABS [15], and Harvard GSP [29]. Acquisition details, subject age ranges and health conditions are different for each dataset. We performed standard pre-processing steps on all scans, including resampling to mm isotropic voxels, affine spatial normalization and brain extraction for each scan using FreeSurfer [24]. We crop the final images to . Segmentation maps including 29 anatomical structures, obtained using FreeSurfer for each scan, are used in evaluating registration results. Each image contains roughly  million brain voxels. We split the dataset into 3231, 250, and 250 volumes for train, validation, and test sets respectively, although we underscore that the training is unsupervised.

3.1.2 Evaluation Metrics

To evaluate a registration algorithm, we register each subject to an atlas, propagate the segmentation map using the resulting warp, and measure volume overlap using the Dice metric. For the surface experiments, we also employ the Euclidean surface distance, computed using the strategy described in (11).

We also evaluate the diffeomorphic property, a focus of our work. Specifically, the Jacobian matrix captures the local properties of around voxel . The local deformation is diffeomorphic, both invertible and orientation-preserving, only at locations for which , where is the determinant operator [3]. We count all other (folding) voxels, where .

3.1.3 Baseline Methods

We compare our approach with the popular ANTs software package using Symmetric Normalization (SyN) [6], a top-performing algorithm [35]. We found that the default ANTs settings are sub-optimal for our task, and performed a wide parameter and similarity metric search across several datasets. We identified and use top performing parameter values for the Dice metric using: the cross-correlation (CC) loss function, SyN step size of 0.25, Gaussian smoothing of (9, 0.2) and three scales of 201 iterations. We also test the NiftyReg package, for which we use a multi-threaded CPU implementation as a GPU implementation is not currently available.222We compiled the latest source code from March, 2018 (tree [4e4525]). We experimented with different parameter settings for improved behavior, and used the following setting: CC cost function, grid spacing of 5, and 500 iterations.

To compare with recent learning-based registration approaches, we also test our recent CNN-based method, VoxelMorph, which produces state-of-the-art fast and accurate registration, but does not yield diffeomorphic results [8, 9]. We sweep the regularization parameter using our validation set, and use the optimal regularization parameter of 1 in our results.

3.2 Image Registration

Figure 7: Average Dice score for VoxelMorph-surf models on the validation set. We test various values of the spatial noise parameter , for both the desired structures observed during training (obs) and all structures (all). For a range of values of , we find significant increases for observed surfaces when using the generative surface model.

Table 1 provides a summary of the results on the held-out test set. Figure 5 and supplementary material Figure 12 show representative results. Figure 4 illustrates Dice results on several anatomical structures. For better visualization, we combine the same structures from the two hemispheres, such as the left and right hippocampus. Our algorithm, VoxelMorph-diff, achieve state-of-the-art Dice results and runtimes, but produces diffeomorphic registration fields (nearly no folding voxels per scan) in a probabilistic framework.

All methods achieve comparable Dice results on each structure and overall. Learning-based methods require a fraction of the baseline runtimes to register two images: less than a second on a GPU, and less than a minute and a half on a CPU. Runtimes were computed for an NVIDIA TitanX GPU and a Intel Xeon (E5-2680) CPU, and exclude preprocessing common to all methods.

Our method outputs positive Jacobians at nearly all voxels, which we analyze in more detail in a later section. We find that for most scans, the deformation fields result in zero folding voxels, while very few volumes have a handful of grouped folding voxels, averaging out to less than a voxel per scan in our test set. In contrast, the deformation fields resulting from the baseline methods contain a few thousand locations of non-positive Jacobians for each scan (Table 1), usually grouped in clusters. This may be alleviated with increased spatial regularization or more optimization iterations, but this in turn leads to a drop in performance on the Dice metric or even longer runtimes.

ANTs SyN (CC) 9060 (4445) 0.545 (0.267)
NiftyReg (CC) 40425 (9901) 2.431 (0.595)
VoxelMorph (CC) 19077 (5928) 1.147 (0.360)
VoxelMorph-diff 0.1 (1.2) 6.1e-6 (7.6e-5)
VoxelMorph-surf (W.M.) 3.0 (6.2) 1.8e-4 (3.8e-4)
VoxelMorph-surf (cortex) 3.4 (6.4) 2.0e-4 (3.9e-4)
VoxelMorph-surf (lat. ven.) 4.0 (8.0) 2.3e-4 (4.8e-4)
VoxelMorph-surf (thalamus) 4.3 (8.2) 2.6e-4 (4.9e-4)
VoxelMorph-surf (hipp.) 2.7 (5.8) 1.6e-4 (3.5e-4)
Table 2: Regularity measures for image and surface models on the test set. Leveraging diffeomorphic aspect of our joint image and surface model, VoxelMorph-surf preserved very low numbers of folding voxels even when training with example surfaces.

3.3 Image and Surface Registration

In this section, we evaluate the generative surface model. We demonstrate the use of anatomical segmentation alongside images during training, and refer to this model as VoxelMorph-Surf. We focus on the setting where one structure of interest is available during training, and learn separate networks for the left white matter, gray matter, ventricle, thalamus and hippocampus. Our goal is to analyze how the additional surface model terms affect the accuracy and regularity of resulting deformations.

Figure 7 illustrates the behaviour of the model with respect to the hyper-parameter  on the validation set. For a range of  values, we find a significant improvement in terms of Dice for the desired structure. For very small values of , the training becomes unstable leading to poor generalization. A very large  value leads to the model ignoring the surface term. Since the Dice scores are comparable in the range , for the rest of this section we use , which exhibits slightly fewer folding voxels ( compared to for ).

Figure 6 demonstrates the improvement on the test set in terms of Euclidean surface distance and Dice, compared to the image-only registration model VoxelMorph-diff. VoxelMorph-surf improves significantly in all measures for most desired structures. Additionally, Table 2 illustrates that with increased accuracy in both metrics, the number of folding voxels in the entire volume increases only very slightly (to an average 3.5 voxels per volume), which remains orders of magnitude fewer than the baseline methods (Table 1). Figure 13 in the supplementary material illustrates example results.

In summary, the principled joint diffeomorphic model enables the use of surfaces during training which dramatically improves registration near a given structure while preserving desired deformation properties. For example, given hippocampus surfaces at training, registration using VoxelMorph-Surf improves Dice by  points over VoxelMorph-diff, improves maximum surface distance by more than three voxels, and preserving diffeomorphisms (less than three folding voxels per scan).

3.4 Analysis

3.4.1 Parameter Analysis

The two main hyper-parameters, smoothing precision  and image noise 

, have physical meaning in our generative model. However, they share a single degree of freedom in the loss function. We set 

, and vary the precision scale  between 0.5 and 100. Figure 8 shows average Dice scores for 50 validation set scans for different parameter values, showing that the results vary smoothly over a large range, with reasonable behavior even near . We use  in our experiments above.

Figure 8: Dice score (computed using 50 validation scans) for VoxelMorph-diff with various values of the precision parameter .

3.4.2 Integration Steps

During training, we hold the number of scaling and squaring steps fixed. However, this number can be varied at test time, affecting aspects of the resulting deformation field. In this section, we analyze the effects of the number of steps on accuracy, runtime, field regularity and invertability. We perform this experiment using 50 validation subjects and the image registration network VoxelMorph-diff trained with integration steps and regularization parameter . The velocity field is computed every two voxels, but all of the conclusions in this section are likely to apply to many reasonable field spacings.

Figure 9 summarizes the analysis results. The runtime increases modestly with the number of steps, and is overall significantly smaller than the cost of the rest of the network (i.e. the deformation network computation of the velocity field, and the spatial transform of the full moving image). After four scaling and squaring steps, the method achieved maximum Dice score. We observe a steep decline in the number of folding voxels (note the log-scale vertical axis), reaching less than five voxels after five scaling and squaring steps, compared to classical methods which can include thousands of such voxels (Table 1). Finally, we measure the average displacement error after inverting the deformation fields: . We find that after five scaling and squaring steps, even the worst error is under a half voxel, indicating that five steps are sufficient to ensure invertible deformations.

In addition, we implemented the integration of the velocity field using Tensorflow ODE solvers and using standard quadrature, but found that these required significantly more runtime compared to the scaling and squaring strategy, consistent with literature findings. Specifically, while five scaling and squaring operations required  seconds, equivalent quadrature integration required  operations (occupying prohibitive amounts of memory) and  seconds, and ODE-solver based integration with default parameter required a single layer and  seconds. At comparable integration settings such as these, all three methods achieve similar Dice scores of . While these alternative methods require significant resources, all three implementations are available in our source code for experimentation.

This analysis indicates that the proposed scaling and squaring network integration layer is efficient and accurate. Increasing the number of scaling and squaring layers incurs a negligible runtime cost while improving deformation field properties. We use  squaring steps in the test experiments above.

Figure 9: The effect of different number of scaling and squaring steps on the registration accuracy, runtime, deformation regularity and invertability. We find that after five scaling and squaring steps, our model, VoxelMorph-diff, is able to produce state-of-the-art accuracy while having essentially no folding (note the log-scale vertical axis in the top-right graph). Similarly, it is able to produce invertible deformations, as seen by the measure of the displacement error . The total runtime cost of the scaling and squaring operations is below the runtime of the rest of the networks, indicating that increasing the number of steps improves deformation properties for trivial runtime cost.

3.4.3 Velocity Sampling and Uncertainty

We also evaluate the modeling assumptions of the variational covariance . Figure 10 illustrates example samples of the velocity field  and voxel-wise empirical variance for the two  approximations: diagonal covariance and the extended approximation in Section 2.6 that smooths samples . For under-regularized networks (very low values for hyper-parameter ), the latter approximation yields smoother velocity fields. However, given a higher hyper-parameter  value, such as the one used in our experiments, the network learns smaller values for the diagonal  approximation, and yields smooth samples  with either method. Futhermore, despite the difference in smoothness of the velocity field samples , the integration operation leads to equally regular and accurate deformations  for a given  (Table 3).

Therefore, although the diagonal covariance has the potential to add noise to velocity field samples, the loss function coupled with the integration operation lead to smooth and accurate deformation fields  at reasonable  values. Diagonal covariances would likely have negative effects in a different deformation model, for example if  would be the displacement field itself.

Figure 10: Illustration of voxel independence assumption in variational approximations for two prior parameters  (top) and  (bottom). Each row contains an example sample velocity field 

, and the voxel-wise standard deviation over 500 samples for that subject.

diagonal extension diagonal extension
0.74 (0.01) 0.74 (0.01) 2934 (2007) 2720 (1593)
0.75 (0.01) 0.75 (0.01) 0.32 (0.96) 0.16 (0.57)
Table 3: Accuracy and deformation regularity for the different variational approximations and two dramatically different values for smoothing parameter . We find that for a given parameter value, the approximations lead to comparable accuracy and number of folding voxels.

4 Discussion and Conclusion

In this work, we build a principled connection between classical registration methods and recent learning-based approaches. We propose a probabilistic model for diffeomorphic image registration and derive a learning algorithm that leverages a convolutional neural network and unsupervised, end-to-end learning for fast runtime. To achieve diffeomorphic transforms, we integrate stationary velocity fields through novel scaling and squaring differentiable network operations, and provide implementation and analysis for other integration layers.

Although the simplifying diagonal approximation to the velocity covariance  adds voxel-independent noise to every velocity field sample , the resulting deformation fields are well behaved because of our smoothing prior and diffeomorphic representation.

We also provide an anatomical surface deformation model. If image segmentations are available for a particular anatomical structure, the generative model incorporates them naturally in the same joint framework during training, while not requiring the surfaces at test time.

Our algorithm can infer the registration of new image pairs in under a second. Compared to traditional methods, our approach is significantly faster, and compared to recent learning based methods, our method offers diffeomorphic guarantees. We demonstrate that the surface extension to our model can help improve registration while preserving properties such as low runtime and diffeomorphisms.

Our focus in this framework has been to present the technical connection between classical and learning paradigms, and show that diffeomorphisms are attainable in a very low runtime. Immediate extensions can enable other models and applications. For example, our derivation is generalizable to other formulations:  can be a low dimensional embedding representation of a deformation field, or the displacement field itself. Similarly, the variational covariance  enables an estimation of the uncertainty of the deformation field at each voxel, which can be informative in downstream tasks such as biomedical segmentation or population analysis. The model is also widely applicable to other applications, such as subject-to-subject registration, segmentation-only registration, or using multiple surfaces to improve image-based registration.

5 Acknowledgments

This research was funded by NIH grants R01LM012719, R01AG053949, and 1R21AG050122, and NSF NeuroNex Grant grant 1707312.


  • [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • [2] Vincent Arsigny, Olivier Commowick, Xavier Pennec, and Nicholas Ayache. A log-euclidean framework for statistics on diffeomorphisms. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 924–931. Springer, 2006.
  • [3] J. Ashburner. A fast diffeomorphic image registration algorithm. Neuroimage, 38(1):95–113, 2007.
  • [4] J. Ashburner and K. Friston. Voxel-based morphometry-the methods. Neuroimage, 11:805–821, 2000.
  • [5] Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis, 12(1):26–41, 2008.
  • [6] Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis, 12(1):26–41, 2008.
  • [7] R. Bajcsy and S. Kovacic. Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing, 46:1–21, 1989.
  • [8] Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. An unsupervised learning model for deformable medical image registration. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 9252–9260, 2018.
  • [9] Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Voxelmorph: A learning framework for deformable medical image registration. arXiv preprint arXiv:1809.05231, 2019.
  • [10] M Faisal Beg, Michael I Miller, Alain Trouvé, and Laurent Younes. Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int. J. Comput. Vision, 61:139–157, 2005.
  • [11] Xiaohuan Cao, Jianhua Yang, Jun Zhang, Dong Nie, Minjeong Kim, Qian Wang, and Dinggang Shen. Deformable image registration based on similarity-steered cnn regression. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 300–308. Springer, 2017.
  • [12] Yan Cao, Michael I Miller, Raimond L Winslow, and Laurent Younes. Large deformation diffeomorphic metric mapping of vector fields. IEEE transactions on medical imaging, 24(9):1216–1230, 2005.
  • [13] Can Ceritoglu, Kenichi Oishi, Xin Li, Ming-Chung Chou, Laurent Younes, Marilyn Albert, Constantine Lyketsos, Peter CM van Zijl, Michael I Miller, and Susumu Mori.

    Multi-contrast large deformation diffeomorphic metric mapping for diffusion tensor imaging.

    Neuroimage, 47(2):618–627, 2009.
  • [14] François Chollet. Keras. https://github.com/fchollet/keras, 2015.
  • [15] Alexander Dagley, Molly LaPoint, Willem Huijbers, Trey Hedden, Donald G McLaren, Jasmeer P Chatwal, Kathryn V Papp, Rebecca E Amariglio, Deborah Blacker, Dorene M Rentz, et al. Harvard aging brain study: dataset and accessibility. NeuroImage, 2015.
  • [16] Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised learning for fast probabilistic diffeomorphic registration. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 729–738, Cham, 2018. Springer.
  • [17] Adrian V Dalca, Andreea Bobu, Natalia S Rost, and Polina Golland. Patch-based discrete registration of clinical brain images. In MICCAI-PATCHMI Patch-based Techniques in Medical Imaging. Springer, 2016.
  • [18] Adrian V Dalca, John Guttag, and Mert R Sabuncu. Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9290–9299, 2018.
  • [19] Christos Davatzikos. Spatial transformation and registration of brain images using elastically deformable models. Computer Vision and Image Understanding, 66(2):207–222, 1997.
  • [20] Bob D de Vos, Floris F Berendsen, Max A Viergever, Hessam Sokooti, Marius Staring, and Ivana Išgum. A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis, 52:128–143, 2019.
  • [21] Bob D de Vos, Floris F Berendsen, Max A Viergever, Marius Staring, and Ivana Išgum. End-to-end unsupervised deformable image registration with a convolutional neural network. In DLMIA, pages 204–212. Springer, 2017.
  • [22] A. Di Martino et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry, 19(6):659–667, 2014.
  • [23] Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
  • [24] B. Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.
  • [25] B. Glocker, N. Komodakis, G. Tziritas, N. Navab, and N. Paragios.

    Dense image registration through mrfs and efficient linear programming.

    Medical image analysis, 12(6):731–741, 2008.
  • [26] Randy L Gollub, Jody M Shoemaker, Margaret D King, Tonya White, Stefan Ehrlich, Scott R Sponheim, Vincent P Clark, Jessica A Turner, Bryon A Mueller, Vince Magnotta, et al. The MCIC collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics, 11(3):367–388, 2013.
  • [27] Mattias P Heinrich, Mark Jenkinson, Michael Brady, and Julia A Schnabel. Mrf-based deformable registration and ventilation estimation of lung ct. IEEE transactions on medical imaging, 32(7):1239–1248, 2013.
  • [28] Monica Hernandez, Matias N Bossa, and Salvador Olmos. Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows. International Journal of Computer Vision, 85(3):291–306, 2009.
  • [29] Avram J Holmes, Marisa O Hollinshead, Timothy M O’Keefe, Victor I Petrov, Gabriele R Fariello, Lawrence L Wald, Bruce Fischl, Bruce R Rosen, Ross W Mair, Joshua L Roffman, et al. Brain genomics superstruct project initial data release with structural, functional, and behavioral measures. Scientific data, 2, 2015.
  • [30] Yipeng Hu, Marc Modat, Eli Gibson, Nooshin Ghavami, Ester Bonmati, Caroline M Moore, Mark Emberton, J Alison Noble, Dean C Barratt, and Tom Vercauteren.

    Label-driven weakly-supervised learning for multimodal deformable image registration.

    In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1070–1074. IEEE, 2018.
  • [31] Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Caroline M Moore, Mark Emberton, et al. Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis, 49:1–13, 2018.
  • [32] Max Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
  • [33] Sarang C Joshi and Michael I Miller. Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing, 9(8):1357–1370, 2000.
  • [34] D.P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
  • [35] Arno Klein, Jesper Andersson, Babak A Ardekani, John Ashburner, Brian Avants, Ming-Chang Chiang, Gary E Christensen, D Louis Collins, James Gee, Pierre Hellier, et al. Evaluation of 14 nonlinear deformation algorithms applied to human brain mri registration. Neuroimage, 46(3):786–802, 2009.
  • [36] Julian Krebs et al. Robust non-rigid registration through agent-based action learning. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 344–352, Cham, 2017. Springer International Publishing.
  • [37] Julian Krebs, Tommaso Mansi, Boris Mailhé, Nicholas Ayache, and Hervé Delingette. Unsupervised probabilistic deformation modeling for robust diffeomorphic registration. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 101–109. Springer, 2018.
  • [38] H. Li and Y. Fan. Non-rigid image registration using fully convolutional networks with deep self-supervision. arXiv preprint arXiv:1709.00799, 2017.
  • [39] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John C Morris, and Randy L Buckner. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience, 19(9):1498–1507, 2007.
  • [40] Kenneth Marek, Danna Jennings, Shirley Lasch, Andrew Siderowf, Caroline Tanner, Tanya Simuni, Chris Coffey, Karl Kieburtz, Emily Flagg, Sohini Chowdhury, et al. The parkinson progression marker initiative (ppmi). Progress in neurobiology, 95(4):629–635, 2011.
  • [41] Michael P Milham, Damien Fair, Maarten Mennes, Stewart HMD Mostofsky, et al. The ADHD-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience, 6:62, 2012.
  • [42] Michael I Miller, M Faisal Beg, Can Ceritoglu, and Craig Stark. Increasing the power of functional maps of the medial temporal lobe by using large deformation diffeomorphic metric mapping. Proceedings of the National Academy of Sciences, 102(27):9685–9690, 2005.
  • [43] Marc Modat, David M Cash, Pankaj Daga, Gavin P Winston, John S Duncan, and Sébastien Ourselin. Global image registration using a symmetric block-matching approach. Journal of Medical Imaging, 1(2):024003, 2014.
  • [44] Marc Modat, Gerard R Ridgway, Zeike A Taylor, Manja Lehmann, Josephine Barnes, David J Hawkes, Nick C Fox, and Sébastien Ourselin. Fast free-form deformation using graphics processing units. Computer methods and programs in biomedicine, 98(3):278–284, 2010.
  • [45] Marc Modat, Ivor JA Simpson, Manual Jorge Cardoso, David M Cash, Nicolas Toussaint, Nick C Fox, and Sébastien Ourselin. Simulating neurodegeneration through longitudinal population analysis of structural and diffusion weighted mri data. Medical Image Computing and Computer-Assisted Intervention, LNCS 8675:57–64, 2014.
  • [46] Cleve Moler and Charles Van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM review, 45(1):3–49, 2003.
  • [47] Kenichi Oishi, Andreia Faria, Hangyi Jiang, Xin Li, Kazi Akhter, Jiangyang Zhang, John T Hsu, Michael I Miller, Peter CM van Zijl, Marilyn Albert, et al. Atlas-based whole brain white matter analysis using large deformation diffeomorphic metric mapping: application to normal elderly and alzheimer’s disease participants. Neuroimage, 46(2):486–499, 2009.
  • [48] Xavier Pennec, Pascal Cachier, and Nicholas Ayache. Understanding the "demon’s algorithm": 3d non-rigid registration by gradient descent. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 597–605, 1999.
  • [49] Marc-Michel Rohé, Manasi Datar, Tobias Heimann, Maxime Sermesant, and Xavier Pennec. SVF-Net: Learning deformable image registration using shape matching. In MICCAI, pages 266–274. Springer, 2017.
  • [50] O. Ronneberger et al. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  • [51] Daniel Rueckert, Luke I Sonoda, Carmel Hayes, Derek LG Hill, Martin O Leach, and David J Hawkes. Nonrigid registration using free-form deformation: Application to breast mr images. IEEE Transactions on Medical Imaging, 18(8):712–721, 1999.
  • [52] D. Shen and C. Davatzikos. Hammer: Hierarchical attribute matching mechanism for elastic registration. IEEE Trans. Med. Imag., 21(11):1421–1439, 2002.
  • [53] Hessam Sokooti, Bob de Vos, Floris Berendsen, Boudewijn PF Lelieveldt, Ivana Išgum, and Marius Staring. Nonrigid image registration using multi-scale 3d convolutional neural networks. In MICCAI, pages 232–239, Cham, 2017. Springer.
  • [54] R. Sridharan, A.V. Dalca, K.M. Fitzpatrick, L. Cloonan, A. Kanakis, O. Wu, K.L. Furie, J. Rosand, N.S. Rost, and P. Golland. Quantification and analysis of large multimodal clinical image studies: Application to stroke. In International Workshop on Multimodal Brain Image Analysis, pages 18–30. Springer International Publishing, 2013.
  • [55] J.P. Thirion. Image matching as a diffusion process: an analogy with maxwell’s demons. Medical Image Analysis, 2(3):243–260, 1998.
  • [56] Tom Vercauteren et al. Diffeomorphic demons: Efficient non-parametric image registration. NeuroImage, 45(1):S61–S72, 2009.
  • [57] Xiao Yang, Roland Kwitt, Martin Styner, and Marc Niethammer. Quicksilver: Fast predictive image registration–a deep learning approach. NeuroImage, 158:378–396, 2017.
  • [58] BT Thomas Yeo, Mert R Sabuncu, Tom Vercauteren, Daphne J Holt, Katrin Amunts, Karl Zilles, Polina Golland, and Bruce Fischl. Learning task-optimal registration cost functions for localizing cytoarchitecture and function in the cerebral cortex. IEEE transactions on medical imaging, 29(7):1424–1441, 2010.
  • [59] Miaomiao. Zhang, Ruizhi. Liao, Adrian V Dalca, Esra A Turk, Jie Luo, P Ellen Grant, and Polina Golland. Frequency diffeomorphisms for efficient image registration. In International Conference on Information Processing in Medical Imaging, pages 559–570. Springer, 2017.

Supplementary Material

Derivation of Main Loss

Using the facts that  is constant, , and , and approximating the expectation with  samples , we obtain


Derivation of Surface VLB and Loss

We derive the variational lower bound and loss for the generative surface model. We start by minimizing the KL divergence between the true and approximate posteriors:


where in  we used the assumptions that the fixed image is independent of anatomical surfaces given the moving image and the deformation, and the fixed surface is independent of either image given the moving surface and the deformation. Following this variational lower bound, the loss follows the previous section closely (see (13)), with the additional term:


Combining this term with (12), and approximating expectations with  samples leads to the final loss (10):


Overview figure with surface loss

Figure 11: Overview of end-to-end unsupervised architecture building on Figure 2. The first part of the network,  takes the input images and outputs the approximate posterior probability parameters representing the velocity field mean, , and variance, . A velocity field  is sampled and transformed to a diffeomorphic deformation field  using novel differentiable squaring and scaling layers. Finally, a spatial transform warps  to obtain . The blue window illustrated the computation of optional surface registration loss. The surface points and distance transform are computed for the both the moving and fixed surfaces. The surface points are warped by the resulting deformation, and a distance is computed using distance transforms.

Additional Figures

Figure 12: Additional example MR slices of input moving image, atlas, and resulting warped image for our method (VoxelMorph-diff) and ANTs, with overlaid boundaries of ventricles, thalami and hippocampi. Each row is a different scan. Our resulting registration field is shown as a warped grid and RGB image, with each channel representing a dimension. We omit VoxelMorph and NiftyReg examples, which are visually similar to our results and ANTs.
Figure 13: Example surface-driven results. For three subjects, we show cropped MR slices of input moving image, atlas, and resulting warped image for our method and ANTs, with overlaid boundaries of ventricles for VoxelMorph-diff (top) and VoxelMorph-surf (bottom). For each set of two rows, we highlight in the red box an improvement in the segmentation of the ventricle from the top row to the bottom row.