VoxelMorph: A Learning Framework for Deformable Medical Image Registration

by   Guha Balakrishnan, et al.

We present VoxelMorph, a fast, unsupervised, learning-based algorithm for deformable pairwise medical image registration. Traditional registration methods optimize an objective function independently for each pair of images, which is time-consuming for large datasets. We define registration as a parametric function, implemented as a convolutional neural network (CNN). We optimize its global parameters given a set of images from a collection of interest. Given a new pair of scans, VoxelMorph rapidly computes a deformation field by directly evaluating the function. Our model is flexible, enabling the use of any differentiable objective function to optimize these parameters. In this work, we propose and extensively evaluate a standard image matching objective function as well as an objective function that can use auxiliary data such as anatomical segmentations available only at training time. We demonstrate that the unsupervised model's accuracy is comparable to state-of-the-art methods, while operating orders of magnitude faster. We also show that VoxelMorph trained with auxiliary data significantly improves registration accuracy at test time. Our method promises to significantly speed up medical image analysis and processing pipelines, while facilitating novel directions in learning-based registration and its applications. Our code is freely available at voxelmorph.csail.mit.edu.


page 2

page 4

page 6


An Unsupervised Learning Model for Deformable Medical Image Registration

We present an efficient learning-based algorithm for deformable, pairwis...

A Learning Framework for Diffeomorphic Image Registration based on Quasi-conformal Geometry

Image registration, the process of defining meaningful correspondences b...

Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning with Pairwise Alignment

Atlas building and image registration are important tasks for medical im...

Deformable Registration through Learning of Context-Specific Metric Aggregation

We propose a novel weakly supervised discriminative algorithm for learni...

Generating Patient-like Phantoms Using Fully Unsupervised Deformable Image Registration with Convolutional Neural Networks

The use of Convolutional neural networks (ConvNets) in medical imaging r...

Learning the Effect of Registration Hyperparameters with HyperMorph

We introduce HyperMorph, a framework that facilitates efficient hyperpar...

Fast Predictive Simple Geodesic Regression

Deformable image registration and regression are important tasks in medi...

Code Repositories


Multi Atlas Segmentation for 3D MRI image

view repo


CNN for segmentation of 3D images

view repo

I Introduction

Deformable registration is a fundamental task in a variety of medical imaging studies, and has been a topic of active research for decades. In deformable registration, a dense, non-linear correspondence is established between a pair of images, such as 3D MR brain scans. Traditional registration methods solve an optimization problem for each volume pair by aligning voxels with similar appearance while enforcing constraints on the registration mapping. Unfortunately, solving an optimization is computationally intensive, and therefore extremely slow in practice. For example, state-of-the-art algorithms can take over two hours to register a pair of scans.

We propose a novel registration method that learns a parametrized registration function from a collection of volumes. We implement the function using a convolutional neural network (CNN), that takes two -D input volumes and outputs a mapping of all voxels of one volume to another volume. The parameters of the network, the convolutional kernel weights, are optimized using a training set of volumes from the dataset of interest. By sharing the same parameters for a collection of volumes, the procedure learns a common representation that enables alignment of a new pair of volumes from the same distribution. In essence, we replace a costly optimization of traditional registration algorithms for each test image pair with one global function optimization during a training phase. Registration of a new test scan pair is achieved by simply evaluating the learned function on the given volumes, resulting in rapid registration. We implement our method as a general purpose framework, VoxelMorph, available at voxelmorph.csail.mit.edu.

Our model is flexible, enabling the use of any differentiable objective function to optimize the parameters of the CNN. In this paper, we present and extensively evaluate two objective functions. The first uses only the input volume pair and the proposed registration field of our model. Similar to traditional image registration algorithms, this loss function drives our method to produce registration fields that match voxels with similar appearance, while varying smoothly in space. The second objective function is useful when anatomical segmentations are available at training time for a subset of the data.

Throughout this study, we use the example of registering 3D MR brain scans. However, our method is broadly applicable to other registration tasks, both within and beyond the medical imaging domain. We evaluate our work on a multi-study dataset of over 7,000 scans containing images of healthy and diseased brains from a variety of age groups. Our unsupervised model achieves comparable accuracy to state-of-the-art registration, while taking orders of magnitude less time. Registration with VoxelMorph requires less than a minute using a CPU and under a second with a GPU, in contrast to the state-of-the-art baseline which takes over 2 hours on the CPU. This is of significant practical importance for many medical image tasks, including population analyses.

This paper extends a preliminary version of the work presented at the 2018 International Conference on Computer Vision and Pattern Recognition 

[1]. We build on that work by introducing an auxiliary learning model that uses anatomical segmentations during training, and demonstrate that training with segmentations can dramatically improve registration accuracy in new test image pairs where segmentation maps are not available. We discuss an interpretation of our model as amortized optimization, and present a rigorous analysis quantifying the effect of training set size on accuracy. We demonstrate that VoxelMorph can be used even with small datasets. We also perform parameter analyses for all of the experiments.

The paper is organized as follows. Section 2 introduces medical image registration and Section 3 describes related work. Section 4 presents our unsupervised and supervised methods. Section 5 presents experimental results on MRI data. We discuss insights of the results and conclude in Section 6.

Ii Background

In the traditional volume registration formulation, one (moving or source) volume is warped to align with a second (fixed or target) volume. Fig. 1

shows sample 2D coronal slices taken from 3D MRI volumes, with boundaries of several anatomical structures outlined. There is significant variability across subjects, caused by natural anatomical brain variations and differences in health state. Deformable registration enables comparison of structures between scans. Such analyses are useful for understanding variability across populations or the evolution of brain anatomy over time for individuals with disease. Deformable registration strategies often involve two steps: an initial affine transformation for global alignment, followed by a much slower deformable transformation with more degrees of freedom. We concentrate on the latter step, in which we compute a dense, nonlinear correspondence for all voxels.

Most existing deformable registration algorithms iteratively optimize a transformation based on an energy function [2]. Let and denote the fixed and moving images, respectively, and let be the registration field. The optimization problem can be written as:


where represents warped by , function

measures image similarity between its two inputs,

imposes regularization, and is the regularization trade-off parameter.

There are several common formulations for , and . Often,

is a displacement vector field, specifying the vector offset from

to for each voxel [3]. Diffeomorphic transforms model as the integral of a velocity vector field, preserving topology and maintaining invertibility on  [4]. Common metrics used for include intensity mean squared error, mutual information [5], and cross-correlation [6]. The latter two are particularly useful when volumes have varying intensity distributions and contrasts. enforces a spatially smooth deformation, often modeled as a linear operator on spatial gradients of .

Traditional algorithms optimize (1) for each volume pair. This is prohibitively expensive when registering many volumes, for example as part of population-wide analyses. In contrast, we assume that a field can be computed by a parameterized function of the data. We optimize the function parameters by minimizing the expected energy of the form of (1) over a dataset of volume pairs. Essentially, we replace pair-specific optimization of the deformation field by global optimization of the shared parameters, which has been referred to as amortization in other domains [7, 8, 9, 10]

. Once the global function is estimated, a field can be produced by evaluating the function on a given volume pair. In this paper, we use a displacement-based vector field, and focus on various aspects of the learning framework and its advantages. However, we recently demonstrated that velocity-based representations are also possible in a VoxelMorph-like framework 


Fig. 1: Example coronal slices from the MRI brain dataset, after affine alignment. Each column is a different scan (subject) and each row is a different coronal slice. Some anatomical regions are outlined using different colors: L/R white matter in light/dark blue, L/R ventricles in yellow/red, and L/R hippocampi in purple/green. There are significant structural differences across scans, necessitating a deformable registration step to analyze inter-scan variations.

Iii Related Work

Iii-a Medical Image Registration (Non-learning-based)

There is extensive work in 3D medical image registration [12, 6, 3, 4, 13, 14, 15, 16, 17]. Several studies optimize within the space of displacement vector fields. These include elastic-type models [3, 18, 19], statistical parametric mapping [20], free-form deformations with b-splines [21], discrete methods [13, 14] and Demons [22, 15]. Diffeomorphic transforms, which are topology-preserving, have shown remarkable success in various computational anatomy studies. Popular formulations include Large Diffeomorphic Distance Metric Mapping (LDDMM) [4, 23, 24, 25, 26, 27, 28, 17], DARTEL [12], diffeomorphic demons [29], and standard symmetric normalization (SyN) [6]. All of these non-learning-based approaches optimize an energy function for each image pair, resulting in slow registration.

Iii-B Medical Image Registration (Learning-based)

There are several recent papers proposing neural networks to learn a function for medical image registration. Most of these rely on ground truth warp fields [30, 31, 32, 33, 34, 35]. A ground truth warp field is ill-defined and difficult to acquire, and using fields derived via conventional registration tools as ground truth can introduce a bias and require a long time to collect. In contrast, VoxelMorph is unsupervised, and is also capable of leveraging auxiliary information such as segmentations during training if those are available. Two recent works [36, 37] present unsupervised methods. Both propose a neural network consisting of a CNN and spatial transformation function [38] that warps images to one another. These methods are preliminary and are only demonstrated on limited subsets of volumes, such as 3D subregions or 2D slices, and support only small transformations. Others [36]

employ regularization only implicitly determined by interpolation methods. A recent method proposed a segmentation label-driven cost function to be used in registering different imaging modalities- T2w MRI and 3D ultrasound- within the same subject 

[31, 30]. In contrast to these methods, our generalizable method is demonstrated on entire 3D volumes between different subjects, handles large deformations, and supports any differentiable cost function. We extend the preliminary version of our method [1] to show how to leverage auxiliary information during training, and provide extensive analysis of the effect of different label subsets on registration quality.

Iii-C 2D Image Alignment

Optical flow estimation is a related registration problem for 2D images. Optical flow algorithms return a dense displacement vector field depicting small displacements between a pair of 2D images. Traditional optical flow approaches typically solve an optimization problem similar to (1) using variational methods [39, 40, 41]. Extensions that better handle large displacements or dramatic changes in appearance include feature-based matching [42, 43] and nearest neighbor fields [44].

Several learning-based approaches to dense 2D image alignment have been proposed. These methods learn a low-dimensional basis for optical flow in natural images using PCA [45] or learn a parametric function using convolutional neural networks [46, 47]. These methods require ground truth registrations during training.

The spatial transform layer enables neural networks to perform global parametric 2D image alignment without requiring supervised labels [38]. This approach has been used for dense spatial transformations as well [48, 49]. We extend the spatial transformer to the -D setting in our work.

Fig. 2: Overview of the method. We learn parameters for a function , and register 3D volume to a second, fixed volume . During training, we warp with using a spatial transformer function. Optionally, auxiliary information such as anatomical segmentations can be leveraged during training (blue box).

Iv Method

Let be two image volumes defined over an -D spatial domain . For the rest of this paper, we focus on the case but our method and implementation are dimension independent. For simplicity we assume that and contain single-channel, grayscale data. We also assume that and are affinely aligned as a preprocessing step, so that the only source of misalignment between the volumes is nonlinear. Many packages are available for rapid affine alignment.

We model a function using a convolutional neural network (CNN), where is a registration displacement field between  and  in , and are network parameters, the kernels of the convolutional layers. That is, for each voxel , is a displacement such that and correspond to similar anatomical locations.

Fig. 2 presents an overview of our method. The network takes and as input, and computes using on a set of parameters . We warp  to  using a spatial transformation function, enabling evaluation of the similarity of  and .

We use stochastic gradient descent to find optimal parameters

by minimizing an expected loss function using a training dataset. We propose two unsupervised loss functions in this work. The first captures image similarity and field smoothness, while the second also leverages anatomical segmentations. Given unseen images and during test time, we obtain a registration field by evaluating . We describe our CNN architecture and the two loss functions in detail in the next sections.

Iv-a VoxelMorph CNN Architecture

In this section we describe the particular architecture used in our experiments, but emphasize that a wide range of architectures may work similarly well and that the exact architecture is not our focus. The parametrization of is based on a convolutional neural network architecture similar to UNet [50, 51], which consists of encoder and decoder sections with skip connections.

Fig. 3: Convolutional UNet architecture implementing . Each rectangle represents a 3D volume, generated from its preceding volume using a 3D convolutional network layer. The spatial resolution of each volume with respect to the input volume is printed underneath. Arrows represent skip connections, which concatenate encoder and decoder features.

Fig. 3 depicts our UNet-style network, which takes a single input formed by concatenating and into a 2-channel 3D image. In our experiments, the input is of size , but the framework is not limited by a particular size. We apply 3D convolutions (followed by nonlinear activations) in both the encoder and decoder stages. The convolutional layers capture hierarchical features of the input image pair, used to estimate

. In the encoder, we use strided convolutions to reduce the spatial dimensions in half at each layer. Successive layers of the encoder therefore operate over coarser representations of the input, similar to the image pyramid used in traditional image registration work.

In the decoding stage, we alternate between upsampling, convolutions and concatenating skip connections that propagate features learned during the encoding stages directly to layers generating the registration. Successive layers of the decoder operate on finer spatial scales, enabling precise anatomical alignment. The receptive fields of the convolutional kernels of the smallest layer should be at least as large as the maximum expected displacement between corresponding voxels in and . In our architecture, the smallest layer applies convolutions over a volume of the size of the input images. We provide further architecture details in the supplementary material.

Iv-B Spatial Transformation Function

The proposed method learns optimal parameter values in part by minimizing differences between and

. In order to use standard gradient-based methods, we construct a differentiable operation based on spatial transformer networks 

[38] to compute .

For each voxel , we compute a (subpixel) voxel location in . Because image values are only defined at integer locations, we linearly interpolate the values at the eight neighboring voxels:


where are the voxel neighbors of and iterates over dimensions of . Because the operations are differentiable111The absolute value is implemented to be differentiable at  via a set derivative value of 

, we can backpropagate errors during optimization.

Iv-C Loss Functions

The proposed model can be trained with any differentiable loss function. In this section, we propose two loss functions: an unsupervised loss that evaluates the model using only the input volumes and generated registration field, and an auxiliary loss that also leverages anatomical segmentations at training time.

Iv-C1 Unsupervised Loss Function

The unsupervised loss consists of two components: that penalizes differences in appearance, and that penalizes local spatial variations in :


where is a regularization parameter. We experimented with two often-used functions for . The first is the mean squared voxelwise difference, applicable when and have similar image intensity distributions and local contrast:


The second is the local cross-correlation of and , which is more robust to intensity variations found across scans and datasets [6]. Let and denote images with local mean intensities subtracted out: , where iterates over a volume around , with in our experiments. The local cross-correlation of and is written as:


A higher CC indicates a better alignment, yielding the loss function: .

Minimizing will encourage to approximate , but may generate a non-smooth that is not physically realistic. We encourage a smooth displacement field using a diffusion regularizer on its spatial gradients:


and approximate spatial gradients using differences between neighboring voxels.

Iv-C2 Auxiliary Data Loss Function

Here, we describe how VoxelMorph can leverage auxiliary information available during training but not during testing. Anatomical segmentation maps are sometimes available during training, and can be annotated by human experts or automated algorithms. A segmentation map assigns each voxel to an anatomical structure. If a registration field represents accurate anatomical correspondences, the regions in and corresponding to the same anatomical structure should overlap well.

Let be the set of voxels of structure for and , respectively. We quantify the volume overlap for structure using the Dice score [52]:


A Dice score of indicates that the anatomy matches perfectly, and a score of 0 indicates that there is no overlap. We define the segmentation loss over all structures as:


alone does not encourage smoothness and agreement of image appearance, which are essential to good registration. We therefore combine with (4) to obtain the objective:


where is a regularization parameter. We emphasize that although the segmentations are used during training, they are not inputs to , as shown in Fig. 2. We do not use them to register a new image pair during test time.

Since anatomical labels are categorical, a naive implementation of linear interpolation to compute is inappropriate, and a direct implementation of (8) might not be amenable to auto-differentiation frameworks. We design and to be image volumes with channels, where each channel is a binary mask specifying the spatial domain of a particular structure. We compute by spatially transforming each channel of . We then compute the numerator and denominator of (8) by multiplying and adding and , respectively.

Iv-D Amortized Optimization Interpretation

Our method substitutes the pair-specific optimization over the deformation field with a global optimization of function parameters for function . This process is sometimes referred to as amortized optimization [53]. Because the function is tasked with estimating registration between any two images, the fact that parameters  are shared globally acts as a natural regularization. We explore this aspect in Section V-E2 (Regularization Analysis). In addition, the quality and generalizability of the deformations output by the function will depend on the data it is trained on. Indeed, the resulting deformation can be interpreted as simply an approximation or initialization to the optimal deformation , and the resulting difference is sometimes referred to the amortization gap [10, 53]. If desired, this initial deformation field could be improved using any instance-specific optimization. In our experiments, we accomplish this by fine-tuning the network parameters for each particular scan independently. However, most often we find that the initial deformation, the VoxelMorph output, is already as accurate as state of the art results. We explore these aspects in experiments presented in Section V-E3.

V Experiments

We demonstrate our method on the task of brain MRI registration. We focus on atlas-based registration, in which we compute a registration field between an atlas, or reference volume, and each volume in our dataset. Atlas-based registration is a common formulation in population analysis, where inter-subject registration is a core problem. The atlas represents a reference, or average volume, and is usually constructed by jointly and repeatedly aligning a dataset of brain MR volumes and averaging them together [54]. We use an atlas computed using an external dataset [55, 56]. Each input volume pair consists of the atlas (image ) and a volume from the dataset (image ). Fig. 4 shows example image pairs using the same fixed atlas for all examples. All figures that depict brains in this paper show 2D slices, but all registration is done in 3D.

Fig. 4: Example MR coronal slices extracted from input pairs (columns 1-2), and resulting for VoxelMorph and ANTs. We overlaid boundaries of a few structures: ventricles (blue/dark green), thalami (red/pink), and hippocampi (light green/orange). A good registration will cause structures in to look similar to structures in . Our models are able to handle various changes in shape of structures, including expansion/shrinkage of the ventricles in rows 2 and 3, and stretching of the hippocampi in row 4. We produce similar, but not identical results to ANTs.

V-a Dataset

We use a large-scale, multi-site, multi-study dataset of 7829 T1–weighted brain MRI scans from eight publicly available datasets: ADNI [57], OASIS [58], ABIDE [59], ADHD200 [60], MCIC [61], PPMI [62], HABS [63], and Harvard GSP [64]. Acquisition details, subject age ranges and health conditions are different for each dataset. All scans were resampled to a grid with mm isotropic voxels. We carry out standard pre‐processing steps, including affine spatial normalization and brain extraction for each scan using FreeSurfer [55], and crop the resulting images to . All MRIs were anatomically segmented with FreeSurfer, and we applied quality control using visual inspection to catch gross errors in segmentation results. We include all anatomical structures that are at least  voxels in volume for all test subjects, resulting in 30 structures. We use the resulting segmentation maps in evaluating our registration as described below. We split our dataset into 7329, 250, and 250 volumes for train, validation, and test sets respectively, although we highlight that we do not use any supervised information at any stage.

V-B Evaluation Metrics

Obtaining dense ground truth registration for these data is not well-defined since many registration fields can yield similar looking warped images. We evaluate our method using volume overlap of anatomical segmentations. If a registration field represents accurate correspondences, the regions in and corresponding to the same anatomical structure should overlap well (see Fig. 4 for examples). We quantify the volume overlap between structures using the Dice score (8).

V-C Baseline Method

We use Symmetric Normalization (SyN) [6], the top-performing registration algorithm in a comparative study [65] as a baseline. We use the SyN implementation in the publicly available ANTs software package [66], with a cross-correlation similarity measure. Throughout our work with medical images, we found the default ANTs smoothness parameters to be sub-optimal for applying ANTs to our data. We obtained improved parameters using a wide parameter sweep across multiple datasets, and use those in these experiments.

Method Dice GPU sec CPU sec
Affine only 0.567 (0.157) 0 0
ANTs (SyN) 0.750 (0.136) - 9059 (2023)
VoxelMorph (CC) 0.750 (0.138) 0.45 (0.01) 57 (1)
VoxelMorph (MSE) 0.753 (0.134) 0.45 (0.01) 57 (1)

Average Dice scores and runtime results for affine alignment, ANTs and VoxelMorph. Standard deviations across structures and subjects are in parentheses. The average Dice score is computed over all structures and subjects. Timing is computed after preprocessing. Our networks yield comparable results to ANTs in Dice score, while operating orders of magnitude faster during testing. To our knowledge, ANTs does not have a GPU implementation.

V-D VoxelMorph Implementation

We implemented our method using Keras 


with a Tensorflow backend 

[68]. We extended the 2D spatial transformer layer to 3D with linear interpolation. We use the ADAM optimizer [69] with a learning rate of . To reduce memory usage, each training batch consists of one pair of volumes. We train separate networks with different regularization parameters, and show the behavior of our method with respect to these parameters. We then select the network that optimizes Dice score on our validation set, and report results on our held-out test set. Our code and model parameters are available online at voxelmorph.csail.mit.edu.

V-E Unsupervised Results

Fig. 5: Boxplots of Dice scores for various anatomical structures for VoxelMorph and ANTs results. We average Dice scores of the left and right brain hemispheres into one score for this visualization. Structures are ordered by average ANTs Dice score.

Table I presents average Dice scores computed for all subjects and structures for: 1) a baseline of only global affine alignment, 2) ANTs, and 3) VoxelMorph with different losses. VoxelMorph variants perform comparably to ANTs in terms of Dice, and are significantly better than affine alignment. Example visual results of the warped images from our algorithms are shown in Fig. 4. Both VoxelMorph and ANTs are able to handle significant shape changes for various structures.

Fig. 5 presents the Dice scores for each structure as a boxplot. For ease of visualization, we average Dice scores of the same structures from the two hemispheres into one score, e.g., the left and right hippocampi scores are averaged. The VoxelMorph models achieve comparable Dice measures to ANTs for all structures, performing slightly better than ANTs on some structures such as cerebral white matter, and worse on others such as the hippocampi. Because MSE and CC perform similarly, we focus on MSE for the remaining experiments.

Fig. 6: Effect of varying the regularization parameter on Dice score of validation subjects for VoxelMorph. Shown for comparison in a dashed line is the Dice score for results from ANTs with optimal parameters.

V-E1 Runtime

Table I presents runtime results using an Intel Xeon (E5-2680) CPU, and a NVIDIA TitanX GPU. We report the elapsed time for computations following the affine alignment preprocessing step, which all of the presented methods share, and requires just a few minutes even on a CPU. ANTs requires two or more hours on the CPU. ANTs runtimes vary widely, as its convergence depends on the difficulty of the alignment task. Registering two images with VoxelMorph is, on average, 60 times faster on the CPU. When using the GPU, VoxelMorph computes a registration in under a second. To our knowledge, there is no publicly available ANTs implementation for GPUs. It is likely that the SyN algorithm would benefit from a GPU implementation, but the main advantage of VoxelMorph comes from not requiring an optimization on each test pair, as can be seen in the CPU comparison.

V-E2 Regularization Analysis

Fig. 6 shows average Dice scores for the validation set for different values of the regularization parameter . Shown for comparison in a dashed line is the Dice score for alignment with ANTs with optimal parameter values. The results vary smoothly over a large range of values, illustrating that our model is robust to choice of . Interestingly, even setting , which enforces no regularization on registration, results in a significant improvement over affine registration. This is likely because the optimal network parameters need to register all pairs in the training set well, yielding an implicit dataset regularization for the function .

Fig. 7: Effect of training set size on accuracy. Also shown are results of instance-specific optimization of VoxelMorph parameters, after these are initialized with the optimal global parameters resulting from the training phase.

V-E3 Training Set Size and Instance-Specific Optimization

We evaluate the effect of training set size on accuracy, and the relationship between amortized and instance-specific optimization. We train VoxelMorph on subsets of different sizes from our training dataset, and report Dice scores on: (1) the training subset, (2) the test set, and (3) the test set when each deformation is further individually optimized for each test image pair. We perform (3) by “fine-tuning” the VoxelMorph parameters for 1000 additional iterations on each test pair (which is possible since VoxelMorph is unsupervised).

Fig. 8: Results on test scans when using auxiliary data during training. We test having varying number of observed labels (a-c), and having coarser segmentation maps (d). Error bars indicate standard deviations across subjects. The leftmost datapoint in each graph for all labels, corresponding to , indicates results of VoxelMorph without using auxiliary data (unsupervised). We show Dice scores for results from ANTs with optimal parameters, which does not use segmentation maps, for comparison.

Fig. 7 presents our results. A small training set size of 10 scans results in slightly lower train and test Dice scores compared to larger training set sizes. However, there is no significant difference in Dice scores when training with 100 scans or the full dataset of over 7000 scans. These results indicate that training with just  subjects already yields near-optimal test performance, and training with  subjects results in state-of-the -art results. Further optimizing the VoxelMorph parameters on each test image pair results in better test Dice scores regardless of training set size.

V-F Registration with Auxiliary Data

In this section, we evaluate VoxelMorph when using loss function (10), where we use segmentation maps during training, through the auxiliary loss function. We present an evaluation of our model in two practical scenarios: (1) when subsets of anatomical structure labels are available during training, and (2) when coarse segmentations labels are available during training. We use the same train/validation/test split as the previous experiments.

V-F1 Training with a subset of anatomical labels

In many practical settings, it may be infeasible to obtain training segmentations for all structures. We therefore first consider the case where segmentations are available for only a subset of the 30 structures. We refer to structures present in segmentations as observed, and the rest as unobserved. We considered three cases, when: 1, 15 (half), and 30 (all) structure segmentations are observed. The first two experiments simulate different amounts of partially observed segmentations. We train separate models on different subsets of observed structures. For single structure segmentations, we manually selected four important structures for four folds of the experiment: hippocampi, gray matter, white matter, and ventricles. For the second experiment type, we randomly selected 15 of the 30 structures, repeated over five folds. We use these segmentation maps at training, and show results on test pairs where segmentation maps are not used.

Fig. 8a-c shows Dice scores for both the observed and unobserved labels when sweeping in (10), the auxiliary regularization trade-off parameter. In general, VoxelMorph with auxiliary data significantly outperforms unsupervised VoxelMorph (equivalent to or ) and ANTs on observed structures. Dice score on observed labels generally increases with an increase in .

Interestingly, VoxelMorph (trained with auxiliary data) yields improved Dice scores for unobserved structures compared to the unsupervised variant for a range of values (see Fig. 8a-b), even though these segmentations were not explicitly observed during training. Registration accuracy for unobserved structures starts declining when is large, which is expected since the model starts over-fitting the observed structures too much at the cost of image matching and spatial regularization.

V-F2 Training with coarse labels

We consider the scenario where only coarse labels are available, such as when all the white matter is segmented as one structure. This situation enables evaluation of how the auxiliary data affects anatomical registration at finer scales, within the coarsely delineated structures. To achieve this, we merge the 30 structures into four broad groups: white matter, gray matter, cerebral spinal fluid (CSF) and the brain stem, and evaluate the accuracy of the registration on the original structures.

Fig. 8d presents mean Dice scores over the original 30 structures with varying regularization parameter . With an optimal of 0.01, we obtain an average Dice score of . This is roughly a Dice point improvement over VoxelMorph without auxiliary information.

Vi Discussion and Conclusion

VoxelMorph with unsupervised loss performs comparably to the state-of-the-art ANTs software in terms of Dice score, while reducing the computation time from hours to minutes on a CPU and under a second on a GPU. VoxelMorph is flexible and handles both partially observed or coarsely delineated auxiliary information during training. When all structures are annotated, the Dice score of the VoxelMorph results improve to Dice points beyond the state-of-the-art, while still preserving the runtime improvement.

VoxelMorph performs amortized optimization, learning function parameters that are optimal for an entire training dataset. As Fig. 7 shows, the dataset need not be large: with only 100 training images, VoxelMorph leads to state-of-the-art registration quality scores for both training and test sets. Instance-specific optimization further improves VoxelMorph performance by Dice point. This is a small increase, illustrating that amortized optimization leads to nearly optimal registration.

We performed a thorough set of experiments demonstrating that, for a reasonable choice of , the availability of anatomical segmentations during training significantly improves test registration performance with VoxelMorph. The performance gain varies based on the quality and number of anatomical segmentations available. Given a single labelled anatomical structure during training, the accuracy of registration of test subjects for that label increases, without negatively impacting other anatomy. If half or all of the labels are observed, or even a coarse segmentation is provided at training, registration accuracy improves for all labels during test. While we experimented with one type of auxiliary data in this study, VoxelMorph can leverage other auxiliary data, such as different modalities or anatomical keypoints.

VoxelMorph is a general learning model, and is not limited to a particular image type or anatomy – it may be useful in other medical image registration applications such as cardiac MR scans or lung CT images. With an appropriate loss function such as mutual information, the model can also perform multimodal registration. VoxelMorph promises to significantly speed up medical image analysis and processing pipelines, while opening novel directions in learning-based registration.


  • [1] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “An unsupervised learning model for deformable medical image registration,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9252–9260.
  • [2] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical image registration: A survey,” IEEE transactions on medical imaging, vol. 32, no. 7, pp. 1153–1190, 2013.
  • [3] R. Bajcsy and S. Kovacic, “Multiresolution elastic matching,” Computer Vision, Graphics, and Image Processing, vol. 46, pp. 1–21, 1989.
  • [4] M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes, “Computing large deformation metric mappings via geodesic flows of diffeomorphisms,” Int. J. Comput. Vision, vol. 61, pp. 139–157, 2005.
  • [5] P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” International journal of computer vision, vol. 24, no. 2, pp. 137–154, 1997.
  • [6] B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee, “Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain,” Medical image analysis, vol. 12, no. 1, pp. 26–41, 2008.
  • [7] Y. Kim, S. Wiseman, A. C. Miller, D. Sontag, and A. M. Rush, “Semi-amortized variational autoencoders,” preprint arXiv:1802.02550, 2018.
  • [8] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár, “Amortised map inference for image super-resolution,” arXiv preprint arXiv:1610.04490, 2016.
  • [9] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt, “Advances in variational inference,” arXiv preprint arXiv:1711.05597, 2017.
  • [10] C. Cremer, X. Li, and D. Duvenaud, “Inference suboptimality in variational autoencoders,” arXiv preprint arXiv:1801.03558, 2018.
  • [11] A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsupervised learning for fast probabilistic diffeomorphic registration,” arXiv preprint arXiv:1805.04605, 2018.
  • [12] J. Ashburner, “A fast diffeomorphic image registration algorithm,” Neuroimage, vol. 38, no. 1, pp. 95–113, 2007.
  • [13] A. V. Dalca, A. Bobu, N. S. Rost, and P. Golland, “Patch-based discrete registration of clinical brain images,” in International Workshop on Patch-based Techniques in Medical Imaging.   Springer, 2016, pp. 60–67.
  • [14]

    B. Glocker, N. Komodakis, G. Tziritas, N. Navab, and N. Paragios, “Dense image registration through mrfs and efficient linear programming,”

    Medical image analysis, vol. 12, no. 6, pp. 731–741, 2008.
  • [15] J. Thirion, “Image matching as a diffusion process: an analogy with maxwell’s demons,” Medical Image Analysis, vol. 2, no. 3, pp. 243–260, 1998.
  • [16] B. T. Yeo, M. R. Sabuncu, T. Vercauteren, D. J. Holt, K. Amunts, K. Zilles, P. Golland, and B. Fischl, “Learning task-optimal registration cost functions for localizing cytoarchitecture and function in the cerebral cortex,” IEEE transactions on medical imaging, vol. 29, no. 7, pp. 1424–1441, 2010.
  • [17] M. Zhang, R. Liao, A. V. Dalca, E. A. Turk, J. Luo, P. E. Grant, and P. Golland, “Frequency diffeomorphisms for efficient image registration,” in International conference on information processing in medical imaging.   Springer, 2017, pp. 559–570.
  • [18] C. Davatzikos, “Spatial transformation and registration of brain images using elastically deformable models,” Computer Vision and Image Understanding, vol. 66, no. 2, pp. 207–222, 1997.
  • [19] D. Shen and C. Davatzikos, “Hammer: Hierarchical attribute matching mechanism for elastic registration,” IEEE Transactions on Medical Imaging, vol. 21, no. 11, pp. 1421–1439, 2002.
  • [20] J. Ashburner and K. J. Friston, “Voxel-based morphometry-the methods,” Neuroimage, vol. 11, pp. 805–821, 2000.
  • [21] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and D. J. Hawkes, “Nonrigid registration using free-form deformation: Application to breast mr images,” IEEE Transactions on Medical Imaging, vol. 18, no. 8, pp. 712–721, 1999.
  • [22] X. Pennec, P. Cachier, and N. Ayache, “Understanding the ”demon’s algorithm”: 3d non-rigid registration by gradient descent,” pp. 597–605, 1999.
  • [23] Y. Cao, M. I. Miller, R. L. Winslow, and L. Younes, “Large deformation diffeomorphic metric mapping of vector fields,” IEEE transactions on medical imaging, vol. 24, no. 9, pp. 1216–1230, 2005.
  • [24]

    C. Ceritoglu, K. Oishi, X. Li, M.-C. Chou, L. Younes, M. Albert, C. Lyketsos, P. C. van Zijl, M. I. Miller, and S. Mori, “Multi-contrast large deformation diffeomorphic metric mapping for diffusion tensor imaging,”

    Neuroimage, vol. 47, no. 2, pp. 618–627, 2009.
  • [25] M. Hernandez, M. N. Bossa, and S. Olmos, “Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows,” International Journal of Computer Vision, vol. 85, no. 3, pp. 291–306, 2009.
  • [26] S. C. Joshi and M. I. Miller, “Landmark matching via large deformation diffeomorphisms,” IEEE transactions on image processing, vol. 9, no. 8, pp. 1357–1370, 2000.
  • [27] M. I. Miller, M. F. Beg, C. Ceritoglu, and C. Stark, “Increasing the power of functional maps of the medial temporal lobe by using large deformation diffeomorphic metric mapping,” Proceedings of the National Academy of Sciences, vol. 102, no. 27, pp. 9685–9690, 2005.
  • [28] K. Oishi, A. Faria, H. Jiang, X. Li, K. Akhter, J. Zhang, J. T. Hsu, M. I. Miller, P. C. van Zijl, M. Albert et al., “Atlas-based whole brain white matter analysis using large deformation diffeomorphic metric mapping: application to normal elderly and alzheimer’s disease participants,” Neuroimage, vol. 46, no. 2, pp. 486–499, 2009.
  • [29] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache, “Diffeomorphic demons: Efficient non-parametric image registration,” NeuroImage, vol. 45, no. 1, pp. S61–S72, 2009.
  • [30]

    Y. Hu, M. Modat, E. Gibson, N. Ghavami, E. Bonmati, C. M. Moore, M. Emberton, J. A. Noble, D. C. Barratt, and T. Vercauteren, “Label-driven weakly-supervised learning for multimodal deformable image registration,” in

    Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on.   IEEE, 2018, pp. 1070–1074.
  • [31] Y. Hu, M. Modat, E. Gibson, W. Li, N. Ghavami, E. Bonmati, G. Wang, S. Bandula, C. M. Moore, M. Emberton et al., “Weakly-supervised convolutional neural networks for multimodal image registration,” Medical image analysis, vol. 49, pp. 1–13, 2018.
  • [32] J. Krebs, T. Mansi, H. Delingette, L. Zhang, F. C. Ghesu, S. Miao, A. K. Maier, N. Ayache, R. Liao, and A. Kamen, “Robust non-rigid registration through agent-based action learning,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).   Springer, 2017, pp. 344–352.
  • [33] M.-M. Rohé, M. Datar, T. Heimann, M. Sermesant, and X. Pennec, “Svf-net: Learning deformable image registration using shape matching,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).   Springer, 2017, pp. 266–274.
  • [34] H. Sokooti, B. de Vos, F. Berendsen, B. P. Lelieveldt, I. Išgum, and M. Staring, “Nonrigid image registration using multi-scale 3d convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).   Springer, 2017, pp. 232–239.
  • [35]

    X. Yang, R. Kwitt, M. Styner, and M. Niethammer, “Quicksilver: Fast predictive image registration–a deep learning approach,”

    NeuroImage, vol. 158, pp. 378–396, 2017.
  • [36] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, and I. Išgum, “End-to-end unsupervised deformable image registration with a convolutional neural network,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 2017, pp. 204–212.
  • [37] H. Li and Y. Fan, “Non-rigid image registration using fully convolutional networks with deep self-supervision,” preprint arXiv:1709.00799, 2017.
  • [38] M. Jaderberg, K. Simonyan, and A. Zisserman, “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
  • [39] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” European Conference on Computer Vision (ECCV), pp. 25–36, 2004.
  • [40] B. K. Horn and B. G. Schunck, “Determining optical flow,” 1980.
  • [41] D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow estimation and their principles,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439, 2010.
  • [42] T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 3, pp. 500–513, 2011.
  • [43] C. Liu, J. Yuen, and A. Torralba, “SIFT flow: Dense correspondence across scenes and its applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978–994, 2011.
  • [44] Z. Chen, H. Jin, Z. Lin, S. Cohen, and Y. Wu, “Large displacement optical flow from nearest neighbor fields,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2450, 2013.
  • [45] J. Wulff and M. J. Black, “Efficient sparse-to-dense optical flow estimation using a learned basis and layers,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 120–130.
  • [46] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766.
  • [47] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large displacement optical flow with deep matching,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1385–1392.
  • [48] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg, “Transformation-grounded image generation network for novel 3D view synthesis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 702–711.
  • [49] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” European Conference on Computer Vision (ECCV), pp. 286–301, 2016.
  • [50]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,”

    arXiv preprint, 2017.
  • [51] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).   Springer, 2015, pp. 234–241.
  • [52] L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.
  • [53] J. Marino, Y. Yue, and S. Mandt, “Iterative amortized inference,” arXiv preprint arXiv:1807.09356, 2018.
  • [54] M. De Craene, A. du Bois d’Aische, B. Macq, and S. K. Warfield, “Multi-subject registration for unbiased statistical atlas construction,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2004, pp. 655–662.
  • [55] B. Fischl, “Freesurfer,” Neuroimage, vol. 62, no. 2, pp. 774–781, 2012.
  • [56] R. Sridharan, A. V. Dalca, K. M. Fitzpatrick, L. Cloonan, A. Kanakis, O. Wu, K. L. Furie, J. Rosand, N. S. Rost, and P. Golland, “Quantification and analysis of large multimodal clinical image studies: Application to stroke,” in International Workshop on Multimodal Brain Image Analysis.   Springer, 2013, pp. 18–30.
  • [57] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. R. Jack, W. Jagust, J. Q. Trojanowski, A. W. Toga, and L. Beckett, “Ways toward an early diagnosis in alzheimer’s disease: the alzheimer’s disease neuroimaging initiative (adni),” Alzheimer’s & Dementia, vol. 1, no. 1, pp. 55–66, 2005.
  • [58] D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults,” Journal of cognitive neuroscience, vol. 19, no. 9, pp. 1498–1507, 2007.
  • [59] A. Di Martino, C.-G. Yan, Q. Li, E. Denio, F. X. Castellanos, K. Alaerts, J. S. Anderson, M. Assaf, S. Y. Bookheimer, M. Dapretto et al., “The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism,” Molecular psychiatry, vol. 19, no. 6, pp. 659–667, 2014.
  • [60] M. P. Milham, D. Fair, M. Mennes, S. H. Mostofsky et al., “The ADHD-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience,” Frontiers in systems neuroscience, vol. 6, p. 62, 2012.
  • [61] R. L. Gollub, J. M. Shoemaker, M. D. King, T. White, S. Ehrlich, S. R. Sponheim, V. P. Clark, J. A. Turner, B. A. Mueller, V. Magnotta et al., “The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia,” Neuroinformatics, vol. 11, no. 3, pp. 367–388, 2013.
  • [62] K. Marek, D. Jennings, S. Lasch, A. Siderowf, C. Tanner, T. Simuni, C. Coffey, K. Kieburtz, E. Flagg, S. Chowdhury et al., “The parkinson progression marker initiative (ppmi),” Progress in neurobiology, vol. 95, no. 4, pp. 629–635, 2011.
  • [63] A. Dagley, M. LaPoint, W. Huijbers, T. Hedden, D. G. McLaren, J. P. Chatwal, K. V. Papp, R. E. Amariglio, D. Blacker, D. M. Rentz et al., “Harvard aging brain study: dataset and accessibility,” NeuroImage, 2015.
  • [64] A. J. Holmes, M. O. Hollinshead, T. M. O’Keefe, V. I. Petrov, G. R. Fariello, L. L. Wald, B. Fischl, B. R. Rosen, R. W. Mair, J. L. Roffman et al., “Brain genomics superstruct project initial data release with structural, functional, and behavioral measures,” Scientific data, vol. 2, 2015.
  • [65] A. Klein, J. Andersson, B. A. Ardekani, J. Ashburner, B. Avants, M.-C. Chiang, G. E. Christensen, D. L. Collins, J. Gee, P. Hellier et al., “Evaluation of 14 nonlinear deformation algorithms applied to human brain mri registration,” Neuroimage, vol. 46(3), pp. 786–802, 2009.
  • [66] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J. C. Gee, “A reproducible evaluation of ants similarity metric performance in brain image registration,” Neuroimage, vol. 54, no. 3, pp. 2033–2044, 2011.
  • [67] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
  • [68] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [69] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.