We present an efficient learning-based algorithm for deformable, pairwise 3D medical image registration. Current registration methods optimize an energy function independently for each pair of images, which can be time-consuming for large data. We define registration as a parametric function, and optimize its parameters given a set of images from a collection of interest. Given a new pair of scans, we can quickly compute a registration field by directly evaluating the function using the learned parameters. We model this function using a CNN, and use a spatial transform layer to reconstruct one image from another while imposing smoothness constraints on the registration field. The proposed method does not require supervised information such as ground truth registration fields or anatomical landmarks. We demonstrate registration accuracy comparable to state-of-the-art 3D image registration, while operating orders of magnitude faster in practice. Our method promises to significantly speed up medical image analysis and processing pipelines, while facilitating novel directions in learning-based registration and its applications. Our code is available at https://github.com/balakg/voxelmorph.READ FULL TEXT VIEW PDF
Deformable registration is a fundamental task in a variety of medical imaging studies, and has been a topic of active research for decades. In deformable registration, a dense, non-linear correspondence is established between a pair of -D image volumes, such as 3D MR brain scans, depicting similar structures. Most registration methods solve an optimization problem for each volume pair that aligns voxels with similar appearance while enforcing smoothness constraints on the registration mapping. Solving this optimization is computationally intensive, and therefore extremely slow in practice.
In contrast, we propose a novel registration method that learns a parametrized registration function from a collection of volumes. We implement the function using a convolutional neural network (CNN), that takes two -D input volumes and outputs a mapping of all voxels of one volume to another volume. The parameters of the network, i.e., the convolutional kernel weights, are optimized using a training set of volume pairs from the dataset of interest. By sharing the same parameters for a collection of volumes, the procedure learns a common representation which can align any new pair of volumes from the same distribution. In essence, we replace a costly optimization of traditional registration algorithms for each test image pair with one global function optimization during a training phase. Registration between a new test scan pair is achieved by simply evaluating the learned function on the given volumes, resulting in rapid registration.
The novelty of this work is that:
we present a learning-based solution requiring no supervised information such as ground truth correspondences or anatomical landmarks during training,
we propose a CNN function with parameters shared across a population, enabling registration to be achieved through a function evaluation, and
our method enables parameter optimization for a variety of cost functions, which can be adapted to various tasks.
Throughout this paper, we use the example of registering 3D MR brain scans. However, our method is broadly applicable to registration tasks, both within and beyond the medical imaging domain. We evaluate our method on a multi-study dataset of over 7,000 scans containing images of healthy and diseased brains from a variety of age groups. Results show that our method achieves comparable accuracy to a state-of-the-art registration package, while taking orders of magnitude less time. Scans that used to take two hours to register can now be registered within one or two minutes using a CPU, and under a second with a GPU. This is of significant practical importance for many medical image analysis tasks.
In the typical volume registration formulation, one (moving or source) volume is warped to align with a second (fixed or target) volume. Deformable registration strategies separate an initial affine transformation for global alignment from a typically much slower deformable transformation with higher degrees of freedom. We concentrate on the latter step, in which we compute a dense, nonlinear correspondence for all voxels. Fig.1 shows sample 2D coronal slices taken from 3D MRI volumes, with boundaries of several anatomical structures outlined. There is significant variability across subjects, caused by differences in health state and natural anatomical variations in healthy brains. Deformable registration enables comparison of structures across scans and population analyses. Such analyses are useful for understanding variability across populations or the evolution of brain anatomy over time for individuals with disease.
Most existing registration algorithms iteratively optimize a transformation based on an energy function. Let denote the fixed and moving images, respectively, and let be the registration field. The optimization problem is typically written as:
is warped by , function
measures image similarity betweenand , imposes regularization on , and is the regularization parameter.
There are several common formulations for , and . Often,
is a displacement vector field, specifying the vector offset fromto for each voxel. Diffeomorphic transforms are a popular alternative that model as the integral of a velocity vector field. As a result, they are able to preserve topology and enforce invertibility on . Common metrics used for include mean squared voxel difference, mutual information, and cross-correlation. The latter two are particularly useful when volumes have varying intensity distributions and contrasts. enforces a spatially smooth deformation, often modeled as a linear operator on spatial gradients of . In our work, we optimize function parameters to minimize the expected energy of the form of (1) using a dataset of volume pairs, instead of doing it for each pair independently.
There is extensive work in 3D medical image registration [2, 4, 6, 7, 13, 18, 42].111in medical imaging literature, the volumes produced by 3D imaging techniques are often referred to as images Several studies optimize within the space of displacement vector fields. These include elastic-type models [6, 38], statistical parametric mapping , free-form deformations with b-splines,  and Demons . Our model also assumes displacement vector fields. Diffeomorphic transforms, which are topology-preserving, have shown remarkable success in various computational anatomy studies. Popular formulations include Large Diffeomorphic Distance Metric Mapping (LDDMM) , DARTEL  and standard symmetric normalization (SyN) .
There are several recent papers proposing neural networks to learn a function for medical image registration. Most of these rely on ground truth warp fields or segmentations [26, 35, 39, 45], a significant drawback compared to our method, which does not require either. Two recent works [14, 27] present unsupervised methods that are closer to our approach. Both propose a neural network consisting of a CNN and spatial transformation function  that warps images to one another. Unfortunately, these methods are preliminary and have significant drawbacks: they are only demonstrated on limited subsets of volumes, such as 3D subregions or 2D slices, and support only small transformations. Others 
employ regularization only implicitly determined by interpolation methods. In contrast, our generalizable method is applicable to entire 3D volumes, handles large deformations, and enables any differentiable cost function. We present a rigorous analysis of our method, and demonstrate results on full MR volumes.
Optical flow estimation is an analogous problem to 3D volume registration for 2D images. Optical flow algorithms return a dense displacement vector field depicting small displacements between a 2D image pair. Traditional optical flow approaches typically solve an optimization problem similar to (1) using variational methods [8, 21, 41]. Extensions that better handle large displacements or dramatic changes in appearance include feature-based matching [9, 28] and nearest neighbor fields .
Several learning-based approaches to dense 2D image alignment have been proposed. One study learns a low-dimensional basis for optical flow in natural images using PCA . Other recent studies in optical flow learn a parametric function using convolutional neural networks [16, 43]. Unfortunately, these methods require ground truth registrations during training. The spatial transform layer enables neural networks to perform global parametric 2D image alignment without requiring supervised labels . The layer has since been used for dense spatial transformations as well [34, 46]. We extend the spatial transformer to the 3D setting in our work.
Let be two image volumes defined over a -D spatial domain . For the rest of this paper, we focus on the case . For simplicity we assume that and contain single-channel, grayscale data. We also assume that and are affinely aligned as a preprocessing step, so that the only source of misalignment between the volumes is nonlinear. Many packages are available for rapid affine alignment.
We model a function using a convolutional neural network (CNN), where is a registration field and are learnable parameters of . For each voxel , is a location such that and define similar anatomical locations.
Fig. 2 presents an overview of our method. Our network takes and as input, and computes based on a set of parameters , the kernels of the convolutional layers. We warp to using a spatial transformation function, enabling the model to evaluate the similarity of and and update .
We use stochastic gradient descent to find optimal parameters
by minimizing an expected loss function, similar to (2), using a training dataset:
where is the dataset distribution. We learn by aligning volume pairs sampled from . Importantly, we do not require supervised information such as ground truth registration fields or anatomical landmarks. Given an unseen and during test time, we obtain a registration field by evaluating . We describe our model, which we call VoxelMorph, in the next sections.
The parametrization of is based on a convolutional neural network architecture similar to UNet [22, 36]. The network consists of an encoder-decoder with skip connections that is responsible for generating given and .
Fig. 3 depicts two variants of the proposed architectures that tradeoff between registration accuracy and computation time. Both take a single input formed by concatenating and into a 2-channel 3D image. In our experiments, the input is of size
. We apply 3D convolutions followed by Leaky ReLU activations in both the encoder and decoder stages, with a convolutional kernel size of. The convolutional layers capture hierarchical features of the input image pair necessary to estimate the correspondence
. In the encoder, we use strided convolutions to reduce the spatial dimensions in half until the smallest layer is reached. Successive layers of the encoder operate over coarser representations of the input, similar to the image pyramid used in traditional image registration work.
The receptive fields of the convolutional kernels of the smallest layer should be at least as large as the maximum expected displacement between corresponding voxels in and . The smallest layer applies convolutions over a volume the size of the input images. In the decoding stage, we alternate between upsampling, convolutions (followed by Leaky ReLU activations) and concatenating skip connections. Skip connections propagate features learned during the encoding stages directly to layers generating the registration. The output of the decoder, , is of size in our experiments.
Successive layers of the decoder operate on finer spatial scales, enabling precise anatomical alignment. However, these convolutions are applied to the largest image volumes, which is computationally expensive. We explore this tradeoff using two architectures, VoxelMorph-1 and VoxelMorph-2, that differ in size at the end of the decoder (see Fig. 3). VoxelMorph-1 uses one less layer at the final resolution and fewer channels over its last three layers.
The proposed method learns optimal parameter values in part by minimizing differences between and
. In order to use standard gradient-based methods, we construct a differentiable operation based on spatial transformer networks to compute.
For each voxel , we compute a (subpixel) voxel location in . Because image values are only defined at integer locations, we linearly interpolate the values at the eight neighboring voxels. That is, we perform:
where are the voxel neighbors of
. Because the operations are differentiable almost everywhere, we can backpropagate errors during optimization.
The proposed method works with any differentiable loss. In this section, we formulate an example of a popular loss function of the form (2), consisting of two components: that penalizes differences in appearance, and that penalizes local spatial variations in . In our experiments, we set to the negative local cross-correlation of and , a popular metric that is robust to intensity variations often found across scans and datasets.
Let and denote images with local mean intensities subtracted out. We compute local means over a volume, with in our experiments. We write the local cross-correlation of and , as:
where iterates over a volume around . A higher CC indicates a better alignment, yielding the loss function: . We compute CC efficiently using only convolutional operations over and .
Minimizing will encourage to approximate , but may generate a discontinuous . We encourage a smooth using a diffusion regularizer on its spatial gradients:
We approximate spatial gradients using differences between neighboring voxels. The complete loss is therefore:
where is a regularization parameter.
We demonstrate our method on the task of brain MRI registration. We use a large-scale, multi-site, multi-study dataset of 7829 T1‐weighted brain MRI scans from eight publicly available datasets: ADNI , OASIS , ABIDE , ADHD200 , MCIC , PPMI , HABS , and Harvard GSP . Acquisition details, subject age ranges and health conditions are different for each dataset. All scans were resampled to a grid with mm isotropic voxels. We carry out standard pre‐processing steps, including affine spatial normalization and brain extraction for each scan using FreeSurfer , and crop the resulting images to . All MRIs were also anatomically segmented with FreeSurfer, and we applied quality control (QC) using visual inspection to catch gross errors in segmentation results. We use the resulting segmentation maps in evaluating our registration as described below. We split our dataset into 7329, 250, and 250 volumes for train, validation, and test sets respectively, although we highlight that we do not use any supervised information at any stage.
We focus on atlas-based registration, in which we compute a registration field between an atlas, or reference volume, and each volume in our dataset. Atlas-based registration is a common formulation in population analysis, where inter-subject registration is a core problem. The atlas represents a reference, or average volume, and is usually constructed by jointly and repeatedly aligning a dataset of brain MR volumes and averaging them together. We use an atlas computed using an external dataset [17, 40]. Each input volume pair consists of the atlas (image ) and a random volume from the dataset (image ). Columns 1-2 of Fig. 4 show example image pairs from the dataset using the same fixed atlas for all examples. All figures that depict brains in this paper show 2D coronal slices for visualization purposes only. All registration is done in 3D.
Obtaining dense ground truth registration for these data is not well-defined since many registration fields can yield similar looking warped images. We evaluate our method using volume overlap of anatomical segmentations. We include any anatomical structures that are at least voxels in volume for all test subjects, resulting in 29 structures. If a registration field represents accurate anatomical correspondences, we expect the regions in and corresponding to the same anatomical structure to overlap well (see Fig. 4 for examples). Let be the set of voxels of structure for and , respectively. We measure the accuracy of our method using the Dice score , which quantifies the volume overlap between two structures:
A Dice score of indicates that the structures are identical, and a score of 0 indicates that there is no overlap.
We compare our approach to Symmetric Normalization (SyN) , the top-performing registration algorithm in a comparative study . We use the SyN implementation in the publicly available ANTs software package , with a cross-correlation similarity measure. Throughout our work with medical images, we found the default ANTs smoothness parameters to be sub-optimal for our purposes. We obtained improved parameters using a wide parameter sweep across a multiple of datasets, and use those in these experiments.
We implement our networks using Keras
with a Tensorflow backend. We use the ADAM optimizer  with a learning rate of . To reduce memory usage, each training batch consists of one pair of volumes. We train separate networks with different values until convergence. We select the network that optimizes Dice score on our validation set, and report results on our held-out test set. Our code and model parameters are available online at https://github.com/balakg/voxelmorph.
Table 1 shows average Dice scores over all subjects and structures for ANTs, the proposed VoxelMorph architectures, and a baseline of only global affine alignment. VoxelMorph models perform comparably to ANTs, and VoxelMorph-2 performs slightly better than VoxelMorph-1. All three improve significantly on affine alignment. We visualize the distribution of Dice scores for each structure as boxplots in Fig. 5. For visualization purposes, we combine same structures from the two hemispheres, such as the left and right white matter. The VoxelMorph models achieve comparable Dice measures to ANTs for all structures, performing slightly better than ANTs on some structures such as cerebral white matter, and worse on others such as the hippocampi.
|Method||Avg. Dice||GPU sec||CPU sec|
|Affine only||0.567 (0.157)||0||0|
|ANTs||0.749 (0.135)||-||9059 (2023)|
|VoxelMorph-1||0.742 (0.139)||0.365 (0.012)||57(1)|
|VoxelMorph-2||0.750 (0.137)||0.554 (0.017)||144 (1)|
Average Dice scores and runtime results for affine alignment, ANTs, VoxelMorph-1, VoxelMorph-2. Standard deviations are in parentheses. The average Dice score is computed over all structures and subjects. Timing is computed after preprocessing. Our networks yield comparable results to ANTs in Dice score, while operating orders of magnitude faster during testing. To our knowledge, ANTs does not have a GPU implementation.
Table 1 presents runtime results using an Intel Xeon (E5-2680) CPU, and a NVIDIA TitanX GPU. We report the elapsed time for computations following the affine alignment preprocessing step, which all of the presented methods share, and requires just a few minutes on a CPU. ANTs requires roughly two or more hours of CPU time. VoxelMorph-1 and VoxelMorph-2 are 60+ and 150+ times faster on average using the CPU. ANTs runtimes vary widely, because its convergence depends on the difficulty of the alignment task. When using the GPU, our networks compute a registration in under a second. To our knowledge, there is no publicly available ANTs implementation for GPUs.
The results in the previous sections combine multiple datasets consisting of different population types, resulting in a trained model that generalizes well to a range of subjects. In this section, we model parameters specific to a subpopulation, demonstrating the ability of tailoring our approach to particular tasks. We train using ABIDE subject scans, and evaluate test performance on unseen ABIDE scans. ABIDE contains scans of subjects with autism and controls, and includes a wide age range, with a median age of 15 years. In Table 2 we compare the results to those of the models trained on all datasets, presented in the previous section. The dataset-specific networks achieve a Dice score improvement.
|Avg. Dice||Avg. Dice|
|Method||(Train on All)||(Train on ABIDE)|
Fig. 5(a) presents average Dice scores for the validation set for different values of the smoothing parameter . As a baseline, we display Dice score of the affinely aligned scans. The optimal Dice scores occur when for VoxelMorph-1 and for VoxelMorph-2. However, the results vary slowly over a large range of values, showing that our model is robust to choice of . Interestingly, even setting , which enforces no regularization, results in a significant improvement over affine registration. This is likely due to the fact that the optimal network parameters need to register all pairs in the training set well, giving an implicit regularization. Fig. 5(b) shows example registration fields at a coronal slice with different regularization values. For low , the field can change dramatically across edges and structural boundaries.
Our model is able to perform on par with the state-of-the-art ANTs registration package while requiring far less computation time to register test volume pairs. While our method learns general features about the data necessary for registration, it can adapt these parameters to specific subpopulations. When training on the ABIDE dataset only, we obtain improved Dice scores on test ABIDE scans compared to training on a dataset from several sources exhibiting different health conditions and variations in acquisition. This result shows that some of our model’s parameters are learning properties specific to the training images.
We present two models which trade off in accuracy and computation time. The smaller architecture, VoxelMorph-1, runs significantly faster on the CPU and is less than
Dice point worse than VoxelMorph-2. This enables an application-specific decision. An advantage of our model is that it is easy to explore this tradeoff by changing the number of convolutional layers and channels of the network, which can be considered as hyperparameters. We selected these hyperparameters by experimenting on training and validation data, and they could be adapted to other tasks.
We quantify accuracy in this study using Dice score, which acts as a proxy measure of registration accuracy. While our models achieve comparable Dice scores, ANTs produces diffeomorphic registrations, which are not guaranteed by our models. Diffeomorphic fields have attractive properties like invertibility and topology-preservation that are useful in some analyses. This presents an exciting area of future work for learning-based registration.
Our method replaces a costly optimization problem for each test image pair, with one function optimization aggregated over a dataset during a training phase. This idea is applicable to a wide variety of problems traditionally relying on complex, non-learning-based optimization algorithms for each input. Our network implementations needed a one-time training period of a few days on a single NVIDIA TITANX GPU, but less than a second to register a test pair of images. Given the growing availability of image data, our solution is preferable to a non-learning-based approach, and sorely-needed to facilitate fast medical image analyses.
This paper presents an unsupervised learning-based approach to medical image registration, that requires no supervised information such as ground truth registration fields or anatomical landmarks. The approach obtains similar registration accuracy to state-of-the-art 3D image registration on a large-scale, multi-study MR brain dataset, while operating orders of magnitude faster. Model analysis shows that our model is robust to regularization parameter, can be tailored to different data populations, and can be easily modified to explore accuracy and runtime tradeoffs. Our method promises to significantly speed up medical image analysis and processing pipelines, while facilitating novel directions in learning-based registration.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2443–2450, 2013.
Dense image registration through mrfs and efficient linear programming.Medical image analysis, 12(6):731–741, 2008.