Deformable image registration establishes pixel- or voxel-level dense correspondences for 2D or 3D image pairs, which form a deformation that transforms images into a common space for comparison and analysis. Such a deformation desires a good property of diffeomorphism, a smooth transformation with a smooth inverse, to ensure the preservation of topology when warping images. Classical image registration models, e.g., LDDMM [beg2005computing], Stationary Velocity Fields (SVF) [arsigny2006log]
, successfully estimate diffeomorphic deformations for building dense correspondences between image pairs. However, these models face challenges for practical applications, i.e., providing both fast and accurate solutions. Therefore, researchers have been working on improving the efficiency of diffeomorphic image registration methods[ashburner2007fast, zhang2015finite].
Recently, deep learning based approaches open an alternative to address the above challenges, which motivates our work in this paper. Existing deep registration models focus on tackling the efficiency challenge using supervised[yang2017quicksilver] or unsupervised techniques [dalca2018unsupervised, krebs2018unsupervised]. Supervised approaches [yang2017quicksilver] maintain the diffeomorphic property, which is inherited from the classical diffeomorphic model LDDMM, but it requires extra effort to obtain the ground-truth deformations. Meanwhile, its registration accuracy is limited by that of the obtained deformations. The unsupervised approaches [dalca2018unsupervised, krebs2018unsupervised] have shown promising diffeomorphic and efficient registration results by introducing an integration layer into the network design, based on the scaling and squaring method [higham2005scaling]
. Due to the flexibility in selecting the network architecture and the loss function, the unsupervised framework has the potential to further improve the registration accuracy. However, because the integration step is computationally expensive and the network faces the memory challenge for high-dimensional images, the unsupervised approaches often work on downsampled images or deformations, which does not fully leverage available information and limits the registration accuracy.
In this paper, we aim to improve the accuracy of unsupervised image registration, using a multi-scale design to integrate image deformations at global, local, and original scales. The difference in our work from current unsupervised registration models [dalca2018unsupervised, krebs2018unsupervised] is that our model works on downsampled images and chopped chunks, which reduce the computational and memory cost compared to working on the original image size directly. Meanwhile, the integration of these two-scale deformations back to the original scale improves the registration accuracy because of the information fusion at different levels. Previous work [hering2019mlvirnet, krebs2019learning, yang2017quicksilver] either use multi-scale approaches of downsampling the entire image to improve accuracy of the result or use only patches of the images to reduce the memory cost, but both these techniques do not adequately leverage the data. Therefore, to gain an accuracy boost but not run into memory issues, we propose the Dividing and Downsampling mixed Registration Network (DDR-Net).
Our contributions in this paper are summarized as follows:
We propose a novel architecture DDR-Net, which can effectively use both global and local features yielding high quality registration performance. Multi-scale information benefits the task of deep image registration.
We demonstrate an effective way to obtain a trade-off between fully leveraging the available data under limited computing resources and gaining an improved accuracy of diffeomorphic image registration at the same time.
We conduct extensive experiments on both 2D and 3D datasets with different image types, including brain MRIs and liver CT scans. The experimental results demonstrate that our framework has better registration performance compared to deep-learning-based method VoxelMorph [dalca2018unsupervised] and the classical registration method ANTs SyN [avants2011reproducible], in terms of image matching, deformation smoothness, and multi-structure segmentation.
2 Dividing and Downsampling mixed Registration Network (DDR-Net)
Architecture Overview. As shown in Figure 1, our proposed DDR-Net includes three main components: a global branch that handles the registration for downsampled images, a local branch that handles the registration for the cropped local chunks of original images, and an original branch that merges estimated velocity fields from the global and local branches to register original images. Each branch outputs a deformation field to register image pairs at its corresponding level, and they share some network designs as discussed below.
Backbone Registration. At each scale, we have a diffeomorphic image registration problem. Given an image pair, a source image and a target image , each of size , the goal of diffeormorphic image registration is to estimate a smooth deformation field with a smooth , such that the image deformed from the source, i.e. , is similar to the target image . Such a diffeomorphic deformation field is driven by a smooth velocity field , via the following differential equation:
Here, is an identity deformation. This formulation estimates an optimal velocity field that drives a deformation field to match an image pair. So, the registration network has three sub-tasks, i.e., estimating the velocity field, solving Eq. (1) for deformations, and deforming an image with interpolation.
Velocity Field Estimation. The global and local branches follow the same UNet [ronneberger2015u] architecture as shown in Fig. 1. The UNet takes in image pairs and outputs the mean and the variance for sampling a corresponding stationary velocity field . Here, the stationary velocity field assumption simplifies the solution of Eq. (1). Given a collection of image pairs , where , the downsampling branch takes the low-resolution image pairs , downsampled by half, i.e., . The chunk branch receives each input as divided patches with the same resolution as and , i.e., and . These cropped chunks have small overlaps on their boundaries to mitigate the discontinuity of the generated velocity fields at the chunk boundaries, when merging back to form the original high resolution velocity fields. The detailed network architecture in terms of the number of convolution layers and the kernel sizes are shown in Fig. 1.
Deformation Integration. Under that assumption of the stationary velocity field, Equation (1) is simplified with a constant velocity field, and its solution is . Similar to VoxelMorph, we adopt the scaling and squaring algorithm [higham2005scaling] to approximate this solution, which is implemented as a differentiable layer in the network. The downsampling and chunk branches integrate deformations separately, which are driven by their respective velocity fields. In the original branch, we use an averaged global and local velocity field to integrate the deformation. In particular, we upsample the global velocity field to the original resolution using an upsampling layer and merge the chunks to the original volume using a chunk-merge layer. As a result, we obtain a velocity field at the original resolution which is integrated to get the deformation at the original level.
Image Interpolation. We use an interpolation layer to deform the source image at each branch. For each voxel in the target image, we compute its location
in the source image and compute its intensity value using linear interpolation. This differentiable operation allows the backpropagation of the network errors.
Loss Functions. All the velocity fields generated in the network have the constraint to ensure the smoothness of the deformations. Each branch also outputs a deformed source image to match the corresponding target image at each level, i.e., the downsampled, chunk, and original scales. We use the Mean Square Error (MSE) to measure the matching results and a K-L divergence loss to encourage the smoothness of the velocity field as done in [dalca2018unsupervised, krebs2019learning].
|OASIS||SyN [avants2011reproducible]||4.79 0.001||495.37 0.03||0.1 0.768|
|VM [dalca2018unsupervised]||1.25 0.001||1.46 0.005||0.051 0.014|
|DDR-Net||1.10 0.001||1.09 0.000||0.003 0.019|
|IBSR18||SyN [avants2011reproducible]||10.93 0.26||0.00 0.000||0.00 0.00|
|VM [dalca2018unsupervised]||2.06 0.001||20.17 0.005||0.05 0.014|
|DDR-Net||1.67 0.001||17.266 0.000||0.04 0.117|
|3DIRCADB-01||SyN [avants2011reproducible]||68.04 0.024||476.98 0.001||35.32 5.227|
|VM [dalca2018unsupervised]||16.84 0.005||1221.21 0.01||0.72 0.274|
|DDR-Net||14.11 0.004||723.89 0.003||0.65 0.247|
|Dataset||Region||SyN [avants2008symmetric]||VM [dalca2018unsupervised]||DDR-Net|
|IBSR18||3rd ventricle||0.28 0.14||0.42 0.16||0.44 0.17|
|4th ventricle||0.20 0.13||0.39 0.16||0.36 0.16|
|amygdala||0.34 0.13||0.27 0.23||0.31 0.20|
|brainstem||0.62 0.12||0.69 0.13||0.66 0.16|
|caudate||0.48 0.09||0.40 0.21||0.40 0.20|
|cerebellum cortex||0.55 0.12||0.59 0.20||0.63 0.13|
|cerebellum white matter||0.42 0.14||0.34 0.20||0.44 0.18|
|cerebral cortex||0.43 0.09||0.55 0.18||0.58 0.13|
|cerebral white matter||0.53 0.06||0.58 0.15||0.59 0.11|
|csf||0.22 0.17||0.31 0.16||0.31 0.17|
|hippocampus||0.32 0.11||0.17 0.14||0.34 0.21|
|lateral ventricle||0.38 0.12||0.53 0.18||0.45 0.17|
|pallidum||0.24 0.16||0.23 0.23||0.29 0.24|
|putamen||0.29 0.21||0.27 0.25||0.33 0.27|
|thalamus||0.68 0.08||0.61 0.18||0.60 0.19|
|ventraldc||0.49 0.11||0.49 0.21||0.54 0.18|
|Avg. Dice||0.28 0.14||0.43 0.1||0.47 0.1|
|3DIRCADB-01||bone||0.223 0.106||0.356 0.149||0.369 0.153|
|skin||0.889 0.065||0.982 0.012||0.991 0.009|
|liver||0.690 0.099||0.731 0.122||0.741 0.131|
We evaluate our method on three public datasets, which involves two types of 3D medical images, i.e., brain MRI scans and liver CT scans.
OASIS Dataset [marcus2007open]. The T1-weighted brain scans from the OASIS dataset underwent pre-processing steps including down-sampling, skull-stripping, intensity normalization to the range [0, 1], bias field correction, and co-registration with affine transformations. We have resulting images of resampled resolution and the voxel size of . We collect 360 volume pairs for the 3D experiments.
IBSR18 Dataset [valverde2015comparison]. The IBSR18 dataset consists of T1-weighted scans for 18 subjects with dimensions of . We resampled these images to . Scans underwent preprocessing steps including skull-stripping, bias field correction, and intensity normalization. Each scan also comes with 84 manually labelled anatomical structures. We collect 306 3D image pairs. Segmentation maps including 28 anatomical structures (see Table 2) were also obtained in this dataset, which followed the same preprocessing steps of resampling.
3DIRCADB-01 Dataset [3dircadb]. The 3DIRCADB-01 dataset contains 20 CT scans with masks of the segmented structures available for bone, liver, and skin. Scans underwent preprocessing steps including histogram equalization and intensity normalization to the range [0, 1]. We collect and pair the central slices on the axial plan and obtained 380 2D image pairs with a resolution of .
Experimental Settings. For all the experiments, we use 70% of the collected data for training, 10% for validation, and 20% for test, after a random shuffle. We use the Adam [kingma2014adam] optimizer with a learning rate of . All the experiments were trained using NVIDIA GeForce TITAN X GPUs.
We measure the image matching after registration using the intensity root mean square error (RMSE), and the smoothness of a deformation by counting the number of its foldings and the absolute sum of negative determinant of its Jacobian [ashburner2007fast]. Also, we perform the segmentation task using image registration and report the Dice score in terms of volume overlap.
We compare our approach with ANTsPy package using Symmetric Normalization (SyN) [avants2011reproducible]. We use cross-correlation and other default settings, which are optimal for our task with 201 iterations for registration. Another baseline algorithm is VoxelMorph [dalca2018unsupervised], and we also use its default settings for comparison. Initial affine registration is not performed for any of the experiments.
Experimental Results. Figures 2, 3, 4 show qualitative results for OASIS, IBSR18, and 3DIRCADB-01 datasets, respectively. Overall, our approach produces better matching results as compared to the baseline methods, demonstrated by the image difference plots. VoxelMorph tends to produce unwanted artifacts in its deformed images, which are not seen in our results. The deformation plot and the visualization of the determinant of the Jacobian show that our results have less deformations happening in the background, as expected. Also, our method achieves significantly smaller deformations throughout the image space and provides a better matching result to the target at the same time.
We observe that the VoxelMorph produces more irregular deformation fields compared to the other methods. Table 1 show the quantitative analysis of the registration performance on the OASIS, IBSR18, and 3DIRCADB-01 datasets. We achieve consistently lower number of foldings compared to the baseline methods while maintaining the consistently higher registration accuracy. Table 2 indicates the Dice score of anatomical structures for the IBSR18 and 3DIRACADB-01 datasets. Our method performs better on most of the anatomical structures for both the IBSR18 and 3DIRCADB-01 datasets. Figure 5 shows the segmentation maps for a few anatomical structures from the IBSR18 dataset.
Regarding the inference time, ANTs SyN, which is tested on CPU since it does not have a GPU implementation, takes on an average 20 minutes, whereas VoxelMorph tested on GPU takes 121 milliseconds, while DDR-Net tested on GPU takes 425 milliseconds to register an image pair from the OASIS dataset. A limited ablation study conducted by us revealed that a single scale UNet architecture working on the full size velocity field integration for 2D input size of
for a single input image pair takes 26.46 MB of memory, whereas an architecture like DDR-Net for the same input and UNet architecture will take 21.16 MB of memory. The memory requirement was calculated considering the forward pass and parameters in the neural network. For 3D cases, our DDR-Net can take in more images in a batch, compared to a single scale UNet.
4 Conclusion and Discussion
In this paper, we have proposed a diffeomorphic image registration model which estimates smoother deformations and provides a better image matching result. It leverages both the global context and local fine structures of the data effectively to enhance the registration result and handle large deformations as a result of such an architecture. Generally, our approach produced more regular deformation fields, which are significantly smoother than the baseline methods. The dice scores indicate that our approach is comparable to the existing methods even surpassing them in most cases. Moreover, our architecture shows that simple yet efficient changes in architectures can lead to better deep learning strategies. We believe that an optimal balance of GPU memory and accuracy is essential for registering image pairs in 3D in order to maximize the utilization of 3D information. Our work will push new avenues of research in considering memory efficient registration models in deep learning, which fully leverage the data.