1 Introduction
While deep learning-based metrics [1, 2] demonstrate great promise for solving difficult registration problems, they are thought to require well-registered training data, which can be a serious drawback. Consider the registration of abdominal MRI and Ultrasound (US), an important and difficult problem. There is strong application pull for such registration, as diagnostic MRI has good contrast for tumors, while US is routinely used for interventional guidance. Unfortunately, for this application, it is difficult to obtain well-registered training data, because the scans can not be obtained simultaneously, and the anatomy will shift between scans. While manual registration is a possibility, it is laborious and technically challenging. For these reasons, there is great practical advantage if we can eliminate or reduce the requirement for well-registered training data for learning deep metrics for registration. In this paper we show how this may be achieved.
Many registration systems can be decomposed into two components, an image agreement metric (or objective function), and an optimizer. Human designed agreement metrics such as mutual information (MI) have been successful in multimodal image registration [3]. However, for difficult problems, single pixel statistics may not capture all the information that is needed. In addition, using manually constructed features also limits the capacity to learn the information that is shared among images.
Convolutional Neural Networks (CNNs) have proven remarkably powerful for image classification, and other image processing tasks, presumably because they are able to learn and manipulate effective representations of image contents at multiple levels of abstraction. They are now gaining traction on difficult medical imaging problems. A recent survey on deep learning methods in medical image analysis [4]
reports two common themes where deep neural networks have been applied to image registration: 1) estimation of similarity measures
[1, 5, 2]; 2) estimation of transformation parameters, between the images [6, 7]. Here, we will focus on the first category as our work is also estimation of similarity metrics. Wu et al. [5]explored unsupervised learning methods to extract deep features from the input patches and used the learned features vectors, instead of hand-crafted features, in an existing registration method. Cheng
et al. [2]proposed learning such a metric by training on corresponding and non-corresponding patches from CT and MR in a multi-modal stacked denoising autoencoder framework. Simonovsky
et al. [1], trained a CNN classifier to distinguish between pairs of corresponding and non-coresponding patches. After training, gradients of the deep metric were used to compute the updates for the transformation parameters in an iterative manner. A limitation of this approach appears to be that well-registered training data is a requirement for training such a classifier.
In this paper we focus on the training requirements of deep metrics for registration. We demonstrate that well-registered training data is actually not required. With misregistered training data there is a risk that the objective function will be biased, however, we demonstrate that with suitable data augmentation, including a novel “dithering” approach, the effect of misregistration is to broaden the objective function, while eliminating its bias, in comparison to the non-augmented case. While there may be some loss of accuracy due to the broadening of the objective function, we show that a multi-shot approach can be used whereby the broadened response function is used to improve the registration of the training data, and the process repeated. This leads to a well-registered training data set where the ultimate trained network would perform as well as the one presented in [1]. We envision that our training approach will be used once for a new application domain, subsequently registrations in the domain will only require the trained network.

2 Method
Registration between a fixed image and a moving image can be formulated as optimizing a transformation , in transformation space , that maximizes their agreement measured by the (similarity) score .
(1) |
In this context, is a transformation from the coordinate frame of the moving image to the fixed image, are locations in the coordinate frame, and
are the transformation parameters. Mutual information compares images by measuring the statistical relationship among single pixels of two modalities; however, deep networks compare images using a hierarchy of concepts (edges, contours, etc.) from all the pixels, and we hypothesize this will substantially improve the performance in non-trivial registration problems. In this work, we estimate a similarity metric (
) from the output score of classifier CNNs by training over corresponding and non-corresponding pairs of patches. In our framework we learn the metric from roughly aligned data, in contrast to the method in [1] where perfectly aligned data were used for the training. By applying symmetrization and dithering to the training data, we demonstrate that even with substantial misalignment between the modalities, a deep metric can be learned and further used for the task of registration. Our metric is formulated as the aggregation of the score of the network, , for randomly chosen patches, , from the fixed and moving image:(2) |
In our registration experiments we optimize the metric by Powell’s method, a popular non-derivative optimizer [8].
2.1 Data Augmentation by Symmetrization and Dithering
Depending on the distribution of misregistration in the training data, without any augmentations applied, there may be substantial bias in the deep metric response function. For instance, if the moving images in the training data were all shifted to the right, then the response function will reflect this bias with a peak at . Such bias can be effectively addressed by symmetrizing the training data; in the above example, flipping half of the moving images in the training set about the vertical axis results in a response function that now has two prominent modes at . Although the bias has been greatly reduced, the bimodal property of the response function will cause major problems for optimizers. Historically, Gaussian blur of the images has been used to achieve a single mode in objective functions, leading to multi-resolution approaches [9]. We experimented with such blurring; however, we found that “dithering” is a more effective approach. In signal processing, dithering is the deliberate introduction of noise that can be used for randomizing quantization error [10]
. We apply Gaussian distributed random displacements to the moving image to effectively merge modes whose separation is on the order of the standard deviation of the dithering, as follows,
where and is variance of the dither. Pairs of 3D patches are randomly cropped from the fixed and their corresponding locations in the moving image. For the non-corresponding class, patches are selected randomly in space. Empirically we observed an increase in the performance by applying both restrictions of position (minimum distance between misregistered patches), and sampling with restriction to the anatomy. Finally, all the patches are augmented by a combination of rotation and flipping to remove the bias as explained earlier. In the situations where we have substantial misregistration in the training data, small patch sizes can not fully capture long range shared information between the images. Therefore, we propose to perform multi-resolution, multi-shot registration by applying downsampling to the images. In our method, the learned deep metric in the downsampled data is used to realign the training data which eventually will make the misalignment smaller and enable registration with smaller patch sizes. To accomplish anti-aliasing in the downsampling, a Gaussian kernel with a standard deviation proportional to the level of downsampling were applied to smooth the images, and intensities were normalized to . On average, 1 million patches were generated for training of each CNN which consists of where the and are augmented versions of the roughly registered patches from training data and and are uniformly randomly misregistered patches.2.2 Network Architecture and Training
Architecture: A 2-channel 3D CNN was used to estimate the similarity of two 3D volumes. The 2-channel architecture has the capacity to compare the patches from the beginning (early fusion), and has exhibited better performance among all other networks in the task of patch comparison [11]. We stacked the fixed image patch () and moving image patch (
) as input channels of the network. Specifically, we used a 5-layer architecture consisting of 3 convolutional layers (filter size 5 followed by 3 layers of filter size 3, all stride 2), a pooling layer, a dense layer and a final softmax layer.
Training: We train our network by minimizing the cross-entropy loss between registered and unregisterd patches. To tune the weights of CNN, the Adam [12]
update rule with a decaying learning rate of 0.01 (decay factor of 0.8 after each epoch) is used. To prevent overfitting, we also add dropout
[13] with a drop rate of 0.5, and regularize the weights with weight decay with penalty on the weights.Registration:
We use the signal immediately before the final softmax as the per-patch response function because the softmax tends to flatten the response functions, which can cause difficulties for registration optimizers. In probabilistic terms, the pre-softmax value corresponds to the log likelihood ratio for the registered vs. unregisterd classes (rather than the posterior probability of “registered”). After using dithering to shape the response function, we use the trained network’s output summed over multiple patches as an objective function to register new data, or to improve registration of the training data, which will greatly reduce bias, and reduce variance, in the misregistration of the training data. This process may be repeated, if necessary as depicted in Figure
1.3 Experiments and Results
Materials: We use IXI brain development dataset [14] which consists of approximately 600 aligned T1 and T2-weighted MRI, PD, MRA and DTI scans of adult brains. In the first two experiments, we randomly choose 100 pairs of T1, T2-weighted scans, and use 25 pairs for training and validation, and the rest for the evaluation. All the images are resampled to . Patch size of were used for all experiments except experiment 1 in which we used patch size of for the first part.
Experiment1: We claim that misalignment in the training data can appear as a substantial bias in the deep metric response function, and applying symmetrization will make the shifted response function bimodal. To support this claim, we train separate CNNs on two datasets. In the first, one of the images is deterministically translated by along the axis. In the second, we generate an augmented version of the previous data set by combinations of rotation, and mirroring applied to the fixed and moving patches. The deep metric is characterized by plotting, as a function of, e.g., translation, the sum of pre-softmax activations of the 64 randomly chosen patches in a test data.
We study the effect of dithering on a training set that has substantial bias in the distribution of misregistration by rigidly transforming the moving image following a misalignmnet. We apply a 3D uniformly distributed translation
in the axis followed by a 3D uniformly distributed rotation . We train a CNN on a dithered version of the data by applying 3D random translation following a Gaussian distribution with .Experiment2: In this experiment we demonstrate the effectiveness of our proposed method in substantially misregistered training data, for intra-subject rigid registration of T1-T2 MRI scans. A rigid transformation, with translation and rotation parameters sampled from and , is applied to misregister the moving images (T1). Our baseline for this experiment is normalized mutual information with 60 histogram bins. Having a large range of misalignment, we experiment with a multi-resolution, multi-shot training strategy. Specifically, a 4 stage training is performed where each stage’s learned deep metric is used to realign the training data to reduce the misregistraion. Model selection is performed based on the accuracy on the validation set. We propose 4 stages as where represents learning a deep metric by downsampling the data by factor of and dithering with a variance . We also augment each patch by rotation and flipping. We misregister the unseen test set in the same manner as training, and use to rigidly register the data.
Experiment3: To explore the accuracy of the proposed method in handling misalignments on other domains, we perform experiment with learning a deep metric in a dataset of T2-weighted MRI, and Gradient Magnitude (GM) images (Figure 1) derived from T1 MRI scans; the GM images are meant to be somewhat similar to US images. We randomly select cases from the IXI dataset and use 75 for training and validation, and the rest for evaluation. After converting all the T1 images to GM images, we misalign the data by a rigid registration following , and . In this experiment we perform a 3-stage learning strategy, specifically to effectively learn the deep metric from the misaligned training data, and we apply for registering the misaligned evaluation data.
Results: Figure 2a demonstrates the effect of distribution of misregistration in the training data, on the deep metric response function for experiment 1. In Figures 2b and 2c we can see the unimodal and broad response function that is caused by dithering of the moving image. The quantitative results of experiment 2 and 3 are listed in Table 1. For experiment 2, results demonstrates statistically significant improvement of the Euclidean norm of the translation error () and the mean absolute error of the translation in axis () () by our proposed method. The 3 stage response functions in the experiment 4 are estimated on a test data as a function of and depicted in the Figure 2d. In addition, quantitative results of this experiment show significantly improved performance in the norm of the translation error, , compared to MI for T2-GM registration problem

Task | Method | |||||||
---|---|---|---|---|---|---|---|---|
T1-T2 | MI | |||||||
CNN | ||||||||
T1-GM | MI | |||||||
CNN |
4 Discussion and Conclusion
In this work, we have presented a novel strategy to enable learning a deep metric for multimodal image registration from substantially misaligned data. Specifically, symmetrization and dithering are proposed to reduce bias and effectively merge multiple modes in the response function that are caused by the distribution of the misregistration in the training data. Although bigger misalignment can be captured by using larger patches, having limited memory capacity prevents us from increasing the patch size. To overcome this, an online data generation and training can be performed which drastically will increase the training time. To address this issue, we proposed a multi-resolution multi-shot approach. Currently, this approach comes with a mild interpolation artifact in the early-state response functions that can be seen in Figure
2d, it is not evident in the final stage response functions.We demonstrated that our learned deep metric can effectively be used for the task of rigid registration and significantly outperform MI in estimating translation parameters. We believe applying dithering in rotation, in addition to the translation, will increase the performance with respect to rotation.
In future work, we plan to expand our framework into nonrigid registration and evaluate it on a more difficult registration task: ultrasound to MRI registration, with resection. In conclusion, data augmentation via symmetrization and dithering is an effective strategy that discharges the need for well-aligned training data – this brings deep metric registration from the realm of supervised to semi-supervised machine learning.
References
- [1] Simonovsky, M., Guti´errez-Becker, B., Mateus, D., Navab, N., Komodakis, N.: A deep metric for multimodal registration. In: MICCAI. (2016) pp. 10-18
- [2] Cheng, X., Zhang, L., Zheng, Y.: Deep similarity learning for multimodal medical images. CMBBE (2016) 1-5
-
[3]
Viola, P., Wells III, W.M.: Alignment by maximization of mutual information. International journal of computer vision 24(2) (1997) 137-154
- [4] Litjens, G.J.S., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J., van Ginneken, B., S´anchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis 42 (2017) 60-88
- [5] Wu, G., Kim, M., Wang, Q., Gao, Y., Liao, S., Shen, D.: Unsupervised deep feature learning for deformable registration of mr brain images. In: MICCAI. (2013) 649- 656
- [6] Sokooti, H., de Vos, B.D., Berendsen, F.F., Lelieveldt, B.P.F., Isgum, I., Staring, M.: Nonrigid image registration using multi-scale 3d convolutional neural networks. In: MICCAI. (2017)
- [7] Miao, S., Wang, Z.J., Liao, R.: A cnn regression approach for real-time 2d/3d registration. IEEE Transactions on Medical Imaging 35 (2016) 1352-1363
- [8] Powell, M.J.D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal 7(2) (1964) 155-162
- [9] Thevenaz, P., Ruttimann, U.E., Unser, M.: A pyramid approach to subpixel registration based on intensity. IEEE TMI 7(1) (Jan 1998) 27-41
- [10] Schuchman, L.: Dither signals and their effect on quantization noise. IEEE Transactions on Communication Technology 12(4) (December 1964) 162-165
- [11] Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: 2015 IEEE (CVPR). (June 2015) 4353-4361
- [12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR (2014)
- [13] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (2014) 1929-1958
- [14] IXI: Information eXtraction from Images. http://brain-development.org/