I Introduction
Multi-modal registration has many use cases, such as in image-guided therapy where pre-therapeutic imaging is aligned with live data to guide treatment (e.g., tumor localization in surgery or radiotherapy). The large computational run-time of conventional iterative methods mean that registration methods in these applications are limited to inaccurate linear transforms
[13]. Recent advances in unsupervised deep learning-based registration have demonstrated sub-second run-times in accurate mono-modal deformable registration [3]. A range of auxiliary approaches have since been investigated to improve registration performance, specifically segmentation maps, multi-resolution, and locally adaptive regularisation [3, 5, 9, 15, 11]. The extension of deep learning methods to multi-modal registration depends on an appropriate similarity function, such as mutual information (MI) [14], which is one of the most studied and successful similarity metrics in multi-modal iterative registration approaches [13]. Application of MI in a learning-based setting is not trivial due to the non-differential quantization step (i.e., binning of intensity values). One solution is to use a continuous quantization function [11].In this paper, we propose estimating MI using a 2-layer convolutional neural network (CNN) to extend deep learning-based frameworks to multi-modal registration. More specifically, we estimate MI with mutual information neural estimation (MINE) that eliminates the cumbersome discretization of the density estimation step and allows end-to-end training as it is fully differentiable
[4].Ii Methods
Ii-a Registration framework
Image registration estimates a transform (parameterized by ) that aligns a moving image to a fixed image , such that becomes similar to , where are points on the image lattice . For brevity we drop and write transformation as .
For our registration framework we use the diffeomorphic variant of VoxelMorph (VM) [5], see Fig. 1, but our method is not specific to this framework. VM takes the input images and passes them through a Unet-like structure [12] to output a posterior distribution from which a sample is taken that is a stationary velocity field [5]. This is then integrated to form a diffeomorphic transform using a scaling and squaring approach [1], and applied to the moving image. VM applies the mean square error (MSE), a mono-modal similarity metric, between and and a regularization function that contains one hyper-parameter to control the scale of the velocity field.

Ii-B Novel image matching loss function
We replace the image matching term in the loss function of VM
[5] with a loss function based on MI that has been demonstrated to work effectively for multi-modal problems in traditional (non-machine-learning) registration methods [14]. The design of our image matching loss function is based on MINE [4] and replaces the cumbersome approximation of MI through a binning approach by training of a network that estimates MI and works on multi-dimensional continuous-valued data.The MI-based registration of to is defined by
(1) |
where
measures the joint probability between the gray values of the fixed and moving images, and
and are the marginal probabilities of and – note that the MI in (1) is equivalent to the Kullback-Leibler (KL) divergence between the joint-probability density and product of marginals , defined by . We extend MINE [4] by training the image matching network with a lower-bound estimate on the MI in (1) using the following inequality [6]:(2) |
where (with ) represents any class of functions (parameterized by ) that satisfies the integrability constraints [6], and with when the expectation is over (i.e., is the identify function, so and match according to ), and (with being a flat distribution over , such as a random permutation of locations) when the expectation is over – this means that for the product of marginals , and are unlikely to match according to .
The learning function that we use to optimise our model is the following:
(3) |
where is the regularization function to approximate the variational posterior and the prior [5], is the diffeomorphic transform computed from the network (see left hand side of Fig. 1), and is defined in Eq. (2).
In (3) is modeled as a simple 2-layer CNN with 1x1x1 filters to preserve the independent voxel assumption in MI. The first layer combines linear transformed versions of the input to create features per voxel-pair. Then, a second layer maps these features back to a single value per voxel that are used to generate representations for and . During inference, we are given two test images , and the registration is obtained by the diffeomorphic transform .
For the function in (2), we evaluate global and local shuffling strategies referred to as MINE and MINE. In MINE,
, which means that it randomly shuffles all voxel coordinates (using a uniform distribution over the whole input lattice
) to estimate the distribution over voxel pairs for unaligned images. However, since a large fraction of voxels in our data belong to background, we also evaluate a local shuffling strategy to estimate the distribution of voxel pairs in their expected neighborhood, as would be expected after sensible pre-processing and initialisation. This is denoted by MINE, . MINE returns a new input address that is within (in each direction) of the original address .Ii-C Data and preprocessing
This research study was conducted retrospectively using human subject data made available in open access [8, 10]. Ethical approval was not required as confirmed by the license attached with the open access data. For our experiments we use the T1-weighted scans from the OASIS-2 dataset [8], and a subset of the ADNI dataset [10], where we included all subjects that had a T1-weighted and FLAIR scan with high spatial resolution acquired in the initial session. Preprocessing was performed using FreeSurfer [7] on all T1-weighted scans and consisted of resampling to 1 mm isotropic voxels, affine spatial normalization, brain extraction, and automated generation of segmentation maps. Since both modalities of the ADNI dataset are acquired in the same session we assume that they are well aligned and apply the preprocessing parameters (including segmentation maps) found for T1 to the FLAIR volumes. Anatomical regions were excluded if they had less than 100 voxels in any of the subjects, resulting in 30 regions that will be used for evaluation (see later). Volumes are cropped to voxels and, after removal of subjects that failed pre-processing, split into training:validation:test sets with 230:24:118 subjects for OASIS and 293:24:118 for ADNI.
Ii-D Training process
We evaluate our method on two tasks. First, we replicate the setup from [5] and evaluate mono-modal registration to an atlas. Second, we evaluate the performance of our method on multi-modal inter-patient registration by aligning T1-weighted and FLAIR volumes, which we regard as an example test for the non-linear registration required in image-guided procedures. For the training of the MINE-network in (3
), we set the feature vector length
, where the neighbouring region for MINE is set at voxels. Evaluation of these values was limited by available memory and prolonged training and chosen on the basis of capturing a reasonable level of anatomical context (for ) and achieving fast and stable convergence (for ). In (3), we set the regularization parameter for all experiments and training was performed using the ADAM optimizer with , and learning ratefor 1500 epochs with a batch size of 1 and Kaiming initialization.
Ii-E Comparison with alternative methods
The baseline methods that we include are ANTs [2] and the diffeomorphic variant of VoxelMorph [5]. ANTs registration uses the default settings that combines SyN [2] with cross correlation (CC) and MI to produce the method’s optimal results for both registration tasks. The training of VM is performed using local normalized cross-correlation (VM-CC) with neighborhood [3]. We use CC rather than MSE as CC is also capable of multi-modal registration.
Evaluation is separated into a quantitative and qualitative part. The quantitative part evaluates Dice similarity coefficient (DSC) as a measure of registration accuracy, incidence of non-positive determinants for Jacobian values as a measure of the diffeomorphic performance, and
run-time at test-time to access feasibility for image guided therapies. The qualitative part compares visual registration results and shows the response for the joint distribution
in (2).Iii Results

Iii-a Quantitative results
Tuning results for the similarity weight (in Eq. 3), for all VM methods, are shown in Fig. 2 with the best results summarized in Table I. Note that the DSC is not very sensitive to the choice of the similarity weight for MINE, and the only notable sensitivity for non-positive Jacobian determinants occurs when . On mono- and multi-modal tasks, all methods achieve comparable results at peak performance with a increase for mono-modal and increase for multi-modal DSC compared to the affine results. (Side-by-side comparison of all methods in terms of dice performance per area are available on request.)
[b] Mono-modality Multi-modality Framework Mean Dice Mean Dice Affine - - ANTs VM-CC VM-MINE VM-MINE
Even though ANTs combines multiple similarity metrics and SyN, it does not substantially outperform the VM-based methods and has more outliers in failed registrations, making the deep-learning-based approaches more reliable for clinical applications. Comparing the VM-based methods for multi-modal registration in terms of DSC, MINE performs better than CC, with MINE a close third place. For mono-modal registration, CC is slightly better than MINE and MINE. However, there are some limits to how strongly we can extrapolate from these findings, as there can be imperfections in automated segmentation, pre-processing performance across multiple modalities, and observed run-to-run variations in mean DSC differences (of up to ). Table I and Fig. 2 show that the diffeomorphic properties of VM are unaffected by the switch to multi-modal registration and choice of similarity metric. The number of non-diffeomorphic transforms are negligible in all cases up to the peak-performance point, based on DSC, though they rapidly increase thereafter.
Since the similarity metric does not need to be evaluated at test-time, run-time of VM is the same for all methods at seconds in the mono-modal tasks with an increase to seconds for multi-modal registration on a Titan Xp GPU. In both tasks, deep-learning-based methods are orders of magnitudes faster than ANTs and are fast enough for most real-time applications.
The low sensitivity to , together with the similar DSC performance, lower outliers, and faster execution, makes MINE the overall best, or at least equally favourable to VM-CC, in both mono- and multi-modal tasks on the basis of the quantitative results.
![]() |
![]() |
Iii-B Qualitative results
Representative registration results are shown in Fig. 3. All methods produce accurate results on both tasks with minor differences in varying locations. Ignoring the extracortical differences resulting from background masking in the atlas, the most prominent differences for all methods are in the cortex where the largest inter-patient anatomical variation occurs, though differences in the lateral ventricles are also visible. In the mono-modal task these differences are consistent throughout the image for ANTs and MINE while VM-CC shows larger average misalignment in the frontal lobe, and especially the anterior horn of the lateral ventricles, compared to the rest of the image. This is consistent for all values of in all similarity metrics evaluated. We hypothesize that intensity normalization in small neighborhoods for CC, while producing excellent overall results, can have a negative effect on learning features for registration where there is large anatomical variation. Since MINE is a global metric, even in our local sampling strategy, it does not suffer from this variation. The difference images also show that MINE behaves consistently everywhere with minimal differences on sub-structure boundaries where voxel intensities change.
To further investigate the MINE-network we took two artificial inputs that cover the range of possible intensity-pair combinations present in the training data to visualize the network’s joint-distribution prediction map , see Fig. 4 for multi-modal results. The local shuffling strategy has removed less informative background-brain intensity-pairs compared to MINE, instead using local brain-brain-pairs, as seen in unaligned inputs. This is shown by the more complex joint-distribution for MINE, using a 2-layer CNN, creating better features for registration.
In clinical applications it is important to avoid non-diffeomorphic transformations as these represent cases where anatomically implausible results are generated (e.g. structures disappear or break apart). For real-time applications it is very important to maintain good regional performance (measured by DSC) and anatomical fidelity (measured by Jacobian determinants) and having fast, sub-second run-times. The local form of the MINE-network allows us to satisfy all three of these criteria, and without difficulty in tuning hyper-parameters. This makes MINE-local an excellent candidate for real-time clinical applications, from both a theoretical and practical standpoint.

Iv Conclusion
We have shown that a small 2-layer CNN MINE-network is capable of producing state-of-the-art mono- and multi-modal registration results on continuous data when smart sampling strategies are used. This eliminates the cumbersome quantization step of conventional MI-based registration, and has relatively low sensitivity to the only key hyper-parameter – the image similarity training weight. This makes the similarity metric easy to use, which provides accurate results to tune training, with an extremely low rate of producing non-diffeomorphic transformations or failures/outliers. Overall, it produces accurate and anatomically correct results that are fast enough for most real-time applications of multi-modal deformable registration in image-guided therapies.
References
- [1] (2006) A log-euclidean framework for statistics on diffeomorphisms. In Int. Conf. on Medical Image Computing and Computer-Assisted Intervention, pp. 924–931. Cited by: §II-A.
- [2] (2008) Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis 12, pp. 26–41. Cited by: §II-E.
- [3] (2019) VoxelMorph: a learning framework for deformable medical image registration. IEEE Trans. on Medical Imaging 38, pp. 1788–1800. Cited by: §I, §II-E.
- [4] (2018) Mutual information neural estimation. In Int. Conf. on Machine Learning, pp. 531–540. Cited by: §I, §II-B, §II-B.
- [5] (2019) Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis 57, pp. 226–236. Cited by: §I, §II-A, §II-B, §II-B, §II-D, §II-E.
- [6] (1983) Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics 36, pp. 183–212. Cited by: §II-B.
- [7] (2012) FreeSurfer. Neuroimage 62, pp. 774–781. Cited by: §II-C.
- [8] (2010) Open access series of imaging studies: longitudinal mri data in nondemented and demented older adults. Journal of Cognitive Neuroscience 22, pp. 2677–2684. Cited by: §II-C.
- [9] (2020) DRMIME: differentiable mutual information and matrix exponential for multi-resolution image registration. Proceedings of Machine Learning Research 121, pp. 527–543. Cited by: §I.
- [10] (2010) Alzheimer’s disease neuroimaging initiative (adni): clinical characterization. Neurology 74, pp. 201–209. Cited by: §II-C.
- [11] (2021) Learning diffeomorphic and modality-invariant registration using b-splines. In Medical Imaging with Deep Learning, Cited by: §I.
- [12] (2015) U-net: convolutional networks for biomedical image segmentation. In Int. Conf. on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §II-A.
- [13] (2013) Deformable medical image registration: a survey. IEEE Trans. on Medical Imaging 32, pp. 1153–1190. Cited by: §I.
- [14] (1996) Multi-modal volume registration by maximization of mutual information. Medical Image Analysis 1, pp. 35–51. Cited by: §I, §II-B.
- [15] (2021) A novel unsupervised learning model for diffeomorphic image registration. In Proc. SPIE Medical Imaging 2021, pp. 115960M. Cited by: §I.
Comments
There are no comments yet.