Multi_Modal_Registration
None
view repo
Deformable registration is one of the most challenging task in the field of medical image analysis, especially for the alignment between different sequences and modalities. In this paper, a non-rigid registration method is proposed for 3D medical images leveraging unsupervised learning. To the best of our knowledge, this is the first attempt to introduce gradient loss into deep-learning-based registration. The proposed gradient loss is robust across sequences and modals for large deformation. Besides, adversarial learning approach is used to transfer multi-modal similarity to mono-modal similarity and improve the precision. Neither ground-truth nor manual labeling is required during training. We evaluated our network on a 3D brain registration task comprehensively. The experiments demonstrate that the proposed method can cope with the data which has non-functional intensity relations, noise and blur. Our approach outperforms other methods especially in accuracy and speed.
READ FULL TEXT VIEW PDFNone
Due to the complex relationship between intensity distributions, multi-modal registration [4] remains a challenging topic. Optimization based registration, which optimizes similarity across phases or modals by aligning the voxel pairs, has been a dominant solution for a long time. However, along with the high complexity of solving 3D images optimization, it is very hard to define a descriptor which is robust enough to cope with the most considerable differences between the image pairs.
Nowadays, a lot of methods leveraging deep learning has been proposed to solve the problems mentioned above. These approaches usually require registration fields of ground truth or landmarks which need to be annotated by experts. Some methods [1][2]
explored unsupervised strategies built on the spatial transformer network.
There are two main challenges in unsupervised-learning based registration. The first one is to define a loss which can efficiently provide similarity measurement across modalities or sequences. For example, mutual information (MI) has been widely and successfully used in registration tasks. But it requires binning or quantizing, which can cause gradient vanishing problem[7]
. The second challenge is no ground-truth. The intuition to solve multi-modal problem is image-to-image translation
[5][3]. But without pixel-wise aligned data pairs, it is difficult to train a GAN to generate synthesized images in which the all texture is mapping to the source exactly. For example, Cycle-GAN can generate the images from MR which just look like CT, but the accuracy in the details cannot meet the requirements of registration.In this paper, we propose a novel unsupervised method which can easily achieve deformable registration between different sequences or modalities. Local gradient loss, an efficient and robust metric, is the first time to be used in deep-learning-based registration method. We combine adversarial learning approach with spatial transformation to simplify multi-modal similarity to mono-modal similarity. Experiment results show that our approach is competitive to state-of-the-art image registration solutions in terms of accuracy and speed.
Our model mainly consists of three parts: an image transformation network which outputs the registration warp field for spatial transformation, an image generator which does multi-modal translation and a discriminator which can distinguish real images and synthesized images. The architecture of our model and the training details is illustrated in Fig.1.
The transformation network takes the reference image and the floating image as input, and then outputs the registration warp field . The mapping can be written as . The floating image is warped into using a spatial transformation function. Then is sent to the generator which synthesizes images of the reference image domain. That, in turn, provides an easier registration task of single-modal between and .
In the proposed network, registration problem is divided into two parts: multi-modal registration and mono-modal registration , which share the same deformation warp field . So every voxel in synthesized images should be mapped to precisely. However, in the early learning period, is poor and the registration result is not accurate. If we use the architecture like Pix2Pix[6] and send the unpaired and to the discriminator, the generator will be confused and generate a misaligned . To solve this problem, we present a gradient-constrained GAN method for unpaired data. This method is different in that the loss is learned, and can, not only penalize any possible structure that differs between output and target, but also penalize that between output and source. The task of generator consists of three parts: fooling the discriminator, minimizing distance between output and the target, and keeping the output texture similar to the source. The discriminator’s job remains unchanged: only to discriminate real and fake.
Both T and G are U-Net-like[8] network. For details, our code and model parameters are available online at https://github.com/Lauraxy/Multi_Modal_Registration. These three networks should be trained by turns: one step of optimizing , one step of optimizing and one step of optimizing . Note that when training one network, the weights of other two networks should be fixed. Please refer to Fig.1 for details. As is updated gradually, becomes more and more real which helps to update and . Then can be aligned better to and in turn contributes to train . This results in that and are reaching mutual beneficial.
We had tried several loss functions for evaluating similarity between multi-sequence images, such as MI, NGF, and so on. However, each of them has their own weakness and cannot achieve satisfying registration results. For example, we try using Parzen Window estimation of MI to solve gradient-vanish problem, but huge memory consumption make it difficult to train a model in practice. NGF, in our experiment, cannot drive the warp field to convergence. Here we present a local gradient loss which can depict local structure information across modalities. It is similar to NGF, but more robust against noise and easy to converge fast.
Suppose that is a voxel position of volume , and we can get the local gradient by:
(1) |
Where , , are gradient filed and iterates over a volume around . Then the gradient can be normalized by:
(2) |
Where means L2 distance. The local gradient loss between and can be defined as follow:
(3) |
is the volume domain of and . In the experiment of local gradient, if in Eq.1 is too small, the network would be difficult to converge. Instead, if is too large, the edge of and cannot be aligned accurately. Finally we set and get the best results.
Next we will talk about the loss of T, which can be expressed as:
(4) |
We set as two parts: the negative local cross-correlation of and , the negative local gradient distance between and :
(5) |
Smooth loss, which enforce spatially smooth deformation, can be set as follow [2]:
(6) |
Then we talk about the generator and discriminator . First of all let us review Pix2Pix, a promising approach for many image-to-image translation tasks. The loss of Pix2Pix can be expressed as:
(7) |
Where is the objective of a conditional GAN[6], and is the L1 distance between the source and the ground truth target. Different from Pix2Pix, in multi-modal registration task, the source and the target are not pixel-wise mapping data. That means directly push the source to near the ground truth in an L1 sense may lead to false translation, which is harmful for registraion. Here we introduce the local gradient loss to constrain gradient distance between the synthesized images and the source images and keep the output texture similar to the source. We mix the GAN objective with local gradient loss to a complete loss:
(8) | ||||
We evaluated our method with Brain Tumor Segmentation (BraTS) 2018 dataset[12], which provides a large number of multi-sequence MRI scans, including T1, T2, FLAIR, and T1Gd. Different sequence in the dataset have been aligned very well. We evaluated the registration on T1 and T2 data pairs. We randomly chose 235 data for training and the rest 50 for testing. We cropped and downsized the images to the input size of . We added random shift, rotation, scaling and elastic deformation to the scans and generated data pairs for registration, while the synthetic deformation fields can be regarded as registration ground-truth. The range of deformations can be large enough to -40 +40 voxels and it is a challenge of registration.
We compare two well-established image registration methods with ours: A conventional MI-based approach[?] and VoxelMorph method[4]. In the former one, MI is implemented as driving forces within a fluid registration framework. The latter one introduces novel diffeomorphic integration layers combined with a transform layer to enable unsupervised end-to-end learning. But the original VoxelMorph set similarity as local cross-correlation which only function well in single-model registration.As described in chapter 2.2, we also tried several similarity metric with Voxel-morph framework, such as MI, NGF and LG. But only LG is capable of the regis-tration task. So we just use Voxelmorph with LG for comparison.
We set the loss weight as: in Eq.4, in Eq.5 and , in Eq.8. For CC and local gradient, window size is set as . We use ADAM optimizer with learning rate . NVIDIA Geforce 1080ti GPU with 11GB memory is applied for training and testing. To evaluate the effect of gradient-constrained loss(Eq.8) in generator , we set the network with and without gradient-constrained, named Deform-GAN-2 and Deform-GAN-1, respectively.
The registration results are illustrated in Fig.2. MI is intrinsically a global measure and so its local estimation is difficult. VoxelMorph with local CC is based on gray value and cannot function well for cross-sequence. The results of VoxelMorph with gradient loss become much better and can handle large deformation between and . This demonstrates the effectiveness of the local gradient loss. Our methods, both Deform-GAN-1 and Deform-GAN-2, prove higher accuracy of registration. Even for blur and noisy image (please see the second row), Deform-GAN can get satisfying results, and obviously, Deform-GAN-2 is even better.
For further evaluation on the two setting of Deform-GAN, the warped floating images and synthesized images
from two GANs at different stages of training are shown in Fig.3. We can see that gradient constraint brings the faster convergence during training. Even at first epoch, white matter can be seen clearly in
. What’s more, Deform-GAN-2 is more stable in the training process (As the yellow arrows point out, there is less noise in of Deform-GAN-2 than that of Deform-GAN-1). Note that is important for calculating , it should be aligned to strictly. The red arrows point out that the alignment between and of Deform-GAN-2 is much better.In order to quantify the registration results of our methods and the compared methods, we proposed additional evaluation experiments. For the BraTS dataset, we can warp the floating image by synthetic deformation fields as ground truth. Hence, the root-mean-square error (RMSE) of pixel-wise intensity can be calculated for the evaluation. Also, because the mask of tumor is given by BraTS challenge, we can calculate Dice score to evaluate registration result around tumor area. Table 1 shows the quantitative results. It can be seen that our method outperforms the others in terms of tumor Dice and RMSE. In terms of registration speed, deep-learning based methods are significantly faster than the traditional one. In particular, our method only need to run the transformation network in the inference process, so the runtime is still very fast, though a bit slower than VoxeMorph.
Method | RMSE(%) | Tumer Dice | Runtime(s) |
MI | 1.390.40 | 0.550.18 | -/6.1 |
VoxelMorph-LG | 1.420.36 | 0.610.12 | 0.09/3.9 |
Deform-GAN-1 | 1.330.31 | 0.670.13 | 0.11/4.4 |
Deform-GAN-2 | 1.180.23 | 0.690.10 | 0.11/4.4 |
A fast multi-modal deformable registration method that makes use of unsupervised learning is proposed. Adversarial learning method combined with spatial transformation helps to reduce similarity calculation between multi-modal to that between mono-modal. We are able to improve the registration results by a weighted sum of local gradient and local in a way that the gradient based loss takes global coarse alignment, while local loss ensures registration accuracy. Compared to recent learning based methods, our approach can effectively cope with the multi-modal registration problems with large deformation, non-functional intensity relations, noise and blur, promising in state-of-the-art accuracy and fast runtimes.
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 9252–9260. Cited by: §1.Adversarial deformation regularization for training image registration neural networks
. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 774–782. Cited by: §1.Image-to-image translation with conditional adversarial networks
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.1, §2.2.
Comments
There are no comments yet.