JSSR: A Joint Synthesis, Segmentation, and Registration System for 3D Multi-Modal Image Alignment of Large-scale Pathological CT Scans

05/25/2020 ∙ by Fengze Liu, et al. ∙ 18

Multi-modal image registration is a challenging problem yet important clinical task in many real applications and scenarios. For medical imaging based diagnosis, deformable registration among different image modalities is often required in order to provide complementary visual information, as the first step. During the registration, the semantic information is the key to match homologous points and pixels. Nevertheless, many conventional registration methods are incapable to capture the high-level semantic anatomical dense correspondences. In this work, we propose a novel multi-task learning system, JSSR, based on an end-to-end 3D convolutional neural network that is composed of a generator, a register and a segmentor, for the tasks of synthesis, registration and segmentation, respectively. This system is optimized to satisfy the implicit constraints between different tasks unsupervisedly. It first synthesizes the source domain images into the target domain, then an intra-modal registration is applied on the synthesized images and target images. Then we can get the semantic segmentation by applying segmentors on the synthesized images and target images, which are aligned by the same deformation field generated by the registers. The supervision from another fully-annotated dataset is used to regularize the segmentors. We extensively evaluate our JSSR system on a large-scale medical image dataset containing 1,485 patient CT imaging studies of four different phases (i.e., 5,940 3D CT scans with pathological livers) on the registration, segmentation and synthesis tasks. The performance is improved after joint training on the registration and segmentation tasks by 0.9% and 1.9% respectively from a highly competitive and accurate baseline. The registration part also consistently outperforms the conventional state-of-the-art multi-modal registration methods.



There are no comments yet.


page 5

page 13

page 18

page 19

page 20

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image registration attempts to discover a spatial transformation between a pair of images that registers the points in one of the images to the homologous points in the other image [38]. Within medical imaging, registration often focuses on inter-patient/inter-study mono-modal alignment. Another important and (if not more) frequent focal point is multi-channel imaging, e.g., dynamic-contrast computed tomography (CT), multi-parametric magnetic resonance imaging (MRI), or positron emission tomography (PET) combined with CT/MRI. In this setting, the needs of intra-patient multi-modal registration are paramount, given the unavoidable patient movements or displacements between subsequent imaging scans. For scenarios where deformable misalignments are present, e.g., the abdomen, correspondences can be highly complex. Because different modalities provide complementary visual/diagnosis information, proper and precise anatomical alignment benefits human reader’s radiological observation/decision; and is crucial for any downstream computerized analyses. However, finding correspondences between homologous points is usually not trivial because of the complex appearance changes across modalities, which may be conditioned on anatomy, pathology, or other complicated interactions.

Unfortunately, multi-modal registration remains a challenging task, particularly since ground-truth deformations are hard or impossible to obtain. Methods must instead learn transformations or losses that allow for easier correspondences between images. Unsupervised registration methods, like [2, 8], often use a local modality invariant feature to measure similarity. However these low-level features may not be universally applicable and can not always capture high level semantic information. Other approaches use generative models to reduce the domain shift between modalities, and then apply registration based on direct intensity similarity [36]. A different strategy learns registrations that maximize the overlap in segmentation labels [2, 12]. This latter approach is promising, as it treats the registration process similarly to a segmentation task, aligning images based on their semantic category. Yet, these approaches rely on having supervised segmentation labels in the first place for every deployment scenario.

Both the synthesis and segmentation approaches are promising, but they have not been fully exploited when used in isolation. As Fig. 1

elaborates, the synthesis, segmentation, and registration tasks are linked together and define implicit constraints between each other. That motivates us to develop a joint synthesis, segmentation, and registration (JSSR) system which satisfies these implicit constraints. JSSR is composed of a generator, a segmentor, and a register and can perform all three synthesis, segmentation and registration tasks simultaneously. Given a fixed image and moving image from different modalities for registration, the generator can synthesize the moving image to the same modality of the fixed image, conditioned on the fixed image to better reduce the domain gap. Then the register takes the synthesized image from the generator and the fixed image to estimate the deformation field. Last, the segmentor estimate the segmentation map for the moving image, fixed image and synthesized image. During the training procedure, we optimize several consistency losses including the similarity between the fixed image and the warped synthesized image, the similarity between the segmentation maps of the warped moving image and the fixed image, together with adversarial loss for generating high fidelity images and smoothness loss to regularize the deformation field. To avoid the segmentor from providing meaningless segmentation map, we add some data with segmentation annotation to regularize the segmentor. We evaluate our system on a large-scale clinical liver CT image dataset containing four phases per patient, for unpaired image synthesis, multi-modal image registration and multi-modal image segmentation tasks. Our system outperforms the state-of-the-art conventional multi-modal registration methods and significantly improves the baseline model we used for other two tasks, validating the effectiveness of joint learning.

We summarize our main contribution as follows:

  • We focus on the problem of multi-modal image registration and propose a novel joint learning approach for synthesis, registration and segmentation tasks, where each task can communicate with other two tasks during the training and forms a virtuous cycle to benefit all the tasks.

  • We evaluate and validate the performance improvement of baseline methods for synthesis and segmentation tasks after joint trained by our system, which proves the effectiveness of joint training and reveals the possibility of getting a better overall system by using (i.e., building upon) better baseline models.

  • Our system consistently and significantly outperforms the state-of-the-art conventional multi-modal registration approaches using a large-scale CT imaging dataset of 1,485 patients (each patient under four different intravenous contrast phases, i.e., 5,940 3D CT scans with various liver tumors).

  • Our method does not use or reply on any manual segmentation labels from this CT imaging dataset, which demonstrates the great potential of being scalable and generalizable to be widely applied in the real clinical scenarios.

Figure 1: The relationship between the synthesis, segmentation and registration tasks. The constraints should hold in both ideal and real settings, which motivates us to develop a joint system under the necessary constraints.

2 Related Work

Multi-modal Image Registration Methods

Multi-modal image registration has been widely studied and applied in the medical area. The existing registration methods can be based on additional information like landmarks [30, 35] and surface [40], or based on the voxel property, which operate directly on the image grey values without prior data introduced by the user or segmentation[24]. For the voxel-based methods, there are two typical strategies. One is to make use of a self-similarity measurement that does not require the cross-modal feature like the Local Self-Similarities(LSS) in [31] and [7] introduced the MIND descriptor which can effectively measure the cross-modal differences. [8, 9, 10] employed a discrete dense displacement sampling for the deformable registration with self-similarity context(SSC)[11]. The other one is to map both modalities into a shared space and measure the mono-modal difference. [22] introduced Mutual Information(MI) based on information theory that can be applied directly on cross-modal images as a similarity measure, followed up by [34] with Normalized Mutual Information(NMI). However, such methods suffer from the shortcomings such as the low convergence rate and loss of spatial information. [3] employed a convolutional neural network(CNN) to learn interpretable modality invariant features with small amount of supervision data. [4]

utilized Haar-like features from paired multi-modality images to fit a Patch-wise Random Forest (P-RF) regression for bi-directional image synthesis, and

[36, 23] applied CycleGANs to reduce the gap between modalities for better alignment.

With the development of deep learning these years, variety of learning-based registration methods are proposed. While the ground truth deformation fields are hard to obtain, most of the methods utilized synthesized deformation fields as the supervision to train

[19, 33, 29]. Unsupervised methods like [6, 21, 5, 2] are generally using CNN with a spatial transformation function [16]. These unsupervised methods mainly focus on mono-modal image registration. Some other methods make use of correspondence between labelled anatomical structures as a weakly supervision to help the registration procedure [12]. [2] also showed how the segmentation map can help registration. However, in most of the case the segmentation map is not available, which motivates us to combine the registration and segmentation component together.

Multi-task Learning Methods

As the registration, synthesis, segmentation tasks are all related with each other, there are already variety of works that explore the advantages of combine different tasks together. [36, 23, 37] used CycleGANs to synthesize multi-modal images into mono-modal and apply mono-modal registration methods. [17] projected multi-modal images into shared feature space and register based on the feature. [28] made use of generative model to disentangle the appearance space from the shape space. [20, 39, 27] combined segmentation model and registration model to let them benefit each other but focus on mono-modal registration. [43] performed supervised multi-phase segmentation based on paired multi-phase images but not jointly trained the registration and segmentation. [41, 42, 14] used generative model to help guide the segmentation model. In this work, we firstly combine all three of the tasks together to tackle the multi-modal registration problem in the most general setting where the deformation ground truth, paired multi-modal images and segmentation maps are all unavailable.

Figure 2: The JSSR system. We denote the generator, segmentor, register and spatial transform as Syn, Seg, Reg and ST respectively.
Figure 3: The model structure for each component. We use the cohemis structure for generator, phnn for register and VNet for segmentor.

3 Methodology

Given the moving image and fixed image from different modalities, we aim to find a spatial transformation function that optimize the similarity between and . We tackle this multi-modal image registration problem in a fully unsupervised way to meet the setting in a common application scene, where none of the deformation fields ground truth, segmentation maps or paired multi-modal images is available. Motivated by the discussion on the relationship between image synthesis, segmentation and registration, we develop a system consisted of three parts: a generator , a register and a segmentor . We optimize for the constraints in Fig. 1 since they are the necessary conditions for a correct joint system. By putting them together, the system is also capable of doing unpaired image synthesis and multi-modal image segmentation. During optimization, these three tasks will benefit from each other, leading to improvement for all three tasks. Refer Fig. 6 for the overall framework of our system.

3.1 Unpaired Image Synthesis

Although there are already works solving unpaired image synthesis like in [13], their synthesis method will generate variety of different target domain images based on the random sampling. However, in the registration, the synthesized images are supposed to have the same texture properties conditioned on the fixed images to help registration, making the paired image synthesis method a better choice. Suppose we have and as the moving and fixed images, where and represent for set of images from different modalities. As in [15], we use a conditional GAN with a generative models which learns a mapping from to , . Here is the ground truth deformation field we are estimating and is the inverse mapping of . Note that is absence in the whole experiments, here is we use it for convenience sake. A discriminator is also equipped to detect the fake images from the generator.

The objective of the conditional GAN is


Here we are supposed to use in the classical GAN’s setting but based on the assumption that spatial transformation function doesn’t change the texture, we replace with in the first term of (1). We also add another texture-based loss to benefit the GAN objective:


The final objective for the synthesis part is:


However, we are only able to optimize this objective giving for each , which can be introduced after combining the registration part.

3.2 Multi-Modal Image Registration

For two images , the registration methods try learn a function where is a spatial transformation function[16], also called the deformation field, so that it warps moving image to be as similar as the fixed image as possible. For mono-modal registration, loss can be used to measure the similarity between the fixed image and warped image. Here we are registering two images from different modalities. [2] proposed to use a cross-modal similarity measure like cross-correlation [1], while we utilize a generative model to transfer into domain so that we can still use mono-modal similarity measure. The objective for registration part is:


where , and is the generator that synthesizes images from to . Another smoothness term is added to prevent from non-realistic deformation field:


where represents for the location of voxel and calculates the differences between neighboring voxels of . We use the same implementation for the smoothness term as in [2]. The final objective is:


Similarly, we cannot optimize this objective without giving . However, to get a good , we need a good as discussed in , which makes this problem a chicken-and-egg conundrum. One way is to add the two objectives from the synthesis part and registration part together, which leads to:


The in is not trivial to calculate but we can prove that and are close to each other when is smooth enough, so we can merge them without changing too much for the objective. See appendix for more detail.

However, there is no guarantee that we can get the optimal solution by minimizing . Actually there is a trivial solution that can minimize , which is when and . We add skip connection from the source domain to keep the spatial information in the structure of generator, as shown in Fig. 3.

3.3 Multi-Modal Image Segmentation

We propose the segmentation part for two reasons. Firstly, as said in [2], additional information of segmentation map can help guide the registration process, so we use additional segmentation models to provide with segmentation map information as the manual annotation is not available. Secondly, as shown in other literatures [20, 39, 27, 41, 42], the synthesis and registration procedure can benefit segmentation model by providing auxiliary information, which can help us develop better segmentation model on the dataset without annotation.

Denote the segmentation model as a function , where represents for the segmentation map domain. Based on the constraint between synthesis, registration and segmentation task, we develop the objective as:


where and is the measurement for the similarity between two binary volume , which is widely used in the medical image segmentation. This loss term is connecting three components together and is proven to be the major reason that improves the whole system in the experiments afterwards.

To make the consistency term to work properly, we need the segmentor to be as good as enough. However only with the consistency loss, the segmentor is not able to learn meaningful semantic information. A segmentor that generates volumes of all background can optimize the consistency loss. To avoid this, we add some extra data with segmentation annotation in one modality as a supervision to provide with the segmentor a meaningful initial state. We have the supervision loss as:


where is in the same modality with but do not overlap. is the corresponding annotation. Then the final regularization term provided by the segmentor is:


3.4 Joint Optimization Strategy

Based on previous sections, the final objective for our whole system is:


In order to provide all the components with a good initial point, we first train on the data with supervision, then we learn and using (7) on the unsupervised data. Finally, we jointly learn all the parts by (11). During the optimization in (7) and (11), we use the classic alternate way for training GAN model, which fixes and optimize for and then fixed and optimize for others and alternate.

4 Experiments

Datasets. We conduct our main experiments on a large-scale 3D liver CT image dataset, collected by our clinical collaborators from an academic hospital. The dataset contains CT scans from 1485 patients and for each patient, there are image volumes of four different intravenous contrast phases: i.e., venous, arterial, delay, and uncontrast. The studied patient population is composed of patients with liver tumors underwent CT imaging examinations for interventional biopsy. Our end goal is to develop a computer-aided diagnosis system to identify the pathological subtype of any given liver tumor. No matter for human readers or computers, all phases of 3D CT images per patient need to be pre-registered precisely to facilitate the downstream diagnosis process by observing and judging the visually dynamic contrast-changing patterns of liver tumor tissues in the sequential order of uncontrast, arterial, venous and delay CTs.

The different phases are obtained from the CT scanner at different time points after the contrast media injection and will display different information according to the distribution of contrast media in the human body. The intensity value of each voxel in the CT image, measured by the Hounsfield Unit (HU), is an integer ranging from HU to HU, which will also be affected by the density of contrast media. The volume size of the CT image is , where depends on the physical coordinate started scanning the CT and the resolution along the axial axis, which is mm in our dataset. Since the venous phase usually contains most of the information for diagnosis, we choose the venous phase as the anchor phase and register images from other three phases to it, and we synthesize the other three phases images to the venous phase. We divide the dataset into 1350/45/90 patients for training, validation and testing, and having the liver segmentation annotated on the validation and testing set for evaluation. Note that there are total 3D CT scans (all contain pathological livers) used in our work. To the best of our knowledge, this is the largest clinically realistic study of this kind to-date. For the supervised part, we choose a public dataset MSD [32] that contains 131 CT images of venous phase with voxel-wise annotation of liver and divide it into 100/31 for training and validation. We evaluate for all the registration, synthesis and segmentation tasks to show how the joint training can improve for each task.

4.1 Baseline

We compare with the baseline for all the image synthesis, image registration and image segmentation tasks.

  • For image synthesis, we choose Pix2Pix [15] using both reconstruction loss and cGAN loss.

  • For image registration, we first compare with Deeds [8], the conventional state-of-the-art multi-modal registration method. The advantage of learning-based method compared with conventional method is usually on the speed of inference, while we can also show the performance improvement. We also compare with a learning-based method VoxelMorph [2] with local cross-correlation to handle the multi-modal image registration.

  • For the segmentation task, we compare with VNet [26], which is a popular framework in the medical image segmentation.

Dice HD95
Arterial Delay Uncontrast Arterial Delay Uncontrast
Initial State 90.94 (7.52) 90.52 (8.08) 90.08 (6.74) 7.54 (4.89) 7.86 (5.83) 7.87 (4.37)
Affine [25] 92.01 (6.57) 91.69 (6.80) 91.52 (5.48) 6.81 (4.83) 6.95 (5.32) 6.73 (3.63)
Deeds [8] 94.73 (2.10) 94.70 (1.91) 94.73 (1.90) 4.74 (1.96) 4.76 (1.69) 4.62 (1.05)
VoxelMorph [2] 94.28 (2.53) 94.23 (3.15) 93.93 (2.58) 5.29 (2.33) 5.42 (3.25) 5.40 (2.48)
JSynR-Reg 94.81 (2.35) 94.71 (2.62) 94.57 (2.52) 4.93 (2.14) 5.07 (3.06) 4.87 (2.30)
JSegR-Reg 95.52 (1.76) 95.39 (2.14) 95.37 (1.80) 4.47 (2.21) 4.70 (3.24) 4.45 (1.85)
JSSR-Reg 95.56(1.70) 95.42(2.00) 95.41(1.72) 4.44(2.19) 4.65(3.14) 4.35(1.60)
ASD Time
Arterial Delay Uncontrast Arterial Delay Uncontrast
Initial State 2.12 (1.86) 2.27 (2.19) 2.37 (1.77) -/- -/- -/-
Affine [25] 1.74 (1.58) 1.86 (1.89) 1.87 (1.41) -/7.77 -/7.77 -/7.77
Deeds [8] 1.01 (0.44) 1.01 (0.39) 0.99 (0.36) -/41.51 -/41.51 -/41.51
VoxelMorph [2] 1.10 (0.53) 1.12 (0.87) 1.20 (0.67) 1.71/1.76 1.71/1.76 1.71/1.76
JSynR-Reg 0.95 (0.45) 0.98 (0.72) 0.98 (0.56) 3.14/1.76 3.14/1.76 3.14/1.76
JSegR-Reg 0.80 (0.37) 0.83 (0.59) 0.83 (0.40) 3.14/1.76 3.14/1.76 3.14/1.76
JSSR-Reg 0.79(0.36) 0.83(0.56) 0.82(0.37) 1.71/1.76 1.71/1.76 1.71/1.76
Table 1: Evaluation for the registration task on the CGMHLiver dataset in terms of Dice score, HD(mm), ASD(mm), and running time(s)

4.2 Implementation Detail

To perform the registration between four different phases, we conduct several preprocessing procedures before applying the deformable registration. Firstly, since the CT images from different phases even for the same patient have different volume size, we crop the maximum intersection of all four phases based on the physical coordinate to make their size the same. Secondly we apply rigid registration using [25] between the four phases, using the venous phase as the anchor. Thirdly we apply the windowing from HU to HU for each CT image and normalize to to , and then resize the CT volume to to fit in the GPU memory. For the public dataset, we sample along the axial axis to make the resolution 5mm, and then apply the same preprocessing for the intensity value.

For optimizing the objective, we use the Adam solver [18] for all the component. We set the hyper parameters to be . We choose different learning rate for different component in order to better balance the training. We use 0.0001 for the generator, 0.001 for the register, 0.1 for the segmentor and discriminator. Another way to balance the training is to adjust the weight of different loss terms. However there are loss terms that relate with multiple components, which makes it more complex to control each component separately. We train on the Nvidia Quadro RTX 6000 GPU with 24 GB memory, with the instance normalization and batch size 1. The training process takes about 1.4 GPU day.

Figure 4: The box-plot for the result of registration (DSC) on three phases (A for arterial, D for delay, N for uncontrast) for different methods.

4.3 Main Results

4.3.1 Multi-modal image registration

We summarize the results of registration task in Table 1

. The methods are evaluated by the similarity between segmentation map of the fixed image, which is always in venous phase here and warped segmentation map of the moving image chosen from arterial, delay and uncontrast. The similarity is measured in Dice score, 95 percent hausdorff distance (HD) and average surface distance (ASD). We report the average number followed by the standard deviation on testing set for each measurement. We also report the consumed time on GPU/CPU in sec for each method. In Table

1, we refer Initial State to the result before applying any registration, and Affine to the result after rigid registration. We also compare with the conventional method Deeds and learning-based method VoxelMorph. We term our joint system as JSSR and JSSR-Reg as the registration part of JSSR. We also compare two ablations of JSSR. JSynR, which only contains the generator and register, is optimized using (7). JSegR has the segmentor and register instead. More detail will be discussed in Section . Our JSSR method outperforms Deeds by on the average Dice, and in the same time takes the advantage of much faster inference. Also by taking benefit from joint training, JSSR achieves significantly higher results than VoxelMorph (exceeded by ) with comparable inference time. We can observe the gradual improvements from VoxelMorph to JSynR to JSSR, which proves the necessity of joint training. Refer to Fig. 4 for more detail of the results.

VNet [26]
Dice Venous Arterial Delay Uncontrast
Non-Synthesis 90.47 (6.23) 89.47 (7.05) 89.88 (6.38) 89.38 (6.38)
Pix2Pix [15] 90.47 (6.23) 76.50 (17.77) 79.60 (13.13) 67.48 (15.97)
JSynR-Syn 90.47 (6.23) 89.69 (7.09) 90.01 (6.27) 90.15 (6.21)
JSSR-Syn 90.47 (6.23) 89.44 (7.15) 89.76 (6.34) 89.31 (7.57)
Dice Venous Arterial Delay Uncontrast
Non-Synthesis 91.88 (4.84) 90.91 (5.06) 91.18 (4.68) 91.12 (4.72)
Pix2Pix [15] 91.88 (4.84) 89.59 (5.51) 87.78 (5.78) 89.59 (5.51)
JSynR-Syn 91.88 (4.84) 91.15 (4.93) 91.37 (4.56) 91.36 (4.54)
JSSR-Syn 91.88 (4.84) 91.12 (4.99) 91.30 (4.63) 91.39 (4.53)
Dice Venous Arterial Delay Uncontrast
Non-Synthesis 92.24 (3.88) 91.25 (4.10) 91.34 (3.76) 91.37 (3.81)
Pix2Pix [15] 92.24 (3.88) 85.30 (7.11) 84.68 (9.29) 79.89 (8.49)
JSynR-Syn 92.24 (3.88) 91.42 (4.06) 91.58 (3.64) 91.67 (3.67)
JSSR-Syn 92.24 (3.88) 91.39 (4.10) 91.51 (3.72) 91.60 (3.69)
Table 2: Evaluation for the synthesis and segmentation tasks on the CGMHLiver dataset in terms of average Dice score

4.3.2 Multi-modal image segmentation and synthesis

We show the results of synthesis and segmentation task in Table 2. Followed with the idea in [15], we evaluate the synthesis model by applying the segmentation model on the synthesized image. The intuition is the better synthesized image is, the better segmentation map can be estimated. We evaluate with three segmentation models. The VNet baseline is trained on the MSD dataset with fully supervision. JSegR-Seg is the segmentation part of JSegR as described in Section . JSSR-Seg is the segmentor of our JSSR system. For each segmentation model, we test it on different version of synthesis model. For Non-Synthesis, we directly apply the segmentation model on original images for four phases on the test data and test for the average Dice between the prediction and annotation. For the other three synthesis model, we test the segmentation model on the original venous image and synthesized image from arterial, delay, uncontrast phase. From the Non-Synthesis lines we can observe performance drop if directly applying the segmentation model to arterial, delay and uncontrast phases, since the supervised data is all from venous phase. Among the three phases, the delay phase drops the least, while uncontrast drop the most, indicating that the domain gap between venous and delay is bigger than between venous and uncontrast. For Pix2Pix, the performance goes through different level of decrease among different segmentors and is not as high as the Non-Synthesis. That may be caused by the artifact introduced by the GAN model and the L1 term is providing less constraint since there is no paired data. For the JSynR-Syn and JSSR-Syn, the performance is better by giving paired data from the register but is just comparable to Non-Synthesis. For JSynR-Syn, it is because the JSynR is not jointly learned with a segmentor, so the performance for synthesized image do not necessarily go up. For JSSR-Syn, however, it means the constraint we are using for optimizing the system does not bring enough communication between the generator and segmentor. In the meantime, we can see the big improvement from VNet to JSegR-Seg to JSSR-Seg on the Non-Synthesis data, meaning that although the generator and segmentor are not well connected, the segmentor can still be benefit from a joint system including the synthesis part. Refer Fig. 5 for qualitative examples of JSSR registration, synthesis and segmentation results.

(a) Results on the arterial CT phase.
(b) Results on the uncontrast CT phase.
Figure 5: Qualitative examples of JSSR synthesis, segmentation and registration.

5 Ablation and Discussion

5.0.1 JSegR vs JSSR

We implement JSegR as another ablation. The purpose is to explore the importance of synthesis module in the JSSR system. Since JSegR does not have a generator, the register takes images from different phases directly as input. The segmentation consistency term in (8) is then turned into:


where . This framework is similar with [39] which jointly learned the register and segmentor. In our case, though, are in different domain and the annotation is unavailable for none of them. This method is not supposed to work properly since are in different phases. However as in Table 2, the performance drop across phases is not too severe even for the baseline VNet, and with that imperfect segmentor, the JSegR can achieve higher result on registration than JSynR and close to JSSR, which shows the great importance of incorporating semantic information in the registration. But for the data with a larger domain gap like CT and MRI, the synthesis part is still necessary to regularize the system.

5.0.2 Extra constraints

We do not introduce all the constraints in Fig. 1 because the joint training process can be sensitive to some of the constraint. We have tried several extra constraints as the ablation. One of them is adding


where . The system optimizes this term by . Also, if we add (12) together with (8), the system has a chance to finally output all blank segmentation. It is very likely that with more constraints and more sophisticated parameter setting the system can converge to a better point. Since in JSSR, each component can be separate out, we can replace each component with more powerful sub-framework to improve the system, which can be the future research direction of this joint system.

6 Conclusion

In this paper, we propose a novel JSSR system for multi-modal image registration. Our system takes advantages of joint learning based on the intrinsic connections between the synthesis, segmentation and registration tasks. The optimization can be conducted end-to-end with several unsupervised consistency loss and each component can get benefits from the joint training process. We evaluate the JSSR system on a large-scale multi-phase clinically realistic CT image dataset without any annotation. After joint training, the performance of registration and segmentation increases by and respectively on the average Dice score on all the phases. Our system outperforms the recent VoxelMorph algorithm [2] by , and the state-of-the-art conventional multi-modal registration method [8] by , but has considerably faster inference time.


  • [1] B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee (2008) Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12 (1), pp. 26–41. Cited by: §3.2.
  • [2] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca (2019) VoxelMorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38 (8), pp. 1788–1800. Cited by: §1, §2, §3.2, §3.3, 2nd item, Table 1, §6.
  • [3] M. Blendowski and M. P. Heinrich (2019) Learning interpretable multi-modal features for alignment with supervised iterative descent. In International Conference on Medical Imaging with Deep Learning, pp. 73–83. Cited by: §2.
  • [4] X. Cao, J. Yang, Y. Gao, Y. Guo, G. Wu, and D. Shen (2017) Dual-core steered non-rigid registration for multi-modal images via bi-directional image synthesis. Medical image analysis 41, pp. 18–31. Cited by: §2.
  • [5] B. D. de Vos, F. F. Berendsen, M. A. Viergever, H. Sokooti, M. Staring, and I. Išgum (2019) A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis 52, pp. 128–143. Cited by: §2.
  • [6] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, and I. Išgum (2017) End-to-end unsupervised deformable image registration with a convolutional neural network. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 204–212. Cited by: §2.
  • [7] M. P. Heinrich, M. Jenkinson, M. Bhushan, T. Matin, F. V. Gleeson, M. Brady, and J. A. Schnabel (2012) MIND: modality independent neighbourhood descriptor for multi-modal deformable registration. Medical image analysis 16 (7), pp. 1423–1435. Cited by: §2.
  • [8] M. P. Heinrich, M. Jenkinson, M. Brady, and J. A. Schnabel (2012) Globally optimal deformable registration on a minimum spanning tree using dense displacement sampling. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 115–122. Cited by: §1, §2, 2nd item, Table 1, §6.
  • [9] M. P. Heinrich, M. Jenkinson, M. Brady, and J. A. Schnabel (2013) MRF-based deformable registration and ventilation estimation of lung ct. IEEE transactions on medical imaging 32 (7), pp. 1239–1248. Cited by: §2.
  • [10] M. P. Heinrich, O. Maier, and H. Handels Multi-modal multi-atlas segmentation using discrete optimisation and self-similarities.. Cited by: §2.
  • [11] M. P. Heinrich, M. Jenkinson, B. W. Papież, M. Brady, and J. A. Schnabel (2013) Towards realtime multimodal fusion for image-guided interventions using self-similarities. In International conference on medical image computing and computer-assisted intervention, pp. 187–194. Cited by: §2.
  • [12] Y. Hu, M. Modat, E. Gibson, W. Li, N. Ghavami, E. Bonmati, G. Wang, S. Bandula, C. M. Moore, M. Emberton, et al. (2018) Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis 49, pp. 1–13. Cited by: §1, §2.
  • [13] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 172–189. Cited by: §3.1.
  • [14] Y. Huo, Z. Xu, H. Moon, S. Bao, A. Assad, T. K. Moyo, M. R. Savona, R. G. Abramson, and B. A. Landman (2018) Synseg-net: synthetic segmentation without target modality ground truth. IEEE transactions on medical imaging 38 (4), pp. 1016–1025. Cited by: §2.
  • [15] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1125–1134. Cited by: §3.1, 1st item, §4.3.2, Table 2.
  • [16] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2, §3.2.
  • [17] M. D. Ketcha, T. S. De Silva, R. Han, A. Uneri, S. Vogt, G. Kleinszig, and J. H. Siewerdsen (2019) Learning-based deformable image registration: effect of statistical mismatch between train and test images. Journal of Medical Imaging 6 (4), pp. 044008. Cited by: §2.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [19] J. Krebs, H. Delingette, B. Mailhé, N. Ayache, and T. Mansi (2019) Learning a probabilistic model for diffeomorphic registration. IEEE transactions on medical imaging 38 (9), pp. 2165–2176. Cited by: §2.
  • [20] B. Li, W. J. Niessen, S. Klein, M. de Groot, M. A. Ikram, M. W. Vernooij, and E. E. Bron (2019) A hybrid deep learning framework for integrated segmentation and registration: evaluation on longitudinal white matter tract changes. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 645–653. Cited by: §2, §3.3.
  • [21] H. Li and Y. Fan (2017) Non-rigid image registration using fully convolutional networks with deep self-supervision. arXiv preprint arXiv:1709.00799. Cited by: §2.
  • [22] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens (1997) Multimodality image registration by maximization of mutual information. IEEE transactions on Medical Imaging 16 (2), pp. 187–198. Cited by: §2.
  • [23] D. Mahapatra, B. Antony, S. Sedai, and R. Garnavi (2018) Deformable medical image registration using generative adversarial networks. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1449–1453. Cited by: §2, §2.
  • [24] J. A. Maintz and M. A. Viergever (1996) An overview of medical image registration methods. In Symposium of the Belgian hospital physicists association (SBPH/BVZF), Vol. 12, pp. 1–22. Cited by: §2.
  • [25] K. Marstal, F. Berendsen, M. Staring, and S. Klein (2016) SimpleElastix: a user-friendly, multi-lingual library for medical image registration. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 134–142. Cited by: §4.2, Table 1.
  • [26] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: 3rd item, Table 2.
  • [27] C. Qin, W. Bai, J. Schlemper, S. E. Petersen, S. K. Piechnik, S. Neubauer, and D. Rueckert (2018) Joint learning of motion estimation and segmentation for cardiac mr image sequences. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 472–480. Cited by: §2, §3.3.
  • [28] C. Qin, B. Shi, R. Liao, T. Mansi, D. Rueckert, and A. Kamen (2019) Unsupervised deformable registration for multi-modal images via disentangled representations. In International Conference on Information Processing in Medical Imaging, pp. 249–261. Cited by: §2.
  • [29] M. Rohé, M. Datar, T. Heimann, M. Sermesant, and X. Pennec (2017) SVF-net: learning deformable image registration using shape matching. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 266–274. Cited by: §2.
  • [30] K. Rohr, H. S. Stiehl, R. Sprengel, T. M. Buzug, J. Weese, and M. H. Kuhn (2001-06) Landmark-based elastic registration using approximating thin-plate splines. IEEE Transactions on Medical Imaging 20 (6), pp. 526–534. External Links: Document, ISSN 1558-254X Cited by: §2.
  • [31] E. Shechtman and M. Irani (2007) Matching local self-similarities across images and videos. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
  • [32] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. Van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, et al. (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §4.
  • [33] H. Sokooti, B. De Vos, F. Berendsen, B. P. Lelieveldt, I. Išgum, and M. Staring (2017) Nonrigid image registration using multi-scale 3d convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 232–239. Cited by: §2.
  • [34] C. Studholme, D. L. Hill, and D. J. Hawkes (1999) An overlap invariant entropy measure of 3d medical image alignment. Pattern recognition 32 (1), pp. 71–86. Cited by: §2.
  • [35] S. Sultana, D. Y. Song, and J. Lee (2019) A deformable multimodal image registration using PET/CT and TRUS for intraoperative focal prostate brachytherapy. In Medical Imaging 2019: Image-Guided Procedures, Robotic Interventions, and Modeling, B. Fei and C. A. Linte (Eds.), Vol. 10951, pp. 383 – 388. External Links: Document Cited by: §2.
  • [36] C. Tanner, F. Ozdemir, R. Profanter, V. Vishnevsky, E. Konukoglu, and O. Goksel (2018) Generative adversarial networks for mr-ct deformable image registration. arXiv preprint arXiv:1807.07349. Cited by: §1, §2, §2.
  • [37] D. Wei, S. Ahmad, J. Huo, W. Peng, Y. Ge, Z. Xue, P. Yap, W. Li, D. Shen, and Q. Wang (2019) Synthesis and inpainting-based mr-ct registration for image-guided thermal ablation of liver tumors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 512–520. Cited by: §2.
  • [38] R. P. Woods (2009) Registration. In Handbook of Medical Image Processing and Analysis (Second Edition), I. N. BANKMAN (Ed.), pp. 495 – 497. External Links: ISBN 978-0-12-373904-9, Document, Link Cited by: §1.
  • [39] Z. Xu and M. Niethammer (2019)

    Deepatlas: joint semi-supervised learning of image registration and segmentation

    In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 420–429. Cited by: §2, §3.3, §5.0.1.
  • [40] X. Yang, H. Akbari, L. Halig, and B. Fei (2011) 3D non-rigid registration using surface and local salient features for transrectal ultrasound image-guided prostate biopsy. Proceedings of SPIE–the International Society for Optical Engineering 7964, pp. 79642V–79642V. External Links: Document Cited by: §2.
  • [41] Z. Zhang, L. Yang, and Y. Zheng (2018) Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 9242–9251. Cited by: §2, §3.3.
  • [42] H. Zheng, L. Xie, T. Ni, Y. Zhang, Y. Wang, Q. Tian, E. K. Fishman, and A. L. Yuille (2018) Phase collaborative network for multi-phase medical imaging segmentation. ArXiv abs/1811.11814. Cited by: §2, §3.3.
  • [43] Y. Zhou, Y. Li, Z. Zhang, Y. Wang, A. Wang, E. K. Fishman, A. L. Yuille, and S. Park (2019) Hyper-pairing network for multi-phase pancreatic ductal adenocarcinoma segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 155–163. Cited by: §2.

Appendix 0.A More Visualization

We visualize more examples in detail here. The denotations follow Figure 6. We can see some potential improvement for JSSR system.

Firstly, the generator part is conditioned on both and , which brings both benefits and shortcomings. In the column we can see the synthesized image can well capture the intensity change between and since the checkerboard image only shows mild difference between and . However, in the column, like Figure 7 Arterial row and Figure 11 Arterial row, the generator also introduces additional boundary information from , which will affect the register.

Secondly, as in Figure 10, the segmentor part produces bad segmentation but the overlap in is still large, meaning that the consistency is well satisfied but will provide wrong supervision to the register. A better consistency term may help this condition.

Figure 6: Review of JSSR system
Figure 7: An Example for for evaluation of JSSR system. Each picture shows the difference in checkerboard style between the two inputs indicated on the top (on the left and right of ). Venous and Arterial, Delay, Uncontrast for each row.

This system is now only tested on multi-phase CT images. However, equipped with a generator and a segmentor, the system can be applied to many application scenes like the registration from CT to MRI, or the domain adaptation for segmentation between CT and MRI, or it can help the tumor detection by combining multi-modality information if we extend the segmentor to segment both normal organ and tumor region.

Figure 8: Visualization for the segmentation part of JSSR on the same example of Figure 8. In each picture, the pink part belongs to the input on the right of and green part belongs to the left input and white part is the overlap.
Figure 9: Another example
Figure 10: Same example as Figure 9
Figure 11: Another example
Figure 12: Same example as Figure 11
Figure 13: Another example
Figure 14: Same example as Figure 13

Appendix 0.B Proof

We made an approximation in the paper that for some constant when the generated by is smooth enough. Here we give the prove.

using the smoothness assumption that and the identity transform . Then we have