Log In Sign Up

FIRE: Unsupervised bi-directional inter-modality registration using deep networks

by   Chengjia Wang, et al.

Inter-modality image registration is an critical preprocessing step for many applications within the routine clinical pathway. This paper presents an unsupervised deep inter-modality registration network that can learn the optimal affine and non-rigid transformations simultaneously. Inverse-consistency is an important property commonly ignored in recent deep learning based inter-modality registration algorithms. We address this issue through the proposed multi-task architecture and the new comprehensive transformation network. Specifically, the proposed model learns a modality-independent latent representation to perform cycle-consistent cross-modality synthesis, and use an inverse-consistent loss to learn a pair of transformations to align the synthesized image with the target. We name this proposed framework as FIRE due to the shape of its structure. Our method shows comparable and better performances with the popular baseline method in experiments on multi-sequence brain MR data and intra-modality 4D cardiac Cine-MR data.


Deep Learning based Inter-Modality Image Registration Supervised by Intra-Modality Similarity

Non-rigid inter-modality registration can facilitate accurate informatio...

Deformation equivariant cross-modality image synthesis with paired non-aligned training data

Cross-modality image synthesis is an active research topic with multiple...

ContraReg: Contrastive Learning of Multi-modality Unsupervised Deformable Image Registration

Establishing voxelwise semantic correspondence across distinct imaging m...

Flexible Bayesian Modelling for Nonlinear Image Registration

We describe a diffeomorphic registration algorithm that allows groups of...

1 Introduction

Modern medical diagnosis benefits from fusion complementary information obtained by different modalities. This makes inter-modality image registration an critical pre-processing task for many applications within the routine clinical pathway [1]. (In this paper use “modality” to uniformly address data acquired with different imaging techniques and with different parametric setups.) Traditional and early learning-based methods typically model the registration problem as an iterative optimization process to find an optimal value of manually designed similarity metrics, thus they are often computationally expensive [2]. Furthermore, manually designed metrics have limited robustness and performances upon registering inter-modality data.

As discussed in [2]

, in the passed decade, a variety of deep learning based methods have been proposed that can predict the geometric correspondences between a pair of images in one pass. But most existing methods are based on supervised learning and requires manually generated ground truths, such as, pre-aligned image pairs, simulated transformation fields and segmentation labels

[3][4][5]. At the same time, present unsupervised methods [6] have been mostly tested only on limited subsets of 3D volumes or 2D slices with small misalignments and require reliable affine registration as a preprocessing. Deep learning models that can perform both affine and non-rigid registrations [7] often requires to apply two independent models for both types of transformation. In this paper, we presents an unsupervised deep inter-modality registration network that can learn the optimal affine and non-rigid transformations simultaneously.

Figure 1: Architecture of the FIRE model: a synthesis encoder, , that extracts modality-independent features; two synthesis decoders, and

, that map the features extracted by

to synthesized images; and two transformation networks, and , that predict the transformation fields.

Our method solves -D image registration problems through cross-modality image synthesis and inverse-consistent transformations [8]

. The cycle consistency adversarial loss has been widely used within this type of methods. Similarly, inverse-consistency (or bi-directional) transformation has been a favorable property for better maintenance of the neighbourhood topology and anatomy of organs. However, most previous works failed to address this issue and solely estimate asymmetric transformations. Two inverse-consistent models presented in concurrent preliminary works

[9][10] are closer to the proposed method. However, [10] is for intra-modality registration, and [9] has only been tested for 2D non-rigid registration.

We named the proposed model as FIRE because its architecture, as shown in Fig. 1, display a shape of the character “火” (a component representing fire in Chinese language). We present experiments demonstrating that our method achieves state-of-the-art performances registering multi-sequence brain MR data with aggressive simulated deformations and intra-modality 4D cardiac MR data. To sum up, contributions of this paper include: (1) the “火”-shape FIRE architecture for inverse-consistent inter-modality registration; (2)simultaneous learning for affine and non-rigid transformation; and (3) new regularization for non-rigid registration using the predicted affine transformation.

2 Method

With two images and , the proposed FIRE model predicts two transformations and to warp the images into and . Transformation fields are obtained by minimizing a loss (or for clear and effective explanation in this paper). Computations described in this section are based on input data normalized to the range .

2.1 Architecture

The FIRE model consists of five sub-networks (Fig. 1): a synthesis encoder, , that extracts modality-independent features and ; two synthesis decoders, and , that map the features extracted by to synthesized images and ; and two transformation networks, and , that predict the transformation fields and . In the training stage, and are also warped into and , then used to generate synthesized images, and .

2.1.1 Synthesis Encoder and Decoder

Fig. 2 shows the details about architecture of the synthesis networks. The encoder contains a input convolutional layer, two downsample convolutional layers and four Resnet blocks. A decoder network starts with four Resnet blocks, followed by two upsample convolutional layers, and output convolutional layers. All convolutional layers use a kernel size of 3, followed by an instance normalization layer.

Figure 2: Architecture of the synthesis encoder and decoders.

2.1.2 Transformation Network

A transformation network learns both an affine transformation and a non-rigid transformation given and . Fig. 3 presents the architecture of and has the same architecture. The affine transformation sub-network

has a similar structure of the original spatial transformation networks (STN). A global average pooling layer is used to resample conv features into a fixed size feature vector. Affine transformation is calculated using two fully connected layers. The non-rigid transformation sub-net

takes and as input, and process them parallel layers first. Extracted features are then concatenated to produce the non-rigid deformation . The last layer is for a normalized coordinate system where a coordinate on a -D image.

Figure 3: Architecture of the transformation networks.

2.2 Loss Functions and Training Procedure

The synthesis generate two synthesized images and , where is aligned with and is identical to the target image . The backward synthesis and registration are performed through the same pipeline using the “” networks. Losses used for training FIRE model includes a synthesis loss, , and a registration loss, . A new regularization

is used for spatially smooth and topology-preserving deformation. The loss function of the proposed FIRE model is defined as:


2.2.1 Synthesis Loss

The synthesis loss includes four terms for different purposes. First, for accurate cross-domain synthesis, we define a synthesis accuracy loss using the root-mean-square (RMS) error. Second, is expected to extract modality-independent features, thus features extracted from aligned image pairs should be identical regardless their modalities. So we define a feature loss . The third cycle-consistency loss is defined as for robust cross-modality synthesis. Finally, for alignment between and we define a alignment loss . To sum up, the FIRE synthesis loss is:


2.2.2 Registration Loss

Transforming the features extracted by is for synthesis purpose and registration is achieved by applying the transformations to input images. Here we define a registration accuracy loss . For mutually inversed transformations and , we define a inverse-consistency loss . The registration loss is computed as:


2.2.3 Regularization

Previous works regularize the non-rigid transformation fields by a smoothness regularization where is the Laplacian operator. In this work, the estimated affine transformations should keep the non-rigid transformations in the minimal level. In the synthesis process, the affinely transformed features, and , can be input into the synthesis decoders to obtain and . The regularization of synthesis is then computed as . Similarly, a regularization of registration, , is computed as:


To sum up, the regularization of the proposed FIRE model is:


where is a scaling parameter of . Empirically, when registering -D images, and has output channels, , where represents number of points in an input image.

2.2.4 Optimization

We use three Adam optimizers to update parameters of , and the rest networks separately in each three consecutive iterations for a stable convergence. Learning rates for training and are set to , and to for training and .

3 Experiments

3.0.1 MRBrainS

We use a dataset of 3T multi-sequence brain MR data by mixing the training data set from the MRBrains18 111 and MRBrains13 222 Challenges. The dataset contains co-registered 3D T1-weighted, inversion recovery (IR) and T2-FLAIR data acquired from 12 subjects. All scans have a voxel size of . We use manual segmentations of 3 anatomical structures to evaluate performances of registration algorithms. Data from 8 patients are for training, 1 for validation and 3 for testing. For both 3D and 2D registration, we resampled all data to per voxel. We perform 2D and 3D registration between T1 and FLAIR data, and 2D registration between IR and FLAIR data. In the training stage, randomly generated affine and non-rigid transformation are applied to the moving image.

3.0.2 Acdc

For intra-modality registration, we use 4D cine-MR data from the 2017 ACDC Challenge 333 The training dataset includes data from 100 patients with a variety of pathology. The in-plane resolution is between and , and each 4D image has 28 to 40 phases that cover completely or partially the cardiac cycle. Manual segmentation of 2 phases are provided for each 4D data. We use all phases for training and the two segmented phases for testing. We use 40 patients for training, 10 for validation, and 50 for testing.

3.0.3 Evaluation Metrics and Baselines

We evaluate our method using the overlap of the segmented objects measured by Dice metric. Higher Dice scores indicate better registration performances. Previous comparison studies show that Symmetric Normalization (SyN) [11] implemented in the ANTs toolbox 444 has outstanding non-rigid registration performances. We compare our FIRE model against SyN [11] for non-rigid registration. Performances of affie registration are compared against the mutual information (MI) implemented in ANTs.

3.1 Results and Discussion

Figure 4: Representative results of MRBrainS T1-FLAIR data.The outer contour of cerebrospinal fluid in the extracerebral space segmented on T1 images are shown in blue.

Table 1 summarizes the Dice scores obtained from registration between MRBrainS T1 and FLAIR data, and Fig. 4 shows representative results. The proposed FIRE model achieved comparable results with SyN on the segmented cerebellum (Ce) and brain stem (BS), and higher scores on white matter (WHM). For 3D registration, our method obtained higher Dice scores on BS. In the example shown in Fig. 4, FIRE achieved visibly better alignment between the outer contours of cerebrospinal fluid in the extracerebral space shown in blue.

Data Object unaligned ANTs-affine FIRE-affine ANTs-SyN FIRE
2D BS 11.62 (6.1) 61.25 (3.7) 62.90 (4.1) 78.73 (7.3) 80.68 (7.7)
CE 7.17 (4.4) 63.32 (3.2) 64.36 (4.0) 75.72 (8.1) 76.96 (7.3)
WHM 14.29 (7.5) 59.12 (4.5) 59.97 (4.4) 81.36 (6.0) 84.18 (3.7)
3D BS 27.15 (9.2) 67.15 (3.1) 69.81 (4.1) 79.77 (6.7) 81.08 (7.0)
CE 28.38 (10.3) 68.38 (3.6) 70.62 (3.7) 86.00 (6.9) 86.13 (7.2)
WHM 20.27 (9.3) 60.27 (3.8) 60.61 (3.8) 72.33 (7.4) 72.56 (7.1)
Table 1: Results of 2D and 3D T1-FLAIR registration on MRBrainS data. Dice scores are calculated on cerebellum (Ce), white matter (WHM), brain stem (BS).

Registration between the IR and FLAIR images is difficult. We failed to produce a visible alignment using the SyN method implemented in ANTs after a grid search on its setup. As an example, the average dice score obtained on Ce using SyN and MI-based affine transformation is below 0.4 when the Dice score of the unaligned image is 0.6. Our method achieved a 0.69 Dice score for IR-FLAIR registration. An example of results is shown in Fig. 5.

Figure 5: Example results of registering the IR

Table 2 and Fig. 6 show the results of the inter-modality registration performed on the ACDC data. The data only show small local displacement between frames thus both compared methods Dice scores over 0.9 on LVe. The Fire model achieved comparable performances with SyN.

Object unaligned ANTs-SyN FIRE
LVe 65.75 (16.25) 90.81 (4.3) 90.08 (5.5)
Myo 51.97 (14.50) 70.71 (5.6) 71.66 (6.3)
Table 2: Results on ACDC data. Dice scores computed on left ventricular endocardium (LVe) and myocardium (Myo).
Figure 6: Representative results of registration on ACDC data. Outer contours of myocardium are shown in blue.

4 Conclusion

We proposed a deep learning model which solves inverse-consistent inter- and intra-modality image registration problems through cross-domain synthesis. The new spatial transformation network and associated loss functions allow to predict both optimal affine and topology preserving non-rigid transformations. Experiments prove that our method have comparable state-of-the-art in both 3D and 2D registration tasks. We achieved better performances than the selected baseline on registration between IR and FLAIR brain data. The model has a new “火”-shape architecture formed by five sub-networks, thus we named it as FIRE.