1 Introduction
Image registration aims to align two or more images to achieve pointwise spatial correspondence. This is a fundamental step for many medical image analysis tasks and has been a very active field of research for decades. In deformable image registration approaches, nonrigid, nonlinear deformation fields are established between a pair of images, such as cardiac cineMR images. Typically, image registration is phrased as an unsupervised optimization problem w.r.t. a spatial mapping that minimizes a suitable cost function by applying iterative optimization schemes. Due to substantially increased computational power and availability of image data over the last years, learningbased image registration methods have emerged as an alternative to energyoptimization approaches.
Prior work on CNNBased Deformable Registration: Compared to other fields relatively little research has yet been undertaken in deeplearningbased image registration. Only recently have the first deeplearning based image registration methods been proposed [1, 2, 3], which mostly aim to learn a function in form of a CNN that predicts a spatial deformation warping a moving image to a fixed image. We categorize these approaches into supervised [1], unsupervised [2, 4] and weaklysupervised [3] techniques based on how they train the network.
The supervised methods use groundtruth deformation fields for training. These deformation fields can either be randomly generated or produced by classic image registration methods. The main limitation of these approaches is that their performance is limited by the performance of existing algorithms or simulations. In contrast, the unsupervised
method do not require any groundtruth data. The learning process is driven by image similarity measures or more general by evaluating the cost function of classic variational image registration methods. An important milestone for the development of these methods was the introduction of the spatial transformer networks
[5] to differentiably warp the moving image during training. Weaklysupervisedmethods also do not rely on groundtruth deformation fields but training is still supervised with prior information. The labels of the moving image are transformed by the deformation field and compared within the loss function with the fixed labels. All anatomical labels are only required during training.
Contributions: We propose a new deeplearningbased image registration method that learns a registration function in form of a CNN to predict a spatial deformation that warps a moving image to a fixed image. In contrast to previous work, we propose the first weaklysupervised approach, which successfully combines the strengths of prior information (segmentation labels) with an energybased distance metric within a comprehensive multilevel deeplearning based registration approach.
2 Materials and Methods
Let denote the fixed image and moving image, respectively, and let be a domain modeling the field of view of . We aim to compute a deformation that aligns the fixed image and the moving image on the field of view such that and are similar for . Inspired by recent unsupervised image registration methods (e.g. [2, 1]), we do not employ iterative optimization like in classic registration, but rather use a CNN that takes images and as input and yields the deformation as output (cf. Fig. 1). Thus, in the context of CNNs, we can consider as a function of input images ,and trainable CNN model parameters to be learned, i.e. During the training, the CNN parameters are learned so that the deformation field minimizes the loss function
(1) 
with socalled distance measure that quantifies the similarity of fixed image and deformed moving image , regularizer that forces smoothness of the deformation and a second distance measure that quantifies the similarity of fixed segmentation and warped moving segmentation . The parameters are weighting factors. For convenience, we set . Note that the segmentations are only used to evaluate the loss function and not used as network input and are therefore only required during training. We use the edgebased normalized gradient fields distance measure [6]
with , , socalled edge parameter and curvature regularizer [6]. The similarity of the segmentation masks is measured using a sum of squared differences of the onehotrepresentation of the segmentations
Architecture and Training: Our architecture is illustrated in Fig. 2. Our network architecture basically follows the structure of a UNet [7], taking a pair of fixed and moving images as input. However, we start with two separate, yet shared processing streams for the moving and fixed image. The CNN generates a grid of control points for a Bspline transformer, which output is a full displacement field to warp a moving image to a fixed image. During training, the outputs of the network are three deformation fields of different resolutions. We compute the overall loss as a weighted sum of the network outputs on this different resolution levels. This design decision is inspired by the multilevel strategy of classic image registration. During inference, only the deformation field on the highest resolution is used.
Experiments: We perform our experiments on the ACDC dataset [8]. It contains cardiac multislice 2D cineMR images of 100 patients captured at enddiastole (ED) and endsystole (ES) time point, amounting to 951 2D images per cardiac phase. The dataset includes annotations for left and right ventricle cavity and myocardium of a clinical expert. We only use slices that contain all labels, i.e. 680 2D images pairs. All images are cropped to the region of interest with a size of 112x112 pixels. Image intensities are normalized to a range of . For data augmentation we slightly deform the images and segmentations to increase the number of image pairs by a factor of 8.
Training is performed as a fold crossvalidation with
which divide our dataset patientwise. Our method is implemented in PyTorch. The network is trained for 40 epochs on a NVIDIA GTX 1080 in approximately 0.5 hours using an ADAM optimizer with a learning rate of
and a batch size of 30. We empirically choose the regularization parameter , the boundaryloss weight and edge parameter in the loss function. To evaluate our registration method we use the computed deformation field to propagate the segmentation of the moving image and measure the volume overlap using the Dice coefficient. We compare our method with a classic multilevel image registration model similar to [6] which iteratively minimizes the loss function without the use of segmentation data.3 Results
Moving  Fixed 
As shown in Fig. 3, our proposed method outperforms the classic multilevel image registration approach compared by the average Dice coefficient. Our method achieves an average improvement from to across all three labels in terms of Dice accuracy while reducing the computation time drastically from 3.583s to 0.006s per image pair. Not only the average Dice score of our approach is higher for every anatomical label, but also the variation is reduced (see Fig. 3). Fig 4 shows two example image pairs and the registration results, demonstrating the ability of our method to compensate large local deformations.
In comparison to the method of Krebs [4] which uses the same dataset and present comparable Dice coefficients of unregistered images, our method yields an improvement from to .
As illustrated in Table 1, our choice of weighting parameters within the loss function (1) leads to a compromise between maximizing the Dice score and keeping the percentage of foldings low.
4 Discussion
We have presented a new weaklysupervised deeplearningbased method for image registration that replaces iterative optimization steps with deep CNNlayers. Our approach advances the stateoftheart in CNNbased deformable registration by combining the complementary strengths of global semantic information (supervised learning with segmentation labels) and local distance metrics borrowed from classical medical image registration that supports the alignment of surrounding structures. We have evaluated our technique on dozens of cardiac multislice 2D cineMR images and demonstrate substantial improvements compared to classic image registration methods both in terms of Dice coefficient and execution time. We also demonstrated the importance of integrating both unsupervised (distance measure
) supervised (boundary term ) learning objectives into a unified framework and achieve stateoftheart Dice scores of 86.5%.References
 [1] Rohé MM, Datar M, Heimann T, Sermesant M, Pennec X. SVFNet: Learning Deformable Image Registration Using Shape Matching. In: MICCAI. Springer; 2017. p. 266–274.

[2]
de Vos BD, Berendsen FF, Viergever MA, Staring M, Išgum I.
Endtoend unsupervised deformable image registration with a convolutional neural network.
In: DLMIA and MLCDS. Springer; 2017. p. 204–212.  [3] Hu Y, Modat M, Gibson E, Li W, Ghavami N, Bonmati E, et al. Weaklysupervised convolutional neural networks for multimodal image registration. Medical image analysis. 2018;49:1–13.
 [4] Krebs J, Mansi T, Mailhé B, Ayache N, Delingette H. Unsupervised Probabilistic Deformation Modeling for Robust Diffeomorphic Registration. In: DLMIA and MLCDS. Springer; 2018. p. 101–109.
 [5] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems; 2015. p. 2017–2025.
 [6] Rühaak J, Heldmann S, Kipshagen T, Fischer B. Highly accurate fast lung CT registration. In: Medical Imaging 2013: Image Processing. vol. 8669. International Society for Optics and Photonics; 2013. p. 86690Y.
 [7] Ronneberger O, Fischer P, Brox T. UNet: Convolutional Networks for Biomedical Image Segmentation. In: MICCAI. Springer; 2015. p. 234–241.
 [8] Jodoin PM, Lalande A, Bernard O. Automated Cardiac Diagnosis Challenge (ACDC); 2017. Accessed: 20180924. Available from: https://www.creatis.insalyon.fr/Challenge/acdc/databases.html.
Comments
There are no comments yet.