Image registration aims to align two or more images to achieve point-wise spatial correspondence. This is a fundamental step for many medical image analysis tasks and has been a very active field of research for decades. In deformable image registration approaches, non-rigid, non-linear deformation fields are established between a pair of images, such as cardiac cine-MR images. Typically, image registration is phrased as an unsupervised optimization problem w.r.t. a spatial mapping that minimizes a suitable cost function by applying iterative optimization schemes. Due to substantially increased computational power and availability of image data over the last years, learning-based image registration methods have emerged as an alternative to energy-optimization approaches.
Prior work on CNN-Based Deformable Registration: Compared to other fields relatively little research has yet been undertaken in deep-learning-based image registration. Only recently have the first deep-learning based image registration methods been proposed [1, 2, 3], which mostly aim to learn a function in form of a CNN that predicts a spatial deformation warping a moving image to a fixed image. We categorize these approaches into supervised , unsupervised [2, 4] and weakly-supervised  techniques based on how they train the network.
The supervised methods use ground-truth deformation fields for training. These deformation fields can either be randomly generated or produced by classic image registration methods. The main limitation of these approaches is that their performance is limited by the performance of existing algorithms or simulations. In contrast, the unsupervised
method do not require any ground-truth data. The learning process is driven by image similarity measures or more general by evaluating the cost function of classic variational image registration methods. An important milestone for the development of these methods was the introduction of the spatial transformer networks to differentiably warp the moving image during training. Weakly-supervised
methods also do not rely on ground-truth deformation fields but training is still supervised with prior information. The labels of the moving image are transformed by the deformation field and compared within the loss function with the fixed labels. All anatomical labels are only required during training.
Contributions: We propose a new deep-learning-based image registration method that learns a registration function in form of a CNN to predict a spatial deformation that warps a moving image to a fixed image. In contrast to previous work, we propose the first weakly-supervised approach, which successfully combines the strengths of prior information (segmentation labels) with an energy-based distance metric within a comprehensive multi-level deep-learning based registration approach.
2 Materials and Methods
Let denote the fixed image and moving image, respectively, and let be a domain modeling the field of view of . We aim to compute a deformation that aligns the fixed image and the moving image on the field of view such that and are similar for . Inspired by recent unsupervised image registration methods (e.g. [2, 1]), we do not employ iterative optimization like in classic registration, but rather use a CNN that takes images and as input and yields the deformation as output (cf. Fig. 1). Thus, in the context of CNNs, we can consider as a function of input images ,and trainable CNN model parameters to be learned, i.e. During the training, the CNN parameters are learned so that the deformation field minimizes the loss function
with so-called distance measure that quantifies the similarity of fixed image and deformed moving image , regularizer that forces smoothness of the deformation and a second distance measure that quantifies the similarity of fixed segmentation and warped moving segmentation . The parameters are weighting factors. For convenience, we set . Note that the segmentations are only used to evaluate the loss function and not used as network input and are therefore only required during training. We use the edge-based normalized gradient fields distance measure 
with , , so-called edge parameter and curvature regularizer . The similarity of the segmentation masks is measured using a sum of squared differences of the one-hot-representation of the segmentations
Architecture and Training: Our architecture is illustrated in Fig. 2. Our network architecture basically follows the structure of a UNet , taking a pair of fixed and moving images as input. However, we start with two separate, yet shared processing streams for the moving and fixed image. The CNN generates a grid of control points for a B-spline transformer, which output is a full displacement field to warp a moving image to a fixed image. During training, the outputs of the network are three deformation fields of different resolutions. We compute the overall loss as a weighted sum of the network outputs on this different resolution levels. This design decision is inspired by the multi-level strategy of classic image registration. During inference, only the deformation field on the highest resolution is used.
Experiments: We perform our experiments on the ACDC dataset . It contains cardiac multi-slice 2D cine-MR images of 100 patients captured at end-diastole (ED) and end-systole (ES) time point, amounting to 951 2D images per cardiac phase. The dataset includes annotations for left and right ventricle cavity and myocardium of a clinical expert. We only use slices that contain all labels, i.e. 680 2D images pairs. All images are cropped to the region of interest with a size of 112x112 pixels. Image intensities are normalized to a range of . For data augmentation we slightly deform the images and segmentations to increase the number of image pairs by a factor of 8.
Training is performed as a -fold cross-validation with
which divide our dataset patient-wise. Our method is implemented in PyTorch. The network is trained for 40 epochs on a NVIDIA GTX 1080 in approximately 0.5 hours using an ADAM optimizer with a learning rate ofand a batch size of 30. We empirically choose the regularization parameter , the boundary-loss weight and edge parameter in the loss function. To evaluate our registration method we use the computed deformation field to propagate the segmentation of the moving image and measure the volume overlap using the Dice coefficient. We compare our method with a classic multi-level image registration model similar to  which iteratively minimizes the loss function without the use of segmentation data.
As shown in Fig. 3, our proposed method outperforms the classic multi-level image registration approach compared by the average Dice coefficient. Our method achieves an average improvement from to across all three labels in terms of Dice accuracy while reducing the computation time drastically from 3.583s to 0.006s per image pair. Not only the average Dice score of our approach is higher for every anatomical label, but also the variation is reduced (see Fig. 3). Fig 4 shows two example image pairs and the registration results, demonstrating the ability of our method to compensate large local deformations.
In comparison to the method of Krebs  which uses the same dataset and present comparable Dice coefficients of unregistered images, our method yields an improvement from to . As illustrated in Table 1, our choice of weighting parameters within the loss function (1) leads to a compromise between maximizing the Dice score and keeping the percentage of foldings low.
We have presented a new weakly-supervised deep-learning-based method for image registration that replaces iterative optimization steps with deep CNN-layers. Our approach advances the state-of-the-art in CNN-based deformable registration by combining the complementary strengths of global semantic information (supervised learning with segmentation labels) and local distance metrics borrowed from classical medical image registration that supports the alignment of surrounding structures. We have evaluated our technique on dozens of cardiac multi-slice 2D cine-MR images and demonstrate substantial improvements compared to classic image registration methods both in terms of Dice coefficient and execution time. We also demonstrated the importance of integrating both unsupervised (distance measure) supervised (boundary term ) learning objectives into a unified framework and achieve state-of-the-art Dice scores of 86.5%.
-  Rohé MM, Datar M, Heimann T, Sermesant M, Pennec X. SVF-Net: Learning Deformable Image Registration Using Shape Matching. In: MICCAI. Springer; 2017. p. 266–274.
de Vos BD, Berendsen FF, Viergever MA, Staring M, Išgum I.
End-to-end unsupervised deformable image registration with a convolutional neural network.In: DLMIA and ML-CDS. Springer; 2017. p. 204–212.
-  Hu Y, Modat M, Gibson E, Li W, Ghavami N, Bonmati E, et al. Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis. 2018;49:1–13.
-  Krebs J, Mansi T, Mailhé B, Ayache N, Delingette H. Unsupervised Probabilistic Deformation Modeling for Robust Diffeomorphic Registration. In: DLMIA and ML-CDS. Springer; 2018. p. 101–109.
-  Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems; 2015. p. 2017–2025.
-  Rühaak J, Heldmann S, Kipshagen T, Fischer B. Highly accurate fast lung CT registration. In: Medical Imaging 2013: Image Processing. vol. 8669. International Society for Optics and Photonics; 2013. p. 86690Y.
-  Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: MICCAI. Springer; 2015. p. 234–241.
-  Jodoin PM, Lalande A, Bernard O. Automated Cardiac Diagnosis Challenge (ACDC); 2017. Accessed: 2018-09-24. Available from: https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html.