Enhancing Label-Driven Deep Deformable Image Registration with Local Distance Metrics for State-of-the-Art Cardiac Motion Tracking

by   Alessa Hering, et al.

While deep learning has achieved significant advances in accuracy for medical image segmentation, its benefits for deformable image registration have so far remained limited to reduced computation times. Previous work has either focused on replacing the iterative optimization of distance and smoothness terms with CNN-layers or using supervised approaches driven by labels. Our method is the first to combine the complementary strengths of global semantic information (represented by segmentation labels) and local distance metrics that help align surrounding structures. We demonstrate significant higher Dice scores (of 86.5%) for deformable cardiac image registration compared to classic registration (79.0%) as well as label-driven deep learning frameworks (83.4%).



page 6


A Deep Learning Framework for Unsupervised Affine and Deformable Image Registration

Image registration, the process of aligning two or more images, is the c...

A Transformer-based Network for Deformable Medical Image Registration

Deformable medical image registration plays an important role in clinica...

Deformable Medical Image Registration Using a Randomly-Initialized CNN as Regularization Prior

We present deformable unsupervised medical image registration using a ra...

Learning a Model-Driven Variational Network for Deformable Image Registration

Data-driven deep learning approaches to image registration can be less a...

Test-Time Training for Deformable Multi-Scale Image Registration

Registration is a fundamental task in medical robotics and is often a cr...

One Shot Learning for Deformable Medical Image Registration and Periodic Motion Tracking

Deformable image registration is a very important field of research in m...

Deformable Image Registration with Deep Network Priors: a Study on Longitudinal PET Images

Longitudinal image registration is challenging and has not yet benefited...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image registration aims to align two or more images to achieve point-wise spatial correspondence. This is a fundamental step for many medical image analysis tasks and has been a very active field of research for decades. In deformable image registration approaches, non-rigid, non-linear deformation fields are established between a pair of images, such as cardiac cine-MR images. Typically, image registration is phrased as an unsupervised optimization problem w.r.t. a spatial mapping that minimizes a suitable cost function by applying iterative optimization schemes. Due to substantially increased computational power and availability of image data over the last years, learning-based image registration methods have emerged as an alternative to energy-optimization approaches.

Prior work on CNN-Based Deformable Registration: Compared to other fields relatively little research has yet been undertaken in deep-learning-based image registration. Only recently have the first deep-learning based image registration methods been proposed [1, 2, 3], which mostly aim to learn a function in form of a CNN that predicts a spatial deformation warping a moving image to a fixed image. We categorize these approaches into supervised [1], unsupervised [2, 4] and weakly-supervised [3] techniques based on how they train the network.

The supervised methods use ground-truth deformation fields for training. These deformation fields can either be randomly generated or produced by classic image registration methods. The main limitation of these approaches is that their performance is limited by the performance of existing algorithms or simulations. In contrast, the unsupervised

method do not require any ground-truth data. The learning process is driven by image similarity measures or more general by evaluating the cost function of classic variational image registration methods. An important milestone for the development of these methods was the introduction of the spatial transformer networks

[5] to differentiably warp the moving image during training. Weakly-supervised

methods also do not rely on ground-truth deformation fields but training is still supervised with prior information. The labels of the moving image are transformed by the deformation field and compared within the loss function with the fixed labels. All anatomical labels are only required during training.

Contributions: We propose a new deep-learning-based image registration method that learns a registration function in form of a CNN to predict a spatial deformation that warps a moving image to a fixed image. In contrast to previous work, we propose the first weakly-supervised approach, which successfully combines the strengths of prior information (segmentation labels) with an energy-based distance metric within a comprehensive multi-level deep-learning based registration approach.

2 Materials and Methods

Figure 1: Illustration of the training process. For convenience, there is only one output deformation field shown instead of three. While application after the training only flows represented by red-dotted lines and red parts are required.

Let denote the fixed image and moving image, respectively, and let be a domain modeling the field of view of . We aim to compute a deformation that aligns the fixed image and the moving image on the field of view such that and are similar for . Inspired by recent unsupervised image registration methods (e.g. [2, 1]), we do not employ iterative optimization like in classic registration, but rather use a CNN that takes images and as input and yields the deformation as output (cf. Fig. 1). Thus, in the context of CNNs, we can consider as a function of input images ,and trainable CNN model parameters to be learned, i.e. During the training, the CNN parameters are learned so that the deformation field minimizes the loss function


with so-called distance measure that quantifies the similarity of fixed image and deformed moving image , regularizer that forces smoothness of the deformation and a second distance measure that quantifies the similarity of fixed segmentation and warped moving segmentation . The parameters are weighting factors. For convenience, we set . Note that the segmentations are only used to evaluate the loss function and not used as network input and are therefore only required during training. We use the edge-based normalized gradient fields distance measure [6]

with , , so-called edge parameter and curvature regularizer [6]. The similarity of the segmentation masks is measured using a sum of squared differences of the one-hot-representation of the segmentations

Figure 2: Proposed UNet based architecture of our CNN. Each blue box represents a multi-channel feature map whose width corresponds corresponds to the number channels which is denoted above or below the box.
Figure 3: Comparison of Dice overlaps for all test images and each anatomical label (  average of all labels,   left ventricle cavity (vc),   right vc and   myocardium). For each one the distributions of Dice coefficients before, after classic and after our proposed registration are shown.

Architecture and Training: Our architecture is illustrated in Fig. 2. Our network architecture basically follows the structure of a UNet [7], taking a pair of fixed and moving images as input. However, we start with two separate, yet shared processing streams for the moving and fixed image. The CNN generates a grid of control points for a B-spline transformer, which output is a full displacement field to warp a moving image to a fixed image. During training, the outputs of the network are three deformation fields of different resolutions. We compute the overall loss as a weighted sum of the network outputs on this different resolution levels. This design decision is inspired by the multi-level strategy of classic image registration. During inference, only the deformation field on the highest resolution is used.

Experiments: We perform our experiments on the ACDC dataset [8]. It contains cardiac multi-slice 2D cine-MR images of 100 patients captured at end-diastole (ED) and end-systole (ES) time point, amounting to 951 2D images per cardiac phase. The dataset includes annotations for left and right ventricle cavity and myocardium of a clinical expert. We only use slices that contain all labels, i.e. 680 2D images pairs. All images are cropped to the region of interest with a size of 112x112 pixels. Image intensities are normalized to a range of . For data augmentation we slightly deform the images and segmentations to increase the number of image pairs by a factor of 8.

Training is performed as a -fold cross-validation with

which divide our dataset patient-wise. Our method is implemented in PyTorch. The network is trained for 40 epochs on a NVIDIA GTX 1080 in approximately 0.5 hours using an ADAM optimizer with a learning rate of

and a batch size of 30. We empirically choose the regularization parameter , the boundary-loss weight and edge parameter in the loss function. To evaluate our registration method we use the computed deformation field to propagate the segmentation of the moving image and measure the volume overlap using the Dice coefficient. We compare our method with a classic multi-level image registration model similar to [6] which iteratively minimizes the loss function without the use of segmentation data.

3 Results

opt. Dice Folding Dice Folding Dice Folding 0.1% 78.2% 75.4% 0.0% 86.0% 0.6% 83.1% 0.7% 83.5% 0.2% 87.2% 0.1% 86.5% 0.0% 86.5% 0.0% 86.5% 0.0% 85.9% 0.0% 87.0% 0.2% 85.3% 0.1% 85.5% 0.1% 87.2% 0.4% 81.6% 0.4% 83.4% 0.1% 0.0% 40.6% 81.9% 0.4% Colormap 87% 86% 85% 84% 83% 82% 81% 80%
Table 1: The quantitative effect of variations of the weighting within the loss function is shown when varying one parameter and fixing the others to their empirically determined optimal values. Besides the resulting Dice coefficient the percentage of pixels in which foldings () occur is depicted.
Moving Fixed
Figure 4: Example input images and , difference image of the input images (third column), of fixed image and the warped image after classic registration (fourth column) and of fixed image and warped image after our proposed registration (fifth column). White and black indicate a great difference, while grey means similar images.

As shown in Fig. 3, our proposed method outperforms the classic multi-level image registration approach compared by the average Dice coefficient. Our method achieves an average improvement from to across all three labels in terms of Dice accuracy while reducing the computation time drastically from 3.583s to 0.006s per image pair. Not only the average Dice score of our approach is higher for every anatomical label, but also the variation is reduced (see Fig. 3). Fig 4 shows two example image pairs and the registration results, demonstrating the ability of our method to compensate large local deformations.
In comparison to the method of Krebs [4] which uses the same dataset and present comparable Dice coefficients of unregistered images, our method yields an improvement from to . As illustrated in Table 1, our choice of weighting parameters within the loss function (1) leads to a compromise between maximizing the Dice score and keeping the percentage of foldings low.

4 Discussion

We have presented a new weakly-supervised deep-learning-based method for image registration that replaces iterative optimization steps with deep CNN-layers. Our approach advances the state-of-the-art in CNN-based deformable registration by combining the complementary strengths of global semantic information (supervised learning with segmentation labels) and local distance metrics borrowed from classical medical image registration that supports the alignment of surrounding structures. We have evaluated our technique on dozens of cardiac multi-slice 2D cine-MR images and demonstrate substantial improvements compared to classic image registration methods both in terms of Dice coefficient and execution time. We also demonstrated the importance of integrating both unsupervised (distance measure

) supervised (boundary term ) learning objectives into a unified framework and achieve state-of-the-art Dice scores of 86.5%.


  • [1] Rohé MM, Datar M, Heimann T, Sermesant M, Pennec X. SVF-Net: Learning Deformable Image Registration Using Shape Matching. In: MICCAI. Springer; 2017. p. 266–274.
  • [2] de Vos BD, Berendsen FF, Viergever MA, Staring M, Išgum I.

    End-to-end unsupervised deformable image registration with a convolutional neural network.

    In: DLMIA and ML-CDS. Springer; 2017. p. 204–212.
  • [3] Hu Y, Modat M, Gibson E, Li W, Ghavami N, Bonmati E, et al. Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis. 2018;49:1–13.
  • [4] Krebs J, Mansi T, Mailhé B, Ayache N, Delingette H. Unsupervised Probabilistic Deformation Modeling for Robust Diffeomorphic Registration. In: DLMIA and ML-CDS. Springer; 2018. p. 101–109.
  • [5] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems; 2015. p. 2017–2025.
  • [6] Rühaak J, Heldmann S, Kipshagen T, Fischer B. Highly accurate fast lung CT registration. In: Medical Imaging 2013: Image Processing. vol. 8669. International Society for Optics and Photonics; 2013. p. 86690Y.
  • [7] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: MICCAI. Springer; 2015. p. 234–241.
  • [8] Jodoin PM, Lalande A, Bernard O. Automated Cardiac Diagnosis Challenge (ACDC); 2017. Accessed: 2018-09-24. Available from: https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html.