1 Introduction
Recently, learning based medical image registration has shown great advances in different tasks, but, in contrast to other medical image analysis applications, e.g. segmentation, remains a very challenging problem for deep networks. The majority of recently published methods uses encoder-decoder architectures (like the U-Net) to predict dense displacement fields directly from the input images in an unsupervised setting using similarity metrics (minimizing the objective function similar to conventional iterative registration frameworks) [balakrishnan2019voxelmorph] or uses annotated label images to guide the training process [hu2018weakly]. To deal with large deformations in medical images primarily multilevel strategies and iteratively trained networks were proposed [eppenhof2019progressively, de2019deep, hering2019mlvirnet]. Still, while deep networks offer very fast inference times and have the potential to further learn from expert annotations, with errors above 2.2 mm they can not yet compete with accuracies of conventional registration frameworks on challenging thoracic CT benchmarks (below 1 mm [ruhaak2017estimation]).
2 Methods
In contrast to aforementioned encoder-decoder architectures, in this work we propose to explicitly model the relation between fixed and moving image features with displacement maps. Figure 1 outlines our approach. This concept is similar to the low resolution correlation layer for 2d images in FlowNet [dosovitskiy2015flownet] and dense volumes in PDDNet [heinrich2019closing]
, but we compute the dissimilarities on high resolution feature maps (stride of 2) and therefore employ different strategies to limit the computational burden (this is especially important for learning based approaches as gradients need to be passed through the network in the backward path): 1. we extract fixed features only at sparse keypoints based on the Foerstner interest operator (cf.
[ruhaak2017estimation]), thus reducing the search space and 2. propose to (non-linearly) map the displacement map to a low dimensional embedding space, substantially compressing the displacement features. Further processing, e.g. estimation of final displacements and regularization, is directly employed on the displacement embeddings.

3 Experiments and Results
To validate our approach we choose the challenging task of inhale-to-exhale lung registration on the DIR-Lab 4D-CT and DIR-Lab COPD data set [castillo2013reference], as it contains complex and large deformations. We use a fixed feature extractor (lightweight U-Net with 3 encoder and 2 decoder blocks) that is pretrained to predict MIND-like descriptors [heinrich2012mind] from the input images. Feature patches from the fixed image are sampled at distinctive keypoints and compared with feature patches of voxels at the corresponding locations in the moving image. For this proof-of-concept we simplify our setting and use a PCA embedding (instead of a learned mapping) with 256 and 512 dimensions (thus compressing the displacement space by and , respectively). As regularization method, we employ a simple diffusion over all keypoints and displacements using the graph Laplacian. Table 1 shows the results of our method in comparison to other learning based registration frameworks, four approaches based on dense encoder-decoder (multi-level) architectures [eppenhof2019progressively, de2019deep, balakrishnan2019voxelmorph, hering2019mlvirnet] and one that is using keypoints with graph CNNs and a point cloud matching algorithm [hansen2019learning].
# levels | DIR-Lab 4D-CT | DIR-Lab COPD | |
initial | – | ||
[eppenhof2019progressively] | 1 | – | |
DLIR [de2019deep] | 3 | – | |
VoxelMorph*[balakrishnan2019voxelmorph] | 2 | ||
mlVIRNET [hering2019mlvirnet] | 3 | – | |
[hansen2019learning] | 1 | – | |
ours-256 | 1 | ||
ours-512 | 1 |
, respectively. The mean(standard deviation) target registration error (TRE) in mm is computed on 300 expert annotated landmark pairs per case. *VoxelMorph was trained on affine pre-aligned images using the publicly available code at
https://github.com/voxelmorph/voxelmorph.4 Conclusion
We presented a registration framework for large deformations in medical images that, in contrast to recent approaches, explicitly considers a large number of discrete feature displacements and maps them into an embedding space. It outperforms other deep learning based state-of-the-art methods on the DIR-Lab 4D-CT (errors below 2 mm) as well as on the DIR-Lab COPD dataset (errors below 3.5 mm). As this work may be considered as a proof of concept, we see great potential for improvement of our method using a learned non-linearly mapping to the embedding space as well as extending the regularization to use graph CNNs that can learn from the inherent structure of the keypoint graph.
We gratefully acknowledge the support of the NVIDIA Corporation with their GPU donations for this research.