Image registration – i.e., finding a dense correspondence map between images or volumes taken at different points in time or under different conditions – is still a crucial component of many clinical and research pipelines: compensating for patient movement and breathing in radiological follow-up and radiation therapy, monitoring progression of degenerative diseases, 3D reconstruction from slices in histopathology, and many others. It is made particularly challenging by the typically large, three-dimensional nature of the data, highly non-convex energies, and runtime requirements of clinical practice.
Towards reducing runtime, the authors of [1, 2] propose a highly accurate non-rigid registration model with applications in follow-up imaging in radiology and liver ultrasound tracking, and introduce a parallel algorithm for the CPU. They also include a preliminary GPU implementation for the 2D case provided by . In this work, we extend these ideas to a fast, matrix-free, parallel algorithm that solves the variational, regularized problem for full 3D image registration on the GPU. This allows us to achieve sub-second runtimes on the standard resolution of , and in the low seconds range for high-resolution volumes, at state-of-the-art accuracy.
2 Materials and methods
Regarding the model, we follow 
: We seek a three-dimensional deformation vector field, , discretized on a deformation grid with dimensions , which deforms a template image to be similar to a reference image , , both discretized on an image grid with dimensions .
To find , we numerically minimize an objective function , consisting of distance measure and smoothing term , weighted by :
Here denotes the grid conversion , which converts the deformation
from the deformation grid to the image grid, before it is used to interpolate the deformed template imageon the image grid.
For the distance measure, we use the well-known normalized gradient field (NGF), which is particularly suitable for multi-modal images . It focuses on intensity changes and compares the angles of the image gradients. To encourage smooth deformations, we employ the curvature-based regularization term introduced by , which penalizes the Laplacian of the deformation field components via .
We chose to implement our method on the GPU using the CUDA toolkit, which allows working close to the hardware and fine-tuning.
Performance on the GPU is tightly coupled to a high occupancy, defined as the number of running threads divided by the number of potentially running threads that the device can handle. Using a large number of registers per thread decreases occupancy , therefore, we keep the number of variables per thread low and split large functions (kernels) into smaller ones.
We generally used single-precision (bit) floating variables due to the faster computations and only half the number of required registers compared to double precision (bit) .
Generating the multi-level pyramid.
For the multi-level approach, reference and template images need to be downsampled to various resolutions. The CUDA framework provides CUDA streams, which enable concurrency between GPU computations and memory transfers between host and GPU . This allows to run the pyramid generation and data transfer for reference and template image in parallel.
Evaluating the objective function.
Evaluating the distance term and its gradient requires two grid conversions and a gradient computation:
convert the deformation to the image grid, denoted by : ,
compute the distance measure and its gradient , and
convert and to the deformation grid by applying .
In the following sections, we discuss the details of each step and its implications for the implementation with CUDA.
Distance measure and gradient computation.
We denote by and the gradients of the reference and deformed template image at the -th image grid point and discretize the NGF distance measure as a sum over grid points,
with voxel volume as product of the image grid spacings, the smoothed norm function , and the modality-dependent parameters to filter the gradient image for noise. Following , we can parallelize the computation of the distance measure function value directly over the terms in the sum.
Applying derivative-based numerical optimization methods such as L-BFGS (section 2.2) requires frequent evaluation of the gradient
. The chain rule yieldswith the reduction function .
Evaluating the gradient using the chain rule by computing the gradient parts and multiplying step-by-step is expensive in terms of (intermediate) memory required. We avoid this by relying on the matrix-free methods introduced by .
Following the approach proposed by , we separate the deformation grid resolution and the image resolution . This allows to save memory and speed up the registration by discretizing the deformation on a coarser grid while preserving all information in the input images.
For optimal performance, a (surprisingly) crucial step in computing the distance measure and its gradient is conversion between the two grids, i.e., computing matrix-vector products with and . Applying is directly parallelizable when using trilinear interpolation . However, applying with a coarser deformation grid produces possible write conflicts introduced when summing up values from multiple points on the higher-resolution image grid in order to obtain a value for a single point on the lower-resolution deformation grid.
To account for this issue, the authors of 
introduced a red-black scheme, where all odd slices are computed in parallel, followed by all even slices. However, the author of observed a poor utilization of GPU cores with this method. Therefore they computed every slice, row, and column in parallel, and used atomic operations to avoid write conflicts.
We introduce a different method, which is not based on the red-black-scheme and free of write conflicts: Each thread computes a deformation grid point independently by summing the corresponding image domain points,
Here, is the local weight and are the corresponding indices of for each dimension, which are determined beforehand, separately for each dimension. While there is a certain overhead in computing the weights this way, in our case it was found that the overall runtime is still faster due to the higher parallelism.
We investigated the accuracy and speed of our method in comparison to state-of-the-art alternatives from the DIR-Lab 4DCT benchmark [8, 9]. We also compared to an Open Multi-Processing–(OMP–)based implementation of the same model on the CPU proposed in , which is already one to two orders of magnitude faster than a matrix-based implementation using the MATLAB FAIR toolbox .
All experiments were performed using an NVIDIA GeForce GTX 1080Ti GPU and an Intel Core i7-6700K CPU.
3.1 Radboud follow-up CT dataset
In order to investigate the performance of our method on high-resolution 3D data, we measured the average runtime over 986 registrations on a dataset of follow-up thorax abdomen CT scans provided by the Radboud University Medical Center, Nijmegen, Netherlands. The images have resolutions in the range of . As full image resolution was slightly out of reach due to memory restrictions of the GPU, we evaluated our approach on half, quarter, eighth and sixteenth resolution per dimension.
For the highest resolution, average runtime was seconds, with an average speedup of compared to the CPU-based parallel OMP implementation (Table 1). On the lower resolutions, our method achieves sub-second runtimes at a speedup of about one order of magnitude. A majority of the runtime on the lower resolutions is spent on the multi-level creation, due to the large memory transfer and downsampling.
Mean runtimes and standard deviations, averaged over 986 thorax-abdomen registrations. Finest image resolutions were approximately, , and . Compared to the CPU-based OMP implementation, we achieve an average speedup of with average runtimes of less than seconds, which opens up new application scenarios for clinical use and interactive registration.
It is prudent to ask whether moving from double precision to single precision on the GPU introduces differences due to rounding. In fact, we observed that this can have an effect (Figure 1). However, it typically only occurs when there are no clear correspondences, such as in regions of the colon with different content, or when the examination table is visible in one of the two scans. In these areas, there is no strong objective function gradient in either direction during optimization, so that numerical differences have a larger impact. However, we argue that if such areas were to be registered accurately, a more elaborate model that accounts for the possible removal of structures would have to be employed in any case.
3.2 DIR-Lab 4DCT benchmark
For a comparison to the state of the art, we evaluated our method on the DIR-Lab 4DCT dataset [8, 9], consisting of ten CT scan pairs of the lung in fully-inhaled and fully-exhaled state. Resolutions are in the range of for the first five images and for the last five images. We set the deformation grid to one quarter of the image resolution.
Accuracy of the final registration was measured by the average landmark error (LME) over 300 expert-annotated landmarks for each dataset (Figure 2). Our OMP implementation scores only slightly behind the best-performing pTVreg method at an average LME of mm vs. mm and places second-best overall in terms of accuracy.
Our GPU implementation follows closely due to the single precision computations and achieves third place overall in terms of accuracy at an LME of mm. Moreover, it is about one order of magnitude faster than all other methods in the benchmark for which runtimes could be obtained. Compared to the only method with better accuracy (pTVreg), it is approximately times faster, at an average of seconds per full 3D registration.
|Ours (GPU)||0.32 s|
|Ours (OMP)||8.24 s|
We introduced a new method for non-linear registration using the GPU, which is highly efficient while maintaining state-of-the-art accuracy. We compared it to an optimized CPU implementation and achieved speedups up to a factor of at runtimes under seconds, while placing third with respect to accuracy in the DIR-Lab 4DCT benchmark. We believe that such low overall runtimes will open up new application scenarios for clinical use, such as interactive registration and real-time organ tracking, and will further clinical adoption of fully-deformable, non-rigid registration methods.
-  König L, Rühaak J. A fast and accurate parallel algorithm for non-linear image registration using Normalized Gradient fields. Proc IEEE Int Symp Biomed Imaging. 2014 apr; p. 580–583.
-  König L, et al. A Matrix-Free Approach to Parallel and Memory-Efficient Deformable Image Registration. SIAM J Sci Comput. 2018 jan;40(3):B858–B888.
-  Meike M. GPU-basierte nichtlineare Bildregistrierung [mathesis]. Universität zu Lübeck; 2016.
-  Modersitzki J. FAIR: Flexible Algorithms for Image Registration. SIAM; 2009.
-  Fischer B, Modersitzki J. A unified approach to fast image registration and a new curvature based registration technique. Linear Algebra Appl. 2004 mar;380:107–124.
-  Nocedal J. Updating Quasi-Newton Matrices with Limited Storage. Mathematics of Computation. 1980 sep;35(151):773–782.
-  Wilt N. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-Wesley; 2013.
-  Castillo R, et al. A framework for evaluation of deformable image registration spatial accuracy using large landmark point sets. Phys Med Biol. 2009 mar;54(7):1849–1870.
-  Castillo E, et al. Four-dimensional deformable image registration using trajectory modeling. Phys Med Biol. 2009 dec;55(1):305–327.