1 Introduction
Image registration – i.e., finding a dense correspondence map between images or volumes taken at different points in time or under different conditions – is still a crucial component of many clinical and research pipelines: compensating for patient movement and breathing in radiological followup and radiation therapy, monitoring progression of degenerative diseases, 3D reconstruction from slices in histopathology, and many others. It is made particularly challenging by the typically large, threedimensional nature of the data, highly nonconvex energies, and runtime requirements of clinical practice.
Towards reducing runtime, the authors of [1, 2] propose a highly accurate nonrigid registration model with applications in followup imaging in radiology and liver ultrasound tracking, and introduce a parallel algorithm for the CPU. They also include a preliminary GPU implementation for the 2D case provided by [3]. In this work, we extend these ideas to a fast, matrixfree, parallel algorithm that solves the variational, regularized problem for full 3D image registration on the GPU. This allows us to achieve subsecond runtimes on the standard resolution of , and in the low seconds range for highresolution volumes, at stateoftheart accuracy.
2 Materials and methods
2.1 Model
Regarding the model, we follow [2]
: We seek a threedimensional deformation vector field
, , discretized on a deformation grid with dimensions , which deforms a template image to be similar to a reference image , , both discretized on an image grid with dimensions .To find , we numerically minimize an objective function , consisting of distance measure and smoothing term , weighted by :
(1) 
Here denotes the grid conversion , which converts the deformation
from the deformation grid to the image grid, before it is used to interpolate the deformed template image
on the image grid.For the distance measure, we use the wellknown normalized gradient field (NGF), which is particularly suitable for multimodal images [4]. It focuses on intensity changes and compares the angles of the image gradients. To encourage smooth deformations, we employ the curvaturebased regularization term introduced by [5], which penalizes the Laplacian of the deformation field components via .
2.2 Parallelization
We chose to implement our method on the GPU using the CUDA toolkit, which allows working close to the hardware and finetuning.
Performance on the GPU is tightly coupled to a high occupancy, defined as the number of running threads divided by the number of potentially running threads that the device can handle. Using a large number of registers per thread decreases occupancy [7], therefore, we keep the number of variables per thread low and split large functions (kernels) into smaller ones.
We generally used singleprecision (bit) floating variables due to the faster computations and only half the number of required registers compared to double precision (bit) [7].
Generating the multilevel pyramid.
For the multilevel approach, reference and template images need to be downsampled to various resolutions. The CUDA framework provides CUDA streams, which enable concurrency between GPU computations and memory transfers between host and GPU [7]. This allows to run the pyramid generation and data transfer for reference and template image in parallel.
Evaluating the objective function.
Evaluating the distance term and its gradient requires two grid conversions and a gradient computation:

convert the deformation to the image grid, denoted by : ,

compute the distance measure and its gradient , and

convert and to the deformation grid by applying .
In the following sections, we discuss the details of each step and its implications for the implementation with CUDA.
Distance measure and gradient computation.
We denote by and the gradients of the reference and deformed template image at the th image grid point and discretize the NGF distance measure as a sum over grid points,
(2) 
with voxel volume as product of the image grid spacings, the smoothed norm function , and the modalitydependent parameters to filter the gradient image for noise. Following [1], we can parallelize the computation of the distance measure function value directly over the terms in the sum.
Applying derivativebased numerical optimization methods such as LBFGS (section 2.2) requires frequent evaluation of the gradient
. The chain rule yields
with the reduction function .Evaluating the gradient using the chain rule by computing the gradient parts and multiplying stepbystep is expensive in terms of (intermediate) memory required. We avoid this by relying on the matrixfree methods introduced by [2].
Grid conversion.
Following the approach proposed by [1], we separate the deformation grid resolution and the image resolution . This allows to save memory and speed up the registration by discretizing the deformation on a coarser grid while preserving all information in the input images.
For optimal performance, a (surprisingly) crucial step in computing the distance measure and its gradient is conversion between the two grids, i.e., computing matrixvector products with and . Applying is directly parallelizable when using trilinear interpolation [1]. However, applying with a coarser deformation grid produces possible write conflicts introduced when summing up values from multiple points on the higherresolution image grid in order to obtain a value for a single point on the lowerresolution deformation grid.
To account for this issue, the authors of [1]
introduced a redblack scheme, where all odd slices are computed in parallel, followed by all even slices. However, the author of
[3] observed a poor utilization of GPU cores with this method. Therefore they computed every slice, row, and column in parallel, and used atomic operations to avoid write conflicts.We introduce a different method, which is not based on the redblackscheme and free of write conflicts: Each thread computes a deformation grid point independently by summing the corresponding image domain points,
(3) 
Here, is the local weight and are the corresponding indices of for each dimension, which are determined beforehand, separately for each dimension. While there is a certain overhead in computing the weights this way, in our case it was found that the overall runtime is still faster due to the higher parallelism.
3 Results
We investigated the accuracy and speed of our method in comparison to stateoftheart alternatives from the DIRLab 4DCT benchmark [8, 9]. We also compared to an Open MultiProcessing–(OMP–)based implementation of the same model on the CPU proposed in [2], which is already one to two orders of magnitude faster than a matrixbased implementation using the MATLAB FAIR toolbox [4].
All experiments were performed using an NVIDIA GeForce GTX 1080Ti GPU and an Intel Core i76700K CPU.
3.1 Radboud followup CT dataset
In order to investigate the performance of our method on highresolution 3D data, we measured the average runtime over 986 registrations on a dataset of followup thorax abdomen CT scans provided by the Radboud University Medical Center, Nijmegen, Netherlands. The images have resolutions in the range of . As full image resolution was slightly out of reach due to memory restrictions of the GPU, we evaluated our approach on half, quarter, eighth and sixteenth resolution per dimension.
For the highest resolution, average runtime was seconds, with an average speedup of compared to the CPUbased parallel OMP implementation (Table 1). On the lower resolutions, our method achieves subsecond runtimes at a speedup of about one order of magnitude. A majority of the runtime on the lower resolutions is spent on the multilevel creation, due to the large memory transfer and downsampling.
Ours (s)  

OMP (s)  
Speedup 
Mean runtimes and standard deviations, averaged over 986 thoraxabdomen registrations. Finest image resolutions were approximately
, , and . Compared to the CPUbased OMP implementation, we achieve an average speedup of with average runtimes of less than seconds, which opens up new application scenarios for clinical use and interactive registration.It is prudent to ask whether moving from double precision to single precision on the GPU introduces differences due to rounding. In fact, we observed that this can have an effect (Figure 1). However, it typically only occurs when there are no clear correspondences, such as in regions of the colon with different content, or when the examination table is visible in one of the two scans. In these areas, there is no strong objective function gradient in either direction during optimization, so that numerical differences have a larger impact. However, we argue that if such areas were to be registered accurately, a more elaborate model that accounts for the possible removal of structures would have to be employed in any case.
3.2 DIRLab 4DCT benchmark
For a comparison to the state of the art, we evaluated our method on the DIRLab 4DCT dataset [8, 9], consisting of ten CT scan pairs of the lung in fullyinhaled and fullyexhaled state. Resolutions are in the range of for the first five images and for the last five images. We set the deformation grid to one quarter of the image resolution.
Accuracy of the final registration was measured by the average landmark error (LME) over 300 expertannotated landmarks for each dataset (Figure 2). Our OMP implementation scores only slightly behind the bestperforming pTVreg method at an average LME of mm vs. mm and places secondbest overall in terms of accuracy.
Our GPU implementation follows closely due to the single precision computations and achieves third place overall in terms of accuracy at an LME of mm. Moreover, it is about one order of magnitude faster than all other methods in the benchmark for which runtimes could be obtained. Compared to the only method with better accuracy (pTVreg), it is approximately times faster, at an average of seconds per full 3D registration.
LME (mm)  Runtime  

Ours (GPU)  0.32 s  
NGF(b)  6.56 s  
Ours (OMP)  8.24 s  
NGF(a)  20.90 s  
cEPE  46 s  
SGFM3D  98 s  
NLR  104.19 s  
cTVL1  110 s  
pTVreg  130 s  
pTV  180 s  
LMP  N/A  
isoPTV  N/A 
4 Discussion
We introduced a new method for nonlinear registration using the GPU, which is highly efficient while maintaining stateoftheart accuracy. We compared it to an optimized CPU implementation and achieved speedups up to a factor of at runtimes under seconds, while placing third with respect to accuracy in the DIRLab 4DCT benchmark. We believe that such low overall runtimes will open up new application scenarios for clinical use, such as interactive registration and realtime organ tracking, and will further clinical adoption of fullydeformable, nonrigid registration methods.
References
 [1] König L, Rühaak J. A fast and accurate parallel algorithm for nonlinear image registration using Normalized Gradient fields. Proc IEEE Int Symp Biomed Imaging. 2014 apr; p. 580–583.
 [2] König L, et al. A MatrixFree Approach to Parallel and MemoryEfficient Deformable Image Registration. SIAM J Sci Comput. 2018 jan;40(3):B858–B888.
 [3] Meike M. GPUbasierte nichtlineare Bildregistrierung [mathesis]. Universität zu Lübeck; 2016.
 [4] Modersitzki J. FAIR: Flexible Algorithms for Image Registration. SIAM; 2009.
 [5] Fischer B, Modersitzki J. A unified approach to fast image registration and a new curvature based registration technique. Linear Algebra Appl. 2004 mar;380:107–124.
 [6] Nocedal J. Updating QuasiNewton Matrices with Limited Storage. Mathematics of Computation. 1980 sep;35(151):773–782.
 [7] Wilt N. The CUDA Handbook: A Comprehensive Guide to GPU Programming. AddisonWesley; 2013.
 [8] Castillo R, et al. A framework for evaluation of deformable image registration spatial accuracy using large landmark point sets. Phys Med Biol. 2009 mar;54(7):1849–1870.
 [9] Castillo E, et al. Fourdimensional deformable image registration using trajectory modeling. Phys Med Biol. 2009 dec;55(1):305–327.
Comments
There are no comments yet.