I Introduction
Deformable registration (also known as image alignment, warping, or matching) refers to methods that find point correspondences between images by comparing image intensities. We refer to these point correspondences as the deformation map (see Figure 1 in §II). An example for a low dimensional image registration problem is affine registration; it creates simple maps consisting of rotations, translations, and scalings [49]. Typically, affine registration is used as an initialization step for large deformation diffeomorphic registration (LDDR
), which is the problem we are concerned with in the present work. In LDDR, we typically search for a deformation map for which the degrees of freedom are the ambient space times the number of grid points defined in the image space. LDDR is much more flexible than affine registration and thus, in general, more informative in clinical studies
[55, 38]. Such highdimensional transformations can be defined in many different ways [49, 50, 55]. Image registration is an illposed inverse problem; it does not have a unique solution. Not all large deformation maps are admissible since they can shuffle the points arbitrarily to match intensities. It is crucial to impose constraints on the deformation while allowing for flexibility. The most important constraint is that the map is diffeomorphic (see Figure 2 in §II).Solving an LDDR problem in a rigorous way requires the solution of a nonconvex partial differential equation (PDE) constrained optimal control problem [10, 26, 31, 41]. This problem is illposed and involves nonlinear and illconditioned operators. Most stateoftheart packages circumvent these issues by sacrificing scalability and settling for crude solutions using simple but suboptimal algorithms. In many cases this works sufficiently well, but in several other cases, it does not. There is significant activity in trying to improve the existing algorithms. With regards to performance optimizations, most codes use open multiprocessing (OpenMP) or graphics processing unit (GPU) acceleration; there are very few codes that utilize distributed memory parallelism. As a result they are not scalable to the full resolution; to solve problems for large images most codes use subsampling. This is limiting considering the current imaging resolutions. Seven Tesla magnetic resonance imaging (MRI) scanners can reach a resolution of up to ( voxels) [44]. Ultrahigh resolution computed tomography (CT) captures resolution ( voxels) [36].
Beyond the need for strong scaling of image registration algorithms for clinical applications, there is also need for weak scaling for imaging in biology, biophysics, and neuroscience. Animal MicroCT reaches resolution ( voxels) [56, 64]. In small animal neuroimaging, CLARITY [58], a novel optical imaging technique, can deliver submicron resolution for the whole brain of the animal, resulting in 10100 billionvoxel images. To our knowledge, none of the existing schemes for LDDR allow for the registration of such large volumetric images [5, 9].
Contributions: The design goals for our 3D LDDR scheme are the following: (1) ability to represent large diffeomorphic deformations; (2) algorithms based on rigorous mathematical foundations; (3) algorithmic optimality with respect to both the deformation map resolution and the image resolution; and (4) parallel scalability. Here, we propose an algorithm that achieves these goals and has the following characteristics:

It is based on optimal control theory. Our formulation allows the control of the registration quality in terms of image correspondence and different quality metrics for the diffeomorphism/deformation map (see §IIA).

It uses a semiLagrangian approach for solving the transport equations that govern the deformation of the image. This approach leads to algorithmic optimality (see §IIIB2).

It uses a spectral discretization in space (see §IIIB1). This discretization enables flexibility in the choice of regularization operators for the deformation map. Such flexibility is necessary since different image registration applications have different requirements. It also allows for efficient solvers of saddlepoint linear systems.

It uses optimal algorithms based on an adjointbased formulation solved via a linesearch globalized, inexact, preconditioned GaussNewtonKrylov scheme (see §IIIA).

It uses distributedmemory parallelism for scalability, employs several performance optimizations specific to our problem, and uses a parallel FFT for elliptic solvers and differentiations that has been shown to scale to hundreds of thousands of cores (see §IIIC). It employs several optimizations for the most expensive part of the computation (cubic interpolation) (see §IIIC2). It also supports GPU acceleration (not discussed here).
In addition, we analyze the overall complexity of our method in terms of communication, computation, and storage. The class of deformations we consider here are one of the most challenging since we enable locally volume preserving maps,^{1}^{1}1In the medical imaging jargon this is referred to as “mass preserving” maps. which find many applications [54, 63, 29, 12]. We present results for synthetic and neurological images and demonstrate the performance of our algorithm (see §IV) for both volume preserving and more generic deformation maps.
Related work on high performance computing methods for 3D image registration: A rich literature survey on high performance computing (HPC) in image registration can be found in [53, 18, 35]. General surveys on image registration can be found in [49, 55]. Formulations related to the one discussed in this work are reviewed in [45, 46, 47].
State of the art registration packages that are used in the medical imaging and medical image computing community include ELASTIX [39], ANTS [6], DARTEL [4], and DEMONS [60, 61, 43]. All of these offer some kind of diffeomorphic registration scheme. These packages mostly support OpenMP, but do not use GPUs or Message Passing Interface (MPI) acceleration (exceptions to be discussed below). An important distinction should be made between the image resolution (number of voxels) and the map resolution (number of degrees of freedom for the map parameterization). In general, the higher the map resolution, the better the registration quality [38, 39] but the harder the optimization problem since it has more degrees of freedom. Most existing codes downsample the map resolution significantly.
The majority of researchers have used GPUs to accelerate the calculations. For example, the solver for the LDDR scheme in [28] uses a preconditioned gradient descent (not Newton) algorithm with a hardwareprovided trilinear interpolation on a GPU architecture. It supports the distributed solution of multiple independent LDDR image registration problems (in an embarrassingly parallel way), but does not support distributed memory parallelism for a single LDDR problem.
Two popular packages that exploit GPU acceleration are NIFTYREG [48] and PLASTIMATCH [52]. They use Bspline parameterized lowresolution maps ( coefficients), a trilinear interpolation scheme, and gradient descent type optimization; NIFTYREG supports soft constraints to penalize volume change.
An MPI version of NIFTYREG for bigger images [33] exists, but the map resolution remains the same ( regular grid for the deformation field). The pioneering works [62, 42] support MPI. Their formulation is based on elastic deformation maps (not LDDR). A GPULDDR scheme that supports somewhat highresolution maps () is [59]. It uses steepest descent (not Newton) and does not support MPI; no timings are reported.
To summarize, existing schemes do not support scalable LDDR algorithms and no scaling studies have been reported.
Limitations: In multiframe volume registration (e.g., 4D CineMRI) one seeks to register multiple images using a smooth, continuous mapping [13, 9]. Our solver can be used as is, but our diffeomorphic map parameterization is better suited for registering two images. Our parameterization can be extended without any major algorithmic changes but the software engineering would require some work. Another missing piece is a preconditioner that is insensitive to the regularization parameter. There are several techniques for doing so, e.g., grid continuation and multilevel preconditioning [10, 1, 47]
. Here we focus on the singlelevel solver. The single node performance of the interpolation can be improved by more sophisticated blocking, manual vectorization, and possibly prefetching. Similar considerations hold true for the GPU version of the interpolation. This is ongoing work. Another limitation is that we only consider a discretization on Cartesian grids. This is not always the best grid
[62]. The structure of our algorithm changes significantly for unstructured grids.Ii Background
Let be the spatial domain, in which we define functions (images). denotes the boundary of and a point in . Let be a function defined on . In imaging is the image intensity at a point ; in optimal control is the state field. In the registration problem, we have a reference image, denoted by , and a template image, denoted by ; the goal is to find a vector valued deformation map, denoted by , that maps a point of the template image to a corresponding point in the reference image [49, 50].
Let be the velocity field that generates the map . In our formulation, we introduce a pseudotime to denote the deformation of the template image at time , denoted by . We define to be the undeformed template image , and to be the result of applying the deformation map (which needs to be compared to ).
For the optimal control problem, is the adjoint field, is the reduced Hessian operator, is the gradient field, and is the scalar regularization parameter. We use periodic boundary conditions for all differential operators. For the discretization, is the number of grid points per th dimension; is the total number of unknowns in space; is the number of time steps and the time step size. We use for the number of MPI tasks, for the latency in seconds, and for the reciprocal of the bandwidth when we do complexity analysis. Boldface lowercase symbols indicate vectors in .
Iia Image registration
The image registration problem can be abstractly defined as follows. Given two functions (i.e., images) and , we seek a vector function such that the distance (i.e., the residual) between and is minimal. We can think of as deforming an (infinite) grid of points in the template image so that their intensity after the deformation matches the reference image in the norm.^{2}^{2}2Other types of distance measures can be used (see, e.g., [49, 50, 55]). There are no significant changes in our formulation or algorithm if we would consider other, popular distance measures.
It can be shown that this problem has an infinite number of solutions, most of which are not useful. To resolve this, one needs to impose additional constraints, such as smoothness (i.e., exists), although this alone may not be sufficient to ensure that the map is plausible. Note, that we require for all to guarantee that is a diffeomorphism (for an illustration, see 2). In velocitybased LDDR it can be shown that such a diffeomorphic map exists, if the generating velocity field is adequately smooth [9, 15, 13]; a typical requirement is that is an function [9]. The deformation map can be computed from by solving
(1) 
where . It can be shown that if the velocity is incompressible (i.e., for every in ), then and the diffeomorphism is referred to as volume preserving [27]. Here, we consider both the general and the incompressible velocity cases. The latter case is more challenging. We only consider stationary velocity fields, that is, .^{3}^{3}3It can be shown that the space of possible diffeomorphisms generated by timevarying velocities is strictly larger than the space generated by stationary velocities. This does not have practical implications when we register two images (see, e.g., [45]). However, it is restrictive if we want register a sequence of images, like in optical flow problems.
In the following we drop the dependence of the functions on the spatial position for notational convenience.
IiB Formulation
The solution to the image registration problem can be found by solving the following PDEconstrained optimization problem [30, 13, 45]:
(2a)  
subject to  
(2b)  
(2c)  
(2d) 
The second term of enforces smoothness for and is the regularization parameter. In this formulation, , where is the solution of (1) at . The constraint (2b) defines an implicit function between and ; given we solve (2b) for .
Computing the gradient of
Given , we need several steps to compute the gradient . First we compute by solving (2b) (with initial condition defined by (2c)). Then we compute the adjoint function by solving the backwardintime adjoint equation [34, 10]
(3)  
Once we have the state and adjoint fields, we can evaluate the gradient given by
(4) 
The operator , also known as the Leray operator [57], eliminates the incompressibility constraint for . Furthermore, we define the vector field
so that . The gradient is a nonlinear elliptic integrodifferential operator where the state (forward in time) and adjoint (backward in time) transport PDEs are “hidden” in .
Evaluating the gradient requires solving two transport equations, inverting the Laplacian and applying gradient, divergence, and biharmonic operators. (If we do not wish to compute a volume preserving map we can drop (i.e., not enforce) the incompressibility constraint. Then, in the gradient calculation, we only need to replace the operator with an identity operator.) The firstorder optimality condition for (2) requires that . Most registration packages use steepest descent (first order) methods to find an optimal point (minimizer) [6, 9, 13]. However, steepest descent methods only have a linear convergence rate. Using Newton methods, which provide a much better convergence rate, is considered to be prohibitive, especially for LDDR for two main reasons. First, it is cumbersome to derive the equations for the second order optimality conditions. Second, a naive implementation of Newton methods can be very costly if not done carefully.
The Newton and GaussNewton Hessian operators
To solve for we use an Armijo linesearch globalized Newton method [51]. The key operation is the action of the Hessian on a vector field , which is commonly referred to as the Hessian matvec. This matvec is computed by performing the following steps: First of all, we need to solve (2b) and (3) to compute the state and adjoint variables and , respectively. After computing these fields, we need to solve (5aa) for the incremental state variable and (5ac) for the incremental adjoint variable , accumulating them in time to compute and finally evaluating (5ae).
(5aa)  
(5ab)  
(5ac)  
(5ad)  
(5ae) 
where
Iii Methods
Given two images and our goal is to find . We use an optimizethendiscretize approach for (2). Our scheme can be summarized as follows:

We solve (2) for .

Once we have , we use (1) to compute .

To find we solve (where is given by (4)) using a preconditioned NewtonKrylov method.

In space we use Fourier expansions (regular grids with periodic boundary conditions) and in time we use a semiLagrangian scheme.

We use data parallelism in space (the regular grid).

All spatial differential operators (and their inverses if needed) are computed spectrally using our parallel FFT.

All spatial algebraic operators are done in parallel.
We provide more details for our algorithm in the following subsections.
Iiia NewtonKrylov solver and preconditioning
For the optimization we use a Newton method globalized with an Armijo linesearch. We use a preconditioned ConjugateGradient (PCG) method to compute the Newton step. The linear solves using PCG are done inexactly using a tolerance that depends on the relative norm of the gradient [17]. The preconditioner is the inverse of the biharmonic operator () and can be applied in nearly linear time using FFTs (with a logarithmic factor). This preconditioner delivers meshindependence—but not independence (see §IV). Since the problem is highly nonlinear we use parameter continuation on . The target value for is application dependent and, in our algorithm, determined by various metrics defined on [45]. We use the TAO module from the PETSc library [8, 7] for numerical optimization, which supports userdefined, matrix free PCG. TAO provides interfaces that allow one to control two main parameters in the NewtonKrylov solver: () the accuracy of the solution of the linear system to compute the Newton step (the relative tolerance of the PCG method used to solve the Hessian equation); and () the nonlinear termination criteria. We provide the algorithms to determine these parameters and, given and , efficient routines for the function evaluation , the gradient evaluation , the Hessian matvec and the action of the preconditioner .
IiiB Discretization
We use Cartesian grids to approximate spatial functions and spatial differential operators. We use an explicit RungeKutta secondorder semiLagrangian method to discretize in time.
IiiB1 Space
We use a spectral projection scheme for all spatial operations on a regular grid defined on with periodic boundary conditions. For simplicity, we consider the isotropic case, in which the grid spacing is the same in all directions; that is, the number of points per direction is given by . Our actual implementation does not require this (see §IV for an example). We approximate
where is a multiindex with . The corresponding regular grid points are given by , where and .
We refer to as the spectral coefficients of . Mappings between and are done using the forward and inverse Fast Fourier Transform (FFT). We use a similar spectral discretization for and for each component of the velocity field . All derivatives are performed by first taking the FFT and then filtering the spectral coefficients appropriately. In general, the input images and
may not be periodic functions. In that case a spectral approximation will create excessively high aliasing errors. To address this, we use zeropadding for
and . Also, in general, images will have discontinuities and thus are not differentiable, creating similar aliasing problems. So, before applying our algorithm, we smooth them spectrally with a Gaussian filter whose bandwidth is (the grid size). Notice that our spectral representation with periodic boundary conditions allows us to apply all the different spatial operators—including and —in a stable, accurate, and extremely efficient manner. As a result, the main cost of the computation will be solving the transport equations, not applying and inverting elliptic differential operators.IiiB2 Time
We choose a semiLagrangian method since it is unconditionally stable [19] and allows us to take a small number of time steps . This is critical since we store several spacetime fields. For example, when solving (5aa) for , we need for all . For large the storage requirements become excessive and more sophisticated checkpointing schemes [2] are required—which are more expensive. If we were using a CourantFriedrichsLewy (CFL) restricted^{4}^{4}4The CFL condition defines an upper bound on the time step size to ensure a stable solution of stiff, timedependent PDEs [40]. scheme for solving the transport equations, storing the time history would have been impossible, since we had to store hundreds of time steps (see, e.g., [45, 46, 47]).
To explain the semiLagrangian method we reintroduce the notational dependence in space. We consider the general transport equation for a scalar field (with a stationary velocity)
For example, for (5ac) we have and . Then, for each , to compute given , we first compute a new point , the semiLagrangian point, using the scheme below:
(6) 
and then we set
(7) 
This scheme is fully explicit and unconditionally stable. Recall that is a regular grid point and thus and are known. and are not regular grid points. Computing and at these offgrid locations requires multiple interpolations: three interpolations for , interpolations for the terms that depend on the semiLagrangian point , and finally one interpolation for . If depends on derivatives of , we first differentiate on the regular grid and then we interpolate the derivatives. The same scheme is used for the adjoint equations by changing the time variable from to , so that and . Note that the interpolation cannot be done using a FFT, since the interpolation points can be spaced irregularly between grid points.
IiiC Parallel algorithms
The main computational kernels are the 3D FFTs to compute derivatives, elliptic operators and their inverses, the interpolation to offgrid points needed for the semiLagrangian time stepping, the Krylov solver (PCG) for the Hessian, and the Newton solver. The 3D FFT has wellknown algorithmic complexity. The interpolation on semiLagrangian points is the most expensive parts of the computation, despite being local. As it turns out, about 60% of the overall time for the image registration problem is spent on interpolation. The Krylov and Newton solvers are sequential across iterations whereas all the function, gradient, and Hessian evaluations are done using data parallelism. Below we give more details on the FFT, the interpolation, and how we put everything together.
IiiC1 Partitioning and FFT
Let , , be the number of grid points in the dimension. Also assume we have MPI tasks. We partition the data using the pencil decomposition of 3D FFT (see Figure 4). Each MPI task gets grid points. There is no partitioning in time and all the time steps are stored in memory. The scalability of the 3D FFT has been well studied [16, 25]. The 3D FFT requires computations and communications. We use the opensource package AccFFT [24, 22] that supports both GPU and CPU FFTs and is based on the 1D FFTs implemented in the FFTW package [21]. Our code features optimizations for the and operators that allow us to avoid multiple 3D FFTs. For example, requires 1D FFTs across the first coordinate, diagonal scaling, and then the same number of inverse FFTs again across the first dimension. A similar process is required for the other components but they require collective alltoall communications for rearranging the data. The remaining operators , , and are diagonal and require standard 3D FFTs.
IiiC2 Interpolation
For every grid point we have to find points required in (7) using (6). In the distributed case, every processor interpolates all the points that fall into the region defined by its pencil (that would be subfigure (a) in Figure 4). This is essentially an alltoallv operation. We refer to this step as “scatter” phase. Note that the points need to be constructed only when the velocity field changes. In a Newton iteration for a given we have to compute these points only once for (forward transport) and for (adjoint equations); that is the scatter phase needs to be done once per field per Newton iteration. This results in speedups due to savings in communication and computation. After the scatter phase is completed, each process has to perform a cubic interpolation on the points that it owns locally as well as the points it received from the other processors. After this step is done, an alltoallv operation is necessary to send/recv all the interpolation results. This needs to be done once per time step.
The computation is organized as follows. For every forward or adjoint solve, we invoke an interpolation planner, which performs the scatter phase and stores the semiLagrangian points and creates the communication plans for the transport equation. Then the actual transport (7), which involves multiple interpolations at every time step, is performed. For divergencefree , the computation of (2b) and (3) involves only interpolations. For (5aa) and (5ac) it also involves differential operators for the gradient and divergence operators that appear on the righthand side.
Note that it is possible for an interpolation point to fall inbetween the locally owned domains of the processes. This is because the local domain of each process is disjoint from others. For this reason, every processor maintains a layer of ghost points, regular grid points that belong to other processors. The values of at these points must be synchronized before interpolation takes place. Notice that for every point we have to bring in 64 scalar values and perform roughly floating point operations. The constant is related to the 64 coefficients () required to build and evaluate the tricubic interpolant times five flops per coefficient. Therefore, the computation to memory traffic ratio will be —the computation is memory bound. Blocking, prefetching, and vectorization can be used to improve the performance.
The execution flow of the interpolation algorithm is explained in Algorithm 1, for an MPI task . Let be the value of the scalar field at a regular grid point . Also, let be the corresponding semiLagrangian points for each computed by (6). Let be the spatial domain assigned to the MPI task . As shown in figure 3, does not imply that and vice versa. For an offgrid point , owner computes the MPI task that owns that point. That is, but . Similarly, but . Of course for most points both the owner and worker domains will be identical to . In line 1 every task sends values of that it owns to its four neighbors (recall that we use a pencil decomposition) with a communication cost of (the four corner neighbors can be combined with the messages of the edge neighbors, but appropriate ordering of the messages). In line 2, for whose corresponding regular grid point belongs to a different MPI task, sends to that task.
IiiC3 Algorithm for incremental state equation
We summarize the steps needed to solve the incremental forward problem (5aa) for one time step in algorithm 2 to illustrate the communication and computation pattern.
This calculation requires four interpolation steps using Algorithm 1, one for the scalar interpolation in line 1 and three for the vector interpolation in line 4. It also requires four FFTs: two for line 2 (it is two because we need to go the spectral domain, differentiate, and then back to the spatial domain) and two for line 6. The other parts are triple “forloops” over all the grid points in . The FFTs require global synchronizations. The total cost of the incremental adjoint solve is four 3D FFTs and two interpolations. The incremental adjoint requires the same computations (for divergencefree velocity fields).
IiiC4 Complexity of Hessian matvec and overall algorithm
Every Hessian matvec requires forward and adjoint solves or FFTs and interpolations. The remaining operations of applying the regularization and the preconditioner are negligible since they include just 2 FFTs each. The gradient is also cheaper since (2b) and (3) are simpler than the ones in (5). Regarding memory, every task needs to store values for the incremental adjoint and state variables. Therefore, accounting the complexities for the FFT and interpolation we obtain,
This estimate assumes that the semiLagrangian points are uniformly distributed across processors, however, this is not guaranteed and depends on the velocity field and the CFL number. In practice the interpolation is the predominant cost of the calculation, at least for the problem sizes we have tested. For fixed
the number of Newton iterations are independent of the mesh size, the inversion of highly illconditioned operators is done in linear time.Iv Results
Iva Experimental setup
In this section, we give details on the experimental setup we used to test our solver.
IvA1 Images
We use one realworld and one synthetic image to test our algorithm. For the synthetic case we construct the template image as follows: ; the velocity is given by ; the reference image is the solution of (2b) with the exact velocity (see Figure 5 for an illustration of this problem).^{5}^{5}5For the incompressible case we use a similar but divergence free velocity field . We use a synthetic case to perform the scaling studies, since medical images come with a fixed resolution/grid size. To test our scheme on real medical images, we use two 3D MRI brain images of different individuals (“multisubject registration problem”; grid size: ). This data is from the Nonrigid Registration Evaluation Project (NIREP) [14].^{6}^{6}6The data is available at http://nirep.org; the interested reader is referred to [14] for more details. We consider the first two datasets na01 and na02 from this repository. (see Figure 6 for an illustration).
IvA2 Implementation and Hardware
Our code is implemented in C++ and uses MPI and the OpenMP library for multithreading. The code is compiled with the Intel C++ compiler using the O3 flag. Although we have GPU implementations both for the FFT and the interpolation, we have not used accelerators in the results we report here. We carry out runtime experiments on the TACC’s “Maverick” system. Each compute node contains dual, tencore Intel Xeon E52680 v2 (Ivy Bridge) processors running at 2.8GHz with 12.8GB/core of memory. Each node also has an NVIDIA Tesla K40 GPU accelerator. We also report largescale runs on TACC’s “Stampede” system (two eightcore Xeon E52680 v1 (Sandy Bridge) processors with 32GB host memory per node). As mentioned before, we use PETSc’s TAO for the nonlinear optimization, vector operations of PETSc for vector linear operations, and AccFFT for the Fourier transforms. The basic interface to TAO is the functional, gradient, Hessian matvec, and preconditioner, as well as routines to select the tolerances for the nonlinear solver and for the Newton steps.
IvA3 Parameters
The regularization parameter is set to for the scalability runs (for both, the synthetic and the realworld brain example). The number of time steps controls the accuracy and should be related to the CFL number. For simplicity and to be able to compare different cases, we have kept it fixed to . The gradient tolerance is unless otherwise stated. We use an inexact Newton method with quadratic forcing. We do not report continuation results; all the runs are done for a single (experimentally determined) value of . We report () wallclock times, () communication times, as well as () the time to solution for our method with respect to different registration problems and parameter settings. Since the problem is nonconvex and we are not interested in highaccuracy solutions, we opt for a GaussNewton approximation.^{7}^{7}7Many image registration codes don’t even compute gradients and the termination criterion is the number of iterations. Given the limitations in the resolution and the image quality a relative reduction of the gradient by 1% is typically considered quite excessive.
IvB Scalability using synthetic images
FFT  interpolation  
nodes  tasks  time to solution  communication  execution  communication  execution  
#1  1  16  
#2  2  32  
#3  1  16  
#4  2  32  
#5  4  64  
#6  16  256  
#7  2  32  
#8  8  128  
#9  32  512  
#10  64  1024  
#11  8  128  
#12  32  512  
#13  64  1024 
FFT  interpolation  
nodes  tasks  time to solution  communication  execution  communication  execution  
#14  256  512  
#15  512  1024  
#16  1024  2048  
#17  256  512  
#18  512  1024  
#19  1024  2048 
We use a fixed set of parameters, which we experimentally determined to yield a good balance between computational complexity and computational performance. We illustrate the registration problem in Figure 5. We report results for different grid resolutions (), and different numbers of cores and MPI task configurations (). The results are reported in Table I (“Maverick” runs) and Table II (large scale “Stampede” runs). First, we interpret the runs (#1–#3), which represents a strong scaling analysis (in general, in image registration, strong scaling is what we’re most interested in). From 32 tasks to 512 tasks the parallel efficiency is 67%, whereas from 32 to 1024, the efficiency is 50%. This is not ideal—however, it is quite good. The majority of the calculation for low task counts goes to the interpolation computation, whereas, as we increase the number of tasks, the majority of time goes to the FFT communication phase.^{8}^{8}8We use FFTs for the discretization of differential operators since this allows us to invert them at the cost of a spectral diagonal scaling. This offers the opportunity to exactly fulfill and efficiently eliminate the incompressibility constraint from the optimality system. Also, it allows for an efficient preconditioning of the Hessian with essentially no construction cost (see §III for details). Similar conclusions can be drawn for the set; again, going from 16 tasks to 256 tasks, we observe 50% efficiency. For the (#11–#13) the efficiency is 72%. The latter is a problem with 1.5 billion unknowns for the velocity, without counting the unknowns for the state and adjoint fields; it only takes 32 seconds to solve to an accuracy of practical interest.
If we look the weak scaling results, we can consider runs #3, #8, and #13, in which we increase the problem size by a factor of eight and the number of tasks also by a factor of eight. The overall timings are 15.2 seconds, 23 seconds, and 32 seconds, respectively, which again is not perfect. If we look more closely at how the time is allocated, we observe that the execution time for the FFT scales perfectly in these three runs (1.35 seconds, 1.56 seconds, and 1.77 seconds, respectively). The interpolation execution also scales well, both in terms of communication and computation. The deterioration of the overall time is due to the FFT communication costs. The largest problem we solved for the synthetic case was run #19, in which we have 3.2 billion unknowns for the velocity field on 2048 MPI tasks on “Stampede”. It took 85 seconds. The good scalability of the computation phase confirms the algorithmic optimality of the preconditioned Newton–Krylov method. We report results for the incompressible case in Table III.
FFT  interpolation  
nodes  tasks  time to solution  communication  execution  communication  execution  
#20  1  1  
#21  2  4  
#22  4  8  
#23  8  16  
#24  16  32 
IvC Realworld registration problem
FFT  interpolation  
nodes  tasks  time to solution  communication  execution  communication  execution  
#25  1  1  
#26  2  4  
#27  8  16  
#28  16  32  
#29  32  256 
matvecs  time to solution ( relative )  
#30  43  ( 1.0 )  
#31  217  ( 4.6 )  
#32  1689  ( 35.0 ) 
We report exemplary results for the brain data sets illustrated in Figure 6 (grid size: ). We set the to and the maximal number of outer iterations (Newton steps) to 50, and . We study strong scaling and the sensitivity of the convergence of our solver with respect to changes in the regularization weight . We report scalability results for the brain images in Table IV. We display exemplary result for the considered datasets in Figure 7. We report results for varying choices of the regularization parameter in Table V.
We observe that we can significantly reduce the computational timings if we switch to parallel architectures. The scaling results are consistent with what we observed for the synthetic data sets. We can reduce the wall clock time by two orders of magnitude if we change from one task on one node to 64 MPI tasks on 32 nodes. We can fit the entire problem on one node. This demonstrates the practicability of our solver. The communication and execution times of the FFT and the interpolator drop significantly as we increase the number of nodes. The interpolation time contributes again critically (about or more than 50% of the time to solution).
As for the sensitivity with respect to the regularization parameter we can see that the number of Hessian matvecs (a proxy for the overall NewtonKrylov iterations) increases, as we reduce the regularization parameter from to . The time to solution increases by a factor of 35 for the smallest reported here. This clearly demonstrates that the performance of our preconditioner is not ideal; it deteriorates with a reduction in . As we have seen in the former section, the solver behaves independent of the mesh size. Implementing an improved scheme for preconditioning the Hessian requires more work.
V Conclusion
We presented a complete algorithm for large deformation diffeomorphic medical image registration. We were able to solve problems of unprecedented scale. One may ask how such runs translate to a clinical setting. As the cost of computing drops, we are hopeful that 32 and 256task calculations will be possible at a modest cost.
The proposed algorithm is flexible and scalable. It supports different types of regularization functionals and can be extended to different image distance measures. Our approach can be easily extended to vector images and—with some additional work—can also be extended to nonstationary (timevarying) velocities [30, 9]. This will be necessary to register timeseries of images or optical flow problems. All the parallelism related issues remain the same. A major remaining challenge is the design of preconditioners that are insensitive to the regularization parameter.
Finally, our algorithm relates to other applications besides medical imaging. For example applications in weather prediction and ocean physics (for tracking Lagrangian tracers in the oceans) [37], for reconstruction of porous media flows [20], and registration of MicroCTs for material science and biology [32]. Although our method is highly optimized for regular grids with periodic boundary conditions, many aspects of our algorithm carry over.
References
 [1] S. S. Adavani and G. Biros, “Fast algorithms for source identification problems with elliptic PDE constraints,” SIAM Journal on Imaging Sciences, vol. 3, no. 4, pp. 791–808, 2008.
 [2] V. Akcelik, G. Biros, and O. Ghattas, “Parallel multiscale GaussNewtonKrylov methods for inverse wave propagation,” in Proc ACM/IEEE Conference on Supercomputing, 2002, pp. 1–15.
 [3] Y. Amit, “A nonlinear variational problem for image matching,” SIAM Journal on Scientific Computing, vol. 15, no. 1, pp. 207–224, 1994.
 [4] J. Ashburner, “A fast diffeomorphic image registration algorithm,” NeuroImage, vol. 38, no. 1, pp. 95–113, 2007.
 [5] J. Ashburner and K. J. Friston, “Diffeomorphic registration using geodesic shooting and GaussNewton optimisation,” NeuroImage, vol. 55, no. 3, pp. 954–967, 2011.
 [6] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook et al., “A reproducible evaluation of ANTs similarity metric performance in brain image registration,” NeuroImage, vol. 54, pp. 2033–2044, 2011.
 [7] S. Balay, S. Abhyankar, M. F. Adams, J. Brown et al., “PETSc Web page.” [Online]. Available: http://www.mcs.anl.gov/petsc
 [8] ——, “PETSc users manual,” Argonne National Laboratory, Tech. Rep. ANL95/11  Revision 3.7, 2016. [Online]. Available: http://www.mcs.anl.gov/petsc

[9]
M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes, “Computing large
deformation metric mappings via geodesic flows of diffeomorphisms,”
International Journal of Computer Vision
, vol. 61, no. 2, pp. 139–157, 2005.  [10] A. Borzì and V. Schulz, Computational optimization of systems governed by partial differential equations. Philadelphia, Pennsylvania, US: SIAM, 2012.
 [11] J. P. Boyd, Chebyshev and Fourier spectral methods. Mineola, New York, US: Dover, 2000.
 [12] M. Burger, J. Modersitzki, and L. Ruthotto, “A hyperelastic regularization energy for image registration,” SIAM Journal on Scientific Computing, vol. 35, no. 1, pp. B132–B148, 2013.
 [13] K. Chen and D. A. Lorenz, “Image sequence interpolation using optimal control,” Journal of Mathematical Imaging and Vision, vol. 41, pp. 222–238, 2011.
 [14] G. E. Christensen, X. Geng, J. G. Kuhl, J. Bruss et al., “Introduction to the nonrigid image registration evaluation project,” in Proc Biomedical Image Registration, vol. LNCS 4057, 2006, pp. 128–135.
 [15] G. Crippa, “The flow associated to weakly differentiable vector fields,” Ph.D. dissertation, University of Zurich, 2007.
 [16] K. Czechowski, C. Battaglino, C. McClanahan, K. Iyer et al., “On the communication complexity of 3D FFTs and its implications for exascale,” in Proc ACM/IEEE Conference on Supercomputing, 2012, pp. 205–214.
 [17] S. C. Eisentat and H. F. Walker, “Choosing the forcing terms in an inexact Newton method,” SIAM Journal on Scientific Computing, vol. 17, no. 1, pp. 16–32, 1996.
 [18] A. Eklund, P. Dufort, D. Forsberg, and S. M. LaConte, “Medical image processing on the GPU–past, present and future,” Medical Image Analysis, vol. 17, no. 8, pp. 1073–1094, 2013.
 [19] M. Falcone and R. Ferretti, “Convergence analysis for a class of highorder semiLagrangian advection schemes,” SIAM Journal on Numerical Analysis, vol. 35, no. 3, pp. 909–940, 1998.
 [20] J. Fohring, E. Haber, and L. Ruthotto, “Geophysical imaging of fluid flow in porous media,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. S218–S236, 2014.
 [21] M. Frigo and S. G. Johnson, “FFTW home page.” [Online]. Available: http://www.fftw.org
 [22] A. Gholami, J. Hill, D. Malhotra, and G. Biros, “AccFFT: A library for distributedmemory FFT on CPU and GPU architectures,” arXiv eprints, 2016, in review (arXiv preprint: http://arxiv.org/abs/1506.07933).
 [23] A. Gholami, A. Mang, and G. Biros, “An inverse problem formulation for parameter estimation of a reactiondiffusion model of low grade gliomas,” Journal of Mathematical Biology, vol. 72, no. 1, pp. 409–433, 2016.
 [24] A. Gholami and G. Biros, “AccFFT home page.” [Online]. Available: http://www.accfft.org
 [25] A. Grama, A. Gupta, G. Karypis, and V. Kumar, An Introduction to parallel computing: Design and analysis of algorithms, 2nd ed. Addison Wesley, 2003.
 [26] M. D. Gunzburger, Perspectives in flow control and optimization. Philadelphia, Pennsylvania, US: SIAM, 2003.
 [27] M. E. Gurtin, An introduction to continuum mechanics, ser. Mathematics in Science and Engineering. Academic Press, 1981, vol. 158.
 [28] L. Ha, J. Krüger, S. Joshi, and T. C. Silva, “Multiscale unbiased diffeomorphic atlas construction on multiGPUs,” GPU Computing Gems Emerald Edition, vol. 1, pp. 771–791, 2010.
 [29] E. Haber and J. Modersitzki, “Numerical methods for volume preserving image registration,” Inverse Problems, vol. 20, pp. 1621–1638, 2004.

[30]
G. L. Hart, C. Zach, and M. Niethammer, “An optimal control approach for
deformable registration,” in
Proc IEEE Conference on Computer Vision and Pattern Recognition
, 2009, pp. 9–16.  [31] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich, Optimization with PDE constraints. Berlin, DE: Springer, 2009.
 [32] S. T. Ho and W. D. Hutmacher, “A comparison of micro CT with other techniques used in the characterization of scaffolds,” Biomaterials, vol. 27, no. 8, pp. 1362–1376, 2006.
 [33] F. Ino, K. Ooyama, and K. Hagihara, “A data distributed parallel algorithm for nonrigid image registration,” Parallel Computing, vol. 31, no. 1, pp. 19–43, 2005.
 [34] K. Ito and K. Kunisch, Lagrange multiplier approach to variational problems and applications. Philadelphia, Pennsylvania, US: SIAM, 2008, vol. 15.
 [35] G. S. James Shackleford, Nagarajan Kandasamy, High performance deformable image registration algorithms for manycore processors. Morgan Kaufmann, 2013.
 [36] R. Kakinuma, N. Moriyama, Y. Muramatsu, S. Gomi et al., “Ultrahighresolution computed tomography of the lung: Image quality of a prototype scanner,” PloS one, vol. 10, no. 9, p. e0137165, 2015.
 [37] E. Kalany, Atmospheric modeling, data assimilation and predictability. Oxford University Press, 2002.
 [38] A. Klein, J. Andersson, B. A. Ardekani, J. Ashburner et al., “Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration,” NeuroImage, vol. 46, no. 3, pp. 786–802, 2009.
 [39] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. W. Pluim, “ELASTIX: A tollbox for intensitybased medical image registration,” Medical Imaging, IEEE Transactions on, vol. 29, no. 1, pp. 196–205, 2010.
 [40] R. J. LeVeque, Numerical methods for conservation laws. Springer, 1992, vol. 132.
 [41] J.L. Lions, Some aspects of the optimal control of distributed parameter systems. Philadelphia, Pennsylvania, US: SIAM, 1972.
 [42] Y. Liu, A. Fedorov, R. Kikinis, and N. Chrisochoides, “Realtime nonrigid registration of medical images on a cooperative parallel architecture,” in Proc IEEE International Conference on Bioinformatics and Biomedicine, 2009, pp. 401–404.
 [43] M. Lorenzi, N. Ayache, G. B. Frisoni, and X. Pennec, “LCCdemons: a robust and accurate symmetric diffeomorphic registration algorithm,” NeuroImage, vol. 81, pp. 470–483, 2013.
 [44] F. Lüsebrink, A. Wollrab, and O. Speck, “Cortical thickness determination of the human brain using high resolution 3T and 7T MRI data,” Neuroimage, vol. 70, pp. 122–131, 2013.
 [45] A. Mang and G. Biros, “An inexact Newton–Krylov algorithm for constrained diffeomorphic image registration,” SIAM Journal on Imaging Sciences, vol. 8, no. 2, pp. 1030–1069, 2015.
 [46] ——, “Constrained regularization schemes for diffeomorphic image registration,” SIAM Journal on Imaging Sciences, 2016, to appear (arXiv preprint: http://arxiv.org/abs/1503.00757).
 [47] ——, “A SemiLagrangian twolevel preconditioned Newton–Krylov solver for constrained diffeomorphic image registration,” arXiv eprints, 2016, in review (arXiv preprint: http://arxiv.org/abs/1604.02153).
 [48] M. Modat, G. R. Ridgway, Z. A. Taylor, M. Lehmann et al., “Fast freeform deformation using graphics processing units,” Computer Methods and Programs in Biomedicine, vol. 98, no. 3, pp. 278–284, 2010.
 [49] J. Modersitzki, Numerical methods for image registration. New York: Oxford University Press, 2004.
 [50] ——, FAIR: Flexible algorithms for image registration. Philadelphia, Pennsylvania, US: SIAM, 2009.
 [51] J. Nocedal and S. J. Wright, Numerical Optimization. New York, New York, US: Springer, 2006.
 [52] J. A. Schackleford, N. Kandasamy, and G. C. Sharp, “On developing Bspline registration algorithms for multicore processors,” Physics in Medicine and Biology, vol. 55, no. 21, pp. 6329–6351, 2010.
 [53] R. Shams, P. Sadeghi, R. A. Kennedy, and R. I. Hartley, “A survey of medical image registration on multicore and the GPU,” Signal Processing Magazine, IEEE, vol. 27, no. 2, pp. 50–60, 2010.
 [54] D. Shen and C. Davatzikos, “Very highresolution morphometry using masspreserving deformations and HAMMER elastic registration,” NeuroImage, vol. 18, no. 1, pp. 28–41, 2003.
 [55] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical image registration: A survey,” Medical Imaging, IEEE Transactions on, vol. 32, no. 7, pp. 1153–1190, 2013.
 [56] Z. Starosolski, C. A. Villamizar, D. Rendon, M. J. Paldino et al., “Ultra highresolution in vivo computed tomography imaging of mouse cerebrovasculature using a long circulating blood pool contrast agent,” Scientific Reports, vol. 5, no. 10178, 2015.
 [57] R. Temam, Navier–Stokes equations: Theory and numerical analysis. NorthHolland Pub. Co., 1977.
 [58] R. Tomer, L. Ye, B. Hsueh, and K. Deisseroth, “Advanced CLARITY for rapid and highresolution imaging of intact tissues,” Nature protocols, vol. 9, no. 7, pp. 1682–1697, 2014.
 [59] T. ur Rehman, E. Haber, G. Pryor, J. Melonakos, and A. Tannenbaum, “3d nonrigid registration via optimal mass transport on the GPU,” Medical Image Analysis, vol. 13, no. 6, pp. 931–940, 2009.
 [60] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache, “Symmetric logdomain diffeomorphic registration: A demonsbased approach,” in Proc Medical Image Computing and ComputerAssisted Intervention, vol. LNCS 5241, no. 5241, 2008, pp. 754–761.
 [61] ——, “Diffeomorphic demons: Efficient nonparametric image registration,” NeuroImage, vol. 45, no. 1, pp. S61–S72, 2009.
 [62] S. K. Warfield, M. Ferrant, X. Gallez, A. Nabavi et al., “Realtime biomechanical simulation of volumetric brain deformation for image guided neurosurgery,” in Proc ACM/IEEE Conference on Supercomputing, 2000, pp. 23–23.
 [63] Y. Yin, E. A. Hoffman, and C.L. Lin, “Mass preserving nonrigid registration of CT lung images using cubic Bspline,” Medical Physics, vol. 36, no. 9, pp. 4213–4222, 2009.
 [64] M.Q. Zhang, L. Zhou, Q.F. Deng, Y.Y. Xie et al., “Ultrahighresolution 3D digitalized imaging of the cerebral angioarchitecture in rats using synchrotron radiation,” Scientific Reports, vol. 5, no. 14982, 2015.
Comments
There are no comments yet.