1 Introduction
Deformable registration is a key technology in the medical imaging. It is about computing a map that establishes a meaningful spatial correspondence between two (or more) images (the reference (fixed) image) and (the template (deformable or moving) image; image to be registered) of the same scene [96, 43]. Numerous approaches for formulating and solving image registration problems have appeared in the past; we refer to [96, 97, 43, 60, 116] for lucid overviews. Image registration is typically formulated as a variational optimization problem that consists of a data fidelity term and a Tikhonov regularization functional to overcome illposedness [43, 40]. A key concern is that is a diffeomorphism, i.e., the map is differentiable, a bijection, and has a differentiable inverse. We require that the determinant of the deformation gradient does not vanish or change sign. An intuitive approach to safeguard against nondiffeomorphic maps is to add hard and/or soft constraints on to the variational optimization problem [29, 58, 103, 108]. An alternative strategy to ensure regularity is to introduce a pseudotime variable and invert for a smooth velocity field that encodes the map [16, 37, 94, 126]; existence of a diffeomorphism can be guaranteed if is adequately smooth [16, 30, 37, 121]. This model is, as opposed to many traditional formulations that directly invert for [88, 58, 108, 95], adequate for recovering large, highly nonlinear deformations. Our approach falls into this category. There exists a large body of literature that, in many cases, focuses on theoretical considerations [37, 94, 131, 129, 130]. There is much less work on the design of efficient solvers for velocity based large deformation diffeomorphic image registration; examples are [11, 8, 9, 16, 34, 132, 64, 126, 102]
. Most of these solvers use first order methods for numerical optimization and/or are based on heuristics that do not guarantee convergence. Due to computational costs, early termination results in compromised registration quality. Our intention is to introduce an efficient solver for diffeomorphic image registration problems that
(i) uses stateofthe art algorithms, (ii) is scalable to thousands of cores, (iii) requires minimal parameter tuning, (iv) and produces highfidelity results with guaranteed regularity on a discrete level.spatial domain;  
spatial coordinate;  
pseudotime variable;  
reference image  
template image (image to be registered/deformed)  
stationary velocity field  
deformation map (Eulerian/pullback map)  
state variable (transported intensities)  
final state;  
adjoint variable  
incremental state variable  
incremental state variable  
Lagrangian functional  
(reduced) gradient  
(reduced) Hessian operator  
partial derivative with respect to ^{th} coordinate direction  
gradient operator;  
divergence operator  
∇

Laplacian operator (vectorial and scalar) 
number of grid points (spatial domain);  
number of cells in temporal grid  
number of unknowns;  
CLAIRE  constrained large deformation diffeomorphic image registration 
CFL  Courant–Friedrichs–Lewy (condition) 
CHEB()  Chebyshev (iteration) with fixed iteration number [49, 53] 
FFT  fast Fourier transform 
GPL  GNU General Public License 
LDDMM  large deformation diffeomorphic metric mapping [16] 
matvec  matrix vector product 
MPI  Message Passing Interface 
PETSc  Portable Extensible Toolkit for Scientific Computation [13] 
PCG  preconditioned conjugate gradient (method) [66] 
PCG()  PCG method with relative tolerance 
RK2  2nd order Runge–Kutta method 
(S)DDEM  (symmetric) diffeomorphic demons [124, 126] 
(S)LDDEM  (symmetric) logdomain diffeomorphic demons [125] 
SL  semiLagrangian (scheme) 
TAO  Toolkit for Advanced Optimization [98] 
1.1 Outline of the Method
We use an optimal control formulation. The task is to find a smooth velocity field (the ‘’control variable‘’) such that the distance between two images (or densities) is minimized, subject to a regularization norm for and a deformation model given by a hyperbolic PDE constraint.^{1}^{1}1We refer to [20, 51, 68, 27] for more details on recent developments in PDEconstrained optimization. We review PDEconstrained optimization formulations in the context of image analysis in [86]. More precisely, given two functions (reference image) and (template image) compactly supported on an open set with boundary , we solve for a stationary velocity field as follows: equationparentequation
(1a)  
subject to  
(1b)  
(1c) 
with periodic boundary conditions on . Here, (the ‘’state variable‘’) corresponds to the transported intensities of subject to the velocity field ; in our formulation, —i.e., the solution of (1b)—is equivalent to for all in . The regularization functional is a Sobolev norm that, if chosen appropriately, ensures that gives rise to a diffeomorphism [16, 37, 62, 121]. We augment the formulation in (1) by constraints on the divergence of to control volume change (see also [83, 30]). Details can be found in §2.1.
Problem (1) is illposed and involves illconditioned operators. We use the method of Lagrange multipliers to solve the constrained optimization problem (1). Our solver is based on an optimizethendiscretize approach; we first derive the optimality conditions and then discretize in space using a pseudospectral discretization with a Fourier basis.^{2}^{2}2A discretizethenoptimize approach can be found in [87]. We use a globalized, inexact, preconditioned Gauss–Newton–Krylov method to solve for the first order optimality conditions [82]. The hyperbolic transport equations that appear in our formulation are integrated using a semiLagrangian method [117, 41]. Our solver uses distributedmemory parallelism and can be scaled up to thousands of cores [83, 47]. The linear solvers and the Gauss–Newton optimizer are built on top of PETSc [14] and TAO [98]. Details can be found in §2.3.
1.2 Contributions
We follow up on our former work on constrained diffeomorphic image registration [82, 83, 84, 85]. Our optimal control formulation is introduced in [82, 83]. Our Newton–Krylov solver has been proposed in [82]; an improved solver is described in [84].^{3}^{3}3A comparison of our Newton–Krylov scheme against a (preconditioned) gradient descent scheme (i.e., a gradient descent in the dual space , , induced by the regularization norm; see, e.g., [16, 62]) can be found in [82]. We extend our original solver in [82] to the 3D setting in [85, 47]. The focus in [85, 47] is the scalability of our solver on high performance computing platforms. In [107] we presented an integrated formulation for diffeomorphic image registration and biophysical tumor growth simulations. In this work, we focus on the registration performance and introduce additional algorithmic improvements to our threedimensional solver. Our contributions are:

We implement an improved preconditioner for the reduced space Hessian (originally described in [84] for the twodimensional case). We empirically evaluate several variants of this preconditioner.

We introduce options for a grid, scale, and parameter continuation to our threedimensional solver.

We study strong scaling performance of our improved solver.

We make our software termed CLAIRE (which stands for constrained large deformation diffeomorphic image registration) available under GPL license (https://github.com/andreasmang/claire).
1.3 Limitations and Unresolved Issues
Several limitations and unresolved issues remain. We assume similar intensity statistics for the reference image and the template image . This is a common assumption in many deformable image registration algorithms [16, 62, 77, 99, 127]. To enable the registration of images with a more complicated intensity relationship, more involved distance measures need to be considered (see, e.g., [96, 116]). Our formulation is not symmetric, i.e., not invariant to a permutation of the reference and template image. The extension of our scheme to the symmetric case is mathematically straightforward but its efficient implementation is nontrivial (see [10, 80, 125] for examples). This will be the subject of future work. We invert for a stationary velocity field (i.e., the velocity is not a function of time). Stationary paths on the manifold of diffeomorphisms are the group exponentials (i.e., oneparameter subgroups) and—in contrast to LDDMM methods [16, 62, 127], they do not cover the entire space of diffeomorphisms. Developing an effective, parallel solver for nonstationary requires more work.^{4}^{4}4We note that our Matlab prototype implementations support nonstationary [82, 87].
1.4 Related Work
With this work we follow up on our prior work on constrained diffeomorphic image registration [82, 83, 85, 87, 89]. We do not directly invert for the deformation map (see [29, 58, 88, 42, 57, 97, 104, 73] for examples) but for its velocity . We use an optimal control formulation. The PDE constraint is, in its simplest form, a hyperbolic transport equation. We refer to [20, 27, 51, 68, 79] for an introduction to the theory and algorithmic developments in PDEconstrained optimization. The numerics for these type of problems have to be tailored to the specific structure of the control problem, which is dominated by the PDE constraints that appear in the formulation; see [1, 21], [2, 46, 90, 118], and [18, 26, 77, 65, 128] for elliptic, parabolic, and hyperbolic examples, respectively.
Excellent reviews on image registration include [96, 60, 116]. Optimal control formulations for image registration have been described in [26, 30, 62, 77, 78, 127, 65]. Related formulations for optimal mass transport are described in [18, 56, 122, 87]. Our work differs from optimal mass transport in that intensities are constant along the computed characteristics (i.e., mass is not preserved). Our formulation also shares characteristics with traditional optical flow formulations [69, 72, 105]. The key difference is that we treat the transport equation for the image intensities as a hard constraint. PDEconstrained formulations for optical flow, which are equivalent to our formulation, are described in [6, 15, 26, 30]. Our work is closely related to the LDDMM approach [10, 11, 16, 37, 121, 129] (c.f., [82, 62]), which builds upon the pioneering work in [32]. LDDMM uses a nonstationary velocity but there exist variants that use stationary [7, 8, 64, 81, 80, 126] because they are much cheaper and if we are only interested in registering one template image to the reference stationary produce good results. Another strategy to reduce the size of the search space is ‘’geodesic shooting‘’ [9, 93, 127, 129, 133]. Here, the deformation map is parameterized by an initial momentum (or, similarly, the velocity at ).
Unlike existing LDDMM approaches, CLAIRE features explicit control on the determinant of the deformation gradient by introducing additional constraints for the divergence of . Our formulation for controlling the divergence of was originally proposed in [83]; a similar formulation is described in [26]. Other works that consider divergence free velocities have been described in [30, 67, 91, 105, 106].
With this work we release CLAIRE, a software package for velocity based diffeomorphic image registration. Among the most popular, publicly available packages for large deformation diffeomorphic image registration are DEMONS [125, 126], ANTS [11], and DARTEL [8]. Other popular packages for deformable registration are IRTK [104], ELASTIX [73], NiftyReg [95], and FAIR [97], which, with the exception of FAIR, are based on (low dimensional) parametric deformation models that use an expansion based on basis splines to represent the deformation map .
There are only few works that discuss effective numerical implementations for velocity based diffeomorphic image registration. Despite the fact that first order methods for numerical optimization display poor convergence for nonlinear, illposed problems, most work, with the exception of ours and [9, 18, 63, 114, 126, 65], uses first order information for numerical optimization. We use a globalized Newton–Krylov method instead. For these methods to be efficient, it is critical to design an effective preconditioner (something we address with this work; we refer to [17] for an overview on preconditioning of saddle point problems). Preconditioners for problems that are similar to ours can be found in [18, 114, 65]. Another critical component of an effective method for PDEconstrained optimization problems is the solver for the PDE operators that appear in the optimality systems. In our case, these PDE operators are hyperbolic transport equations. Several strategies to efficiently solve these equations have been considered; they range from conditionally stable total variation diminishing [26], second and fourthorder RungeKutta [82, 83, 102], and highorder essentially nonoscillatory [62] schemes, to unconditionally stable implicit LaxFriedrich [18, 114], SL [16, 30, 89, 85], and Lagrangian [87] schemes. We use a SL scheme.
Examples for parallel solvers for PDEconstrained optimization problems can be found in [3, 4, 22, 23, 24, 20, 19, 113]. We refer to [39, 44, 110, 112] for surveys on parallel algorithms for image registration. Implementations, such as DEMONS [125, 126], ANTS [11], or ELASTIX [73], which are largely based on kernels implemented in the ITK package [70], exploit multithreading for parallelism. GPU implementations of different variants of mapbased parametric approaches based on basis splines are described in [109, 95, 111]. A GPU implementation of a mapbased nonparametric image registration approach that builds upon FAIR [97] is described in [74]. GPU implementations of diffeomorphic velocity based registration approaches are described in [55, 54, 115, 122, 123]. The work that is most closely related to ours [85, 47], is [55, 54, 123]. In [55, 54] a multiGPUaccelerated implementation of the diffeomorphic image registration approach described in [71] is presented. The work in [123] discusses a GPUaccelerated implementation of DARTEL [8].
What sets our work apart are the numerics and our distributedmemory implementation: We use highorder numerical methods (second order time integration, cubic interpolation, and spectral differentiation). Our solver uses MPI for parallelism and has been deployed to high performance platforms
[83, 47]. This allows us to target applications of unprecedented scale such as CLARITY imaging [120], a novel optical imaging technique that delivers submicron resolution. The first attempt of applying LDDMM to CLARITY is described in [75]. The data is downsampled by about a factor of 100 (from to ) to be able to perform the registration. By exploiting a multiresolution strategy they were able to register the images in about 100 minutes. We will see that we can solve problems with unknowns in on 22 compute nodes (256 MPI tasks) and in less than if we use 342 compute nodes (4096 MPI tasks). However, CLAIRE cannot only be used to solve problems of unprecedented scale but also paves the way for realtime applications of large deformation diffeomorphic image registration. We are the first group to present a distributedmemory implementation of a globalized Gauss–Newton–Krylov method for these type of problems.1.5 Outline
We present our approach for large deformation diffeomorphic image registration in §2, which comprises the formulation of the problem (see §2.1), a formal presentation of the optimality conditions (see §2.2), and a discussion of the numerical implementation (see §2.3). Numerical experiments are reported in §3. We conclude with §4. Supplementary material is presented in §A, §B, and §C.
2 Methods
In what follows, we describe the main building blocks of our formulation, our solver, and its implementation, and introduce new features that distinguish this work from our former work on constrained diffeomorphic image registration [82, 83, 85, 47, 87, 84]. We use a globalized preconditioned, inexact, reduced space Gauss–Newton–Krylov method to solve (1). Our scheme has been described in [82, 85, 84]. Extensions to our original optimal control formulation [82] are described in [83]. The parallel implementation of the main kernels of our solver is described in [85, 47].
2.1 Formulation
Given two images—the reference image and the template image —compactly supported on , with boundary and closure , our aim is to compute a plausible deformation map such that [96, 97, 43]. We consider a map to be plausible, if it is a diffeomorphism, i.e., an invertible map, which is continuously differentiable (in particular a function) and maps onto itself. In our formulation, we do not directly invert for ; we introduce a pseudotime variable and invert for a stationary velocity , instead. In particular, we solve for the velocity field and a mass source map as follows [83]: equationparentequation
(2a) 
subject to
(2b)  
(2c)  
(2d) 
with periodic boundary conditions on , and , , . The state variable in (2b) represents the transported intensities of subjected to the velocity field ; the solution of (2b), i.e., , is equivalent to , where is the Eulerian (or pullback) map.^{5}^{5}5The deformation map and the determinant of the deformation gradient can be computed from as a postprocessing step (see [82, 83]). We use a squared distance to measure the proximity between and . The parameters and control the contribution of the regularization norms for and . The constraint on the divergence of in (2d) allows us to control the compressibility of . If we set in (2d) to zero is incompressible, i.e., up to numerical accuracy [52]; see [30, 67, 82, 83, 85, 91, 105, 106] for examples. By introducing a nonzero masssource map , we can relax this model to nearincompressible diffeomorphisms [83, 25]; the regularization on in (2a) acts like a penalty on the divergence of ; we use an norm (a similar formulation can be found in [25]).
If we neglect the constraint (2d) we obtain a formulation that is closely related to available models for velocitybased diffeomorphic image registration [8, 16, 62, 64, 127, 126] (see [62, 82]).^{6}^{6}6We, likewise to [7, 8, 64, 81, 80, 126], invert for a stationary . The original LDDMM formulation [16, 37, 121, 129] uses timedependent velocity fields. We note that stationary do not cover the entire space of diffeomorphisms. The paths obtained from stationary are oneparameter subgroups that do not depend on any metric; the definition of such a metric may be desirable in certain applications [16, 92, 132] and, in general, requires nonstationary velocities. We observed for the 2D case that using timevarying did not improve the mismatch for twoimage registration problems [82]. Our solver supports different Sobolev (semi)norms to regularize . We summarize the available operators in Tab. 2. The choice of the differential operator not only depends on application requirements but is also critical from a theoretical point of view; an adequate choice guarantees existence and uniqueness of an optimal solution of the control problem [15, 16, 26, 30, 77] (subject to the smoothness properties of the images).
seminorm  seminorm  seminorm  
∇


2.2 Optimality Condition and Newton Step
We use the method of Lagrange multipliers [79] to turn the constrained problem (2) into an unconstrained one; neglecting boundary conditions, the Lagrangian functional is given by
(3)  
with Lagrange multipliers for the hyperbolic transport equation (2b), for the initial condition (2c), and for the incompressibility constraint (2d). Formally, we have to compute variations of with respect to the state, adjoint, and control variables. We will only consider a reduced form (after eliminating the incompressibility constraint) of the optimality system—a system of nonlinear PDEs for , , and . Details on how we formally arrive at this reduced from can be found in [82, 83].
The evaluation of the reduced gradient (the first variation of the Lagrangian in (3) with respect to ) for a candidate requires several steps. We first solve the state equation (2b) with initial condition (2c) forward in time to obtain the state variable . Given we then compute the adjoint variable by solving the adjoint equation equationparentequation
(4a)  
(4b) 
with periodic boundary conditions on , backward in time. Once we have the adjoint and state fields and , respectively, we can evaluate the expression for the reduced gradient given by
(5) 
The differential operator in (5) corresponds to the first variation of the regularization norm for in (3), resulting in an elliptic, biharmonic, or triharmonic control equation for , respectively (see Tab. 2). The operator projects onto the space of incompressible or nearincompressible velocity fields. We have
for the incompressible case. If we neglect the incompressibility constraint (2d), in (5) is an identity operator. The dependence of and on is “hidden” in (2b) and (4a), respectively.
The first order optimality condition (control or decision equation) requires that for an admissible solution to (2). Most available registration packages for large deformation diffeomorphic registration use Sobolev gradient descent type optimization schemes to find an optimal point [16, 62, 127]. The search direction is typically given by the gradient in the dual space , :
(6) 
These methods have linear convergence. Newtontype methods yield superlinear to quadratic convergence [100].^{7}^{7}7We demonstrated in [82] that our (Gauss–)Newton–Krylov method is superior to this Sobolev gradient descent scheme. However, their implementation requires more work and, if implemented naively, they can become computationally prohibitive.
We use a Newton–Krylov method to solve for , which is globalized using an Armijo linesearch [100]. We use a reduced space formulation [23], in which, for every value of , we eliminate and . Then, given a candidate , the Newton direction is computed based on the second variations of the Lagrangian in (3). The expression for the Hessian matvec (action of the Hessian on the vector ) is given by
(7) 
where , , and denote the incremental state, adjoint, and control variable, respectively. We use the notation for the application of the reduced space Hessian to indicate that is a function of and through , , , and in (2b), (8a), (4a), and (9a). Each application of the reduced Hessian in (7) requires the following steps: Given and for some candidate we have to solve^{8}^{8}8We assign the computational costs for computing and to the evaluation of the gradient in (5). equationparentequation
(8a)  
(8b) 
for (forward in time) and equationparentequation
(9a)  
(9b) 
for (backward in time). Computing the Newton step requires the (iterative) solution of
(10) 
where the right hand is given by the reduced gradient in (5).
2.3 Numerics
In the following, we describe our distributedmemory solver for 3D diffeomorphic image registration problems.^{9}^{9}9Our original Newton–Krylov scheme is described in [82]. Its parallel implementation is described in [85, 47], where we focus on the scalability of our solver. Improvements of our solver in [82] are described in [84]; the work in [82, 84] is limited to the 2D case. In what follows, we present an extensions of the 3D implementation described in [85, 47].
2.3.1 Discretization
We discretize in space on a regular nodal grid with grid nodes , , , , and periodic boundary conditions; denotes the Hadamard division. In the continuum, we model images as compactly supported (periodic), smooth functions. We apply a Gaussian smoothing (in the spectral domain) with a bandwidth of , and mollify the discrete data, to meet these requirements. We rescale the images to an intensity range of prior to registration.
We use a trapezoidal rule for numerical quadrature and a spectral projection scheme for all spatial operations. The mapping between the spectral and the spatial domain is done using forward and inverse FFTs. All spatial derivatives are computed in the spectral domain; we first take the FFT, then apply the appropriate weights to the spectral coefficients, and then take the inverse FFT. Notice, that this scheme allows us to efficiently and accurately apply differential operators and their inverses. Consequently, the main cost of our scheme is the solution of the transport equations (2b), (4a), (8a), and (9a), and not the inversion of differential (e.g., elliptic or biharmonic) operators. We use a nodal discretization in time, which results in spacetime fields we need to solve for. We use an unconditionally stable SL scheme [117, 41] to solve the transport equations, which allows us to keep small (we found that yields a good compromise between runtime and numerical accuracy).
2.3.2 Newton–Krylov Solver
The discretized Hessian operator , , is dense and illconditioned [82]. Constructing and storing is out of question, especially for 3D problems (e.g., ). We use iterative, matrixfree Krylov subspace methods, which only require the action of on a vector (see (7)). We use a PCG method [66] under the assumption that is a symmetric, positive (semi)definite operator.^{10}^{10}10The exact Hessian is of course guaranteed to be symmetric. But since we use an optimizethendiscretize formulation, discretization discrepancies between the adjoint and the forward operators result in a nonsymmetric Hessian. However, the error in symmetry is small compared to the tolerance levels we use in our solver. In general, the more accurately we want to compute the gradient, the more accurately we need to solve the forward and adjoint problems. The linear solves are done inexactly, with a tolerance that is proportional to the norm of the reduced gradient [36, 38]; our solver supports quadratic and superlinear forcing sequences (see, e.g., [100, 166ff]).
We note that it cannot be guaranteed that the Hessian operator is positive definite far away from the optimum. As a remedy we opt for a Gauss–Newton approximation. This corresponds to dropping all terms in (7) and (9) that involve [82]. The rate of convergence drops from quadratic to superlinear. As tends to zero (i.e., the mismatch goes to zero; see (4)), we recover quadratic convergence.
Preconditioner
As we have pointed out in §2.2, the transport equations (8) and (9) have to be solved every time in (7) is applied to a vector. To design an efficient algorithm we have to keep the number of Hessian matvecs (i.e., the number of PCG iterations) as small as possible. This necessitates the design of an effective preconditioner. We refer to [82] for a study of the spectral properties of the Hessian; for practical values of the regularization parameter
, the Hessian behaves as a compact operator; larger eigenvalues are associated with smooth eigenvectors
[82].The speed of convergence of the PCG method depends on the distance of from identity; ideally, the spectrum of is clustered around one. In former work, we have considered two preconditioners, a spectral preconditioner based on the inverse of the regularization operator^{11}^{11}11The operator is singular if we consider a seminorm as regularization model in (2
). We bypass this problem by setting the zero singular values of the operator
to one before computing its inverse. (a common choice in PDEconstrained optimization problems [5, 28]) and a twolevel (nested) preconditioner (proposed and tested in [84] for the 2D case). The latter strategy can be interpreted as a simplified twolevel multigrid Vcycle. We present these two preconditioners in more detail next. Both of our approaches are matrixfree, i.e., they do not require assembling or storage of the preconditioner.Applying the spectral preconditioner to (10) yields
(11) 
The preconditioned Hessian operator in (11) is a compact perturbation of the identity; we expect a rate of convergence that is independent of the grid size. Notice that the inversion of can be done efficiently and exactly due to the spectral discretization in space; it only requires a forward and an inverse FFT, and an appropriate diagonal scaling in the spectral domain. This preconditioner is used in our past work [82, 83, 85, 87, 86].
We need a good approximation of the inverse of the Hessian to have an effective preconditioner. Our second approach uses the inverse of the reduced space Hessian as a preconditioner (in theory, the ideal preconditioner). The key idea is to amortize the computational costs by solving for the action of the inverse of the Hessian inexactly, on a coarser grid. This preconditioner will only act on the low frequency components of the input vector. We ensure this numerically by separating the spectrum. We do so by applying ideal (spectral) low and highpass filters to the vector the preconditioner is applied to (input vector). We proceed as follows: We apply a spectral restriction operator to present the filtered low frequency modes of our input vector to the algorithm for inverting the Hessian on the coarse grid. We apply a spectral prolongation operator to the output vector of this algorithm to project it back to the fine grid, and subsequently add the filtered high frequency modes of the input vector to it (i.e., we leave the high frequency components of the input vector untouched). To be able to evaluate the Hessian matvec, we also need to apply restriction operators to the adjoint and state variable and , respectively.
We do not apply this scheme to (7) but to the spectrally preconditioned Hessian operator in (11)—with a slight modification. The preconditioned Hessian in (11) is no longer symmetric. Since we can efficiently compute the square root of the operator we use
(12) 
instead.^{12}^{12}12Notice that the left preconditioned Hessian in (11) and the splitpreconditioned Hessian in (12) have identical spectral properties. Consequently, we can use a PCG algorithm to compute the action of the inverse of in (12). Notice that the inverse of the regularization operator is a smoother; we can view it as an approximation of a smoother in multigrid methods.
The final building block of our twolevel scheme for preconditioning (10) is the solver used to invert the reduced space Hessian on the coarse grid. Forming and storing this operator is out of question (too costly); we opt for a matrixfree iterative solver to compute the action of its inverse on a vector, instead. We have several choices. If we opt for a nested Krylov subspace method, such as PCG, we have to use a smaller tolerance for the inversion of the reduced space Hessian , , on the coarse grid (i.e., the preconditioner) than we use for the inversion of on the fine grid, i.e., with . This is due to the fact that Krylov subspace methods are nonlinear operators; this scheme would violate the theory of preconditioning linear systems if the preconditioner was not inverted accurately. Another approach is to use a semiiterative Chebyshev (CHEB) method [53], which is a fixed linear operator for a particular choice of eigenvalue bounds [49]
. We have to estimate these bounds at every Newton iteration; this is costly. We do so using the Lanczos method. This is the method we used in
[84].^{13}^{13}13We have also considered a flexible PCG method [12, 101] for the inversion of the Hessian operator. We observed that the performance deteriorated significantly as we reduce the regularization parameter. We disregard this approach.Stopping Criteria
We terminate the inversion if the norm of the gradient in (5) is reduced by a factor of , i.e., if , where is the gradient at iteration and is the gradient for the initial condition of the inversion. In most of our experiments, we use . We also provide an option to set a lower bound for the norm of the gradient (the default value is ). CLAIRE also features the stopping criteria discussed in [97, 48, 82] (not considered here).
2.3.3 Hyperbolic PDE Solver
We use a fully explicit, unconditionally stable SL scheme [41, 117] to solve the hyperbolic transport equations that appear in our Eulerian formulation ((2b), (4a), (8a), and (9a)).^{14}^{14}14We introduced our SL formulation in [84] and introduced a first parallel implementation in [85]. SL methods are a hybrid between Eulerian and Lagrangian methods. To explain our scheme, we consider the general transport equation
with periodic boundary conditions on . We introduce the notion of a characteristic that is defined by the flow on a time interval , . The interval corresponds to a single time step of size of our scheme for integrating the transport equation with .
If the Lagrangian derivative vanishes, is constant along the characteristic . The value of at a given coordinate at can then be computed by tracing the characteristic backward in time to obtain the location (departure point) the particle originated from. We can trace the characteristic by solving in with final condition at . We solve this ODE backward in time based on a fully explicit RK2 scheme (Heun’s method):
Notice, that is stationary; we have to compute the characteristic only once for a single time step at each Newton iteration (more precisely, once for the state and once for the adjoint equations). The coordinate is in general an offgrid location; the evaluation of requires three scalar interpolations. We use a cubic interpolation model.
In a subsequent step, we have to solve along the characteristic . As we have mentioned earlier, is constant along for the homogeneous case (). Hence, we can compute at by scalar interpolation at the offgrid locations ; the solution of the underlying PDE overall requires interpolations. If the Lagrangian derivative does not vanish, we compute the solution based on an RK2 scheme. Depending on the form of , this requires the evaluation of differential and/or interpolation operators for each datum evaluated at (see [84] for details). Overall, our scheme is second order accurate in time and third order accurate in space.
2.3.4 Continuation Schemes
The automatic selection of an optimal regularization parameter for unseen data is a topic of research by itself [61, 59]. The choice of the regularization norm (i.e., the regularity requirements for the deformation map ) depends on the application. Subject to smoothness requirements on the data, we know that we require at least regularity of to guarantee that is a diffeomorphism [16, 37, 121]. In a practical setting, we assume that it is sufficient to require that is strictly positive (up to numerical accuracy). We can, for instance, monitor during the optimization and adjust the regularization parameter to guarantee that is locally diffeomorphic.
Remark
In our formulation, does not appear. Also, we have to differentiate (or the displacement field) once we found ; resolving is challenging and may require a significantly larger amount of discretization points than we use for the solution of the optimization problem. For this reason, we do not compute ; we directly transport .
Probing for an adequate is expensive, since it typically requires a repeated solution of the inverse problem. We assume that, once we have found an adequate , we can use this parameter for similar registration problems. Such cohort studies are quite typical in medical imaging. We determine based on a binary search (originally proposed in [82]; a similar strategy is described in [59]). The measure for this search is a bound (a user defined parameter) on . We choose so that . We start at level with a regularization parameter of and reduce by one order of magnitude until we hit the tolerance . At this point, we execute a binary search. We terminate the binary search if the update for at level is smaller than of the value for at which we reached the bound (i.e., if we reach for we terminate the binary search if the update for between level and is smaller than ). We use the solution of level as initial guess for the next level. We perform at least one Gauss–Newton iteration, and terminate the level if the gradient is reduced by (with respect to the norm of the gradient for our initial guess ; notice that we only need to evaluate the reference gradient once for this choice).
We implement the following acceleration schemes to speed up the rate of convergence and reduce the likelihood to get trapped in local minima:

Parameter continuation. For a given target parameter , we reduce (starting with ) by one order of magnitude until we reach the order of . We suggest to solve the problem on each level with a larger tolerance for the stopping criteria (e.g., ). On the final level, we set to and solve with the user defined tolerance . We perform at least one Gauss–Newton iteration per level.

Grid continuation. Our solver also features a (spectral) grid continuation (i.e., multilevel) scheme. We solve the problem on coarse grid first, and successively refine until we reach the resolution of the original image data. The restriction and prolongation is performed in the spectral domain. The solution on the coarse grid at level is used as the initial condition for the fine grid at level . At each level, we solve the registration up to the user defined tolerance .

Scale continuation. We have also implemented a multiscale approach. We use a Gaussian kernel (applied in the spectral domain) for the scale space representation of the image data.
We summarize the critical parameters of CLAIRE in Tab. 3.
variable  meaning  suggested value  determined automatically 
regularization parameter for  —  yes  
regularization parameter for  no  
relative tolerance for gradient  no  
number of time steps  4  no  
bound for  0.25 (div) or 0.1 ()  no 
2.3.5 Implementation Details
We make CLAIRE available under GPL license. It can be downloaded at https://github.com/andreasmang/claire. CLAIRE is written in C/C++; it implements data parallelism via MPI. Our solver supports single and double precision. We use the PETSc library [13] for linear algebra, and PETSc’s TAO [98] for numerical optimization. We provide routines for the evaluation of the objective functional in (2a), the evaluation of the gradient in (5), the Hessian matvec in (7), the construction and application of the preconditioner, and routines that control the tolerances and implement the stopping conditions. We use the AccFFT package [45]—a parallel, opensource FFT library for CPU/GPU architectures developed in our group, to apply the spectral operators.
AccFFT dictates the data layout: We partition the data based on a pencil decomposition for 3D FFTs [50, 35]: Let denote the number of MPI tasks; then each MPI task gets grid points (i.e., we partition the domain along the  and axis into subdomains , , of size . There is no partitioning in time. In our improved implementation presented in this work we have significantly reduced the memory allocation for the Gauss–Newton approximation; we only store the time history of the state variable , , . This is accomplished by, e.g., performing the integration of
during the solution of the adjoint equation in (4) (we proceed in a similar fashion for the evaluation of (7), in particular, when computing the solution of the incremental adjoint equation (8)). Here, , represents the Kronecker product, and is the Hadamard product. We summarize the memory requirements in Tab. 4. Notice that we in addition to that require the memory of the Hessian matvec, if we consider the twolevel preconditioner (under the assumption that the coarse level is half the size of the fine level). The spectral preconditioner does not add to the memory pressure.
Newton  Gauss–Newton  
Gradient (see (5))  
Hessian matvec (see (7)) 
The scalability of the 3D FFT is well explored [50, 35, 45]. If we assume that the number of grid points , , is equal along each spatial direction, i.e., , each 3D FFT requires computations and communications, where is the startup time for the data transfer and represents the perword transfer time [50].
We use a tricubic interpolation model to evaluate offgrid points in our SL scheme (see §2.3.3). The parallel implementation of our interpolation kernel is introduced in [85] and improved in [47]. The polynomial is implemented in Lagrange form. The evaluation of this interpolation kernel requires the computation of twelve basis polynomials. The local support of the cubic basis is grid points. Overall, this results in a complexity of computations. We have implemented a SIMD vectorization based on advanced vector extensions (AVX2) for Haswell architectures, for the evaluation of the interpolation kernel. In addition to that we have introduced a binning method (based on Morton sorting) that reduces the cache misses (see [47] for details). This yields an interpolation kernel that is now bound by communication time instead of time spent in the interpolation (which was the case in [85]).
The communication costs are more difficult to estimate; they depend on the computed characteristic (see Fig. 2 for an illustration). If a departure point is owned by the current processor, we require no communication. If the values for a departure point are owned by another processor/MPI task (the “worker”), we communicate the coordinates from the “owner” to the “worker”. We then evaluate the interpolation model on the “worker” and communicate the result back to the “owner”. This results in a communication cost of per offgrid point not owned by a processor. To evaluate the interpolation model at offgrid points not owned by any MPI task (i.e., located inbetween the subdomains ), we add a layer of four ghost points (scalar values to be interpolated; see Fig. 2, right). This results in an additional communication cost of for each MPI task for the four face neighbors, where is the size of layer for the ghost points (in our case four). The communication with the four corner neighbors can be combined with the messages of the edge neighbors, by appropriate ordering of the messages. Notice that the communication of the departure points (for the forward and backward characteristics) needs to be performed only once per Newton iteration, since our velocity field is stationary. We perform this communication when we evaluate the forward and the adjoint operators (i.e., during the evaluation of the objective functional and the reduced gradient).
3 Experiments
We report results for realworld and synthetic datasets. We evaluate the registration accuracy for 16 segmented MRI brain volumes [31]. Details on the considered datasets can be found in §A. We showcase two exemplary datasets in Fig. 3. Notice, that these datasets have been rigidly preregistered. We directly apply our method to this data (without an additional affine preregistration step). The runs were executed on the CACDS’s Opuntia server or on TACC’s Lonestar 5 system. The specs of these systems can be found in §B.
Remark
In most experiments, we consider an seminorm for the velocity since we observed (see also [83]) that it yields a good tradeoff between timetosolution and registration quality. We consider a moderate regularization weight of . We will see that this results in a good agreement between the reference image and the deformed template image. If we further reduce the regularization weight, we obtain a smaller mismatch at the cost of a prolonged runtime.
3.1 Convergence: Preconditioner
We study the performance of different variants of our preconditioner for the reduced space Hessian.
Setup. We solve the KKT system in (7) at a true solution . This velocity is found by registering two neuroimaging data sets from the NIREP dataset (na01 and na02). The images are downsampled to a grid of size (half the original resolution). We consider an div regularization model with and and an regularization model with with a tolerance to compute . Once we have found , we generate a synthetic reference image by transporting the reference image using . We use the velocity as an initial guess for our solver, and iteratively solve for the search direction using different variants of our preconditioner. The number of time steps for the PDE solves is set to . We fix the tolerance for the (outer) PCG method to . We consider an inexact Chebyshev semiiterative method with a fixed number of iterations (denoted by CHEB()) to compute the action of the inverse of our preconditioner. We also consider a nested PCG method for the iterative inversion of the preconditioner, with a tolerance of (denoted by PCG()). We compare these strategies to a spectral preconditioner (inverse of the regularization operator ; used in [85, 47, 86]). We study the rate of convergence of the PCG solver for a vanishing regularization parameter . We consider mesh sizes of and .
Results. We display the trend of the residual with respect to the (outer) PCG iterations in Fig. 4 (seminorm for with ) and in Fig. 5 (div regularization model with an seminorm for , and ), respectively. We provide detailed results in Tab. 6 and Tab. 7 in §C.1. We execute the code on a single node of Opuntia with 20 MPI tasks.
Observations. The most important observations are:

The PCG method converges significantly faster for the twolevel preconditioner.

The performance of all preconditioners considered in this study is not independent of the regularization parameter . The workload increases significantly for vanishing regularity of the velocity for all preconditioners. The plots in Fig. 4 and Fig. 5 imply that the convergence of the PCG method for the twolevel preconditioner is less sensitive to (or even independent of) . The work goes to the inversion of the reduced space Hessian on the coarse grid (see Tab. 6 and Tab. 7 in §C.1 for details). If we further reduce the regularization parameter (below for the regularization model and below for the div regularization model) the performance of our preconditioners deteriorates further; the runtime becomes impractical for all variants of our preconditioners.

The rate of convergence of the PCG method is (almost) independent of the mesh size for all preconditioners.^{15}^{15}15We note that we apply a smoothing of along each spacial dimension so that the input image data is resolved on the coarse grid of size . The same frequency content is presented to the solver on the fine grid of size .

The PCG method converges significantly faster if we consider an regularization model for . This is a direct consequence of fact that the condition number of the Hessian increases with the order of the regularization operator .

The differences of the performance of the preconditioners are less pronounced for an div regularization model for than for an regularization model. For an regularization model with we require more than 200 iterations for the spectral preconditioner.

We obtain a speedup of up to 2.9 for the regularization model (run #20 in Tab. 6 in §C.1) and a speedup of up to 2.6 for the div regularization model (run #40 in Tab. 7 in §C.1). The coarser the grid, the less effective is the twolevel preconditioner, especially for vanishing regularization parameters . This is expected, since we cannot resolve highfrequency components of the fine level on the coarse level. Secondly, we do not use a proper (algorithmic) smoother in our scheme to reduce the highfrequency errors.

The performance of the CHEB and the nested PCG method for iteratively inverting the reduced space Hessian are similar. There are differences in terms of the mesh size. For a coarser grid () the CHEB seems to perform slightly better. For a grid size of the nested PCG method is slightly better.
Conclusions. (i) The twolevel preconditioner is more effective than the spectral preconditioner. (ii) The nested PCG method is more effective than the CHEB method on a finer grid (and does not require a repeated estimation of the spectrum of the Hessian operator). (iii) The PCG method converges faster if we consider an div regularization model for . (iv) Designing a preconditioner that delivers a good performance for vanishing regularization parameters requires more work .
3.2 Convergence: Newton–Krylov Solver
We study the rate of convergence of our Newton–Krylov solver for the entire inversion. We consider a synthetic test problem and neuroimaging data (see Fig. 3).
3.2.1 Synthetic Data
Setup. We report results for a synthetic test problem (see §A) discretized on a grid of size . We consider an regularization model (seminorm; ) and an div regularization model (seminorm for ; , ). We run the registration at full resolution ( unknowns). The number of Newton iterations is limited to 50 (not reached). We use a superlinear forcing sequence and limit the number of Krylov iterations to 100 (not reached). The tolerance for the relative change of the gradient is ; the absolute tolerance for the norm of the gradient is . The number of time steps for the PDE solves is set to . We use 20 cores (64GB compute nodes) resulting in a processor layout of ( unknowns per core). We do not perform any parameter, scale, or grid continuation. We will consider the same synthetic test problem when we study the scalability of our solver in §3.5.
Results. We report results in Fig. 6. The top row shows results for the div regularization model and the bottom row for the regularization model. We plot the relative reduction of the mismatch, the reduced gradient, and the objective functional with respect to the Gauss–Newton iteration index. We also report results for the convergence of the PCG solver for different realization of the preconditioner.
Observations. The most important observations are:

Our Newton–Krylov solver converges after 3 to 4 Gauss–Newton iterations.

We can reduce the gradient by three orders of magnitude in less than 5 Gauss–Newton iterations (about one order of magnitude per Gauss–Newton iteration if we consider a twolevel preconditioner).

We require one Hessian matvec per Gauss–Newton iteration for the twolevel preconditioner in combination with a nested PCG method. The residual in the PCG method drops rapidly for both preconditioners.

We obtain a better search direction per Gauss–Newton iteration if we consider the twolevel preconditioner in combination with a nested PCG method (we slightly oversolve the KKT system for the nested preconditioner). The reduced gradient drops more rapidly. The trend of the objective and the mismatch is more or less similar.
Conclusions. Our Newton–Krylov solver converges quickly for smooth data. We can observe that the twolevel preconditioner is not only more effective for the inversion of the reduced space Hessian, but also yields a better search direction; the Newton–Krylov solver converges quicker. We will see that these differences are less significant for real data. This indicates that the performance of our scheme deteriorates as the objects become more irregular.
3.2.2 Real Data
Setup. We register the datasets na02 through na16 (template images) with na01 (reference image). We run the registration at the full resolution (; unknowns). We consider and div regularization model (seminorm for with and ; the parameters are chosen empirically). We perform these runs on TACC’s Lonestar 5 system. The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of and (the latter is not reached) for the relative reduction and the absolute norm of the reduced gradient as a stopping criterion. We use time steps for numerical time integration. We compare results obtained for the twolevel preconditioner to results obtained using a spectral preconditioner (inverse of the regularization operator). We use a nested PCG method with a tolerance of for computing the action of the inverse of the twolevel preconditioner. We do not perform any parameter, scale, or grid continuation. We compare results obtained for single () and double () precision.
Results. We show convergence plots for all datasets in Fig. 7. We plot the relative reduction of the mismatch (left column), the relative reduction of the gradient (middle column), and the relative reduction of the objective functional (right column) with respect to the Gauss–Newton iterations. The top row shows results for the spectral preconditioner; the bottom rows show results for the twolevel preconditioner. Detailed results for these runs are provided in Tab. 8, Tab. 9 and Tab. 10 in §C.2. We also report a comparison of the performance of our solver for single () and double () precision in Tab. 11 for two exemplary images of the NIREP dataset in §C.2.
Observations. The most important observations are:

Switching from double to single precision does not affect the convergence of our solver (see Fig. 7; see also Tab. 11).^{16}^{16}16This is only true if we consider an regularization model. For an regularization model we have observed that single precision does not provide sufficient numerical accuracy to ensure the convergence of our method. Stabilizing the computations for higher order regularization operators for single precision requires more work.

The twolevel preconditioner executed with single precision yields a speedup of up to 6 (with an average speedup of ) compared to our baseline method (spectral preconditioner executed in double precision) [83] (see Fig. 7 top right; run #13 in Tab. 10). Switching from single to double precision yields a speedup of more than 2 (see Tab. 11).

We obtain a very similar convergence behavior for the different variants of our solver (see Fig. 7). We can reduce the norm of the gradient by in 6 to 14 Gauss–Newton iterations (depending on the considered pair of images).

The mismatch between the deformed template image and the reference image stagnates once we have reduced the gradient by more than one order of magnitude (for the considered regularization weight).

We oversolve the reduced space KKT system if we consider a superlinear forcing sequence in combination with a nested PCG method (see Fig. 7).
Conclusions. (i) Our improved implementation of CLAIRE yields an overall speedup of for real data if executed on a single resolution level. (ii) Executing CLAIRE in single precision does not deteriorate the performance of our solver (if we consider an regularization model for the velocity) .
3.3 TimetoSolution
We study the performance of our solver for different continuation schemes to stabilize and accelerate the computations of CLAIRE. We note that the DEMONS algorithm requires between (3 levels with 15, 10, and 5 iterations) and (3 levels with 1500, 1000 and 500 iterations) until ”convergence” on the same system (depending on the parameter choices; this includes a grid continuation scheme; see §3.4 for details).^{17}^{17}17Since we perform a fixed number of iterations for the DEMONS algorithm, the runtime only depends on the execution time of the operators. The regularization parameters control the support of the Gaussian smoothing operator; the larger the parameters, the longer the execution time. This is different for CLAIRE; large regularization parameters result in fast convergence and, hence, yield a short execution time. A simple strategy to obtain competitive results in terms of runtime would be to also execute CLAIRE for a fixed number of iterations. We prefer to use a tolerance for the relative reduction of the gradient, instead, since it yields consistent results across different datasets.
Setup. We use the dataset na02 and na03 as a template images, and register them to na01 (reference image). We consider and div regularization model (seminorm for with and ; these parameters are chosen empirically). We execute these runs on one node of the Opuntia system using 20 MPI tasks. The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of for the relative reduction of the norm of the gradient and a tolerance of (not reached) for its norm as a stopping criterion. We use time steps for numerical time integration. We compare results obt