Deformable registration is a key technology in the medical imaging. It is about computing a map that establishes a meaningful spatial correspondence between two (or more) images (the reference (fixed) image) and (the template (deformable or moving) image; image to be registered) of the same scene [96, 43]. Numerous approaches for formulating and solving image registration problems have appeared in the past; we refer to [96, 97, 43, 60, 116] for lucid overviews. Image registration is typically formulated as a variational optimization problem that consists of a data fidelity term and a Tikhonov regularization functional to over-come ill-posedness [43, 40]. A key concern is that is a diffeomorphism, i.e., the map is differentiable, a bijection, and has a differentiable inverse. We require that the determinant of the deformation gradient does not vanish or change sign. An intuitive approach to safeguard against nondiffeomorphic maps is to add hard and/or soft constraints on to the variational optimization problem [29, 58, 103, 108]. An alternative strategy to ensure regularity is to introduce a pseudo-time variable and invert for a smooth velocity field that encodes the map [16, 37, 94, 126]; existence of a diffeomorphism can be guaranteed if is adequately smooth [16, 30, 37, 121]. This model is, as opposed to many traditional formulations that directly invert for [88, 58, 108, 95], adequate for recovering large, highly nonlinear deformations. Our approach falls into this category. There exists a large body of literature that, in many cases, focuses on theoretical considerations [37, 94, 131, 129, 130]. There is much less work on the design of efficient solvers for velocity based large deformation diffeomorphic image registration; examples are [11, 8, 9, 16, 34, 132, 64, 126, 102]
. Most of these solvers use first order methods for numerical optimization and/or are based on heuristics that do not guarantee convergence. Due to computational costs, early termination results in compromised registration quality. Our intention is to introduce an efficient solver for diffeomorphic image registration problems that(i) uses state-of-the art algorithms, (ii) is scalable to thousands of cores, (iii) requires minimal parameter tuning, (iv) and produces high-fidelity results with guaranteed regularity on a discrete level.
|template image (image to be registered/deformed)|
|stationary velocity field|
|deformation map (Eulerian/pull-back map)|
|state variable (transported intensities)|
|incremental state variable|
|incremental state variable|
|(reduced) Hessian operator|
|partial derivative with respect to th coordinate direction|
|Laplacian operator (vectorial and scalar)|
|number of grid points (spatial domain);|
|number of cells in temporal grid|
|number of unknowns;|
|CLAIRE||constrained large deformation diffeomorphic image registration|
|CHEB()||Chebyshev (iteration) with fixed iteration number [49, 53]|
fast Fourier transform
|GPL||GNU General Public License|
|LDDMM||large deformation diffeomorphic metric mapping |
matrix vector product
|MPI||Message Passing Interface|
|PETSc||Portable Extensible Toolkit for Scientific Computation |
|PCG||preconditioned conjugate gradient (method) |
|PCG()||PCG method with relative tolerance|
|RK2||2nd order Runge–Kutta method|
|(S)DDEM||(symmetric) diffeomorphic demons [124, 126]|
|(S)LDDEM||(symmetric) log-domain diffeomorphic demons |
|TAO||Toolkit for Advanced Optimization |
1.1 Outline of the Method
We use an optimal control formulation. The task is to find a smooth velocity field (the ‘’control variable‘’) such that the distance between two images (or densities) is minimized, subject to a regularization norm for and a deformation model given by a hyperbolic PDE constraint.111We refer to [20, 51, 68, 27] for more details on recent developments in PDE-constrained optimization. We review PDE-constrained optimization formulations in the context of image analysis in . More precisely, given two functions (reference image) and (template image) compactly supported on an open set with boundary , we solve for a stationary velocity field as follows: equationparentequation
with periodic boundary conditions on . Here, (the ‘’state variable‘’) corresponds to the transported intensities of subject to the velocity field ; in our formulation, —i.e., the solution of (1b)—is equivalent to for all in . The regularization functional is a Sobolev norm that, if chosen appropriately, ensures that gives rise to a diffeomorphism [16, 37, 62, 121]. We augment the formulation in (1) by constraints on the divergence of to control volume change (see also [83, 30]). Details can be found in §2.1.
Problem (1) is ill-posed and involves ill-conditioned operators. We use the method of Lagrange multipliers to solve the constrained optimization problem (1). Our solver is based on an optimize-then-discretize approach; we first derive the optimality conditions and then discretize in space using a pseudospectral discretization with a Fourier basis.222A discretize-then-optimize approach can be found in . We use a globalized, inexact, preconditioned Gauss–Newton–Krylov method to solve for the first order optimality conditions . The hyperbolic transport equations that appear in our formulation are integrated using a semi-Lagrangian method [117, 41]. Our solver uses distributed-memory parallelism and can be scaled up to thousands of cores [83, 47]. The linear solvers and the Gauss–Newton optimizer are built on top of PETSc  and TAO . Details can be found in §2.3.
We follow up on our former work on constrained diffeomorphic image registration [82, 83, 84, 85]. Our optimal control formulation is introduced in [82, 83]. Our Newton–Krylov solver has been proposed in ; an improved solver is described in .333A comparison of our Newton–Krylov scheme against a (preconditioned) gradient descent scheme (i.e., a gradient descent in the dual space , , induced by the regularization norm; see, e.g., [16, 62]) can be found in . We extend our original solver in  to the 3D setting in [85, 47]. The focus in [85, 47] is the scalability of our solver on high performance computing platforms. In  we presented an integrated formulation for diffeomorphic image registration and biophysical tumor growth simulations. In this work, we focus on the registration performance and introduce additional algorithmic improvements to our three-dimensional solver. Our contributions are:
We implement an improved preconditioner for the reduced space Hessian (originally described in  for the two-dimensional case). We empirically evaluate several variants of this preconditioner.
We introduce options for a grid, scale, and parameter continuation to our three-dimensional solver.
We study strong scaling performance of our improved solver.
We make our software termed CLAIRE (which stands for constrained large deformation diffeomorphic image registration) available under GPL license (https://github.com/andreasmang/claire).
1.3 Limitations and Unresolved Issues
Several limitations and unresolved issues remain. We assume similar intensity statistics for the reference image and the template image . This is a common assumption in many deformable image registration algorithms [16, 62, 77, 99, 127]. To enable the registration of images with a more complicated intensity relationship, more involved distance measures need to be considered (see, e.g., [96, 116]). Our formulation is not symmetric, i.e., not invariant to a permutation of the reference and template image. The extension of our scheme to the symmetric case is mathematically straightforward but its efficient implementation is nontrivial (see [10, 80, 125] for examples). This will be the subject of future work. We invert for a stationary velocity field (i.e., the velocity is not a function of time). Stationary paths on the manifold of diffeomorphisms are the group exponentials (i.e., one-parameter subgroups) and—in contrast to LDDMM methods [16, 62, 127], they do not cover the entire space of diffeomorphisms. Developing an effective, parallel solver for nonstationary requires more work.444We note that our Matlab prototype implementations support nonstationary [82, 87].
1.4 Related Work
With this work we follow up on our prior work on constrained diffeomorphic image registration [82, 83, 85, 87, 89]. We do not directly invert for the deformation map (see [29, 58, 88, 42, 57, 97, 104, 73] for examples) but for its velocity . We use an optimal control formulation. The PDE constraint is, in its simplest form, a hyperbolic transport equation. We refer to [20, 27, 51, 68, 79] for an introduction to the theory and algorithmic developments in PDE-constrained optimization. The numerics for these type of problems have to be tailored to the specific structure of the control problem, which is dominated by the PDE constraints that appear in the formulation; see [1, 21], [2, 46, 90, 118], and [18, 26, 77, 65, 128] for elliptic, parabolic, and hyperbolic examples, respectively.
Excellent reviews on image registration include [96, 60, 116]. Optimal control formulations for image registration have been described in [26, 30, 62, 77, 78, 127, 65]. Related formulations for optimal mass transport are described in [18, 56, 122, 87]. Our work differs from optimal mass transport in that intensities are constant along the computed characteristics (i.e., mass is not preserved). Our formulation also shares characteristics with traditional optical flow formulations [69, 72, 105]. The key difference is that we treat the transport equation for the image intensities as a hard constraint. PDE-constrained formulations for optical flow, which are equivalent to our formulation, are described in [6, 15, 26, 30]. Our work is closely related to the LDDMM approach [10, 11, 16, 37, 121, 129] (c.f., [82, 62]), which builds upon the pioneering work in . LDDMM uses a non-stationary velocity but there exist variants that use stationary [7, 8, 64, 81, 80, 126] because they are much cheaper and if we are only interested in registering one template image to the reference stationary produce good results. Another strategy to reduce the size of the search space is ‘’geodesic shooting‘’ [9, 93, 127, 129, 133]. Here, the deformation map is parameterized by an initial momentum (or, similarly, the velocity at ).
Unlike existing LDDMM approaches, CLAIRE features explicit control on the determinant of the deformation gradient by introducing additional constraints for the divergence of . Our formulation for controlling the divergence of was originally proposed in ; a similar formulation is described in . Other works that consider divergence free velocities have been described in [30, 67, 91, 105, 106].
With this work we release CLAIRE, a software package for velocity based diffeomorphic image registration. Among the most popular, publicly available packages for large deformation diffeomorphic image registration are DEMONS [125, 126], ANTS , and DARTEL . Other popular packages for deformable registration are IRTK , ELASTIX , NiftyReg , and FAIR , which, with the exception of FAIR, are based on (low dimensional) parametric deformation models that use an expansion based on basis splines to represent the deformation map .
There are only few works that discuss effective numerical implementations for velocity based diffeomorphic image registration. Despite the fact that first order methods for numerical optimization display poor convergence for non-linear, ill-posed problems, most work, with the exception of ours and [9, 18, 63, 114, 126, 65], uses first order information for numerical optimization. We use a globalized Newton–Krylov method instead. For these methods to be efficient, it is critical to design an effective preconditioner (something we address with this work; we refer to  for an overview on preconditioning of saddle point problems). Preconditioners for problems that are similar to ours can be found in [18, 114, 65]. Another critical component of an effective method for PDE-constrained optimization problems is the solver for the PDE operators that appear in the optimality systems. In our case, these PDE operators are hyperbolic transport equations. Several strategies to efficiently solve these equations have been considered; they range from conditionally stable total variation diminishing , second and fourth-order Runge-Kutta [82, 83, 102], and high-order essentially nonoscillatory  schemes, to unconditionally stable implicit Lax-Friedrich [18, 114], SL [16, 30, 89, 85], and Lagrangian  schemes. We use a SL scheme.
Examples for parallel solvers for PDE-constrained optimization problems can be found in [3, 4, 22, 23, 24, 20, 19, 113]. We refer to [39, 44, 110, 112] for surveys on parallel algorithms for image registration. Implementations, such as DEMONS [125, 126], ANTS , or ELASTIX , which are largely based on kernels implemented in the ITK package , exploit multi-threading for parallelism. GPU implementations of different variants of map-based parametric approaches based on basis splines are described in [109, 95, 111]. A GPU implementation of a map-based non-parametric image registration approach that builds upon FAIR  is described in . GPU implementations of diffeomorphic velocity based registration approaches are described in [55, 54, 115, 122, 123]. The work that is most closely related to ours [85, 47], is [55, 54, 123]. In [55, 54] a multi-GPU-accelerated implementation of the diffeomorphic image registration approach described in  is presented. The work in  discusses a GPU-accelerated implementation of DARTEL .
What sets our work apart are the numerics and our distributed-memory implementation: We use high-order numerical methods (second order time integration, cubic interpolation, and spectral differentiation). Our solver uses MPI for parallelism and has been deployed to high performance platforms[83, 47]. This allows us to target applications of unprecedented scale such as CLARITY imaging , a novel optical imaging technique that delivers sub-micron resolution. The first attempt of applying LDDMM to CLARITY is described in . The data is downsampled by about a factor of 100 (from to ) to be able to perform the registration. By exploiting a multi-resolution strategy they were able to register the images in about 100 minutes. We will see that we can solve problems with unknowns in on 22 compute nodes (256 MPI tasks) and in less than if we use 342 compute nodes (4096 MPI tasks). However, CLAIRE cannot only be used to solve problems of unprecedented scale but also paves the way for real-time applications of large deformation diffeomorphic image registration. We are the first group to present a distributed-memory implementation of a globalized Gauss–Newton–Krylov method for these type of problems.
We present our approach for large deformation diffeomorphic image registration in §2, which comprises the formulation of the problem (see §2.1), a formal presentation of the optimality conditions (see §2.2), and a discussion of the numerical implementation (see §2.3). Numerical experiments are reported in §3. We conclude with §4. Supplementary material is presented in §A, §B, and §C.
In what follows, we describe the main building blocks of our formulation, our solver, and its implementation, and introduce new features that distinguish this work from our former work on constrained diffeomorphic image registration [82, 83, 85, 47, 87, 84]. We use a globalized preconditioned, inexact, reduced space Gauss–Newton–Krylov method to solve (1). Our scheme has been described in [82, 85, 84]. Extensions to our original optimal control formulation  are described in . The parallel implementation of the main kernels of our solver is described in [85, 47].
Given two images—the reference image and the template image —compactly supported on , with boundary and closure , our aim is to compute a plausible deformation map such that [96, 97, 43]. We consider a map to be plausible, if it is a diffeomorphism, i.e., an invertible map, which is continuously differentiable (in particular a -function) and maps onto itself. In our formulation, we do not directly invert for ; we introduce a pseudo-time variable and invert for a stationary velocity , instead. In particular, we solve for the velocity field and a mass source map as follows : equationparentequation
with periodic boundary conditions on , and , , . The state variable in (2b) represents the transported intensities of subjected to the velocity field ; the solution of (2b), i.e., , is equivalent to , where is the Eulerian (or pullback) map.555The deformation map and the determinant of the deformation gradient can be computed from as a post-processing step (see [82, 83]). We use a squared -distance to measure the proximity between and . The parameters and control the contribution of the regularization norms for and . The constraint on the divergence of in (2d) allows us to control the compressibility of . If we set in (2d) to zero is incompressible, i.e., up to numerical accuracy ; see [30, 67, 82, 83, 85, 91, 105, 106] for examples. By introducing a nonzero mass-source map , we can relax this model to near-incompressible diffeomorphisms [83, 25]; the regularization on in (2a) acts like a penalty on the divergence of ; we use an -norm (a similar formulation can be found in ).
If we neglect the constraint (2d) we obtain a formulation that is closely related to available models for velocity-based diffeomorphic image registration [8, 16, 62, 64, 127, 126] (see [62, 82]).666We, likewise to [7, 8, 64, 81, 80, 126], invert for a stationary . The original LDDMM formulation [16, 37, 121, 129] uses time-dependent velocity fields. We note that stationary do not cover the entire space of diffeomorphisms. The paths obtained from stationary are one-parameter subgroups that do not depend on any metric; the definition of such a metric may be desirable in certain applications [16, 92, 132] and, in general, requires nonstationary velocities. We observed for the 2D case that using time-varying did not improve the mismatch for two-image registration problems . Our solver supports different Sobolev (semi-)norms to regularize . We summarize the available operators in Tab. 2. The choice of the differential operator not only depends on application requirements but is also critical from a theoretical point of view; an adequate choice guarantees existence and uniqueness of an optimal solution of the control problem [15, 16, 26, 30, 77] (subject to the smoothness properties of the images).
2.2 Optimality Condition and Newton Step
with Lagrange multipliers for the hyperbolic transport equation (2b), for the initial condition (2c), and for the incompressibility constraint (2d). Formally, we have to compute variations of with respect to the state, adjoint, and control variables. We will only consider a reduced form (after eliminating the incompressibility constraint) of the optimality system—a system of nonlinear PDEs for , , and . Details on how we formally arrive at this reduced from can be found in [82, 83].
The evaluation of the reduced gradient (the first variation of the Lagrangian in (3) with respect to ) for a candidate requires several steps. We first solve the state equation (2b) with initial condition (2c) forward in time to obtain the state variable . Given we then compute the adjoint variable by solving the adjoint equation equationparentequation
with periodic boundary conditions on , backward in time. Once we have the adjoint and state fields and , respectively, we can evaluate the expression for the reduced gradient given by
The differential operator in (5) corresponds to the first variation of the regularization norm for in (3), resulting in an elliptic, biharmonic, or triharmonic control equation for , respectively (see Tab. 2). The operator projects onto the space of incompressible or near-incompressible velocity fields. We have
The first order optimality condition (control or decision equation) requires that for an admissible solution to (2). Most available registration packages for large deformation diffeomorphic registration use Sobolev gradient descent type optimization schemes to find an optimal point [16, 62, 127]. The search direction is typically given by the gradient in the dual space , :
These methods have linear convergence. Newton-type methods yield super-linear to quadratic convergence .777We demonstrated in  that our (Gauss–)Newton–Krylov method is superior to this Sobolev gradient descent scheme. However, their implementation requires more work and, if implemented naively, they can become computationally prohibitive.
We use a Newton–Krylov method to solve for , which is globalized using an Armijo linesearch . We use a reduced space formulation , in which, for every value of , we eliminate and . Then, given a candidate , the Newton direction is computed based on the second variations of the Lagrangian in (3). The expression for the Hessian matvec (action of the Hessian on the vector ) is given by
where , , and denote the incremental state, adjoint, and control variable, respectively. We use the notation for the application of the reduced space Hessian to indicate that is a function of and through , , , and in (2b), (8a), (4a), and (9a). Each application of the reduced Hessian in (7) requires the following steps: Given and for some candidate we have to solve888We assign the computational costs for computing and to the evaluation of the gradient in (5). equationparentequation
for (forward in time) and equationparentequation
for (backward in time). Computing the Newton step requires the (iterative) solution of
where the right hand is given by the reduced gradient in (5).
In the following, we describe our distributed-memory solver for 3D diffeomorphic image registration problems.999Our original Newton–Krylov scheme is described in . Its parallel implementation is described in [85, 47], where we focus on the scalability of our solver. Improvements of our solver in  are described in ; the work in [82, 84] is limited to the 2D case. In what follows, we present an extensions of the 3D implementation described in [85, 47].
We discretize in space on a regular nodal grid with grid nodes , , , , and periodic boundary conditions; denotes the Hadamard division. In the continuum, we model images as compactly supported (periodic), smooth functions. We apply a Gaussian smoothing (in the spectral domain) with a bandwidth of , and mollify the discrete data, to meet these requirements. We rescale the images to an intensity range of prior to registration.
We use a trapezoidal rule for numerical quadrature and a spectral projection scheme for all spatial operations. The mapping between the spectral and the spatial domain is done using forward and inverse FFTs. All spatial derivatives are computed in the spectral domain; we first take the FFT, then apply the appropriate weights to the spectral coefficients, and then take the inverse FFT. Notice, that this scheme allows us to efficiently and accurately apply differential operators and their inverses. Consequently, the main cost of our scheme is the solution of the transport equations (2b), (4a), (8a), and (9a), and not the inversion of differential (e.g., elliptic or biharmonic) operators. We use a nodal discretization in time, which results in space-time fields we need to solve for. We use an unconditionally stable SL scheme [117, 41] to solve the transport equations, which allows us to keep small (we found that yields a good compromise between runtime and numerical accuracy).
2.3.2 Newton–Krylov Solver
The discretized Hessian operator , , is dense and ill-conditioned . Constructing and storing is out of question, especially for 3D problems (e.g., ). We use iterative, matrix-free Krylov subspace methods, which only require the action of on a vector (see (7)). We use a PCG method  under the assumption that is a symmetric, positive (semi)definite operator.101010The exact Hessian is of course guaranteed to be symmetric. But since we use an optimize-then-discretize formulation, discretization discrepancies between the adjoint and the forward operators result in a nonsymmetric Hessian. However, the error in symmetry is small compared to the tolerance levels we use in our solver. In general, the more accurately we want to compute the gradient, the more accurately we need to solve the forward and adjoint problems. The linear solves are done inexactly, with a tolerance that is proportional to the norm of the reduced gradient [36, 38]; our solver supports quadratic and superlinear forcing sequences (see, e.g., [100, 166ff]).
We note that it cannot be guaranteed that the Hessian operator is positive definite far away from the optimum. As a remedy we opt for a Gauss–Newton approximation. This corresponds to dropping all terms in (7) and (9) that involve . The rate of convergence drops from quadratic to superlinear. As tends to zero (i.e., the mismatch goes to zero; see (4)), we recover quadratic convergence.
As we have pointed out in §2.2, the transport equations (8) and (9) have to be solved every time in (7) is applied to a vector. To design an efficient algorithm we have to keep the number of Hessian matvecs (i.e., the number of PCG iterations) as small as possible. This necessitates the design of an effective preconditioner. We refer to  for a study of the spectral properties of the Hessian; for practical values of the regularization parameter82].
The speed of convergence of the PCG method depends on the distance of from identity; ideally, the spectrum of is clustered around one. In former work, we have considered two preconditioners, a spectral preconditioner based on the inverse of the regularization operator111111The operator is singular if we consider a seminorm as regularization model in (2 ). We bypass this problem by setting the zero singular values of the operator
). We bypass this problem by setting the zero singular values of the operatorto one before computing its inverse. (a common choice in PDE-constrained optimization problems [5, 28]) and a two-level (nested) preconditioner (proposed and tested in  for the 2D case). The latter strategy can be interpreted as a simplified two-level multigrid V-cycle. We present these two preconditioners in more detail next. Both of our approaches are matrix-free, i.e., they do not require assembling or storage of the preconditioner.
Applying the spectral preconditioner to (10) yields
The preconditioned Hessian operator in (11) is a compact perturbation of the identity; we expect a rate of convergence that is independent of the grid size. Notice that the inversion of can be done efficiently and exactly due to the spectral discretization in space; it only requires a forward and an inverse FFT, and an appropriate diagonal scaling in the spectral domain. This preconditioner is used in our past work [82, 83, 85, 87, 86].
We need a good approximation of the inverse of the Hessian to have an effective preconditioner. Our second approach uses the inverse of the reduced space Hessian as a preconditioner (in theory, the ideal preconditioner). The key idea is to amortize the computational costs by solving for the action of the inverse of the Hessian inexactly, on a coarser grid. This preconditioner will only act on the low frequency components of the input vector. We ensure this numerically by separating the spectrum. We do so by applying ideal (spectral) low- and high-pass filters to the vector the preconditioner is applied to (input vector). We proceed as follows: We apply a spectral restriction operator to present the filtered low frequency modes of our input vector to the algorithm for inverting the Hessian on the coarse grid. We apply a spectral prolongation operator to the output vector of this algorithm to project it back to the fine grid, and subsequently add the filtered high frequency modes of the input vector to it (i.e., we leave the high frequency components of the input vector untouched). To be able to evaluate the Hessian matvec, we also need to apply restriction operators to the adjoint and state variable and , respectively.
We do not apply this scheme to (7) but to the spectrally preconditioned Hessian operator in (11)—with a slight modification. The -preconditioned Hessian in (11) is no longer symmetric. Since we can efficiently compute the square root of the operator we use
instead.121212Notice that the left preconditioned Hessian in (11) and the split-preconditioned Hessian in (12) have identical spectral properties. Consequently, we can use a PCG algorithm to compute the action of the inverse of in (12). Notice that the inverse of the regularization operator is a smoother; we can view it as an approximation of a smoother in multigrid methods.
The final building block of our two-level scheme for preconditioning (10) is the solver used to invert the reduced space Hessian on the coarse grid. Forming and storing this operator is out of question (too costly); we opt for a matrix-free iterative solver to compute the action of its inverse on a vector, instead. We have several choices. If we opt for a nested Krylov subspace method, such as PCG, we have to use a smaller tolerance for the inversion of the reduced space Hessian , , on the coarse grid (i.e., the preconditioner) than we use for the inversion of on the fine grid, i.e., with . This is due to the fact that Krylov subspace methods are nonlinear operators; this scheme would violate the theory of preconditioning linear systems if the preconditioner was not inverted accurately. Another approach is to use a semi-iterative Chebyshev (CHEB) method , which is a fixed linear operator for a particular choice of eigenvalue bounds 
. We have to estimate these bounds at every Newton iteration; this is costly. We do so using the Lanczos method. This is the method we used in.131313We have also considered a flexible PCG method [12, 101] for the inversion of the Hessian operator. We observed that the performance deteriorated significantly as we reduce the regularization parameter. We disregard this approach.
We terminate the inversion if the -norm of the gradient in (5) is reduced by a factor of , i.e., if , where is the gradient at iteration and is the gradient for the initial condition of the inversion. In most of our experiments, we use . We also provide an option to set a lower bound for the -norm of the gradient (the default value is ). CLAIRE also features the stopping criteria discussed in [97, 48, 82] (not considered here).
2.3.3 Hyperbolic PDE Solver
We use a fully explicit, unconditionally stable SL scheme [41, 117] to solve the hyperbolic transport equations that appear in our Eulerian formulation ((2b), (4a), (8a), and (9a)).141414We introduced our SL formulation in  and introduced a first parallel implementation in . SL methods are a hybrid between Eulerian and Lagrangian methods. To explain our scheme, we consider the general transport equation
with periodic boundary conditions on . We introduce the notion of a characteristic that is defined by the flow on a time interval , . The interval corresponds to a single time step of size of our scheme for integrating the transport equation with .
If the Lagrangian derivative vanishes, is constant along the characteristic . The value of at a given coordinate at can then be computed by tracing the characteristic backward in time to obtain the location (departure point) the particle originated from. We can trace the characteristic by solving in with final condition at . We solve this ODE backward in time based on a fully explicit RK2 scheme (Heun’s method):
Notice, that is stationary; we have to compute the characteristic only once for a single time step at each Newton iteration (more precisely, once for the state and once for the adjoint equations). The coordinate is in general an off-grid location; the evaluation of requires three scalar interpolations. We use a cubic interpolation model.
In a subsequent step, we have to solve along the characteristic . As we have mentioned earlier, is constant along for the homogeneous case (). Hence, we can compute at by scalar interpolation at the off-grid locations ; the solution of the underlying PDE overall requires interpolations. If the Lagrangian derivative does not vanish, we compute the solution based on an RK2 scheme. Depending on the form of , this requires the evaluation of differential and/or interpolation operators for each datum evaluated at (see  for details). Overall, our scheme is second order accurate in time and third order accurate in space.
2.3.4 Continuation Schemes
The automatic selection of an optimal regularization parameter for unseen data is a topic of research by itself [61, 59]. The choice of the regularization norm (i.e., the regularity requirements for the deformation map ) depends on the application. Subject to smoothness requirements on the data, we know that we require at least -regularity of to guarantee that is a diffeomorphism [16, 37, 121]. In a practical setting, we assume that it is sufficient to require that is strictly positive (up to numerical accuracy). We can, for instance, monitor during the optimization and adjust the regularization parameter to guarantee that is locally diffeomorphic.
In our formulation, does not appear. Also, we have to differentiate (or the displacement field) once we found ; resolving is challenging and may require a significantly larger amount of discretization points than we use for the solution of the optimization problem. For this reason, we do not compute ; we directly transport .
Probing for an adequate is expensive, since it typically requires a repeated solution of the inverse problem. We assume that, once we have found an adequate , we can use this parameter for similar registration problems. Such cohort studies are quite typical in medical imaging. We determine based on a binary search (originally proposed in ; a similar strategy is described in ). The measure for this search is a bound (a user defined parameter) on . We choose so that . We start at level with a regularization parameter of and reduce by one order of magnitude until we hit the tolerance . At this point, we execute a binary search. We terminate the binary search if the update for at level is smaller than of the value for at which we reached the bound (i.e., if we reach for we terminate the binary search if the update for between level and is smaller than ). We use the solution of level as initial guess for the next level. We perform at least one Gauss–Newton iteration, and terminate the level if the gradient is reduced by (with respect to the norm of the gradient for our initial guess ; notice that we only need to evaluate the reference gradient once for this choice).
We implement the following acceleration schemes to speed up the rate of convergence and reduce the likelihood to get trapped in local minima:
Parameter continuation. For a given target parameter , we reduce (starting with ) by one order of magnitude until we reach the order of . We suggest to solve the problem on each level with a larger tolerance for the stopping criteria (e.g., ). On the final level, we set to and solve with the user defined tolerance . We perform at least one Gauss–Newton iteration per level.
Grid continuation. Our solver also features a (spectral) grid continuation (i.e., multilevel) scheme. We solve the problem on coarse grid first, and successively refine until we reach the resolution of the original image data. The restriction and prolongation is performed in the spectral domain. The solution on the coarse grid at level is used as the initial condition for the fine grid at level . At each level, we solve the registration up to the user defined tolerance .
Scale continuation. We have also implemented a multi-scale approach. We use a Gaussian kernel (applied in the spectral domain) for the scale space representation of the image data.
We summarize the critical parameters of CLAIRE in Tab. 3.
|variable||meaning||suggested value||determined automatically|
|regularization parameter for||—||yes|
|regularization parameter for||no|
|relative tolerance for gradient||no|
|number of time steps||4||no|
|bound for||0.25 (-div) or 0.1 ()||no|
2.3.5 Implementation Details
We make CLAIRE available under GPL license. It can be downloaded at https://github.com/andreasmang/claire. CLAIRE is written in C/C++; it implements data parallelism via MPI. Our solver supports single and double precision. We use the PETSc library  for linear algebra, and PETSc’s TAO  for numerical optimization. We provide routines for the evaluation of the objective functional in (2a), the evaluation of the gradient in (5), the Hessian matvec in (7), the construction and application of the preconditioner, and routines that control the tolerances and implement the stopping conditions. We use the AccFFT package —a parallel, open-source FFT library for CPU/GPU architectures developed in our group, to apply the spectral operators.
AccFFT dictates the data layout: We partition the data based on a pencil decomposition for 3D FFTs [50, 35]: Let denote the number of MPI tasks; then each MPI task gets grid points (i.e., we partition the domain along the - and -axis into subdomains , , of size . There is no partitioning in time. In our improved implementation presented in this work we have significantly reduced the memory allocation for the Gauss–Newton approximation; we only store the time history of the state variable , , . This is accomplished by, e.g., performing the integration of
during the solution of the adjoint equation in (4) (we proceed in a similar fashion for the evaluation of (7), in particular, when computing the solution of the incremental adjoint equation (8)). Here, , represents the Kronecker product, and is the Hadamard product. We summarize the memory requirements in Tab. 4. Notice that we in addition to that require the memory of the Hessian matvec, if we consider the two-level preconditioner (under the assumption that the coarse level is half the size of the fine level). The spectral preconditioner does not add to the memory pressure.
|Gradient (see (5))|
|Hessian matvec (see (7))|
The scalability of the 3D FFT is well explored [50, 35, 45]. If we assume that the number of grid points , , is equal along each spatial direction, i.e., , each 3D FFT requires computations and communications, where is the startup time for the data transfer and represents the per-word transfer time .
We use a tricubic interpolation model to evaluate off-grid points in our SL scheme (see §2.3.3). The parallel implementation of our interpolation kernel is introduced in  and improved in . The polynomial is implemented in Lagrange form. The evaluation of this interpolation kernel requires the computation of twelve basis polynomials. The local support of the cubic basis is grid points. Overall, this results in a complexity of computations. We have implemented a SIMD vectorization based on advanced vector extensions (AVX2) for Haswell architectures, for the evaluation of the interpolation kernel. In addition to that we have introduced a binning method (based on Morton sorting) that reduces the cache misses (see  for details). This yields an interpolation kernel that is now bound by communication time instead of time spent in the interpolation (which was the case in ).
The communication costs are more difficult to estimate; they depend on the computed characteristic (see Fig. 2 for an illustration). If a departure point is owned by the current processor, we require no communication. If the values for a departure point are owned by another processor/MPI task (the “worker”), we communicate the coordinates from the “owner” to the “worker”. We then evaluate the interpolation model on the “worker” and communicate the result back to the “owner”. This results in a communication cost of per off-grid point not owned by a processor. To evaluate the interpolation model at off-grid points not owned by any MPI task (i.e., located in-between the subdomains ), we add a layer of four ghost points (scalar values to be interpolated; see Fig. 2, right). This results in an additional communication cost of for each MPI task for the four face neighbors, where is the size of layer for the ghost points (in our case four). The communication with the four corner neighbors can be combined with the messages of the edge neighbors, by appropriate ordering of the messages. Notice that the communication of the departure points (for the forward and backward characteristics) needs to be performed only once per Newton iteration, since our velocity field is stationary. We perform this communication when we evaluate the forward and the adjoint operators (i.e., during the evaluation of the objective functional and the reduced gradient).
We report results for real-world and synthetic datasets. We evaluate the registration accuracy for 16 segmented MRI brain volumes . Details on the considered datasets can be found in §A. We showcase two exemplary datasets in Fig. 3. Notice, that these datasets have been rigidly preregistered. We directly apply our method to this data (without an additional affine preregistration step). The runs were executed on the CACDS’s Opuntia server or on TACC’s Lonestar 5 system. The specs of these systems can be found in §B.
In most experiments, we consider an -seminorm for the velocity since we observed (see also ) that it yields a good trade-off between time-to-solution and registration quality. We consider a moderate regularization weight of . We will see that this results in a good agreement between the reference image and the deformed template image. If we further reduce the regularization weight, we obtain a smaller mismatch at the cost of a prolonged runtime.
3.1 Convergence: Preconditioner
We study the performance of different variants of our preconditioner for the reduced space Hessian.
Setup. We solve the KKT system in (7) at a true solution . This velocity is found by registering two neuroimaging data sets from the NIREP dataset (na01 and na02). The images are downsampled to a grid of size (half the original resolution). We consider an -div regularization model with and and an regularization model with with a tolerance to compute . Once we have found , we generate a synthetic reference image by transporting the reference image using . We use the velocity as an initial guess for our solver, and iteratively solve for the search direction using different variants of our preconditioner. The number of time steps for the PDE solves is set to . We fix the tolerance for the (outer) PCG method to . We consider an inexact Chebyshev semi-iterative method with a fixed number of iterations (denoted by CHEB()) to compute the action of the inverse of our preconditioner. We also consider a nested PCG method for the iterative inversion of the preconditioner, with a tolerance of (denoted by PCG()). We compare these strategies to a spectral preconditioner (inverse of the regularization operator ; used in [85, 47, 86]). We study the rate of convergence of the PCG solver for a vanishing regularization parameter . We consider mesh sizes of and .
Results. We display the trend of the residual with respect to the (outer) PCG iterations in Fig. 4 (-seminorm for with ) and in Fig. 5 (-div regularization model with an -seminorm for , and ), respectively. We provide detailed results in Tab. 6 and Tab. 7 in §C.1. We execute the code on a single node of Opuntia with 20 MPI tasks.
Observations. The most important observations are:
The PCG method converges significantly faster for the two-level preconditioner.
The performance of all preconditioners considered in this study is not independent of the regularization parameter . The workload increases significantly for vanishing regularity of the velocity for all preconditioners. The plots in Fig. 4 and Fig. 5 imply that the convergence of the PCG method for the two-level preconditioner is less sensitive to (or even independent of) . The work goes to the inversion of the reduced space Hessian on the coarse grid (see Tab. 6 and Tab. 7 in §C.1 for details). If we further reduce the regularization parameter (below for the -regularization model and below for the -div regularization model) the performance of our preconditioners deteriorates further; the runtime becomes impractical for all variants of our preconditioners.
The rate of convergence of the PCG method is (almost) independent of the mesh size for all preconditioners.151515We note that we apply a smoothing of along each spacial dimension so that the input image data is resolved on the coarse grid of size . The same frequency content is presented to the solver on the fine grid of size .
The PCG method converges significantly faster if we consider an -regularization model for . This is a direct consequence of fact that the condition number of the Hessian increases with the order of the regularization operator .
The differences of the performance of the preconditioners are less pronounced for an -div regularization model for than for an -regularization model. For an regularization model with we require more than 200 iterations for the spectral preconditioner.
We obtain a speedup of up to 2.9 for the -regularization model (run #20 in Tab. 6 in §C.1) and a speedup of up to 2.6 for the -div regularization model (run #40 in Tab. 7 in §C.1). The coarser the grid, the less effective is the two-level preconditioner, especially for vanishing regularization parameters . This is expected, since we cannot resolve high-frequency components of the fine level on the coarse level. Secondly, we do not use a proper (algorithmic) smoother in our scheme to reduce the high-frequency errors.
The performance of the CHEB and the nested PCG method for iteratively inverting the reduced space Hessian are similar. There are differences in terms of the mesh size. For a coarser grid () the CHEB seems to perform slightly better. For a grid size of the nested PCG method is slightly better.
Conclusions. (i) The two-level preconditioner is more effective than the spectral preconditioner. (ii) The nested PCG method is more effective than the CHEB method on a finer grid (and does not require a repeated estimation of the spectrum of the Hessian operator). (iii) The PCG method converges faster if we consider an -div regularization model for . (iv) Designing a preconditioner that delivers a good performance for vanishing regularization parameters requires more work .
3.2 Convergence: Newton–Krylov Solver
We study the rate of convergence of our Newton–Krylov solver for the entire inversion. We consider a synthetic test problem and neuroimaging data (see Fig. 3).
3.2.1 Synthetic Data
Setup. We report results for a synthetic test problem (see §A) discretized on a grid of size . We consider an -regularization model (seminorm; ) and an -div regularization model (-seminorm for ; , ). We run the registration at full resolution ( unknowns). The number of Newton iterations is limited to 50 (not reached). We use a superlinear forcing sequence and limit the number of Krylov iterations to 100 (not reached). The tolerance for the relative change of the gradient is ; the absolute tolerance for the norm of the gradient is . The number of time steps for the PDE solves is set to . We use 20 cores (64GB compute nodes) resulting in a processor layout of ( unknowns per core). We do not perform any parameter, scale, or grid continuation. We will consider the same synthetic test problem when we study the scalability of our solver in §3.5.
Results. We report results in Fig. 6. The top row shows results for the -div regularization model and the bottom row for the regularization model. We plot the relative reduction of the mismatch, the reduced gradient, and the objective functional with respect to the Gauss–Newton iteration index. We also report results for the convergence of the PCG solver for different realization of the preconditioner.
Observations. The most important observations are:
Our Newton–Krylov solver converges after 3 to 4 Gauss–Newton iterations.
We can reduce the gradient by three orders of magnitude in less than 5 Gauss–Newton iterations (about one order of magnitude per Gauss–Newton iteration if we consider a two-level preconditioner).
We require one Hessian matvec per Gauss–Newton iteration for the two-level preconditioner in combination with a nested PCG method. The residual in the PCG method drops rapidly for both preconditioners.
We obtain a better search direction per Gauss–Newton iteration if we consider the two-level preconditioner in combination with a nested PCG method (we slightly oversolve the KKT system for the nested preconditioner). The reduced gradient drops more rapidly. The trend of the objective and the mismatch is more or less similar.
Conclusions. Our Newton–Krylov solver converges quickly for smooth data. We can observe that the two-level preconditioner is not only more effective for the inversion of the reduced space Hessian, but also yields a better search direction; the Newton–Krylov solver converges quicker. We will see that these differences are less significant for real data. This indicates that the performance of our scheme deteriorates as the objects become more irregular.
3.2.2 Real Data
Setup. We register the datasets na02 through na16 (template images) with na01 (reference image). We run the registration at the full resolution (; unknowns). We consider and -div regularization model (-seminorm for with and ; the parameters are chosen empirically). We perform these runs on TACC’s Lonestar 5 system. The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of and (the latter is not reached) for the relative reduction and the absolute -norm of the reduced gradient as a stopping criterion. We use time steps for numerical time integration. We compare results obtained for the two-level preconditioner to results obtained using a spectral preconditioner (inverse of the regularization operator). We use a nested PCG method with a tolerance of for computing the action of the inverse of the two-level preconditioner. We do not perform any parameter, scale, or grid continuation. We compare results obtained for single () and double () precision.
Results. We show convergence plots for all datasets in Fig. 7. We plot the relative reduction of the mismatch (left column), the relative reduction of the gradient (middle column), and the relative reduction of the objective functional (right column) with respect to the Gauss–Newton iterations. The top row shows results for the spectral preconditioner; the bottom rows show results for the two-level preconditioner. Detailed results for these runs are provided in Tab. 8, Tab. 9 and Tab. 10 in §C.2. We also report a comparison of the performance of our solver for single () and double () precision in Tab. 11 for two exemplary images of the NIREP dataset in §C.2.
Observations. The most important observations are:
Switching from double to single precision does not affect the convergence of our solver (see Fig. 7; see also Tab. 11).161616This is only true if we consider an -regularization model. For an -regularization model we have observed that single precision does not provide sufficient numerical accuracy to ensure the convergence of our method. Stabilizing the computations for higher order regularization operators for single precision requires more work.
The two-level preconditioner executed with single precision yields a speedup of up to 6 (with an average speedup of ) compared to our baseline method (spectral preconditioner executed in double precision)  (see Fig. 7 top right; run #13 in Tab. 10). Switching from single to double precision yields a speedup of more than 2 (see Tab. 11).
We obtain a very similar convergence behavior for the different variants of our solver (see Fig. 7). We can reduce the -norm of the gradient by in 6 to 14 Gauss–Newton iterations (depending on the considered pair of images).
The mismatch between the deformed template image and the reference image stagnates once we have reduced the gradient by more than one order of magnitude (for the considered regularization weight).
We oversolve the reduced space KKT system if we consider a superlinear forcing sequence in combination with a nested PCG method (see Fig. 7).
Conclusions. (i) Our improved implementation of CLAIRE yields an overall speedup of for real data if executed on a single resolution level. (ii) Executing CLAIRE in single precision does not deteriorate the performance of our solver (if we consider an -regularization model for the velocity) .
We study the performance of our solver for different continuation schemes to stabilize and accelerate the computations of CLAIRE. We note that the DEMONS algorithm requires between (3 levels with 15, 10, and 5 iterations) and (3 levels with 1500, 1000 and 500 iterations) until ”convergence” on the same system (depending on the parameter choices; this includes a grid continuation scheme; see §3.4 for details).171717Since we perform a fixed number of iterations for the DEMONS algorithm, the runtime only depends on the execution time of the operators. The regularization parameters control the support of the Gaussian smoothing operator; the larger the parameters, the longer the execution time. This is different for CLAIRE; large regularization parameters result in fast convergence and, hence, yield a short execution time. A simple strategy to obtain competitive results in terms of runtime would be to also execute CLAIRE for a fixed number of iterations. We prefer to use a tolerance for the relative reduction of the gradient, instead, since it yields consistent results across different datasets.
Setup. We use the dataset na02 and na03 as a template images, and register them to na01 (reference image). We consider and -div regularization model (-seminorm for with and ; these parameters are chosen empirically). We execute these runs on one node of the Opuntia system using 20 MPI tasks. The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of for the relative reduction of the -norm of the gradient and a tolerance of (not reached) for its -norm as a stopping criterion. We use time steps for numerical time integration. We compare results obt