CLAIRE: A distributed-memory solver for constrained large deformation diffeomorphic image registration

We introduce CLAIRE, a distributed-memory algorithm and software for solving constrained large deformation diffeomorphic image registration problems in three dimensions. We invert for a stationary velocity field that parameterizes the deformation map. Our solver is based on a globalized, preconditioned, inexact reduced space Gauss--Newton--Krylov scheme. We exploit state-of-the-art techniques in scientific computing to develop an effective solver that scales to thousand of distributed memory nodes on high-end clusters. Our improved, parallel implementation features parameter-, scale-, and grid-continuation schemes to speedup the computations and reduce the likelihood to get trapped in local minima. We also implement an improved preconditioner for the reduced space Hessian to speedup the convergence. We test registration performance on synthetic and real data. We demonstrate registration accuracy on 16 neuroimaging datasets. We compare the performance of our scheme against different flavors of the DEMONS algorithm for diffeomorphic image registration. We study convergence of our preconditioner and our overall algorithm. We report scalability results on state-of-the-art supercomputing platforms. We demonstrate that we can solve registration problems for clinically relevant data sizes in two to four minutes on a standard compute node with 20 cores, attaining excellent data fidelity. With the present work we achieve a speedup of (on average) 5x with a peak performance of up to 17x compared to our former work.


page 5

page 19


A Semi-Lagrangian two-level preconditioned Newton-Krylov solver for constrained diffeomorphic image registration

We propose an efficient numerical algorithm for the solution of diffeomo...

Constrained H^1-regularization schemes for diffeomorphic image registration

We propose regularization schemes for deformable registration and effici...

An inexact Newton-Krylov algorithm for constrained diffeomorphic image registration

We propose numerical algorithms for solving large deformation diffeomorp...

Effective Implementation of GPU-based Revised Simplex algorithm applying new memory management and cycle avoidance strategies

Graphics Processing Units (GPUs) with high computational capabilities us...

Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

Parallelism patterns (e.g., map or reduce) have proven to be effective t...

Multi-Node Multi-GPU Diffeomorphic Image Registration for Large-Scale Imaging Problems

We present a Gauss-Newton-Krylov solver for large deformation diffeomorp...

Distributed-memory large deformation diffeomorphic 3D image registration

We present a parallel distributed-memory algorithm for large deformation...

1 Introduction

Deformable registration is a key technology in the medical imaging. It is about computing a map that establishes a meaningful spatial correspondence between two (or more) images (the reference (fixed) image) and (the template (deformable or moving) image; image to be registered) of the same scene [96, 43]. Numerous approaches for formulating and solving image registration problems have appeared in the past; we refer to [96, 97, 43, 60, 116] for lucid overviews. Image registration is typically formulated as a variational optimization problem that consists of a data fidelity term and a Tikhonov regularization functional to over-come ill-posedness [43, 40]. A key concern is that is a diffeomorphism, i.e., the map is differentiable, a bijection, and has a differentiable inverse. We require that the determinant of the deformation gradient does not vanish or change sign. An intuitive approach to safeguard against nondiffeomorphic maps is to add hard and/or soft constraints on to the variational optimization problem [29, 58, 103, 108]. An alternative strategy to ensure regularity is to introduce a pseudo-time variable and invert for a smooth velocity field that encodes the map  [16, 37, 94, 126]; existence of a diffeomorphism can be guaranteed if is adequately smooth [16, 30, 37, 121]. This model is, as opposed to many traditional formulations that directly invert for  [88, 58, 108, 95], adequate for recovering large, highly nonlinear deformations. Our approach falls into this category. There exists a large body of literature that, in many cases, focuses on theoretical considerations [37, 94, 131, 129, 130]. There is much less work on the design of efficient solvers for velocity based large deformation diffeomorphic image registration; examples are [11, 8, 9, 16, 34, 132, 64, 126, 102]

. Most of these solvers use first order methods for numerical optimization and/or are based on heuristics that do not guarantee convergence. Due to computational costs, early termination results in compromised registration quality. Our intention is to introduce an efficient solver for diffeomorphic image registration problems that

(i) uses state-of-the art algorithms, (ii) is scalable to thousands of cores, (iii) requires minimal parameter tuning, (iv) and produces high-fidelity results with guaranteed regularity on a discrete level.

spatial domain;
spatial coordinate;
pseudo-time variable;
reference image
template image (image to be registered/deformed)
stationary velocity field
deformation map (Eulerian/pull-back map)
state variable (transported intensities)
final state;
adjoint variable
incremental state variable
incremental state variable
Lagrangian functional
(reduced) gradient
(reduced) Hessian operator
partial derivative with respect to th coordinate direction
gradient operator;
divergence operator
Laplacian operator (vectorial and scalar)
number of grid points (spatial domain);
number of cells in temporal grid
number of unknowns;
CLAIRE constrained large deformation diffeomorphic image registration
CFL Courant–Friedrichs–Lewy (condition)
CHEB() Chebyshev (iteration) with fixed iteration number [49, 53]

fast Fourier transform

GPL GNU General Public License
LDDMM large deformation diffeomorphic metric mapping [16]

matrix vector product

MPI Message Passing Interface
PETSc Portable Extensible Toolkit for Scientific Computation [13]
PCG preconditioned conjugate gradient (method) [66]
PCG() PCG method with relative tolerance
RK2 2nd order Runge–Kutta method
(S)DDEM (symmetric) diffeomorphic demons [124, 126]
(S)LDDEM (symmetric) log-domain diffeomorphic demons [125]
SL semi-Lagrangian (scheme)
TAO Toolkit for Advanced Optimization [98]
Table 1: Notation and symbols.

1.1 Outline of the Method

We use an optimal control formulation. The task is to find a smooth velocity field (the ‘’control variable‘’) such that the distance between two images (or densities) is minimized, subject to a regularization norm for and a deformation model given by a hyperbolic PDE constraint.111We refer to [20, 51, 68, 27] for more details on recent developments in PDE-constrained optimization. We review PDE-constrained optimization formulations in the context of image analysis in [86]. More precisely, given two functions (reference image) and (template image) compactly supported on an open set with boundary , we solve for a stationary velocity field as follows: equationparentequation

subject to

with periodic boundary conditions on . Here, (the ‘’state variable‘’) corresponds to the transported intensities of subject to the velocity field ; in our formulation, —i.e., the solution of (1b)—is equivalent to for all in . The regularization functional is a Sobolev norm that, if chosen appropriately, ensures that gives rise to a diffeomorphism  [16, 37, 62, 121]. We augment the formulation in (1) by constraints on the divergence of to control volume change (see also [83, 30]). Details can be found in §2.1.

Problem (1) is ill-posed and involves ill-conditioned operators. We use the method of Lagrange multipliers to solve the constrained optimization problem (1). Our solver is based on an optimize-then-discretize approach; we first derive the optimality conditions and then discretize in space using a pseudospectral discretization with a Fourier basis.222A discretize-then-optimize approach can be found in [87]. We use a globalized, inexact, preconditioned Gauss–Newton–Krylov method to solve for the first order optimality conditions [82]. The hyperbolic transport equations that appear in our formulation are integrated using a semi-Lagrangian method [117, 41]. Our solver uses distributed-memory parallelism and can be scaled up to thousands of cores [83, 47]. The linear solvers and the Gauss–Newton optimizer are built on top of PETSc [14] and TAO [98]. Details can be found in §2.3.

1.2 Contributions

We follow up on our former work on constrained diffeomorphic image registration [82, 83, 84, 85]. Our optimal control formulation is introduced in [82, 83]. Our Newton–Krylov solver has been proposed in [82]; an improved solver is described in [84].333A comparison of our Newton–Krylov scheme against a (preconditioned) gradient descent scheme (i.e., a gradient descent in the dual space , , induced by the regularization norm; see, e.g., [16, 62]) can be found in [82]. We extend our original solver in [82] to the 3D setting in [85, 47]. The focus in [85, 47] is the scalability of our solver on high performance computing platforms. In [107] we presented an integrated formulation for diffeomorphic image registration and biophysical tumor growth simulations. In this work, we focus on the registration performance and introduce additional algorithmic improvements to our three-dimensional solver. Our contributions are:

  • We implement an improved preconditioner for the reduced space Hessian (originally described in [84] for the two-dimensional case). We empirically evaluate several variants of this preconditioner.

  • We introduce options for a grid, scale, and parameter continuation to our three-dimensional solver.

  • We evaluate registration quality and compare our new, improved solver against different variants of the diffeomorphic DEMONS algorithm [125, 126].

  • We study strong scaling performance of our improved solver.

  • We make our software termed CLAIRE (which stands for constrained large deformation diffeomorphic image registration) available under GPL license (

1.3 Limitations and Unresolved Issues

Several limitations and unresolved issues remain. We assume similar intensity statistics for the reference image and the template image . This is a common assumption in many deformable image registration algorithms [16, 62, 77, 99, 127]. To enable the registration of images with a more complicated intensity relationship, more involved distance measures need to be considered (see, e.g., [96, 116]). Our formulation is not symmetric, i.e., not invariant to a permutation of the reference and template image. The extension of our scheme to the symmetric case is mathematically straightforward but its efficient implementation is nontrivial (see [10, 80, 125] for examples). This will be the subject of future work. We invert for a stationary velocity field (i.e., the velocity is not a function of time). Stationary paths on the manifold of diffeomorphisms are the group exponentials (i.e., one-parameter subgroups) and—in contrast to LDDMM methods [16, 62, 127], they do not cover the entire space of diffeomorphisms. Developing an effective, parallel solver for nonstationary requires more work.444We note that our Matlab prototype implementations support nonstationary  [82, 87].

1.4 Related Work

With this work we follow up on our prior work on constrained diffeomorphic image registration [82, 83, 85, 87, 89]. We do not directly invert for the deformation map (see [29, 58, 88, 42, 57, 97, 104, 73] for examples) but for its velocity . We use an optimal control formulation. The PDE constraint is, in its simplest form, a hyperbolic transport equation. We refer to [20, 27, 51, 68, 79] for an introduction to the theory and algorithmic developments in PDE-constrained optimization. The numerics for these type of problems have to be tailored to the specific structure of the control problem, which is dominated by the PDE constraints that appear in the formulation; see [1, 21], [2, 46, 90, 118], and [18, 26, 77, 65, 128] for elliptic, parabolic, and hyperbolic examples, respectively.

Excellent reviews on image registration include [96, 60, 116]. Optimal control formulations for image registration have been described in [26, 30, 62, 77, 78, 127, 65]. Related formulations for optimal mass transport are described in [18, 56, 122, 87]. Our work differs from optimal mass transport in that intensities are constant along the computed characteristics (i.e., mass is not preserved). Our formulation also shares characteristics with traditional optical flow formulations [69, 72, 105]. The key difference is that we treat the transport equation for the image intensities as a hard constraint. PDE-constrained formulations for optical flow, which are equivalent to our formulation, are described in [6, 15, 26, 30]. Our work is closely related to the LDDMM approach [10, 11, 16, 37, 121, 129] (c.f., [82, 62]), which builds upon the pioneering work in [32]. LDDMM uses a non-stationary velocity but there exist variants that use stationary  [7, 8, 64, 81, 80, 126] because they are much cheaper and if we are only interested in registering one template image to the reference stationary produce good results. Another strategy to reduce the size of the search space is ‘’geodesic shooting‘’ [9, 93, 127, 129, 133]. Here, the deformation map is parameterized by an initial momentum (or, similarly, the velocity at ).

Unlike existing LDDMM approaches, CLAIRE features explicit control on the determinant of the deformation gradient by introducing additional constraints for the divergence of . Our formulation for controlling the divergence of was originally proposed in [83]; a similar formulation is described in [26]. Other works that consider divergence free velocities have been described in [30, 67, 91, 105, 106].

With this work we release CLAIRE, a software package for velocity based diffeomorphic image registration. Among the most popular, publicly available packages for large deformation diffeomorphic image registration are DEMONS [125, 126], ANTS [11], and DARTEL [8]. Other popular packages for deformable registration are IRTK [104], ELASTIX [73], NiftyReg [95], and FAIR [97], which, with the exception of FAIR, are based on (low dimensional) parametric deformation models that use an expansion based on basis splines to represent the deformation map .

There are only few works that discuss effective numerical implementations for velocity based diffeomorphic image registration. Despite the fact that first order methods for numerical optimization display poor convergence for non-linear, ill-posed problems, most work, with the exception of ours and [9, 18, 63, 114, 126, 65], uses first order information for numerical optimization. We use a globalized Newton–Krylov method instead. For these methods to be efficient, it is critical to design an effective preconditioner (something we address with this work; we refer to [17] for an overview on preconditioning of saddle point problems). Preconditioners for problems that are similar to ours can be found in [18, 114, 65]. Another critical component of an effective method for PDE-constrained optimization problems is the solver for the PDE operators that appear in the optimality systems. In our case, these PDE operators are hyperbolic transport equations. Several strategies to efficiently solve these equations have been considered; they range from conditionally stable total variation diminishing [26], second and fourth-order Runge-Kutta [82, 83, 102], and high-order essentially nonoscillatory [62] schemes, to unconditionally stable implicit Lax-Friedrich [18, 114], SL [16, 30, 89, 85], and Lagrangian [87] schemes. We use a SL scheme.

Examples for parallel solvers for PDE-constrained optimization problems can be found in [3, 4, 22, 23, 24, 20, 19, 113]. We refer to [39, 44, 110, 112] for surveys on parallel algorithms for image registration. Implementations, such as DEMONS [125, 126], ANTS [11], or ELASTIX [73], which are largely based on kernels implemented in the ITK package [70], exploit multi-threading for parallelism. GPU implementations of different variants of map-based parametric approaches based on basis splines are described in [109, 95, 111]. A GPU implementation of a map-based non-parametric image registration approach that builds upon FAIR [97] is described in [74]. GPU implementations of diffeomorphic velocity based registration approaches are described in [55, 54, 115, 122, 123]. The work that is most closely related to ours [85, 47], is [55, 54, 123]. In [55, 54] a multi-GPU-accelerated implementation of the diffeomorphic image registration approach described in [71] is presented. The work in [123] discusses a GPU-accelerated implementation of DARTEL [8].

What sets our work apart are the numerics and our distributed-memory implementation: We use high-order numerical methods (second order time integration, cubic interpolation, and spectral differentiation). Our solver uses MPI for parallelism and has been deployed to high performance platforms 

[83, 47]. This allows us to target applications of unprecedented scale such as CLARITY imaging [120], a novel optical imaging technique that delivers sub-micron resolution. The first attempt of applying LDDMM to CLARITY is described in [75]. The data is downsampled by about a factor of 100 (from to ) to be able to perform the registration. By exploiting a multi-resolution strategy they were able to register the images in about 100 minutes. We will see that we can solve problems with unknowns in on 22 compute nodes (256 MPI tasks) and in less than if we use 342 compute nodes (4096 MPI tasks). However, CLAIRE cannot only be used to solve problems of unprecedented scale but also paves the way for real-time applications of large deformation diffeomorphic image registration. We are the first group to present a distributed-memory implementation of a globalized Gauss–Newton–Krylov method for these type of problems.

Figure 1: We compare results for CLAIRE and the diffeomorphic DEMONS algorithm. We consider the first two volumes of the NIREP dataset. We report results for the symmetric diffeomorphic DEMONS algorithm (SDDEM) with regularization parameters () determined by an exhaustive search. We report results for CLAIRE for different choices for the regularization parameter for the velocity ( and ; determined by a binary search). We show the original mismatch on the left. For each variant of the considered algorithms we show the mismatch after registration and a map for the determinant of the deformation gradient. We report values of the Dice score of the union of all available gray matter labels below the mismatch. We also report the extremal values for the determinant of the deformation gradient. We execute the DEMONS algorithm on one node of the CACDS’s Opuntia server (Intel ten-core Xeon E5-2680v2 at with memory (2 sockets for a total of 20 cores)) using 20 threads. We use a grid continuation scheme with 15, 10, and 5 iterations per level, respectively. If we execute CLAIRE on the same system, the runtime is and , respectively. If we increase the number of iterations of SDDEM to 150, 100, 50 per level, we obtain a dice score of and with a runtime of and , respectively. The results for CLAIRE are for 16 nodes with 12 MPI tasks per node on TACC’s Lonestar 5 system (2-socket Xeon E5-2690 v3 (Haswell) with 12 cores/socket, memory per node). We execute CLAIRE at full resolution using a parameter continuation scheme in . Detailed results for these runs can be found in Tab. 12, Tab. 13, and Tab. 17.

1.5 Outline

We present our approach for large deformation diffeomorphic image registration in §2, which comprises the formulation of the problem (see §2.1), a formal presentation of the optimality conditions (see §2.2), and a discussion of the numerical implementation (see §2.3). Numerical experiments are reported in §3. We conclude with §4. Supplementary material is presented in §A, §B, and §C.

2 Methods

In what follows, we describe the main building blocks of our formulation, our solver, and its implementation, and introduce new features that distinguish this work from our former work on constrained diffeomorphic image registration [82, 83, 85, 47, 87, 84]. We use a globalized preconditioned, inexact, reduced space Gauss–Newton–Krylov method to solve (1). Our scheme has been described in [82, 85, 84]. Extensions to our original optimal control formulation [82] are described in [83]. The parallel implementation of the main kernels of our solver is described in [85, 47].

2.1 Formulation

Given two images—the reference image and the template image —compactly supported on , with boundary and closure , our aim is to compute a plausible deformation map such that  [96, 97, 43]. We consider a map to be plausible, if it is a diffeomorphism, i.e., an invertible map, which is continuously differentiable (in particular a -function) and maps onto itself. In our formulation, we do not directly invert for ; we introduce a pseudo-time variable and invert for a stationary velocity , instead. In particular, we solve for the velocity field and a mass source map as follows [83]: equationparentequation


subject to


with periodic boundary conditions on , and , , . The state variable in (2b) represents the transported intensities of subjected to the velocity field ; the solution of (2b), i.e., , is equivalent to , where is the Eulerian (or pullback) map.555The deformation map and the determinant of the deformation gradient can be computed from as a post-processing step (see [82, 83]). We use a squared -distance to measure the proximity between and . The parameters and control the contribution of the regularization norms for and . The constraint on the divergence of in (2d) allows us to control the compressibility of . If we set in (2d) to zero is incompressible, i.e., up to numerical accuracy [52]; see [30, 67, 82, 83, 85, 91, 105, 106] for examples. By introducing a nonzero mass-source map , we can relax this model to near-incompressible diffeomorphisms  [83, 25]; the regularization on in (2a) acts like a penalty on the divergence of ; we use an -norm (a similar formulation can be found in [25]).

If we neglect the constraint (2d) we obtain a formulation that is closely related to available models for velocity-based diffeomorphic image registration [8, 16, 62, 64, 127, 126] (see [62, 82]).666We, likewise to [7, 8, 64, 81, 80, 126], invert for a stationary . The original LDDMM formulation [16, 37, 121, 129] uses time-dependent velocity fields. We note that stationary do not cover the entire space of diffeomorphisms. The paths obtained from stationary are one-parameter subgroups that do not depend on any metric; the definition of such a metric may be desirable in certain applications [16, 92, 132] and, in general, requires nonstationary velocities. We observed for the 2D case that using time-varying did not improve the mismatch for two-image registration problems [82]. Our solver supports different Sobolev (semi-)norms to regularize . We summarize the available operators in Tab. 2. The choice of the differential operator not only depends on application requirements but is also critical from a theoretical point of view; an adequate choice guarantees existence and uniqueness of an optimal solution of the control problem [15, 16, 26, 30, 77] (subject to the smoothness properties of the images).

-seminorm -seminorm -seminorm
Table 2: Regularization norms available in CLAIRE with their associated differential operator in (2a), and their first and second variation .

2.2 Optimality Condition and Newton Step

We use the method of Lagrange multipliers [79] to turn the constrained problem (2) into an unconstrained one; neglecting boundary conditions, the Lagrangian functional is given by


with Lagrange multipliers for the hyperbolic transport equation (2b), for the initial condition (2c), and for the incompressibility constraint (2d). Formally, we have to compute variations of with respect to the state, adjoint, and control variables. We will only consider a reduced form (after eliminating the incompressibility constraint) of the optimality system—a system of nonlinear PDEs for , , and . Details on how we formally arrive at this reduced from can be found in [82, 83].

The evaluation of the reduced gradient (the first variation of the Lagrangian in (3) with respect to ) for a candidate requires several steps. We first solve the state equation (2b) with initial condition (2c) forward in time to obtain the state variable . Given we then compute the adjoint variable by solving the adjoint equation equationparentequation


with periodic boundary conditions on , backward in time. Once we have the adjoint and state fields and , respectively, we can evaluate the expression for the reduced gradient given by


The differential operator in (5) corresponds to the first variation of the regularization norm for in (3), resulting in an elliptic, biharmonic, or triharmonic control equation for , respectively (see Tab. 2). The operator projects onto the space of incompressible or near-incompressible velocity fields. We have

for the incompressible case. If we neglect the incompressibility constraint (2d), in (5) is an identity operator. The dependence of and on is “hidden” in (2b) and (4a), respectively.

The first order optimality condition (control or decision equation) requires that for an admissible solution to (2). Most available registration packages for large deformation diffeomorphic registration use Sobolev gradient descent type optimization schemes to find an optimal point [16, 62, 127]. The search direction is typically given by the gradient in the dual space , :


These methods have linear convergence. Newton-type methods yield super-linear to quadratic convergence [100].777We demonstrated in [82] that our (Gauss–)Newton–Krylov method is superior to this Sobolev gradient descent scheme. However, their implementation requires more work and, if implemented naively, they can become computationally prohibitive.

We use a Newton–Krylov method to solve for , which is globalized using an Armijo linesearch [100]. We use a reduced space formulation [23], in which, for every value of , we eliminate and . Then, given a candidate , the Newton direction is computed based on the second variations of the Lagrangian in (3). The expression for the Hessian matvec (action of the Hessian on the vector ) is given by


where , , and denote the incremental state, adjoint, and control variable, respectively. We use the notation for the application of the reduced space Hessian to indicate that is a function of and through , , , and in (2b), (8a), (4a), and (9a). Each application of the reduced Hessian in (7) requires the following steps: Given and for some candidate we have to solve888We assign the computational costs for computing and to the evaluation of the gradient in (5). equationparentequation


for (forward in time) and equationparentequation


for (backward in time). Computing the Newton step requires the (iterative) solution of


where the right hand is given by the reduced gradient in (5).

2.3 Numerics

In the following, we describe our distributed-memory solver for 3D diffeomorphic image registration problems.999Our original Newton–Krylov scheme is described in [82]. Its parallel implementation is described in [85, 47], where we focus on the scalability of our solver. Improvements of our solver in [82] are described in [84]; the work in [82, 84] is limited to the 2D case. In what follows, we present an extensions of the 3D implementation described in [85, 47].

2.3.1 Discretization

We discretize in space on a regular nodal grid with grid nodes , , , , and periodic boundary conditions; denotes the Hadamard division. In the continuum, we model images as compactly supported (periodic), smooth functions. We apply a Gaussian smoothing (in the spectral domain) with a bandwidth of , and mollify the discrete data, to meet these requirements. We rescale the images to an intensity range of prior to registration.

We use a trapezoidal rule for numerical quadrature and a spectral projection scheme for all spatial operations. The mapping between the spectral and the spatial domain is done using forward and inverse FFTs. All spatial derivatives are computed in the spectral domain; we first take the FFT, then apply the appropriate weights to the spectral coefficients, and then take the inverse FFT. Notice, that this scheme allows us to efficiently and accurately apply differential operators and their inverses. Consequently, the main cost of our scheme is the solution of the transport equations (2b), (4a), (8a), and (9a), and not the inversion of differential (e.g., elliptic or biharmonic) operators. We use a nodal discretization in time, which results in space-time fields we need to solve for. We use an unconditionally stable SL scheme [117, 41] to solve the transport equations, which allows us to keep small (we found that yields a good compromise between runtime and numerical accuracy).

2.3.2 Newton–Krylov Solver

The discretized Hessian operator , , is dense and ill-conditioned [82]. Constructing and storing is out of question, especially for 3D problems (e.g., ). We use iterative, matrix-free Krylov subspace methods, which only require the action of on a vector (see (7)). We use a PCG method [66] under the assumption that is a symmetric, positive (semi)definite operator.101010The exact Hessian is of course guaranteed to be symmetric. But since we use an optimize-then-discretize formulation, discretization discrepancies between the adjoint and the forward operators result in a nonsymmetric Hessian. However, the error in symmetry is small compared to the tolerance levels we use in our solver. In general, the more accurately we want to compute the gradient, the more accurately we need to solve the forward and adjoint problems. The linear solves are done inexactly, with a tolerance that is proportional to the norm of the reduced gradient [36, 38]; our solver supports quadratic and superlinear forcing sequences (see, e.g., [100, 166ff]).

We note that it cannot be guaranteed that the Hessian operator is positive definite far away from the optimum. As a remedy we opt for a Gauss–Newton approximation. This corresponds to dropping all terms in (7) and (9) that involve  [82]. The rate of convergence drops from quadratic to superlinear. As tends to zero (i.e., the mismatch goes to zero; see (4)), we recover quadratic convergence.


As we have pointed out in §2.2, the transport equations (8) and (9) have to be solved every time in (7) is applied to a vector. To design an efficient algorithm we have to keep the number of Hessian matvecs (i.e., the number of PCG iterations) as small as possible. This necessitates the design of an effective preconditioner. We refer to [82] for a study of the spectral properties of the Hessian; for practical values of the regularization parameter

, the Hessian behaves as a compact operator; larger eigenvalues are associated with smooth eigenvectors 


The speed of convergence of the PCG method depends on the distance of from identity; ideally, the spectrum of is clustered around one. In former work, we have considered two preconditioners, a spectral preconditioner based on the inverse of the regularization operator111111The operator is singular if we consider a seminorm as regularization model in (2

). We bypass this problem by setting the zero singular values of the operator

to one before computing its inverse. (a common choice in PDE-constrained optimization problems [5, 28]) and a two-level (nested) preconditioner (proposed and tested in [84] for the 2D case). The latter strategy can be interpreted as a simplified two-level multigrid V-cycle. We present these two preconditioners in more detail next. Both of our approaches are matrix-free, i.e., they do not require assembling or storage of the preconditioner.

Applying the spectral preconditioner to (10) yields


The preconditioned Hessian operator in (11) is a compact perturbation of the identity; we expect a rate of convergence that is independent of the grid size. Notice that the inversion of can be done efficiently and exactly due to the spectral discretization in space; it only requires a forward and an inverse FFT, and an appropriate diagonal scaling in the spectral domain. This preconditioner is used in our past work [82, 83, 85, 87, 86].

We need a good approximation of the inverse of the Hessian to have an effective preconditioner. Our second approach uses the inverse of the reduced space Hessian as a preconditioner (in theory, the ideal preconditioner). The key idea is to amortize the computational costs by solving for the action of the inverse of the Hessian inexactly, on a coarser grid. This preconditioner will only act on the low frequency components of the input vector. We ensure this numerically by separating the spectrum. We do so by applying ideal (spectral) low- and high-pass filters to the vector the preconditioner is applied to (input vector). We proceed as follows: We apply a spectral restriction operator to present the filtered low frequency modes of our input vector to the algorithm for inverting the Hessian on the coarse grid. We apply a spectral prolongation operator to the output vector of this algorithm to project it back to the fine grid, and subsequently add the filtered high frequency modes of the input vector to it (i.e., we leave the high frequency components of the input vector untouched). To be able to evaluate the Hessian matvec, we also need to apply restriction operators to the adjoint and state variable and , respectively.

We do not apply this scheme to (7) but to the spectrally preconditioned Hessian operator in (11)—with a slight modification. The -preconditioned Hessian in (11) is no longer symmetric. Since we can efficiently compute the square root of the operator we use


instead.121212Notice that the left preconditioned Hessian in (11) and the split-preconditioned Hessian in (12) have identical spectral properties. Consequently, we can use a PCG algorithm to compute the action of the inverse of in (12). Notice that the inverse of the regularization operator is a smoother; we can view it as an approximation of a smoother in multigrid methods.

The final building block of our two-level scheme for preconditioning (10) is the solver used to invert the reduced space Hessian on the coarse grid. Forming and storing this operator is out of question (too costly); we opt for a matrix-free iterative solver to compute the action of its inverse on a vector, instead. We have several choices. If we opt for a nested Krylov subspace method, such as PCG, we have to use a smaller tolerance for the inversion of the reduced space Hessian , , on the coarse grid (i.e., the preconditioner) than we use for the inversion of on the fine grid, i.e., with . This is due to the fact that Krylov subspace methods are nonlinear operators; this scheme would violate the theory of preconditioning linear systems if the preconditioner was not inverted accurately. Another approach is to use a semi-iterative Chebyshev (CHEB) method [53], which is a fixed linear operator for a particular choice of eigenvalue bounds [49]

. We have to estimate these bounds at every Newton iteration; this is costly. We do so using the Lanczos method. This is the method we used in 

[84].131313We have also considered a flexible PCG method [12, 101] for the inversion of the Hessian operator. We observed that the performance deteriorated significantly as we reduce the regularization parameter. We disregard this approach.

Stopping Criteria

We terminate the inversion if the -norm of the gradient in (5) is reduced by a factor of , i.e., if , where is the gradient at iteration and is the gradient for the initial condition of the inversion. In most of our experiments, we use . We also provide an option to set a lower bound for the -norm of the gradient (the default value is ). CLAIRE also features the stopping criteria discussed in [97, 48, 82] (not considered here).

2.3.3 Hyperbolic PDE Solver

We use a fully explicit, unconditionally stable SL scheme [41, 117] to solve the hyperbolic transport equations that appear in our Eulerian formulation ((2b), (4a), (8a), and (9a)).141414We introduced our SL formulation in [84] and introduced a first parallel implementation in [85]. SL methods are a hybrid between Eulerian and Lagrangian methods. To explain our scheme, we consider the general transport equation

with periodic boundary conditions on . We introduce the notion of a characteristic that is defined by the flow on a time interval , . The interval corresponds to a single time step of size of our scheme for integrating the transport equation with .

If the Lagrangian derivative vanishes, is constant along the characteristic . The value of at a given coordinate at can then be computed by tracing the characteristic backward in time to obtain the location (departure point) the particle originated from. We can trace the characteristic by solving in with final condition at . We solve this ODE backward in time based on a fully explicit RK2 scheme (Heun’s method):

Notice, that is stationary; we have to compute the characteristic only once for a single time step at each Newton iteration (more precisely, once for the state and once for the adjoint equations). The coordinate is in general an off-grid location; the evaluation of requires three scalar interpolations. We use a cubic interpolation model.

In a subsequent step, we have to solve along the characteristic . As we have mentioned earlier, is constant along for the homogeneous case (). Hence, we can compute at by scalar interpolation at the off-grid locations ; the solution of the underlying PDE overall requires interpolations. If the Lagrangian derivative does not vanish, we compute the solution based on an RK2 scheme. Depending on the form of , this requires the evaluation of differential and/or interpolation operators for each datum evaluated at (see [84] for details). Overall, our scheme is second order accurate in time and third order accurate in space.

Figure 2: Illustration of the SL scheme in 2D. The original grid at timepoint is distributed across processors P, . To solve the transport problem, we have to trace a characteristic for each grid point backward in time. The deformed configuration of the grid (‘’departure points‘’) originally owned by P4 (red points) are displayed in overlay. We illustrate three scenarios: The departure point is located (i) on P4 (left; ), (ii) on a different processor P1 (left; ), and (iii) between processors P3 and P4 (right). For the first case, no communication is required. The second case requires the communication of to P1, and the communication of the interpolation result back to P4. For the third case, we add a ghost layer with a size equal to the support of the interpolation kernel (4 grid points in our case) to each processor; the evaluation of the interpolation happens on the same processor (like in the first case). Notice that the communication of the departure points (for the forward and backward characteristics) needs to be performed only once per Newton iteration, since our velocity field is stationary.

2.3.4 Continuation Schemes

The automatic selection of an optimal regularization parameter for unseen data is a topic of research by itself [61, 59]. The choice of the regularization norm (i.e., the regularity requirements for the deformation map ) depends on the application. Subject to smoothness requirements on the data, we know that we require at least -regularity of to guarantee that is a diffeomorphism [16, 37, 121]. In a practical setting, we assume that it is sufficient to require that is strictly positive (up to numerical accuracy). We can, for instance, monitor during the optimization and adjust the regularization parameter to guarantee that is locally diffeomorphic.


In our formulation, does not appear. Also, we have to differentiate (or the displacement field) once we found ; resolving is challenging and may require a significantly larger amount of discretization points than we use for the solution of the optimization problem. For this reason, we do not compute ; we directly transport .

Probing for an adequate is expensive, since it typically requires a repeated solution of the inverse problem. We assume that, once we have found an adequate , we can use this parameter for similar registration problems. Such cohort studies are quite typical in medical imaging. We determine based on a binary search (originally proposed in [82]; a similar strategy is described in [59]). The measure for this search is a bound (a user defined parameter) on . We choose so that . We start at level with a regularization parameter of and reduce by one order of magnitude until we hit the tolerance . At this point, we execute a binary search. We terminate the binary search if the update for at level is smaller than of the value for at which we reached the bound (i.e., if we reach for we terminate the binary search if the update for between level and is smaller than ). We use the solution of level as initial guess for the next level. We perform at least one Gauss–Newton iteration, and terminate the level if the gradient is reduced by (with respect to the norm of the gradient for our initial guess ; notice that we only need to evaluate the reference gradient once for this choice).

We implement the following acceleration schemes to speed up the rate of convergence and reduce the likelihood to get trapped in local minima:

  • Parameter continuation. For a given target parameter , we reduce (starting with ) by one order of magnitude until we reach the order of . We suggest to solve the problem on each level with a larger tolerance for the stopping criteria (e.g., ). On the final level, we set to and solve with the user defined tolerance . We perform at least one Gauss–Newton iteration per level.

  • Grid continuation. Our solver also features a (spectral) grid continuation (i.e., multilevel) scheme. We solve the problem on coarse grid first, and successively refine until we reach the resolution of the original image data. The restriction and prolongation is performed in the spectral domain. The solution on the coarse grid at level is used as the initial condition for the fine grid at level . At each level, we solve the registration up to the user defined tolerance .

  • Scale continuation. We have also implemented a multi-scale approach. We use a Gaussian kernel (applied in the spectral domain) for the scale space representation of the image data.

We summarize the critical parameters of CLAIRE in Tab. 3.

variable meaning suggested value determined automatically
regularization parameter for yes
regularization parameter for no
relative tolerance for gradient no
number of time steps 4 no
bound for 0.25 (-div) or 0.1 () no
Table 3: Parameters available in CLAIRE (there are more, but these are the critical ones).

2.3.5 Implementation Details

We make CLAIRE available under GPL license. It can be downloaded at CLAIRE is written in C/C++; it implements data parallelism via MPI. Our solver supports single and double precision. We use the PETSc library [13] for linear algebra, and PETSc’s TAO [98] for numerical optimization. We provide routines for the evaluation of the objective functional in (2a), the evaluation of the gradient in (5), the Hessian matvec in (7), the construction and application of the preconditioner, and routines that control the tolerances and implement the stopping conditions. We use the AccFFT package [45]—a parallel, open-source FFT library for CPU/GPU architectures developed in our group, to apply the spectral operators.

AccFFT dictates the data layout: We partition the data based on a pencil decomposition for 3D FFTs [50, 35]: Let denote the number of MPI tasks; then each MPI task gets grid points (i.e., we partition the domain along the - and -axis into subdomains , , of size . There is no partitioning in time. In our improved implementation presented in this work we have significantly reduced the memory allocation for the Gauss–Newton approximation; we only store the time history of the state variable , , . This is accomplished by, e.g., performing the integration of

during the solution of the adjoint equation in (4) (we proceed in a similar fashion for the evaluation of (7), in particular, when computing the solution of the incremental adjoint equation (8)). Here, , represents the Kronecker product, and is the Hadamard product. We summarize the memory requirements in Tab. 4. Notice that we in addition to that require the memory of the Hessian matvec, if we consider the two-level preconditioner (under the assumption that the coarse level is half the size of the fine level). The spectral preconditioner does not add to the memory pressure.

Newton Gauss–Newton
Gradient (see (5))
Hessian matvec (see (7))
Table 4: Estimates for the memory pressure.

The scalability of the 3D FFT is well explored [50, 35, 45]. If we assume that the number of grid points , , is equal along each spatial direction, i.e., , each 3D FFT requires computations and communications, where is the startup time for the data transfer and represents the per-word transfer time [50].

We use a tricubic interpolation model to evaluate off-grid points in our SL scheme (see §2.3.3). The parallel implementation of our interpolation kernel is introduced in [85] and improved in [47]. The polynomial is implemented in Lagrange form. The evaluation of this interpolation kernel requires the computation of twelve basis polynomials. The local support of the cubic basis is grid points. Overall, this results in a complexity of computations. We have implemented a SIMD vectorization based on advanced vector extensions (AVX2) for Haswell architectures, for the evaluation of the interpolation kernel. In addition to that we have introduced a binning method (based on Morton sorting) that reduces the cache misses (see [47] for details). This yields an interpolation kernel that is now bound by communication time instead of time spent in the interpolation (which was the case in [85]).

The communication costs are more difficult to estimate; they depend on the computed characteristic (see Fig. 2 for an illustration). If a departure point is owned by the current processor, we require no communication. If the values for a departure point are owned by another processor/MPI task (the “worker”), we communicate the coordinates from the “owner” to the “worker”. We then evaluate the interpolation model on the “worker” and communicate the result back to the “owner”. This results in a communication cost of per off-grid point not owned by a processor. To evaluate the interpolation model at off-grid points not owned by any MPI task (i.e., located in-between the subdomains ), we add a layer of four ghost points (scalar values to be interpolated; see Fig. 2, right). This results in an additional communication cost of for each MPI task for the four face neighbors, where is the size of layer for the ghost points (in our case four). The communication with the four corner neighbors can be combined with the messages of the edge neighbors, by appropriate ordering of the messages. Notice that the communication of the departure points (for the forward and backward characteristics) needs to be performed only once per Newton iteration, since our velocity field is stationary. We perform this communication when we evaluate the forward and the adjoint operators (i.e., during the evaluation of the objective functional and the reduced gradient).

3 Experiments

We report results for real-world and synthetic datasets. We evaluate the registration accuracy for 16 segmented MRI brain volumes [31]. Details on the considered datasets can be found in §A. We showcase two exemplary datasets in Fig. 3. Notice, that these datasets have been rigidly preregistered. We directly apply our method to this data (without an additional affine preregistration step). The runs were executed on the CACDS’s Opuntia server or on TACC’s Lonestar 5 system. The specs of these systems can be found in §B.


In most experiments, we consider an -seminorm for the velocity since we observed (see also [83]) that it yields a good trade-off between time-to-solution and registration quality. We consider a moderate regularization weight of . We will see that this results in a good agreement between the reference image and the deformed template image. If we further reduce the regularization weight, we obtain a smaller mismatch at the cost of a prolonged runtime.

Figure 3: Illustration of exemplary datasets from the NIREP dataset [31]. Left: Volume rendering of an exemplary reference image (dataset na01) and an exemplary template image (dataset na03), respectively. Right: Axial slice for these datasets together with label maps associated with these data. Each color corresponds to one of the 32 individual anatomical gray matter regions that serve as a ground truth to evaluate the registration performance.

3.1 Convergence: Preconditioner

We study the performance of different variants of our preconditioner for the reduced space Hessian.

Setup. We solve the KKT system in (7) at a true solution . This velocity is found by registering two neuroimaging data sets from the NIREP dataset (na01 and na02). The images are downsampled to a grid of size (half the original resolution). We consider an -div regularization model with and and an regularization model with with a tolerance to compute . Once we have found , we generate a synthetic reference image by transporting the reference image using . We use the velocity as an initial guess for our solver, and iteratively solve for the search direction using different variants of our preconditioner. The number of time steps for the PDE solves is set to . We fix the tolerance for the (outer) PCG method to . We consider an inexact Chebyshev semi-iterative method with a fixed number of iterations (denoted by CHEB()) to compute the action of the inverse of our preconditioner. We also consider a nested PCG method for the iterative inversion of the preconditioner, with a tolerance of (denoted by PCG()). We compare these strategies to a spectral preconditioner (inverse of the regularization operator ; used in [85, 47, 86]). We study the rate of convergence of the PCG solver for a vanishing regularization parameter . We consider mesh sizes of and .

Results. We display the trend of the residual with respect to the (outer) PCG iterations in Fig. 4 (-seminorm for with ) and in Fig. 5 (-div regularization model with an -seminorm for , and ), respectively. We provide detailed results in Tab. 6 and Tab. 7 in §C.1. We execute the code on a single node of Opuntia with 20 MPI tasks.

Figure 4: Convergence of Krylov solver for different variants of the preconditioner for the reduced space Hessian. We consider an -seminorm as regularization model for the velocity. We report results for different regularization weights . We report the trend of the relative residual for the outer Krylov method (PCG) versus the iteration count. We report results for the spectral preconditioner and the two-level preconditioner. We use different algorithms to compute the action of the inverse of the preconditioner.
Figure 5: Convergence of Krylov solver for different variants of the preconditioner for the reduced space Hessian. We consider an -div regularization model with an -seminorm for the velocity. We report results for different regularization weights . We set . We report the trend of the relative residual for the outer Krylov method (PCG) versus the iteration count. We report results for the spectral preconditioner and the two-level preconditioner. We use different algorithms to compute the action of the inverse of the preconditioner.

Observations. The most important observations are:

  • The PCG method converges significantly faster for the two-level preconditioner.

  • The performance of all preconditioners considered in this study is not independent of the regularization parameter . The workload increases significantly for vanishing regularity of the velocity for all preconditioners. The plots in Fig. 4 and Fig. 5 imply that the convergence of the PCG method for the two-level preconditioner is less sensitive to (or even independent of) . The work goes to the inversion of the reduced space Hessian on the coarse grid (see Tab. 6 and Tab. 7 in §C.1 for details). If we further reduce the regularization parameter (below for the -regularization model and below for the -div regularization model) the performance of our preconditioners deteriorates further; the runtime becomes impractical for all variants of our preconditioners.

  • The rate of convergence of the PCG method is (almost) independent of the mesh size for all preconditioners.151515We note that we apply a smoothing of along each spacial dimension so that the input image data is resolved on the coarse grid of size . The same frequency content is presented to the solver on the fine grid of size .

  • The PCG method converges significantly faster if we consider an -regularization model for . This is a direct consequence of fact that the condition number of the Hessian increases with the order of the regularization operator .

  • The differences of the performance of the preconditioners are less pronounced for an -div regularization model for than for an -regularization model. For an regularization model with we require more than 200 iterations for the spectral preconditioner.

  • We obtain a speedup of up to 2.9 for the -regularization model (run #20 in Tab. 6 in §C.1) and a speedup of up to 2.6 for the -div regularization model (run #40 in Tab. 7 in §C.1). The coarser the grid, the less effective is the two-level preconditioner, especially for vanishing regularization parameters . This is expected, since we cannot resolve high-frequency components of the fine level on the coarse level. Secondly, we do not use a proper (algorithmic) smoother in our scheme to reduce the high-frequency errors.

  • The performance of the CHEB and the nested PCG method for iteratively inverting the reduced space Hessian are similar. There are differences in terms of the mesh size. For a coarser grid () the CHEB seems to perform slightly better. For a grid size of the nested PCG method is slightly better.

Conclusions. (i) The two-level preconditioner is more effective than the spectral preconditioner. (ii) The nested PCG method is more effective than the CHEB method on a finer grid (and does not require a repeated estimation of the spectrum of the Hessian operator). (iii) The PCG method converges faster if we consider an -div regularization model for . (iv) Designing a preconditioner that delivers a good performance for vanishing regularization parameters requires more work .

3.2 Convergence: Newton–Krylov Solver

We study the rate of convergence of our Newton–Krylov solver for the entire inversion. We consider a synthetic test problem and neuroimaging data (see Fig. 3).

3.2.1 Synthetic Data

Setup. We report results for a synthetic test problem (see §A) discretized on a grid of size . We consider an -regularization model (seminorm; ) and an -div regularization model (-seminorm for ; , ). We run the registration at full resolution ( unknowns). The number of Newton iterations is limited to 50 (not reached). We use a superlinear forcing sequence and limit the number of Krylov iterations to 100 (not reached). The tolerance for the relative change of the gradient is ; the absolute tolerance for the norm of the gradient is . The number of time steps for the PDE solves is set to . We use 20 cores (64GB compute nodes) resulting in a processor layout of ( unknowns per core). We do not perform any parameter, scale, or grid continuation. We will consider the same synthetic test problem when we study the scalability of our solver in §3.5.

Results. We report results in Fig. 6. The top row shows results for the -div regularization model and the bottom row for the regularization model. We plot the relative reduction of the mismatch, the reduced gradient, and the objective functional with respect to the Gauss–Newton iteration index. We also report results for the convergence of the PCG solver for different realization of the preconditioner.

Figure 6: Convergence of Newton–Krylov solver for a synthetic test problem for different regularization norms and variants for the preconditioner. The top row shows results for an -div regularization model (-seminorm for with and ). The bottom row shows results for an regularization model (seminorm; ). We plot (from left to right) the relative reduction of () the mismatch (-distance between the images to be registered), () the reduced gradient, and () the objective functional, with respect to the Gauss–Newton iterations. We use a relative change of the gradient by as a stopping criterion (dashed red line in second column). The two plots on the right show the convergence of the PCG solver per Gauss–Newton iteration for different realization of the preconditioner, respectively. The vertical, dashed red lines separate the individual Gauss–Newton iterations; the PCG iteration index is cumulative.

Observations. The most important observations are:

  • Our Newton–Krylov solver converges after 3 to 4 Gauss–Newton iterations.

  • We can reduce the gradient by three orders of magnitude in less than 5 Gauss–Newton iterations (about one order of magnitude per Gauss–Newton iteration if we consider a two-level preconditioner).

  • We require one Hessian matvec per Gauss–Newton iteration for the two-level preconditioner in combination with a nested PCG method. The residual in the PCG method drops rapidly for both preconditioners.

  • We obtain a better search direction per Gauss–Newton iteration if we consider the two-level preconditioner in combination with a nested PCG method (we slightly oversolve the KKT system for the nested preconditioner). The reduced gradient drops more rapidly. The trend of the objective and the mismatch is more or less similar.

Conclusions. Our Newton–Krylov solver converges quickly for smooth data. We can observe that the two-level preconditioner is not only more effective for the inversion of the reduced space Hessian, but also yields a better search direction; the Newton–Krylov solver converges quicker. We will see that these differences are less significant for real data. This indicates that the performance of our scheme deteriorates as the objects become more irregular.

3.2.2 Real Data

Setup. We register the datasets na02 through na16 (template images) with na01 (reference image). We run the registration at the full resolution (; unknowns). We consider and -div regularization model (-seminorm for with and ; the parameters are chosen empirically). We perform these runs on TACC’s Lonestar 5 system. The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of and (the latter is not reached) for the relative reduction and the absolute -norm of the reduced gradient as a stopping criterion. We use time steps for numerical time integration. We compare results obtained for the two-level preconditioner to results obtained using a spectral preconditioner (inverse of the regularization operator). We use a nested PCG method with a tolerance of for computing the action of the inverse of the two-level preconditioner. We do not perform any parameter, scale, or grid continuation. We compare results obtained for single () and double () precision.

Results. We show convergence plots for all datasets in Fig. 7. We plot the relative reduction of the mismatch (left column), the relative reduction of the gradient (middle column), and the relative reduction of the objective functional (right column) with respect to the Gauss–Newton iterations. The top row shows results for the spectral preconditioner; the bottom rows show results for the two-level preconditioner. Detailed results for these runs are provided in Tab. 8, Tab. 9 and Tab. 10 in §C.2. We also report a comparison of the performance of our solver for single () and double () precision in Tab. 11 for two exemplary images of the NIREP dataset in §C.2.

Figure 7: Convergence of CLAIRE’s Newton–Krylov solver for neuroimaging data for different realizations of the preconditioner. Top row: inverse regularization operator. Middle and bottom row: two-level preconditioner using PCG() for double (; middle row) and single (; bottom row) precision, respectively. We report results for 15 multi-subject brain registration problems (na02 through na16 of the NIREP repository registered to na01). Each of these 15 registrations is plotted in a different shade of gray. We plot (from left to right) the relative reduction of () the mismatch (squared -distance between the images to be registered), () the reduced gradient, and () the objective functional, with respect to the Gauss–Newton iterations. We use a relative change of the gradient by as a stopping criterion (dashed red line in second column). We also report the runtime in seconds for each registration problem (right plot at top) and an exemplary plot of the reduction of the residual of the (outer) PCG solver per Newton iteration (right plot at bottom; the Newton iterations are separated by vertical dashed lines). The runs are performed on one node of TACC’s Lonestar 5 system. The results reported here correspond to those in Tab. 8, Tab. 9 and Tab. 10 in the appendix.

Observations. The most important observations are:

  • Switching from double to single precision does not affect the convergence of our solver (see Fig. 7; see also Tab. 11).161616This is only true if we consider an -regularization model. For an -regularization model we have observed that single precision does not provide sufficient numerical accuracy to ensure the convergence of our method. Stabilizing the computations for higher order regularization operators for single precision requires more work.

  • The two-level preconditioner executed with single precision yields a speedup of up to 6 (with an average speedup of ) compared to our baseline method (spectral preconditioner executed in double precision) [83] (see Fig. 7 top right; run #13 in Tab. 10). Switching from single to double precision yields a speedup of more than 2 (see Tab. 11).

  • The average runtime of our improved solver is with a maximum of (run #13 in Tab. 10) and a minimum of (run #7 in Tab. 10).

  • We obtain a very similar convergence behavior for the different variants of our solver (see Fig. 7). We can reduce the -norm of the gradient by in 6 to 14 Gauss–Newton iterations (depending on the considered pair of images).

  • The mismatch between the deformed template image and the reference image stagnates once we have reduced the gradient by more than one order of magnitude (for the considered regularization weight).

  • We oversolve the reduced space KKT system if we consider a superlinear forcing sequence in combination with a nested PCG method (see Fig. 7).

Conclusions. (i) Our improved implementation of CLAIRE yields an overall speedup of for real data if executed on a single resolution level. (ii) Executing CLAIRE in single precision does not deteriorate the performance of our solver (if we consider an -regularization model for the velocity) .

3.3 Time-to-Solution

We study the performance of our solver for different continuation schemes to stabilize and accelerate the computations of CLAIRE. We note that the DEMONS algorithm requires between (3 levels with 15, 10, and 5 iterations) and (3 levels with 1500, 1000 and 500 iterations) until ”convergence” on the same system (depending on the parameter choices; this includes a grid continuation scheme; see §3.4 for details).171717Since we perform a fixed number of iterations for the DEMONS algorithm, the runtime only depends on the execution time of the operators. The regularization parameters control the support of the Gaussian smoothing operator; the larger the parameters, the longer the execution time. This is different for CLAIRE; large regularization parameters result in fast convergence and, hence, yield a short execution time. A simple strategy to obtain competitive results in terms of runtime would be to also execute CLAIRE for a fixed number of iterations. We prefer to use a tolerance for the relative reduction of the gradient, instead, since it yields consistent results across different datasets.

Setup. We use the dataset na02 and na03 as a template images, and register them to na01 (reference image). We consider and -div regularization model (-seminorm for with and ; these parameters are chosen empirically). We execute these runs on one node of the Opuntia system using 20 MPI tasks. The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of for the relative reduction of the -norm of the gradient and a tolerance of (not reached) for its -norm as a stopping criterion. We use time steps for numerical time integration. We compare results obt