Hubless keypoint-based 3D deformable groupwise registration

09/11/2018 ∙ by Rémi Agier, et al. ∙ INSA Lyon 0

We present a novel deformable groupwise registration method, applied to large 3D image groups. Our approach extracts 3D SURF keypoints from images, computes matched pairs of keypoints and registers the group by minimizing pair distances in a hubless way i.e. without computing any central mean image. Using keypoints significantly reduces the problem complexity compared to voxel-based approaches, and enables us to provide an in-core global optimization, similar to the Bundle Adjustment for 3D reconstruction. As we aim at registering images of different patients, the matching step yields many outliers. Then we propose a new EM-weighting algorithm which efficiently discards outliers. Global optimization is carried out with a fast gradient descent algorithm. This allows our approach to robustly register large datasets. The result is a set of half transforms which link the volumes together and can be subsequently exploited for computational anatomy, landmark detection or image segmentation. We show experimental results on whole-body CT scans, with groups of up to 103 volumes. On a benchmark based on anatomical landmarks, our algorithm compares favorably with the star-groupwise voxel-based ANTs and NiftyReg approaches while being much faster. We also discuss the limitations of our approach for lower resolution images such as brain MRI.



There are no comments yet.


page 8

page 21

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Registration of several images together, also known as groupwise registration, is nowadays most often carried out for human brain studies (jenkinson2002improved). Whole body studies remain rare as they still raise significant problems (XU2016BENCHMARK)

. Scalability of current groupwise registration algorithms is limited, and advances in imaging techniques are constantly increasing the number and size of 3D images in hospital databases. Hence the need for low complexity groupwise image registration techniques. Instead of dense voxel based registration, sparse keypoint matching, extracted from the images is a promising approach. Keyoints are extracted with their location and description vector. Points from different images are paired by comparing their description vectors. This is a great challenge when shifting from intrapatient registration to interpatient registration, as the human anatomy exhibits a large variability.

In this paper we propose a novel groupwise registration approach aimed at large image databases which is able to register high resolution images, such as whole body CT scans shown in Figure 9

. Inspired by advances in the computer vision field, our algorithm exploits keypoint detection and matching, which bring speed and robustness. Some paired point are erroneous (outliers) and we devised an algorithm for robustness against these outliers, efficient even when they are very frequent. Potential applications range from Computational Anatomy to Forensic Anthropology and robust patient-specific model construction. An example of forensic application is probabilistic sex diagnosis using worldwide variability in hip-bone measurements


. Moreover, the ever-increasing amount of medical images stored in Picture Archiving and Communication Systems (PACS) in hospitals offers a great opportunity for big data analysis. Screening huge image groups could improve diagnosis and could be of critical help for Computational anatomy. Recent advances in machine learning

(michalski2013machine) or atlas based-approaches (Iglesias2015survey) have pushed for efficient big data analysis tools. Yet the modest size of current annotated medical image databases limits the use of these approaches for medical imaging. As a consequence, whole-body groupwise registration could bridge the gap between big data and organ localization, segmentation, and Computational anatomy. One known limitation for our approach is that the number of extracted keypoints should be as high as possible to obtain the best accuracy.

The paper is organized as follows: section 2 presents previous works related to registration, keypoints and Bundle Adjustment. Section 3 summarizes the key contributions of our hubless approach. Section 4 explains our approach in detail while section 5 shows results obtained with our approach, as well as comparisons with both NiftyReg and ANTs star-groupwise algorithms. We conclude the paper in section 6.

2 Related works

2.1 Image registration

Image registration, also known as image matching, fusion or warping, consists in finding a transform between two or more images, mapping any point from a source image to a position

in the target image. It is a crucial step in many medical applications, such as longitudinal studies

(scahill2003longitudinal), radiotherapy planning (keall2005four), brain studies (jenkinson2002improved), atlas-based segmentation (lotjonen2010fast), image reconstruction (huang2008three) and microscopy (vercauteren2007non). We present here a brief state of the art, and we refer the reader to Brown:1992:SIR:146370.146374; maintz1998survey; sotiras2013deformable for thorough reviews of existing approaches.

2.1.1 Pairwise & Groupwise registration

Pairwise registration consists in matching one moving image to a fixed image. Groupwise registration consists in registering a whole set of images together.

One can categorize groupwise registration approaches using graph theory, as shown in Figure 1, where vertices represent images and edges represent transforms. Then, transforming any image to any other image requires at least a graph spanning all vertices. Then at least transforms are needed for images.

  • Star groupwise methods register each image of the set against a reference image. In this case the graph is a star graph (Figure 1(a)). The output is a set of half transforms , mapping each image to the common space. After registration, the transform from one image to an other image is a composition of the two half transforms and : . These methods are biased by the choice of the initial reference or template image. This bias takes two forms : the intensity bias and the shape bias. guimond00 reduced these biases by iteratively updating the reference image, taking into account the average of each transformed image, and the average of image transformations. Later, joshi2004unbiased improved the theoretical foundations of this approach via diffeomorphism. An example of application for brain pediatric studies is shown in fonov11. The publicly available softwares ANTs(Avants2008ants), Elastix(klein10) and NiftyReg(modat2008) both belong to the star groupwise category.

  • Tree groupwise methods use a spanning tree graph and multiple references (wu2011sharpmean), as shown by Figure 1(b). Variability can then be distributed across several references instead of only one.

  • Hubless methods use an abstract common space, but do not use any reference image: while star and spanning tree graphs are minimal graphs, hubless methods use a dense graph to benefit from more constraints, as shown by Figure 1(c). When registering images, up to observations can be used. Observations can be pairwise registrations, or local matches. hamm2010 proposed GRAM, a hubless registration method, based on manifold learning. This algorithm requires dense pairwise registration of all image pairs, which is very time consuming. As an example, the authors report a processing time of 24 hours on a cluster to register 416 low resolution images (). ying2014hierarchical proposed a similar approach, using geodesic graph shrinkage. A recent work on groupwise nonrigid registration was also proposed proposed by wu2012feature, but is dedicated to the processing of brains acquired with MRI, as it requires a segmentation of the image into White Matter, Grey Matter and Ventricular Cerebrospinal Fluid. More recently, the approach proposed in agier2016hubless uses no reference at all. This allows to deal with partially matching body parts and high variability but is restricted to rigid transforms.

Groupwise registration has most often been used for brain analysis, as the brain shape exhibits low variations between subjects in overall. Whole body studies such as carried out by Suh2011 remain rare. The important advantage of hubless approaches becomes clear when averaging a group of images is a problem. A good example is to register full body images, where computing the average between men and women is inconsistent in several regions.

(a) Star
(b) Tree
(c) Hubless
Figure 1: Different groupwise registration methods. Images are outlined in red, reference images are outlined in dotted black, transforms in dashed orange and observations in black. For dense registrations, observations are generally voxelwise difference, cross-correlation or mutual information. For sparse registrations, observations are usually keypoint match distances. Transforms are optimized via the minimization of an energy function driven by the observations. (a) star groupwise registration (1 reference, transforms and observations). (b) tree-groupwise registration (multiple references, transforms, observations). (c) hubless registration, with no reference image but one abstract common space, half-transforms and at most observations.

2.1.2 Rigid & Deformable

Rigid registration (ashburner2003rigid)

consists in computing a linear transform between two images. The transform can be a translation or a more expressive one such as a combination of translation, rotation, or scaling. Rigid transforms can be used for Procrustes Analysis

(schonemann1966generalized), Generalized Procrustes Analysis (Bartoli2013), mosaic assembly or as an initialization step for nonrigid transform. Nonrigid (deformable) registration consists in finding a free transform able to express local variability, which cannot be done with a rigid transform. Nevertheless, the transform cannot be arbitrary, and should generally be smooth and invertible.

2.1.3 Dense & Sparse

Registration approaches can also be split in two classes, depending on how they process the input data:

Dense registration estimates displacement on the entire image domain

(Avants2008ants; modat2008; modat2014). The most common approaches use intensity difference between images but there are many other criteria such as cross-correlation(lewis1995fast) or mutual-information(pluim2003mutual). The output is a displacement vector for each input voxel. The key difficulties with dense registration are high computational cost and an ill-posed optimization problem, which generally needs explicit regularization(robinson2004fundamental). As an example, the Advanced Normalization Tools (ANTs) framework (Avants2008ants) registers two thoracic images in about 2 hours on our test machine.

Sparse registration, instead of working on a dense grid, uses only point sets. Compared to dense registration, sparse registration is much faster(allaire2008full; cheung2009n)

but may yield less accurate results because points do not always span the whole space. For nonrigid registration, sparse transforms need an interpolant, such as splines


or radial basis functions

(fornefett2001radial). The interpolant inherently provides a smooth solution which helps regularizing the solution. Point sets can be of different types:

  • Reference landmarks which are distinguishable anatomic structures manually placed on the image by an expert. This task is usually carried out by physicians and can be time consuming. These are used in (wang06; li12)

  • Reference landmarks automatically placed on the image by an algorithm, such as proposed by Zhu et al.(zhu13)

  • Vertices of a surface mesh representing the object to register, such as proposed in (rasoulian12)

  • Keypoints, automatically detected and placed in the image. One important difference between reference landmarks and keypoints is that keypoints are not defined by anatomical definitions but by mathematical properties of the image. Hence, although it is interesting to find the same keypoints in each image, this is never the case in practice, and the number of extracted points from each image is not a priori known. Next section (2.2) gives more details on keypoints.

Some works use a combination of both dense and spare approaches. As an example, wu2012feature uses driving voxels to speed up a dense method.

2.2 Keyoints

During the last decades keypoints (harris1988combined) have successfully been used for object recognition (lowe1999object), action recognition (wang2011action), robotic navigation with Simultaneous Localization and Mapping (karlsson2005vslam) and panorama creation (anguelov2010google). They aim at being fast while reducing the amount of data to process, mainly to deal with real time processing or tasks involving large amounts of data. Their suitability to medical imaging has been evaluated in (lopez1999evaluation), and various applications have been proposed in this context such as image annotation (datta2005content), retrieval (zheng2008interest) and segmentation (KECH-14; wachinger18). Initially designed for 2D images, keypoint approaches have been extended to process 3D medical images by (allaire2008full; agier2016hubless; cheung2009n; SIFT3D).

The keypoint extraction pipeline is the following: first, for each image, a detector extracts locations exhibiting important features such as corners (first derivative analysis (harris1988combined)) or blobs (second derivative), most often via fast approximations of theses derivatives. As an example, Scale Invariant Feature Transform (SIFT) uses difference of Gaussians (lowe2004distinctive) and Speeded Up Robust Features (SURF) use integral images (bay2006surf). Afterwards, the neighborhood of each location is summarized into a compact feature vector, called descriptor. SIFT populates the feature vector with orientation histograms, while SURF stores local texture information.

Efficient registration can be performed as follows: (1) keypoints are extracted from the input images, with locations and descriptions. (2) Keypoints from different images are paired by comparing their feature vectors. Note that the number of keypoints may vary between images. This results in a set of paired points. (3) Registration is performed by minimizing the distance between all paired points of this set (mikolajczyk2005performance).

The use of paired points combined with sparse registration approaches leads to very fast registration, in opposition to dense registration. Moreover, as keypoints have to be extracted only once per image, repeated registration with the same reference image is very efficient.

A drawback of using keypoints is the significant number of mismatches (outliers) in the paired point set . Outliers are pairs of keypoints with similar descriptors describing regions which are actually different and therefore should not be paired. Decreasing the image quality (using low end cameras and webcams used in consumer electronics devices) potentially increases the outlier ratio to more than 30% of the set (labe2004geometric) (37% in (mikolajczyk2002affine)).

M-estimators (huber2011robust) provide a good way to mitigate the influence of outliers in this context. During optimization, M-estimators provide an adaptive weighting function which adjusts the contribution of each pair by computing robust statistics on

. Each point pair can then be classified as inlier or outlier (soft classification) and weighted accordingly. For this aim, one needs an estimation of the inlier contribution variance, computed via Median Absolute Deviation (MAD). Note that the maximum outlier ratio that MAD can process (the breakdown point

(donoho1983notion)), is 50%. Hence the breakdown point of the M-estimator is also 50%.

Finally, keypoints are rarely used for groupwise registration, most of previous approaches use anatomical reference landmarks to perform groupwise registration. Few papers deal with keypoints, such as Zhang et al. (zhang12). Note that (zhang12) is restricted to 2D images.

2.3 Bundle Adjustment

Bundle adjustment (BA) is a class of computer vision approaches which allows to reconstruct a unique 3D scene from multiple 2D views by back projection of keypoints in the 3D scene and global optimization (triggs2000bundle). It allows efficient tracking even with low-end cameras (karlsson2005vslam). Recent BA implementations can manage large amounts of data and successfully reconstruct large parts of a city (frahm2010building).

BA jointly optimizes point positions, camera positions and camera parameters. Because of (1) the non-linearity of projective geometry (hartley2003multiple), (2) mismatches and (3) the M-estimator, optimization is high-dimensional and non-convex (NIPS2014_5486). As a consequence, particular attention has to be paid to the optimization method. Algorithms such as Levenberg-Marquart optimization (more1978levenberg) can be used to avoid converging to a local minimum. BA also needs a good initialization to remain close to the global minimum at any time.

3 Challenges and contributions

Our paper provides a unified solution to register in a nonrigid way a whole image bundle at once. It can be seen as a complete graph registration because each image is linked with all other images (figure 1(c)). Several challenges need to be addressed:

  • Amount of data: As we want to process many images in reasonable time, we cannot use dense registration: each image from our test database contains about voxels, each voxel being encoded with 32 bits. Depending on the experiments, 20 to more than 100 images are used, representing more than 50GB of image data. Current approaches process a lower number of images, as in (LPBA40). On top of a hubless approach as in (wu2012feature), we need a compact solution which avoids reading voxel data and allows us to manage many images in an in-core way.

  • Repeatability & Outliers: We have to overcome inter-patient variability. When using keypoints, the rate of outliers increases from 30% for similar patients up to 70% in case of different patients. Two main reasons stand out, both induced by inter-patient variability : (1) match selectivity needs to be decreased to find more matches between differing patients, (2) match consistency cannot be enforced (if three points A, B, C are linked by two matches : AB and BC, A and C must match). As a result, the rate of outliers can exceed the M-estimator breakdown point (50%), hence the need for a more robust estimator.

  • Computational complexity: The analogy with BA is not perfect. BA processes a set of 2D images distributed in a 3D space. As each image overlaps with few other images, BA can be significantly simplified to a band matrix problem. In our case, we need to register 3D images in a 3D space, and the overlap between images is much higher: the images are all located within the human body, but for different patients. This results in a dense matrix problem, illustrated by Figure 2.

Figure 2: Sparsity induced by a panoramic reconstruction with BA (top) compared to our application (bottom). Overlap matrices of each problem are given on the right. The top image is provided by Noso1, via Wikimedia Commons.

To overcome these challenges, we introduce in this paper three main contributions:

  • Keypoint-based hubless registration: As input data, we use only the keypoints extracted from the images. We still use an abstract common space and half-transforms as in star groupwise registration, but we do not need any central reference data : our optimization is driven only by inter-image registration, as shown in Figure 1(c). Half-transforms are represented by spline pyramids, and we use the graph linking all images together (the complete graph) for the optimization. This compact framework can register 100 images with a memory footprint of only 10GB.

  • EM-weighting: To manage a ratio of outliers that can exceed

    , we devise an Expectation-Maximization algorithm which explicitly estimates inlier and outlier distributions. Moreover, experimental results show that the proposed EM-weighting yields better selectivity against outliers than the M-estimator. As a result, EM-weighting allows our algorithm to converge, while the M-estimator failed to provide convergence.

  • Efficient optimization: We propose a fast energy function minimization algorithm suited to our nonrigid problem. With this algorithm, our 24-core testing computer registers 20 images (512x512x400 voxels each) in less than one hour.

Figure 3: The three main steps in our algorithm: (1) Keypoints are extracted from each image. (2) Matching constructs a list of point pairs from the point sets. (3) Registration optimizes a transformation for each image in order to minimize point pair distances in the common space.

4 Method

Our algorithm proceeds in three steps : (1) 3D SURF keypoint extraction from all input images, (2) points pairing according to their descriptor, (3) half-transforms optimization by minimizing the distance between paired points, while aiming at discarding outlier influence. Figure 3 shows a block diagram of the three steps.

4.1 Half transforms driven by keypoints

image 1image 2image 4image 3common space

Figure 4: Example with 4 input images. For clarity purposes, only bone structures are displayed. Overprinted on each image : the grid on which lie the splines used to represent the nonrigid transform . In the middle : the common space, which is used to measure match distances. Similarly to Figure 1(c), matches are drawn in solid black while point transforms are drawn in dashed orange. In this simple example, 7 keypoints were extracted : , and 5 matches were found: . The match is an outlier. is the match matrix.

We consider a set of 3D images. In order to get rid of dense computation the earliest in the process, we extract for each image the 3D SURF keypoint set as done by agier2016hubless. For simplicity, we define . The entire algorithm uses physical coordinates (in millimeters), and is invariant to image sampling conditions. We compute the set of matched keypoints for all image pairs , by combining several criteria: feature vector distance, nearest neighbor ratio and laplacian sign as in bay2006surf and scale difference. As explained earlier in the paper, contains outliers. Optimizing the set of half transforms is done by minimizing the match distance-errors :



distance is the only measure computed in the abstract common space. We use tensor products of Uniform Cubic B-splines to interpolate the half-transforms

, driven by control points placed on a rectilinear grid. Each half-transform is expressed as: , where the element of is the evaluation of the spline basis function at point ; is the spline coefficient matrix which represents the control point displacements (one column for each coordinate). are unknowns we want to optimize for each image. Figure 4 gives an overview of our framework, a simple example with 4 input images, 7 keypoints and 5 matches.

4.2 Outliers - EM-weighting

An efficient way of mitigating the presence of outliers in the set of paired points is to use robust statistics such as M-estimators (fox2002robust)

. The usual assumption made with M-estimators is that the probability distribution of the distance (


) is a combination of one normal distribution (the inlier contribution), and one uniform distribution (the outlier contribution). M-estimators need to estimate inlier variance, the tuning parameter, which provides a way to balance the contribution of each distance

(1) to the optimization criterion. Smaller values of the tuning constant increase robustness to outliers.

In our case, the underlying distributions follow different laws:

  • Inliers are norms of Gaussian random vectors of . They follow a

    distribution with 3 degrees of freedom, also known as Maxwell distribution.

  • Outliers are distances between random vectors of following uniform distributions. We propose to approximate their distribution using also a Maxwell distribution.

Then, the probability density function of the distance can be expressed on each image



where is the Maxwell probability density function, and

are respectively the inlier and outlier scale parameters (the standard deviation of all vector distance coordinates with zero mean) of the two Maxwell laws and

is their mixing ratio. Instead of estimating the inlier variance, we directly estimate for each image the parameters of using the Expectation-Maximization (EM) algorithm (dempster1977maximum). As we explicitly estimate the Maxwell laws, we can handle more than 50% of outliers as we do not rely on the median operator. Following Bayes’ rule, we can infer for any match distance-error the probability of belonging to the inlier class :


This probability can be used as a weighting function to inhibit outlier contributions. But as each match links two images, two probabilities can actually be computed for each match: and , using respectively image and image as statistical context. As a symmetric criterion is preferable, we select the minimum value as a weight for each match:


The minimum is a good heuristic since it increases selectivity against outliers. It has been observed in our experiments that it is more effective to wrongly reject more inliers than wrongly accept more outliers. Figure

5 shows example data as well as a comparison between EM-weighting and M-estimators, showing that our approach is more robust and more selective than M-estimator.

Figure 5: EM-weighting and M-estimators. Top : synthetic distance distribution, with inlier distribution in light blue and outlier distribution in orange. The dark-blue and red curves represent their approximations with Maxwell distributions. Middle : Real world example, with more than 50 % of outliers. The gray histogram is computed for one image of the VISCERAL data set after registration. The dark-blue and red curves respectively represent the estimation of inlier and outlier distributions with Maxwell distributions. Bottom : Comparison of weighting functions computed with the same real world data. Blue: incorrect standard deviation estimation leads the M-Estimator to include a lot of outliers. Dashed blue: M-estimator with corrected standard deviation. Red: our EM-weighting yields better selectivity.

4.3 Local EM-weighting

With M-estimators, statistics are defined globally, but nonrigid registration and the large variability across body images result in spatially heterogeneous statistics. More precisely, during optimization, a part of a body can be well registered while an other part in the same image might still be off-target. Hence the need for spatial adaptivity.

To create a local estimation of the inlier and outlier distributions, we rely on splines. Instead of computing EM-parameters on the whole image, we compute an estimation for each control point, using only keypoints contained in the support of the control point basis function. Afterwards, we use splines to interpolate the estimations.

4.4 Energy minimization

We propose to find the best set of transforms by minimizing for each keypoint the mean square distance to its matching points, each match contribution (1) being weighted according to EM-weighting (4):


where is the set of points matching with and its cardinal.

However, transforms are enough to register volumes whereas there are unknown transforms in our problem. This results in an underdetermined problem: convergence is not guaranteed and the transforms can be subject to global drift. Following wu2012feature, we remove the degree of freedom by adding an extra constraint to the control point displacements:


By traversing the set of matches , The energy function (5) can be rewritten as:


where is defined by :


Equation (7) can be expressed in matrix notation :


where is the diagonal matrix of all .

is the matrix of keypoint locations.

is the sparse match matrix. Each line of represents a match between two keypoints, and contains exactly two nonzero values : one and one in the columns corresponding to the two matching keypoints. can be seen as the incidence matrix of the match directed graph where vertices are keypoints and edges are matches with arbitrary orientation. An example of is shown by Figure 4.

is the spline matrix. is a sparse non-square block diagonal matrix, where each non-zero block is the spline matrix for image (the stacking of all ).

is the stacking of all matrices .

To minimize equation (7), inspired by the Iteratively Reweighted Least Squares method, we use a gradient descent algorithm and we update the parameters every ten iterations. The gradient of equation (9) is :


We initially set , and each iteration step of our algorithm is defined as follows:

  1. every 10 iterations : compute , with the EM algorithm,

  2. apply the gradient descent step :


    where is a small positive coefficient,

  3. split into coefficient matrices , shift the mean of the spline coefficients in order to satisfy the constraint in (6) :


    and construct by stacking the matrices . Note that an interesting byproduct of this step is the iterative removal of spatial bias in the common space (guimond00).

We use a hierarchical approach, beginning with a coarse spline grid and incrementally doubling the grid density. This improves the speed of our approach and keeps the solution close to the global minimum, which is required as we use EM-weighting. Note that we achieve a significant simplification of the gradient computation (10), by taking into account the fact that the matrices and are sparse.

5 Results

5.1 Data sets and evaluation

We evaluate our approach with the VISCERAL database (Langs2012visceral) which is composed of three groups of images (dimensions around , spacing around mm, 32 bits floats):

  • A: 20 volumes (thorax and abdomen) of contrast enhanced (via contrast agent injection) CT scans. For these volumes, experts have located anatomical landmarks (up to 45 per volume, between 41 and 42 on average).

  • B: 20 volumes (whole body) of CT scans. For these volumes, experts have located anatomical landmarks (up to 53 per volume, between 52 and 53 on average).

  • C: 63 volumes (thorax and abdomen) of contrast enhanced CT scans, not annotated. The VISCERAL consortium refers to these volumes as the silver corpus.

With these three different groups (A, B and C), we are able to carry out several groupwise registrations with different scenarios: only A, only B, A and C or the three groups together, while observing the effects of these cases on the landmarks registration accuracy. To measure the quality of a given groupwise registration, we first project all reference landmarks in the common space using the transforms . Then, for each landmark category (Clavicle L., Clavicle R. etc…), we compute a mean position . The quality criterion for this category is then defined as the average distance to . We also compute the global maximum individual distance to as a robustness criterion. Note that reference landmarks are only present for images in groups A and B. As a consequence, when registering groups A and C together, the criterion is restricted to images of group A. Note that landmarks are never used by any of the tested algorithm algorithm during registration. Experiments were carried out with a 24-core computer (48 logical cores with hyperthreading) with 128 GB of RAM.

We have compared our approach with star-groupwise voxel-based methods NiftyReg (modat2008; modat2014) and ANTs (Avants2008ants). As the test data contains high resolution images, other dense methods could not be applied. As an example, Elastix (klein10) and GLIRT (wu2012feature) require to store all images in-core, and GRAM (hamm2010) exhibits a computational complexity which is too high for our data sets (see section 2.1).

5.2 Implementation and settings

Our initial implementation uses c++ for both keypoint extraction and matching, while the optimization is written in Matlab.

In the complexity analysis, the dominant terms are for keypoint extraction, for match computation and for optimization, where is the number of volumes, is the number of voxels for each volume, is the total number of extracted keypoints, and is the number of control points at the highest resolution.

Our pyramid of splines contains four levels, with a grid step of , , respectively. For each volume, we keep up to 20000 keypoints (see agier2016hubless). For each resolution level, we compute 200 gradient descent iterations, with set to , , and , respectively.

5.3 Comparison with NiftyReg and ANTs

With NiftyReg, we have used the script. This script computes a star-groupwise registration (see Figure 1(a)) and generates a template model from all input volumes. It alternates between mean reference computation and pairwise registration of all images with this reference. By default, 15 iterations of groupwise registration are applied: the first 5 iterations use rigid registration, the 10 last iterations use nonrigid registration.

With ANTs; we have used the script. This script computes a star-groupwise registration (see Figure 1(a)) and generates a template model from all input volumes. It alternates between mean reference computation and pairwise registration of all images with this reference. By default, 5 iterations of groupwise registration are applied: the first is a rigid registration, the 4 others being nonrigid. The best results with ANTs have been obtained using mutual information as registration criterion.

5.3.1 Group A

Detailed results are given by Tables 1 and 2.

ANTs processes the 20 volumes in 62.5 hours, with a mean registration distance of 15.5 mm. To furthermore increase the speed of ANTs, we switched the registration subsampling factor from 1 to 2 and 4 (i.e. setting the shrinking factor parameter to 12x6x4x2 and 20x12x6x4, respectively). Setting increases both speed and registration quality: the 20 volumes are registered in 13.8 hours with a mean distance of 11.3 mm. In that case, there is no wrongly registered volume. Setting further increases computing speed but decreases output quality: the 20 volumes are registered in 9.4 hours with a mean distance of 14.0 mm. These results are presented in Table 2, where ANTS-S2 and and ANTS-S4 refer to setting to 2 and 4, respectively.

NiftyReg registered group A in 67 hours on our test machine, yielding an average landmark distance of .

Figure 6: Convergence of the minimized criterion and mean distances when registering group A. Distances are in mm. The blue curve is the square root of the criterion given by Equation (5) which reflects the average distance between matched keypoints. The black and orange curves represent the global mean landmark distances when using EM-weighting and M-estimator, respectively. Note that when using the M-estimator, the average distance increases which is a sign of convergence failure.

Our approach was able to successfully register all volumes of group A, while being much faster than NiftyReg and ANTs. The one-tailed Welch’s t-test shows that our approach (Table

1 third column) achieves an average landmark distance () significantly lower to that of NiftyReg (, ) and ANTs (, ). We have also added experiments to table 2, varying the number of kept keypoints for each image to respectively 40k, 30k, 10k, 5k and 2.5k. Note that as expected, accuracy increases with the number of extracted keypoints, but saturates above 30k points per volume. Hence we have fixed the number of kept points to 20k. Figure 6 shows the convergence curve for (see equation 5) in blue, the mean landmarks distance during optimization in black, and in orange the mean landmark distance when using M-estimators instead of our EM-weighting algorithm. All distances are in mm. One can clearly see the convergence of our approach, while using M-estimators results in a diverging error.

Landmark ANTs NiftyReg Ours
Clavicle L
Clavicle R
Tubercul. L
Tubercul. R
Sternoclav. L
Sternoclav. R
Aortic arch
Trachea bif.
Bronchus L
Bronchus R
Aortic valve
Renal pelvis L
Renal pelvis R
Crista iliaca L
Crista iliaca R
Aorta bif.
VCI bif.
Troch. maj. L
Troch. maj. R
Ischiadicum L
Ischiadicum R
Troch. min. L
Troch. min. R
Time (h)
Table 1: Registering Group A : comparison of landmark average distances (in mm) between our approach, NiftyReg and ANTs.
(a) Niftyreg
(b) Ours
Figure 7: Comparison between our approach and NiftyReg on VISCERAL group A: landmarks transformed in the common space.

Figure 7 compares reference landmarks transformed in the common space, for NiftyReg and our approach. Figure 8 compares average landmark distances between ANTs, NiftyReg and our approach. On this figure, the landmarks where split into 5 anatomical regions: 1-shoulders (landmarks 1–4) 2-vertebrae (5–23), 3-chest (24–32), 4-kidneys (33–38), 5-hips (39–45). The error curves with our approach, NiftyReg and ANTs follow similar trends.

5.3.2 Group B

Figure 9 compares groupwise registration of group B. We have extracted bone structures from each volume and superposed them in the common space, using (a) ANTs, (b) NiftyReg and (c) our approach. One volume of group B has a very unnatural height (probably due to a wrong spacing definition), hence we perform a global rigid registration before deformable registration, similarly to ANTs. NiftyReg yields a better result on the torso and hip regions, but still lacks robustness for the legs. Our approach is much more robust to variability, as most of its error is concentrated around the patient arms, which are very challenging regions due to different shoulder poses. The average landmark distance (reported in the second block of Table 2) obtained with our approach is , significantly lower than NiftyReg () and ANTs (). Figure 9(c) also confirms spatial variability of estimator parameters. Colors on the right image represent local inlier mean distances and confirm the visual observation that hips are better registered than the torso. Hence, our Local EM-weighting also provides a good estimation of registration quality. Note that landmark distances are larger than the ones obtained on group A, for all algorithms (Ours, NiftyReg, ANTs). One reason is that the images of group B were obtained without any contrast enhancement. As a consequence, soft tissue is much less visible on group B, which could explain lower accuracy.

5.3.3 Groups A and C

When registering groups A and C together, ANTs yields a mean distance of 44.7 mm. One explanation is that computing the average image from 83 full body scans with a high variability can lead to a very blurred image when registration is not perfect (third image of Figure 10). As a consequence, registration against a blurry reference image degenerates.

Registering groups A and C together took 324 hours with NiftyReg and has resulted in an average landmark distance of .

Our algorithm registers groups A and C within 5.9 hours and yields similar results : , equivalent to the result of NiftyReg () and significantly better than ANTs (). We have estimated a rate of outliers of 70% when registering groups A and C together. Note that our previous work based on rigid groupwise registration (agier2016hubless) yields an average error of 31.7mm.

Figure 8: Mean error comparison between our approach, ANTs and NiftyReg on the set of reference landmarks transformed into the common space. Distances are in mm. Manually placed reference landmarks are split into 5 regions (1: shoulders, 2: vertebrae, 3: chest, 4: kidneys, 5: hips). Errors are measured when registering group A, except for the curve Ours* where both groups A and C are registered.


Groups N Algorithm Max (mm) pt (mm) Time (h)
A 20 Ours 72.6
A 20 Ours (40k) 74.1 1.4
A 20 Ours (30k) 72.6 1.0
A 20 Ours (10k) 65.5 0.6
A 20 Ours (5k) 61.2 0.5
A 20 Ours (2.5k) 73.6 0.4
A 20 Ours (SIFT) 80.9 4.5
A 20 ANTs-S4 152.7 9.4
A 20 ANTs-S2 70.6 13.8
A 20 ANTs 150.2 62.5
A 20 NiftyReg 67.0
B 20 Ours
B 20 ANTs-S2 231.0 36.0
B 20 NiftyReg 105.0 135.0
AC 83 Ours 65.9
AC 83 ANTs-S2 198.0 82.8
AC 83 NiftyReg 324.0
ABC 103 Ours 88.1 11.3
Table 2: Comparison between our approach, ANTs and NiftyReg on the VISCERAL data set. Our approach extracts 20k points per volume, except for lines 2 to 6 where the number of points vary between 40k and 2.5k

5.3.4 Overall

In Figures 7, 8 and Table 1, we can observe that hips are well registered, while the distances are higher in the torso region. We can bring several explanations to this trend : First, soft tissue is less visible on CT scans and hence generates less points, which then yields less constraints for the registration. On the other hand, bones are very visible, and their boundaries are well defined, which yields better points and hence more accurate registration. Second : in the torso regions, there are many structures with similar shape (vertebrae, ribs) which could yield many outliers during point matching.

Table 3 shows timings for our algorithm as well as the average number of matches per volume for several scenarios. Note that even if the number of keypoints per volume, is constant (20k), the average number of matches increases with the number of volumes. As an example, when registering group A, each keypoint is matched with points on average. But when registering groups A and C together, each keypoint is matched with points on average. In terms of memory footprint, our approach and NiftyReg never needed more than 10 GB, while we observed a peak memory consumption greater than 65 GB with ANTs at full resolution, and 34 GB when setting .

(a) ANTs
(b) NiftyReg
(c) Ours
Figure 9: Bone structures extracted from a registered group of 20 whole-body CT volumes of the VISCERAL group B. Groupwise registration has been carried out with (a) ANTs, (b) NiftyReg and (c) our algorithm. Our algorithm exhibits a better overall robustness to variability. The color scale for our algorithm indicates local mean error in mm, as an accuracy indicator.
(a) ANTS
(b) NiftyReg
(c) Ours
Figure 10: Average images obtained with (a) ANTS, (b) NiftyReg, (c) Ours. Left : when registering group A. Right : when registering groups A+C.

Finally, figure 10 shows the average of the registered images. From top to bottom: ANTs, NiftyReg and our algorithm, when registering Group A (left) and groups A and C (right). NiftyReg and ANTs provide a sharper image than our approach when registering group A. When registering groups A and C (83 images), the resulting average image is blurred for ANTs, whereas our approach remains robust to the number of images. NiftyReg still yields an average image sharper than ours. Hence, our approach is not very effective for template construction (average image). But, in contrast with NiftyReg and ANTs, our hubless approach does not need any average image during registration. Our average image is only a byproduct used for comparison, computed after the registration. Moreover, image-based criteria have been shown unreliable to evaluate registration accuracy (rohlfing12). Finally, although our average image is not the sharpest, landmark distances (table 1) are on average lower with our approach.

We have also registered all three groups A, B, and C in 11.3 hours, whith a reasonable error, which illustrates the ability of our approach to register more than 100 images, mixing images obtained with or without constrast agent.

5.4 Using 3D-SIFT

We have carried out an experiment with points defined with 3D-SIFT (SIFT3D), visible on line 2 in table 2. Keypoint extraction is much slower than our 3D-SURF approach, which results in a registration time of 4.5 hours, while yielding larger landmark distances (11.0mm for 3D-SIFT vs 9.0mm for 3D-SURF). This shows that our approach is able to exploit various keypoint types, and also illustrates the relevance of choosing 3D-SURF over 3D-SIFT, as reported initially by bay2006surf.

5.5 Limitation

The limitation of our proposal is related to the requirement of a sufficient number of inlier keypoints in each cell of the grid at the highest resolution. In other words the size of the finest grid used is locally limited by the keypoint density. This results in a limit in registration accuracy. We observed this limitation in processing the MRI LONI Probabilistic Brain Atlas (LPBA40) (LPBA40), containing 40 humain brain images of resolution . Processing time for the whole dataset is less than 1 hour. The resulting average image is displayed in Figure 11. The accuracy of our generic approach is not sufficient to correctly handle small structures inside the brain. Hence, our approach is not as accurate as algorithms dedicated to brain processing such as GLIRT(wu2012feature), which reports an average overlap ratio of (about higher than ours). Yet our algorithm converges in a short time, which illustrates the versatility of our approach. In practice, our approach is able to extract about 10k keypoints for each brain volume of LPBA40, while more that 100k points have been extracted from each volume of the VISCERAL database. Thus, for the VISCERAL dataset, we retained only the most significant points, discarding weak points. This was not possible for the LPBA40 database. This comes from the fact that brain MRI images are less sharp and contain fewer voxels than full boddy CT scans. Hence, a possible improvement could be to develop an keypoint extractor more adapted to brain MRI, which could yield much more keypoints and therefore increase registration accuracy.

Data I (h) II (h) III (h) Total (h)
A 20 41k 0.1 0.2 0.6 0.9
B 20 65k 0.1 0.2 1.0 1.3
AC 83 178k 0.2 0.7 5.0 5.9
ABC 103 272k 0.3 1.0 10.0 11.3
Table 3: Number of matches per volume and processing times for our algorithm. is the number of registered volumes, the number of matches per volume. Columns I, II and III indicate the time spent (in hours) for keypoint extraction, matching and optimization, respectively.
Figure 11: Average brain when registering the LPBA40 MRI brain dataset with our approach.

6 Conclusion

Our algorithm is able to register CT and MRI volumes in reasonable time and low memory consumption. Experimental results illustrate its robustness to variability, and comparisons with NiftyReg and ANTs show that our algorithm yields lower mean landmark distances within a very shorter computation time. Our approach is currently suited for large databases of relatively high resolution images. As an example, brain databases generally contain images with voxels, but our approach is not as accurate as state-of-the-art voxel-based approaches for these resolutions. A possible improvement could be an keypoint extractor dedicated to MRI images with fewer voxels and fewer sharp features. But many applications could already benefit from our approach. As an example, the low complexity and memory requirement of our algorithm allows to use a full groupwise nonrigid registration algorithm as a preliminary task for applications such as atlas based segmentation, machine learning classification and longitudinal studies.

As depicted in figure 1, we use a full graph approach. As a consequence, the number of matches grows faster than the number of images. Hence, the number of matches per images increases with the number of images (see Table 3). This means that, contrarily to classical groupwise approaches, our hubless approach could benefit from using even more images.

Our method is oblivious to point cloud sources and is not restricted to SURF keypoints and splines. Hence, investigating keypoint descriptors tailored to medical images and richer transform models are two very interesting research perspectives for our work. Computing features with machine learning seems a promising option in this context.

As our algorithm provides a robust approach able to handle keypoints between different patients without needing dense voxel information during optimization, it paves the way to fast algorithms able to deal with large data sets. We showed that our approach is able to register more than one hundred images. Scaling the algorithm to even larger data sets (more than 10000 patients) is still a challenging problem.


Many thanks to the VISCERAL Consortium (VISCERAL) ( for allowing us to comppute evaluations using the VISCERAL data set.