Image registration is the task of finding the optimal spatial transformation between two or more images. In most registration methods, no assessment of the registration quality is provided, and simply the result is returned. Evaluation of the registration is devolved to human experts, which is very time-consuming and prone to inter-observer errors as well as human fatigue (Murphy et al., 2011b). Automatic quantitative error prediction of registration would decrease quality assessment time and can provide information about the registration uncertainty. Many medical pipelines are based on registered images and it is important to know the uncertainty of registration before continuing to a next phase in order to prevent accumulation of errors. For example, in online adaptive radiotherapy daily contouring of the tumor and organs-at-risk can be performed with the help of image registration (Thörnqvist et al., 2010). In this task, quality assessment (QA) is mandatory to ensure patient safety. In addition, the accumulation of delivered dose over several treatment fractions is also impacted by the quality of registration (Murphy et al., 2012, Tilly et al., 2013, Veiga et al., 2015). Registration quality therefore has to be checked before the treatment starts. Visualizing the error of registration can also be directly helpful in medical applications before making a clinical decision. Smit et al. (2017) localized autonomic pelvic nerves by registering a pre-operative MRI scan to an atlas model that includes nerve information. These nerves are not visible in the MRI scans and are prone to be damaged during a surgical procedure. Utilizing registration uncertainty yield better visualization of the autonomic nerves.
Refinement of registration is another important application of automatic error prediction. Muenzing et al. (2014) improved registration by focusing only on regions with high registration error and discarding pixels which are aligned correctly. Registration refinement can also be done with the feedback of human experts by manually adding several corresponding landmarks (Gunay et al., 2017).
Schlachter et al. (2016) did a comprehensive study on visualization of registration quality with the help of three radiation oncologists on the DIR-Lab-COPDgene data, which has a slice thickness of 2.5 mm. The [average, maximum] TRE of the landmarks that were rated to be of acceptable registration quality with the conventional visualization method (checkerboard visualization and color blended) was [2.3, 6.9] mm, while with the best visualization method (histogram intersection) [1.8, 3.3] mm was achieved.
A few methods have been proposed to detect the misalignment of a pair of images with the purpose to refine the registration result. Rohde et al. (2003) proposed to use the gradient of the cost function to detect which region in the image pair is poorly registered and potentially can be improved. Schnabel et al. (2001) suggested to refine the registration by increasing the number of registration parameters in regions with high local entropy, or with high local variation in the intensity or with relatively steep cost function. In another work, analyzing the shape of the cost function around each voxel was used to estimate the confidence of registration (Saygili et al., 2016). Park et al. (2004) used normalized local mutual information to find poorly aligned regions in order to increase the number of registration parameters. Forsberg et al. (2011) utilized the outer product of the intensity gradient as an uncertainty measure in multi-channel diffeomorphic Demons registration. Although the mentioned metrics can be used to improve the image registration, it has not been shown how these metrics are correlated with the image registration error.
Several methods exploit continuous probabilistic image registration by utilizing Bayesian inference to achieve an intrinsic transformation uncertainty measure(Risholm et al., 2013, Simpson et al., 2015). However, it has been shown that there is no clear statistical correlation between transformation uncertainty and registration uncertainty (Luo et al., 2017)
. The transformation and corresponding label (of a pair of images) are two random variables and it is not possible to quantify the uncertainty of the corresponding label by the summary statistics of the transformation. Another downside of these methods is that they can only be used for the specific paradigm of Bayesian registration.
Some methods are based on the consistency of multiple registrations between a group of images (Datteri and Dawant, 2012, Gass et al., 2015), but these methods cannot be used in pairwise registrations.
In the stochastic approaches, Kybic (2010) suggested to perform multiple registrations with random sampling of pixels with replacement. He found a correlation between the true registration error and the variation of the 2D translational parameters. The method was not extended to 3D and to nonrigid registration. Hub et al. (2009) calculated the local mean square intensity difference multiple times by perturbing the B-spline grid. They showed that the maximum change of the dissimilarity metric in a local region is correlated with the registration error in that region. The drawback of this method is that it is not efficient in homogeneous areas (Hub and Karger, 2013)2013), using the Demons algorithm. However, to find large misalignment a large search region is needed.
In this paper, we turn our attention to methods capable of learning the registration error allowing to take advantage of multiple features related to registration uncertainty within a single framework. Muenzing et al. (2012)
casted the registration assessment task to a classification problem with three categories (wrong, poor and correct registrations). In their method, they mostly utilize intensity-based features, except for the determinant of the Jacobian of the transformation. Although their training samples consist of manually selected landmarks, later they showed that assessing registration in all regions is possible by interpolation(Muenzing et al., 2014).
In our paper, instead of casting the uncertainty estimation task to a classification problem, we formulate it as a regression problem. To the best of our knowledge, in the field of continuous prediction of 3D registration error, Lotfi et al. (2013) only tested their method on artificially deformed images. Recently Eppenhof and Pluim (2017)
estimated the registration error by utilizing convolutional neural networks. Only preliminary results were available for synthetic 3D data.
We explore several features related to the uncertainty of the registration transformation as well as related to intensity. All features are calculated in physical units, i.e. mm, which makes the system independent of voxel size. Finally, features are combined by using regression forests. The proposed method is applied and evaluated on chest CT scans. This work is an extension of Sokooti et al. (2016) with updated methodology and substantially extended evaluation.
2.1 System overview
A block diagram of the proposed algorithm is shown in Fig. 1. The system has two inputs: a fixed image and a moving image . Several registration-based and intensity-based features are generated. A regression forests (RF) is then trained from all features to estimate the registration error.
The proposed system is trained to predict residual distances (registration errors) obtained from a set of semi-automatically established corresponding landmarks. During evaluation, the prediction result is compared with errors obtained from an independent set of ground truth landmarks, using cross-validation. The proposed system therefore estimates registration errors in physical units, i.e. mm. More information about the ground truth is available in Section 3.1. Details of the features are elaborated in Section 2.3.
Registration can be formulated as an optimization problem in which the cost function is minimized with respect to :
where denotes the transformation. The optimization is usually solved by an iterative method embedded in a multi-resolution setting. A registration can be initialized by an initial transform .
2.3 Features and pooling
The features we used in our system, consist of several registration-based as well as intensity-based features. Some features are intrinsically capable to be calculated over differently sized local boxes, for others, a pool of features is created by computing local averages and maxima afterwards. The features used in this paper are listed in Table 1. We propose the following features:
2.3.1 Registration-based features
Variation of deformation vector field (): The final solution of an iterative optimization problem can be influenced by the initial parameters. If in a region the cost function has multiple local minima or is semi-flat, a slight change in the initial parameters can lead to a different solution. In contrast, in areas where the cost function is well-defined, variations in the initial state are expected to have much less effect on the final solution. A flow chart of the described feature is available in Fig. 2. Given random initial transformations , that are used as initializations of the registration algorithm from Eq. (1), the variation in the final transformation results
is a surrogate for the precision of the registration. We propose to use the standard deviationof those final transformations as a feature:
In this work, the initial transformations
are created by uniformly distributed offsets in the rangemm to all B-spline coefficients. The offset range is chosen to be relatively small in comparison to the B-spline grid spacing in order to avoid unrealistic deformation. An example of in a synthetically deformed image is given in Fig. 3(a).
Instead of perturbing the initial state of the registration, it is also possible to first perform the registration without any manipulated initial state, resulting in a transformation (Klein et al., 2009). Then, random offsets are added to after which another registration is performed, resulting in . This is close to the work of Hub and Karger (2013), and approximately measures the concavity of the cost function. The feature is then derived akin to Eq. (3):
It is expected that is small in regions where the cost function is concave, as by adding small offsets to the parameters, it can still move back to the previous optimal point. A flow chart of is shown in Fig 2. is calculated using the same setting as , except that only one resolution is used.
If the difference between and is relatively large, regions indicating a small are still potentially regions of low registration quality. We then consider the bias and as complementary features to and computed by:
Coefficient of variation of joint histograms (CVH): Multiple registration results can be used to extract additional information from the matched intensity patterns of the images. Given a fixed image and a registration sub-result , we calculate their joint histogram . For identical sub-registrations, all resulting joint histograms are equal. Variation in the joint histograms implies registration uncertainty as a surrogate for registration error. The coefficient of variation of the joint histograms is calculated by dividing the standard deviation of all joint histograms over the average, , of them. This normalization is done to compensate for large differences between the elements of . We obtain the CVH in histogram space as follows:
where B is the number of histogram bins, and a constant to avoid division by zero. In the experiments we set to 5. The CVH in histogram space is subsequently transferred to the spatial domain, by assigning voxels with a particular intensity combination the corresponding value from CVH, resulting in the final CVH feature with size equal to the fixed image. Note that the CVH can be used in a multi-modality setting, like the previous features. An example of the CVH on a synthetically deformed image is given in Fig. 3(b).
Determinant of the Jacobian (Jac): Jac measures the relative local volume change. This can point to poor registration quality in case of very large () or very small () values, or discontinuous transformations in case of a negative value (). In the experiments, the determinant of the Jacobian of is used.
2.3.2 Intensity-based features
MIND: The Modality Independent Neighborhood Descriptor (MIND) was introduced by Heinrich et al. (2012) in order to register multi-modal images. In this local self-similarity metric, a patch is considered to compare intensities between fixed and moving images. Finally, the sum of absolute differences between the MIND vector of and that of is computed. We calculate MIND with a sparse patch including 82 voxels inside a box, which is approximately physically isotropic for the data used in the experiments (see Fig. 4).
Local normalized mutual information: Mutual information is used as an entropy-based similarity measure of two images. Similar to Muenzing et al. (2012) we use the following definitions for local normalized mutual information:
Both metrics are calculated over 8 differently sized boxes: [5, 10, 15, 20, 25, 30, 35, 40] mm. Two strategies for the selection of the number of bins are used, one uses a constant value , the other strategy depends on the number of samples , in which is the number of samples in each box. The notations and indicate mutual information calculated with the latter strategy.
Modality-dependent features: In addition to the modality-independent features from above, we consider the use of several modality-dependent features. In the experiments we assess their contributed value. Similar to Muenzing et al. (2012) the squared intensity difference (SID) and the gradient of intensity difference (GID) are computed using Gaussian (derivative) operators with standard deviations of [0.5, 1, 2, 4, 8, 16] mm. Normalized correlation (NC) is calculated within boxes of size [5, 10, 15, 20, 25, 30, 35, 40] mm akin to Muenzing et al. (2012).
In order to reduce discontinuities and improve interaction with other features, the total set of features is increased by generating a pool from those mother features by calculating averages and maxima over them using differently sized boxes. The features MI, SID, GID and NC are inherently computed over differently sized local regions. The features , , , CVH, , and Jac are calculated in a voxel-based fashion, and then pooled afterwards. Average and maximum pooling is performed with box sizes of mm. As a result, for each feature we obtain a pool of 18 features: 9 from box averages and 9 from box maxima. The average-pooling is done efficiently by the help of integral images introduced by Viola and Jones (2004). A list of the proposed mother features together with the number of derived features are given in Table 1.
|18||9 average boxes + 9 maxima boxes|
|MI||32||NMI, NMIS, PMI, PMIS calculated over 8 boxes|
|18||9 average boxes + 9 maxima boxes|
|18||9 average boxes + 9 maxima boxes|
|CVH||18||9 average boxes + 9 maxima boxes|
|18||9 average boxes + 9 maxima boxes|
|18||9 average boxes + 9 maxima boxes|
|Jac||18||9 average boxes + 9 maxima boxes|
|NC||8||calculated over 8 boxes|
|SID&GID||12||calculated over 6 sigma’s|
2.4 Regression forests
by extending the idea of bagging. The forests consist of several weak learners (trees) which are combined in an efficient fashion. Each tree is started from a node and continues splitting until reaching certain criteria. In contrast to bagging, splitting is performed with a random subset of features which makes the training phase faster and reduces correlation between trees, consequently decreasing the forest error rate. The reason that we chose the random forest is that it can handle data without preprocessing. For instance rescaling of data, outlier removal and selection of features are not necessary in random forests. In addition, random forest are efficient to train and fast at runtime.
Random forests have the capability to calculate the importance of each feature with a little additional computation, which shows the contribution of each feature to the forest. Training of each tree is based on a bootstrap of all samples, and the so-called out-of-bootstrap samples are used to compute the importance of a feature . Importance is then defined as the difference between the mean square error (MSE) before and after a permutation of this feature:
where is the real value, the predicted value from the regression, the predicted value when permuting feature , and the number of trees.
3 Experiments and results
3.1 Materials and ground truth
The SPREAD (Stolk et al., 2007) DIR-Lab-4DCT (Castillo et al., 2009) and DIR-Lab-COPDgene (Castillo et al., 2013) databases have been used in this study. In the SPREAD study, there are 21 pairs of 3D follow-up lung CT images. Each patient in this database has a baseline and a follow-up image (which is taken after 30 months) both in inhale phase. The age of the patients ranges from 49 to 78 years old. The average size of the images is with an average voxel size of mm. In each pair of images, about 100 well-distributed corresponding landmarks were previously selected (Staring et al., 2014) semi-automatically on distinctive locations (Murphy et al., 2011a).
From the DIR-Lab-4DCT data, five cases (4DCT1 to 4DCT5) are selected with each five phases between maximum inhalation and exhalation. The average image size is with an average voxel size of mm. Each scan has 75 corresponding landmarks annotated. Ten cases with severe breathing disorders are available via the DIR-Lab-COPDgene database. The images are taken in inhale and exhale phases. In total, 300 landmarks are annotated. The average image size and the average voxel size are and mm, respectively.
Accuracy of the registration can be defined as the residual Euclidean distance after registration between the corresponding landmarks:
with and the corresponding landmark locations. Based on the idea that the registration error is smooth, we include voxels from a small local neighborhood around the landmarks to increase the total set of available landmarks. In this small neighborhood we assume that the registration error is equal to the error at the center of the neighborhood. This assumption seems reasonable for smooth transformations and within a small region. The neighborhood size is chosen as , which is approximately equivalent to the final grid spacing of the B-spline registration (see Fig. 5).
The core software is written in Python. The feature pooling is performed with a C++ program (Glocker et al., 2014) and the regression forest is calculated with the help of the Scikit-learn package (Pedregosa et al., 2011). All registrations are performed by elastix (Klein et al., 2010). Detailed registration setting can be found in the elastix parameter file database (elastix.isi.uu.nl, entry par0049). The code is publicly available via github.com/hsokooti/regun.
3.2 Evaluation measures
In the SPREAD database, we employ 10 cross-validations by randomly splitting the data in 15 image pairs for training and the remaining 6 pairs for testing. To evaluate the regression performance, the mean absolute error (MAE) of the real registration error and the estimated one is calculated over the neighborhood of the landmarks by:
To further detail the regression performance, the MAE is subdivided into three categories: , and with in , and mm, corresponding to correct, poor and wrong registration, similar to Muenzing et al. (2012). We then do the same for
, and report the accuracy and F1 score for classifying the registration error in these three categories.
3.3 Parameter selection
The RF is trained using 100 trees with a maximum tree depth of 9, while at least 5 samples remain in the leaf nodes. At each splitting node, features are randomly selected. We set to the square root of the total number of features in that experiment, which performed slightly better than (Liaw et al., 2002). The total number of registrations is chosen as 20 to ensure that the estimation of does not change considerably when increasing the number of registrations (Sokooti et al., 2016).
3.4 Reference registration error set
For the SPREAD and the DIR-Lab-4DCT study, registrations are based on free-form deformations by B-splines (Rueckert et al., 1999)
. The cost function is mutual information, which is optimized by adaptive stochastic gradient descent. We used three resolutions with a final B-spline grid spacing ofmm. We collect samples by performing four different registrations using 20, 100, 500 and 2000 iterations, respectively. All other registration settings remain the same in these registrations. By varying the number of iterations we increase the variation in the samples, as well as the training size. Table 2 gives the distribution of reference registration errors in each database. As expected, increasing the number of iterations shifts the distribution towards the “correct” registration category. The maximum registration error is 81.8 mm in the SPREAD database, 17.6 mm in the DIR-Lab-4DCT database.
Since the a priori distribution of registration errors is imbalanced, with much more samples in the “correct” category, we perform the following balancing step during training. For landmarks that fall in the category “correct”, we only add samples from a smaller neighborhood of instead of the neighborhoods used for landmarks in the categories “poor” and “wrong”. The distribution of reference registration errors of the training samples is shown in Table 3.
For the DIR-Lab-COPDgene study, more advanced settings of the registration are used. In this experiment, samples are taken only on the landmark locations. More details are given in Section 3.5.8. The maximum registration error in this data is 31.5 mm.
3.5.1 Single feature performance in SPREAD
The proposed features are described in Section 2.3 and summarized in Table 1. To investigate the strength of the individual features, we trained the random forest with only a single feature with pooling. By comparing the MAE results in Table 4, it can be seen that MIND, and SIDGID are the best single features in the categories Intensity, Registration and Modality-dependent, respectively.
3.5.2 Combined features performance
Instead of using only a single feature, several combinations of features are used to build the RFs:
Intensity: Combination of all modality-independent intensity features: and MI (50 features).
Registration: Combination of all registration features: , , CVH, , and Jac (108 features).
Combined: Combination of both intensity and registration features (158 features).
All results are available in Table 5. By combining features from both the registration and modality-independent intensity category, improvements were obtained in all evaluation measures.
The result of the regression with combined features is detailed in Fig. 6(a), which shows the real error (solid blue line) against the predicted error, sorted from small to large. In Fig. 6(b) we grouped the real errors in the three categories, each category showing a box-plot of the predicted errors. Intuitively, a smaller overlap between the boxes represents a better regression.
) mm classes, respectively. MD, NN and LR stands for modality dependent, neural networks and linear regression, respectively.
3.5.3 Including modality-dependent features
We consider adding the combination of three modality-dependent features to the combined feature set (Combined+MD): NC, SID and GID. In both databases, if we add the modality-dependent features (see Table 5), negligible differences are observed. Therefore, to keep the feature set small and modality-independent, we select the “combined features” class without the modality-dependent features as the final system in the remainder of this paper.
3.5.4 The effect of pooling
To examine the effect of pooling, we perform an experiment without pooling on the combined feature set. We only calculate PMIS within a box size of 15 mm in this experiment. From Table 5 the benefit of pooling can be observed.
3.5.5 Alternative regression methods
In this section, we compare RF regression with linear regression (LR) and neural networks (NN). Feature normalization is done for both regressors. We utilized neural networks with three hidden layers of 1024, 512 and 256 units each. ReLU is used as an activation function and Huber is utilized as a loss function. Table5 gives the results of these experiments. The performance of neural networks is on par with random forests. However, the results of linear regression are not comparable to that of random forests, both in MAE and accuracy.
3.5.6 Feature importance
The feature importance using a different number of iterations is shown in Fig. 8. The contribution of all intensity features stay the same in all experiments, while some of the registration features contribute differently with respect to the number of iterations. For instance, the importance of and CVH increase with increasing the number of iterations. The features and play important roles when the number of iterations is not enough for registration convergence.
3.5.7 Excluding a single feature
To further investigate the importance of the several features, we additionally perform an experiment where we leave one feature out of the combined feature set. The results are reported in Table 6. In these experiments, feature redundancy can be found. For instance, MI has a large importance values in random forests, but if we leave that feature out, other features can compensate for that.
3.5.8 Inter-database validation
To study the generalizability of the proposed system, instead of cross-validation on a single database, we perform training on the DIR-Lab-4DCT database and test it on the SPREAD database. As mentioned before, the SPREAD database consists of only inhale images but the DIR-Lab-4DCT database has images from inhale to exhale phases. Therefore, this makes the DIR-Lab-4DCT more suitable for training. The result of this experiment is available in Table 7. Once more, we can draw the conclusion that by combining both intensity and registration-based features, the regression performance can be improved. In contrast to the SPREAD experiment, this time it is observed that the registration features perform better than the intensity features.
To further evaluate the generalizability of the proposed method, we test it for different registration methods on a third independent test set, the DIR-Lab-COPDgene dataset. The regression forest is trained on a combination of the SPREAD and DIR-Lab-4DCT data. We evaluate two registration algorithms that achieved excellent performance in the EMPIRE10 challenge (Murphy et al., 2011b), i.e. the ANTs registration package (Avants et al., 2009, Tustison et al., 2013) and elastix with advanced settings (Staring et al., 2010).
Prior to deformable registration we perform an affine registration using 5 resolutions and utilizing torso masks. For the deformable registration we use settings similar to the ones used in the EMPIRE10 challenge, specifically:
ANTs-BSplineSyN: With respect to the EMPIRE10 challenge we increased the number of iterations to 1000 for each of the 4 resolutions, using a 10% sampling rate. This improved the performance on our data and considerably reduced the calculation time. As suggested in Tustison et al. (2013), several preprocessing steps are used, including masking out the lungs, and inverting the image intensities and rescaling them between 0 and 1. Further settings include: registration model: symmetric diffeomorphic; dissimilarity metric: local cross correlation; number of resolutions: 4; maximum number of iterations: 1000; sampling: 10% random samples; convergence threshold: 1e-6. The average TRE on DIR-Lab-COPDgene is mm.
elastix-advanced: Settings are adopted from Staring et al. (2010). The most important ones are: registration model: B-spline; dissimilarity metric: normalized correlation; number of resolutions: 6; number of iterations: 1000; sampling: 2000 random samples; B-spline grid spacing: [5, 5, 5] mm. The average TRE with this setting is mm on the DIR-Lab-COPDgene dataset.
Detailed parameter files for both registration methods are available via elastix.isi.uu.nl (entry par0049) and github.com/hsokooti/regun. The calculation time of ANTs was about 60 hours per registration, comparing to 12 minutes for elastix.
In this experiment, the evaluation is performed only on the landmarks locations, where Table 2 displays the distribution of reference registration errors during testing. The results of the experiments are given in Table 8. A scatter plot is also depicted in Fig. 9. Similar to the previous inter-database experiment (Table 7), the MAE and accuracy of the registration features are slightly better than the MAE and accuracy of the intensity-based features. However, intensity features obtained better classification score in the wrong category. We conclude that the proposed method indeed generalizes to different settings of the same method (elastix-advanced), as well as registration methods with quite a different underlying transformation model (ANTs-BSplineSyN, which uses a symmetric diffeomorphic model).
A system for quantitative error prediction of medical image registration is proposed and it is quantitatively evaluated on multiple chest CT datasets.
In the intra-database (SPREAD) validation, it is observed that the single MIND feature can perform almost as good as the overall combined system. By adding MI and registration features, the results slightly improved. Muenzing et al. (2012) did not consider MIND in their feature set and found that the most important single features in their classification experiments are mutual information and Gaussian intensity, whereas, based on Table 4 these features are less important than MIND in our experiments. Furthermore, the calculation time of MI for the whole image is about 3 h, as opposed to the calculation time of MIND, which is about 8 min (3 min MIND + 5 min pooling). Although less accurate, it is possible to reduce the calculation time of the MI feature by calculating it over a single window and then aggregate by pooling.
The modality-dependent intensity features do not increase regression accuracy on the data used in our paper. Consequently more generally applicable modality-independent features can be used, even for mono-modal problems.
Table 5, 7 and 8 together suggests that features in the intensity and registration categories provide complementary information, and that a better system can be obtained in terms of MAE and accuracy by considering both intensity and registration-derived features.
The intensity features were better predictors than the registration features in the intra-database experiment. However, in the inter-database experiment, the registration features outperform intensity features in terms of total accuracy and MAE. The same observation can be made for the average F1 score in the inter-database experiments using elastix (See Table 7, 8). For ANTs (Table 8), the average F1 score of the intensity-based features was slightly higher than that of the registration-based features.
The registration features contribute differently with respect to the number of iterations (See Fig. 8). The features and play important roles when the number of iterations is not enough for convergence. When the number of iterations increases, the contribution of and CVH go up. In the work of Muenzing et al. (2012), only one registration feature, Jac, was used and they reported that the impact of this feature is relatively low in comparison with intensity-based features. We observed the same result for Jac, but it should be pointed out that the range of Jac in our database was [0.3, 3.9] so voxels with negative or very large values were not encountered.
Feature pooling improves the regression results in all evaluation measures, due to the addition of contextual information. In some features like , average pooling contributed more to the regression performance, while in features like CVH, maximum pooling had a higher importance value (See Fig. 8(d)).
As can be seen in Table 6, the proposed combined system has redundant features. Hence, by removing a single feature, the system is still able to predict the registration error with almost equal MAE as the total system. However, removing these features may decrease the generalizability of the system. For example, looking at the feature in Fig. 7 we see that its contribution is relatively small overall. However, in Fig. 8 it can be seen that while it is less important for better registration results (100, 500 and 2000 iterations), it is still important for poor registration results (20 iterations).
4.2 Quantitative validation
Commonly, in image registration tasks, the distribution of registration errors is not balanced as can be seen in Table 3.
In the SPREAD experiment, Table 5 reports that in the combined experiment, the MAE of the correct and poor classes are and , respectively. On the contrary, the MAE of the wrong class is . It is expected that the regression error of values of the wrong class is relatively larger than that of the other classes. However, it should be emphasized that only of samples are available to make a regression model between 6 and 81.8 mm. We tried to add more samples and make the distribution more balanced by performing registrations with different number of iterations, but there is still room for improvement for the wrong class by adding more samples and data.
In terms of classification, we obtained F1 scores of , and in the classes correct, poor and wrong, respectively (Table 5
). For the wrong class, which is arguably most important for clinical application, the precision and recall areand , respectively. This means that of all samples predicted to be over 6 mm are correct and the proposed method caught of larger registration errors. Muenzing et al. (2012) obtained F1 scores of , and in the classes [0,2), [2,5) and [5, ) mm. They achieved a better F1 score in the poor class and they also reported zero overlap between the correct and wrong classes. However, the comparison between the two methods is not easy because of the differences in the data. For example, the slice thickness in SPREAD is 2.5 mm, while it is 1 mm for Muenzing’s data, which may affect the performance especially in the poor class. Moreover, we generated the classes by thresholding the regression values. Thus, the forests are optimized for regression not for classification.
4.3 Qualitative validation
Muenzing et al. (2014) generated an uncertainty map by spatial interpolation of landmark-based quality estimates. On the contrary, our proposed system, which is trained on landmark locations, can be applied in all regions of the image. We showed this for two example images, see Fig. 5. It can be easily visualized that in the blue region, images are matched correctly. On the other hand, by tracking the vessels in the red region misalignment can be seen. Another note about the prediction is that there are no abrupt changes, and error varies smoothly from blue to yellow and then red, even though the error is predicted for each voxel independently.
Another example is given in Fig. 10(a-d). Although all landmarks indicate that the registration error is small in this slice, the quantitative results found several misregistered regions, which implies that few landmarks may not be sufficient to assess the registration quality of the whole image. In Fig. 10(e, f), it can be observed that the performance in the homogeneous area (left side of the images) is as good as the performance in the area with structure. The main reason of acceptable performance in the homogeneous area is that the training samples consist of landmarks as well as their neighborhood region, which can be homogeneous. Thus, the system is trained both for homogeneous regions and regions with structure.
Another example is given in Fig 10(g, h), where the proposed system is not able to predict the registration error correctly because of a shift in the slice direction.
Discrete optimization: If the optimization method is less or not dependent on the initial state, for instance for discrete optimization methods (Glocker et al., 2008, Heinrich et al., 2016), many of the proposed registration features, which are generated by varying the initial transformation of the registration, are not informative anymore. In such cases, instead of or , other measures can be used. For example, by utilizing the adaptive mean-shift algorithm, the local standard deviation of the displacement distribution can be calculated (Heinrich et al., 2016).
Anatomical changes: The proposed method is trained in such a way that any dissimilarity between the fixed and moving images is counted as misalignment in registration. In case of anatomical changes this assumption may be invalid, but typically prior knowledge of the underlying anatomy is required to determine which regions are allowed to be ”misaligned” because of anatomical changes and which are not (Muenzing et al., 2018). The proposed method highlights all changes, coming from misalignment or from anatomical change.
4.5 Future work
In the proposed method we predict the misalignment as an Euclidean distance in millimeters, rather than a 3D vector representing residual displacement. This is mostly because the features used in the system are not direction-wise, especially the local intensity features. The use of features that include directional information may help the system to be used in predicting the registration error in each direction, which is then effectively a new registration method.
The proposed method was tested on chest CT scans. Since the proposed features are generic and modality-independent, the overall method can in principle be applied to other modality data from other anatomical regions. The performance in such cases however remains to be investigated.
The uncertainty of affine registration is not measured in this work. Defining a gold standard for this mid-phase result is a complex task. However, extending the experiments to other databases where only affine transformations are applicable can be done in the future.
Instead of manually defined features, it is possible to use convolutional neural networks, which can learn features automatically. Eppenhof and Pluim (2017) predicted the Euclidean distance of registration error. Our own work on CNNs for registration (Sokooti et al., 2017)
can also be modified to predict registration uncertainty in a direction-wise manner. Both methods are trained only based on intensity, where the current paper shows that registration-derived information still contributes to a better regression. Thus, adding registration information to the neural networks should probably be considered as well.
A larger set of corresponding points annotated more densely throughout the scan could potentially also benefit training of the regression forest. In addition, experimenting on multi-modality data and investigating the contribution of all introduced features on them are future plans of this work.
Finally, the uncertainty map produced by the proposed method may be exploited to improve local registration results.
In this paper we proposed a method based on random regression forests to predict registration accuracy on chest CT scans from registration-based as well as intensity-based features. We introduced the variation in registration result from differences in initialization () and CVH, which showed high feature importance in several experiments. Registration-based features provided additional information on registration error with respect to intensity-based features.
The regression method was evaluated on data from the SPREAD study and predicted the registration error with a mean absolute error of 1.07 1.86 mm. The proposed method gained promising results on inter-database validation with a regression error of 1.76 2.59 mm.
This work is financed by the Netherlands Organization for Scientific Research (NWO), project 13351. Ben Glocker received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 757173, project MIRA, ERC-2017-STG). Dr. M.E. Bakker and J. Stolk are acknowledged for providing a ground truth for the SPREAD study data used in this paper. We would like to thank Dr. R. Castillo and T. Guerrero for providing the DIR-Lab database.
- Avants et al. (2009) Avants, B. B., Tustison, N., Song, G., 2009. Advanced normalization tools (ants). Insight J. 2, 1–35.
Breiman, L., 2001. Random forests. Machine Learning 45 (1), 5–32.
- Castillo et al. (2013) Castillo, R., Castillo, E., Fuentes, D., Ahmad, M., Wood, A. M., Ludwig, M. S., Guerrero, T., 2013. A reference dataset for deformable image registration spatial accuracy evaluation using the copdgene study archive. Physics in Medicine & Biology 58 (9), 2861.
- Castillo et al. (2009) Castillo, R., Castillo, E., Guerra, R., Johnson, V. E., McPhail, T., Garg, A. K., Guerrero, T., 2009. A framework for evaluation of deformable image registration spatial accuracy using large landmark point sets. Physics in Medicine and Biology 54 (7), 1849.
- Datteri and Dawant (2012) Datteri, R. D., Dawant, B. M., 2012. Automatic detection of the magnitude and spatial location of error in non-rigid registration. In: Biomedical Image Registration. Springer, pp. 21–30.
- Eppenhof and Pluim (2017) Eppenhof, K. A., Pluim, J. P., 2017. Supervised local error estimation for nonlinear image registration using convolutional neural networks. In: SPIE Medical Imaging. International Society for Optics and Photonics, pp. 101331U–101331U.
- Forsberg et al. (2011) Forsberg, D., Rathi, Y., Bouix, S., Wassermann, D., Knutsson, H., Westin, C.-F., 2011. Improving registration using multi-channel diffeomorphic demons combined with certainty maps. In: Multimodal Brain Image Analysis. Springer, pp. 19–26.
- Gass et al. (2015) Gass, T., Szekely, G., Goksel, O., 2015. Consistency-based rectification of nonrigid registrations. Journal of Medical Imaging 2 (1), 014005–014005.
Glocker et al. (2008)
Glocker, B., Paragios, N., Komodakis, N., Tziritas, G., Navab, N., 2008. Optical flow estimation with uncertainties through dynamic mrfs. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1–8.
- Glocker et al. (2014) Glocker, B., Zikic, D., Haynor, D. R., 2014. Robust registration of longitudinal spine ct. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 251–258.
- Gunay et al. (2017) Gunay, G., Luu, M. H., Moelker, A., Walsum, T., Klein, S., 2017. Semi-automated registration of pre-and intra-operative CT for image-guided percutaneous liver tumor ablation interventions. Medical Physics.
- Heinrich et al. (2012) Heinrich, M. P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F. V., Brady, M., Schnabel, J. A., 2012. MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration. Medical Image Analysis 16 (7), 1423–1435.
- Heinrich et al. (2016) Heinrich, M. P., Simpson, I. J., Papież, B. W., Brady, M., Schnabel, J. A., 2016. Deformable image registration by combining uncertainty estimates from supervoxel belief propagation. Medical Image Analysis 27, 57–71.
- Hub and Karger (2013) Hub, M., Karger, C., 2013. Estimation of the uncertainty of elastic image registration with the Demons algorithm. Physics in Medicine and Biology 58 (9), 3023.
- Hub et al. (2009) Hub, M., Kessler, M. L., Karger, C. P., 2009. A stochastic approach to estimate the uncertainty involved in B-spline image registration. IEEE Transactions on Medical Imaging 28 (11), 1708–1716.
- Klein et al. (2009) Klein, S., Pluim, J. P., Staring, M., Viergever, M. A., 2009. Adaptive stochastic gradient descent optimisation for image registration. International Journal of Computer Vision 81 (3), 227–239.
- Klein et al. (2010) Klein, S., Staring, M., Murphy, K., Viergever, M. A., Pluim, J. P., 2010. Elastix: A toolbox for intensity-based medical image registration. IEEE Transactions on Medical Imaging 29 (1), 196–205.
- Kybic (2010) Kybic, J., 2010. Bootstrap resampling for image registration uncertainty estimation without ground truth. IEEE Transactions on Image Processing 19 (1), 64–73.
- Liaw et al. (2002) Liaw, A., Wiener, M., et al., 2002. Classification and regression by randomforest. R news 2 (3), 18–22.
Lotfi et al. (2013)
Lotfi, T., Tang, L., Andrews, S., Hamarneh, G., 2013. Improving probabilistic image registration via reinforcement learning and uncertainty evaluation. In: Machine Learning in Medical Imaging. Springer, pp. 187–194.
- Luo et al. (2017) Luo, J., Popuri, K., Cobzas, D., Ding, H., Wells, W. M., Sugiyama, M., 2017. Misdirected registration uncertainty. arXiv preprint arXiv:1704.08121.
- Muenzing et al. (2018) Muenzing, S. E., Strauch, M., Truman, J. W., Bühler, K., Thum, A. S., Merhof, D., 2018. Larvalign: Aligning gene expression patterns from the larval brain of drosophila melanogaster. Neuroinformatics 16 (1), 65–80.
- Muenzing et al. (2012) Muenzing, S. E., van Ginneken, B., Murphy, K., Pluim, J. P., 2012. Supervised quality assessment of medical image registration: Application to intra-patient CT lung registration. Medical Image Analysis 16 (8), 1521–1531.
- Muenzing et al. (2014) Muenzing, S. E., van Ginneken, B., Viergever, M. A., Pluim, J. P., 2014. Dirboost–an algorithm for boosting deformable image registration: Application to lung CT intra-subject registration. Medical Image Analysis 18 (3), 449–459.
- Murphy et al. (2011a) Murphy, K., van Ginneken, B., Klein, S., Staring, M., de Hoop, B. J., Viergever, M. A., Pluim, J. P., 2011a. Semi-automatic construction of reference standards for evaluation of image registration. Medical Image Analysis 15 (1), 71–84.
- Murphy et al. (2011b) Murphy, K., Van Ginneken, B., Reinhardt, J. M., Kabus, S., Ding, K., Deng, X., Cao, K., Du, K., Christensen, G. E., Garcia, V., et al., 2011b. Evaluation of registration methods on thoracic CT: the EMPIRE10 challenge. IEEE Transactions on Medical Imaging 30 (11), 1901–1920.
- Murphy et al. (2012) Murphy, M. J., Salguero, F. J., Siebers, J. V., Staub, D., Vaman, C., 2012. A method to estimate the effect of deformable image registration uncertainties on daily dose mapping. Medical Physics 39 (2), 573–580.
- Park et al. (2004) Park, H., Bland, P. H., Brock, K. K., Meyer, C. R., 2004. Adaptive registration using local information measures. Medical Image Analysis 8 (4), 465–473.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.
- Risholm et al. (2013) Risholm, P., Janoos, F., Norton, I., Golby, A. J., Wells, W. M., 2013. Bayesian characterization of uncertainty in intra-subject non-rigid registration. Medical Image Analysis 17 (5), 538–555.
- Rohde et al. (2003) Rohde, G. K., Aldroubi, A., Dawant, B. M., 2003. The adaptive bases algorithm for intensity-based nonrigid image registration. IEEE Transactions on Medical Imaging 22 (11), 1470–1479.
- Rueckert et al. (1999) Rueckert, D., Sonoda, L. I., Hayes, C., Hill, D. L., Leach, M. O., Hawkes, D. J., 1999. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Transactions on Medical Imaging 18 (8), 712–721.
- Saygili et al. (2016) Saygili, G., Staring, M., Hendriks, E. A., 2016. Confidence estimation for medical image registration based on stereo confidences. IEEE Transactions on Medical Imaging 35 (2), 539–549.
- Schlachter et al. (2016) Schlachter, M., Fechter, T., Jurisic, M., Schimek-Jasch, T., Oehlke, O., Adebahr, S., Birkfellner, W., Nestle, U., Bühler, K., 2016. Visualization of deformable image registration quality using local image dissimilarity. IEEE Transactions on Medical Imaging 35 (10), 2319–2328.
- Schnabel et al. (2001) Schnabel, J. A., Rueckert, D., Quist, M., Blackall, J. M., Castellano-Smith, A. D., Hartkens, T., Penney, G. P., Hall, W. A., Liu, H., Truwit, C. L., et al., 2001. A generic framework for non-rigid registration based on non-uniform multi-level free-form deformations. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2001. Springer, pp. 573–581.
- Simpson et al. (2015) Simpson, I. J., Cardoso, M. J., Modat, M., Cash, D. M., Woolrich, M. W., Andersson, J. L., Schnabel, J. A., Ourselin, S., et al., 2015. Probabilistic non-linear registration with spatially adaptive regularisation. Medical Image Analysis 26 (1), 203–216.
- Smit et al. (2017) Smit, N., Lawonn, K., Kraima, A., DeRuiter, M., Sokooti, H., Bruckner, S., Eisemann, E., Vilanova, A., 2017. Pelvis: Atlas-based surgical planning for oncological pelvic surgery. IEEE Transactions on Visualization and Computer Graphics 23 (1), 741–750.
- Sokooti et al. (2017) Sokooti, H., de Vos, B., Berendsen, F., Lelieveldt, B. P., Išgum, I., Staring, M., 2017. Nonrigid image registration using multi-scale 3d convolutional neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 232–239.
- Sokooti et al. (2016) Sokooti, H., Saygili, G., Glocker, B., Lelieveldt, B. P., Staring, M., 2016. Accuracy estimation for medical image registration using regression forests. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 107–115.
- Staring et al. (2014) Staring, M., Bakker, M., Stolk, J., Shamonin, D., Reiber, J., Stoel, B., 2014. Towards local progression estimation of pulmonary emphysema using CT. Medical Physics 41 (2), 021905.
- Staring et al. (2010) Staring, M., Klein, S., Reiber, J. H., Niessen, W. J., Stoel, B. C., 2010. Pulmonary image registration with elastix using a standard intensity-based algorithm. Medical Image Analysis for the Clinic: A Grand Challenge, 73–79.
- Stolk et al. (2007) Stolk, J., Putter, H., Bakker, E. M., Shaker, S. B., Parr, D. G., Piitulainen, E., Russi, E. W., Grebski, E., Dirksen, A., Stockley, R. A., Reiber, J. H. C., Stoel, B. C., 2007. Progression parameters for emphysema: a clinical investigation. Respiratory Medicine 101 (9), 1924–1930.
- Thörnqvist et al. (2010) Thörnqvist, S., Petersen, J. B., Høyer, M., Bentzen, L. N., Muren, L. P., 2010. Propagation of target and organ at risk contours in radiotherapy of prostate cancer using deformable image registration. Acta Oncologica 49 (7), 1023–1032.
- Tilly et al. (2013) Tilly, D., Tilly, N., Ahnesjö, A., 2013. Dose mapping sensitivity to deformable registration uncertainties in fractionated radiotherapy–applied to prostate proton treatments. BMC Medical Physics 13 (1), 2.
- Tustison et al. (2013) Tustison, N., Song, G., Gee, J., Avants, B., 2013. Two greedy syn variants for pulmonary image registration.
- Veiga et al. (2015) Veiga, C., Lourenço, A. M., Mouinuddin, S., van Herk, M., Modat, M., Ourselin, S., Royle, G., McClelland, J. R., 2015. Toward adaptive radiotherapy for head and neck patients: Uncertainties in dose warping due to the choice of deformable registration algorithm. Medical Physics 42 (2), 760–769.
Viola and Jones (2004)
Viola, P., Jones, M. J., 2004. Robust real-time face detection. International journal of computer vision 57 (2), 137–154.