1 Introduction
Image fusion is frequently involved in modern imageguided medical interventions, typically augmenting intraoperatively acquired 2D Xray images with preoperative 3D CT or MRI images. Accurate alignment between the fused images is essential for clinical applications and can be achieved using 2D/3D rigid registration, which aims at finding the pose of a 3D volume in order to align its projections to 2D Xray images. Most commonly, intensitybased methods are employed [8], where a similarity measure between the 2D image and the projection of the 3D image is defined and optimized as e. g. described by Kubias et al. [6]. Despite decades of investigations, 2D/3D registration remains challenging. The difference in dimensionality of the input images results in an illposed problem. In addition, content mismatch between the preoperative and intraoperative images, poor image quality and a limited field of view challenge the robustness and accuracy of registration algorithms. Miao et al. [9] propose a learningbased registration method that is build upon the intensitybased approach. While they achieve a high robustness, registration accuracy remains challenging.
The intuition of 2D/3D rigid registration is to globally minimize the visual misalignment between 2D images and the projections of the 3D image. Based on this intuition, Schmid and Chênes [13] decompose the target structure to local shape patches and model image forces using Hooke’s law of a spring from image block matching. Wang et al. [15]
propose a pointtoplane correspondence (PPC) model for 2D/3D registration, which linearly constrains the global differential motion update using local correspondences. Registration is performed by iteratively establishing correspondences and performing the motion estimation. During the intervention, devices and implants, as well as locally similar anatomies, can introduce outliers for local correspondence search (see Fig.
(a)a and (b)b). Weighting of local correspondences, in order to emphasize the correct correspondences, directly influences the accuracy and robustness of the registration. An iterative reweighted scheme is suggested by Wang et al. [15] to enhance the robustness against outliers. However, this scheme only works when outliers are a minority of the measurements.Recently, Qi et al. [11]
proposed the PointNet, a type of neural network directly processing point clouds. PointNet is capable of internally extracting global features of the cloud and relating them to local features of individual points. Thus, it is well suited for correspondence weighting in 2D/3D registration. Yi et al.
[16] propose to learn the selection of correct correspondences for widebaseline stereo images. As a basis, candidates are established, e. g. using SIFT features. Ground truth labels are generated by exploiting the epipolar constraint. This way, an outlier label is generated. Additionally, a regression loss is introduced, which is based on the error in the estimation of a known essential matrix between two images. Both losses are combined during training. While including the regression loss improves the results, the classification loss is shown to be important to find highly accurate correspondences. The performance of iterative correspondencebased registration algorithms (e. g. [13], [15]) can be improved by learning a weighting strategy for the correspondences. However, automatic labeling of the correspondences is not practical for iterative methods as even correct correspondences may have large errors in the first few iterations. This means that labeling cannot be performed by applying a simple rule such as a threshold based on the ground truth position of a point.In this paper, we propose a method to learn an optimal weighting strategy for the local correspondences for rigid 2D/3D registration directly with the criterion to minimize the registration error, without the need of percorrespondence ground truth annotations. We treat the correspondences as a point cloud with extended perpoint features and use a modified PointNet architecture to learn global interdependencies of local correspondences according to the PPC registration metric. We choose to use the PPC model as it was shown to enable a high registration accuracy as well as robustness [15]. Furthermore, it is differentiable and therefore lends itself to the use in our training objective function. To train the network, we propose a novel training objective function, which is composed of the motion estimation according to the PPC model and the registration error computation steps. It allows us to learn a correspondence weighting strategy by minimizing the registration error. We demonstrate the effectiveness of the learned weighting strategy by evaluating our method on singlevertebra registration, where we show a highly improved robustness compared to the original PPC registration.
2 Registration and Learned Correspondence Weighting
In the following section, we begin with an overview of the registration method using the PPC model. Then, further details on motion estimation (see Sec. 2.2) and registration error computation (see Sec. 2.3) are given, as these two steps play a crucial role in our objective function. The architecture of our network is discussed in Sec. 2.4, followed by the introduction of our objective function in Sec. 2.5. At last, important details regarding the training procedure are given in Sec. 2.6.
2.1 Registration Using PointtoPlane Correspondences
Wang et al. [15] measure the local misalignment between the projection of a 3D volume and the 2D fluoroscopic (live Xray) image and compute a motion which compensates for this misalignment. Surface points are extracted from using the 3D Canny detector [1]. A set of contour generator points [4] , i. e. surface points which correspond to contours in the projection of , are projected onto the image as , i. e. a set of points on the image plane. Additionally, gradient projection images of are generated and used to perform local patch matching to find correspondences for in . Assuming that the motion along contours is not detectable, the patch matching is only performed in the orthogonal direction to the contour. Therefore, the displacement of along the contour is not known, as well as the displacement along the viewing direction. These unknown directions span the plane with the normal . After the registration, a point should be located on the plane . To minimize the pointtoplane distances , a linear equation is defined for each correspondence under the small angle assumption. The resulting system of equations is solved for the differential motion , which contains both rotational components in the axisangle representation and translational components , i. e. . The correspondence search and motion estimation steps are applied iteratively over multiple resolution levels. To increase the robustness of the motion estimation, the maximum correntropy criterion for regression (MCCR) [3] is used to solve the system of linear equations [15]. The motion estimation is extended to coordinate systems related to the camera coordinates by a rigid transformation by Schaffert et al. [12].
The PPC model sets up a linear relationship between the local pointtoplane correspondences and the differential transformation, i. e. a linear misalignment metric based on the found correspondences. In this paper, we introduce a learning method for correspondence weighting, where the PPC metric is used during training to optimize the weighting strategy for the used correspondences with respect to the registration error.
2.2 Weighted Motion Estimation
Motion estimation according to the PPC model is performed by solving a linear system of equations defined by and , where each equation corresponds to one pointtoplane correspondence and is the number of used correspondences. We perform the motion estimation in the camera coordinate system with the origin shifted to the centroid of . This allows us to use the regularized leastsquares estimation
(1) 
in order to improve the robustness of the estimation. Here, , and is the regularizer weight. The diagonal matrix contains weights for all correspondences. As Eq. (1) is differentiable w. r. t. , we obtain
(2) 
where
is the identity matrix. After each iteration, the registration
is updated as(3) 
where , ,
is a skew matrix which expresses the cross product with
as a matrix multiplication and is the registration after the previous iteration [15].2.3 Registration Error Computation
In the training phase, the registration error is measured and minimized via our training objective function. Different error metrics, such as the mean target registration error (mTRE) or the mean reprojection distance (mRPD) can be used. For more details on these metrics, see Sec. 3.3. In this work, we choose the projection error (PE) [14], as it directly corresponds to the visible misalignment in the images and therefore roughly correlates to the difficulty to find correspondences by patch matching for the next iteration of the registration method. The PE is computed as
(4) 
where a set of target points is used and is the point index. is the projection onto the image plane under the currently estimated registration and the projection under the groundtruth registration matrix . Corners of the bounding box of the point set are used as .
2.4 Network Architecture
We want to weight individual correspondences based on their geometrical properties as well as the image similarity, taking into account the global properties of the correspondence set. For every correspondence, we define the features
(5) 
where denotes the normalized gradient correlation for the correspondences, which is obtained in the patch matching step.
The goal is to learn the mapping from a set of feature vectors representing all correspondences to the weight vector containing weights for all correspondences, i. e. the mapping
(6) 
where is our network, and the network parameters.
To learn directly on correspondence sets, we use the PointNet [11] architecture and modify it to fit our task (see Fig. 1). The basic idea behind PointNet is to process points individually and obtain global information by combining the points in a symmetric way, i. e. independent of order in which the points appear in the input [11]. In the simplest variant, the PointNet consists of a multilayer perceptron (MLP) which is applied for each point, transforming the respective
into a higherdimensional feature space and thereby obtaining a local point descriptor. To describe the global properties of the point set, the resulting local descriptors are combined by max pooling over all points, i. e. for each feature, the maximum activation over all points in the set is retained. To obtain perpoint outputs, the resulting global descriptor is concatenated to the local descriptors of each point. The resulting descriptors, containing global as well as local information, are further processed for each point independently by a second MLP. For our network, we choose MLPs with the size of
and , which are smaller than in the original network [11]. We enforce the output to be in the range ofby using a softsign activation function
[2] in the last layer of the second MLP and modify it to rescale the output range from to . Our modified softsign activation function is defined as(7) 
where
is the state of the neuron. Additionally, we introduce a global trainable weighting factor which is applied to all correspondences. This allows for an automatic adjustment of the strength of the regularization in the motion estimation step. Note that the network is able to process correspondence sets of variable size so that no fixed amount of correspondences is needed and all extracted correspondences can be utilized.
2.5 Training Objective
We now combine the motion estimation, PE computation and the modified PointNet to obtain the training objective function as
(8) 
where is the training sample index and the overall number of samples. Equation (2) is differentiable with respect to , Eq. (3) with respect to and Eq. (4) with respect to . Therefore, gradientbased optimization can be performed on Eq. (8).
Note that using Eq. (8), we learn directly with the objective to minimize the registration error and no percorrespondence groundtruth weights are needed. Instead, the PPC metric is used to implicitly assess the quality of the correspondences during the backpropagation step of the training and the weights are adjusted accordingly. In other words, the optimization of the weights is driven by the PPC metric.
2.6 Training Procedure
To obtain training data, a set of volumes is used, each with one or more 2D images and a known (see Sec. 3.1
). For each pair of images, 60 random initial transformations with an uniformly distributed mTRE are generated
[5]. For details on the computation of the mTRE and start positions, see Sec. 3.3.Estimation of correspondences at training time is computationally expensive. Instead, the correspondence search is performed once and the precomputed correspondences are used during training. Training is performed for one iteration of the registration method and start positions with a small initial error are assumed to be representative for subsequent registration iterations at test time. For training, the number of correspondences is fixed to 1024 to enable efficient batchwise computations. The subset of used correspondences is selected randomly for every training step. Data augmentation is performed on the correspondence sets by applying translations, inplane rotations and horizontal flipping, i. e. reflection over the plane spanned by the vertical axis of the 2D image and the principal direction. For each resolution level, a separate model is trained.
3 Experiments and Results
3.1 Data
We perform experiments for singleview registration of individual vertebrae. Note that singlevertebra registration is challenging due to the small size of the target structure and the presence of neighbor vertebrae. Therefore, achieving a high robustness is challenging. We use clinical Carm CT acquisitions from the thoracic and pelvic regions of the spine for training and evaluation. Each acquisition consists of a sequence of 2D images acquired with a rotating Carm. These images are used to reconstruct the 3D volume. To enable reconstruction, the Carm geometry has to be calibrated with a high accuracy (the accuracy is mm for the projection error at the isocenter in our case). We register the acquired 2D images to the respective reconstructed volume and therefore the ground truth registration is known within the accuracy of the calibration. Vertebra are defined by an axisaligned volume of interest (VOI) containing the whole vertebra. Only surface points inside the VOI are used for registration. We register the projection images (resolution of pixels, pixel size of 0.62 mm) to the reconstructed volumes (containing around 390 slices with slice resolution of voxels and voxel size of 0.49 mm). To simulate realistic conditions, we add Poisson noise to all 2D images and rescale the intensities to better match fluoroscopic images.
The training set consists of 19 acquisitions with a total of 77 vertebrae. For each vertebra, 8 different 2D images are used. An additional validation set of 23 vertebrae from 6 acquisitions is used to monitor the training process. The registration is performed on a test set of 6 acquisitions. For each acquisition, 2 vertebrae are evaluated and registration is performed independently for both the anteriorposterior and the lateral views. Each set contains data from different patients, i. e. no patient appears in two different sets. The sets were defined so that all sets are representative to the overall quality of the available images, i. e. contain both pelvic and thoracic vertebrae, as well as images with more or less clearly visible vertebrae. Examples of images used in the test set are shown in Fig. 2.
3.2 Compared Methods
We evaluate the performance of the registration using the PPC model in combination with the learned correspondence weighting strategy (PPCL), which was trained using our proposed metricdriven learning method. To show the effectiveness of the correspondence weighting, we compare PPCL to the original PPC method. The compared methods differ in the computation of the correspondence weights and the regularizer weight . For PPCL, the correspondence weights and are used. For PPC, we set and the used correspondence weights are the values of the found correspondences, where any value below is set to , i. e. the correspondence is rejected. Additionally, the MCCR is used in the PPC method only. The minimum resolution level has a scaling of 0.25 and the highest a scaling of 1.0. For the PPC method, registration is performed on the lowest resolution level without allowing motion in depth first, as this showed to increases the robustness of the method. To differentiate between the effect of the correspondence weighting and the regularized motion estimation, we also consider registration using regularized motion estimation. We use a variant where the global weighting factor, which is applied to all points, is matched to the regularizer weight automatically by using our objective function (PPCR). For the different resolution levels, we obtained a data weight in the range of . Therefore, we use and . Additionally, we empirically set the correspondence weight to , which increases the robustness of the registration while still allowing for a reasonable amount of motion (PPCRM).
3.3 Evaluation Metrics
To evaluate the registration, we follow the standardized evaluation methodology [5, 10]. The following metrics are defined by van de Kraats et al. [5]:

Mean Target Registration Error: The mTRE is defined as the mean distance of target points under and the estimated registration .

Mean ReProjection Distance (mRPD): The mRPD is defined as the mean distance of target points under and the reprojection rays of the points as projected under .

Success Rate (SR): The SR is the number of registrations with with a registration error below a given threshold. As we are concerned with singleview registration, we define the success criterion as a mRPD 2 mm.

Capture Range (CR): The CR is defined as the maximum initial mTRE for which at least 95% of registrations are successful.
Additionally, we compute the gross success rate (GSR) [9] as well as a gross capture range (GCR) with a success criterion of a mRPD 10 mm in order to further assess the robustness of the methods in case of a low accuracy. We define target points as uniformly distributed points inside the VOI of the registered vertebra. For the evaluation, we generate 600 random start transformations for each vertebra in a range of 0 mm  30 mm initial mTRE using the methodology described by van de Kraats et al. [5]. We evaluate the accuracy using the mRPD and the robustness using the SR, CR GSR and GCR.
3.4 Results and Discussion
3.4.1 Accuracy and Robustness
The evaluation results for the compared methods are summarized in Tab. 1. We observe that PPCL achieves the best SR of 94.3 % and CR of 13 mm. Compared to PPC (SR of 79.3 % and CR of 3 mm), PPCR also achieves a higher SR of 88.1 % and CR of 6 mm. For the regularized motion estimation, the accuracy decreases for increasing regularizer influence (0.790.22 mm for PPCR and 1.180.42 mm for PPCRM), compared to PPC (0.750.21 mm) and PPCL (0.740.26 mm). A sample registration result using PPCL is shown in Fig. (d)d.
Method  mRPD [mm]  SR [%]  CR [mm]  GSR [%]  GCR [mm] 

PPC  0.750.21  79.3  3  81.8  3 
PPCR  0.790.22  88.1  6  90.72  6 
PPCRM  1.180.42  59.6  4  95.1  20 
PPCL  0.740.26  94.3  13  96.3  22 
For strongly regularized motion estimation, we observe a large difference between the GSR and the SR. While for PPCR, the difference is relatively small (88.1% vs. 90.7%), it is very high for PPCRM. Here a GSR of 95.1 % is achieved, while the SR is 59.6 %. This indicates that while the method is robust, the accuracy is low. Compared to the CR, the GCR is increased for PPCL (22 mm vs. 13 mm) and especially for PPCRM (20 mm vs. 4 mm). Overall, this shows that while some inaccurate registrations are present in PPCL, they are very common for PPCRM.
3.4.2 Single Iteration Evaluation
To better understand the effect of the correspondence weighting and regularization, we investigate the registration results after one iteration on the lowest resolution level. In Fig. 4, the PE in pixels (computed using as target points) is shown for all cases in the validation set. As in training, 1024 correspondences are used per case for all methods. We observe that for PPC, the error has a high spread, where for some cases, it is decreased considerably, while for other cases, it is increased. For PPCR, most cases are below the initial error. However, the error is decreased only marginally, as the regularization prevents large motions. For PPCL, we observe that the error is drastically decreased for most cases. This shows that PPCL is able to estimate motion efficiently. An example for correspondence weighting in PPCL is shown in Fig. (c)c, where we observe a set of consistent correspondences with high weights, while the remaining correspondences have low weights.
3.4.3 Method Combinations
We observed that while the PPCRM method has a high robustness (GCR and GSR), it leads to low accuracy. For PPCL, we observed an increased GCR compared to the CR. In both cases, this demonstrates that registrations are present with a mRPD between 2 mm and 10 mm. As the PPC works reliably for small initial errors, we combine these methods with PPC by performing PPC on the highest resolution level instead of the respective method. We denote the resulting methods as PPCRM+ and PPCL+. We observe that PPCRM+ achieves an accuracy of 0.740.18 mm, an SR of 94.6 % and a CR of 18 mm, while PPCL+ achieves an accuracy of 0.740.19 mm, an SR of 96.1 % and a CR of 19 mm. While the results are similar, we note that for PPCRM+ a manual weight selection is necessary. Further investigations are needed to clarify the better performance of PPC compared to PPCL on the highest resolution level. However, this result may also demonstrate the strength of MCCR for cases where the majority of correspondences are correct. We evaluate the convergence behavior of PPCL+ and PPCRM+ by only considering cases which were successful. For these cases, we investigate the error distribution after the first resolution level. The results are shown in Fig. 5. We observe that for PPCL+, a mRPD of below 10 mm is achieved for all cases, while for PPCRM+, higher misalignment of around 20 mm mRPD is present. The result for PPCL+ is achieved after an average of 7.6 iterations, while 11.8 iterations were performed on average for PPCRM+ using the stop criterion defined in [15]. In combination, this further substantiates our findings from the single iteration evaluation and shows the efficiency of PPCL and its potential for reducing the computational cost.
4 Conclusion
For 2D/3D registration, we propose a method to learn the weighting of the local correspondences directly from the global criterion to minimize the registration error. We achieve this by incorporating the motion estimation and error computation steps into our training objective function. A modified PointNet network is trained to weight correspondences based on their geometrical properties and image similarity. A large improvement in the registration robustness is demonstrated when using the learningbased correspondence weighting, while maintaining the high accuracy. Although a high robustness can also be achieved by regularized motion estimation, registration using learned correspondence weighting has the following advantages: it is more efficient, does not need manual parameter tuning and achieves a high accuracy. One direction of future work is to further improve the weighting strategy, e. g. by including more information into the decision process and optimizing the objective function for robustness and/or accuracy depending on the stage of the registration, such as the current resolution level. By regarding the motion estimation as part of the network and not the objective function, our model can also be understood in the framework of precision learning [7] as a regression model for the motion, where we learn only the unknown component (weighting of correspondences), while employing prior knowledge to the known component (motion estimation). Following the framework of precision learning, replacing further steps of the registration framework with learned counterparts can be investigated. One candidate is the correspondence estimation, as it is challenging to design an optimal correspondence estimation method by hand.
Disclaimer: The concept and software presented in this paper are based on research and are not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.
References
 [1] Canny, J.: A Computational Approach to Edge Detection. IEEE Trans Pattern Anal Mach Intell (6), 679–698 (1986)
 [2] Elliott, D.L.: A Better Activation Function for Artificial Neural Networks. Tech. rep. (1993)
 [3] Feng, Y., Huang, X., Shi, L., Yang, Y., Suykens, J.A.: Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J Mach Learn Res 16, 993–1034 (2015)

[4]
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, p. 200. Cambridge University Press, 2 edn. (2003)
 [5] van de Kraats, E.B., Penney, G.P., Tomaževič, D., van Walsum, T., Niessen, W.J.: Standardized Evaluation Methodology for 2D3D Registration. IEEE Trans Med Imag 24(9), 1177–1189 (2005)

[6]
Kubias, A., Deinzer, F., Feldmann, T., Paulus, D., Schreiber, B., Brunner, T.: 2D/3D Image Registration on the GPU. Pattern Recognition and Image Analysis
18(3), 381–389 (2008)  [7] Maier, A., Schebesch, F., Syben, C., Würfl, T., Steidl, S., Choi, J.H., Fahrig, R.: Precision Learning: Towards Use of Known Operators in Neural Networks. arXiv preprint arXiv:1712.00374v3 (2017)
 [8] Markelj, P., Tomaževič, D., Likar, B., Pernuš, F.: A Review of 3D/2D Registration Methods for ImageGuided Interventions. Med. Image Anal. 16(3), 642–661 (2012)

[9]
Miao, S., Piat, S., Fischer, P., Tuysuzoglu, A., Mewes, P., Mansi, T., Liao, R.: Dilated FCN for MultiAgent 2D/3D Medical Image Registration. In: AAAI Conference on Artificial Intelligence (AAAI). pp. 4694–4701 (2018)
 [10] Mitrović, U., Špiclin, Ž., Likar, B., Pernuš, F.: 3D2D Registration of Cerebral Angiograms: A Method and Evaluation on Clinical Images. IEEE Trans Med Imag 32(8), 1550–1563 (2013)
 [11] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In: IEEE Conference on Computer Vision and Pattern Recogition (CVPR). pp. 77–85 (2017)
 [12] Schaffert, R., Wang, J., Fischer, P., Borsdorf, A., Maier, A.: MultiView DepthAware Rigid 2D/3D Registration. In: IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC) (2017)
 [13] Schmid, J., Chênes, C.: Segmentation of Xray Images by 3D2D Registration based on Multibody Physics. In: Cremers, D., Saito, I.R.H., Yang, M. (eds.) ACCV 2014. LNCS. vol. 9004, pp. 674–687. Springer (2014)
 [14] Wang, J., Borsdorf, A., Heigl, B., Köhler, T., Hornegger, J.: GradientBased Differential Approach for 3D Motion Compensation in Interventional 2D/3D Image Fusion. In: International Conference on 3D Vision (3DV). pp. 293–300 (2014)
 [15] Wang, J., Schaffert, R., Borsdorf, A., Heigl, B., Huang, X., Hornegger, J., Maier, A.: Dynamic 2D/3D Rigid Registration Framework Using PointToPlane Correspondence Model. IEEE Trans Med Imag 36(9), 1939–1954 (2017)
 [16] Yi, K.M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to Find Good Correspondences. In: IEEE Conference on Computer Vision and Pattern Recogition (CVPR). pp. 2666–2674 (2018)
Comments
There are no comments yet.