1 Introduction
In 2D/3D rigid registration for intervention, the goal is to find a rigid pose of a preintervention 3D data, e.g., computed tomography (CT), such that it aligns with a 2D intraintervention image of a patient, e.g., fluoroscopy. In practice, CT is usually a preferred 3D preintervention data as digitally reconstructed radiographs (DRRs) can be produced from CT using ray casting [21]. The generation of DRRs simulates how an Xray is captured, which makes them visually similar to the Xrays. Therefore, they are leveraged to facilitate the 2D/3D registration as we can observe the misalignment between the CT and patient by directly comparing the intraintervention Xray and the generated DRR (See Figure 1 and Section 3.1 for details).
One of the most commonly used 2D/3D registration strategies [12] is through an optimizationbased approach, where a similarity metric is first designed to measure the closeness between the DRRs and the 2D data, and then the 3D pose is iteratively searched and optimized for the best similarity score. However, the iterative pose searching scheme usually suffers from two problems. First, the generation of DRRs incurs high computation, and the iterative pose searching requires a significant number of DRRs for the similarity measure, making it computationally slow. Second, iterative pose searching relies on a good initialization. When the initial position is not close enough to the correct one, the method may converge to local extrema, and the registration fails. Although many studies have been proposed to address these two problems [4, 16, 15, 8, 5, 7, 19], tradeoffs still have to be made between sampling good starting points and less costly registration.
In recent years, the development of deep neural networks (DNNs) has enabled a learningbased strategy for medical image registration
[13, 22, 11, 14] that aims to estimate the pose of the 3D data without searching and sampling the pose space at a large scale. Despite the efficiency, there are still two limitations of the existing learningbased methods. First, the learningbased methods usually require generating a huge number of DRRs for training. The corresponding poses for the DRRs have to be dense in the entire searching space to avoid overfitting. Considering that the number of required DRRs is exponential with respect to the dimension of the pose space (which is usually six), this is computationally prohibitive, thus making the learningbased methods less reliable during testing. Second, the current stateoftheart learningbased methods [13, 22, 11] require an iterative refinement of the estimated pose and use DNNs to predict the most plausible update direction for faster convergence. However, the iterative approach still introduces a nonnegligible computational cost, and the DNNs may direct the searching to an unseen state, which fails the registration quickly.In this paper, we introduce a novel learningbased approach, which is referred to as a PointOfInterest Network for Tracking and Triangulation (POINT). POINT directly aligns the 3D data with the patient by using DNNs to establish a pointtopoint correspondence between multiple views^{1}^{1}1A different view indicates the DRR or Xray is captured at a different projection angle. of DRRs and Xray images. The 3D pose is then estimated by aligning the matched points. Specifically, these are achieved by tracking a set of points of interest (POIs). For 2D correspondence, we use the POI tracking network to map the 2D POIs from the DRRs to the Xray images. For 3D correspondence, we develop a triangulation layer that projects the tracked POIs in the Xray images of multiple views back into 3D. We highlight that since the pointtopoint correspondence is established in a shiftinvariant manner, the requirement of dense sampling in the entire pose space is avoided.
The contributions of this paper are as follows:

[noitemsep,topsep=0pt]

A novel learningbased multiview 2D/3D rigid registration method that directly measures the 3D misalignment by exploiting the pointtopoint correspondence between the Xrays and DRRs, which avoids the costly and unreliable iterative pose searching, and thus delivers faster and more robust registration.

A novel POI tracking network constructed using a Siamese UNet with POI convolution to enable a finegrained feature extraction and effective POI similarity measure, and more importantly, to offer a shiftinvariant 2D misalignment measure that is robust to inplane offsets
^{2}^{2}2Inplane/outplane offset refers to the translation and rotation offset within/outside the DRR or Xray images.. 
A unified framework of the POI tracker and the triangulation layer, which enables (i) endtoend learning of informative 2D features and (ii) 3D pose estimation.

An extensive evaluation on a largescale and challenging clinical conebeam CT (CBCT) dataset, which shows that the proposed method performs significantly better than the stateoftheart learningbased approaches, and, when used as an initial pose estimator, it also greatly improves the robustness and speed of the stateoftheart optimizationbased approaches.
2 Related Work
OptimizationBased Approaches. Optimizationbased approaches usually suffer from high computational cost and is sensitive to the initial estimate. To reduce the computational cost, many works have been proposed to improve the efficiency in hardwarelevel [10, 8, 15] or softwarelevel [19, 27, 9]. Although these works have successfully reduced the DRR generation time to a reasonable range, the overall registration time is still nonnegligible [19, 15] and the registration accuracy might be compromised for faster speed [27, 19]. For better initial pose estimation, many attempts have been made by either sampling better initial position [7, 5], using multistart strategies [26, 16], or a carefully designed objective function that is less sensitive to the initial position selection [15]. However, these methods usually achieve a more robust registration at the cost of longer running time as more locations, and the corresponding DRRs need to be sampled and generated, respectively, to avoid being trapped in the local extrema.
LearningBased Approaches. Early learningbased approach [14]
aims to train the DNNs to directly predict the 3D pose given a pair of DRR and Xray images. However, this approach is generally too ambitious and hence relies on the existence of opaque objects, such as medical implants, that provide strong features for robustness. Alternatively, it has been shown that formulating the registration as a Markov decision process (MDP) is viable
[11]. Instead of directly regressing the 3D pose, MDPbased methods propose first to train an agent that predicts the most possible search direction and then the registration is iteratively repeated until a fixed number of steps is reached. However, the MDPbased approach requires the agent to be trained on a large number of samples such that the registration can follow the expected trajectory. Though mitigated with a multiagent design [13], it is still inevitable that the neighborhood search may reach an unseen pose and the registration fails. Moreover, the MDPbased approach cannot guarantee convergence and hence limits its registration accuracy. Therefore, the MDPbased approach [13] is usually used to find a good initial pose for the registration, and a combination with an optimizationbased method is applied for better performance. Another possible approach is by directly tracking landmarks from multiple views of Xray images [2]. However, the landmarkbased tracking approach does not make use of the information from the CT volume and requires the landmarks to be present in the Xray images, making it less robust and applicable to clinical applications.3 Methodology
3.1 Problem Formulation
Following the convention in the literature [12], we assume a 2D/3D rigid registration problem and also assume that the 3D data is a CT or CBCT volume, which is the most accessible and allows the generation of DRR. For the 2D data, we use Xrays. As singleview 2D/3D registration is an illposed problem (due to the ambiguity introduced by the outplane offset), Xrays from multiple views are usually captured during the intervention. Therefore, we also follow the literature [12] and tackle a multiview 2D/3D registration problem. Without loss of generality, most of the studies in this work are conducted under two views, and it is easy to extend our work to the cases with more views.
2D/3D Rigid Registration with DRRs. In 2D/3D rigid registration, the misalignment between the patient and the CT volume is formulated through a transformation matrix that brings from its initial location to the patient’s location under the same coordinate. As illustrated in Figure 2, is usually parameterized by three translations and three rotations about the axes, and can be written as a matrix under the homogeneous coordinate
(1) 
where is the rotation matrix that controls the rotation of around the origin.
As demonstrated in Figure 1, casting simulated Xrays through the CT volume creates a DRR on the detector. Similarly, passing a real xray beam through the patient’s body gives an Xray image. Hence, the misalignment between the CT volume and the patient can be observed from the detector by comparing the DRR and the Xray image. Given a transformation matrix and a CT volume , the DRR can be computed by
(2) 
where , whose parameters are determined by the imaging model, is a line segment connecting the Xray source and a point on the detector. Therefore, let denote the Xray image, the 2D/3D registration can be seen as finding the optimal such that and are aligned.
XRay Imaging Model. An Xray imaging system is usually modeled as a pinhole camera [3, 6], as illustrated in Figure 2, where the Xray source serves as the camera center and the Xray detector serves as the image plane. Following the convention in Xray imaging [3], we assume an isocenter coordinate system whose origin lies at the isocenter. Without loss of generality, we also assume the imaging model is calibrated, and there is no Xray source offset and detector offset. Thus, the Xray source, the isocenter, and the detector’s origin are collinear, and the line from the Xray source to the isocenter (referred to as the principal axis) is perpendicular to the detector. Let denote the distance between the Xray source and the detector origin, and denote the distance between the Xray source and the isocenter, then for a point in the isocenter coordinate, its projection on the detector is given by
(3) 
where
Here is defined under the homogeneous coordinate and its counterpart under the detector coordinate can be written as .
In general, an Xray is usually not captured at the canonical view as discussed above. Let be a transformation matrix that converts a canonical view to a noncanonical view (Figure 2), then the projection of for the noncanonical view can be written as
(4) 
where and perform the rotation and translation, respectively, as in Equation (1). Similarly, we can rewrite Equation (2) at a noncanonical view as
(5) 
3.2 The Proposed POINT Approach
An overview of the proposed method with two views is shown in Figure 3. Given a set of DRR and Xray pairs of different views, our approach first selects a set of POIs in 3D from the CT volume and projects them to each DRR using Equation (4) as shown in Figure 3(a). Then, the approach measures the misalignment between each pair of DRR and Xray by tracking the projected DRR POIs from the Xray (Figure 3(b)). Using the tracked POIs on the Xrays, we can estimate their corresponding 3D POIs on the patient through triangulation (Figure 3(c)). Finally, by aligning CT POIs with patient POIs, the pose misalignment between the CT and the patient can be calculated (Figure 3(d)).
POINT. One of the key components of the proposed method is a PointOfInterest Network for Tracking (POINT) that finds the pointtopoint correspondence between two images, that is, we use this network to track the POIs from DRR to Xray. Specifically, the network takes a DRR and Xray pair and a set of projected DRR POIs as the input and outputs the tracked Xray POIs in the form of heatmaps .
The structure of the network is illustrated in Figure 4. We construct this network under a Siamese architecture [1, 23] with each branch having an UNet like structure [18]. The weights of the two branches are shared. Each branch takes an image as the input and performs finegrained feature extraction at pixellevel. Thus, the output is a feature map with the same resolution as the input image, and for an image with size MN, the size of the feature map is MNC where C is the number of channels. We denote the extracted feature maps of DRR and Xray as and , respectively.
With feature map
, the feature vector of a DRR POI
can be extracted by interpolating
at . The feature extraction layer (FE layer) in Figure 4 performs this operation and we denote its output as a feature kernel . For a richer feature representation, the neighbor feature vectors around may also be used. A neighbor of size gives in total (2K+1)(2K+1) feature vectors and the feature kernel in this case has a size (2K+1)(2K+1)C.Similarly, a feature kernel at of the Xray feature map can be extracted and denoted as . Then, we may apply a similarity operation to and to give a similarity score of the two locations and . When the similarity check is operated exhaustively over all locations on the Xray, the location with the highest similarity score is regarded as the corresponding POI of on the Xray. Such an exhaustive search on can be performed effectively with convolution and is denoted as a POI convolution layer in Figure 4. The output of the layer is a heatmap and is computed by
(6) 
where is a learned weight that selects the features for better similarity. Each element denotes a similarity score of the corresponding location on the Xray.
POINT. With the tracked POIs from different views of Xrays, we can obtain their 3D locations on the patient using triangulation as shown in Figure 3(c). However, this work seeks a uniform solution that formulates the POINT network and the triangulation under the same framework so that the two tasks can be trained jointly in an endtoend fashion which could potentially benefit the learning of the tracking network. An illustration of this endtoend design for two views is shown in Figure 5. For an view 2D/3D registration problem, the proposed design will include POINT networks as discussed above. Each of the networks will track POIs for the designated view and, therefore, the weights are not shared among the networks. Given a set of DRR and Xray pairs of the views, these networks output the tracked Xray POIs of each view in the form of heatmaps.
After obtaining the heatmaps, we introduce a triangulation layer that localizes a 3D point by forming triangles to it from the 2D tracked POIs from the heatmaps. Formally, we denote the set of heatmaps from different views but all corresponding to the same 3D POI . Here, is the heatmap of the th Xray POI from the th view, and we obtain the 2D Xray POI by
(7) 
Next, we rewrite Equation (4) as
(8) 
where
Thus, by applying Equation (8) for each view, we can get
(9) 
Let
(10) 
then is given by
(11) 
The triangulation can be plugged into a loss function that regulates the training of POINT networks of different views.
(12) 
where is the ground truth heatmap, is the ground truth 3D POI, BCE is the pixelwise binary cross entropy function,
is the sigmoid function, and
is a weight balancing the losses between tracking and triangulation errors.Shape Alignment. Let be the selected CT POIs and be the estimated 3D POIs ^{3}^{3}3The shape alignment assumes the points are under the homogeneous coordinate.. The shape alignment finds a transformation matrix such that the transformed aligns closely with , i.e.,
(13) 
This problem is solved analytically through Procrustes analysis [20].
4 Experiments
4.1 Dataset
The dataset we use in the experiments is a conebeam CT (CBCT) dataset captured for radiation therapy. The dataset contains 340 raw CBCT scans with each has 780 Xray images. Each Xray image comes with a geometry file that provides the registration ground truth as well as the information to reconstruct the CBCT volume. Each CBCT volume is reconstructed from the 780 Xray images, and in total, we have 340 CBCT volumes (one for each CBCT scan). We use 300 scans for training and validation, and 40 scans for testing. The size of the CBCT volumes is with 0.5 mm voxel spacing, and the size of the Xray images is with 0.388 mm pixel spacing. During the experiments, the CBCT volumes are treated as the 3D preintervention data, and the corresponding Xray images are treated as the 2D intraintervention data. Sample Xray images from our dataset are shown in Figure. Note that unlike many existing approaches [15, 17, 25] that evaluate their methods on small datasets (typically about 10 scans) which are captured under relatively ideal scenarios, we use a significantly larger dataset with complex clinical settings, e.g., diverse fieldofviews, surgical instruments/implants, various image contrast and quality, etc.
We consider two common views during the experiment: the anteriorposterior view and the lateral view. Hence, only Xrays that are close to () these views are used for training and testing. Note that this selection does not tightly constrain the diversity of the Xrays as the patient may be subject to movements with regard to the operating bed. To train the proposed method, Xray and DRR pairs are selected and generated with a maximum of rotation offset and mm translation offset. We first invert all the raw Xray images and then apply histogram equalization to both the inverted Xray images and DRRs to facilitate the similarity measurement. For each of the scan, we also annotate their landmarks on the reconstructed CBCT volume for further evaluation.
4.2 Implementation and Training Details
We implement the proposed approach under the Pytorch
^{4}^{4}4https://pytorch.org framework with GPU acceleration. For the POINT network, each of the Siamese branchhas five encoding blocks (BatchNorm, Conv, and LeakyReLU) followed by five decoding blocks (BatchNorm, Deconv, and ReLU), thus forming a symmetric structure, and we use skipconnections to shuttle the lowerlevel features from an encoding block to its symmetric decoding counterpart (see details in the supplementary material). The triangulation layer is implemented according to Equation (
11) with the backpropagation automatically supported by Pytorch. We train the proposed approach in a twostage fashion. In the first stage, we train the POINT network of each view independently for 30 epochs. Then, we finetune POINT
for 20 epochs. We find this mechanism converges faster than training POINTfrom scratch. For the optimization, we use the minibatch stochastic gradient descent with
learning rate for the first stage and for the second. We set the loss weight as , which we empirically find it works well during training. For the Xray imaging model, we use mm and mm.4.3 Ablation Study
This section discusses an ablation study of the proposed POINT network. As the network tracks POIs in 2D, we use mean projected distance (mPD) [24] to evaluate different models with specific design choices. The evaluation results are given in Table 1.
POI Selection. The first step of the proposed approach requires selecting a set of POIs to set up a pointtopoint correspondence. In this experiment, we investigate different POI selection strategies. First, we investigate directly using landmarks as the POIs since they usually have strong semantic meaning and can be annotated before the intervention. Second, we also investigate an automatic solution that uses the Harris corners as the POIs to avoid the labor work of annotation. Finally, we try random POI selection.
As shown in Figure 7 (a), we find our approach is prone to overfitting when trained with landmark POIs. This is actually reasonable as each CBCT volume only contains about a dozen of landmarks, which in total is about POIs. Considering the variety of the field of views of our dataset, this is far from enough and leads to the overfitting. For the Harris corners, a few hundreds of POIs are selected from each CBCT volume, and we can see an improvement in performance, but the overfitting still exists (Figure 7 (b)). We find the use of random POIs gives the best performance and generalizes well to unseen data (Figure 7 (c)). This seemly surprising observation is, in fact, reasonable as it forces the model to learn a more general way to extract features at a finegrained level, instead of memorizing some feature points that may look different when projected from a different view.
POI Convolution. We also explore two design options for the POI convolution layer. First, it is worth knowing that how much neighborhood information around the POI is necessary to extract a distinctive feature while the learning can still be easily generalized. To this end, we try different sizes of the feature kernel for POI convolution as given in Equation (6). Rows 13 in Table 1 show the performance of the POINT network with different feature kernel sizes. We observe that a kernel does not give features distinctive enough for better similarity measure and a kernel seems to include too much neighborhood information (and use more computation) that is harder for the model to figure out a general representation. In general, a kernel serves better for the feature similarity measure. It should also be noted that a kernel does not mean only the information at the current pixel location is used since each element of or is supported by the receptive field of the UNet that readily provides rich neighborhood information. Second, we compare the performance of the POINT network with or without having the weight in Equation (6). Rows 2 and 6 show that it is critical to have a weighted feature kernel convolution so that discriminate features can be highlighted in the similarity measure.
ShiftInvariant Tracking. The POINT network benefits from the shift invariant property of the convolution operation, which makes it less sensitive to the inplane offset of the DRRs. Figure 8
shows some tracking results from the POINT network. Here the odd rows show the (a) Xray and (bd) DRR images. The heatmap below each DRR shows the tracking result between this DRR and the leftmost Xray image. The red and the blue marks on the Xray and DRR images denote the POIs. The red and the blue marks on the heatmaps are the ground truth POIs and the tracked POIs, respectively. The green blobs are the heatmap responses and they are used to generate the tracked POIs (blue) according to Equation (
7). The numbers under each DRR denote the mPD scores before and after the tracking. As we can observe that the tracking results are consistently good, no matter how much initial offset there is between the DRR and the Xray image. This shows that our POINT network indeed benefits from the POI convolution layer and provide more consistent outputs regardless of the inplane offsets.


#  Kernel size  POI type  Weight  mPD  
1  3  5  land.  Harris  rand.  w/  w/o  (mm)  


1  ✓  ✓  ✓  8.46  
2  ✓  ✓  ✓  8.12  
3  ✓  ✓  ✓  9.49  
4  ✓  ✓  ✓  9.87  
5  ✓  ✓  ✓  12.72  
6  ✓  ✓  ✓  11.26  

4.4 2D/3D Registration
We compare our method with one learningbased (MDP [13]) and three optimizationbased methods (OptGC [4], OptGO [4] and OptNGI [16]). To further evaluate the performances of the proposed method as an initial pose estimator, we also compare two approaches that use MDP or our method to initialize the optimization. We denote these two approaches as MDP+opt and POINT+opt, respectively. Finally, we investigate the registration performance of our method that only uses the POINT network without the triangulation layer, and denote the corresponding models as POINT and POINT+opt. For MDP+opt, POINT+opt and POINT+opt, we use the OptGC method during the optimization as we find it converges faster when the initial pose is close to the global optima.
Following the standard in 2D/3D registration [24], the performances of the proposed method and the baseline methods are evaluated with mean target registration error (mTRE), i.e., the mean distance (in mm) between the patient landmarks and the aligned CT landmarks in 3D. The mTRE results are reported in forms of the 50th, 75th, and 95th percentiles to demonstrate the robustness of the compared methods. In addition, we also report the gross failure rate (GFR) and average registration time, where GFR is defined as the percentage of the tested cases with a TRE greater than 10 mm [13].
The evaluation results are given in Table 2. We find that the optimizationbased methods generally require a good initialization for accurate registration. Otherwise, they fail quickly. OptNGI overall is less sensitive to the initial location than OptGO and OptGC, with more than half of the registration results have less than 1 mm mTRE. Despite the high accuracy, it still suffers from the high failure rate and long registration time and so do the OptGO and OptGC methods. On the other hand, MDP achieves a better GFR and registration time by learning a function that guides the iterative pose searching. This also demonstrates the benefit of using a learningbased approach to guide the registration. However, due to the problems we have mentioned in Section 1, it still has a relatively high GFR and a noticeable registration time. In contrast, our base model POINT already achieves comparable performance to MDP; however, it runs over twice faster. Further, by including the triangulation layer, POINT performs significantly better than both POINT and MDP in terms of mTRE and GFR. It means that the triangulation layer that brings the 3D information to the training of the POINT network is indeed useful.
In addition, we notice that when our method is combined with an optimizationbased method (POINT + Opt) the GFR is greatly reduced, which demonstrates that our method provides initial poses that are close to the global optima such that the optimization is unlikely to fall into local optima. The speed is also significantly improved due to faster convergence and less sampling over the pose space.



mTRE (mm)  GFR  Reg.  
50th  75th  95th  time  


Initial  20.4  24.4  29.7  92.9%  N/A 
OptNGI [16]  0.62  25.2  57.8  40.0%  23.5s 
OptGO [4]  6.53  23.8  44.7  45.1%  22.8s 
OptGC [4]  7.40  25.7  56.5  47.7%  22.1s 
MDP [13]  5.40  8.62  27.6  16.4%  1.74s 
POINT  5.63  7.72  12.8  18.6%  0.75s 
POINT  4.22  5.70  9.84  4.9%  0.78s 


MDP [13] + Opt  1.06  2.25  24.6  15.6%  3.21s 
POINT + Opt  1.19  4.67  21.8  14.8%  2.16s 
POINT + Opt  0.55  0.96  5.67  2.7%  2.25s 

5 Limitations
First, similar to other learningbased approaches, our method requires a considerably large dataset from the targeting medical domain for learning reliable feature representations. When the data is insufficient, the proposed method may fail. Second, although our method alone is quite robust and its accuracy is stateoftheart through a combination with the optimizationbased approach, it is still desirable to come up with a more elegant solution to solve the problem directly. Finally, due to the use of triangulation, our method requires Xrays from at least two views to be available. Hence, for the applications where only a single view is acceptable, our method will render an estimate of registration parameter with inherent ambiguity.
6 Conclusion
We proposed a fast and robust method for 2D/3D registration. The proposed method avoids the often costly and unreliable iterative pose searching by directly aligning the CT with the patient through a novel POINT framework, which first establishes the pointtopoint correspondence between the pre and intraintervention data in both 2D and 3D, and then performs a shape alignment between the matched points to estimate the pose of the CT. We evaluated the proposed POINT framework on a challenging and largescale CBCT dataset and showed that 1) a robust POINT network should be trained with random POIs, 2) a good POI convolution layer should be convolved with weighted feature kernel, and 3) the POINT network is not sensitive to inplane offsets. We also demonstrated that the proposed POINT framework is significantly more robust and faster than the stateoftheart learningbased approach. When used as an initial pose estimator, we also showed that the POINT framework can greatly improve the speed and robustness of the current optimizationbased approach while attaining a higher registration accuracy. Finally, we discussed several limitations of the POINT framework which we will address in our future work.
Acknowledgement: This work was partially supported by NSF award #17228477, and the Morris K. Udall Center of Excellence in Parkinson’s Disease Research by NIH.
References

[1]
Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS
Torr.
Fullyconvolutional siamese networks for object tracking.
In
European conference on computer vision
, pages 850–865. Springer, 2016.  [2] Bastian Bier, Mathias Unberath, JanNico Zaech, Javad Fotouhi, Mehran Armand, Greg Osgood, Nassir Navab, and Andreas K. Maier. Xraytransform invariant anatomical landmark detection for pelvic trauma surgery. In International Medical Image Computing and Computer Assisted Intervention, volume 11073 of Lecture Notes in Computer Science, pages 55–63. Springer, 2018.
 [3] International Electrotechnical Commission et al. Radiotherapy equipment: coordinates, movements and scales. IEC, 2008.
 [4] T De Silva, A Uneri, MD Ketcha, S Reaungamornrat, G Kleinszig, S Vogt, N Aygun, SF Lo, JP Wolinsky, and JH Siewerdsen. 3d–2d image registration for target localization in spine surgery: investigation of similarity metrics providing robustness to content mismatch. Physics in Medicine & Biology, 61(8):3009, 2016.
 [5] Joyoni Dey and Sandy Napel. Targeted 2d/3d registration using ray normalization and a hybrid optimizer. Medical physics, 33(12):4730–4738, 2006.
 [6] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
 [7] HS Jans, AM Syme, S Rathee, and BG Fallone. 3d interfractional patient position verification using 2d3d registration of orthogonal images. Medical physics, 33(5):1420–1439, 2006.
 [8] Ali Khamene, Peter Bloch, Wolfgang Wein, Michelle Svatos, and Frank Sauer. Automatic registration of portal images and volumetric ct for patient positioning in radiation therapy. Medical Image Analysis, 10(1):96–112, 2006.
 [9] David LaRose, John Bayouth, and Takeo Kanade. Transgraph: Interactive intensitybased 2d/3d registration of xray and ct data. In Medical Imaging 2000: Image Processing, volume 3979, pages 385–397. International Society for Optics and Photonics, 2000.
 [10] David A LaRose. Iterative Xray/CT registration using accelerated volume rendering. PhD thesis, Citeseer, 2001.

[11]
Rui Liao, Shun Miao, Pierre de Tournemire, Sasa Grbic, Ali Kamen, Tommaso
Mansi, and Dorin Comaniciu.
An artificial agent for robust image registration.
In
Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA.
, pages 4168–4175, 2017.  [12] Primoz Markelj, Dejan Tomaževič, Bostjan Likar, and Franjo Pernuš. A review of 3d/2d registration methods for imageguided interventions. Medical image analysis, 16(3):642–661, 2012.
 [13] Shun Miao, Sebastien Piat, Peter Walter Fischer, Ahmet Tuysuzoglu, Philip Walter Mewes, Tommaso Mansi, and Rui Liao. Dilated FCN for multiagent 2d/3d medical image registration. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018, pages 4694–4701, 2018.
 [14] Shun Miao, Z Jane Wang, and Rui Liao. A cnn regression approach for realtime 2d/3d registration. IEEE transactions on medical imaging, 35(5):1352–1363, 2016.
 [15] Yoshito Otake, Mehran Armand, Robert S Armiger, Michael D Kutzer, Ehsan Basafa, Peter Kazanzides, and Russell H Taylor. Intraoperative imagebased multiview 2d/3d registration for imageguided orthopaedic surgery: incorporation of fiducialbased carm tracking and gpuacceleration. IEEE transactions on medical imaging, 31(4):948–962, 2012.
 [16] Yoshito Otake, Adam S Wang, J Webster Stayman, Ali Uneri, Gerhard Kleinszig, Sebastian Vogt, A Jay Khanna, Ziya L Gokaslan, and Jeffrey H Siewerdsen. Robust 3d–2d image registration: application to spine interventions and vertebral labeling in the presence of anatomical deformation. Physics in Medicine & Biology, 58(23):8535, 2013.
 [17] Franjo Pernus et al. 3d2d registration of cerebral angiograms: a method and evaluation on clinical images. IEEE transactions on medical imaging, 32(8):1550–1563, 2013.
 [18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [19] Daniel B Russakoff, Torsten Rohlfing, Kensaku Mori, Daniel Rueckert, Anthony Ho, John R Adler, and Calvin R Maurer. Fast generation of digitally reconstructed radiographs using attenuation fields with application to 2d3d image registration. IEEE transactions on medical imaging, 24(11):1441–1454, 2005.
 [20] George AF Seber. Multivariate observations, volume 252. John Wiley & Sons, 2009.
 [21] George W Sherouse, Kevin Novins, and Edward L Chaney. Computation of digitally reconstructed radiographs for use in radiotherapy treatment design. International Journal of Radiation Oncology• Biology• Physics, 18(3):651–658, 1990.

[22]
Daniel Toth, Shun Miao, Tanja Kurzendorfer, Christopher A Rinaldi, Rui Liao,
Tommaso Mansi, Kawal Rhode, and Peter Mountney.
3d/2d modeltoimage registration by imitation learning for cardiac procedures.
International journal of computer assisted radiology and surgery, pages 1–9, 2018. 
[23]
Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and
Philip HS Torr.
Endtoend representation learning for correlation filter based
tracking.
In
Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on
, pages 5000–5008. IEEE, 2017.  [24] Everine B Van de Kraats, Graeme P Penney, Dejan Tomazevic, Theo Van Walsum, and Wiro J Niessen. Standardized evaluation methodology for 2d3d registration. IEEE transactions on medical imaging, 24(9):1177–1189, 2005.
 [25] Jian Wang, Roman Schaffert, Anja Borsdorf, Benno Heigl, Xiaolin Huang, Joachim Hornegger, and Andreas Maier. Dynamic 2d/3d rigid registration framework using pointtoplane correspondence model. IEEE transactions on medical imaging, 36(9):1939–1954, 2017.
 [26] BM You, Pepe Siy, William Anderst, and Scott Tashman. In vivo measurement of 3d skeletal kinematics from sequences of biplane radiographs: application to knee kinematics. IEEE transactions on medical imaging, 20(6):514–525, 2001.
 [27] L Zollei, Eric Grimson, Alexander Norbash, and W Wells. 2d3d rigid registration of xray fluoroscopy and ct images using mutual information and sparsely sampled histogram estimators. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2001.