I Introduction
Ia Background
Image registration is one of the most fundamental tools in biomedical image processing, with applications that range from imagebased navigation in imaging and imageguided interventions to longitudinal and group analyses [1, 2, 3, 4, 5, 6, 7]
. Registration can be performed between images of the same modality or across modalities, and within a subject or across subjects, with diverse goals such as motion correction, pose estimation, spatial normalization, and atlasbased segmentation. Image registration is defined as an optimization problem to find a global transformation or a deformation model that maps a source (or moving) image to a reference (or fixed) image. The complexity of the transformation is defined by its degreeoffreedom (DOF) or the number of its parameters. The most widely used transformations in biomedical image registration range from rigid and affine to highdimensional small or large deformations based on biophysical/biomechanical, elastic, or viscous fluid models
[6].Given a transformation model and images, iterative numerical optimization methods are used to maximize intensitybased similarity metrics or minimize point cloud or local feature distances between images; however the cost functions associated with these metrics are often nonconvex, limiting the capture range of these registration methods. Techniques such as centerofgravity matching, principal axes and moments matching, grid search, and multiscale registration are used to initialize transformation parameters so that iterative optimization starts from the vicinity of the global optimum. These techniques, however, are not always successful, especially if the range of possible rotations is wide and shapes have complex features. Grid search and multiscale registration may find global optima but are computationally expensive and may not be useful in timesensitive applications such as imagebased navigation.
There has been an increased interest in using deep learning in medical image processing, motivated by promising results that have been achieved in semantic segmentation in computer vision
[8] and medical imaging [9, 10]. The use of learningbased techniques in image registration, however, has been limited. Some registration tasks, for example those on image to template, atlas, or standardspace registration are amendable to learning and may provide significant improvement over strategies such as iterative optimization or grid search when the range of plausible position/orientation is wide, demanding a large capture range. Under these conditions, a human observer can find the approximate pose of 3D objects quickly and bring them into rough alignment without solving an iterative optimization. This is performed through feature identification.IB Related Work
Deep feature representations have recently been used to learn metrics to guide local deformations for multimodal intersubject registration [11, 12]. These works have shown that deep learned metrics provide slight improvements over local image intensity and patch features that are currently used in deformable image registration. Initialized by rigid and affine alignments, the goal here was merely to improve local deformations and not the global alignment. In another recent work on deformable registration, Yang et al. [13]
developed a deep autoencoderdecoder convolutional neural network (CNN) that learned to predict the Large Deformation Diffeomorphic Metric Mapping (LDDMM) model, and achieved stateoftheart performance with an order of magnitude faster optimization in intersubject and subjecttoatlas deformable registration.
For 3D global rigid registration, which is the subject of this study, Liao et. al. [14]
proposed a reinforcement learning algorithm for a CNN with 3 fully connected layers. They used a greedy supervised learning strategy with an attentiondriven hierarchical method to simultaneously encode a matching metric and learn a strategy; and showed improved accuracy and robustness compared to stateoftheart registration methods in computed tomography (CT). This algorithm is relatively slow and lacks a systematic stopping criterion at test time.
In an effort to speed up slicetovolume (Xray to CT) rigid registration and improve its capture range, Miao et. al. [15, 16]
proposed a realtime registration algorithm using CNN regressors. In this method, called pose estimation via hierarchical learning, they partitioned the 6dimensions of the parameter space to three zones to learn, hierarchically, the regression function based on inplane and outofplane rotations and outofplane translations. CNN regressors were trained separately in each zone, where local image residual features were used as input and the Euclidean distance of the transformation parameters were used as the loss function. In experiments with relatively small rotations of up to
(perturbations with standard deviations of
in each of the rotation parameters), they reported improved registrations achieved in 100ms (2045 times faster than the best intensitybased slicetovolume registration in that application).The slicetovolume (Xray to CT) image registration problem shares similarity with 3D pose estimation in computer vision. The term 3D pose estimation in computer vision is referred to as finding the underlying 3D transformation between an object and the camera from 2D images. Stateoftheart methods for CNNbased 3D pose estimation can be classified in two groups: 1) models that are trained and used to predict keypoints as models and then use object models to find the orientation
[17, 18]; and 2) models that predict the pose of the object directly from images [19, 20]. Pose estimation in computer vision has been largely treated as a classification problem, where the pose space is discretized into bins and the pose is predicted to belong to one of the bins [19, 20]. Mahendran et al. [21] have recently modeled the 3D camera/object pose estimation as a regression problem. They proposed deep CNN regression to find rotation matrices and a new loss function based on geodesic distance for training.IC Contributions
Similar to [15, 16, 21], we propose deep CNN regression models for 3D pose estimation; but unlike those works that focused on estimating pose based on 2Dprojected image representation of objects (thus limited rotations), we aimed to find the 3D pose of arbitrarilyoriented objects based on their volumetric or sectional (slice) image representations. In this paper, we use the term 3D pose mainly to refer to 3D orientation; and use registration for the estimation of both rotation and translation parameters in 3D. Our goal was to speed up and improve the capture range of volumetovolume and slicetovolume registrations. To achieve this, we formulated a regression problem for 3D pose estimation based on the angleaxis representation of 3D rotations that form a Special Orthogonal Group ; and used the biinvariant geodesic distance, which is a natural Riemannian metric on [22], as the loss function. We augmented our proposed deep regression network with a correction network to estimate translation parameters, and ultimately used it to initialize optimizationbased registration to achieve robust and accurate registration at the widest plausible range of 3D rotations. In this paper we do not suggest a general method of registration for arbitrary pairs of images. Rather, 3D pose estimation finds the orientation of a shape or anatomy with respect to a canonical space or template. Intersubject registration can then be achieved by computing a composite transformation from estimated 3D pose of images of individual subjects.
We applied our proposed method to rigidly register reconstructed fetal brain MRI images [23] to a standard (atlas) space. Fetal brains can be in any arbitrary orientation with respect to the MRI scanner coordinate system, as one cannot predefine the position of a fetus when a pregnant woman is positioned on an MRI scanner table. Moreover, fetuses frequently move and can rotate within scan sessions. Our deep model, trained on reconstructed T2weighted images of 2837 week gestational age (GA) fetuses from the training set, was able to find the 3D position of fetuses in the test set in realtime (ms) in the majority of cases, where optimizationbased methods failed due to falling in local minima. We then examined the generalization properties of the learned model on test images of much younger fetuses (2127 weeks GA), as well as T2 and T1weighted images of newborns, that all exhibited significantly different size, shape, and features.
Based on our formulation, we also trained models for slicetovolume registration, an application that exhibits significant technical challenges in medical imaging, as recently reviewed in [7]. Prior work on slicetovolume registration in fetal MRI has shown a strong need for regularization and initialization of slice transformations through hierarchical registration [23, 24] or statespace motion modeling [25]. Learningbased methods have been recently used to improve prediction of slice locations in fetal MRI [26, 27] and fetal ultrasound [28]. In [26, 27] anchorpoint slice parametrization was used along with the Euclidean loss function based on [29] to predict slice positions and reconstruct fetal MRI in canonical space. The alignment of fetal ultrasound slices in [28] was formulated as zposition estimation and 3class slice plane classification (midaxial, target eye, and eye planes); where a CNN was trained using negative likelihood loss for simultaneous prediction of slice location and brain segmentation.
For slicetovolume registration we used 3D full rotation representation to train our CNN regression model. Our results are also promising in this application as they show initial pose of the fetal head can be estimated in real time from slice acquisitions, which is particularly helpful if goodquality slices are only sparsely acquired due to fetal motion. Realtime pose estimation and registration has broader potential applications such as guided and automated ultrasonography [28], automated fetal MRI [26, 30], and motionrobust MRI [31, 32, 33, 34, kurugol2017motion]. For example Hou et al. [27] used slicetovolume registration for fetal brain MRI reconstruction. Realtime slicetovolume registration may also be used for realtime motion tracking in MRI of moving subjects for data reacquisition or prospective navigation. The remainder of this paper involves details of the methods in Section II, followed by results in Section III, a discussion in Section IV, and the conclusion. Our formulation is generic and may be used in other applications.
Ii Methods
In this section we present a 3D rotation representation that helps us build our CNN regression models for 3D pose estimation. We show how using a nonlinear activation function can mimic exact rotation representation. We present our network architectures and propose a twostep training algorithm with appropriate loss functions to train the network.
Iia 3D Rotation Representation
A 3D rotation is commonly represented by a matrix with 9 elements that are subject to six norm and orthogonality constraints ( is orthogonal and ). The set of 3D rotations form the Special Orthogonal Group that is a 3dimensional object embedded in (thus has 3 DOFs).
is a compact Lie group that has skew symmetric matrices as its Lie algebra. Its 3 DOFs can be represented as 3 consecutive rotations relative to principle axes of the coordinate frame.
Based on Euler’s theorem each rotation matrix
can be described by an axis of rotation and an angle around it (known as angleaxis representation). A 3dimensional rotation vector is a compact representation of rotation matrix such that rotation axis is its unit vector and angle in radians is its magnitude. The axis is oriented so that the angle rotation is counterclockwise around it. As a consequence, the rotation angle is always nonnegative, and at most
; i.e. .For a 3dimensional vector , by defining as the axis of orientation and as the angle of rotation (in radians), the rotation matrix is calculated as:
(1) 
where is the skewsymmetric operator:
(2) 
Using Rodrigues’ rotation formula, (1) can be simplified to:
(3) 
and
(4) 
As a result, to find any arbitrary rotation in 3D space it is sufficient to find the rotation vector corresponding to that orientation. In the next section, the proposed networks that can find this rotation vector are introduced.
Figure 1
(a) shows general parts of the regression networks used in this study. Each network contains 3 parts: input, feature extraction, and output. In this study we used three networks with slightly different configurations of these parts. Next we discuss the architecture of each network in detail.
IiB Slice 3D Pose Estimation Network Architecture
For slice 3D pose estimation we used an 18layer residual CNN [35] for feature extraction, and two regression heads^{1}^{1}1The top of the network is referred to as the head of the network.: one for regression over 3 rotation parameters and the other for slice location in the atlas space. While different choices existed for the feature extraction component of the networks, in choosing a network architecture for slice pose estimation we tried different network architectures based on suggestions from pose estimation literature, including [27]. We examined VGG16, ResNet18, and DenseNet. We observed better performance with ResNet18 compared to other networks including VGG16. The network architecture is shown in Figure 1(b). For the rotation head, the last fully connected layer has size three which corresponds to the elements of the rotation vector . The last nonlinear function on top of the fully connected layer is which limits the output of each element from to and simulates the constraints of each element of the rotation vector independently. The physical location estimator head contains a scalar, as this network tries to estimate the physical location of the slice (in millimeters) along with its orientation.ReLU nonlinearity is applied on top of this head as the value of the slice number is nonnegative.
IiC Volume 3D Pose Estimation Network Architecture
The 3D feature extraction part of our 3D pose estimation network for volumetovolume registration is shown in Figure 1(c), where block arrows show the functions defined on the right hand side of the figure. All convolutional kernels have size
. In the first layer, eight convolutional kernels are applied on the 3D input image, followed by ReLU nonlinear function and batch normalization
(C0 and C1). The tensors are downsampled by a factor of 2 using the 3D maxpooling function (C2) before the second and third convolutional layers (C3 and C4). For C3 and C4, ReLU nonlinear function and batch normalization are used after applying 32 convolutional kernels. In the last two convolutional layers (C6 and C7), 64 kernels are used followed by ReLU and batch normalization. Following C7, 3 fully connected layers with size of 512, 512, and 256 are used with ReLU nonlinear activation function and batch normalization (C8)
. The feature extraction part provides 256 features that are fed into the regression head. The overall architecture of the pose network is shown in Figure 1(d). This network estimates orientation, and has the same regression rotation head as the slice pose estimation network.IiD VolumetoVolume Correction Network Architecture
The correction network aims at simultaneously estimating translations and rotations. Note that we assume initial translations between stackofslices or volumes and the template (or reference) are calculated by centerofgravity matching and initial rotations are estimated by the volume 3D pose estimation networks, so the correction network aims to register a roughlyaligned source image to the reference template. The architecture of this network is shown in Figure 1(e). The 3D feature extraction part of this network is the same as the volume 3d pose estimation network. In this architecture, both a 3D reference image (an atlas or template image) and a roughlyoriented 3D moving image are fed as 2channel input, as we aim to estimate both rotation and translation parameters. The regression head of this network contains two heads: a rotational head as already described and a translational head. The translational head is a vector of 3 parameters that translate the moving image into the target image.
IiE Training the Networks
In this section we describe the training procedures for the networks. The loss function is designed as:
(5) 
where is a hyperparameter to balance between the rotation loss (which is bounded between 0 to ) and the translation loss . The translation loss is the meansquared error (MSE) between the predicted and ground truth translation vectors. For the first stage of training, we use the MSE loss also for the rotation parameters, and then switch to the geodesic loss in the second stage. The MSE loss is defined as
(6) 
where and are the output of the rotation head and the ground truth rotation, respectively. MSE, as a convex loss function, can help narrow down the search space for pose prediction learning, thus is appropriate for training; however, it does not accurately represent a distance function between two rotations. The distance between two 3D rotations is geometrically interpreted as the geodesic distance between two points on the unit sphere. The geodesic distance (or the shortest path) is the radian angle between two viewpoints, which has an exponential form. Let and be the estimated and the ground truth rotation matrices, respectively. The distance between these rotation matrices is defined as:
(7) 
Equation (7) shows the amount of rotation in radian around a specific vector that needs to be applied on rotation matrix to reach rotation matrix , and is calculated as:
(8) 
where is the Frobenius norm and is the matrix logarithm of a rotation matrix that can be written as:
(9) 
To show that (8) is actually the distance between rotation matrices we should consider the fact that a rotation matrix is orthogonal () and the rotation from to is . Considering (9) and the fact that can be calculated using (3), where and are the axis and angle of rotation of as the 3dimensional rotation vector representation of , and knowing that the norm of the skewsymmetric matrix of unit vector is one, one can show that (8) is equal to .
On the other hand, since the distance between and can be represented as rotation matrix using (4), is equal to . Therefore, the geodesic loss which is defined as the distance between two rotation matrices can be written as:
(10) 
This is a natural Riemannian metric on the compact Lie group . Equations (9) and (10) are equivalent, so we calculate the geodesic loss using (10), as it is easier to implement. To use (10) we find the rotation matrices as described in Section IIA. In summary, training the networks involves iterations of backpropagation with the total loss function in (5) where translation loss is the MSE, and the rotation loss is calculated by (6) in the first stage and by (10) in the second stage. This schedule is chosen because of the computational convexity advantage of MSE and the accuracy of the geodesic loss.
In our experiments each stage involved ten epochs. The details of the data and experiments are discussed next.
Iii Experiments
Iiia Datasets
The datasets used in this study contained 93 reconstructed T2weighted MRI scans of fetuses, as well as T1 and T2weighted MRI scans of 40 newborns. The newborn data was obtained from the first data release of the developing human connectome project [37]. The fetal MRI data was obtained from fetuses scanned at Boston Children’s Hospital at a gestational age between 21 and 37 weeks (mean=30.1, stdev=4.6) on 3Tesla Siemens Skyra scanners with 18channel body matrix and spine coils. Written informed consent was obtained from all pregnant women research participants. Repeated multiplanar T2weighted single shot fast spin echo scans were acquired of the moving fetuses, ellipsoidal brain masks were automatically extracted based on the realtime algorithm in [38]
. The scans were then combined through slicelevel motion correction and robust superresolution volume reconstruction
[23, 24]. Brain masks were generated on the reconstructed images using AutoNet [39] and manually corrected in ITKSNAP [40] as needed.Brainextracted reconstructed images were then registered to a spatiotemporal fetal brain MRI atlas [41] at an isotropic resolution of . This registration was performed through the procedure described in [41] and is briefly described here as it generated the set of fetal brain scans (all registered to the standard atlas space) used to generate ground truth data. First, a rigid transform was found between the fetal head coordinates and the MRI scanner (world) coordinates by inverting the direction cosine matrix of one of the original fetal MRI scans that appeared in an orthogonal plane with respect to the fetal head (the idea behind this is that the MR technologist who prescribed scan planes identified and used the fetal head coordinates and did not use the world coordinates). Applying to the image reconstructed in the world coordinates mapped it to the fetal coordinates; thus the oblique reconstructed image appeared orthogonal with respect to the fetal head after this mapping; which inturn enabled a grid search on all orthogonal 3D rotations that could map this image to the corresponding age of the spatiotemporal atlas (fetal coordinates to atlas space). Multiscale rigid registration was performed afterwards to fine tune the alignment.
It should be noted that due to differences in the anatomy of different subjects at different ages and the templates, the final alignments have an intrinsic level of uncertainty as an exact rigid alignment of two different anatomies is not well defined; but since our goals are improved capture range and speed, in our analysis we are not sensitive to uncertainty in alignment of reference data. All images were manually controlled to ensure visuallycorrect alignment to the atlas space.
IiiA1 Training Dataset
From the total database of 93 fetal MRI scans, reconstructed T2weighted images of 36 fetuses scanned at 28 to 37 weeks GA were used all together to train one network. Each image was 3D rotated and translated randomly and fed to the network. Since the rotation matrix was known the rotation vector was computed and used as the ground truth. Two different algorithms were used to randomly generate rotation matrices.
For slice pose estimation training, each input image randomly rotated around the and axes between and . This algorithm covered half of all possible orientations, and provided all different views in the training set. Therefore, for training the network, the separation of different views (i.e. axial, coronal, and saggital) was unnecessary. The reason that we did not span the whole space in this experiment is that 2D brain slices do not have enough information to separate between rotations that are radians away around arbitrary rotation vectors as predicting the 3D direction of the brain from a 2D slice is difficult due to the symmetrical shape of the brain. In order to choose input slices we randomly chose 30 slices from 66 percent of the middle slices, skipping the border slices that did not carry sufficient information for training.
For volume pose estimation training we used the algorithm proposed in [42, p. 355]
to uniformly span the whole space. This algorithm mapped three random variables in the range
onto the set of orthogonalmatrices with positive determinant; that is, the set of all 3D rotations. This algorithm generated uniformly distributed samples on unit sphere.
For the volumetovolume training of the correction network (referred to as the CorrectionNet), each moving image was randomly rotated around the and axes between to and translated randomly in each direction between to millimeters. The transformed image was then concatenated with its corresponding atlas image to form a 2channel input to the network. The range of transform parameter variations was lower for this network as the objective of this network was to correct initial predictions made by other networks. Initial translations between stackofslices or volumes and the template or reference were estimated using centerofgravity matching and rotations were estimated by the pose estimation networks, therefore the CorrectionNet was trained and tested on a limited range of transformation parameters; however, to evaluate the capacity of this network to learn to predict the entire parameter space for comparison purposes, we also trained it on the wider range of rotations similar to the 3D pose estimation network. We refer to this trained model as the 3DRegNet in the results, as compared to the CorrectionNet which was trained only to correct initial transformations.
Translation and rotation of the images were applied using one transformation and the resampling was done onthefly during training. Linear interpolation was used for resampling images for faster training. Scaling with random scale factors in the range of 0.95 and 1.05 were also used for data augmentation. The total number of generated training samples was
slices for slice pose estimation and volumes for volumetovolume registration. The number of epochs for each training step was set to .IiiA2 Testing Datasets
To test the performance and generalization properties of the trained models, three test sets were used: Test Set 1) reconstructed T2weighted images of 40 fetuses with GA between 27 to 37 weeks that were not used in training, as well as original T2weighted slices of those scans; Test Set 2) reconstructed T2weighted images of 17 fetuses with GA between 22 and 26 weeks; as well as T2weighted MRI scans of 7 newborns scanned at 38 to 44 weeks GAequivalent age (selected from the total number of 40 cases, to span the age range); and Test Set 3) T1weighted MRI scans of those newborns. There was no overlap between the test sets and the training set described in the previous subsection.
On each 3D image 10 randomly generated rotation matrices were applied resulting in 400, 170, and 70 samples for each set. For each application, rotation matrices were generated through the same process used for the training data as discussed in section IIIA1. Figure 2 shows the histogram of the synthetic rotations for the slice and volume pose estimation experiments. The
axis shows the distance of the generated rotation matrix from the identity matrix in degrees.
IiiB IntensityBased Registration
To compare the pose predictions made by our pose estimation CNNs, referred to as 3DPoseNet, with conventional intensitybased registration methods, we developed multiple variations of rigid registration for volumetovolume registration (VVR) and slicetovolume registration (SVR) between images and agematched templates. For VVR comparisons, we developed the following programs: (i) VVRGC: A multiscale approach was used for rigid registration with 3 levels of transform refinement. The transform was initialized using a Gravity Center (GC) matching strategy. A gradientdescent optimizer was used to maximize the normalized mutual information metric between the source and reference images. (ii) VVRPAA: the same as VVRGC except that the transform was initialized using a moments matching and principal axis alignment approach. (iii) VVRDeep: same as VVRGC except that the transform was initialized using 3DPoseNet predicted transforms, without employing any other initialization strategy. For SVR comparisons, we developed two versions of the program: (i) SVRGC: A multiscale approach was used for registration with 3 levels of transform refinement initialized with centerofgravity matching, and gradientdescent maximization of normalized cross correlation. (ii) SVRDeep: same as SVRGC except that the transform was initialized using 3DPoseNet predicted transforms. The learning rate for the optimization process was set lower in both VVR and SVR programs when they were initialized using 3DPoseNet predictions.
IiiC Results
We evaluated pose predictions obtained from the proposed methods in different scenarios.
IiiC1 Slice 3D Pose Estimation
As described in section II, optimizationbased SVR methods and the trained CNN were used for slice 3D pose estimation. To investigate the influence of geodesic loss compared to MSE, after the model was trained with the MSE loss function, it was fine tuned for 10 more epochs, once with the MSE loss and once with the geodesic loss. In visualizing and comparing the results, test samples were distributed over 6 different bins according to their magnitude of rotation in a way that the number of samples in each bin was roughly equal. By this comparison we aimed to evaluate performance of methods in terms of their capture range. It can be seen in Figure 3 and Table I that 1) the geodesic loss improved the results. This improvement was significant in bins of and ; 2) the optimization based method without deep CNN initialization failed in most cases; and 3) the optimizationbased method with deep initialization performed the best.
Figure 4.a shows estimated physical location of slices compared to their actual location, with error lines (in mm). The estimation error of the majority of slices was below 5mm, while the error was higher for some slices especially for those closer to the boundary of the brain. Figure 4.b shows snapshots of four slices of one of the test cases. This figure shows the limited features available to the algorithm and the similarity of slices especially those closer to the boundaries. Learning to estimate slice locations for subjects with different anatomies scanned at different ages is challenging but can be augmented with slicelevel motion tracking algorithms, such as [25], when motion is fast and continuous.
Slice 3D pose estimation error  
Method  
SVRGC  21.36 ()  50.62 ()  70.16 ()  83.28 ()  105.64 ()  147.69 () 
3DPoseNet (MSE)  15.73 ()  15.9 ()  20.9 ()  22.74 ()  21.66 ()  38.45 () 
3DPoseNet (Geodesic)  15.1 ()  14.05 ()  16.18 ()  24.33 ()  19.8 ()  36.47 () 
SVRDeep  10.23 ()  12.32 ()  13.08 ()  17.6 ()  16.19 ()  26.85 () 
We also tested the trained model on original slices from T2weighted stackofslices of the test subjects. the goal in this experiment, which is shown in Figure 5, was to find the corresponding 3D pose and location of the fetal brain in an input slice in the atlas space. The fetal brain was extracted in each slice using [38] and was provided as input to the trained slice 3DPoseNet. A transformation composed from the estimated pose and location of the slice was applied to the corresponding age of the spatiotemporal fetal brain MRI atlas [41] to obtain the corresponding atlas slice. Representative results of original slices and the corresponding estimated slices from the atlas are shown for five samples from different fetuses in Figure 6.
IiiC2 Volume 3D Pose Estimation
In the volumetovolume rigid registration scenario, 6 different algorithms were compared: VVRGC, VVRPAA, 3DPoseNet with MSE, 3DPoseNet with geodesic loss, CorrectionNet, 3DRegNet, and VVRDeep. Figure 7 shows that 1) the VVRGC performed very well for rotations between but it failed for almost all samples with rotations as it converged to the wrong local minima; 2) by using the principal axis initialization, the VVRPAA significantly improved the performance for but again failed for the majority of samples with rotations, and it resulted in a huge loss in performance (compared to VVRGC) in as it incorrectly shifted the initial point to the region of a wrong local minimum. 3) The trained deep CNN models all performed well as they showed much lower number of failures. The geodesic loss showed significant improvement over the MSE loss; and the CorrectionNet performed the best with only a very small fraction of failures in the range of rotations. 4) VVRDeep, which is the optimizationbased registration initialized by deep pose estimation generated the most accurate results and the minimum number of failures. Table II shows that VVRDeep performed the best, while CorrectionNet results were also comparable, especially as CorrectionNet based registration is realtime and several orders of magnitude faster than the VVRDeep registration. The average runtime of methods is discussed in Section IIIC4.
VolumetoVolume  
Method  
VVRGC  2.42 ()  45.39 ()  149.91 ()  177.0 ()  174.87 ()  177.2 () 
VVRPAA  95.54 ()  131.1 ()  128.44 ()  129.68 ()  131.15 ()  141.44 () 
3DPoseNet (MSE)  16.28 ()  17.85 ()  20.06 ()  19.5 ()  18.93 ()  45.38 () 
3DPoseNet (Geodesic)  10.08 ()  11.44 ()  12.43 ()  13.46 ()  16.46 ()  34.19 () 
CorrectionNet  4.54 ()  4.45 ()  4.83 ()  4.82 ()  6.33 ()  19.42 () 
3DRegNet  18.21 ()  17.53 ()  18.80 ()  18.65 ()  19.57 ()  43.88 () 
VVRDeep  2.42 ()  2.35 ()  2.43 ()  2.36 ()  4.84 ()  20.44 () 
Figure 8 shows the translation error of the correction network. The error is calculated as the distance of true translation vector and the predicted one. The initial translation was calculated as the distance of the input image to the atlas location. Note that all errors reported here including the translation errors are between images of different subjects and atlases, so there is an intrinsic level of uncertainty in alignment as the exact alignment of two different anatomies (with different size and shape) using rigid registration is not well defined.
Figure 9 shows the results of different algorithms on an example from the volumetovolume registration tests. All algorithms tried to register the brain of this fetus (with mild unilateral ventriculomegaly) to the corresponding age of the atlas on the right. The first column is the input with synthetic rotation. As the rotation was more than , VVRGC failed due to the nonconvex similarity loss function. Without deep initialization this algorithm converged to the wrong local optimum which resulted in a flipped version of the correct orientation (the forth column). The second and third columns show the results of the 3DPoseNet and the CorrectionNet. The geodesic distance errors (in degrees) of each algorithm are given underneath each column. For this example, the correction network generated the most accurate results.
IiiC3 Generalization property of the trained models
An important question that is frequently asked about learningbased methods such as the ones developed in this study concerns their generalization performance: can they generalize well for new test data, possibly with different features? In this section, we aimed to investigate the generalization property of our trained models. For this, we carried out two sets of experiments, with Test Sets 2 and 3:
First, we added Test Set 1 to Test Set 2 to investigate the generalization of the algorithm for fetal brains at ages other than those used in the training set (younger fetuses at 2227 weeks GA and newborn brains scanned at 3844 weeks GAequivalent age scanned in different, exutero scan settings). We recall that the training dataset only contained fetuses scanned at 2837 weeks. The brain develops very rapidly especially throughout the second trimester of pregnancy, therefore the difference in brain size and shape between these test sets and the training set was significant. The images underneath the box plots in Figure 10 show sample slices for different ages. By simply using a scale parameter that was calculated by the size ratio of atlases at different ages, we scaled the images and fed them into the network. Box plots of the estimated pose error in different ages in Figure 10 showed that the network generalized very well over all age ranges and for different scan settings. It is, however, seen that the average and median errors slightly increased towards the lower age range as the anatomy became significantly different from the anatomy of the training set.
In our second experiment on generalization, we investigated generalizability of the networks over different modalities. To investigate whether the 3DPoseNet could generalize on T1weighted newborn MRI scans while trained only on reconstructed T2weighted scans of fetuses, we applied our volumetovolume registration test pipeline to T1weighted scans of 7 newborns (70 samples in total) in Test Set 3. Figure 11 shows the results of applying the trained model on T1weighted scans (blue box plots) compared to T2weighted scans (orange box plots) with exact same random rotations. While 3DPoseNet still performed better than VVRGC and VVRPAA (compared to Figure 7), it did not generalize well on T1weighted scans.
To solve this issue through preprocessing, we developed an image contrast transfer algorithm based on a conditional generative adversarial network (GAN) [43]. Details of this algorithm can be found in Appendix A. By training a conditional GAN in this approach, we transferred T1weighted images to T2weighted images. The results of the transferred T2weighted images are shown in the last row of Figure 11. The pose error box plots in this figure show that the image contrast transfer from T1 to T2weighted images and using the generated T2 images as input to the pose network significantly decreased the pose estimation error. In fact the trained T2weighted image generator can be used as an input cascade to the 3DPoseNet or CorrectionNet so that they can be directly used to register T1weighted newborn brain images without being trained in this domain. Note that no reference data (aligned to an atlas) was needed for T1weighted scans to train the conditional GAN except a set of paired T1 and T2 scans in the subject space that was easy to obtain. A similar approach can be taken to further expand the generalization domain of the trained pose estimation networks, for example to adult brains. In this work we had access to paired T1 and T2 images. In case in any other application paired images are not accessible between two domains, cycleGAN [44] can be used.
IiiC4 Testing times
Table III shows the average testing time (in milliseconds) for the algorithms developed in this study, measured on a GPU, which shows that all the deep learning based algorithms were real time. The test time difference between 3DPoseNet and the CorrectionNet was because of a resampling operation on the image between the two stages of the CorrectionNet, which took about 80 milliseconds. For comparison, we note that efficient implementations of intensitybased optimization methods for VVR typically require about 5ms per iteration of optimization (for 1M voxel samples) on GPUs, and about 10ms per iteration of optimization through symmetric multiprocessing [4]. Depending on the range of rotations and translations, which may require a multiscale registration approach and between 10100 iterations of optimization, these algorithms may take between 50ms and several seconds if implemented efficiently on appropriate hardware. All preprocessing steps prior to the application of 3DPoseNet and CorrectionNet were also realtime. All experiments were done on an NVIDIA Geforce GTX 1080 GPU. This includes the centerofgravity matching to estimate initial translations, as well as scaling and the application of the conditional GAN. The average test time for the conditional GAN was ms. This analysis shows that the techniques proposed in this paper can improve the performance and capture range of massively parallel implementations of optimizationbased registration algorithms [4] for realtime applications such as imagebased surgical navigation and motionrobust imaging.
Method  Volume  Slice 

3DPoseNet  ms  ms 
CorrectionNet  ms   
Conditional GAN  ms   
Iv Discussion
In this work we trained deep CNN regression models for 3D pose estimation of anatomy based on medical images. Our results show that deep learning based algorithms not only can provide a good initialization for optimizationbased methods to improve the capture range of slicetovolume registration, but also can be directly used for robust volumetovolume rigid registration in real time. Using these learningbased methods along with accelerated optimizationbased registration methods will provide powerful registration systems that can capture almost all possible rotations in 3D space.
Our networks composed of feature extraction layers and regression heads at the output layer. Using nonlinearity at the regression layer mimicked the behaviour of the angleaxis representation of the rotation matrix, where the geodesic loss was used as a biinvariant, natural Riemannian distance metric for the space of 3D rotations. Compared to MSE on rotation vectors, our results showed that the geodesic loss led to significantly improved performance especially in 3D when images contained sufficient information for pose estimation.
By using a two step approach, where the 3D pose of an object (anatomy) is first approximately found in a standard (atlas) space, and then fed along with a reference image as two channels of input to a regression CNN (the correction network), accurate intersubject rigid registration can be achieved in realtime for all ranges of rotation and translation. Initial translations may be achieved also in realtime through center of gravity matching.
One of the main concerns with learning based methods is their generalization property when they face test images with features that are different from the training set. This would be more important in medical imaging studies as the number of training samples is rather limited. In this study, to evaluate the generalization of the trained models over different ages, as the shape and size of the brain aggressively changes in early gestational weeks, we intentionally trained the network on older fetuses and tested it on younger ones. We only used a predefined scale parameter inferred from the gestational age based on a fetal brain MRI atlas.
We also tested the trained models on brain MRI scans of newborns which were obtained in a completely different setting, with head coils for exutero
imaging. While the trained models worked very well for T2weighted brain scans of newborns at 3844 weeks, we challenged the trained models by testing T1weighted MRI scans of newborn brains. For the T1weighted scans the performance of the networks dropped significantly; but we showed that by using a GAN based technique that learned to translate T1weighted images into T2like images, and feeding the outputs into the trained regression CNNs, we achieved great performance for T1weighted images as well. To achieve this, we designed and trained an imagetoimage translation GAN from pairs of T1 and T2 images of newborn subjects in a training set; and used it as a realtime preprocessing step for T1weighted scans before they were fed into the pose estimation networks. In fact, with the conditional GAN algorithm, many of the learning based algorithms can be generalized over different modalities as long as some paired images are provided for training.
V Conclusion
We developed and evaluated deep pose estimation networks for slicetovolume and volumetovolume registration. In learningbased imagetotemplate (standard space) registration scenarios, the proposed methods provided very fast (realtime) registration with a wide capture range on the space of all plausible 3D rotations, and provided good initialization for current optimization based registration methods. While the current highlyevolved multiscale optimizationbased methods that use cost functions such as mutual information or local cross correlation can converge to wrong local minima due to nonconvex cost functions, our proposed CNNbased methods learn to predict 3D pose of images based on their features. A combination of these techniques and accelerated optimizationbased methods can dramatically enhance the performance of imaging devices and image processing methods of the future.
Appendix A
In the image contrast transfer algorithm we trained a conditional generative adversarial network (cGAN) [43] to simultaneously learn the mapping from T1 to T2weighted images and a loss function to learn this mapping. Figure A.1 shows the pipeline to train the adversarial network on T1 and T2 image pairs. In this algorithm two networks, a generator () and a discriminator (), were trained simultaneously in a way that tried to generate T2like images from the T1weighted scans, and tried to distinguish real from fake (syntheticallygenerated) T2weighted image contrast in {T1, T2} pairs. To train these networks the following objective was used, where was random noise vector:
(11) 
where the loss function of the cGAN, , was defined as:
(12) 
and the distance between the generated and real T2 scans in the training set were calculated by the norm to encourage generating sharper images:
(13) 
To train the conditional GAN networks we used 33 pairs of T1 and T2weighted newborn brain images (of the subjects not used in the test set) resulting in 3300 paired slices. These images were used for training only. We then tested the trained on the test set of 7 newborn brain images.
References
 [1] D. L. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes, “Medical image registration,” Physics in medicine & biology, vol. 46, no. 3, p. R1, 2001.
 [2] J. P. Pluim, J. A. Maintz, and M. A. Viergever, “Mutualinformationbased registration of medical images: a survey,” IEEE transactions on medical imaging, vol. 22, no. 8, pp. 986–1004, 2003.
 [3] A. Gholipour, N. Kehtarnavaz, R. Briggs, M. Devous, and K. Gopinath, “Brain functional localization: a survey of image registration techniques,” IEEE transactions on medical imaging, vol. 26, no. 4, pp. 427–451, 2007.
 [4] R. Shams, P. Sadeghi, R. A. Kennedy, and R. I. Hartley, “A survey of medical image registration on multicore and the GPU,” IEEE Signal Processing Magazine, vol. 27, no. 2, pp. 50–60, 2010.
 [5] P. Markelj, D. Tomaževič, B. Likar, and F. Pernuš, “A review of 3D/2D registration methods for imageguided interventions,” Medical image analysis, vol. 16, no. 3, pp. 642–661, 2012.
 [6] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical image registration: A survey,” IEEE transactions on medical imaging, vol. 32, no. 7, pp. 1153–1190, 2013.
 [7] E. Ferrante and N. Paragios, “Slicetovolume medical image registration: A survey,” Medical image analysis, vol. 39, pp. 101–123, 2017.

[8]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 3431–3440.  [9] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1153–1159, 2016.
 [10] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
 [11] M. Simonovsky, B. GutiérrezBecker, D. Mateus, N. Navab, and N. Komodakis, “A deep metric for multimodal registration,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2016, pp. 10–18.
 [12] G. Wu, M. Kim, Q. Wang, B. C. Munsell, and D. Shen, “Scalable highperformance image registration framework by unsupervised deep feature representations learning,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 7, pp. 1505–1516, July 2016.
 [13] X. Yang, R. Kwitt, M. Styner, and M. Niethammer, “Quicksilver: Fast predictive image registration–a deep learning approach,” NeuroImage, vol. 158, pp. 378–396, 2017.
 [14] R. Liao, S. Miao, P. de Tournemire, S. Grbic, A. Kamen, T. Mansi, and D. Comaniciu, “An artificial agent for robust image registration.” in AAAI, 2017, pp. 4168–4175.
 [15] S. Miao, Z. J. Wang, Y. Zheng, and R. Liao, “Realtime 2D/3D registration via cnn regression,” in Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on. IEEE, 2016, pp. 1430–1434.
 [16] S. Miao, Z. J. Wang, and R. Liao, “A cnn regression approach for realtime 2D/3D registration,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1352–1363, 2016.
 [17] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman, “Single image 3d interpreter network,” in European Conference on Computer Vision. Springer, 2016, pp. 365–382.
 [18] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis, “6dof object pose from semantic keypoints,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2011–2018.
 [19] S. Tulsiani and J. Malik, “Viewpoints and keypoints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1510–1519.
 [20] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for CNN: Viewpoint estimation in images using cnns trained with rendered 3d model views,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2686–2694.
 [21] S. Mahendran, H. Ali, and R. Vidal, “3d pose regression using convolutional neural networks,” in IEEE International Conference on Computer Vision, vol. 1, no. 2, 2017, p. 4.
 [22] D. Q. Huynh, “Metrics for 3D rotations: Comparison and analysis,” Journal of Mathematical Imaging and Vision, vol. 35, no. 2, pp. 155–164, 2009.
 [23] A. Gholipour, J. A. Estroff, and S. K. Warfield, “Robust superresolution volume reconstruction from slice acquisitions: application to fetal brain MRI,” IEEE transactions on medical imaging, vol. 29, no. 10, pp. 1739–1758, 2010.
 [24] B. Kainz, M. Steinberger, W. Wein, M. KuklisovaMurgasova, C. Malamateniou, K. Keraudren, T. TorsneyWeir, M. Rutherford, P. Aljabar, J. V. Hajnal et al., “Fast volume reconstruction from motion corrupted stacks of 2D slices,” IEEE transactions on medical imaging, vol. 34, no. 9, pp. 1901–1913, 2015.
 [25] B. Marami, S. S. M. Salehi, O. Afacan, B. Scherrer, C. K. Rollins, E. Yang, J. A. Estroff, S. K. Warfield, and A. Gholipour, “Temporal slice registration and robust diffusiontensor reconstruction for improved fetal brain structural connectivity analysis,” NeuroImage, vol. 156, pp. 475–488, 2017.
 [26] B. Hou, A. Alansary, S. McDonagh, A. Davidson, M. Rutherford, J. V. Hajnal, D. Rueckert, B. Glocker, and B. Kainz, “Predicting slicetovolume transformation in presence of arbitrary subject motion,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2017, pp. 296–304.
 [27] B. Hou, B. Khanal, A. Alansary, S. McDonagh, A. Davidson, M. Rutherford, J. V. Hajnal, D. Rueckert, B. Glocker, and B. Kainz, “3D reconstruction in canonical coordinate space from arbitrarily oriented 2D images,” IEEE Transactions on Medical Imaging, 2018.
 [28] A. I. Namburete, W. Xie, M. Yaqub, A. Zisserman, and J. A. Noble, “Fullyautomated alignment of 3D fetal brain ultrasound to a canonical reference space using multitask learning,” Medical Image Analysis, 2018.
 [29] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for realtime 6dof camera relocalization,” in Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp. 2938–2946.
 [30] A. Gholipour, J. A. Estroff, C. E. Barnewolt, R. L. Robertson, P. E. Grant, B. Gagoski, S. K. Warfield, O. Afacan, S. A. Connolly, J. J. Neil, A. Wolfberg, and R. V. Mulkern, “Fetal MRI: a technical update with educational aspirations,” Concepts in Magnetic Resonance Part A, vol. 43, no. 6, pp. 237–266, 2014.
 [31] S. Thesen, O. Heid, E. Mueller, and L. R. Schad, “Prospective acquisition correction for head motion with imagebased tracking for realtime fMRI,” Magnetic resonance in medicine, vol. 44, no. 3, pp. 457–465, 2000.
 [32] N. White, C. Roddey, A. Shankaranarayanan, E. Han, D. Rettmann, J. Santos, J. Kuperman, and A. Dale, “Promo: Realtime prospective motion correction in MRI using imagebased tracking,” Magnetic Resonance in Medicine, vol. 63, no. 1, pp. 91–105, 2010.
 [33] A. Gholipour, M. Polak, A. van der Kouwe, E. Nevo, and S. K. Warfield, “Motionrobust MRI through realtime motion tracking and retrospective superresolution volume reconstruction,” in Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011, pp. 5722–5725.
 [34] B. Marami, B. Scherrer, O. Afacan, B. Erem, S. K. Warfield, and A. Gholipour, “Motionrobust diffusionweighted brain MRI reconstruction through slicelevel registrationbased motion tracking,” IEEE transactions on medical imaging, vol. 35, no. 10, pp. 2258–2269, 2016.
 [35] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
 [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [37] E. Hughes, L. C. Grande, M. Murgasova, J. Hutter, A. Price, A. S. Gomes, J. Allsop, J. Steinweg, N. Tusor, J. Wurie et al., “The developing human connectome: announcing the first release of open access neonatal brain imaging,” Organization for Human Brain Mapp, pp. 25–29, 2017.
 [38] S. S. M. Salehi, S. R. Hashemi, C. VelascoAnnis, A. Ouaalam, J. A. Estroff, D. Erdogmus, S. K. Warfield, and A. Gholipour, “Realtime automatic fetal brain extraction in fetal mri by deep learning,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), April 2018, pp. 720–724.
 [39] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Autocontext convolutional neural network (autonet) for brain extraction in magnetic resonance imaging,” IEEE transactions on medical imaging, vol. 36, no. 11, pp. 2319–2330, 2017.
 [40] P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, and G. Gerig, “Userguided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability,” Neuroimage, vol. 31, no. 3, pp. 1116–1128, 2006.
 [41] A. Gholipour, C. K. Rollins, C. VelascoAnnis, A. Ouaalam, A. AkhondiAsl, O. Afacan, C. M. Ortinau, S. Clancy, C. Limperopoulos, E. Yang et al., “A normative spatiotemporal MRI atlas of the fetal brain for automatic segmentation and analysis of early brain growth,” Scientific reports, vol. 7, no. 1, p. 476, 2017.
 [42] J. Arvo, Graphics gems II. Elsevier, 2013.

[43]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,”
arXiv preprint, 2017.  [44] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.
Comments
There are no comments yet.